**1.** **Load and clean data**

Load the data and fill the country NaN values for those entries that we know are US. This we do by identifying entries that have a 'state/province' that belongs to the US and assign its 'country' value to 'us'.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#fill some country values as US
sightings = pd.read_csv('../input/ufo-sightings-around-the-world/ufo_sighting_data.csv')
us_states= sightings[sightings.country == 'us']['state/province'].unique()
for row in sightings.itertuples():
    if (row._3 in us_states) and (type(row.country) == float):
        sightings.loc[row.Index,'country'] = 'us'

#set the geolocation latitude coordinate in proper float type
sightings.loc[43782,'latitude']='33.200088' # corrects a typo
sightings.latitude = pd.to_numeric(sightings.latitude)
sightings.loc[27822,'length_of_encounter_seconds']='2' # corrects a typo
sightings.loc[35692,'length_of_encounter_seconds']='8' # corrects a typo
sightings.loc[58591,'length_of_encounter_seconds']='0.5' # corrects a typo

sightings.length_of_encounter_seconds = pd.to_numeric(sightings.length_of_encounter_seconds)

#set the Date_time column as pd.datetime type
#correct some wrong time entries
for row in sightings.itertuples():
    if '24:00' in row.Date_time:
        sightings.loc[row.Index,'Date_time'] = sightings.iloc[row.Index].Date_time.replace('24:00', '23:59')
        
sightings['Date_time'] = pd.to_datetime(sightings['Date_time'], format='%m/%d/%Y %H:%M')

#set the df index Date_time so we have a proper time series for analysis
pd.to_datetime(sightings.index)
sightings.index = sightings.Date_time
#sightings.info()

**2.** **Time evolution of sightings**

Lets have a look what is the evolution of number of sightings over the years. 

As observed in the graph below, the amount of sightings has increased considerably, specially since 1994! Notice that the first episode of 'The X-files' was released in September 1993. Is there a correlation?


In [None]:
by_year = sightings.resample('A').count()
ax = by_year.Date_time.plot()
ax.set_ylabel('Number of Sightings')
ax.set_xlabel('Year')
ax.set_xlim(['1940','2016'])

**3. Can we now estimate in which period of the year aliens like to visit the earth?** 

It seems that Aliens love to come to earth specially in June. They also seem to like warm weather since in the Norther hemisphere they are mostly seen during summer time. Whereas in the souther hemisphere the likelihood of seeing an UFO is higher in June. Although, in December/January (i.e., the austral summer) there is an increased probability to see an UFO. 

Another curious observation is that about 98% of the UFO visits take place in the northern hemisphere. This could be due problems in the data colection as that encounters are not properly reported in the souther hemisphere. 

In [None]:
by_month_north = sightings[sightings.latitude > 0].resample('M').count()
by_month_south = sightings[sightings.latitude < 0].resample('M').count()
by_month = sightings.resample('M').count()

#function to extract monthly statistics for all years. stat = mean, std, min, max, etc
def get_month_prob(df, stat):
        years = np.arange(1943, 2015)
        month_name = ['Jan', 'Feb','Mar','Apr','May','Jun','Jul', 'Aug', 'Sep', 'Oct', 'Nov','Dec']
        all_months = pd.DataFrame(index=range(12))

        for year in years: #here we get the amount of sightings per month for each year normalized by the total sightings of the given year
            all_months[str(year)] = (df[str(year)].Date_time / df[str(year)].Date_time.sum()).reset_index(drop=True)
        all_months.index = month_name
        all_months_stats = all_months.T.describe() #we now get the statistics for each month for all the years from 1943 to 2014
        monthly= all_months_stats.loc[stat] #we now estract the mean value of the weighted sightings per month for all years
        return monthly

south = get_month_prob(by_month_south, 'mean')
north = get_month_prob(by_month_north, 'mean')
whole = get_month_prob(by_month, 'mean')

ax = whole.plot(kind='line',color='k' )
south.plot.bar(position=0,color='b', alpha=0.5, ax = ax, width=0.3)
north.plot.bar(position=1,color='r', alpha=0.5, ax = ax, width=0.3)

ax.set_xticks(np.arange(12))
ax.set_xticklabels(south.index)
ax.legend(['Whole world', 'South Hemisphere', ' North Hemisphere'])
ax.set_ylabel('Sighting probability')

total_sightings = by_month['1943':].Date_time.sum()
total_sightings_north = by_month_north['1943':].Date_time.sum()
print('Sightings in the north hemisphere are', str(round(total_sightings_north/total_sightings *100, 2)), '% of the total')

**4.  Day and time**

Which time is it more likely to be visited by aliens and how long these encounters last?

By extracting the total amount of sightings per day for all years since 1943 and normalizing it by the total number of sightings, we can estimate which is the day of the week where s it more likely to meet with a green man (if they are green). From the left plot below, all days have a comparable likelihood of UFO sightings. However this likelihood increases during the weekends being Saturday the day with the higher frequency of encounters. 

So, summer time and weekends, it seems that we are getting a profile of extraterrestrials. Lets see now if we can find when is it more likely to have a third kind encounter during the day. 

Having a closer look at the reported sighting times we learn another interesting behavioral fact about aliens. As  shown in the histogram at the right, aliens mainly show up from 17:00 (5PM)  to 3:00AM. 

So we have Summer, weekends and nightime. These creatures are party animals!


In [None]:
import datetime
import matplotlib.pyplot as plt

#get the days a time as columns
day_time = time=pd.DataFrame()
day_time['day'] = [sightings.index[x].day_name() for x in range(len(sightings))] # this extracts the day name from the date
day_time['time'] = [sightings.index[x].hour for x in range(len(sightings))] # this extracts the hour name from the date
grouped_day = day_time.groupby(by=['day'], as_index=False).count().reindex([1,5,6,4,0,2,3]).set_index('day') #reindex here sets the day names in weekly order (not alphabetical)

def get_yearly_medians(df):
    medians=[]   
    years = np.arange(1943, 2015)
    for year in years:
        year_medians = medians.append(df.loc[str(year)].length_of_encounter_seconds.median()) # get median in order to remove outliers
    return np.array(medians)
length_median = get_yearly_medians(sightings)

fig = plt.figure(figsize=(20,5))
ax1 = fig.add_subplot(131)
(grouped_day.time/grouped_day.time.sum()).plot.bar(x='day',y='time', rot=45)

ax2 = fig.add_subplot(132)
day_time.time.plot.hist(bins=24)
ax2.set_xlabel('Hour of the day')

ax3 = fig.add_subplot(133)
plt.hist(length_median, 20)
ax3.set_xlabel('Time (seconds)')
ax3.set_ylabel('Yearly frequency')

print('The average sighting length is', length_median.mean()/60, 'minutes') 
print('The median sighting length is', np.median(length_median)/60, 'minutes') 

**5.** **Which is the typical shape of a UFO?**

    By determining the ratio of times an UFO of a given shape has been observed we can determine which is the most common shape of an UFO. In this regard, no actual shape, but a light is what most people report as a UFO sighting. After that, a circle and a teardrop are the most common shapes observed.

In [None]:
ufo_shape = sightings.groupby(by='UFO_shape').count()
shapes = sightings.UFO_shape.unique()
ufo_shape = ufo_shape.Date_time / ufo_shape.Date_time.sum()
ax =ufo_shape.plot.bar(rot=65, figsize=(8,6))
ax.set_ylabel('Percentage')

In [None]:
#get the geolocation coordinate values of sightings
sightings_coord = sightings[['latitude', 'longitude']]

lon, lat = sightings_coord.longitude.values, sightings_coord.latitude.values
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
%matplotlib inline  
#m = Basemap(width=120000000,height=90000000,projection='lcc',
#            resolution='None',lat_1=45.,lat_2=55,lat_0=50,lon_0=-107.)
m = Basemap(projection='merc',llcrnrlat=0,urcrnrlat=80,\
            llcrnrlon=-180,urcrnrlon=-40,lat_ts=20,resolution='c')
m1 = Basemap(projection='merc',llcrnrlat=0,urcrnrlat=80,\
            llcrnrlon=-20,urcrnrlon=80,lat_ts=20,resolution='c')
x, y = m(lon, lat)
m.scatter(x,y, marker='o',color='y', alpha=0.5)
x1, y1 = m1(lon, lat)
m1.scatter(x1,y1, marker='o',color='y', alpha=0.5)

#m.fillcontinents(color='coral',lake_color='aqua')
#m.drawcoastlines()
#m.drawcountries()
#m.drawmapboundary(fill_color='aqua')
m.bluemarble()
m1.bluemarble()
plt.title("Sightings locations")
plt.show()
