A day late because I got bogged down in trying to make a map visualization - perhaps it will be helpful for someone. Apart from that, I also imported additional datasets to standardize UFO sightings per state by state populations.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy.stats import chisquare
from sklearn.preprocessing import StandardScaler
# GFX
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon
%matplotlib inline

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
print(check_output(["ls", "../input/ufo-sightings/"]).decode("utf8"))

In [2]:
df = pd.read_csv("../input/ufo-sightings/scrubbed.csv")
print(df.shape)
df.head(5).transpose()

Examine NaNs:

In [3]:
print("Unique states:")
print(df.state.unique())
print("Proportion of NaNs: %.2f" % (sum(df.country.isnull())/df.shape[0]))
print("Unique shapes:")
print(df['shape'].unique())
print("Proportion of NaNs: %.2f" % (sum(df['shape'].isnull())/df.shape[0]))

Reformat strings as categories:

In [4]:
for feature in ['country', 'shape', 'state']:
    df[feature] = df[feature].astype('category')

Histograms for states and shapes.

In [20]:
fig, axs = plt.subplots(2,1,figsize=(15,10))
axs = axs.flatten()
state_order = df.groupby('state').size().sort_values(ascending=False).index.values
sns.countplot(df['state'], order=state_order, ax=axs[0]);
shape_order = df.groupby('shape').size().sort_values(ascending=False).index.values
sns.countplot(df['shape'], order=shape_order, ax=axs[1]);
for ax in axs:
    for tick in ax.get_xticklabels():
        tick.set_rotation(90);

Most observations in CA - suspicious because it's also the most populous state. Perhaps it would be more reasonable to divide UFO observations per capita

In [21]:
state_count = df.groupby('state').size()
state_count = pd.DataFrame({
    'state': [s.upper() for s in state_count.index.values],
    'count': np.asarray(state_count)
})
state_count.reset_index()

census = pd.read_csv('../input/us_energy_census_gdp_10-14/Energy Census and Economic Data US 2010-2014.csv')
census = census[['StateCodes', 'POPESTIMATE2014']]
census.columns = ['state', 'pop']
state_count = state_count.merge(census, how="right", on='state')
state_count = state_count.loc[np.invert(state_count.isnull().any(axis=1))]
state_count['normalized count'] = (state_count['count']/state_count['pop'])
state_count['normalized count'] = (state_count['normalized count'] \
                                   - min(state_count['normalized count']))\
                                    /max(state_count['normalized count'])
state_count = state_count.sort_values(by='normalized count', ascending=False)

Histogram for observations by state normalized by state population. Totally different story here. I wonder what's going on in Arizona.

In [23]:
fig, ax = plt.subplots(figsize=(15,5))
sns.barplot(x=state_count['state'], y=state_count['normalized count']);

Oh, I forgot - we were supposed to do a chi-square. Both in the case of the non-per capita-normalized and normalized the observation counts do not seem to be uniformly distributed, perhaps unsurprisingly given the histograms

In [42]:
print('Count #s:')
print(chisquare(state_count['count']))

print("(Count #s / state pop) * mean count:")
mean_count = np.mean(state_count['count'])
print(chisquare(state_count['normalized count'].apply(lambda x: x * mean_count)))

Format latitude and longitude of observations as numeric:

In [43]:
# have to get rid of one pesky ill-formatted latitude
chars = set('-.0123456789')
bad_formatting = df['latitude'].astype('str').map(lambda s: not all((c in chars) for c in s))
df['latitude'].loc[bad_formatting] = '33.200088'
# now OK to convert to string
df['latitude'] = df['latitude'].map(lambda s: float(s))
# while I'm at it let's get rid of the space after longitude col
df.rename(columns={'longitude ':'longitude'}, inplace=True)

Plot US map with states on yellow->green->blue depending on UFO observations per capita.

In [54]:
# Creating new plot
plt.figure(figsize=(20,20))
# Load map of France
map = Basemap(projection='cyl', 
            lat_0=46.2374,
            lon_0=2.375,
            resolution='h',
            llcrnrlon=-126, llcrnrlat=24,
            urcrnrlon=-62, urcrnrlat=51)

map.readshapefile('../input/us-states-cartographic-boundary-shapefiles/cb_2016_us_state_500k',\
                  name='states', drawbounds=False)

map.drawcoastlines(zorder=20)
map.drawcountries(zorder=20)
map.drawmapboundary()
map.drawstates(zorder=20)

ax = plt.gca()

for s in state_count['state']:
    for shape, info in zip(map.states, map.states_info):
        if info['STUSPS'] == s:
            val = state_count.loc[state_count['state'] == s]['normalized count']
            col = cm.Blues(val).flatten()[:3]
            poly = Polygon(shape, facecolor = col)
            ax.add_patch(poly)

x,y = map(df['longitude'], df['latitude'])
map.scatter(x, y, s=0.7, alpha=0.7, c='c', zorder = 10)

plt.title("UFO Spottings in the US", fontsize=20, y=1.05)

plt.show()

Possible further research: There seem to be a lot of UFO observations per capita in the NW, NE and AZ. Could go through descriptions and see if they are similar.