**Data analysis and visualization of**   
**UFOs in North America 1969 to 2019 dataset**

By: Myrna M Figueroa Lopez   

Source:    
THE NATIONAL UFO REPORTING CENTER:     
Dedicated to the Collection and Dissemination of Objective UFO Data http://www.nuforc.org/.

Data limitations: The data depends on reports made to the NUFORC. Some entries are incomplete; These, I removed for the purpose of illustration in a timely fashion.   

Tasks:
1. Visualization through plots and maps.
2. Text mining of report summaries.
3. Descriptive statistics.
4. Predict



In [None]:
#  Python 3 environment as defined by the kaggle/python Docker 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

#obtain data file names
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#First, open CSV file as a pandas dataframe
df = pd.read_csv('../input/ufo-sightings-1969-to-2019/nuforc_reports.csv')
#visualize the first 5 rows of the dataframe
df.head()

In [None]:
#dropping columns I wont use
df1=df.drop( ['stats', 'posted', 'text'], axis=1)
df1.head(2)

In [None]:
df1.info()

### Visualization of dataset

In [None]:
#explore the data
#display cities with most UFO sighting reports
df1['city'].value_counts(dropna=False)

In [None]:
#verify the data is for North America
df1['state'].unique()

In [None]:
#looking for entries outside of North America
df1['state'].value_counts(dropna=False)

In [None]:
#5235 state entries are NaN
#verify these are locations in North America

df1[df1['state'].isna()]

In [None]:
#remove those rows with NaNs in the state column 
df1.dropna(subset=['state'], how='all', inplace=True)

#remove those rows with NaNs in the date and time column 
df1.dropna(subset=['date_time'], how='all', inplace=True)
df1

After dropping specific NaNs, rows went from 88125 to 81831    
Almost 7000 entries were not from North America.

In [None]:
#find those rows with no latitude or longitude data

needData=df1[df1['city_latitude'].isna()]
needData

10877 entries of UFO sighting reports in North America do not include the latitude or longitude for their cities of occurance. Before visualization on a map, I need this data.

In [None]:
# applying groupby() function to 
# group the data on city value. 
cities = df1.groupby('city') 

# Finding the values contained in the "Peoria" cities 
Peoria=cities.get_group('Peoria') 
PeoriaNY=Peoria.groupby('state')

#looking for other Peoria, NY reports 
#to copy their lat and lon
PeoriaNY=PeoriaNY.get_group('NY') 
PeoriaNY

I looked for other entries from the same location to get its lat long, but I couldnt find similar entries. I have to add manually the latitudes and longitudes for those rows missing that data.

In [None]:
#entries per state that need lat and lon
needData.groupby('state')["city"].count() 

In [None]:
Alberta=needData[needData.state == 'AB']
Alberta.head()

In [None]:
df1Alberta=df1[df1.city == 'Lethbridge']
df1Alberta.head()

In [None]:
#replace NaN of rows from this city without coordinates
#Example: Lethbridge, AB, Canada coordinates are 49.665183, -112.81
## An entry error mislabed row 310's city as Lethbrdge and produced no coord.
##in the original data source (CSV)
## below, I fix this entry

df1.loc[df1['city'] == 'Lethbrdge', 'city_latitude'] = 49.665183
df1.loc[df1['city'] == 'Lethbrdge', 'city_longitude'] = -112.81

#while the original row name is 310, the entry is in a new position: 261
#verify the update of the coordinates occured
df1.iloc[261]

In [None]:
#Reset row indexes of the DataFrame
#to ignore those rows indexes previously removed

df2 = df1.reset_index(drop=True)
df2.head(2)

In [None]:
#some city entries are erroneously labeled by their state
#I will remove those entries

#Example: Alberta as city
df1Alb=df2[df2.city == 'Alberta']
df1Alb

In [None]:
#I create a new dataframe, where city does not include Alberta
#and reset indexes
df2=df2[df2.city != 'Alberta'].reset_index(drop=True)
df2

In [None]:
#before adding lat and lon to the city of St Albert,
#I make sure that the city name only occurs in the state of Alberta
df2StA=df2[df2.city == 'Saint Albert']
df2StA.head(2)

In [None]:
#add the lat and lon for the various St Albert, AB, CANADA entries
df2.loc[df2['city'] == 'Saint Albert', 'city_latitude'] = 53.630474
df2.loc[df2['city'] == 'Saint Albert', 'city_longitude'] = -113.625641

In [None]:
Wa=needData[needData.state == 'WA']
Wa['city'].value_counts()

In [None]:
#add the lat and lon for the various WA city entries 
##I use 2 conditions to ensure I replace the right information

df2.loc[(df2["state"]=="WA") & (df2["city"]=="South Hill"), 'city_latitude'] = 47.1193
df2.loc[(df2["state"]=="WA") & (df2["city"]=="South Hill"), 'city_longitude'] = -122.2877

In [None]:
#Repeat for those cities with the most entries

df2.loc[(df2["state"]=="WA") & (df2["city"]=="Shoreline"), 'city_latitude'] = 47.755653
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Shoreline"), 'city_longitude'] = -122.341515

In [None]:
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Spokane Valley"), 'city_latitude'] = 47.6732
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Spokane Valley"), 'city_longitude'] = -117.2394

In [None]:
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Burien"), 'city_latitude'] = 47.4668
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Burien"), 'city_longitude'] = -122.3405

In [None]:
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Des Moines"), 'city_latitude'] = 47.4018
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Des Moines"), 'city_longitude'] = -122.3243

In [None]:
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Camano Island"), 'city_latitude'] = 48.1740
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Camano Island"), 'city_longitude'] = -122.5282

In [None]:
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Mount Rainier National Park"), 'city_latitude'] = 46.8800
df2.loc[(df2["state"]=="WA") & (df2["city"]=="Mount Rainier National Park"), 'city_longitude'] = -121.7269

In [None]:
#I noted that some cities have NaNs. 
#remove those rows with NaNs in the city column 
df2.dropna(subset=['city'], how='all', inplace=True)

In [None]:
#reset row indexes again
df2.reset_index(drop=True)

After the changes above, the amount of rows went 81831 to 81715

In [None]:
#Arizona
AZ=needData[needData.state == 'AZ']
AZ['city'].value_counts()

In [None]:
view=df2[df2.city=='Arizona']
view

In [None]:
#first, drop those cities misslabled as Arizona
#and reset index
#by creating a new dataframe, where city does not include Arizona
df2=df2[df2.city != 'Arizona'].reset_index(drop=True)

In [None]:
#add the lat and lon for various AZ city with several entries
##I use 2 conditions to ensure I replace the right information

df2.loc[(df2["state"]=="AZ") & (df2["city"]=="Lake Havasu"), 'city_latitude'] = 34.4839
df2.loc[(df2["state"]=="AZ") & (df2["city"]=="Lake Havasu"), 'city_longitude'] = -114.3225

In [None]:
df2.loc[(df2["state"]=="AZ") & (df2["city"]=="North Phoenix"), 'city_latitude'] = 33.6894
df2.loc[(df2["state"]=="AZ") & (df2["city"]=="North Phoenix"), 'city_longitude'] = -112.0994

In [None]:
#Wisconsin
WI=needData[needData.state == 'WI']
WI['city'].value_counts()

In [None]:
df2.loc[(df2["state"]=="WI") & (df2["city"]=="Wauwatosa"), 'city_latitude'] = 43.0495
df2.loc[(df2["state"]=="WI") & (df2["city"]=="Wauwatosa"), 'city_longitude'] = -88.0076

In [None]:
df2.loc[(df2["state"]=="WI") & (df2["city"]=="West Allis"), 'city_latitude'] = 43.0167
df2.loc[(df2["state"]=="WI") & (df2["city"]=="West Allis"), 'city_longitude'] = -88.0070

In [None]:
#Alabama
AL=needData[needData.state == 'AL']
AL['city'].value_counts()

In [None]:
df2.loc[(df2["state"]=="AL") & (df2["city"]=="Hoover"), 'city_latitude'] = 33.4054
df2.loc[(df2["state"]=="AL") & (df2["city"]=="Hoover"), 'city_longitude'] = -86.8114

In [None]:
df2.loc[(df2["state"]=="AL") & (df2["city"]=="Homewood"), 'city_latitude'] = 33.4718
df2.loc[(df2["state"]=="AL") & (df2["city"]=="Homewood"), 'city_longitude'] = -86.8008

In [None]:
df2.loc[(df2["state"]=="AL") & (df2["city"]=="Fort Morgan"), 'city_latitude'] = 30.2256
df2.loc[(df2["state"]=="AL") & (df2["city"]=="Fort Morgan"), 'city_longitude'] = -88.0189

In [None]:
#Oregon
OR=needData[needData.state == 'OR']
OR['city'].value_counts()

In [None]:
df2.loc[(df2["state"]=="OR") & (df2["city"]=="Aloha"), 'city_latitude'] = 45.4943
df2.loc[(df2["state"]=="OR") & (df2["city"]=="Aloha"), 'city_longitude'] = -122.8670

In [None]:
df2.loc[(df2["state"]=="OR") & (df2["city"]=="Chiloquin"), 'city_latitude'] = 42.5776
df2.loc[(df2["state"]=="OR") & (df2["city"]=="Chiloquin"), 'city_longitude'] = -121.8661

In [None]:
df2.loc[(df2["state"]=="OR") & (df2["city"]=="Chiloquin"), 'city_latitude'] = 44.3935
df2.loc[(df2["state"]=="OR") & (df2["city"]=="Chiloquin"), 'city_longitude'] = -122.9848

In [None]:
#British Columbia
BC=needData[needData.state == 'BC']
BC['city'].value_counts()

In [None]:
#The same city appears several times, mispelled
#I added the lat and log for the different instances

df2.loc[(df2["state"]=="BC") & (df2["city"]=="Langley"), 'city_latitude'] = 49.1042
df2.loc[(df2["state"]=="BC") & (df2["city"]=="Langley"), 'city_longitude'] = -122.6604

In [None]:
#mispelled Langley city rows
df2.loc[(df2["state"]=="BC") & (df2["city"]=="Langely"), 'city_latitude'] = 49.1042
df2.loc[(df2["state"]=="BC") & (df2["city"]=="Langely"), 'city_longitude'] = -122.6604

df2.loc[(df2["state"]=="BC") & (df2["city"]=="Lanley"), 'city_latitude'] = 49.1042
df2.loc[(df2["state"]=="BC") & (df2["city"]=="Lanley"), 'city_longitude'] = -122.6604

In [None]:
df2.loc[(df2["state"]=="BC") & (df2["city"]=="West Vancouver"), 'city_latitude'] = 49.3286
df2.loc[(df2["state"]=="BC") & (df2["city"]=="West Vancouver"), 'city_longitude'] = -123.1602

In [None]:
#re-verify for rows with no latitude or longitude data

needData2=df2[df2['city_latitude'].isna()]
needData2

In [None]:
#entries per state that need lat and lon
needData2.groupby('state')["city"].count()

In [None]:
#CA
CA=needData2[needData2.state == 'CA']
CA['city'].value_counts()


In [None]:
df2.loc[(df2["state"]=="CA") & (df2["city"]=="Hollywood"), 'city_latitude'] = 34.0928
df2.loc[(df2["state"]=="CA") & (df2["city"]=="Hollywood"), 'city_longitude'] = -118.3287

In [None]:
df2.loc[(df2["state"]=="CA") & (df2["city"]=="Eastvale"), 'city_latitude'] = 33.9525
df2.loc[(df2["state"]=="CA") & (df2["city"]=="Eastvale"), 'city_longitude'] = -117.5848

In [None]:
df2.loc[(df2["state"]=="CA") & (df2["city"]=="Red Bluff"), 'city_latitude'] = 40.1785
df2.loc[(df2["state"]=="CA") & (df2["city"]=="Red Bluff"), 'city_longitude'] = -122.2358

In [None]:
#PA

PA=needData2[needData2.state == 'PA']
PA['city'].value_counts()


In [None]:
df2.loc[(df2["state"]=="PA") & (df2["city"]=="North Huntingdon"), 'city_latitude'] = 40.3302
df2.loc[(df2["state"]=="PA") & (df2["city"]=="North Huntingdon"), 'city_longitude'] = -79.7307

In [None]:
df2.loc[(df2["state"]=="PA") & (df2["city"]=="Yardley"), 'city_latitude'] = 40.2457
df2.loc[(df2["state"]=="PA") & (df2["city"]=="Yardley"), 'city_longitude'] = -74.8460

In [None]:
df2.loc[(df2["state"]=="PA") & (df2["city"]=="Whitehall"), 'city_latitude'] = 40.6572
df2.loc[(df2["state"]=="PA") & (df2["city"]=="Whitehall"), 'city_longitude'] = -75.4986

In [None]:
#AR
AR=needData2[needData2.state == 'AR']
AR['city'].value_counts()


In [None]:
df2.loc[(df2["state"]=="AR") & (df2["city"]=="Hartman"), 'city_latitude'] = 35.4326
df2.loc[(df2["state"]=="AR") & (df2["city"]=="Hartman"), 'city_longitude'] = -93.6155

In [None]:
#NY
NY=needData2[needData2.state == 'NY']
NY['city'].value_counts()

In [None]:
df2.loc[(df2["state"]=="NY") & (df2["city"]=="Cheektowaga"), 'city_latitude'] = 42.9071
df2.loc[(df2["state"]=="NY") & (df2["city"]=="Cheektowaga"), 'city_longitude'] = -78.7543

In [None]:
df2.loc[(df2["state"]=="NY") & (df2["city"]=="West Seneca"), 'city_latitude'] = 42.8359
df2.loc[(df2["state"]=="NY") & (df2["city"]=="West Seneca"), 'city_longitude'] = -78.7539

In [None]:
df2.loc[(df2["state"]=="NY") & (df2["city"]=="Williamsville"), 'city_latitude'] = 42.9639
df2.loc[(df2["state"]=="NY") & (df2["city"]=="Williamsville"), 'city_longitude'] = -78.7378

In [None]:
#Manitoba

MB=needData2[needData2.state == 'MB']
MB['city'].value_counts()

In [None]:
df2.loc[(df2["state"]=="MB") & (df2["city"]=="Anola"), 'city_latitude'] = 49.8812
df2.loc[(df2["state"]=="MB") & (df2["city"]=="Anola"), 'city_longitude'] = -96.6233

In [None]:
#NM
NM=needData2[needData2.state == 'NM']
NM['city'].value_counts()


In [None]:
df2.loc[(df2["state"]=="NM") & (df2["city"]=="Espanola"), 'city_latitude'] = 35.9910
df2.loc[(df2["state"]=="NM") & (df2["city"]=="Espanola"), 'city_longitude'] = -106.0818

In [None]:
#NV
NV=needData2[needData2.state == 'NV']
NV['city'].value_counts()


In [None]:
df2.loc[(df2["state"]=="NV") & (df2["city"]=="Rachel"), 'city_latitude'] = 37.6447
df2.loc[(df2["state"]=="NV") & (df2["city"]=="Rachel"), 'city_longitude'] = -115.7428

In [None]:
#WV
WV=needData2[needData2.state == 'WV']
WV['city'].value_counts()


In [None]:
df2.loc[(df2["state"]=="WV") & (df2["city"]=="South Charleston"), 'city_latitude'] = 38.3686
df2.loc[(df2["state"]=="WV") & (df2["city"]=="South Charleston"), 'city_longitude'] = -81.6999

In [None]:
df2.loc[(df2["state"]=="WV") & (df2["city"]=="Cross Lanes"), 'city_latitude'] = 38.4204
df2.loc[(df2["state"]=="WV") & (df2["city"]=="Cross Lanes"), 'city_longitude'] = -81.7907

In [None]:
df2.loc[(df2["state"]=="WV") & (df2["city"]=="Tomahawk"), 'city_latitude'] = 39.5304
df2.loc[(df2["state"]=="WV") & (df2["city"]=="Tomahawk"), 'city_longitude'] = -78.0469

In [None]:
#TX
TX=needData2[needData2.state == 'TX']
TX['city'].value_counts()

In [None]:
df2.loc[(df2["state"]=="TX") & (df2["city"]=="The Woodlands"), 'city_latitude'] = 30.1658
df2.loc[(df2["state"]=="TX") & (df2["city"]=="The Woodlands"), 'city_longitude'] = -95.4613

In [None]:
df2.loc[(df2["state"]=="TX") & (df2["city"]=="Lakeway"), 'city_latitude'] = 30.3680
df2.loc[(df2["state"]=="TX") & (df2["city"]=="Lakeway"), 'city_longitude'] = -97.9917

In [None]:
#remove the remaining rows with NaNs for latitude
df2.dropna(subset=['city_latitude'], how='all', inplace=True)

#remove those rows with NaNs for longitude
df2.dropna(subset=['city_longitude'], how='all', inplace=True)

#reset index
df2.reset_index(drop=True)

Now, I have 71540 entries with coordinates.

In [None]:
import datetime
#remove time portion of date_time input
df2['date'] = df2['date_time'].apply(lambda x: pd.Timestamp(x).strftime('%Y-%m-%d'))
df2.drop('date_time', axis=1, inplace=True)

In [None]:
#sorted by date
#and reset indexes
df2.sort_values(by=['date'], inplace=True, ascending=True)
df2.reset_index(drop=True)

In [None]:
#libraries for plotting
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns 

In [None]:
plt.figure(figsize=(14,5))
sns.countplot(df2['state'])

Most reports come from California, then Florida, and Washington state

In [None]:
#get year as a new column and convert to integer
df2['year'] = df2['date'].apply(lambda x: pd.Timestamp(x).strftime('%Y'))
df2['year'].astype(int)

In [None]:
# Create a pie chart of dates with percentages

df2["year"].value_counts().plot.pie(label="", title="year", figsize=(10, 10), autopct='%1.1f%%', subplots=True); 
plt.show(block=True);   

Of the data in df2, 2014 was the year with the most reports of UFO sightings.

In [None]:
#map plotting libraries
import folium
from folium import plugins
from folium.plugins import HeatMap

In [None]:
#a map of North America
NorthAm = folium.Map(location=[54.5260, -105.2551],
                   zoom_start = 2)

# Ensure lats and longs datatype are floats
df2['city_latitude'] = df2['city_latitude'].astype(float)
df2['city_longitude'] = df2['city_longitude'].astype(float)

# Filter the DF for 2014 rows

heat_df = df2[df2['year']=='2014'] # Reducing data size so it runs faster
heat_df = heat_df[['city_latitude', 'city_longitude']]

# List comprehension to make a list of lists
heat_data = [[row['city_latitude'],row['city_longitude']] for index, row in heat_df.iterrows()]

# Plot it on the map
HeatMap(heat_data).add_to(NorthAm)

# Display the map
NorthAm

In [None]:
#California sightings 1969-2019

Cali=df2[df2.state == 'CA']

#a map of California
InCali = folium.Map(location=[36.7783, -119.4179],
                   zoom_start = 4)

heat_df2 = Cali[['city_latitude', 'city_longitude']]

# List comprehension to make a list of lists
heat_data2 = [[row['city_latitude'],row['city_longitude']] for index, row in heat_df2.iterrows()]

# Plot it on the map
HeatMap(heat_data2).add_to(InCali)

# Display the map
InCali

In [None]:
#Map of California sightings in 2019 alone

#a map of California
InCali2019 = folium.Map(location=[36.7783, -119.4179],
                   zoom_start = 6)

heat_df3 = Cali[Cali['year']=='2019'] # Reducing data size so it runs faster
heat_df3 = heat_df3[['city_latitude', 'city_longitude']]

# List comprehension to make a list of lists
heat_data3 = [[row['city_latitude'],row['city_longitude']] for index, row in heat_df3.iterrows()]

# Plot it on the map
HeatMap(heat_data3).add_to(InCali2019)

# Display the map
InCali2019


### Text mining for common words

In [None]:
import nltk #natural language toolkit
from nltk import word_tokenize
from nltk.corpus import stopwords

In [None]:
## Change the summary column to string
df2['words'] = df2['summary'].astype(str)

## Lowercase all summaries
df2['words'] = df2['words'].apply(lambda x: " ".join(x.lower() for x in x.split()))

## remove punctuation
df2['words'] = df2['words'].str.replace('[^\w\s]','')

In [None]:
#remove common words (like the, are, all, etc.)
stop = stopwords.words('english')
df2['words'] = df2['words'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

In [None]:
#Most frequent words in the report summaries
wordfreq=df2.words.str.split(expand=True).stack().value_counts()
freqs=pd.DataFrame(wordfreq)
Descriptions=freqs.head(35)

### Numeric descriptions

In [None]:
##numeric data to explore
#word counts

##Top 35 words found in the summaries
Descriptions

In [None]:
#mean of numeric data
##*not an interpretation of the summaries
Descriptions.mean()

In [None]:
Descriptions.std() #Standard deviation

In [None]:
Descriptions.skew()

In [None]:
#all together
Descriptions.describe()

In [None]:
Descriptions.sort_values(by=[0], ascending=True).plot(kind='barh',
                                                     title='Frequent Words in UFO reports',
                                                      figsize=(10, 10), color='darkgreen')

In [None]:
#categorize the shape column

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df2['shape'], prefix='shape', dummy_na=True)
# Join the encoded df
df3 = df2.join(one_hot)
df3 = df3[['city',"state", 'year', "shape_changing",'shape_chevron',
   'shape_cigar', 'shape_circle', 'shape_cone', 'shape_cross', 'shape_cylinder',
           'shape_diamond','shape_disk','shape_egg','shape_fireball','shape_flash',
           'shape_formation','shape_light','shape_other', 'shape_oval',
           'shape_rectangle', 'shape_sphere', 'shape_teardrop', "shape_triangle"]]
df3  

In [None]:
df3.describe(include="all")

In [None]:
#correlation
df3.corr()

In [None]:
shapes=df3.drop(['city', 'state'], axis=1)
shapes=shapes.astype(int)

In [None]:
shapes.sum()

In [None]:
#ploting shapes totals

fig=plt.figure()
fig.show()
ax=fig.add_subplot(111)

ax.plot(df3['shape_circle'].sum(),c='g',marker="o",ls='--',label='circle')
ax.plot(df3['shape_triangle'].sum(),c='k',marker="^", ls='-',label='triangle')
ax.plot(df3['shape_sphere'].sum(),c='r',marker="+",ls='-',label='sphere')
plt.xlim(-1, 5)
plt.ylim(0, 9000)

plt.legend(loc=1)
plt.draw()

In [None]:
#### new df
df4=df1[['state','date_time', 'shape']]

#Drop rows with NaNs
df4 = df4.dropna()
df4

In [None]:
df4.value_counts()

In [None]:
df4.describe()

In [None]:
dates=df4["date_time"].value_counts()
dates

In [None]:
df4["shape"].value_counts()

In [None]:
#make a column of the count list
reportsPerDate = pd.DataFrame (dates)
reportsPerDate.describe()

In [None]:
#turn indexes to a column
reportsPerDatemodified = reportsPerDate.reset_index()

#renaming columns
reportsPerDatemodified.rename(columns={'index': 'date_time', 'date_time': 'total_reports'}, inplace=True)

In [None]:
reports=reportsPerDatemodified.sort_values(by=['total_reports'],ascending=False)
reports

In [None]:
#get year as a new column and convert to integer
reports['year'] = reports["date_time"].apply(lambda x: pd.Timestamp(x).strftime('%Y'))
reports['year']=reports['year'].astype(int)

In [None]:
reports2014_2015=reports[(reports.year >= 2014) & (reports.year <= 2015)]

In [None]:
reports2014_2015.plot.scatter(x='year', y='total_reports', color='salmon', alpha=0.5)
plt.title('Reports per year')

In [None]:
#predict
reports2014_2015 = pd.get_dummies(reports2014_2015, columns=['year'], prefix='year')
reports2014_2015 = reports2014_2015.drop(['date_time'], axis = 1) 
reports2014_2015

In [None]:
#Binary Classification
import sklearn as sk
from sklearn import metrics #module for accuracy calculation
y = reports2014_2015.iloc[:,0] #all rows, first column
X = reports2014_2015.iloc[:,:2] #all rows, column 0 and 1  

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
#decision tree
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier().fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))