# Capstone - Part 2: Problem Statement + EDA: Week 7

## REQUIREMENTS

### A. Aim

Analyse the historical plane crash data

- Is it possible to give a safety score to airline, aircraft based on the accident history

### B. Proposed Methods and Models

Analysis models
- Since the data spans over almost a century, dividing it as per time periods makes sense. 
    As the aircraft type and providers have changed over the years
- Triangulate locations that are more accident prone
- Analyse if weather has any influence. The month of a year along with location would be the factors to check.
- The Summary has important information about the incident. It wold be good to extract information form this in 2 ways:
    a. Get tag words about the accident conditions like fog, mountain etc
    b. Use NLP to understand if 
- Analyse the number of 

Impact of ticket prices after an accident is reported.? ( No dataset)

### C. Risks and Assumptions

#### Risks:
- Since the ratio of accidents to the number of flights per year are  less and it is widespread around the world, there is a risk that the data points are less. 
    
#### Assumptions:
- The dataset contains only the incident data. Since there is no positive data, it would be challenging to get in any kind of prediction model
    

### D. Goal and Criteria
    - Initial
    - Revised

### E. Local Database
- Should I write about how I scraped the data?
- Saving the local database into csv

In [2]:
import pandas as pd
import numpy as np

In [3]:
crashes = pd.read_csv('../datasets/plane_crash_info.csv')

In [4]:
import sqlite3
sqlite_db = '../database/plane_crash_info.sqlite'
conn = sqlite3.connect(sqlite_db) 
c = conn.cursor()

In [5]:
crashes.to_sql('crashes',             # Name of the table
            con=conn,                    # The handle to the file that is setup
            if_exists='replace',         # Overwrite, append, or fail
            index=False)                 # Add index as column

### F. DATA CLEANING & DATA MUNGING

### Examine the data

In [65]:
# Check the data shape to figure out the number of data samples that are available
print('Shape of data', crashes.shape)

# Since this is accidents in the airline industry and the data varies over almost a century, 
# it would be good to check what the data tells.
print(crashes.head())

# The data seems to have some missing values which are denoted by '?'.
# Replace '?' with null/nan values for easier manipulation later on
crashes.replace("?",np.nan, inplace = True)
crashes.head(2)

# Check out the features in the dataframe
features = crashes.columns
print(features)
# From the available features, what new features can be added?
# 1. Date - The date can be split into Day, Month and Year
# 2. Location - It can be used to get the latitudes and longitudes using the google maps api
#             - Location is currently in City, Country format. It can be split into 2 columns
# 3. Operator - It can be split into Flight category(Military, Commercial etc) and additional information for it 
# 4. AIrcraft Type, can we change it to less specific type

Shape of data (5762, 25)
                 Date      Time                            Location  \
0  September 17, 1908  00:00:00                 Fort Myer, Virginia   
1  September 07, 1909  00:00:00             Juvisy-sur-Orge, France   
2       July 12, 1912  00:00:00           Atlantic City, New Jersey   
3     August 06, 1913  00:00:00  Victoria, British Columbia, Canada   
4      March 05, 1915  00:00:00                     Tienen, Belgium   

                 operator Flight          Route                  AcType  Reg  \
0    Military - U.S. Army    NaN  Demonstration        Wright Flyer III  NaN   
1                     NaN    NaN       Air show          Wright Byplane  SC1   
2    Military - U.S. Navy    NaN    Test flight               Dirigible  NaN   
3                 Private    NaN            NaN        Curtiss seaplane  NaN   
4  Military - German Navy    NaN            NaN  Zeppelin L-8 (airship)  NaN   

  CnLn Ground:                                            Summary  

In [7]:
# Check the null values in the dataset
# Function to check the null counts in a dataframe
def checknullcount(data):
    info = []
    columns = ('Feature', 'Nullcount')
    for col in data.columns:
        nullcount = data[col].isnull().sum()
        info.append([col, nullcount])
    return (pd.DataFrame(columns=columns, data=info).sort_values('Nullcount', ascending = False))

In [64]:
print(checknullcount(crashes))

# There seem to be a lot of military flights. I want to check if that is the one with the maximum number of nulls
# Check thenumber of data points in military flights
import re
flights_only_military = crashes[crashes['operator'].str.contains("Military") == True]
print('No of military accidents: ', len(flights_only_military))

flights_wo_military = crashes[crashes['operator'].str.contains("Military") == False]
print('Accidents excluding military flights: ', len(flights_wo_military))

# Check nulls after the military flights have been removed
# Check the null values in the dataset
print(checknullcount(flights_wo_military))

# For now I will not drop them. instead create a dummy column which indicates if the type is military or not
crashes['IsMilitary'] = [1 if 'Military' in str(row) else 0 for row in crashes['operator']]

# Check cases where your route is null
crashes[crashes['Route'].isnull() == True]

#With this I can see that the null counts are very high for flight, I checked flight.unique and it returned XXX unique
#values.  

#len(crashes['Flight'].unique())=881

# This means that the flights are random numbers, so I can substitute any random number that 
# is not in the set crashes['Flight'].unique()

        Feature  Nullcount
24         Town       5750
23         City       5355
4        Flight       4428
5         Route       1499
8          CnLn       1210
20    CrewFatal        562
19    PassFatal        554
17   CrewAboard        545
16   PassAboard        539
7           Reg        354
10      Summary        232
22        State        213
15  TotalAboard         40
6        AcType         24
3      operator         21
18   TotalFatal         10
2      Location          6
21      Country          0
0          Date          0
14          Day          0
13        Month          0
1          Time          0
11   IsMilitary          0
9       Ground:          0
12         Year          0
No of military accidents:  829
Accidents excluding military flights:  4912
        Feature  Nullcount
24         Town       4903
23         City       4532
4        Flight       3618
5         Route       1066
8          CnLn        831
20    CrewFatal        300
19    PassFatal        296
17   Cr

881

In [9]:
# Operator

In [10]:
# Draw distributions for each feature

In [11]:
# Describe the data
crashes.describe(include='all')

Unnamed: 0,Date,Time,Location,operator,Flight,Route,AcType,Reg,CnLn,Aboard,Fatalities,Ground:,Summary,IsMilitary
count,5762,3645,5756,5741,1334,4263,5738,5408.0,4552.0,5762,5762,5713.0,5530,5762.0
unique,5179,1325,4693,2795,880,3840,2719,5356.0,4119.0,1035,914,53.0,5333,
top,"February 26, 1960",19:30,"Moscow, Russia",Aeroflot,-,Training,Douglas DC-3,49.0,178.0,2 (passengers:0 crew:2),1 (passengers:0 crew:1),0.0,Crashed on takeoff.,
freq,4,27,18,261,65,95,340,3.0,7.0,237,288,5459.0,16,
mean,,,,,,,,,,,,,,0.143874
std,,,,,,,,,,,,,,0.350992
min,,,,,,,,,,,,,,0.0
25%,,,,,,,,,,,,,,0.0
50%,,,,,,,,,,,,,,0.0
75%,,,,,,,,,,,,,,0.0


In [12]:
# Check the data types
# Does anything need to be converted
crashes.dtypes
# Aboard, Fatalities, Ground can be changed to integers

Date          object
Time          object
Location      object
operator      object
Flight        object
Route         object
AcType        object
Reg           object
CnLn          object
Aboard        object
Fatalities    object
Ground:       object
Summary       object
IsMilitary     int64
dtype: object

In [13]:
# Check for null values and find ways to impute it. 

In [14]:
# Check for unique values in each of the features

In [15]:
# NLP on summary to categorize into a type of accident

In [16]:
# What variables can be converted to dummy variableS?

### Updating/Cleaning Features

In [17]:
# 1. Date - The date can be split into Day, Month and Year
from datetime import datetime as dt

print(crashes.Date.dtypes)

print('Feature list before date was split:')
print(crashes.columns)

crashes['Year'] = pd.DatetimeIndex(crashes['Date']).year
crashes['Month'] = pd.DatetimeIndex(crashes['Date']).month
crashes['Day'] = pd.DatetimeIndex(crashes['Date']).day
crashes['Time'] = pd.DatetimeIndex(crashes['Date']).time

print('Feature list after date was split:')
print(crashes.columns)

object
Feature list before date was split:
Index(['Date', 'Time', 'Location', 'operator', 'Flight', 'Route', 'AcType',
       'Reg', 'CnLn', 'Aboard', 'Fatalities', 'Ground:', 'Summary',
       'IsMilitary'],
      dtype='object')
Feature list after date was split:
Index(['Date', 'Time', 'Location', 'operator', 'Flight', 'Route', 'AcType',
       'Reg', 'CnLn', 'Aboard', 'Fatalities', 'Ground:', 'Summary',
       'IsMilitary', 'Year', 'Month', 'Day'],
      dtype='object')


In [18]:
# 2. Change data for Aboard, Fatalities and Ground
import re
# crashes.Aboard.replace(to_replace='(passengers:', value = '', inplace = True)
# for aboard in crashes.Aboard:
#     print(re.findall(r'\d+', aboard))
# crashes.Aboard
aboard = [re.findall(r'\d+', aboard) for aboard in crashes.Aboard]
fatalities = [re.findall(r'\d+', fatal) for fatal in crashes.Fatalities]

aboard_df = pd.DataFrame(aboard, columns = ['TotalAboard', 'PassAboard', 'CrewAboard'])
aboard_df.head()

fatalities_df = pd.DataFrame(fatalities, columns = ['TotalFatal', 'PassFatal', 'CrewFatal'])
fatalities_df.head()

# Concatenate the aboard and fatalities dataframe
crashes = pd.concat([crashes, aboard_df, fatalities_df], axis = 1)
# Deleted redundant columns
crashes.drop(['Aboard', 'Fatalities'], inplace = True, axis = 1)
crashes.columns

# ground = [re.findall(r'\d+', fatal) for fatal in crashes['Ground:']]
# Change null values in ground to 0
crashes['Ground:'].fillna('0', inplace = True)
crashes['Ground:'].value_counts()

0       5508
1         62
2         32
3         23
4         17
5         12
8         10
7         10
6          6
10         6
11         5
14         5
44         4
13         4
22         4
24         3
19         3
12         3
20         3
15         2
25         2
2750       2
23         2
35         2
17         2
30         2
125        2
50         1
63         1
36         1
67         1
53         1
49         1
52         1
47         1
39         1
29         1
32         1
85         1
58         1
78         1
71         1
87         1
16         1
225        1
45         1
18         1
9          1
113        1
40         1
37         1
31         1
33         1
Name: Ground:, dtype: int64

In [19]:
# 3. Location - It can be used to get the latitudes and longitudes using the google maps api
#             - Location is currently in City, Country format. It can be split into 2 columns

# The location parameters range from 1 value to 4 values when split.
# Split the location feature to Country, State, Town and City
# loc = [location.split(',')[::-1] for location in crashes.Location]
# loc = pd.DataFrame(loc, columns = ['Country', 'State', 'City', 'Town'])
# # Check for null values in country
# crashes = pd.concat([crashes, loc], axis = 1)
# crashes.head()
crash_location = [str(row).split(',')[::-1] for row in crashes.Location]
crash_location = pd.DataFrame(crash_location, columns = ['Country', 'State', 'City', 'Town'])
crashes = pd.concat([crashes, crash_location], axis = 1)
crashes.head()
# crashes['Country'] = [c for c in crashes.Country]
# Check for null values in country

Unnamed: 0,Date,Time,Location,operator,Flight,Route,AcType,Reg,CnLn,Ground:,...,TotalAboard,PassAboard,CrewAboard,TotalFatal,PassFatal,CrewFatal,Country,State,City,Town
0,"September 17, 1908",00:00:00,"Fort Myer, Virginia",Military - U.S. Army,,Demonstration,Wright Flyer III,,1.0,0,...,2,1.0,1.0,1,1.0,0.0,Virginia,Fort Myer,,
1,"September 07, 1909",00:00:00,"Juvisy-sur-Orge, France",,,Air show,Wright Byplane,SC1,,0,...,1,0.0,1.0,1,0.0,0.0,France,Juvisy-sur-Orge,,
2,"July 12, 1912",00:00:00,"Atlantic City, New Jersey",Military - U.S. Navy,,Test flight,Dirigible,,,0,...,5,0.0,5.0,5,0.0,5.0,New Jersey,Atlantic City,,
3,"August 06, 1913",00:00:00,"Victoria, British Columbia, Canada",Private,,,Curtiss seaplane,,,0,...,1,0.0,1.0,1,0.0,1.0,Canada,British Columbia,Victoria,
4,"March 05, 1915",00:00:00,"Tienen, Belgium",Military - German Navy,,,Zeppelin L-8 (airship),,,0,...,41,,,21,,,Belgium,Tienen,,


In [20]:
# Check the validity of the country column
from geotext import GeoText

fix = []
for row in crashes.Country:
    if (GeoText(row).countries):
        pass
    elif (GeoText(row).cities):
        pass
    else:
        fix.append(row)

print(len(fix))
pd.Series(fix).value_counts().sort_values(0).head()
# Infer: There are 1649 locations that dont have country/city information

1691


 Western Samoa                          1
 Baangladesh                            1
 French West Indies                     1
Off Western Africa                      1
 off the Philippine island of Elalat    1
dtype: int64

In [28]:
# I've pulled out all the US states so as to be able to assign them a country
usNames = ['Virginia','New Jersey','Ohio','Pennsylvania', 'Maryland', 'Indiana', 'Iowa',
          'Illinois','Wyoming', 'Minnisota', 'Wisconsin', 'Nevada', 'NY','California',
          'WY','New York','Oregon', 'Idaho', 'Connecticut','Nebraska', 'Minnesota', 'Kansas',
          'Texas', 'Tennessee', 'West Virginia', 'New Mexico', 'Washington', 'Massachusetts',
          'Utah', 'Ilinois','Florida', 'Michigan', 'Arkansas','Colorado', 'Georgia''Missouri',
          'Montana', 'Mississippi','Alaska','Jersey', 'Cailifornia', 'Oklahoma','North Carolina',
          'Kentucky','Delaware','D.C.','Arazona','Arizona','South Dekota','New Hampshire','Hawaii',
          'Washingon','Massachusett','Washington DC','Tennesee','Deleware','Louisiana',
          'Massachutes', 'Louisana', 'New York (Idlewild)','Oklohoma','North Dakota','Rhode Island',
          'Maine','Alakska','Wisconson','Calilfornia','Virginia','Virginia.','CA','Vermont',
          'HI','AK','IN','GA','Coloado','Airzona','Alabama','Alaksa' 
          ]

# Decided to try and cleanse the country names.
afNames = ['Afghanstan'] #Afghanistan
anNames = ['off Angola'] #Angola
ausNames = ['Qld. Australia','Queensland  Australia','Tasmania','off Australia'] #Australia
argNames = ['Aregntina'] #Argentina
azNames = ['Azores (Portugal)'] #Azores
baNames = ['Baangladesh'] #Bangladesh
bahNames = ['Great Inagua'] #Bahamas
berNames = ['off Bermuda'] #Bermuda
bolNames = ['Boliva','BO'] #Bolivia
bhNames = ['Bosnia-Herzegovina'] #Bosnia Herzegovina
bulNames = ['Bugaria','Bulgeria'] #Bulgaria
canNames = ['British Columbia', 'British Columbia Canada','Canada2',
            'Saskatchewan','Yukon Territory'] #Canada
camNames = ['Cameroons','French Cameroons'] #Cameroon
caNames = ['Cape Verde Islands'] #Cape Verde
chNames = ['Chili'] #Chile
coNames = ['Comoro Islands', 'Comoros Islands'] #Comoros
djNames = ['Djbouti','Republiof Djibouti'] #Djibouti
domNames = ['Domincan Republic', 'Dominica'] #Dominican Republic
drcNames = ['Belgian Congo','Belgian Congo (Zaire)','Belgium Congo'
           'DR Congo','DemocratiRepubliCogo','DemocratiRepubliCongo',
            'DemocratiRepubliof Congo','DemoctratiRepubliCongo','Zaire',
           'Zaïre'] #Democratic Republic of Congo
faNames = ['French Equitorial Africa'] #French Equatorial Africa
gerNames = ['East Germany','West Germany'] #Germany
grNames = ['Crete'] #Greece
haNames = ['Hati'] #Haiti
hunNames = ['Hunary'] #Hungary
inNames = ['Indian'] #India
indNames = ['Inodnesia','Netherlands Indies'] #Indonesia
jamNames = ['Jamacia'] #Jamaica
malNames = ['Malaya'] #Malaysia
manNames = ['Manmar'] #Myanmar
marNames = ['Mauretania'] #Mauritania
morNames = ['Morrocco','Morroco'] #Morocco
nedNames = ['Amsterdam','The Netherlands'] #Netherlands
niNames = ['Niger'] #Nigeria
philNames = ['Philipines','Philippine Sea', 'Phillipines',
            'off the Philippine island of Elalat'] #Philippines
romNames = ['Romainia'] #Romania
rusNames = ['Russian','Soviet Union','USSR'] #Russia
saNames = ['Saint Lucia Island'] #Saint Lucia
samNames = ['Western Samoa'] #Samoa
siNames = ['Sierre Leone'] #Sierra Leone
soNames = ['South Africa (Namibia)'] #South Africa
surNames = ['Suriname'] #Surinam
uaeNames = ['United Arab Emirates', 'UAE'] #UAE
ukNames = ['England', 'UK','Wales','110 miles West of Ireland', 'Scotland'] #United Kingdom
uvNames = ['US Virgin Islands','Virgin Islands','U.S. Virgin Islands'] #U.S. Virgin Islands
vietNames = ['South Vietnam']
wkNames = ['325 miles east of Wake Island']#Wake Island
yuNames = ['Yugosalvia'] #Yugoslavia
zimNames = ['Rhodesia', 'Rhodesia (Zimbabwe)'] #Zimbabwe

clnames = []
for country in crashes['Country'].values:
    country = country.strip()
    if country in afNames:
        clnames.append('Afghanistan')
    elif country in anNames:
        clnames.append('Angola')
    elif country in ausNames:
        clnames.append('Australia')
    elif country in argNames:
        clnames.append('Argentina')
    elif country in azNames:
        clnames.append('Azores')
    elif country in baNames:
        clnames.append('Bangladesh')
    elif country in bahNames:
        clnames.append('Bahamas')
    elif country in berNames:
        clnames.append('Bermuda')
    elif country in bolNames:
        clnames.append('Bolivia')
    elif country in bhNames:
        clnames.append('Bosnia Herzegovina')
    elif country in bulNames:
        clnames.append('Bulgaria')
    elif country in canNames:
        clnames.append('Canada')
    elif country in camNames:
        clnames.append('Cameroon')
    elif country in caNames:
        clnames.append('Cape Verde')
    elif country in chNames:
        clnames.append('Chile')
    elif country in coNames:
        clnames.append('Comoros')
    elif country in djNames:
        clnames.append('Djibouti')
    elif country in domNames:
        clnames.append('Dominican Republic')
    elif country in drcNames:
        clnames.append('Democratic Republic of Congo')
    elif country in faNames:
        clnames.append('French Equatorial Africa')
    elif country in gerNames:
        clnames.append('Germany')
    elif country in grNames:
        clnames.append('Greece')
    elif country in haNames:
        clnames.append('Haiti')
    elif country in hunNames:
        clnames.append('Hungary')
    elif country in inNames:
        clnames.append('India')
    elif country in jamNames:
        clnames.append('Jamaica')
    elif country in malNames:
        clnames.append('Malaysia')
    elif country in manNames:
        clnames.append('Myanmar')
    elif country in marNames:
        clnames.append('Mauritania')
    elif country in morNames:
        clnames.append('Morocco')
    elif country in nedNames:
        clnames.append('Netherlands')
    elif country in niNames:
        clnames.append('Nigeria')
    elif country in philNames:
        clnames.append('Philippines')
    elif country in romNames:
        clnames.append('Romania')
    elif country in rusNames:
        clnames.append('Russia')
    elif country in saNames:
        clnames.append('Saint Lucia')
    elif country in samNames:
        clnames.append('Samoa')
    elif country in siNames:
        clnames.append('Sierra Leone')
    elif country in soNames:
        clnames.append('South Africa')
    elif country in surNames:
        clnames.append('Surinam')
    elif country in uaeNames:
        clnames.append('UAE')
    elif country in ukNames:
        clnames.append('United Kingdom')
    elif country in usNames:
        clnames.append('United States of America')
    elif country in uvNames:
        clnames.append('U.S. Virgin Islands')
    elif country in vietNames:
        clnames.append('Vietnam')
    elif country in wkNames:
        clnames.append('Wake Island')
    elif country in yuNames:
        clnames.append('Yugoslavia')
    elif country in zimNames:
        clnames.append('Zimbabwe')
    else:
        clnames.append(country)
        
crashes['Country'] = clnames 

In [59]:
# Check the validity of the country column
# Set minimum number of rows and columns to be displayed as we are dealing with a large dataset
pd.options.display.max_rows = 500
pd.options.display.max_columns = 100
from geotext import GeoText

fix = []
for row in crashes.Country:
    if (GeoText(row).countries):
        pass
    elif (GeoText(row).country_mentions):
        pass
    elif (GeoText(row).cities):
        pass
    else:
        fix.append(row)


pd.Series(fix).value_counts()
print(len(fix))
pd.Series(fix)
# Infer: There are 1649 locations that dont have country/city information

415


0                        Over the North Sea
1                                 North Sea
2                      Off Northern Germany
3                                 North Sea
4                                 North Sea
5                    Over the Mediterranean
6                             Off Gibraltar
7                      North Atlantic Ocean
8                Over the Mediterranean Sea
9                  Over the English Channel
10                          English Channel
11                           Czechoslovakia
12                          English Channel
13                            Unied Kingdom
14                           Atlantic Ocean
15                                   Ariège
16                            Off Gibraltar
17                              Off Morocco
18                     North Atlantic Ocean
19                            East Sardinia
20                           Czechoslovakia
21                                Off Spain
22                              

In [62]:
do_geocode('Democratic Republic of Congo')

Location(République démocratique du Congo, (-2.9814343, 23.8222636, 0.0))

In [58]:
# (GeoText('Congo').countries)
from geopy.geocoders import Nominatim
from geotext import GeoText
from geopy.exc import GeocoderTimedOut
import time

geolocator = Nominatim()
geoloc_default = geolocator.reverse("0, 0")

def do_geocode(address):
    try:
        return geolocator.geocode(address, addressdetails=True)
    except GeocoderTimedOut:
        return do_geocode(address)
    

geoloc = []
for loc in fix[44:]:
    print(loc)
    location = do_geocode(loc)
    if(location):
        time.sleep(10)
        geoloc.append(location)
    else:
        pass
#             geoloc.append(geoloc_default)
#     else:
#         geoloc.append(geoloc_default)

Off Malta-Luqa
Gulf of Tonkin
Mediterranean Sea
Mediterranean Sea
Atlantic Ocean
bulgaria
Timor
Timor
nan
Indian Ocean
nan
Off West Africa
Newfoundland
Dutch Guyana
Atlantic Ocean
Newfoundland
North Sea
New Guinea
North Sea
Eastern Libya
New Guinea
Yugoslavia
New Guinea
Yugoslavia
North Atlantic Ocean
Pacific Ocean
Swden
Yugoslavia
Bosnia
Atlantic Ocean
English Channel
Washington D.C.
New Guinea
New Guinea
South Carolina
North Atlantic Ocean
South Carolina
New Guinea
Atlantic Ocean
Himalayas
New Guinea
French Equatorial Africa
Czechoslovakia
Indian Ocean
Northern Ireland
Off Malaya
Burma
Newfoundland
Near Hong Kong International Airport
Newfoundland
North Pacific Ocean
Atlantic Ocean
Pacific Ocean
Gulf of Karkinitsky
North Pacific Ocean
Persian Gulf
800 miles east of Newfoundland
nan
Yugoslavia
Labrador
Territory of New Guinea
Near Irkutsk Russia
Democratic Republic of Congo
French Indo-China
Atlantic Ocean
Belgium Congo
Newfoundland
Manitoba
Democratic Republic of Congo
Azores
Czechos

GeocoderServiceError: [Errno 65] No route to host

In [60]:
geoloc

[Location(Gulf of Tonkin, (20.0000001, 107.9999999, 0.0)),
 Location(Mediterranean Sea, (35.0000035, 19.9999957, 0.0)),
 Location(Mediterranean Sea, (35.0000035, 19.9999957, 0.0)),
 Location(Atlantic Ocean, (13.581921, -38.3203119, 0.0)),
 Location(България, (42.6073975, 25.4856617, 0.0)),
 Location(Timor, Nusa Tenggara Timur, Indonesia, (-9.346017, 124.637279937916, 0.0)),
 Location(Timor, Nusa Tenggara Timur, Indonesia, (-9.346017, 124.637279937916, 0.0)),
 Location(Nangarhar ننگرهار, افغانستان, (34.220389, 70.3800314, 0.0)),
 Location(Indian Ocean, (-9.9999998, 69.9999999, 0.0)),
 Location(Nangarhar ننگرهار, افغانستان, (34.220389, 70.3800314, 0.0)),
 Location(Newfoundland, Newfoundland and Labrador, Canada, (49.12120935, -56.696296112741, 0.0)),
 Location(Atlantic Ocean, (13.581921, -38.3203119, 0.0)),
 Location(Newfoundland, Newfoundland and Labrador, Canada, (49.12120935, -56.696296112741, 0.0)),
 Location(North Sea, (55.3333373, 2.9999964, 0.0)),
 Location(New Guinea, Kiunga Dist

In [25]:
len(crashes[crashes['Country'] =='United States of America'])

1408

In [36]:
# from geograpy import places
# c = places.PlaceContext(['Cleveland', 'Ohio', 'United States'])
# from geopy.distance import vincenty
# newport_ri = (48.6904063, 2.373809)
# cleveland_oh = (46.603354, 1.8883335)
# print(vincenty(newport_ri, cleveland_oh).miles)

### G. DATA DICTIONARY

In [37]:
# Database Format

# Date:Date of accident,  in the format - January 01, 2001
# Time:Local time, in 24 hr. format unless otherwise specified
# Airline/Op:Airline or operator of the aircraft
# FlightNo:Flight number assigned by the aircraft operator
# Route:Complete or partial route flown prior to the accident
# AC Type:Aircraft type
# Reg:ICAO registration of the aircraft
# cnln:Construction or serial number / Line or fuselage number
# Aboard:Total aboard (passengers / crew)
# Fatalities:Total fatalities aboard (passengers / crew)
# Ground:Total killed on the ground
# Summary:Brief description of the accident and cause if known

### H1. PERFORM EDA

In [38]:
Analyse the data
Split the columns into meaningful information
No of null values and check how it can be filled.(other datasets or impute?)
Check the datatypes for all columns
Value counts, unique values for every column
Put the data frame into sql

Data points for each year. Check if it is sufficient
Would it be good to analyse data from a particular year?
Can I get the missing information from any other available datasets


SyntaxError: invalid syntax (<ipython-input-38-e4deba7f0d8a>, line 1)

### H2. SUMMARIZE EDA

## BONUS

### I. TUNING METRICS AND EVALUATION APPROACHES 
Explain how you intend to evaluate your results. What tuning metric and evaluation approaches do you intend to use?

### J. POSSIBLE ADDITIONAL DATASETS
Identify 1-2 additional datasets that may help you triangulate your findings. How might these relate to your data?

http://www.airfleets.net/home/

http://www.airsafe.com/events/models/airbus.htm

### K. BLOG POST ABOUT PART 2
Create a blog post of at least 500 words (and 1-2 graphics!) that describes your assumptions and processes for EDA. 
Link to it in your Jupyter notebook.