# Project week two

In [148]:
#import sys
#!{sys.executable} -m pip install pycountry

## Importing libraries

In [149]:
import pandas as pd
import re
import pycountry

## Information about the dataset

The original description of the dataset is: 

*"There are some big plane crashes recently. I want to know more about the crashes. For very first step, I need to collect data from somewhere, then I found http://www.planecrashinfo.com/database.htm You guys can pull new data from planecrashinfo.com by using https://github.com/hocnx/planecrashinfo_scraping"*

And the description of the column headers:

1. date:    Date of accident,  in the format - January 01, 2001
2. time:    Local time, in 24 hr. format unless otherwise specified
3. location: location information
4. Airline/Op:  Airline or operator of the aircraft
5. flight_no:   Flight number assigned by the aircraft operator
6. route:   Complete or partial route flown prior to the accident
7. ac_type:     Aircraft type
8. registration:    ICAO registration of the aircraft
9. cn_ln:   Construction or serial number / Line or fuselage number
10. aboard:  Total aboard (passengers / crew)
11. fatalities:  Total fatalities aboard (passengers / crew)
12. ground:  Total killed on the ground
13. summary:     Brief description of the accident and cause if known

## Reading dataset file

We have the data stored on a csv file, so we use pandas utilities in order to read the file.

In [233]:
plane_crashes_data = pd.read_csv('plane_crash_info.csv')

A briefly view of the data

In [83]:
plane_crashes_data.head(10)

Unnamed: 0,date,time,location,operator,flight_no,route,ac_type,registration,cn_ln,aboard,fatalities,ground,summary,month
0,"September 17, 1908",17:18,"Fort Myer, Virginia",Military - U.S. Army,?,Demonstration,Wright Flyer III,?,1,2 (passengers:1 crew:1),1 (passengers:1 crew:0),0,"During a demonstration flight, a U.S. Army fly...",September
1,"September 07, 1909",?,"Juvisy-sur-Orge, France",?,?,Air show,Wright Byplane,SC1,?,1 (passengers:0 crew:1),1 (passengers:0 crew:0),0,Eugene Lefebvre was the first pilot to ever be...,September
2,"July 12, 1912",06:30,"Atlantic City, New Jersey",Military - U.S. Navy,?,Test flight,Dirigible,?,?,5 (passengers:0 crew:5),5 (passengers:0 crew:5),0,First U.S. dirigible Akron exploded just offsh...,July
3,"August 06, 1913",?,"Victoria, British Columbia, Canada",Private,?,?,Curtiss seaplane,?,?,1 (passengers:0 crew:1),1 (passengers:0 crew:1),0,The first fatal airplane accident in Canada oc...,August
4,"September 09, 1913",c 18:30,Over the North Sea,Military - German Navy,?,?,Zeppelin L-1 (airship),?,?,20 (passengers:? crew:?),14 (passengers:? crew:?),0,The airship flew into a thunderstorm and encou...,September
5,"October 17, 1913",c 10:30,"Near Johannisthal, Germany",Military - German Navy,?,?,Zeppelin L-2 (airship),?,?,30 (passengers:? crew:?),30 (passengers:? crew:?),0,Hydrogen gas which was being vented was sucked...,October
6,"March 05, 1915",c 01:00,"Tienen, Belgium",Military - German Navy,?,?,Zeppelin L-8 (airship),?,?,41 (passengers:? crew:?),21 (passengers:? crew:?),0,Crashed into trees while attempting to land af...,March
7,"September 03, 1915",c 15:20,"Off Cuxhaven, Germany",Military - German Navy,?,?,Zeppelin L-10 (airship),?,?,19 (passengers:? crew:?),19 (passengers:? crew:?),0,"Exploded and burned near Neuwerk Island, when...",September
8,"July 28, 1916",?,"Near Jambol, Bulgeria",Military - German Army,?,?,Schutte-Lanz S-L-10 (airship),?,?,20 (passengers:? crew:?),20 (passengers:? crew:?),0,"Crashed near the Black Sea, cause unknown.",July
9,"September 24, 1916",c 01:00,"Billericay, England",Military - German Navy,?,?,Zeppelin L-32 (airship),?,?,22 (passengers:? crew:?),22 (passengers:? crew:?),0,Shot down by British aircraft crashing in flames.,September


## Checking missing values

The first attempt to check missing values with 'isna()' didn't work, because all the missing values were replace with '?'

In [42]:
nulls_col = plane_crashes_data.isna.sum()
nulls_col[nulls_col > 0]

Series([], dtype: int64)

I had to search all the '?' in the dataframe

In [102]:
missings_data = (plane_crashes_data == '?').sum()
print('Missing values:')
missings_data[missings_data > 0]

Missing values:


time            2108
location           6
operator          21
flight_no       4436
route           1494
ac_type           24
registration     352
cn_ln           1207
ground            52
summary          231
dtype: int64

With this we were able to see that there are not missing values for: [date,aboard,fatalities] but in the case of "aboard" and "fatalities" it doesn't mean that we don't have problems because this columns contain a string (it is a value) where it can contain missing values. 

In [101]:
# So we make a little more complex search in that columns
check_aboard_miss = lambda x: True if len(re.findall(r'[?]',str(x))) > 0 else False
aboard_missings = plane_crashes_data['aboard'].apply(check_aboard_miss)
fatalities_missings = plane_crashes_data['fatalities'].apply(check_aboard_miss)
abo_fatal_missing = pd.DataFrame()
abo_fatal_missing['aboard_m'] = aboard_missings
abo_fatal_missing['fatalities_m'] = fatalities_missings
missing_abo_fatal = (abo_fatal_missing == False).sum()
print('Missing values:')
missing_abo_fatal

Missing values:


aboard_m        5239
fatalities_m    5222
dtype: int64

With this we can see that we have: 
1. 544 complete info registers for aboard
2. 561 complete info for fatalities. 

## Data cleaning

I decided to clean each column using the order of appearance.

### Date

Here we have the date of the crash, but it is format in a single string with the "Month day, year" structure but
I considered that it is more useful to have three columns with each element of the date.

In [28]:
# Search in the date column all the letters to obtain the Month
months = [''.join(re.findall(r'[A-Za-z]',plane_crashes_data['date'][da])) for da in range(len(plane_crashes_data['date']))]

In [32]:
# Search in the date column the group of two digits for the day
days = [''.join(re.findall(r'\b\d{2}\b',plane_crashes_data['date'][da])) for da in range(len(plane_crashes_data['date']))]

In [34]:
# Search in the date column the group of four digits for the day
years = [''.join(re.findall(r'\b\d{4}\b',plane_crashes_data['date'][da])) for da in range(len(plane_crashes_data['date']))]

### Time

In the time column we don't have a homogenus format so we check all the formats

In [132]:
plane_crashes_data['time'].head(10)

0      17:18
1          ?
2      06:30
3          ?
4    c 18:30
5    c 10:30
6    c 01:00
7    c 15:20
8          ?
9    c 01:00
Name: time, dtype: object

There is not information about the 'c' character in the time so, we create a new column with this value.
Note: Investigate the meaning.

In [133]:
check_c_char = lambda x: True if len(re.findall(r'[c]',str(x))) > 0 else False
char_unknow = plane_crashes_data['time'].apply(check_c_char)
char_unknow = char_unknow.replace(True, value='c')
char_unknow = char_unknow.replace(False, value='not c')

Then we search the registers who don't have the format "hh:mm" like "hhmm"

In [134]:
check_time_for = lambda x: True if len(re.findall(r'[:]',str(x))) > 0 else False
format_time = plane_crashes_data['time'].apply(check_time_for)

#If we have a different format than "hh:mm" we reformat and if we have the 'c' character we drop it
def reformat(x):
    if check_c_char(x):
        x = x.replace("c"," ")
    if not check_time_for(x) and x != '?':
        x = x[:2] + ':' + x[2:]
    return x

new_times = plane_crashes_data['time'].apply(reformat)

### Location 

With the location column we can see that in the major we have a "region,Country" structure so we try to split the values using commas and identify which one is the country.

#### Identify the country.

In [219]:
split_lb = lambda x: x.split(',')
check_is_country = lambda x: True if pycountry.countries.get(name=x) != None else False
region_country = pd.DataFrame(plane_crashes_data['location'].apply(split_lb))

def who_is_country(li_country):
    temp = []
    for i in li_country:
        if (len(i)>1):
            if i[0] == ' ':
                if check_is_country(i[1:]):
                    temp.append(1)
                else:
                    temp.append(0)
            else:
                if check_is_country(i[1:]):
                    temp.append(1)
                else:
                    temp.append(0)
        else:
            return li_country
    try:
        return li_country[temp.index(1)]
    except:
        return li_country

countries = region_country['location'].apply(who_is_country)
#pais = pycountry.countries.get(name=region_country.iloc[[1]].values[0][0][1].replace(' ',''))
#print(pais)
#print(set(region_country['location']))

### Operator

In [137]:
plane_crashes_data['operator'].head(10)

0      Military - U.S. Army
1                         ?
2      Military - U.S. Navy
3                   Private
4    Military - German Navy
5    Military - German Navy
6    Military - German Navy
7    Military - German Navy
8    Military - German Army
9    Military - German Navy
Name: operator, dtype: object

### Flight number

Checking the number of missing values in 'flight_no' we can see that there is just 1/5 of valid information so we can drop that column.

In [234]:
print('Missing values on flight_no: ',len(plane_crashes_data[plane_crashes_data['flight_no'] == '?']))

Missing values on flight_no:  4436


In [235]:
plane_crashes_data = plane_crashes_data.drop(['flight_no'],axis=1)
plane_crashes_data.head(3)

Unnamed: 0,date,time,location,operator,route,ac_type,registration,cn_ln,aboard,fatalities,ground,summary
0,"September 17, 1908",17:18,"Fort Myer, Virginia",Military - U.S. Army,Demonstration,Wright Flyer III,?,1,2 (passengers:1 crew:1),1 (passengers:1 crew:0),0,"During a demonstration flight, a U.S. Army fly..."
1,"September 07, 1909",?,"Juvisy-sur-Orge, France",?,Air show,Wright Byplane,SC1,?,1 (passengers:0 crew:1),1 (passengers:0 crew:0),0,Eugene Lefebvre was the first pilot to ever be...
2,"July 12, 1912",06:30,"Atlantic City, New Jersey",Military - U.S. Navy,Test flight,Dirigible,?,?,5 (passengers:0 crew:5),5 (passengers:0 crew:5),0,First U.S. dirigible Akron exploded just offsh...


### Registration

Whe replace the missing values with 'Unknown'

In [240]:
unknown_registers = plane_crashes_data['registration'] != '?'
plane_crashes_data['registration'] = plane_crashes_data['registration'][unknown_registers]

In [241]:
plane_crashes_data

Unnamed: 0,date,time,location,operator,route,ac_type,registration,cn_ln,aboard,fatalities,ground,summary
0,"September 17, 1908",17:18,"Fort Myer, Virginia",Military - U.S. Army,Demonstration,Wright Flyer III,,1,2 (passengers:1 crew:1),1 (passengers:1 crew:0),0,"During a demonstration flight, a U.S. Army fly..."
1,"September 07, 1909",?,"Juvisy-sur-Orge, France",?,Air show,Wright Byplane,SC1,?,1 (passengers:0 crew:1),1 (passengers:0 crew:0),0,Eugene Lefebvre was the first pilot to ever be...
2,"July 12, 1912",06:30,"Atlantic City, New Jersey",Military - U.S. Navy,Test flight,Dirigible,,?,5 (passengers:0 crew:5),5 (passengers:0 crew:5),0,First U.S. dirigible Akron exploded just offsh...
3,"August 06, 1913",?,"Victoria, British Columbia, Canada",Private,?,Curtiss seaplane,,?,1 (passengers:0 crew:1),1 (passengers:0 crew:1),0,The first fatal airplane accident in Canada oc...
4,"September 09, 1913",c 18:30,Over the North Sea,Military - German Navy,?,Zeppelin L-1 (airship),,?,20 (passengers:? crew:?),14 (passengers:? crew:?),0,The airship flew into a thunderstorm and encou...
5,"October 17, 1913",c 10:30,"Near Johannisthal, Germany",Military - German Navy,?,Zeppelin L-2 (airship),,?,30 (passengers:? crew:?),30 (passengers:? crew:?),0,Hydrogen gas which was being vented was sucked...
6,"March 05, 1915",c 01:00,"Tienen, Belgium",Military - German Navy,?,Zeppelin L-8 (airship),,?,41 (passengers:? crew:?),21 (passengers:? crew:?),0,Crashed into trees while attempting to land af...
7,"September 03, 1915",c 15:20,"Off Cuxhaven, Germany",Military - German Navy,?,Zeppelin L-10 (airship),,?,19 (passengers:? crew:?),19 (passengers:? crew:?),0,"Exploded and burned near Neuwerk Island, when..."
8,"July 28, 1916",?,"Near Jambol, Bulgeria",Military - German Army,?,Schutte-Lanz S-L-10 (airship),,?,20 (passengers:? crew:?),20 (passengers:? crew:?),0,"Crashed near the Black Sea, cause unknown."
9,"September 24, 1916",c 01:00,"Billericay, England",Military - German Navy,?,Zeppelin L-32 (airship),,?,22 (passengers:? crew:?),22 (passengers:? crew:?),0,Shot down by British aircraft crashing in flames.
