# The Task at Hand:  Getting Airline Data In Order

#### 1. Read each data file into a Pandas DataFrame.  Add meaningful names (i.e., names that would make sense to other people, given the data) to the columns of each DataFrame.
    
* Provide your syntactically correct, commented code.
* Print the first three rows of each DataFrame.  Provide your code, and the results it produced. 

In [1]:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
import pandas as pd
import re
import pickle

## For files (airlines, airports, routes and airport_codes) read into a dataframe. 
## Fill missing data with empty string. Assumes data is in local path

dfAirlines = pd.read_csv("airlines.dat", sep = ",", na_values = '\N', 
                names = ("AirlineID","Name", "Alias", "IATA", "ICAO", 
                         "CallSign","Country", "Active"))
dfAirlines = dfAirlines.fillna('')
print "___   Airlines   ___", '\n', dfAirlines.head(3),"\n"

dfAirports = pd.read_csv("airports.dat", sep = ",", na_values = '\N', 
                names = ("AirportID", "Name", "City", "Country", "AirportCode", 
                         "ICAO", "Lat", "Long", "Alt", "TimeZone", 
                         "DST", "Olson_TZ"))
dfAirports = dfAirports.fillna('')
print "___   Airports   ___", '\n',dfAirports.head(3),"\n"

dfRoutes = pd.read_csv("routes.dat", sep = ",", na_values = '\N', 
              names = ("Airline", "AirlineID", "Src", "SrcID", "Dest", "DestID", 
                       "Codeshare", "Stops", "Equip"))
dfRoutes = dfRoutes.fillna('')
print "___   Routes   ___", '\n', dfRoutes.head(3), "\n"

dfAirportCodes = pd.read_csv("airports_codes.txt", sep = "\t", na_values = '\N')
dfAirportCodes.columns = ["AirportCode", "City_Country", "WorldAreaCode"]
dfAirportCodes.fillna('')
print "___   Airport Codes   ___", '\n', dfAirportCodes.head(3)

___   Airlines   ___ 
   AirlineID            Name Alias IATA ICAO CallSign        Country Active
0          1  Private flight          -                                   Y
1          2     135 Airways             GNL  GENERAL  United States      N
2          3   1Time Airline         1T  RNX  NEXTIME   South Africa      Y 

___   Airports   ___ 
   AirportID         Name         City           Country AirportCode  ICAO  \
0          1       Goroka       Goroka  Papua New Guinea         GKA  AYGA   
1          2       Madang       Madang  Papua New Guinea         MAG  AYMD   
2          3  Mount Hagen  Mount Hagen  Papua New Guinea         HGU  AYMH   

        Lat        Long   Alt  TimeZone DST              Olson_TZ  
0 -6.081689  145.391881  5282      10.0   U  Pacific/Port_Moresby  
1 -5.207083  145.788700    20      10.0   U  Pacific/Port_Moresby  
2 -5.826789  144.295861  5388      10.0   U  Pacific/Port_Moresby   

___   Routes   ___ 
  Airline AirlineID  Src SrcID Dest DestID 

    
#### 2. Check each DataFrame for duplicate records.  For each, report the number of duplicates you found.
    
* Provide your commented, syntactically correct code and the results it produced.  

In [2]:
## For each dataframe above, create a new dataframe with the duplicates
## removed.  Specify the columns to compare on, making sure the 
## ID column is not included.  Compare the before and after dataframe
## lengths to determine the number of dropped rows.

before = len(dfAirlines)
noDupsAL = dfAirlines.drop_duplicates(["Name", "Alias", "IATA", "ICAO","CallSign",
                                       "Country", "Active"])
after = len(noDupsAL)
numDups = before - after
print "Number of duplicate Airline records =", numDups

before = len(dfAirports)
noDupsAP = dfAirports.drop_duplicates(["Name", "City", "Country","AirportCode", 
                                        "ICAO", "Lat", "Long", "Alt", "TimeZone", 
                                        "DST", "Olson_TZ"])
after = len(noDupsAP)
numDups = before - after
print "Number of duplicate Airport records =", numDups

before = len(dfRoutes)
noDupsRoute = dfRoutes.drop_duplicates(["Airline", "AirlineID","Src", "SrcID", "Dest", 
                                        "DestID", "Codeshare", "Stops", "Equip"])
after = len(noDupsRoute)
numDups = before - after
print "Number of duplicated Route records =", numDups

before = len(dfAirportCodes)
noDupsCodes = dfAirportCodes.drop_duplicates(["AirportCode", "City_Country", 
                                              "WorldAreaCode"])
after = len(noDupsCodes)
numDups = before - after
print "Number of duplicate Airport Code records =", numDups


Number of duplicate Airline records = 1
Number of duplicate Airport records = 29
Number of duplicated Route records = 0
Number of duplicate Airport Code records = 0


    
#### 3. Describe the data types of the columns in each of the DataFrames.
    
* Provide your commented, syntactically correct code and the results it produced.  

In [3]:
## getting the data types for dataframe member via df.dtypes 
airlinesTypes = dfAirlines.dtypes
airportsTypes = dfAirports.dtypes
routesTypes = dfRoutes.dtypes
codesTypes = dfAirportCodes.dtypes
print "___   Airlines dataframe   ___", '\n', airlinesTypes, "\n"
print "___   Airports dataframe   ___", '\n', airportsTypes,"\n"
print "___   Routes dataframe   ___", '\n', routesTypes, "\n"
print "___   Airport Codes dataframe   ___", '\n', codesTypes, '\n'

___   Airlines dataframe   ___ 
AirlineID     int64
Name         object
Alias        object
IATA         object
ICAO         object
CallSign     object
Country      object
Active       object
dtype: object 

___   Airports dataframe   ___ 
AirportID        int64
Name            object
City            object
Country         object
AirportCode     object
ICAO            object
Lat            float64
Long           float64
Alt              int64
TimeZone       float64
DST             object
Olson_TZ        object
dtype: object 

___   Routes dataframe   ___ 
Airline      object
AirlineID    object
Src          object
SrcID        object
Dest         object
DestID       object
Codeshare    object
Stops         int64
Equip        object
dtype: object 

___   Airport Codes dataframe   ___ 
AirportCode      object
City_Country     object
WorldAreaCode     int64
dtype: object 



        
#### 4. Determine how many of the airlines are "defunct." 
    
* Provide your definition of what a defunct airline is.
* Provide your commented, syntactically correct code and the results it produced.  
    

In [4]:
## A defunct airline is one which has no routes. Any airline not in routes.dat is 
## considered to be defunct. Check Airline IDs are correct format before counting

allAirlines = dfAirlines['IATA'].unique()
hasRoute = dfRoutes['Airline'].unique()
noRoutes = list(set(allAirlines)- set(hasRoute))
defunct = 0
# use RegEx to pattern match on correct 2 AlphaNum Airline ID
for r in noRoutes:
    goodID = re.search('\w\w', r)
    if goodID:
        defunct += 1
print "There are", defunct, "defunct airlines \n"

There are 539 defunct airlines 



#### 5. Determine how many "routes from nowhere" there are in the data.  These are flights that don't originate from an airport.
    
* Provide your commented, syntactically correct code and the results it produced.  

In [5]:
## get the list of source airports in the route dataframe, and compare to the list of 
## airpot codes from the airport_codes txt file.  Any source airport not in the code 
## list will be counted as a phantom flight.

codeList = list(dfAirportCodes.AirportCode)
srcList = list(dfRoutes.Src)
phantomCount = 0
for i in srcList:
    if i not in codeList:
#        print "missing", i
        phantomCount +=1
print "There are", phantomCount,"flights without a valid starting airport", '\n'

There are 2253 flights without a valid starting airport 



        
#### 6. Save your DataFrames for future use.  You may pickle them, put them in a shelve db, on in a the tables of a SQL db.  Check to make sure that they are saved correctly.
    
 * Provide your commented, syntactically correct code and the results it produced.

In [6]:
## pickle each dataframe used above and store in local working directory. 
## Then check for correctness by reading the pickled file back in.  
## Should be the same as the output for question 1

dfAirlines.to_pickle('Airlines.pkl')
testPickle1 = pd.read_pickle('Airlines.pkl')
dfAirports.to_pickle('Airports.pkl')
testPickle2 = pd.read_pickle('Airports.pkl')
dfRoutes.to_pickle('Routes.pkl')
testPickle3 = pd.read_pickle('Routes.pkl')
dfAirportCodes.to_pickle('AirportCodes.pkl')
testPickle4 = pd.read_pickle('AirportCodes.pkl')
print "----------Pickle test - round trip each dataframe-----------", '\n'
print "___   Airlines   ___", '\n', testPickle1.head(3),"\n"
print "___   Airports   ___", '\n', testPickle2.head(3),"\n"
print "___   Routes   ___", '\n', testPickle3.head(3),"\n"
print "___   Airport Codes   ___", '\n', testPickle4.head(3)

----------Pickle test - round trip each dataframe----------- 

___   Airlines   ___ 
   AirlineID            Name Alias IATA ICAO CallSign        Country Active
0          1  Private flight          -                                   Y
1          2     135 Airways             GNL  GENERAL  United States      N
2          3   1Time Airline         1T  RNX  NEXTIME   South Africa      Y 

___   Airports   ___ 
   AirportID         Name         City           Country AirportCode  ICAO  \
0          1       Goroka       Goroka  Papua New Guinea         GKA  AYGA   
1          2       Madang       Madang  Papua New Guinea         MAG  AYMD   
2          3  Mount Hagen  Mount Hagen  Papua New Guinea         HGU  AYMH   

        Lat        Long   Alt  TimeZone DST              Olson_TZ  
0 -6.081689  145.391881  5282      10.0   U  Pacific/Port_Moresby  
1 -5.207083  145.788700    20      10.0   U  Pacific/Port_Moresby  
2 -5.826789  144.295861  5388      10.0   U  Pacific/Port_Moresby   

