# Predicting Airline Delay Project
## Part 1: Isolating the top 50 Airports by Traffic
In this part, we load the entire airline dataset and get a list of the top 50 airports by traffic (Departures + Arrivals). Then we plot the top 50 airports on a map of USA using plotly and cufflinks. 

In [1]:
import pandas as pd
from os import walk
import time
import plotly.plotly as py
%matplotlib inline

In [8]:
#Read all the files
df = pd.DataFrame()

f = []
for (dirpath, dirnames, filenames) in walk('airline_data/'):
    f.extend(filenames)
    break
    
for file in f[1:]:
    df = df.append(pd.read_csv('airline_data/' + file)[['ORIGIN','DEST']])  

In [23]:
#Create a new dataframe combining the Arrival and Destination Airports
df = pd.DataFrame(pd.concat([df['ORIGIN'],df['DEST']]),columns={'IATA'})

In [35]:
#Select the Top 50 Airports 
df_top50 = pd.DataFrame(df['IATA'].value_counts().head(50)).reset_index()
df_top50.rename(index=str, columns={"index":"IATA","IATA":"Count"},inplace=True)

To plot the Airports, we import Latitude and Longitude from [OpenFlights.org](http://openflights.org/data.html)

In [40]:
#Read Airport data
df_airports = pd.read_csv('airports.dat')
df_airports = df_airports[df_airports['Country']=='United States'][['IATA','Latitude','Longitude']].copy()

In [42]:
#Merge with the DataFrame containing the list of Top 50 Airports
df_top50 = pd.merge(df_top50,df_airports,on="IATA")

In [47]:
#Map the Airports as a Bubble map, with size corresponding to Traffic

#First create a new columns with Textual information
df_top50['text'] = df_top50['IATA'] + '<br>Total Flights: ' + (df_top50['Count']/1e3).astype(str)+' (Thousands)'

#Create plot using Plotly
limits = [(0,10),(11,30),(31,50)]
colors = ["rgb(0,116,217)","rgb(255,65,54)","rgb(133,20,75)"]
names = ["Top 10 Busiest","11-30","31-50"]
cities = []
scale = 2500

for i in range(len(limits)):
    lim = limits[i]
    df_sub = df_top50[lim[0]:lim[1]]
    city = dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = df_sub['Longitude'],
        lat = df_sub['Latitude'],
        text = df_sub['text'],
        marker = dict(
            size = df_sub['Count']/scale,
            color = colors[i],
            line = dict(width=0.5, color='rgb(40,40,40)'),
            sizemode = 'area'
        ),
        name = names[i] )
    cities.append(city)
    
layout = dict(
        title = 'Top 50 Busiest Airports by Traffic',
        showlegend = True,
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(255, 255, 255)",
            countrycolor="rgb(255, 255, 255)"
        ),
    )

fig = dict( data=cities, layout=layout )
py.iplot( fig, validate=False, filename='d3-map-airports' )