# Team 88: Basic Exploratory Analysis
## Airport Traffic Data

From the initial review of this dataset, we identified variables of interest, useful supplementary variables from other datasets, and any needed cleaning was completed. Now with the variables and data set, we can find some insights from the data

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
#modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  
import seaborn as sns

%matplotlib inline

Reminder of the fields in the datasets. All of these descriptions come from the data source.
- `ITIN_ID` = Itinerary ID
- `ORIGIN_AIRPORT_ID` = Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
- `ORIGIN` = Origin Airport Code
- `DEST_AIRPORT_ID` = Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
- `DEST` = Destination Airport Code
- `PASSENGERS` = Number of Passengers for the itinerary
- `YEAR` = Year
- `QUARTER` = Quarter (1-4)
- `ORIGIN_COUNTRY` = Origin Airport, Country Code
- `ROUNDTRIP` = Round Trip Indicator (1=Yes)
- `ITIN_FARE` = Itinerary Fare Per Person
- `ORIGIN_CITY_NAME` = collected as `AirportCityName`: Airport City Name with either U.S. State or Country
- `DEST_CITY_NAME` = collected as `AirportCityName`: Airport City Name with either U.S. State or Country

In [2]:
#import data
path = '/content/drive/MyDrive/2020-Move/Learning/DS4A-correlation-one/DS4A Project/repo/data/'
all_travel = pd.read_csv('all_travel.csv').drop(['Unnamed: 0', 'ORIGIN', 'DEST'], axis=1)
print(all_travel.shape,'\n')
all_travel.head(2)

(4288742, 11) 



Unnamed: 0,ITIN_ID,ORIGIN_AIRPORT_ID,ORIGIN,DEST_AIRPORT_ID,DEST,PASSENGERS,YEAR,QUARTER,ORIGIN_COUNTRY,ROUNDTRIP,ITIN_FARE
0,20111156458,11278,DCA,14771,SFO,1.0,2011,1,US,0.0,2112.0
1,20111156497,11618,EWR,14831,SJC,1.0,2011,1,US,0.0,394.0


In [None]:
all_travel.columns

By far, most people fly in from LAX (Los Angeles International Airport) followed by a few west coast airports and some major hubs including New York (JFK) and Boston (BOS). Given that airport codes are not common knowledge, a supplemental file with the city names for each airport code is included and used to update the plot above for top 25 origin airports for other uses going forward.

The [Market Coordinate](https://www.transtats.bts.gov/Tables.asp?DB_ID=595&DB_Name=Aviation%20Support%20Tables) table download (`341379231_T_MASTER_CORD.csv`) that is also provided by the Bureau of Transportation Statistics as with [all our data](https://www.transtats.bts.gov/Tables.asp?DB_ID=125&DB_Name=Airline%20Origin%20and%20Destination%20Survey%20%28DB1B%29&DB_Short_Name=Origin%20and%20Destination%20Survey), includes information about airport codes, IDs, and location, along with any historical identifying information about the airports that appear on the itineraries downloaded.

In [3]:
#get city/location data for merging with
airport_data = pd.read_csv('341379231_T_MASTER_CORD.csv')
airport_data.shape

(18102, 14)

In [None]:
airport_data.head(2)

Unnamed: 0,AIRPORT_ID,AIRPORT,DISPLAY_AIRPORT_NAME,DISPLAY_AIRPORT_CITY_NAME_FULL,AIRPORT_COUNTRY_NAME,AIRPORT_STATE_NAME,AIRPORT_STATE_FIPS,DISPLAY_CITY_MARKET_NAME_FULL,LAT_DEGREES,LATITUDE,LON_DEGREES,LONGITUDE,AIRPORT_IS_LATEST,Unnamed: 13
0,10001,01A,Afognak Lake Airport,"Afognak Lake, AK",United States,Alaska,2.0,"Afognak Lake, AK",58.0,58.109444,152.0,-152.906667,1,
1,10003,03A,Bear Creek Mining Strip,"Granite Mountain, AK",United States,Alaska,2.0,"Granite Mountain, AK",65.0,65.548056,161.0,-161.071667,1,


We established during exploration that the parameter that will be used to connect these two datasets is the airport ID = a five digit value that is unique to the airport as the three alphanumeric digit codes tend to be reassigned or retired.

In [4]:
#create temporary columns for merging// there are only four destinations, only adding origin lat/lon
airport_data['ORIGIN_AIRPORT_ID'] = airport_data['AIRPORT_ID']
airport_data['DEST_AIRPORT_ID'] = airport_data['AIRPORT_ID']
airport_data['ORIGIN_CITY_NAME'] = airport_data['DISPLAY_AIRPORT_CITY_NAME_FULL']
airport_data['DEST_CITY_NAME'] = airport_data['DISPLAY_AIRPORT_CITY_NAME_FULL']
airport_data['ORIGIN_LONGITUDE'] = airport_data['LONGITUDE']
airport_data['ORIGIN_LATITUDE'] = airport_data['LATITUDE']
airport_data.iloc[:, -8:].head(2)

Unnamed: 0,AIRPORT_IS_LATEST,Unnamed: 13,ORIGIN_AIRPORT_ID,DEST_AIRPORT_ID,ORIGIN_CITY_NAME,DEST_CITY_NAME,ORIGIN_LONGITUDE,ORIGIN_LATITUDE
0,1,,10001,10001,"Afognak Lake, AK","Afognak Lake, AK",-152.906667,58.109444
1,1,,10003,10003,"Granite Mountain, AK","Granite Mountain, AK",-161.071667,65.548056


In [5]:
#dropping unused/useless columns
del airport_data['Unnamed: 13']
airport_data.head(2)

Unnamed: 0,AIRPORT_ID,AIRPORT,DISPLAY_AIRPORT_NAME,DISPLAY_AIRPORT_CITY_NAME_FULL,AIRPORT_COUNTRY_NAME,AIRPORT_STATE_NAME,AIRPORT_STATE_FIPS,DISPLAY_CITY_MARKET_NAME_FULL,LAT_DEGREES,LATITUDE,LON_DEGREES,LONGITUDE,AIRPORT_IS_LATEST,ORIGIN_AIRPORT_ID,DEST_AIRPORT_ID,ORIGIN_CITY_NAME,DEST_CITY_NAME,ORIGIN_LONGITUDE,ORIGIN_LATITUDE
0,10001,01A,Afognak Lake Airport,"Afognak Lake, AK",United States,Alaska,2.0,"Afognak Lake, AK",58.0,58.109444,152.0,-152.906667,1,10001,10001,"Afognak Lake, AK","Afognak Lake, AK",-152.906667,58.109444
1,10003,03A,Bear Creek Mining Strip,"Granite Mountain, AK",United States,Alaska,2.0,"Granite Mountain, AK",65.0,65.548056,161.0,-161.071667,1,10003,10003,"Granite Mountain, AK","Granite Mountain, AK",-161.071667,65.548056


In [None]:
#merge to larger dataset by airport ID
travel_df = pd.merge(all_travel, airport_data[['ORIGIN_AIRPORT_ID', 'ORIGIN_CITY_NAME', 'ORIGIN_LATITUDE', 'ORIGIN_LONGITUDE']])
#travel_df = pd.merge(all_travel, airport_data[['ORIGIN_AIRPORT_ID', 'ORIGIN_CITY_NAME']])
travel_df = pd.merge(travel_df, airport_data[['DEST_AIRPORT_ID', 'DEST_CITY_NAME']])
travel_df.head(2)

After filtering the dataset to only include bay area inbound flights and aggregating all periods, we decided to use December 19th as the cutoff for our data periods. There are no identifiable dates from this data, but the `ITIN_ID` field looks like it contains the date so we try to extract that to get this information.