### Data Collection
###### **Sources**
- [**Historical Jet Fuel Prices**](https://www.eia.gov/opendata/qb.php?sdid=PET.EER_EPJK_PF4_RGC_DPG.M) 
    - Data showcases the price of Jet Fuel in US Dollars.
    - Data separated by month.
    - Data collected ranges from April 1990 to August 2020.
- [**Top 1,000 Contiguous State City-Pair Markets**](https://data.transportation.gov/Aviation/Consumer-Airfare-Report-Table-1-Top-1-000-Contiguo/4f3n-jbg2)
    - Data showcases the average airfare per route separated by origin and destination city for the 48 USA landlocked states.
    - Data separated by quarter.
    - Data collected ranges from Q1 1996 to Q3 2019.
- [**US Domestic Flights**](https://academictorrents.com/details/a2ccf94bbb4af222bf8e69dad60a68a29f310d9a)
    - Data showcases the airline flight data including route by city, route by airport, passengers, number of flights, total seats available, distance, and population.
    - Data separated by month.
    - Data collected ranges from January 1990 to December 2009.

### Data Cleaning / Merging
- Clean 
    - Historical Jet Fuel Prices
        - Saved as variable 'fuel'
        - DatetimeIndex created
    - Top 1,000 Contiguous State City-Pair Markets
        - Saved as variable 'airfare'
        - DatetimeIndex created
        - Identify matching routes
        - Changed city names to match city names from different dataset
    - US Domestic Flights
        - Saved as variable 'flights'
        - DatetimeIndex created  
- Merge
    - Combine US Domestic Flights (left) & Top 1,000 Contiguous State City-Pair Markets.
    - Left join on route, quarter, and year to preserve shape of US Domestic Flights.
        - **Imputation:** airfare route pricing data was gathered on a quarterly basis, therefore the same value was imputed for each month of the corresponding quarter.
    - Resulting dataframe contain 381 different routes over 168 months.
        - Dataset is by month and ranges from the beginning of 1996 to end of 2009.



In [99]:
import pandas as pd

# Top 1,000 Contiguous State City-Pair Market

In [100]:
# Reading in top 1000 DF - WE WILL USE THIS DF TO OBTAIN PRICING
top1000 = pd.read_csv('./data/raw/Consumer_Airfare_Report__Table_1_-_Top_1_000_Contiguous_State_City-Pair_Markets.csv')
print(top1000.shape)
top1000.head()

(95023, 18)


Unnamed: 0,Year,quarter,citymarketid_1,citymarketid_2,city1,city2,nsmiles,passengers,fare,carrier_lg,large_ms,fare_lg,carrier_low,lf_ms,fare_low,table_1_flag,Geocoded_City1,Geocoded_City2
0,2009,2,32467,34576,"Miami, FL (Metropolitan Area)","Rochester, NY",1204,203,151.46,FL,0.29,131.05,FL,0.29,131.05,1,"Miami, FL (Metropolitan Area)\n(44.977479, -93...","Rochester, NY\n(43.155708, -77.612547)"
1,2000,4,30397,33198,"Atlanta, GA (Metropolitan Area)","Kansas City, MO",692,782,172.83,DL,0.63,194.71,NJ,0.26,126.88,1,"Atlanta, GA (Metropolitan Area)\n(33.748547, -...","Kansas City, MO\n(39.099792, -94.578559)"
2,2007,4,32575,34614,"Los Angeles, CA (Metropolitan Area)","Salt Lake City, UT",590,3122,135.24,DL,0.51,144.28,B6,0.15,111.68,1,"Los Angeles, CA (Metropolitan Area)\n(34.05223...","Salt Lake City, UT\n(40.758478, -111.888142)"
3,2004,4,32337,31650,"Indianapolis, IN","Minneapolis/St. Paul, MN",503,395,206.78,NW,0.74,224.77,TZ,0.11,156.74,1,"Indianapolis, IN\n(39.76845, -86.156212)","Minneapolis/St. Paul, MN\n(44.977479, -93.264346)"
4,2008,4,30194,30559,"Dallas/Fort Worth, TX","Seattle, WA",1670,957,242.74,AA,0.47,262.43,AS,0.27,218.9,1,"Dallas/Fort Worth, TX\n(40.11086, -77.035636)","Seattle, WA\n(47.603229, -122.33028)"


### Column Creation

In [101]:
# CREATE - Route Column
top1000['market_city'] = top1000['city1'] + ' - ' + top1000['city2']

# CREATE - Datetime Column
monthly = top1000['Year'].astype(str) + 'M' + (3 * top1000['quarter']).astype(str)
from statsmodels.tsa.base.datetools import dates_from_str
monthly = dates_from_str(monthly)
top1000['year-month'] = pd.DatetimeIndex(monthly)

top1000.head()

Unnamed: 0,Year,quarter,citymarketid_1,citymarketid_2,city1,city2,nsmiles,passengers,fare,carrier_lg,large_ms,fare_lg,carrier_low,lf_ms,fare_low,table_1_flag,Geocoded_City1,Geocoded_City2,market_city,year-month
0,2009,2,32467,34576,"Miami, FL (Metropolitan Area)","Rochester, NY",1204,203,151.46,FL,0.29,131.05,FL,0.29,131.05,1,"Miami, FL (Metropolitan Area)\n(44.977479, -93...","Rochester, NY\n(43.155708, -77.612547)","Miami, FL (Metropolitan Area) - Rochester, NY",2009-06-30
1,2000,4,30397,33198,"Atlanta, GA (Metropolitan Area)","Kansas City, MO",692,782,172.83,DL,0.63,194.71,NJ,0.26,126.88,1,"Atlanta, GA (Metropolitan Area)\n(33.748547, -...","Kansas City, MO\n(39.099792, -94.578559)","Atlanta, GA (Metropolitan Area) - Kansas City, MO",2000-12-31
2,2007,4,32575,34614,"Los Angeles, CA (Metropolitan Area)","Salt Lake City, UT",590,3122,135.24,DL,0.51,144.28,B6,0.15,111.68,1,"Los Angeles, CA (Metropolitan Area)\n(34.05223...","Salt Lake City, UT\n(40.758478, -111.888142)","Los Angeles, CA (Metropolitan Area) - Salt Lak...",2007-12-31
3,2004,4,32337,31650,"Indianapolis, IN","Minneapolis/St. Paul, MN",503,395,206.78,NW,0.74,224.77,TZ,0.11,156.74,1,"Indianapolis, IN\n(39.76845, -86.156212)","Minneapolis/St. Paul, MN\n(44.977479, -93.264346)","Indianapolis, IN - Minneapolis/St. Paul, MN",2004-12-31
4,2008,4,30194,30559,"Dallas/Fort Worth, TX","Seattle, WA",1670,957,242.74,AA,0.47,262.43,AS,0.27,218.9,1,"Dallas/Fort Worth, TX\n(40.11086, -77.035636)","Seattle, WA\n(47.603229, -122.33028)","Dallas/Fort Worth, TX - Seattle, WA",2008-12-31


In [102]:
top1000 = top1000[['year-month', 'market_city', 'city1', 'city2', 'fare']]
print(top1000.shape)
top1000.head()

(95023, 5)


Unnamed: 0,year-month,market_city,city1,city2,fare
0,2009-06-30,"Miami, FL (Metropolitan Area) - Rochester, NY","Miami, FL (Metropolitan Area)","Rochester, NY",151.46
1,2000-12-31,"Atlanta, GA (Metropolitan Area) - Kansas City, MO","Atlanta, GA (Metropolitan Area)","Kansas City, MO",172.83
2,2007-12-31,"Los Angeles, CA (Metropolitan Area) - Salt Lak...","Los Angeles, CA (Metropolitan Area)","Salt Lake City, UT",135.24
3,2004-12-31,"Indianapolis, IN - Minneapolis/St. Paul, MN","Indianapolis, IN","Minneapolis/St. Paul, MN",206.78
4,2008-12-31,"Dallas/Fort Worth, TX - Seattle, WA","Dallas/Fort Worth, TX","Seattle, WA",242.74


In [103]:
top1000['year-month'] = pd.to_datetime(top1000['year-month'])
top1000 = top1000.set_index('year-month').sort_index()
print(top1000.shape)
top1000.head()

(95023, 4)


Unnamed: 0_level_0,market_city,city1,city2,fare
year-month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1996-03-31,"Cleveland, OH (Metropolitan Area) - Denver, CO","Cleveland, OH (Metropolitan Area)","Denver, CO",234.76
1996-03-31,"Denver, CO - Minneapolis/St. Paul, MN","Denver, CO","Minneapolis/St. Paul, MN",120.28
1996-03-31,"Las Vegas, NV - Phoenix, AZ","Las Vegas, NV","Phoenix, AZ",73.3
1996-03-31,"Atlanta, GA (Metropolitan Area) - Buffalo, NY","Atlanta, GA (Metropolitan Area)","Buffalo, NY",199.34
1996-03-31,"Los Angeles, CA (Metropolitan Area) - San Anto...","Los Angeles, CA (Metropolitan Area)","San Antonio, TX",162.77


In [104]:
top1000 = top1000.groupby(['market_city', 'city1', 'city2', pd.Grouper(freq='M')])[['fare']].sum().reset_index().set_index('year-month').sort_index()
print(top1000.shape)
top1000.head()

(95023, 4)


Unnamed: 0_level_0,market_city,city1,city2,fare
year-month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1996-03-31,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",129.2
1996-03-31,"Minneapolis/St. Paul, MN - San Francisco, CA (...","Minneapolis/St. Paul, MN","San Francisco, CA (Metropolitan Area)",290.73
1996-03-31,"Cincinnati, OH - Tampa, FL (Metropolitan Area)","Cincinnati, OH","Tampa, FL (Metropolitan Area)",153.17
1996-03-31,"Denver, CO - Portland, OR","Denver, CO","Portland, OR",240.01
1996-03-31,"Los Angeles, CA (Metropolitan Area) - Phoenix, AZ","Los Angeles, CA (Metropolitan Area)","Phoenix, AZ",73.67


In [105]:
# test data 
top1000[top1000['market_city'] == 'Hartford, CT - West Palm Beach/Palm Beach, FL']

Unnamed: 0_level_0,market_city,city1,city2,fare
year-month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1996-03-31,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",129.20
1996-06-30,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",132.35
1996-09-30,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",118.88
1996-12-31,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",132.74
1997-03-31,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",130.03
...,...,...,...,...
2018-09-30,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",178.87
2018-12-31,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",229.45
2019-03-31,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",202.92
2019-06-30,"Hartford, CT - West Palm Beach/Palm Beach, FL","Hartford, CT","West Palm Beach/Palm Beach, FL",218.25


In [106]:
# Saving the cleaned Dataframe
top1000.to_csv('./data/clean/routepricing_byquarter.csv')

# Historical Jet Fuel Prices

In [107]:
# https://stackoverflow.com/questions/20637439/skip-rows-during-csv-import-pandas

# Reading in data
fuel = pd.read_csv('./data/raw/U.S._Gulf_Coast_Kerosene-Type_Jet_Fuel_Spot_Price_FOB_Monthly.csv', skiprows=4).rename(columns={'Month' : 'month_year', 'Series ID: PET.EER_EPJK_PF4_RGC_DPG.M Dollars per Gallon' : 'jet_fuel_price_per_gallon_usd'})
print(fuel.shape)
fuel.head()

(365, 2)


Unnamed: 0,month_year,jet_fuel_price_per_gallon_usd
0,Aug 2020,1.112
1,Jul 2020,1.084
2,Jun 2020,0.983
3,May 2020,0.686
4,Apr 2020,0.606


In [108]:
# Create Year & Month Column
fuel['year'] = fuel['month_year'].apply(lambda x: int(x[4:8]))
fuel['month'] = fuel['month_year'].apply(lambda x: x[0:3]).map({'Jan' : 1, 'Feb' : 2, 'Mar' : 3, 'Apr' : 4, 'May' : 5, 'Jun' : 6, 'Jul' : 7, 'Aug' : 8, 'Sep' : 9, 'Oct' : 10, 'Nov' : 11, 'Dec' : 12})
fuel.head()

Unnamed: 0,month_year,jet_fuel_price_per_gallon_usd,year,month
0,Aug 2020,1.112,2020,8
1,Jul 2020,1.084,2020,7
2,Jun 2020,0.983,2020,6
3,May 2020,0.686,2020,5
4,Apr 2020,0.606,2020,4


In [109]:
# Create Datetime Column
monthly = fuel['year'].astype(int).astype(str) + 'M' + fuel['month'].astype(int).astype(str)
from statsmodels.tsa.base.datetools import dates_from_str
monthly = dates_from_str(monthly)
fuel['year-month']= pd.DatetimeIndex(monthly)
fuel.head()

Unnamed: 0,month_year,jet_fuel_price_per_gallon_usd,year,month,year-month
0,Aug 2020,1.112,2020,8,2020-08-31
1,Jul 2020,1.084,2020,7,2020-07-31
2,Jun 2020,0.983,2020,6,2020-06-30
3,May 2020,0.686,2020,5,2020-05-31
4,Apr 2020,0.606,2020,4,2020-04-30


In [110]:
fuel = fuel.set_index('year-month').sort_index().drop(columns=['year', 'month', 'month_year']).rename(columns={'jet_fuel_price_per_gallon_usd' : 'fuel_usd_pergallon'})
fuel.head()

Unnamed: 0_level_0,fuel_usd_pergallon
year-month,Unnamed: 1_level_1
1990-04-30,0.54
1990-05-31,0.515
1990-06-30,0.494
1990-07-31,0.535
1990-08-31,0.791


In [111]:
# Saving the cleaned Dataframe
fuel.to_csv('./data/clean/fuelpricing_bymonth.csv')

# US Domestic Flights

In [112]:
# Read in Data
flights = pd.read_csv('./data/raw/flight_edges.tsv', sep='\t', header=None).rename(columns={0:'Origin', 1:'Destination', 2:'Origin City', 3:'Destination City', 4:'Passengers', 5:'Seats', 6:'Flights', 7:'Distance', 8:'Fly Date', 9:'Origin Population', 10: 'Destination Population'})
print(flights.shape)
flights.head()

(3606803, 11)


Unnamed: 0,Origin,Destination,Origin City,Destination City,Passengers,Seats,Flights,Distance,Fly Date,Origin Population,Destination Population
0,MHK,AMW,"Manhattan, KS","Ames, IA",21,30,1,254.0,200810,122049,86219
1,EUG,RDM,"Eugene, OR","Bend, OR",41,396,22,103.0,199011,284093,76034
2,EUG,RDM,"Eugene, OR","Bend, OR",88,342,19,103.0,199012,284093,76034
3,EUG,RDM,"Eugene, OR","Bend, OR",11,72,4,103.0,199010,284093,76034
4,MFR,RDM,"Medford, OR","Bend, OR",0,18,1,156.0,199002,147300,76034


In [113]:
# Create Datetime Column
monthly = flights['Fly Date'].map(lambda x: str(int(str(x)[0:4])) + 'M' + str(int(str(x)[4:6])))
from statsmodels.tsa.base.datetools import dates_from_str
monthly = dates_from_str(monthly)
flights['year-month']= pd.DatetimeIndex(monthly)
flights.head()

Unnamed: 0,Origin,Destination,Origin City,Destination City,Passengers,Seats,Flights,Distance,Fly Date,Origin Population,Destination Population,year-month
0,MHK,AMW,"Manhattan, KS","Ames, IA",21,30,1,254.0,200810,122049,86219,2008-10-31
1,EUG,RDM,"Eugene, OR","Bend, OR",41,396,22,103.0,199011,284093,76034,1990-11-30
2,EUG,RDM,"Eugene, OR","Bend, OR",88,342,19,103.0,199012,284093,76034,1990-12-31
3,EUG,RDM,"Eugene, OR","Bend, OR",11,72,4,103.0,199010,284093,76034,1990-10-31
4,MFR,RDM,"Medford, OR","Bend, OR",0,18,1,156.0,199002,147300,76034,1990-02-28


In [114]:
# Create market routes (airport & city)
flights['market_air'] = flights['Origin'] + ' - ' + flights['Destination']
flights['market_city'] = flights['Origin City'] + ' - ' + flights['Destination City']
flights.head()

Unnamed: 0,Origin,Destination,Origin City,Destination City,Passengers,Seats,Flights,Distance,Fly Date,Origin Population,Destination Population,year-month,market_air,market_city
0,MHK,AMW,"Manhattan, KS","Ames, IA",21,30,1,254.0,200810,122049,86219,2008-10-31,MHK - AMW,"Manhattan, KS - Ames, IA"
1,EUG,RDM,"Eugene, OR","Bend, OR",41,396,22,103.0,199011,284093,76034,1990-11-30,EUG - RDM,"Eugene, OR - Bend, OR"
2,EUG,RDM,"Eugene, OR","Bend, OR",88,342,19,103.0,199012,284093,76034,1990-12-31,EUG - RDM,"Eugene, OR - Bend, OR"
3,EUG,RDM,"Eugene, OR","Bend, OR",11,72,4,103.0,199010,284093,76034,1990-10-31,EUG - RDM,"Eugene, OR - Bend, OR"
4,MFR,RDM,"Medford, OR","Bend, OR",0,18,1,156.0,199002,147300,76034,1990-02-28,MFR - RDM,"Medford, OR - Bend, OR"


In [115]:
flights = flights.set_index('year-month').drop(columns=['Fly Date']).sort_index()
flights.head()

Unnamed: 0_level_0,Origin,Destination,Origin City,Destination City,Passengers,Seats,Flights,Distance,Origin Population,Destination Population,market_air,market_city
year-month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1990-01-31,SEA,ORD,"Seattle, WA","Chicago, IL",1713,4410,30,1721.0,5154164,16395048,SEA - ORD,"Seattle, WA - Chicago, IL"
1990-01-31,CLE,EWR,"Cleveland, OH","Newark, NJ",1476,4619,31,404.0,2103367,16868983,CLE - EWR,"Cleveland, OH - Newark, NJ"
1990-01-31,CRW,ROA,"Charleston, WV","Roanoke, VA",388,2100,21,114.0,307480,269195,CRW - ROA,"Charleston, WV - Roanoke, VA"
1990-01-31,CLE,EWR,"Cleveland, OH","Newark, NJ",1337,3348,31,404.0,2103367,16868983,CLE - EWR,"Cleveland, OH - Newark, NJ"
1990-01-31,CLE,EWR,"Cleveland, OH","Newark, NJ",2787,4888,52,404.0,2103367,16868983,CLE - EWR,"Cleveland, OH - Newark, NJ"


In [116]:
# # Add Year & Quarter Column for later data merge
# flights['year'] = flights['year-month'].dt.year
# flights['quarter'] = flights['year-month'].dt.quarter
# flights['month'] = flights['year-month'].dt.month

In [117]:
# Aggregate the dataframe so that each route has one row per year-month
flights = flights.groupby([pd.Grouper(freq='M'), 'market_city']).agg({'Passengers' : 'sum', 'Seats' : 'sum', 'Flights' : 'sum', 
                                                                      'Distance' : 'mean', 'Origin Population' : 'mean', 
                                                                      'Destination Population' : 'mean'}).reset_index().set_index('year-month')
flights.head()

Unnamed: 0_level_0,market_city,Passengers,Seats,Flights,Distance,Origin Population,Destination Population
year-month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1990-01-31,"Abilene, TX - Dallas, TX",741,1018,7,158.0,147700,8019250
1990-01-31,"Akron, OH - Atlanta, GA",3742,5610,56,528.0,658558,3087755
1990-01-31,"Akron, OH - Birmingham, AL",75,99,1,585.0,658558,958585
1990-01-31,"Akron, OH - Chicago, IL",7863,20688,170,344.0,658558,16395048
1990-01-31,"Akron, OH - Cleveland, OH",0,123,1,40.0,658558,2103367


In [118]:
# Save the cleaned Dataframe
flights.to_csv('./data/clean/flightdata_city_bymonth.csv')

# City Location Data

In [119]:
# read in city location data
location = pd.read_csv('./data/raw/location_info.csv')
print(location.shape)
location.head()

(53, 4)


Unnamed: 0,City,State,Latitude,Longitude
0,Boston,MA,42.37,71.03
1,Hartford,CT,41.73,72.65
2,Albany,NY,42.75,73.8
3,New York,NY,40.77,73.98
4,Philadelphia,PA,39.88,75.25


In [120]:
location['city_state'] = location['City'] + ', ' + location['State']
location.head()

Unnamed: 0,City,State,Latitude,Longitude,city_state
0,Boston,MA,42.37,71.03,"Boston, MA"
1,Hartford,CT,41.73,72.65,"Hartford, CT"
2,Albany,NY,42.75,73.8,"Albany, NY"
3,New York,NY,40.77,73.98,"New York, NY"
4,Philadelphia,PA,39.88,75.25,"Philadelphia, PA"


In [121]:
location = location.sort_values(by=['City']).reset_index()
location = location[['city_state', 'Latitude', 'Longitude']]
location.head()

Unnamed: 0,city_state,Latitude,Longitude
0,"Albany, NY",42.75,73.8
1,"Albuquerque, NM",35.05,106.6
2,"Atlanta, GA",33.65,84.42
3,"Austin, TX",30.3,97.7
4,"Boston, MA",42.37,71.03


In [122]:
# Save Cleaned Dataframe
location.to_csv('./data/clean/locationdata_bycity.csv')

# Merge the Dataframes

In [123]:
# Let's recall our 4 dataframes and take a peak by reading them in from their saved locations

flights = pd.read_csv('./data/clean/flightdata_city_bymonth.csv')

fuel = pd.read_csv('./data/clean/fuelpricing_bymonth.csv')

airfare = pd.read_csv('./data/clean/routepricing_byquarter.csv')

location = pd.read_csv('./data/clean/locationdata_bycity.csv').drop(columns='Unnamed: 0')

In [124]:
# Set datetime as index on all dataframes
flights.index = pd.to_datetime(flights['year-month'])
flights = flights.drop(columns=['year-month'])

fuel.index = pd.to_datetime(fuel['year-month'])
fuel = fuel.drop(columns=['year-month'])

airfare.index = pd.to_datetime(airfare['year-month'])
airfare = airfare.drop(columns=['year-month'])

In [125]:
print(f'flights shape: {flights.shape}')
print(f'fuel shape: {fuel.shape}')
print(f'airfare shape: {airfare.shape}')
print(f'location: {location.shape}')

flights shape: (1051957, 7)
fuel shape: (365, 1)
airfare shape: (95023, 4)
location: (53, 3)


In [126]:
# How many unique routes do we have pricing data for?
print(f'# of Routes with Pricing Data: {len(airfare.market_city.unique())}')

# of Routes with Pricing Data: 1629


In [127]:
# Identify unique markets to potentially model

# Create dataframe out of value_counts values
price_count = pd.DataFrame(airfare.market_city.value_counts()).reset_index()

# Filter dataframe to include only those with 95 - this means all data available for every quarter then save as list
price_count = list(sorted(list(price_count[price_count['market_city'] == 95]['index'])))
price_count[:10]

['Albany, NY - Chicago, IL',
 'Albany, NY - Orlando, FL',
 'Albany, NY - Washington, DC (Metropolitan Area)',
 'Albuquerque, NM - Chicago, IL',
 'Albuquerque, NM - Dallas/Fort Worth, TX',
 'Albuquerque, NM - Denver, CO',
 'Albuquerque, NM - Houston, TX',
 'Albuquerque, NM - Las Vegas, NV',
 'Albuquerque, NM - Los Angeles, CA (Metropolitan Area)',
 'Albuquerque, NM - New York City, NY (Metropolitan Area)']

In [128]:
airfare = airfare.loc[airfare['market_city'].isin(price_count)]
airfare.head()

Unnamed: 0_level_0,market_city,city1,city2,fare
year-month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1996-03-31,"Minneapolis/St. Paul, MN - San Francisco, CA (...","Minneapolis/St. Paul, MN","San Francisco, CA (Metropolitan Area)",290.73
1996-03-31,"Cincinnati, OH - Tampa, FL (Metropolitan Area)","Cincinnati, OH","Tampa, FL (Metropolitan Area)",153.17
1996-03-31,"Denver, CO - Portland, OR","Denver, CO","Portland, OR",240.01
1996-03-31,"Los Angeles, CA (Metropolitan Area) - Phoenix, AZ","Los Angeles, CA (Metropolitan Area)","Phoenix, AZ",73.67
1996-03-31,"Atlantic City, NJ - Miami, FL (Metropolitan Area)","Atlantic City, NJ","Miami, FL (Metropolitan Area)",96.28


In [129]:
# Unique routes we have full airfare/pricing data for that remain (originally data had 1629)
len(airfare.market_city.unique())

632

In [130]:
# Unique routes we have flight data for
len(sorted(flights.market_city.unique()))

30331

In [131]:
# Separating market_city into two columns
flights['city1'] = flights['market_city'].apply(lambda x: x.split(' - ')[0])
flights['city2'] = flights['market_city'].apply(lambda x: x.split(' - ')[1])

In [132]:
# Renaming certain values so that they match for when data is merged
airfare['city1'] = [i.replace(' (Metropolitan Area)', "") for i in airfare['city1']]
airfare['city2'] = [i.replace(' (Metropolitan Area)', "") for i in airfare['city2']]

airfare['city1'] = [i.replace('Dallas/Fort Worth, TX', "Dallas, TX") for i in airfare['city1']]
airfare['city2'] = [i.replace('Dallas/Fort Worth, TX', "Dallas, TX") for i in airfare['city2']]

airfare['city1'] = [i.replace('Greensboro/High Point, NC', "Greensboro, NC") for i in airfare['city1']]
airfare['city2'] = [i.replace('Greensboro/High Point, NC', "Greensboro, NC") for i in airfare['city2']]

airfare['city1'] = [i.replace('Minneapolis/St. Paul, MN', "Minneapolis, MN") for i in airfare['city1']]
airfare['city2'] = [i.replace('Minneapolis/St. Paul, MN', "Minneapolis, MN") for i in airfare['city2']]

airfare['city1'] = [i.replace('New York City, NY', "New York, NY") for i in airfare['city1']]
airfare['city2'] = [i.replace('New York City, NY', "New York, NY") for i in airfare['city2']]

airfare['city1'] = [i.replace('Raleigh/Durham, NC', "Raleigh, NC") for i in airfare['city1']]
airfare['city2'] = [i.replace('Raleigh/Durham, NC', "Raleigh, NC") for i in airfare['city2']]

In [133]:
flights_city_names = []
for i in flights.city1.unique():
    flights_city_names.append(i)
for i in flights.city2.unique():
    flights_city_names.append(i)
flights_city_names = sorted(list(set(flights_city_names)))
print(len(flights_city_names))
flights_city_names[:10]

563


['Aberdeen, SD',
 'Abilene, TX',
 'Akron, OH',
 'Alamogordo, NM',
 'Albany, GA',
 'Albany, NY',
 'Albany, OR',
 'Albuquerque, NM',
 'Alexandria, LA',
 'Alexandria, MN']

In [134]:
airfare_city_names = []
for i in airfare.city1.unique():
    airfare_city_names.append(i)
for i in airfare.city2.unique():
    airfare_city_names.append(i)
airfare_city_names = sorted(list(set(airfare_city_names)))
print(len(airfare_city_names))
airfare_city_names[:10]

73


['Albany, NY',
 'Albuquerque, NM',
 'Amarillo, TX',
 'Atlanta, GA',
 'Atlantic City, NJ',
 'Austin, TX',
 'Birmingham, AL',
 'Boise, ID',
 'Boston, MA',
 'Buffalo, NY']

In [135]:
in_airfare_and_flight = []
not_in_airfare_and_flight = []
for i in flights_city_names:
    if i in airfare_city_names:
        in_airfare_and_flight.append(i)
    else:
        not_in_airfare_and_flight.append(i)
        

In [136]:
# 65 of 73 airfare city names match city names in flights city names
len(in_airfare_and_flight)

65

In [137]:
# These are the remaining names that are in airfare that do match the flight city names
remaining = list(set(airfare_city_names) - set(in_airfare_and_flight))
set(airfare_city_names) - set(in_airfare_and_flight)

{'Boise, ID',
 'Denver, CO',
 'Fort Myers, FL',
 'Louisville, KY',
 'Midland/Odessa, TX',
 'Norfolk, VA',
 'Sarasota/Bradenton, FL',
 'West Palm Beach/Palm Beach, FL'}

In [138]:
# remaining = [i.split(', ')[0] for i in remaining]
# remaining = sorted(remaining)
for i in sorted(remaining):
    print(i.replace(' (Metropolitan Area)', ""))

Boise, ID
Denver, CO
Fort Myers, FL
Louisville, KY
Midland/Odessa, TX
Norfolk, VA
Sarasota/Bradenton, FL
West Palm Beach/Palm Beach, FL


In [139]:
airfare['market_city'] = airfare['city1'] + ' - ' + airfare['city2']
airfare['market_city']

year-month
1996-03-31    Minneapolis, MN - San Francisco, CA
1996-03-31             Cincinnati, OH - Tampa, FL
1996-03-31              Denver, CO - Portland, OR
1996-03-31          Los Angeles, CA - Phoenix, AZ
1996-03-31          Atlantic City, NJ - Miami, FL
                             ...                 
2019-09-30            Atlanta, GA - Milwaukee, WI
2019-09-30         New Orleans, LA - New York, NY
2019-09-30          Detroit, MI - Los Angeles, CA
2019-09-30             Chicago, IL - Portland, OR
2019-09-30      Portland, OR - Salt Lake City, UT
Name: market_city, Length: 60040, dtype: object

In [140]:
airfare = airfare.reset_index()
airfare['year-month'] = pd.to_datetime(airfare['year-month'])
airfare = airfare.set_index('year-month')

In [141]:
flights = flights.reset_index()
flights['year-month'] = pd.to_datetime(flights['year-month'])
flights = flights.set_index('year-month')

In [142]:
flights.shape

(1051957, 9)

In [143]:
airfare.shape

(60040, 4)

In [144]:
airfare['year'] = airfare.reset_index()['year-month'].dt.year
airfare['quarter'] = (airfare.reset_index()['year-month'].dt.month - 1) / 3

In [145]:
airfare = airfare.reset_index()
airfare['year-month'] = pd.to_datetime(airfare['year-month'])
airfare['year'] = airfare['year-month'].dt.year
airfare['month'] = airfare['year-month'].dt.month
airfare['quarter'] = airfare['month'].apply(lambda x: (x - 1) // 3 + 1)
airfare[airfare['year-month'] == '1996-12-31']

Unnamed: 0,year-month,market_city,city1,city2,fare,year,quarter,month
1896,1996-12-31,"Chicago, IL - Tampa, FL","Chicago, IL","Tampa, FL",137.32,1996,4,12
1897,1996-12-31,"Hartford, CT - Los Angeles, CA","Hartford, CT","Los Angeles, CA",307.27,1996,4,12
1898,1996-12-31,"Pittsburgh, PA - San Francisco, CA","Pittsburgh, PA","San Francisco, CA",314.86,1996,4,12
1899,1996-12-31,"Atlanta, GA - Dallas, TX","Atlanta, GA","Dallas, TX",208.46,1996,4,12
1900,1996-12-31,"Boston, MA - New Orleans, LA","Boston, MA","New Orleans, LA",189.05,1996,4,12
...,...,...,...,...,...,...,...,...
2523,1996-12-31,"Chicago, IL - Phoenix, AZ","Chicago, IL","Phoenix, AZ",139.98,1996,4,12
2524,1996-12-31,"San Francisco, CA - St. Louis, MO","San Francisco, CA","St. Louis, MO",195.88,1996,4,12
2525,1996-12-31,"Norfolk, VA - San Diego, CA","Norfolk, VA","San Diego, CA",269.58,1996,4,12
2526,1996-12-31,"Kansas City, MO - Tampa, FL","Kansas City, MO","Tampa, FL",136.68,1996,4,12


In [147]:
flights = flights.reset_index()
flights['year-month'] = pd.to_datetime(flights['year-month'])
flights['year'] = flights['year-month'].dt.year
flights['month'] = flights['year-month'].dt.month
flights['quarter'] = flights['month'].apply(lambda x: (x - 1) // 3 + 1)

In [152]:
# https://stackoverflow.com/questions/45803676/python-pandas-loc-filter-for-list-of-values
    
flights = flights.loc[flights['market_city'].isin(flights_count)]
print(flights.shape)
flights.head()

(366720, 13)


Unnamed: 0,year-month,market_city,Passengers,Seats,Flights,Distance,Origin Population,Destination Population,city1,city2,year,month,quarter
11,1990-01-31,"Albany, NY - Atlanta, GA",9495,20216,141,852.0,811232,3087755,"Albany, NY","Atlanta, GA",1990,1,1
15,1990-01-31,"Albany, NY - Chicago, IL",11303,22257,172,723.0,811232,16395048,"Albany, NY","Chicago, IL",1990,1,1
18,1990-01-31,"Albany, NY - Detroit, MI",5125,10900,109,488.0,811232,8503650,"Albany, NY","Detroit, MI",1990,1,1
25,1990-01-31,"Albany, NY - Philadelphia, PA",8793,22494,205,212.0,811232,10881988,"Albany, NY","Philadelphia, PA",1990,1,1
33,1990-01-31,"Albuquerque, NM - Chicago, IL",6420,15849,118,1117.0,601893,16395048,"Albuquerque, NM","Chicago, IL",1990,1,1


In [153]:
# Merging location data (lat/long)
final = pd.merge(flights, airfare, how='left', on=['market_city', 'quarter', 'year'])
final.head()

Unnamed: 0,year-month_x,market_city,Passengers,Seats,Flights,Distance,Origin Population,Destination Population,city1_x,city2_x,year,month_x,quarter,year-month_y,city1_y,city2_y,fare,month_y
0,1990-01-31,"Albany, NY - Atlanta, GA",9495,20216,141,852.0,811232,3087755,"Albany, NY","Atlanta, GA",1990,1,1,NaT,,,,
1,1990-01-31,"Albany, NY - Chicago, IL",11303,22257,172,723.0,811232,16395048,"Albany, NY","Chicago, IL",1990,1,1,NaT,,,,
2,1990-01-31,"Albany, NY - Detroit, MI",5125,10900,109,488.0,811232,8503650,"Albany, NY","Detroit, MI",1990,1,1,NaT,,,,
3,1990-01-31,"Albany, NY - Philadelphia, PA",8793,22494,205,212.0,811232,10881988,"Albany, NY","Philadelphia, PA",1990,1,1,NaT,,,,
4,1990-01-31,"Albuquerque, NM - Chicago, IL",6420,15849,118,1117.0,601893,16395048,"Albuquerque, NM","Chicago, IL",1990,1,1,NaT,,,,


In [154]:
# Merging location data (lat/long)
final = pd.merge(final, location, how='left', left_on='city1_x', right_on='city_state')
final = final.rename(columns={'Latitude' : 'Origin_Latitude', 'Longitude' : 'Origin_Longitude'})
final = pd.merge(final, location, how='left', left_on='city2_x', right_on='city_state')
final = final.rename(columns={'Latitude' : 'Destination_Latitude', 'Longitude' : 'Destination_Longitude'})
final.head()

Unnamed: 0,year-month_x,market_city,Passengers,Seats,Flights,Distance,Origin Population,Destination Population,city1_x,city2_x,...,city1_y,city2_y,fare,month_y,city_state_x,Origin_Latitude,Origin_Longitude,city_state_y,Destination_Latitude,Destination_Longitude
0,1990-01-31,"Albany, NY - Atlanta, GA",9495,20216,141,852.0,811232,3087755,"Albany, NY","Atlanta, GA",...,,,,,"Albany, NY",42.75,73.8,"Atlanta, GA",33.65,84.42
1,1990-01-31,"Albany, NY - Chicago, IL",11303,22257,172,723.0,811232,16395048,"Albany, NY","Chicago, IL",...,,,,,"Albany, NY",42.75,73.8,"Chicago, IL",41.9,87.65
2,1990-01-31,"Albany, NY - Detroit, MI",5125,10900,109,488.0,811232,8503650,"Albany, NY","Detroit, MI",...,,,,,"Albany, NY",42.75,73.8,"Detroit, MI",42.42,83.02
3,1990-01-31,"Albany, NY - Philadelphia, PA",8793,22494,205,212.0,811232,10881988,"Albany, NY","Philadelphia, PA",...,,,,,"Albany, NY",42.75,73.8,"Philadelphia, PA",39.88,75.25
4,1990-01-31,"Albuquerque, NM - Chicago, IL",6420,15849,118,1117.0,601893,16395048,"Albuquerque, NM","Chicago, IL",...,,,,,"Albuquerque, NM",35.05,106.6,"Chicago, IL",41.9,87.65


In [155]:
# Check for nulls - Dataset so large we expect a lot of nulls and are ok with removing
final.isnull().sum()

year-month_x                   0
market_city                    0
Passengers                     0
Seats                          0
Flights                        0
Distance                       0
Origin Population              0
Destination Population         0
city1_x                        0
city2_x                        0
year                           0
month_x                        0
quarter                        0
year-month_y              302712
city1_y                   302712
city2_y                   302712
fare                      302712
month_y                   302712
city_state_x               67200
Origin_Latitude            67200
Origin_Longitude           67200
city_state_y               66960
Destination_Latitude       66960
Destination_Longitude      66960
dtype: int64

In [159]:
# Drop NA columns and duplicate columns that came from datamerges
final = final.dropna().drop(columns=['city1_x', 'city2_x', 'year', 'month_x', 'quarter', 'year-month_y', 'city1_y', 'city2_y', 'month_y', 
                                     'city_state_x', 'city_state_y',])
final.head()

Unnamed: 0,year-month_x,market_city,Passengers,Seats,Flights,Distance,Origin Population,Destination Population,fare,Origin_Latitude,Origin_Longitude,Destination_Latitude,Destination_Longitude
110017,1996-01-31,"Albany, NY - Chicago, IL",10985,18579,175,723.0,825245,17287860,273.9,42.75,73.8,41.9,87.65
110020,1996-01-31,"Albuquerque, NM - Chicago, IL",5203,8604,62,1119.5,680994,17287860,156.16,35.05,106.6,41.9,87.65
110021,1996-01-31,"Albuquerque, NM - Dallas, TX",43467,69323,494,573.714286,680994,8994450,89.76,35.05,106.6,32.9,97.03
110023,1996-01-31,"Albuquerque, NM - Houston, TX",13017,22992,186,750.0,680994,4268132,104.98,35.05,106.6,29.97,95.35
110024,1996-01-31,"Albuquerque, NM - Las Vegas, NV",16010,29919,223,487.0,680994,1044023,77.06,35.05,106.6,36.08,115.17


In [160]:
new_names = {
    'Destination Population' : 'pop_dest',
    'Distance' : 'dist_miles',
    'Flights' : 'num_of_flights',
    'Origin Population' : 'pop_origin',
    'Passengers' : 'passengers',
    'Seats' : 'seat_capacity',
    'fare' : 'airfare',
    'market_city' : 'route',
    'year-month_x' : 'year-month',
    'Origin_Latitude' : 'origin_lat',
    'Destination_Latitude' : 'dest_lat',
    'Origin_Longitude' : 'origin_long',
    'Destination_Longitude' : 'dest_long',
}

In [161]:
final = final.rename(columns=new_names)
final.head()

Unnamed: 0,year-month,route,passengers,seat_capacity,num_of_flights,dist_miles,pop_origin,pop_dest,airfare,origin_lat,origin_long,dest_lat,dest_long
110017,1996-01-31,"Albany, NY - Chicago, IL",10985,18579,175,723.0,825245,17287860,273.9,42.75,73.8,41.9,87.65
110020,1996-01-31,"Albuquerque, NM - Chicago, IL",5203,8604,62,1119.5,680994,17287860,156.16,35.05,106.6,41.9,87.65
110021,1996-01-31,"Albuquerque, NM - Dallas, TX",43467,69323,494,573.714286,680994,8994450,89.76,35.05,106.6,32.9,97.03
110023,1996-01-31,"Albuquerque, NM - Houston, TX",13017,22992,186,750.0,680994,4268132,104.98,35.05,106.6,29.97,95.35
110024,1996-01-31,"Albuquerque, NM - Las Vegas, NV",16010,29919,223,487.0,680994,1044023,77.06,35.05,106.6,36.08,115.17


In [163]:
final = final.set_index('year-month')
final.head()

Unnamed: 0_level_0,route,passengers,seat_capacity,num_of_flights,dist_miles,pop_origin,pop_dest,airfare,origin_lat,origin_long,dest_lat,dest_long
year-month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1996-01-31,"Albany, NY - Chicago, IL",10985,18579,175,723.0,825245,17287860,273.9,42.75,73.8,41.9,87.65
1996-01-31,"Albuquerque, NM - Chicago, IL",5203,8604,62,1119.5,680994,17287860,156.16,35.05,106.6,41.9,87.65
1996-01-31,"Albuquerque, NM - Dallas, TX",43467,69323,494,573.714286,680994,8994450,89.76,35.05,106.6,32.9,97.03
1996-01-31,"Albuquerque, NM - Houston, TX",13017,22992,186,750.0,680994,4268132,104.98,35.05,106.6,29.97,95.35
1996-01-31,"Albuquerque, NM - Las Vegas, NV",16010,29919,223,487.0,680994,1044023,77.06,35.05,106.6,36.08,115.17


In [164]:
# Merge Fuel Data with the rest of data
final = pd.merge(final, fuel, how='left', left_index=True, right_index=True)
print(final.shape)
final.head()

(63000, 13)


Unnamed: 0_level_0,route,passengers,seat_capacity,num_of_flights,dist_miles,pop_origin,pop_dest,airfare,origin_lat,origin_long,dest_lat,dest_long,fuel_usd_pergallon
year-month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1996-01-31,"Albany, NY - Chicago, IL",10985,18579,175,723.0,825245,17287860,273.9,42.75,73.8,41.9,87.65,0.55
1996-01-31,"Albuquerque, NM - Chicago, IL",5203,8604,62,1119.5,680994,17287860,156.16,35.05,106.6,41.9,87.65,0.55
1996-01-31,"Albuquerque, NM - Dallas, TX",43467,69323,494,573.714286,680994,8994450,89.76,35.05,106.6,32.9,97.03,0.55
1996-01-31,"Albuquerque, NM - Houston, TX",13017,22992,186,750.0,680994,4268132,104.98,35.05,106.6,29.97,95.35,0.55
1996-01-31,"Albuquerque, NM - Las Vegas, NV",16010,29919,223,487.0,680994,1044023,77.06,35.05,106.6,36.08,115.17,0.55


In [317]:
# Save final dataframe of data
final.to_csv('./data/clean/combined.csv')


# Split dataset into train and test datasets
- test will be to try our model on unseen data

In [178]:
# Split final dataset into training and testing datasets
train_data_percentage = 132 / len(final.index.unique())

print(f'First Month: {final.index.min()}')
print(f'Last Month: {final.index.max()}')
print(f'Total Months: {len(final.index.unique())}')
print(f'Train/Test % Split: {round(train_data_percentage * 100, 2)}%')
print(f'Number of Months for Training Dataset: {round(len(final.index.unique()) * train_data_percentage)}')
print(f'Number of Months for Testing Dataset: {round(len(final.index.unique()) * (1 - train_data_percentage))}')

First Month: 1996-01-31 00:00:00
Last Month: 2009-12-31 00:00:00
Total Months: 168
Train/Test % Split: 78.57%
Number of Months for Training Dataset: 132
Number of Months for Testing Dataset: 36


In [179]:
# Change dates in order ajust train/test split
train_start_date = '1996-01-31'
train_end_date = '2006-12-31'
test_start_date = '2007-01-31'
test_end_date = '2009-12-31'

In [180]:
# splitting combined data into two datasets train and test so I can test my trained model on unseen data!

# train
final.loc[train_start_date : train_end_date].to_csv('./data/clean/train.csv')

# test
final.loc[test_start_date : test_end_date].to_csv('./data/clean/test.csv')

In [181]:
# Save dataframes by route: combined, train, test

# save combined by route
for i in final['route'].unique():
    final[final['route'] == i].to_csv('./data/clean/route_datasets/combined/' + 'combined_' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[1]
                                      + '-' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[3] + '_' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[0] + '-' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[2] + '.csv')

# save train by route
for i in final.loc[train_start_date : train_end_date]['route'].unique():
    final.loc[train_start_date : train_end_date][final.loc[train_start_date : train_end_date]['route'] == i].to_csv('./data/clean/route_datasets/train/' + 'train_' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[1]
                                      + '-' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[3] + '_' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[0] + '-' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[2] + '.csv')
    final.loc[train_start_date : train_end_date][final.loc[train_start_date : train_end_date]['route'] == i].to_csv('./data/clean/route_datasets/train/' + 'train_' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[1]
                                      + '-' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[3] + '_' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[0] + '-' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[2] + '.csv')
    
# save test by route
for i in final.loc[test_start_date : test_end_date]['route'].unique():
    final.loc[test_start_date : test_end_date][final.loc[test_start_date : test_end_date]['route'] == i].to_csv('./data/clean/route_datasets/test/' + 'test_' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[1]
                                      + '-' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[3] + '_' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[0] + '-' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[2] + '.csv')
    final.loc[test_start_date : test_end_date][final.loc[test_start_date : test_end_date]['route'] == i].to_csv('./data/clean/route_datasets/test/' + 'test_' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[1]
                                      + '-' + i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[3] + '_' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[0] + '-' +
                                      i.replace(",", "_").replace(" ", "").replace("-", "_").lower().split('_')[2] + '.csv')

### Next steps with our data
- Given the data I have collected I believe we have a very probable chance to create a model which predicts the price of flights for specific routes at specific times of year