# Ajout de variables

Ce notebook permet de créer la variable distance à intégrer à notre jeu de données.

In [1]:
import os
import importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from describe import stats_histo
import problem
import warnings
import math
pd.set_option('display.max_columns', None)
warnings.filterwarnings("ignore")



On importe les données initiales

In [2]:
X_train, y_train = problem.get_train_data()

In [3]:
X_train.head()

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,std_wtd
0,2012-06-19,ORD,DFW,12.875,9.812647
1,2012-09-10,LAS,DEN,14.285714,9.466734
2,2012-10-05,DEN,LAX,10.863636,9.035883
3,2011-10-09,ATL,ORD,11.48,7.990202
4,2012-02-21,DEN,SFO,11.45,9.517159


Tout d'abord on crée la variable trajet : 

In [4]:
X_train["Trajet"] = X_train["Departure"] + '-' + X_train["Arrival"]

## Calcul des distances

On commence par importer le jeu de données. 

In [22]:
airports = pd.read_csv("data/airports.csv")
airports.head(3)

Unnamed: 0,Airport_ID,Name,City,Country,IATA,ICAO,Latitude,Longitude,Altitude,Timezone,DST,Tz,Type,Source
0,1,Goroka Airport,Goroka,Papua New Guinea,GKA,AYGA,-6.08169,145.391998,5282,10,U,Pacific/Port_Moresby,airport,OurAirports
1,2,Madang Airport,Madang,Papua New Guinea,MAG,AYMD,-5.20708,145.789001,20,10,U,Pacific/Port_Moresby,airport,OurAirports
2,3,Mount Hagen Kagamuga Airport,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.82679,144.296005,5388,10,U,Pacific/Port_Moresby,airport,OurAirports


In [67]:
airports.head(3)

Unnamed: 0,IATA_CODE,AIRPORT,CITY,STATE,COUNTRY,LATITUDE,LONGITUDE
0,ABE,Lehigh Valley International Airport,Allentown,PA,USA,40.65236,-75.4404
1,ABI,Abilene Regional Airport,Abilene,TX,USA,32.41132,-99.6819
2,ABQ,Albuquerque International Sunport,Albuquerque,NM,USA,35.04022,-106.60919


On selectionne les aéroports qui nous interessent et on vérifie que c'est les même à l'aller et au retour. 

In [7]:
np.setdiff1d(X_train["Arrival"].unique(),
             X_train["Departure"].unique())

array([], dtype=object)

La liste des aéroports :

In [8]:
aep = X_train["Arrival"].unique()

In [9]:
aep

array(['DFW', 'DEN', 'LAX', 'ORD', 'SFO', 'MCO', 'LAS', 'CLT', 'MSP',
       'EWR', 'PHX', 'DTW', 'MIA', 'BOS', 'PHL', 'JFK', 'ATL', 'LGA',
       'SEA', 'IAH'], dtype=object)

In [11]:
aep_choice = []

for i in aep :
    aep_choice.append(airports[airports.IATA == i])
    
distance_collect = pd.concat(aep_choice)

In [12]:
distances_data = distance_collect[["IATA",
                                   "Latitude",
                                   "Longitude"]].reset_index(drop=True)

In [13]:
dict_lat = dict(distances_data.set_index("IATA")["Latitude"])
dict_long = dict(distances_data.set_index("IATA")["Longitude"])

On essais de voir sur plusieurs fonction quelle est la plus correcte.

In [16]:
from geopy.distance import distance

coords_1 = (distances_data.Latitude[0], distances_data.Longitude[0])
coords_2 = (distances_data.Latitude[1], distances_data.Longitude[1])

In [85]:
distance(coords_1, coords_2).km

1031.445141120004

On peut l'appliquer à notre DataFrame : 

In [17]:
X_train.head()

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,std_wtd,Trajet
0,2012-06-19,ORD,DFW,12.875,9.812647,ORD-DFW
1,2012-09-10,LAS,DEN,14.285714,9.466734,LAS-DEN
2,2012-10-05,DEN,LAX,10.863636,9.035883,DEN-LAX
3,2011-10-09,ATL,ORD,11.48,7.990202,ATL-ORD
4,2012-02-21,DEN,SFO,11.45,9.517159,DEN-SFO


In [18]:
X_train.apply(lambda x : 
              distance((dict_lat[x["Departure"]],dict_long[x["Departure"]]),
              (dict_lat[x["Arrival"]], dict_long[x["Arrival"]])).kilometers, axis=1)

0       1290.346856
1       1011.046677
2       1387.023784
3        974.957134
4       1556.391964
           ...     
8897     956.518345
8898    1290.346856
8899     666.249783
8900    1090.917520
8901     956.518345
Length: 8902, dtype: float64

On conserve les deux dictionnaire : 

In [19]:
distances_data.set_index("IATA").to_csv("dico_trajet.csv")

In [20]:
print(dict_lat)

{'DFW': 32.896801, 'DEN': 39.861698150635, 'LAX': 33.94250107, 'ORD': 41.9786, 'SFO': 37.61899948120117, 'MCO': 28.429399490356445, 'LAS': 36.08010101, 'CLT': 35.2140007019043, 'MSP': 44.882, 'EWR': 40.69250106811523, 'PHX': 33.43429946899414, 'DTW': 42.212398529052734, 'MIA': 25.79319953918457, 'BOS': 42.36429977, 'PHL': 39.87189865112305, 'JFK': 40.63980103, 'ATL': 33.6367, 'LGA': 40.77719879, 'SEA': 47.449001, 'IAH': 29.98439979553223}


In [21]:
print(dict_long)

{'DFW': -97.038002, 'DEN': -104.672996521, 'LAX': -118.40799709999999, 'ORD': -87.9048, 'SFO': -122.375, 'MCO': -81.30899810791016, 'LAS': -115.15200039999999, 'CLT': -80.94309997558594, 'MSP': -93.221802, 'EWR': -74.168701171875, 'PHX': -112.01200103759766, 'DTW': -83.35340118408203, 'MIA': -80.29060363769531, 'BOS': -71.00520325, 'PHL': -75.24109649658203, 'JFK': -73.77890015, 'ATL': -84.428101, 'LGA': -73.87259674, 'SEA': -122.308998, 'IAH': -95.34140014648438}
