# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

### Create Base DataFrame

In [114]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

In [115]:
test_df = pd.read_csv('../DB/test_sample.csv', index_col='Unnamed: 0')
test_df = test_df.dropna()

In [116]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 197074 entries, 0 to 197559
Data columns (total 20 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   fl_date             197074 non-null  object
 1   mkt_unique_carrier  197074 non-null  object
 2   branded_code_share  197074 non-null  object
 3   mkt_carrier         197074 non-null  object
 4   mkt_carrier_fl_num  197074 non-null  int64 
 5   op_unique_carrier   197074 non-null  object
 6   tail_num            197074 non-null  object
 7   op_carrier_fl_num   197074 non-null  int64 
 8   origin_airport_id   197074 non-null  int64 
 9   origin              197074 non-null  object
 10  origin_city_name    197074 non-null  object
 11  dest_airport_id     197074 non-null  int64 
 12  dest                197074 non-null  object
 13  dest_city_name      197074 non-null  object
 14  crs_dep_time        197074 non-null  int64 
 15  crs_arr_time        197074 non-null  int64 
 16  du

In [117]:
flights_df = pd.read_csv('../DB/flights_data.csv', index_col='Unnamed: 0')
flights_df.columns

Index(['fl_date', 'mkt_unique_carrier', 'branded_code_share', 'mkt_carrier',
       'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num',
       'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name',
       'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time', 'dep_time',
       'dep_delay', 'taxi_out', 'wheels_off', 'wheels_on', 'taxi_in',
       'crs_arr_time', 'arr_time', 'arr_delay', 'cancelled',
       'cancellation_code', 'diverted', 'dup', 'crs_elapsed_time',
       'actual_elapsed_time', 'air_time', 'flights', 'distance',
       'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay',
       'late_aircraft_delay', 'first_dep_time', 'total_add_gtime',
       'longest_add_gtime', 'no_name'],
      dtype='object')

In [118]:
feature_cols = list(test_df.columns)
feature_cols.append('arr_delay')

feature_df = flights_df[feature_cols]
feature_df = feature_df.dropna()

In [119]:
feature_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 162002 entries, 0 to 165071
Data columns (total 21 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   fl_date             162002 non-null  object 
 1   mkt_unique_carrier  162002 non-null  object 
 2   branded_code_share  162002 non-null  object 
 3   mkt_carrier         162002 non-null  object 
 4   mkt_carrier_fl_num  162002 non-null  int64  
 5   op_unique_carrier   162002 non-null  object 
 6   tail_num            162002 non-null  object 
 7   op_carrier_fl_num   162002 non-null  int64  
 8   origin_airport_id   162002 non-null  int64  
 9   origin              162002 non-null  object 
 10  origin_city_name    162002 non-null  object 
 11  dest_airport_id     162002 non-null  int64  
 12  dest                162002 non-null  object 
 13  dest_city_name      162002 non-null  object 
 14  crs_dep_time        162002 non-null  int64  
 15  crs_arr_time        162002 non-nul

In [120]:
X = feature_df[feature_cols]

In [121]:
# find categorical features
cols = X.columns
num_cols = X._get_numeric_data().columns
cat_cols = list(set(cols) - set(num_cols))

In [122]:
# remove redundant numeric columns
final_num_cols = list(num_cols)
final_num_cols.remove('op_carrier_fl_num')
final_num_cols.remove('flights')

# remove redundant categorical columns
final_cat_cols = ['mkt_unique_carrier', 'fl_date', 'tail_num', 'branded_code_share']

# combine final features
final_features = final_num_cols + final_cat_cols
print('final_features:', final_features)

final_features: ['mkt_carrier_fl_num', 'origin_airport_id', 'dest_airport_id', 'crs_dep_time', 'crs_arr_time', 'crs_elapsed_time', 'distance', 'arr_delay', 'mkt_unique_carrier', 'fl_date', 'tail_num', 'branded_code_share']


In [123]:
X = X[final_features]

In [124]:
# convert fl_date feature into datetime
X['fl_date'] = pd.to_datetime(X['fl_date'])

In [125]:
# separate datetime into date features
X['year'] = X['fl_date'].dt.year
X['month'] = X['fl_date'].dt.month
X['week'] = X['fl_date'].dt.isocalendar().week
X['day'] = X['fl_date'].dt.day
X['day_of_week'] = X['fl_date'].dt.dayofweek

In [126]:
X = X.reset_index()

In [127]:
X.index.name = 'order'
X = X.drop(columns=['index'])

In [128]:
X.head()

Unnamed: 0_level_0,mkt_carrier_fl_num,origin_airport_id,dest_airport_id,crs_dep_time,crs_arr_time,crs_elapsed_time,distance,arr_delay,mkt_unique_carrier,fl_date,tail_num,branded_code_share,year,month,week,day,day_of_week
order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,3501,12953,13930,1300,1444,164.0,733.0,-28.0,UA,2018-01-01,N744YX,UA_CODESHARE,2018,1,1,1,0
1,3502,11433,12266,630,854,204.0,1075.0,1.0,UA,2018-01-01,N640RW,UA_CODESHARE,2018,1,1,1,0
2,3503,11618,11433,1500,1709,129.0,488.0,18.0,UA,2018-01-01,N641RW,UA_CODESHARE,2018,1,1,1,0
3,3504,11618,11278,2041,2159,78.0,199.0,32.0,UA,2018-01-01,N722YX,UA_CODESHARE,2018,1,1,1,0
4,3505,12266,11298,2140,2257,77.0,224.0,-1.0,UA,2018-01-01,N855RW,UA_CODESHARE,2018,1,1,1,0


In [129]:
# set y
y = X[['arr_delay']]

In [130]:
# drop original fl_date and arr_delay columns
X = X.drop(columns=['fl_date'])
X = X.drop(columns=['arr_delay'])

In [131]:
X

Unnamed: 0_level_0,mkt_carrier_fl_num,origin_airport_id,dest_airport_id,crs_dep_time,crs_arr_time,crs_elapsed_time,distance,mkt_unique_carrier,tail_num,branded_code_share,year,month,week,day,day_of_week
order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,3501,12953,13930,1300,1444,164.0,733.0,UA,N744YX,UA_CODESHARE,2018,1,1,1,0
1,3502,11433,12266,630,854,204.0,1075.0,UA,N640RW,UA_CODESHARE,2018,1,1,1,0
2,3503,11618,11433,1500,1709,129.0,488.0,UA,N641RW,UA_CODESHARE,2018,1,1,1,0
3,3504,11618,11278,2041,2159,78.0,199.0,UA,N722YX,UA_CODESHARE,2018,1,1,1,0
4,3505,12266,11298,2140,2257,77.0,224.0,UA,N855RW,UA_CODESHARE,2018,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161997,2789,13487,14771,925,1143,258.0,1589.0,DL,N886DN,DL,2019,7,31,31,2
161998,2790,10721,13487,1841,2101,200.0,1124.0,DL,N302DN,DL,2019,7,31,31,2
161999,2791,10397,11298,1000,1116,136.0,731.0,DL,N375NC,DL,2019,7,31,31,2
162000,2791,11298,10397,1201,1512,131.0,731.0,DL,N375NC,DL,2019,7,31,31,2


### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

### testing

In [132]:
pas_df = pd.read_csv('../DB/passengers_data.csv', index_col='Unnamed: 0')

In [133]:
pas_df.columns

Index(['departures_scheduled', 'departures_performed', 'payload', 'seats',
       'passengers', 'freight', 'mail', 'distance', 'ramp_to_ramp', 'air_time',
       'unique_carrier', 'airline_id', 'unique_carrier_name', 'region',
       'carrier', 'carrier_name', 'carrier_group', 'carrier_group_new',
       'origin_airport_id', 'origin_city_market_id', 'origin',
       'origin_city_name', 'origin_country', 'origin_country_name',
       'dest_airport_id', 'dest_city_market_id', 'dest', 'dest_city_name',
       'dest_country', 'dest_country_name', 'aircraft_group', 'aircraft_type',
       'aircraft_config', 'year', 'month', 'distance_group', 'class',
       'data_source'],
      dtype='object')

In [134]:
pas_df.origin_airport_id.value_counts()

13930    5331
10397    4827
11292    4017
12892    3820
12266    3327
         ... 
12130       1
14453       1
11496       1
15382       1
11856       1
Name: origin_airport_id, Length: 1512, dtype: int64

In [135]:
pas_df.month.value_counts()

10    14801
11    14399
12    14398
7     14064
6     14006
3     13973
9     13696
4     13685
8     13660
1     13496
5     12932
2     12818
Name: month, dtype: int64

In [136]:
def make_df(df,year):
    
    months = [1,2,3,4,5,6,7,8,9,10,11,12]
    year = year
    dfs = []
    
    for month in months:
        X = df.loc[(df.month == month) & (df.year == year)].groupby('origin_airport_id')[['passengers']].sum()
        X['month'] = month
        X['year'] = year
        dfs.append(X)
        
    return dfs

In [137]:
passengers_per_month_2018 = make_df(pas_df, 2018)
passengers_per_month_2019 = make_df(pas_df, 2019)

In [138]:
pass_2018 = pd.concat(passengers_per_month_2018)
pass_2019 = pd.concat(passengers_per_month_2019)

In [153]:
avg_monthly_pas = pd.merge(pass_2018, pass_2019, how='left', on=['origin_airport_id', 'month'])

In [154]:
avg_monthly_pas.passengers_y.fillna(avg_monthly_pas.passengers_x, inplace=True)
avg_monthly_pas.year_y.fillna(avg_monthly_pas.year_x + 1, inplace=True)

In [155]:
avg_monthly_pas['avg_monthly_pas'] = (avg_monthly_pas.passengers_x + avg_monthly_pas.passengers_y) / 2

In [157]:
avg_monthly_pas = avg_monthly_pas[['month','avg_monthly_pas']]

In [158]:
avg_monthly_pas

Unnamed: 0_level_0,month,avg_monthly_pas
origin_airport_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10135,1,1985.0
10140,1,9861.0
10158,1,2828.0
10165,1,159.0
10194,1,0.0
...,...,...
16271,12,8694.5
16304,12,456.0
16390,12,39.0
16477,12,0.0


In [159]:
final = pd.merge(X, avg_monthly_pas, how='left', on=['origin_airport_id','month'])

In [164]:
final[['avg_monthly_pas']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 162002 entries, 0 to 162001
Data columns (total 1 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   avg_monthly_pas  156746 non-null  float64
dtypes: float64(1)
memory usage: 2.5 MB


### Total Monthly Passengers For Each Airport Feature 

In [166]:
# store passengers csv from DB folder into a DataFrame
pas_df = pd.read_csv('../DB/passengers_data.csv', index_col='Unnamed: 0')


# helper function to find total passengers per airport
def make_df(df,year):
    
    '''
    Returns a list of DataFrames ordered by month
    showing the total passenger counts for the corresponding aiport
    
    
    Arguments:
    df -- DataFrame from passengers table
            - must include 'passengers', 'month', 'year', 
              and 'origin_airport_id' features
            
    Returns:
    List of DataFrames
        - Columns: total passengers, month, year
        - Row: origin_airport_ids
    '''
    
    months = [1,2,3,4,5,6,7,8,9,10,11,12]
    year = year
    dfs = []
    
    for month in months:
        X = df.loc[(df.month == month) & (df.year == year)].groupby('origin_airport_id')[['passengers']].sum()
        X['month'] = month
        X['year'] = year
        dfs.append(X)
        
    return dfs


# use function to find passenger counts for 2018 and 2019
passengers_per_month_2018 = make_df(pas_df, 2018)
passengers_per_month_2019 = make_df(pas_df, 2019)


# take list of DataFrames and concatenate 
pass_2018 = pd.concat(passengers_per_month_2018)
pass_2019 = pd.concat(passengers_per_month_2019)


# merge yearly DataFrames
avg_monthly_pas = pd.merge(pass_2018, pass_2019, how='left', on=['origin_airport_id', 'month'])


# fill NaN values 
avg_monthly_pas.passengers_y.fillna(avg_monthly_pas.passengers_x, inplace=True)
avg_monthly_pas.year_y.fillna(avg_monthly_pas.year_x + 1, inplace=True)


# calculate an average monthly passenger feature for each airport
avg_monthly_pas['avg_monthly_pas'] = (avg_monthly_pas.passengers_x + avg_monthly_pas.passengers_y) / 2


# drop unnecessary features
avg_monthly_pas = avg_monthly_pas[['month','avg_monthly_pas']]


# merge onto training DataFrame
final = pd.merge(X, avg_monthly_pas, how='left', on=['origin_airport_id','month'])

In [167]:
final

Unnamed: 0,mkt_carrier_fl_num,origin_airport_id,dest_airport_id,crs_dep_time,crs_arr_time,crs_elapsed_time,distance,mkt_unique_carrier,tail_num,branded_code_share,year,month,week,day,day_of_week,avg_monthly_pas
0,3501,12953,13930,1300,1444,164.0,733.0,UA,N744YX,UA_CODESHARE,2018,1,1,1,0,66470.5
1,3502,11433,12266,630,854,204.0,1075.0,UA,N640RW,UA_CODESHARE,2018,1,1,1,0,108812.5
2,3503,11618,11433,1500,1709,129.0,488.0,UA,N641RW,UA_CODESHARE,2018,1,1,1,0,94693.5
3,3504,11618,11278,2041,2159,78.0,199.0,UA,N722YX,UA_CODESHARE,2018,1,1,1,0,94693.5
4,3505,12266,11298,2140,2257,77.0,224.0,UA,N855RW,UA_CODESHARE,2018,1,1,1,0,155988.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161997,2789,13487,14771,925,1143,258.0,1589.0,DL,N886DN,DL,2019,7,31,31,2,136111.0
161998,2790,10721,13487,1841,2101,200.0,1124.0,DL,N302DN,DL,2019,7,31,31,2,84468.5
161999,2791,10397,11298,1000,1116,136.0,731.0,DL,N375NC,DL,2019,7,31,31,2,379869.5
162000,2791,11298,10397,1201,1512,131.0,731.0,DL,N375NC,DL,2019,7,31,31,2,262691.5


### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.