
# Hackathon - Binary Classification - Solutions

In this notebook you will find the instructor's solution to the hackathon. This is one approach of many possible.
The main goal is to provide you a baseline that you can modify/expand upon and show you how to take advantage of sklearn's pipeline to simplify your workflow.

### Import the necessary libraries

In [1]:
# Import basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

# Sklearn libraries
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,roc_auc_score,roc_curve,classification_report,auc
from sklearn.base import BaseEstimator, TransformerMixin # to create classes

# Category encoders
from category_encoders import OneHotEncoder, TargetEncoder

#### Import the dataset

In [2]:
# You might have to change this path or the location of this file
data = pd.read_csv('data/train.csv').set_index("ID")

FileNotFoundError: [Errno 2] File data/train.csv does not exist: 'data/train.csv'

In [None]:
data.head()

### EDA (Exploratory Dataset Analysis)

Let's start by checking the number of unique values per column.

In [None]:
data.nunique()

We can see some of the columns are not adding any value to the model. Let's drop them.

In [None]:
cols_irrelevant=["ORIGIN_AIRPORT_ID", "DEST_AIRPORT_ID", "OP_CARRIER_AIRLINE_ID", "CANCELLED"]

In [None]:
data = data.drop(columns=cols_irrelevant)

Let's check the data types we are dealing with.

In [None]:
data.dtypes

We can see dates are in the incorrect format. We'll have to convert them.

In [None]:
date_cols_to_convert = ["DATE_DEPARTURE_UTC", "DATE_ARRIVAL_UTC", "DATE_DEPARTURE_LCL", "DATE_ARRIVAL_LCL"]
data[date_cols_to_convert] = data[date_cols_to_convert].apply(pd.to_datetime, format="%Y/%m/%d %H:%M:%S")

In [None]:
data.dtypes

Much better now!

Let's check if we have any missing values to worry about.

In [None]:
data.isna().sum()

Hmm, two features have missing values.

In the case of the `DISTANCE` column you may have noticed some of the missing values can be filled with information from other rows.

In [None]:
data[["ORIGIN", "DEST", "DISTANCE"]].sort_values(by=["ORIGIN"]).head()

In [None]:
# Start by isolating the information that is available to you regarding the variable distance
data_distance = data[["ORIGIN","DEST","DISTANCE"]].dropna().copy()
# Create a dict that matches (Origin, destination) tuples to distances
distance_dict = data_distance.set_index(["ORIGIN","DEST"])["DISTANCE"].to_dict()
# Replace any missing value with the information contained in the dict 
data["DISTANCE"] = data.apply(lambda row: distance_dict[(row["ORIGIN"], row["DEST"])]
                        if np.isnan(row["DISTANCE"]) and (row["ORIGIN"], row["DEST"]) in distance_dict.keys()
                        else row["DISTANCE"], axis=1)

In [None]:
#distance_dict

Result:

In [None]:
data.isna().sum()

We are able to get rid of most of the missing values in this feature.

Let's check the correlation.

In [None]:
data.corr()[data.corr().abs()>0.7]

The feature `DEP_DEL15` is highly correlated with our target variable. This feature is likely to be very useful.

Let's also check if are dealing with an imbalanced dataset.

In [None]:
# Checking if the dataset is imbalanced 
data["ARR_DEL15"].value_counts(normalize=True).plot(kind="bar")
plt.title('ARR_DEL15 Distribution')
plt.xlabel("Airplane arriving late (No=0, Yes=1)")
plt.ylabel("Perc. of airplanes arriving late");

Note: You can also use [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling)!

In [None]:
#from pandas_profiling import ProfileReport
#report = ProfileReport(data)
#report

### Putting all these steps together

Let's put the steps above into classes so they can be integrated our pipeline.

In [None]:
class DistanceFixNA(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X=None, y=None, **fit_params):
        data_distance = X[["ORIGIN","DEST","DISTANCE"]].dropna().copy()
        self.distance_mapping = data_distance.set_index(["ORIGIN","DEST"])["DISTANCE"].to_dict()
        return self
    def transform(self, data):
        X = data.copy()
        X["DISTANCE"] = X.apply(lambda row: self.distance_mapping[(row["ORIGIN"], row["DEST"])]
                                if np.isnan(row["DISTANCE"]) and (row["ORIGIN"], row["DEST"]) in self.distance_mapping.keys()
                                else row["DISTANCE"], axis=1)
        X["DISTANCE"] = X["DISTANCE"].fillna(X["DISTANCE"].median())
        return X

In [None]:
class DroppingColumns(BaseEstimator, TransformerMixin):
    def __init__(self, cols=[]):
        self.cols = cols
    def fit(self, X=None, y=None, **fit_params):
        return self
    def transform(self, data):
        X = data.copy()
        X = X.drop(self.cols,axis=1)
        return X

### Feature Engineering

Create new columns with the **hour** and **day of the week** of each flight

In [None]:
class CreateTimeFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, cols=[]):
        self.cols = cols
    def fit(self, X=None, y=None, **fit_params):
        return self
    def transform(self, data):
        X = data.copy()
        for col in self.cols:
            X["HOUR" + col.replace("DATE","")] = X[col].dt.hour
            X["WEEK_DAY" + col.replace("DATE","")] = X[col].dt.dayofweek
        return X

Produce a new feature **speed** from the `DISTANCE`, `DATE_ARRIVAL_UTC`, and `DATE_DEPARTURE_UTC` columns

In [None]:
class CalcSpeed(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X=None, y=None, **fit_params):
        return self
    def transform(self, data):
        X = data.copy()
        X["SPEED"] = data["DISTANCE"] / (X["DATE_ARRIVAL_UTC"] - X["DATE_DEPARTURE_UTC"]).dt.total_seconds()
        return X

### Creating the Model 

Import the dataset again (let's do the same transformations using the pipeline)

In [None]:
data = pd.read_csv('data/train.csv').set_index("ID")
# Converting date columns to datetime
data[date_cols_to_convert] = data[date_cols_to_convert].apply(pd.to_datetime, format="%Y/%m/%d %H:%M:%S")

Preparing the dataset for the split (don't forget to sort)

In [None]:
data = data.sort_values(by="DATE_DEPARTURE_LCL")
X = data.drop(columns=['ARR_DEL15'])
y = data['ARR_DEL15']

Splitting the dataset between test and train

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    shuffle=False) # be careful here. By default the dataset is shuffled

Creating the pipeline

In [None]:
time_features_cols = ['DATE_DEPARTURE_LCL','DATE_ARRIVAL_LCL']
# This is just a list with the name of the new time features we created in the step "create_time_features"
new_time_variables = [ft+col.replace("DATE","") for col in time_features_cols for ft in ["WEEK_DAY", "HOUR"]]
cols_to_drop = ['OP_CARRIER', 'OP_CARRIER_FL_NUM', 'TAIL_NUM', 'DATE_DEPARTURE_UTC','DATE_ARRIVAL_UTC',
                'DATE_DEPARTURE_LCL','DATE_ARRIVAL_LCL']

pipeline = Pipeline([("distance_fix", DistanceFixNA()),
                     ("create_time_features", CreateTimeFeatures(cols=time_features_cols)),
                     ('onehot_encoding', OneHotEncoder(cols=new_time_variables)),
                     ('departure_encoding', OneHotEncoder(cols=["DEP_DEL15"], handle_missing="indicator")),
                     ('target_encoding',TargetEncoder(cols=['DEST',"ORIGIN"], min_samples_leaf=30)),
                     ('speed', CalcSpeed()),
                     ("drop_columns", DroppingColumns(cols=cols_irrelevant + cols_to_drop)),
                     ('model', RandomForestClassifier(random_state=42))])

Train the model and generate the predictions

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
y_scores = pipeline.predict_proba(X_test)[:,1]

In [None]:
fpr, tpr, threshold = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
print("Score: "+ str(round(roc_auc,3)))
plt.show()

#### Check final score using the test set

Load the test set

In [None]:
test = pd.read_csv('data/test.csv').set_index("ID")
# Don't forget to convert dates to timestamp again
test[date_cols_to_convert] = test[date_cols_to_convert].apply(pd.to_datetime, format="%Y/%m/%d %H:%M:%S")

Use the pipeline to get predictions (very simple!)

In [None]:
y_scores_final = pipeline.predict_proba(test)[:,1]

Prepare the submissions file

In [None]:
submission = pd.Series(y_scores_final,index=test.index, name='ARR_DEL15')
submission.to_csv("submission.csv")

This should get you a 0.915!