# Mini-Flight-Delay

## Pre-processing

In [None]:
import pandas as pd

### Reading dataframe

In [None]:
df = pd.read_csv('../input/mini-flight-delay-prediction/flight_delays_train.csv')

In [None]:
df.head()

### Function to split values of the first 3 dataframe columns

In [None]:
def split_values(str):
    value = str.split('-')[1]
    return int(value)

### Mapping the values with the split_values function for integers

In [None]:
df['Month'] = df['Month'].map(lambda x: split_values(x))
df['DayofMonth'] = df['DayofMonth'].map(lambda x: split_values(x))
df['DayOfWeek'] = df['DayOfWeek'].map(lambda x: split_values(x))

In [None]:
df.head()

In [None]:
print(pd.DataFrame(df.dtypes))

### Mapping columns UniqueCarrier, Origin, Dest
- Mapping values for a dict, and replace them at the dataframe with integers indexes.

In [None]:
# UniqueCarrier
uc_labels = df.UniqueCarrier.unique().tolist()
label_dict_uc_train = {}
for index, possible_label in enumerate(uc_labels):
    label_dict_uc_train[possible_label] = index

# Origin
origin_labels = df.Origin.unique().tolist()
label_dict_origin_train = {}
for index, possible_label in enumerate(origin_labels):
    label_dict_origin_train[possible_label] = index

# Dest
dest_labels = df.Dest.unique().tolist()
label_dict_dest_train = {}
for index, possible_label in enumerate(dest_labels):
    label_dict_dest_train[possible_label] = index

In [None]:
df['UniqueCarrier'] = df.UniqueCarrier.replace(label_dict_uc_train)
df['Origin'] = df.Origin.replace(label_dict_origin_train)
df['Dest'] = df.Dest.replace(label_dict_dest_train)

In [None]:
df.head()

- Our data is quite imbalanced, as we can see below:

In [None]:
df.dep_delayed_15min.value_counts()

## Exploratory model analysis

We'll explore different models using the Pycaret lib.
- First of all, we'll split our dataset in train and test; for classification tasks, the split is stratified by default.

In [None]:
pip install pycaret

In [None]:
# para tarefas de classificação, o split é estratificado por padrão
# https://pycaret.org/setup/

data = df.sample(frac=0.75, random_state=786)
data_unseen = df.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

- Initialize a setup with our data.

In [None]:
from pycaret.classification import *
exp_cls101 = setup(data = data, target = 'dep_delayed_15min', session_id=123,
                   numeric_features = ['Month', 'DayofMonth', 'DayOfWeek',
                                       'DepTime', 'UniqueCarrier', 'Origin',
                                       'Dest', 'Distance'], data_split_stratify=True)

- Check the perfomance of the some different possible models using the method `compare_methods`.

In [None]:
best = compare_models()

- The best performance was achieved by CatBoost, but it is also one of the most expensive models here. As long as its the numbers are very slightely close to the Light Gradient Boosting Machine ones, we go along with LGBM.

In [None]:
lgbm = create_model('lightgbm')

- We use `tune_model` to tune hyperparameters; it returns is the model setup with greater performance. Under the hood, the function uses a random grid search approach.

In [None]:
tuned_model = tune_model(lgbm)

In [None]:
tuned_model

- A nice feature of Pycaret lib is this performance plot. Here we have the ROC Curve (receiver operating characteristic curve), which plots the True Positive Rates and False Positive Rates at different classification thresholds. Further reading about ROC curves and AOC can be found here:
        - https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc, 
        - https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python, and 
        - https://stackoverflow.com/questions/44172162/f1-score-vs-roc-auc

In [None]:
plot_model(tuned_model)

- We can also see the confusion matrix, which shows the absolute numbers of our predictions.

In [None]:
plot_model(tuned_model, plot='confusion_matrix')

- The function `interpret_model` show the most important components for the model output. We can see that the determinant component to state if a flight had a delay, using a LGBM model, is 'departure time'.

In [None]:
interpret_model(tuned_model)

- Here we have an evaluation on the hold out set.

In [None]:
predictions = predict_model(tuned_model)

## Generating the predictions for the actual test set

### Preprocessing dataframe values

In [None]:
df_test = pd.read_csv('../input/mini-flight-delay-prediction/flight_delays_test.csv')
df_test.head()

In [None]:
df_test['Month'] = df_test['Month'].map(lambda x: split_values(x))
df_test['DayofMonth'] = df_test['DayofMonth'].map(lambda x: split_values(x))
df_test['DayOfWeek'] = df_test['DayOfWeek'].map(lambda x: split_values(x))

In [None]:
df_test.head()

In [None]:
# UniqueCarrier
uc_labels = df_test.UniqueCarrier.unique().tolist()
label_dict_uc_test = {}
for index, possible_label in enumerate(uc_labels):
    label_dict_uc_test[possible_label] = index

# Origin
origin_labels = df_test.Origin.unique().tolist()
label_dict_origin_test = {}
for index, possible_label in enumerate(origin_labels):
    label_dict_origin_test[possible_label] = index

# Dest
dest_labels = df_test.Dest.unique().tolist()
label_dict_dest_test = {}
for index, possible_label in enumerate(dest_labels):
    label_dict_dest_test[possible_label] = index

In [None]:
df_test['UniqueCarrier'] = df_test.UniqueCarrier.replace(label_dict_uc_test)
df_test['Origin'] = df_test.Origin.replace(label_dict_origin_test)
df_test['Dest'] = df_test.Dest.replace(label_dict_dest_test)

In [None]:
df_test.head()

In [None]:
df_test

### Using the trained model to predict on the test dataframe

In [None]:
lgbm_final = finalize_model(lgbm)

In [None]:
test_labels = predict_model(lgbm_final, data = df_test)

In [None]:
test_labels

In [None]:
test_labels = test_labels.drop(['Score'], axis=1)

In [None]:
test_labels

In [None]:
test_labels.to_csv('hey.csv', index=False)