# Prediction Model

After `data-cleanup` we now have the task of predicting if a flight will be canceled according to these features:
1. Flight Date
2. Airline
3. Origin
4. Destination
5. Departure Time (Scheduled)
6. Arrival Time (Scheduled)

These are less features than

In [8]:
import pandas as pd
import numpy as np
import xgboost as xgb

from sklearn import metrics
from sklearn.model_selection import train_test_split

In [9]:
# Lets load the dataset:
flights = pd.read_parquet("../../dataset/flight-data.parquet")

In [10]:
# Prepare it:
# When categorical type is supplied, The experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:
# FlightDate: datetime64[ns], Operating_Airline: object, Origin: object, Dest: object

categorical = ['FlightDate', 'Operating_Airline', 'Origin', 'Dest']
flights[categorical] = flights[categorical].apply(lambda x: x.astype('category'))

x = flights[['FlightDate', 'Operating_Airline', 'Origin', 'Dest', 'CRSDepTime', 'CRSArrTime']]
y = flights['Cancelled']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

In [11]:
# With this we can define and fit the model:

model = xgb.XGBClassifier(tree_method="hist", enable_categorical=True)
model.fit(x_train, y_train)

model.save_model("model.json")

In [12]:
graph = xgb.to_graphviz(model, num_trees=1)

https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html
https://graphviz.readthedocs.io/en/stable/manual.html
https://stackoverflow.com/questions/33433274/anaconda-graphviz-cant-import-after-installation
https://stackoverflow.com/questions/30991532/converting-multiple-columns-to-categories-in-pandas-apply

In [13]:
graph.render(directory='model-render')

'model-render/Source.gv.pdf'