# Delay Prediction - Applying Our Model to Predict Delays for Future Flights

We have the upcoming flights available via the API and functions that we have built during our [exploration phase](https://github.com/yoshi-man/hkg-flights-study/blob/main/explore/extract_transform_load.ipynb) of this study. As such, we can apply our model that we have built in the previous section to predict if the upcoming flights will be delayed. The method will be:

1. Get departure flights for date within ```[t+1, t+7]``` from public API
2. Get arrival flights for date within ```[t-7, t+7]``` from public API
3. For each future departing flight, map the previous arrival flight
4. If there is an actual arrival date for the previous flight, use it for prediction and mark it as ```Actual```
5. Else, we will run 5 cases of ```actual_delay in [0, 45, 90, 180] minutes```, use these for prediction and mark it as ```Tentative```

In [31]:
# Notebook setup
import time
from datetime import datetime, date, timedelta

import pandas as pd
import numpy as np
import xgboost as xgb

from flights.etl import get_flights
from flights.utils import get_previous_flight

from sklearn.preprocessing import OneHotEncoder

import pickle


## Getting the required data from the public API

Getting departure flights and arrival flights can be in the same loop, although the date ranges are different.

In [32]:
# setting up required parameters
today = date.today()
today_str = today.strftime("%Y-%m-%d")
date_range = [(today + timedelta(days=t)).strftime('%Y-%m-%d') for t in range(-7, 8)] # list of date_str to use as query parameters

departures = pd.DataFrame()
arrivals = pd.DataFrame()

for date_str in date_range:
    for cargo in ['true', 'false']:
        if date_str > today_str:
            departure = get_flights(date_string=date_str, arrival='false', cargo=cargo)
            departures = pd.concat([departures, departure])
        
        arrival = get_flights(date_string=date_str, arrival='true', cargo=cargo)
        arrivals = pd.concat([arrivals, arrival])

        time.sleep(0.5)

  df['status'] = normalized_df['status'].str.replace(


In [19]:
print(f'With today being: {today_str}...')
print(f"Shape of Departure Flights: {departures.shape} || With Flights from {departures['actual_departure'].min()} to {departures['scheduled_departure'].max()}")
print(f"Shape of Arrival Flights: {arrivals.shape} || With Flights from {arrivals['actual_arrival'].min()} to {arrivals['scheduled_arrival'].max()}")

With today being: 2022-06-18...
Shape of Departure Flights: (428, 8) || With Flights from 2022-06-16 00:00:00 to 2022-06-25 23:25:00
Shape of Arrival Flights: (1646, 8) || With Flights from 2022-06-11 00:00:00 to 2022-06-25 23:55:00


## Mapping Previous Arrival Flight

This task was done via SQL in BigQuery during training. However, as we only have past flights in BigQuery, we will have to rewrite the code for finding the previous arrival flight for ftuure departing flights. The rule is simple, to get the flight that:

1. Is of the same ```airline```, ```cargo``` type, and ```lane```
2. Scheduled/actual arrival is no later than 7 days away from scheduled departure
3. Scheduled/actual arrival is no earlier than 30 minutes away from scheduled departure

In [20]:
previous_flights = departures.apply(lambda x: get_previous_flight(x, arrivals), axis=1)
previous_flights.columns = ['scheduled_arrival', 'actual_arrival', 'arrival_flight_num']

In [21]:
df = pd.concat([departures, previous_flights], axis=1)

df.sample(5)

Unnamed: 0,scheduled_departure,actual_departure,flight_num,origin,destination,airline,arrival,cargo,scheduled_arrival,actual_arrival,arrival_flight_num
16,2022-06-23 13:30:00,NaT,CI910,HKG,TPE,CAL,False,False,2022-06-23 12:30:00,NaT,CI909
17,2022-06-25 15:35:00,NaT,JL026,HKG,NRT,JAL,False,False,2022-06-23 22:20:00,NaT,JL7049
23,2022-06-19 14:30:00,2022-06-19,SQ 883,HKG,SIN,SIA,False,False,2022-06-19 09:50:00,2022-06-19,SQ 8504
26,2022-06-19 03:40:00,2022-06-19,8K 526,HKG,BKK,KMI,False,True,2022-06-18 23:55:00,2022-06-18,8K 277
24,2022-06-19 03:20:00,2022-06-19,KE 314,HKG,ICN,KAL,False,True,2022-06-19 01:10:00,2022-06-19,KE 313


## Setting Up Other Columns

Final touches to make are adding the columns for ```route_flight_options``` and ```route_airline_options```, representing the unique flight numbers and airline choices for the particular lane on the same departing day, respectively. Other final touches to be made to our dataset below to fit our training data is shown below:

- Adding ```hour``` and ```weekday``` features from departure and arrival dates
- Adding ```turnaround``` and ```arrival delay``` calculations
- Getting the ```IATA``` code for ```airline``` instead of using the ICAO code as shown under ```airline``` column
- Adding ```fleet_size``` based on ```iata```, looked up based on a static file
- Updating ```Cargo``` from boolean to a binary integer representation
- Dropping unneeded columns and reordering columns to the exact format

In [22]:
# adding route_options
df['scheduled_departure_date'] = df['scheduled_departure'].dt.date

route_options = df.copy()
route_flight_options = route_options[['scheduled_departure_date', 'destination', 'flight_num']].groupby(by=['scheduled_departure_date', 'destination']).nunique().reset_index()
route_airline_options = route_options[['scheduled_departure_date', 'destination', 'airline']].groupby(by=['scheduled_departure_date', 'destination']).nunique().reset_index()

route_flight_options.columns = ['scheduled_departure_date', 'destination', 'route_flight_options']
route_airline_options.columns = ['scheduled_departure_date', 'destination', 'route_airline_options']

df = df.merge(route_flight_options, how='left', on=['scheduled_departure_date', 'destination'])
df = df.merge(route_airline_options, how='left', on=['scheduled_departure_date', 'destination'])

df.sample(5)

Unnamed: 0,scheduled_departure,actual_departure,flight_num,origin,destination,airline,arrival,cargo,scheduled_arrival,actual_arrival,arrival_flight_num,scheduled_departure_date,route_flight_options,route_airline_options
210,2022-06-20 18:35:00,2022-06-20,LD 561,HKG,SGN,AHK,False,True,2022-06-19 00:30:00,2022-06-19 00:00:00,LD 572,2022-06-20,1,1
219,2022-06-20 21:30:00,2022-06-20,QR 8407,HKG,DOH,QTR,False,True,2022-06-19 11:05:00,2022-06-19 00:00:00,QR 8412,2022-06-20,2,1
376,2022-06-24 13:05:00,NaT,MF382,HKG,XMN,CXA,False,False,2022-06-24 12:05:00,NaT,MF381,2022-06-24,1,1
46,2022-06-19 06:50:00,2022-06-19,CX 2080,HKG,ANC,CPA,False,True,2022-06-18 23:25:00,2022-06-18 23:03:00,CX 2095,2022-06-19,15,6
251,2022-06-20 21:05:00,2022-06-20,5J 115,HKG,MNL,CEB,False,False,2022-06-20 19:45:00,2022-06-20 00:00:00,5J 114,2022-06-20,3,2


In [23]:
# adding date features as columns
df['prev_arr_hour'] = df['scheduled_arrival'].dt.hour
df['prev_arr_weekday'] = df['scheduled_arrival'].dt.weekday + 1
df['dep_hour'] = df['scheduled_departure'].dt.hour
df['dep_weekday'] = df['scheduled_departure'].dt.weekday + 1

# calculating turnaround and arrival_delay
df['scheduled_turnaround'] = (df['scheduled_departure'] - df['scheduled_arrival']).dt.total_seconds()/60
df['arrival_delay'] = (df['actual_arrival'] - df['scheduled_arrival']).dt.total_seconds()/60

# get Iata code from flight_num
df['iata'] = df['flight_num'].str[:2]

# getting fleet_size using iata
airlines = pd.read_csv('./dataset/airlines.csv')[['iata', 'aircrafts']]
airlines.columns = ['iata', 'fleet_size']
df = df.merge(airlines, how='left', on='iata')

# convert boolean to binary representation
df['cargo'] = df['cargo'].astype(int)

df.sample(5)

Unnamed: 0,scheduled_departure,actual_departure,flight_num,origin,destination,airline,arrival,cargo,scheduled_arrival,actual_arrival,...,route_flight_options,route_airline_options,prev_arr_hour,prev_arr_weekday,dep_hour,dep_weekday,scheduled_turnaround,arrival_delay,iata,fleet_size
279,2022-06-21 12:45:00,NaT,TG601,HKG,BKK,THA,False,0,2022-06-21 11:45:00,NaT,...,3,3,11.0,2.0,12,2,60.0,,TG,64.0
416,2022-06-25 11:15:00,NaT,CX163,HKG,MEL,CPA,False,0,NaT,NaT,...,1,1,,,11,6,,,CX,173.0
2,2022-06-17 07:05:00,2022-06-19 07:05:00,CC 4306,HKG,DWC,ABD,False,1,NaT,NaT,...,1,1,,,7,5,,,CC,7.0
91,2022-06-19 15:40:00,2022-06-19 00:00:00,CV 7957,HKG,SIN,CLX,False,1,2022-06-19 13:40:00,2022-06-19 00:00:00,...,10,9,13.0,7.0,15,7,120.0,-820.0,CV,30.0
265,2022-06-20 23:25:00,2022-06-20 00:00:00,LH 797,HKG,FRA,DLH,False,0,2022-06-17 07:10:00,2022-06-17 08:09:00,...,2,2,7.0,5.0,23,1,5295.0,59.0,LH,308.0


In [24]:
# splitting the data into data with actual arrival and those without arrival_delays (due to no actual_arrival yet)
actual_df = df.drop(df[df['scheduled_arrival'].notnull() & df['actual_arrival'].isnull()].index)
tentative_df = df[df['scheduled_arrival'].notnull() & df['actual_arrival'].isnull()]

# add in prediction types
actual_df['prediction_type'] = 'Actual'
tentative_df['prediction_type'] = 'Tentative'

# for tentative df, we will use [0, 45, 90, 180] as possible arrival_delay values
tentative_arrival_delays = pd.DataFrame([0, 45, 90, 180], columns=['tentative_arrival_delay'])
tentative_df = tentative_df.merge(tentative_arrival_delays, how='cross')
tentative_df['arrival_delay'] = tentative_df['tentative_arrival_delay']
tentative_df['actual_arrival'] = tentative_df['scheduled_arrival'] + pd.to_timedelta(tentative_df['arrival_delay'], unit='m')

# re-merge the actual and tentative dataframes to form df
df = pd.concat([actual_df, tentative_df])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [53]:
# reorder to follow the exact format of training data
X = df[['prev_arr_hour', 'prev_arr_weekday', 'dep_hour', 'dep_weekday',
       'scheduled_turnaround', 'arrival_delay', 'destination', 'iata',
       'fleet_size', 'cargo', 'route_flight_options', 'route_airline_options']].reset_index().drop(columns='index')

# for results
X_info = df[['flight_num', 'scheduled_departure', 'arrival_flight_num', 'scheduled_arrival', 'actual_arrival', 'arrival_delay', 
              'origin', 'destination', 'iata', 'cargo', 'prediction_type']].reset_index().drop(columns='index')

In [54]:
X_info.head()

Unnamed: 0,flight_num,scheduled_departure,arrival_flight_num,scheduled_arrival,actual_arrival,arrival_delay,origin,destination,iata,cargo,prediction_type
0,EAU303,2022-06-09 08:15:00,,NaT,NaT,,HKG,DEL,EA,1,Actual
1,5Y 654,2022-06-16 18:20:00,,NaT,NaT,,HKG,ANC,5Y,1,Actual
2,CC 4306,2022-06-17 07:05:00,,NaT,NaT,,HKG,DWC,CC,1,Actual
3,TH 6820,2022-06-18 16:05:00,TH 6821,2022-06-17 14:45:00,2022-06-17 23:23:00,518.0,HKG,SZB,TH,1,Actual
4,CX 3296,2022-06-18 16:35:00,CX 085,2022-06-18 08:15:00,2022-06-18 10:19:00,124.0,HKG,ANC,CX,1,Actual


## Get Predictions (And get results ready...)

Get the One Hot encoder and xgboost model that we have trained with and use it to do batch prediction based on the received data.

In [55]:
# Loading training assets to assist with training
with open("./models/20220618_iata_encoder.pkl", "rb") as f:
    iata_encoder = pickle.load(f)

with open("./models/20220618_dest_encoder.pkl", "rb") as f:
    dest_encoder = pickle.load(f)

with open("./models/20220618_t_0_13112187385559082.pkl", "rb") as f:
    bst = pickle.load(f) 

threshold = 0.13112187385559082

In [56]:
# get encoded columns
iata_OHE = pd.DataFrame(iata_encoder.transform(np.array(X['iata']).reshape(-1, 1)))
iata_OHE.columns = [f'iata_{x}' for x in iata_OHE.columns]

dest_OHE = pd.DataFrame(dest_encoder.transform(np.array(X['destination']).reshape(-1, 1)))
dest_OHE.columns = [f'dest_{x}' for x in dest_OHE.columns]

In [57]:
# X in the right format for inference
X = pd.concat([X, iata_OHE, dest_OHE], axis=1
                ).drop(columns=['iata', 'destination']).fillna(np.nan)

print(f"Data Shape: {X.shape}")
X.head()

Data Shape: (705, 223)


Unnamed: 0,prev_arr_hour,prev_arr_weekday,dep_hour,dep_weekday,scheduled_turnaround,arrival_delay,fleet_size,cargo,route_flight_options,route_airline_options,...,dest_101,dest_102,dest_103,dest_104,dest_105,dest_106,dest_107,dest_108,dest_109,dest_110
0,,,8,4,,,,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,,,18,4,,,55.0,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,7,5,,,7.0,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,14.0,5.0,16,6,1520.0,518.0,5.0,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8.0,6.0,16,6,500.0,124.0,173.0,1,3,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
# Run predictions
d_matrix = xgb.DMatrix(X)

preds = bst.predict(d_matrix)
preds_label = [int(pred >= threshold) for pred in preds]

print(f"First Few Delay Probabilities: {preds[:5]}")
print(f"First Few Predictions: {preds_label[:5]}")

First Few Delay Probabilities: [0.06774487 0.30433482 0.23199902 0.8259008  0.09509188]
First Few Predictions: [0, 1, 1, 1, 0]


In [59]:
# Finally, place these predictions in our output table ready for insertion to BigQuery table
X_info['prediction_prob'] = preds
X_info['prediction'] = preds_label
X_info['prediction_date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

X_info.sample(5)

Unnamed: 0,flight_num,scheduled_departure,arrival_flight_num,scheduled_arrival,actual_arrival,arrival_delay,origin,destination,iata,cargo,prediction_type,prediction_prob,prediction,prediction_date
41,K4 806,2022-06-19 05:15:00,,NaT,NaT,,HKG,NGO,K4,1,Actual,0.450817,1,2022-06-18 19:40:10
295,NH812,2022-06-22 09:30:00,NH 8541,2022-06-17 13:35:00,2022-06-17 13:30:00,-5.0,HKG,NRT,NH,0,Actual,0.027293,0,2022-06-18 19:40:10
279,MU508,2022-06-21 15:35:00,,NaT,NaT,,HKG,PVG,MU,0,Actual,0.025813,0,2022-06-18 19:40:10
444,CA112,2022-06-22 14:35:00,CA6501,2022-06-21 21:00:00,2022-06-22 00:00:00,180.0,HKG,PEK,CA,0,Tentative,0.119826,0,2022-06-18 19:40:10
407,CX888,2022-06-22 00:45:00,CX865,2022-06-21 05:35:00,2022-06-21 07:05:00,90.0,HKG,YVR,CX,0,Tentative,0.062019,0,2022-06-18 19:40:10


In [60]:
# A quick summary of our predictions
print(f"Total Flights for Prediction: {X_info.shape[0]} // Total Predicted Late Flights {X_info['prediction'].sum()}")

print(f"Flights Predicted Late per Carrier and Cargo:")
pd.pivot_table(X_info, index=['iata'], columns=['cargo'], values=['prediction'], aggfunc=np.sum, margins=True).fillna(0).sort_values(by=('prediction', 'All'), ascending=False)

Total Flights for Prediction: 705 // Total Predicted Late Flights 204
Flights Predicted Late per Carrier and Cargo:


Unnamed: 0_level_0,prediction,prediction,prediction
cargo,0,1,All
iata,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
All,125.0,79.0,204
5J,30.0,0.0,30
TG,15.0,0.0,15
CX,13.0,2.0,15
CI,9.0,4.0,13
...,...,...,...
KE,0.0,0.0,0
LX,0.0,0.0,0
MU,0.0,0.0,0
NH,0.0,0.0,0


## Next Steps - Moving These Results to BigQuery

The idea will be to do this batch prediction every night after we've collected the new data from BigQuery. This can then be fed into a web frontend for monitoring progress, but the hope is everyday we will have 1 round of prediction (at least for now), as the data is refreshed on a daily basis. The approach will also be through Cloud Functions, but we'll see if the memory will be an issue with Cloud Functions.

To do so, we will have to package this pipeline into modular functions.


In [50]:
from typing import Union

def get_all_flights() -> Union[pd.DataFrame, pd.DataFrame]:

    """
    Returns all flights within the [t-7, t+7] from public API
    """

    # setting up required parameters
    today = date.today()
    today_str = today.strftime("%Y-%m-%d")
    date_range = [(today + timedelta(days=t)).strftime('%Y-%m-%d') for t in range(-7, 8)] # list of date_str to use as query parameters

    departures = pd.DataFrame()
    arrivals = pd.DataFrame()

    for date_str in date_range:
        for cargo in ['true', 'false']:
            if date_str > today_str:
                departure = get_flights(date_string=date_str, arrival='false', cargo=cargo)
                departures = pd.concat([departures, departure])
            
            arrival = get_flights(date_string=date_str, arrival='true', cargo=cargo)
            arrivals = pd.concat([arrivals, arrival])

            time.sleep(0.5)

    return departures, arrivals

In [51]:
def map_previous_flights(departures: pd.DataFrame, arrivals: pd.DataFrame) -> pd.DataFrame:
    """
    For each departure flight, map all previous arrival flights to a df and return it
    """
    previous_flights = departures.apply(lambda x: get_previous_flight(x, arrivals), axis=1)
    previous_flights.columns = ['scheduled_arrival', 'actual_arrival', 'arrival_flight_num']

    df = pd.concat([departures, previous_flights], axis=1)

    return df

In [52]:
def set_up_columns(df: pd.DataFrame) -> pd.DataFrame:
    """
    Series of cleanup procedures done to fit the inference input format
    """

    # adding route_options
    df['scheduled_departure_date'] = df['scheduled_departure'].dt.date

    route_options = df.copy()
    route_flight_options = route_options[['scheduled_departure_date', 'destination', 'flight_num']].groupby(by=['scheduled_departure_date', 'destination']).nunique().reset_index()
    route_airline_options = route_options[['scheduled_departure_date', 'destination', 'airline']].groupby(by=['scheduled_departure_date', 'destination']).nunique().reset_index()

    route_flight_options.columns = ['scheduled_departure_date', 'destination', 'route_flight_options']
    route_airline_options.columns = ['scheduled_departure_date', 'destination', 'route_airline_options']

    df = df.merge(route_flight_options, how='left', on=['scheduled_departure_date', 'destination'])
    df = df.merge(route_airline_options, how='left', on=['scheduled_departure_date', 'destination'])

    # adding date features as columns
    df['prev_arr_hour'] = df['scheduled_arrival'].dt.hour
    df['prev_arr_weekday'] = df['scheduled_arrival'].dt.weekday + 1
    df['dep_hour'] = df['scheduled_departure'].dt.hour
    df['dep_weekday'] = df['scheduled_departure'].dt.weekday + 1

    # calculating turnaround and arrival_delay
    df['scheduled_turnaround'] = (df['scheduled_departure'] - df['scheduled_arrival']).dt.total_seconds()/60
    df['arrival_delay'] = (df['actual_arrival'] - df['scheduled_arrival']).dt.total_seconds()/60

    # get Iata code from flight_num
    df['iata'] = df['flight_num'].str[:2]

    # getting fleet_size using iata
    airlines = pd.read_csv('./dataset/airlines.csv')[['iata', 'aircrafts']]
    airlines.columns = ['iata', 'fleet_size']
    df = df.merge(airlines, how='left', on='iata')

    # convert boolean to binary representation
    df['cargo'] = df['cargo'].astype(int)

    return df


In [59]:
def add_prediction_types(df: pd.DataFrame) -> pd.DataFrame:
    """
    Split those with actual arrival and those without into different prediction types and return df
    """

    # splitting the data into data with actual arrival and those without arrival_delays (due to no actual_arrival yet)
    actual_df = df.drop(df[df['scheduled_arrival'].notnull() & df['actual_arrival'].isnull()].index)
    tentative_df = df[df['scheduled_arrival'].notnull() & df['actual_arrival'].isnull()]

    # add in prediction types
    actual_df['prediction_type'] = 'Actual'
    tentative_df['prediction_type'] = 'Tentative'

    # for tentative df, we will use [0, 45, 90, 180] as possible arrival_delay values
    tentative_arrival_delays = pd.DataFrame([0, 45, 90, 180], columns=['tentative_arrival_delay'])
    tentative_df = tentative_df.merge(tentative_arrival_delays, how='cross')
    tentative_df['arrival_delay'] = tentative_df['tentative_arrival_delay']
    tentative_df['actual_arrival'] = tentative_df['scheduled_arrival'] + pd.to_timedelta(tentative_df['arrival_delay'], unit='m')

    # re-merge the actual and tentative dataframes to form df
    return pd.concat([actual_df, tentative_df])


In [60]:
def split_input_for_prediction(df: pd.DataFrame) -> Union[pd.DataFrame, pd.DataFrame]:
    """
    Split df into the input for predictions and an info df for output
    """
    # reorder to follow the exact format of training data
    X = df[['prev_arr_hour', 'prev_arr_weekday', 'dep_hour', 'dep_weekday',
        'scheduled_turnaround', 'arrival_delay', 'destination', 'iata',
        'fleet_size', 'cargo', 'route_flight_options', 'route_airline_options']].reset_index().drop(columns='index')

    # for results
    X_info = df[['flight_num', 'scheduled_departure', 'arrival_flight_num', 'scheduled_arrival', 'actual_arrival', 'arrival_delay', 
                'origin', 'destination', 'iata', 'cargo', 'prediction_type']].reset_index().drop(columns='index')

    return X, X_info

In [61]:
def get_predictions(X: pd.DataFrame, X_info: pd.DataFrame, model_paths: list) -> pd.DataFrame:
    """
    Takes input X and returns a dataframe with predictions, ready to be inserted into BigQuery
    """
    # Loading training assets to assist with training
    with open(model_paths[0], "rb") as f:
        iata_encoder = pickle.load(f)

    with open(model_paths[1], "rb") as f:
        dest_encoder = pickle.load(f)

    with open(model_paths[2], "rb") as f:
        bst = pickle.load(f) 

    threshold = float('0.' + str(model_paths[2].split('_')[-1].split('.')[0]))


    # get encoded columns
    iata_OHE = pd.DataFrame(iata_encoder.transform(np.array(X['iata']).reshape(-1, 1)))
    iata_OHE.columns = [f'iata_{x}' for x in iata_OHE.columns]

    dest_OHE = pd.DataFrame(dest_encoder.transform(np.array(X['destination']).reshape(-1, 1)))
    dest_OHE.columns = [f'dest_{x}' for x in dest_OHE.columns]

    # X in the right format for inference
    X = pd.concat([X, iata_OHE, dest_OHE], axis=1
                ).drop(columns=['iata', 'destination']).fillna(np.nan)


    d_matrix = xgb.DMatrix(X)

    preds = bst.predict(d_matrix)
    preds_label = [int(pred >= threshold) for pred in preds]

    X_info['prediction_prob'] = preds
    X_info['prediction'] = preds_label
    X_info['prediction_date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

    return X_info

In [62]:
def end_to_end_pipeline(model_paths: list) -> pd.DataFrame:

    """
    Combines all functions for an end to end pipeline:
    1. Gets latest flights from API
    2. Clean the data into the required format
    3. Use the model from model_paths to run predictions
    4. Return an output df
    """
    
    print('Getting flights from API...', end='\r')
    departures, arrivals = get_all_flights()
    print('Getting flights from API... Done.')

    print('Mapping previous flights...', end='\r')
    df = map_previous_flights(departures, arrivals)
    print('Mapping previous flights... Done.')

    print('Setting up columns...', end='\r')
    df = set_up_columns(df)
    print('Setting up columns... Done.')

    print('Adding prediction types...', end='\r')
    df = add_prediction_types(df)
    print('Adding prediction types... Done.')

    print('Splitting data for prediction...', end='\r')
    X, X_info = split_input_for_prediction(df)
    print('Splitting data for prediction... Done.')

    print('Running Predictions...', end='\r')
    output = get_predictions(X, X_info, model_paths)
    print('Running Predictions... Done.')

    print('Complete!')

    return output

In [63]:
MODEL_PATHS = ["./models/20220614_iata_encoder.pkl",
                "./models/20220614_dest_encoder.pkl",
                "./models/20220614_t_0_13483448326587677.pkl"
                ]

end_to_end_pipeline(MODEL_PATHS)

Getting flights from API... Done.
Mapping previous flights... Done.
Setting up columns... Done.
Adding prediction types... Done.
Splitting data for prediction... Done.
Running Predictions... Done.
Complete!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,flight_num,scheduled_departure,arrival_flight_num,scheduled_arrival,actual_arrival,arrival_delay,origin,destination,iata,cargo,prediction_type,prediction_prob,prediction,prediction_date
0,EAU303,2022-06-09 08:15:00,,NaT,NaT,,HKG,DEL,EA,1,Actual,0.222266,1,2022-06-18 11:45:11
1,5Y 654,2022-06-16 18:20:00,,NaT,NaT,,HKG,ANC,5Y,1,Actual,0.232168,1,2022-06-18 11:45:11
2,CC 4306,2022-06-17 07:05:00,,NaT,NaT,,HKG,DWC,CC,1,Actual,0.301860,1,2022-06-18 11:45:11
3,TH 6820,2022-06-18 16:05:00,TH 6821,2022-06-17 14:45:00,2022-06-17 23:23:00,518.0,HKG,SZB,TH,1,Actual,0.758153,1,2022-06-18 11:45:11
4,CX 3296,2022-06-18 16:35:00,CX 085,2022-06-18 08:15:00,2022-06-18 10:19:00,124.0,HKG,ANC,CX,1,Actual,0.093290,0,2022-06-18 11:45:11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
696,CX251,2022-06-25 23:00:00,CX252,2022-06-25 07:35:00,2022-06-25 10:35:00,180.0,HKG,LHR,CX,0,Tentative,0.157477,1,2022-06-18 11:45:11
697,LH797,2022-06-25 23:25:00,LH7014,2022-06-22 07:10:00,2022-06-22 07:10:00,0.0,HKG,FRA,LH,0,Tentative,0.046918,0,2022-06-18 11:45:11
698,LH797,2022-06-25 23:25:00,LH7014,2022-06-22 07:10:00,2022-06-22 07:55:00,45.0,HKG,FRA,LH,0,Tentative,0.049921,0,2022-06-18 11:45:11
699,LH797,2022-06-25 23:25:00,LH7014,2022-06-22 07:10:00,2022-06-22 08:40:00,90.0,HKG,FRA,LH,0,Tentative,0.037810,0,2022-06-18 11:45:11


### Completed!

In this notebook, we've gone from getting the data again from the public API to transforming it into a useful format, and then running it through our pickled XGBoost model that we trained from the previuos notebook. 

The next step is clearly to throw this into GCP and hope for the best!
