# Advanced Machine Learning Application - Assignment 3
## Data Product with Machine Learning
### Group 10
    - Archit Pradip Murgudkar : 14190286
    - Mahjabeen Mohiuddin : 24610507
    - Rohan Rocky Britto : 24610990
    - Smit Khatri : 24712248
    
This model was built by Rohan Rocky Britto

## Data Preprocessing

In [3]:
import pandas as pd
import numpy as np

Read the merged dataset

In [None]:
df_cleaned = pd.read_csv('../data/raw/merged_dataset.csv')

The user will only input 5 fields, namely Departure airport, Destination airport, Departure date, Departure time and Cabin code in the streamlit application. Based on these inputs, I will be fetching some more fields in the backend for my business case. Hence, I have only copied those features into df_cleaned for analysis and preprocessing. I will not be using flightDate as it can lead to overfitting. As flightWeek has week of the year data, I will not be using flightMonth field.

In [None]:
df_cleaned= df_cleaned[['searchGap', 'flightYear', 'flightWeek', 'startingAirport', 'destinationAirport', 'totalFare', 'segmentsDepartureTimeRaw', 'segmentsAirlineName', 'segmentsCabinCode']]

The day of the week can affect the prices of flights, for eg. the flights are generally more expensive on Friday and weekends compared to weekdays. Hence, I will get the day of the week as a number (0 for Monday, 1 for Tuesday, etc.)

In [None]:
#df_cleaned['flightDayOfWeek'] = pd.to_datetime(df_cleaned['flightDate']).dt.dayofweek

We will not be training the data on the exact time of departure of the flight due to chances of overfitting, however, the hour of the flight can highly influence the price of the flight. For example, a night flight might be cheaper than a day flight.

In [None]:
df_cleaned['departureHour'] = pd.to_datetime(df_cleaned['segmentsDepartureTimeRaw'].str.split('\|\|').str[0], utc=True).dt.hour

In [None]:
def pivot_airline(df, col_name):
    df_split = pd.DataFrame({})
    df_split[col_name] = df[col_name].str.split('\|\|')
    df_exploded = df_split.explode(col_name)
    unique_airlines = df_exploded[col_name].unique()
    pivot_table = df_exploded.pivot_table(index=df_exploded.index, columns=col_name, aggfunc='size', fill_value=0)
    df = pd.concat([df, pivot_table], axis=1)
    return df, unique_airlines

In [None]:
df_cleaned, unique_airlines = pivot_airline(df_cleaned, 'segmentsAirlineName')

Cabin codes have a hierarchy or order. Let us map them accordingly.

In [None]:
cabin_weights = {'coach': 1, 'premium coach': 2, 'business': 3, 'first': 4}

I have noticed that there are some cases where the traveller has searched for separate travel coaches in a layover flight. I have created a function to map them. As per my logic, [coach,coach] will be converted to (1+1)/2, which is similar to coach, but if the cabin code is [first,coach], it will calculate it as (4+1)/2, which will inform the system that the price difference might be due to a higher cabin code.

In [None]:
def calculate_cabin_weight(cabin_list):
    weighted_sum = sum(cabin_weights[cabin] for cabin in cabin_list)/len(cabin_list)
    return weighted_sum

In [None]:
df_cleaned['cabinWeight'] = df_cleaned['segmentsCabinCode'].str.split('\|\|').apply(calculate_cabin_weight)

We will store the flight timetable in a csv file so that we can display a complete list of flights that travel between the given airports at the given hour of the day.

In [None]:
timetable_cols = ['startingAirport', 'destinationAirport', 'departureHour', 'segmentsAirlineName']

In [None]:
df_airline_timetable = df_cleaned[timetable_cols].drop_duplicates().sort_values(by=timetable_cols)
df_airline_timetable.to_csv('../data/processed/airline_timetable.csv', index=False)

Let us check the columns and segregate them into numerical, categorical and target lists

In [None]:
df_cleaned.columns

In [1]:
target = 'totalFare'
num_cols = ['flightYear', 'flightWeek', 'searchGap', 'departureHour']
cat_cols = ['startingAirport', 'destinationAirport']
enc_cols = ['cabinWeight', 'Delta', 'JetBlue Airways', 'American Airlines', 'United', 'Spirit Airlines', 'Frontier Airlines', 'Cape Air', 'Alaska Airlines', 'Boutique Air', 'Key Lime Air', 'Southern Airways Express', 'Sun Country Airlines', 'Hawaiian Airlines', 'Contour Airlines']
X_cols = num_cols+enc_cols+cat_cols
all_cols = X_cols+[target]

Let us drop all the other columns

In [None]:
df_cleaned = df_cleaned[all_cols]

With the limited columns that we have selected, there is a chance of duplication. Let us check it and process them accordingly.

In [None]:
df_cleaned[X_cols].duplicated().unique()

Let us calculate the mean value of fares for duplicated cases.

In [None]:
df_cleaned = df_cleaned.groupby(by=X_cols, as_index=False).mean()

## Train test split

As we are dealing with time series data, we will be using the older data for training the model and comparatively newer data for testing it. This allows us to check if the model is able to pickup the newer trend.

In [None]:
df_cleaned['flightWeek'].unique()

Dividing the dataset using flight weeks. Data from flight weeks 15-23 (approximately 71%) as training data, weeks 24-25 (approximately 18%) as validation data and weeks 26-28 (approximately 11%) as testing data

In [None]:
X_train = df_cleaned[df_cleaned['flightWeek'] <= 23]
X_val = df_cleaned[df_cleaned['flightWeek'].isin([24,25])]
X_test = df_cleaned[df_cleaned['flightWeek'] >= 26]

In [None]:
print('Training dataset percentage:', len(X_train)*100/len(df_cleaned))
print('Validation dataset percentage:', len(X_val)*100/len(df_cleaned))
print('Testing dataset percentage:', len(X_test)*100/len(df_cleaned))

Let us separate the target variable from the datasets

In [None]:
y_train = X_train.pop(target)
y_val = X_val.pop(target)
y_test = X_test.pop(target)

Save the datasets to processed folder so that we can directly read from it in the future

In [None]:
X_train.to_csv('../data/processed/airline_X_train.csv', index=False)
X_val.to_csv('../data/processed/airline_X_val.csv', index=False)
X_test.to_csv('../data/processed/airline_X_test.csv', index=False)

In [None]:
y_train.to_csv('../data/processed/airline_y_train.csv', index=False)
y_val.to_csv('../data/processed/airline_y_val.csv', index=False)
y_test.to_csv('../data/processed/airline_y_test.csv', index=False)

## Read the saved training, validation and testing datasets

In [4]:
X_train = pd.read_csv('../data/processed/airline_X_train.csv')
X_val = pd.read_csv('../data/processed/airline_X_val.csv')
X_test = pd.read_csv('../data/processed/airline_X_test.csv')

In [5]:
y_train = pd.read_csv('../data/processed/airline_y_train.csv')[target]
y_val = pd.read_csv('../data/processed/airline_y_val.csv')[target]
y_test = pd.read_csv('../data/processed/airline_y_test.csv')[target]

## Baseline Model

Let us build a baseline model with mean values and test it using MAE and RMSE scores

In [21]:
mean_value = y_train.mean()
base_preds = np.full((len(y_train), 1), mean_value)

In [6]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [22]:
print('The Mean Absolute Error for the baseline model is ', mean_absolute_error(y_train, base_preds))
print('The Root Mean Squared Error for baseline model is ', mean_squared_error(y_train, base_preds, squared=False))

The Mean Absolute Error for the baseline model is  158.91944319263337
The Root Mean Squared Error for baseline model is  231.7206561042368


## Building pipelines for scaling, encoding and model execution

Importing processing and pipeline related packages

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, TargetEncoder
from joblib import dump, load

Creating a pipeline for standard scaling of numerical features and target encoding of categorical features as they have many unique values

In [8]:
num_transformer = Pipeline(
    steps=[
        ('scaler', StandardScaler())
    ]
)

In [9]:
cat_transformer = Pipeline(
    steps=[
        ('target_encoder', TargetEncoder(random_state=8))
    ]
)

Creating a column transformer to build a column-wise transformation list

In [10]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num_transformer', num_transformer, num_cols),
        ('cat_transformer', cat_transformer, cat_cols),
        ('enc_transformer', 'passthrough', enc_cols)
    ]
)

Creating a preprocessor pipeline and storing it for future use

In [11]:
preprocessor_pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor)
    ]
)

In [None]:
dump(preprocessor_pipe, '../src/airline_preproc_pipe.joblib')

## Model Building and Evaluation

Let us start by building a Linear Regression model and check if it is able to perform well

In [23]:
from sklearn.linear_model import LinearRegression

In [24]:
lin_pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor_pipe),
        ('lin', LinearRegression())
    ]
)

In [25]:
lin_pipe.fit(X_train, y_train)

In [26]:
lin_train_preds = lin_pipe.predict(X_train)

In [27]:
lin_val_preds = lin_pipe.predict(X_val)

In [None]:
dump(lin_pipe, '../models/airline_lin_pipe.joblib')

Defining a function to evaluate the performance of the models on training and validation datasets. We will store this function in a python file for future use.

In [18]:
def evaluate_model(y_train, train_preds, y_val, val_preds):
    
    print('The Mean Absolute Error for training set is ', mean_absolute_error(y_train, train_preds))
    print('The Mean Absolute Error for validation set is ', mean_absolute_error(y_val, val_preds))
    
    print('The Root Mean Squared Error for training set is ', mean_squared_error(y_train, train_preds, squared=False))
    print('The Root Mean Squared Error for validation set is ', mean_squared_error(y_val, val_preds, squared=False))

In [28]:
evaluate_model(y_train, lin_train_preds, y_val, lin_val_preds)

The Mean Absolute Error for training set is  109.12674880583604
The Mean Absolute Error for validation set is  113.63530710742256
The Root Mean Squared Error for training set is  161.9104771213932
The Root Mean Squared Error for validation set is  157.83134182242787


Linear Regression seems to be performing slightly better than the base model. Let us check if other models are able to provide better results.

## Random Forest

In [11]:
from sklearn.ensemble import RandomForestRegressor

In [12]:
rf_pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor_pipe),
        ('rf', RandomForestRegressor(random_state=8, n_estimators=50, max_depth=8, min_samples_leaf=3))
    ]
)

In [13]:
rf_pipe.fit(X_train, y_train)

In [14]:
rf_train_preds = rf_pipe.predict(X_train)

In [15]:
rf_val_preds = rf_pipe.predict(X_val)

In [20]:
evaluate_model(y_train, rf_train_preds, y_val, rf_val_preds)

The Mean Absolute Error for training set is  95.90274390914483
The Mean Absolute Error for validation set is  101.82217483899977
The Root Mean Squared Error for training set is  137.06767457696677
The Root Mean Squared Error for validation set is  142.03601047932096


In [29]:
dump(rf_pipe, '../models/airline_rf_pipe.joblib')

['../models/airline_rf_pipe.joblib']

## Gradient Boost

In [12]:
from sklearn.ensemble import GradientBoostingRegressor

In [13]:
gb_pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor_pipe),
        ('gb', GradientBoostingRegressor(random_state=8, n_estimators=50, max_depth=8, min_samples_leaf=3))
    ]
)

In [14]:
gb_pipe.fit(X_train, y_train)

In [15]:
gb_train_preds = gb_pipe.predict(X_train)

In [16]:
gb_val_preds = gb_pipe.predict(X_val)

In [19]:
evaluate_model(y_train, gb_train_preds, y_val, gb_val_preds)

The Mean Absolute Error for training set is  76.70695258456486
The Mean Absolute Error for validation set is  81.99926660043921
The Root Mean Squared Error for training set is  112.92319994483144
The Root Mean Squared Error for validation set is  117.89257871868854


In [20]:
dump(gb_pipe, '../models/airline_gb_pipe.joblib')

['../models/airline_gb_pipe.joblib']

## TensorFlow

In [35]:
X_train = preprocessor_pipe.fit_transform(X_train, y_train)

In [36]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import ModelCheckpoint

# Create a simple neural network model
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=10))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Define a model checkpoint callback to save weights during training
checkpoint_callback = ModelCheckpoint("model_checkpoint.h5", save_best_only=True)

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['root_mean_squared_error'])

# Train the model with checkpointing
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val), callbacks=[checkpoint_callback])

ModuleNotFoundError: No module named 'tensorflow'