# Flight Price Forecast – Kiwi

## Table of Contents
- [Introduction](#introduction)
- [Imports](#imports)
- [Data Preprocessing](#data-preprocessing)
- [Linear Regression](#linear-regression)
- [Decision Tree](#decision-tree)
- [Gaussian Process Regression](#gaussian-process-regression)
- [Random Forest](#random-forest)
- [GridSearchCV](#gridsearchcv)
- [KNN Regressor](#knn-regressor)
- [MLP Regressor](#mlp-regressor)
- [XG Boost Regressor](#xg-boost-regressor)
- [Hist Gradient Boosting](#hist-gradient-boosting)
- [Best Performance with Best Parameters](#best-performance-with-best-parameters)
- [Feature Importance](#feature-importance)
- [Conclusions](#conclusions)

---

## Introduction

The provided Python code develops a machine learning pipeline for predicting flight prices, starting with data preprocessing such as cleaning and converting price data, date extraction, duration conversion, and numeric encoding of categorical features like flight stops.  
The pipeline evaluates multiple regression methods, including Linear Regression, Decision Trees, Gaussian Process, Random Forests (with GridSearchCV optimization), K-Nearest Neighbors, Multi-layer Perceptron, XGBoost, and HistGradientBoostingRegressor, comparing their performance using metrics like R², RMSE, MSE, and MAE.

**Our Top models:** *"Random Forest"*, *"XGBoost"*, and *"HistGradientBoosting"* regressors demonstrated the strongest predictive performances, while significantly outperforming basic linear models.  
Using visualization tools such as residual plots, permutation importance, and predicted-versus-actual graphs provided insights into model accuracy and feature relevance.


# Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import shap

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.gaussian_process import GaussianProcessRegressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor

from sklearn.model_selection import train_test_split

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.inspection import permutation_importance

from sklearn.gaussian_process.kernels import RBF, DotProduct, Matern, RationalQuadratic, WhiteKernel
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.inspection import partial_dependence, PartialDependenceDisplay

# Data Preprocessing

In [4]:

df = pd.read_csv('kiwi_cleaned.csv')


print(df.head())
print(df.info())

   departure_hour departure_airport  flight_length  landing_hour  \
0       11.833333               SEN             70     14.000000   
1       11.833333               SEN             70     14.000000   
2       11.833333               SEN             70     14.000000   
3       18.583333               LTN             80     20.916667   
4       13.333333               LTN             80     15.666667   

  landing_airport to_dest_company  return_departure_hour  \
0             CDG         easyJet              11.833333   
1             CDG         easyJet              19.333333   
2             CDG         easyJet              15.166667   
3             CDG         easyJet              11.833333   
4             CDG         easyJet              11.833333   

   return_departure_airport  return_flight_length  return_landing_hour  ...  \
0                       NaN                    70            12.000000  ...   
1                       NaN                    65            19.416667  