# Flight Price Forecast - Kiwi


This notebook presents a full data science pipeline for forecasting flight prices using data from the **Kiwi** platform.  
The process begins with structured data cleaning and preprocessing, including handling missing values, converting date and duration fields, and encoding categorical features (such as number of stops).

Following the preparation phase, we explore the data (EDA) to uncover trends, distributions, and potential outliers.

Then, we evaluate multiple regression models for price prediction, including:

- **Linear Regression**
- **Decision Tree**
- **Gaussian Process**
- **Random Forest**
- **K-Nearest Neighbors**
- **Multi-layer Perceptron**
- **XGBoost**
- **HistGradientBoosting**

Each model is assessed using metrics like: **R², RMSE, MSE**, and **MAE**.  
To enhance interpretation, we include **residual plots**, **feature importance (permutation)**, and **actual vs. predicted** visualizations.

 This analysis is part of a dual-platform comparison (Kiwi & Kayak).  
A separate notebook applies the same methodology to the Kayak dataset.

## Stage No. 1: collecting the data:
- for this stage we will use a web scraper-collector that will collect flights data from two websites: Kiwi and Kayak.
- this collector is built with async-runtime functions, random user-actions generator, session saver, cookies saver, dynamic viewport and even DHCP-ip-refresher function (since were not using proxy), all of these methods are used for collecting without getting cought by these websities bot-identifier machenisms...
- this collector will run every combination of ttt (time to travel) from 1-30, los (lenght of stay) from 1-5 within a 3 different snapshot days for every combination of rome,london and paris routes

In [1]:
%pip install -e scraping
# %playwright install

Obtaining file:///C:/Users/LaurenM/OneDrive/Desktop/flight_price_forecasting_and_clustering/scraping
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Checking if build backend supports build_editable: started
  Checking if build backend supports build_editable: finished with status 'done'
  Getting requirements to build editable: started
  Getting requirements to build editable: finished with status 'done'
  Preparing editable metadata (pyproject.toml): started
  Preparing editable metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: scraping
  Building editable for scraping (pyproject.toml): started
  Building editable for scraping (pyproject.toml): finished with status 'done'
  Created wheel for scraping: filename=scraping-0.1-0.editable-py3-none-any.whl size=2637 sha256=036fa0a197d27adfe8896f69d8907b36cba8e06b1bb0d36aeef33e6c4fd8ce85
  Stored in directory: C:\Users\LaurenM\AppData\Local\


[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Imports

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import re, time, random, datetime, math
from datetime import datetime
from tqdm import tqdm

In [3]:
##################### Preprocessing imports 
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from category_encoders import TargetEncoder, HashingEncoder, CountEncoder
from sklearn.impute import KNNImputer

##################### Metrics
from sklearn.metrics import make_scorer, mean_squared_error, mean_absolute_error
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

##################### Models
from xgboost import XGBClassifier, XGBRegressor
from catboost import CatBoostClassifier, CatBoostRegressor
from lightgbm import LGBMClassifier, LGBMRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline

##################### Model selection 
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.gaussian_process.kernels import RBF, Matern, RationalQuadratic
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingRegressor
from sklearn.inspection import permutation_importance
from sklearn.base import clone
from sklearn.metrics import mean_squared_error, r2_score
from scipy.optimize import minimize
from IPython.display import clear_output
from concurrent.futures import ThreadPoolExecutor
from scipy.stats import skew, kurtosis

##################### optuna library import
# !pip install shap
import optuna
import shap
import matplotlib.cm as cm
from collections import defaultdict
from currency_converter import CurrencyConverter

##################### Basic settings
random_state = 42
n_splits = 5

## Stage 2: Exploring the data

In [4]:
data1 = pd.read_csv('data_kiwi_balanced.csv')

In [5]:
# Table for first look
def summary(train):
    sum = pd.DataFrame(train.dtypes, columns=['dtypes'])
    sum['missing#'] = train.isna().sum()
    sum['missing%'] = (train.isna().sum())/len(train)
    sum['uniques'] = train.nunique().values
    sum['count'] = train.count().values
    return sum

display(summary(data1).style.background_gradient(cmap='Blues'))
data1.head()

Unnamed: 0,dtypes,missing#,missing%,uniques,count
departure_hour,object,0,0.0,195,97092
departure_airport,object,5223,0.053794,11,91869
flight_length,object,0,0.0,129,97092
landing_hour,object,4235,0.043618,193,92857
landing_airport,object,4502,0.046368,10,92590
to_dest_company,object,8677,0.089369,21,88415
return_departure_hour,object,0,0.0,199,97092
return_departure_airport,object,5195,0.053506,198,91897
return_flight_length,object,0,0.0,124,97092
return_landing_hour,object,5195,0.053506,198,91897


Unnamed: 0,departure_hour,departure_airport,flight_length,landing_hour,landing_airport,to_dest_company,return_departure_hour,return_departure_airport,return_flight_length,return_landing_hour,...,ttt,los,snapshot_date,origin_city,destination_city,departure_date,return_date,website,layover_time,return_layover_time
0,06:35,FCO,2h 45m,08:20,LGW,Wizz Air Malta,09:10,12:40,2h 30m,12:40,...,2,1,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0m,0m
1,06:35,FCO,2h 45m,08:20,LGW,Wizz Air Malta,21:00,,2h 30m,,...,2,1,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0m,0m
2,06:45,FCO,2h 50m,08:35,LGW,Vueling,09:10,12:40,2h 30m,12:40,...,2,1,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0m,0m
3,06:00,CIA,2h 45m,07:45,STN,Ryanair,09:10,12:40,2h 30m,12:40,...,2,1,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0m,0m
4,06:35,FCO,2h 45m,08:20,LGW,Wizz Air Malta,18:10,21:35,2h 25m,21:35,...,2,1,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0m,0m


## Stage 3: Preprocessing
<!-- <ul>
    <li>basic_cleanups</li>
    <li></li>
    <li></li>
</ul> -->

the price is still an object! we need to convert it to numerical... (with scraping knowlage we know that kiwi gives the price in nis and kayak in $)

In [6]:
%pip install currencyconverter

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [17]:
def basic_cleanups(data):
    data = data.drop_duplicates()
    data = data.dropna()

    c = CurrencyConverter()

    data['price'] = data['price'].apply(
        lambda x: (
            c.convert(float(x.replace('₪ ', '').replace(',', '')), 'ILS', 'USD') if '₪ ' in x
            else float(x.replace('$', '').replace(',', '')) if '$' in x 
            else None 
        )
    )
    return data

we will add a function that will convert the time strings to int (in minutes)

In [18]:
def time_to_minutes(time_str: str)-> int:
    if pd.isna(time_str):
        return 0

    # Extract hours and minutes using regex
    hours = 0
    minutes = 0
    
    h_match = re.search(r'(\d+)h', time_str)
    m_match = re.search(r'(\d+)m', time_str)

    if h_match:
        hours = int(h_match.group(1))
    if m_match:
        minutes = int(m_match.group(1))

    return hours * 60 + minutes

In [19]:
def hour_to_numeric(hour_str):
    '''
    convert the hours to int format
    '''
    hour_str = hour_str.strip()  

    match = re.search(r"\+(\d+)", hour_str)
    extra_days = int(match.group(1)) if match else 0
    hour_str = re.sub(r"\+\d+", "", hour_str)
    
    if 'a' in hour_str or 'p' in hour_str:
        hour_str = hour_str.replace('a', 'AM').replace('p', 'PM')

        time_obj = datetime.strptime(hour_str, "%I:%M%p")
    else:
        time_obj = datetime.strptime(hour_str, "%H:%M")

    hour_float = time_obj.hour + time_obj.minute / 60

    hour_float += extra_days * 24

    return hour_float


In [20]:
def preprocessing(data):
    # perform basic cleanups
    data = basic_cleanups(data)

    # convert time to minutes
    data['flight_length'] = data['flight_length'].apply(time_to_minutes)
    data['return_flight_length'] = data['return_flight_length'].apply(time_to_minutes)
    data['layover_time'] = data['layover_time'].apply(time_to_minutes)
    data['return_layover_time'] = data['return_layover_time'].apply(time_to_minutes)

    # convert hours to numeric
    data['departure_hour'] = data['departure_hour'].apply(hour_to_numeric)
    data['landing_hour'] = data['landing_hour'].apply(hour_to_numeric)
    data['return_departure_hour'] = data['return_departure_hour'].apply(hour_to_numeric)
    data['return_landing_hour'] = data['return_landing_hour'].apply(hour_to_numeric)

    # convert date to datetime
    ## we know that dates are tied to day of week, lets create new feature based on the departure date!
    data['departure_date'] = pd.to_datetime(data['departure_date'])
    data['day_of_week'] = data['departure_date'].dt.day_name()

    # create new feature based on the origin_city and destination_city
    data['route'] = data['origin_city'] + '_' + data['destination_city']

    return data

data1 = preprocessing(data1)
data1.head()

Unnamed: 0,departure_hour,departure_airport,flight_length,landing_hour,landing_airport,to_dest_company,return_departure_hour,return_departure_airport,return_flight_length,return_landing_hour,...,snapshot_date,origin_city,destination_city,departure_date,return_date,website,layover_time,return_layover_time,day_of_week,route
0,6.583333,FCO,165,8.333333,LGW,Wizz Air Malta,9.166667,12:40,150,12.666667,...,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0,0,Sunday,ROME_LONDON
2,6.75,FCO,170,8.583333,LGW,Vueling,9.166667,12:40,150,12.666667,...,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0,0,Sunday,ROME_LONDON
3,6.0,CIA,165,7.75,STN,Ryanair,9.166667,12:40,150,12.666667,...,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0,0,Sunday,ROME_LONDON
4,6.583333,FCO,165,8.333333,LGW,Wizz Air Malta,18.166667,21:35,145,21.583333,...,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0,0,Sunday,ROME_LONDON
5,6.583333,FCO,165,8.333333,LGW,Wizz Air Malta,17.0,20:30,150,20.5,...,2025-02-28,ROME,LONDON,2025-03-02,2025-03-03,Kiwi,0,0,Sunday,ROME_LONDON
