# 1. Goal
The goal of this project is to build a complete end-to-end machine learning pipeline, focusing on data understanding, preparation, model training, and evaluation. This involves conducting thorough **exploratory data analysis (EDA)** and implementing essential data preprocessing steps to ensure model reliability.

Key objectives include:

- Identifying data types of all columns and presenting **descriptive statistics** for numerical features.
- **Detecting and imputing missing values** to maintain dataset integrity.
- Identifying and addressing **duplicate entries** and **outliers**, with clear justifications for retaining or removing them.
- Creating **at least three visualizations** to uncover patterns and insights within the data.
- Applying appropriate **scaling to numerical features** and **encoding to categorical features** for model compatibility.
- Training **at least 7 different machine learning models** to evaluate a variety of approaches.
- Performing **hyperparameter tuning on three models** to enhance performance.
- **Comparing model results** using validation metrics to identify the best-performing model.

# 2. Import Dependencies
The necessary Python libraries to support data analysis, visualization, preprocessing, model building, selection, validation and prediction, sorted by increasing length for utmost visual delight :)

In [1]:
import warnings
import numpy as np
import pandas as pd
import plotly.express as px
from xgboost import XGBRegressor
import plotly.graph_objects as go
from lightgbm import LGBMRegressor
from sklearn.impute import KNNImputer
from plotly.subplots import make_subplots
from sklearn.pipeline import make_pipeline
from category_encoders import TargetEncoder
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.exceptions import ConvergenceWarning
from plotly.offline import init_notebook_mode, iplot
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.preprocessing import StandardScaler, FunctionTransformer, OneHotEncoder,LabelEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor , GradientBoostingRegressor, AdaBoostRegressor

warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning) # to handle runtime warnings
init_notebook_mode(connected=True) # To make plotly render offline in notebook

# 3. Data Loading and Initial Inspection

Load `train.csv` and `test.csv` using **Pandas**. 

- Display the **first few rows** of both datasets using `.head()` to preview the structure and values.
- Inspect the **data types** of each column using `.dtypes` to distinguish between numerical and categorical features.
- Review **descriptive statistics** of the numerical columns using `.describe()` to understand the distribution, central tendencies, and variability of the features.


In [2]:
train_set = pd.read_csv('/kaggle/input/mlp-term-2-2025-kaggle-assignment-1/train.csv')
test_set = pd.read_csv('/kaggle/input/mlp-term-2-2025-kaggle-assignment-1/test.csv')

In [3]:
train_set.head()

Unnamed: 0,id,airline,flight,source,departure,stops,arrival,destination,class,duration,days_left,price
0,0,Vistara,UK-930,Mumbai,Early_Morning,one,Night,Chennai,Business,,40.0,64173
1,1,Air_India,AI-539,Chennai,Evening,one,Morning,Mumbai,Economy,16.08,26.0,4357
2,2,SpiceJet,SG-8107,Delhi,Early_Morning,zero,Morning,Chennai,Economy,2.92,25.0,3251
3,3,,0.00E+00,Hyderabad,Early_Morning,zero,Morning,Bangalore,Economy,1.5,22.0,1776
4,4,Air_India,AI-569,Chennai,Early_Morning,one,Morning,Bangalore,Economy,4.83,20.0,3584


In [4]:
test_set.head()

Unnamed: 0,id,airline,flight,source,departure,stops,arrival,destination,class,duration,days_left
0,0,Vistara,UK-816,Bangalore,Morning,zero,Afternoon,Delhi,Economy,2.67,18.0
1,1,Air_India,AI-440,Chennai,Early_Morning,zero,Morning,Delhi,Economy,,5.0
2,2,SpiceJet,SG-8938,Delhi,Evening,one,Evening,Bangalore,Economy,,44.0
3,3,Vistara,UK-838,Chennai,Night,one,Evening,Kolkata,Business,21.0,26.0
4,4,Air_India,AI-429,Delhi,Morning,one,Evening,Mumbai,Business,7.25,22.0


## 3.1 View the column data types

In [5]:
display = {}
print('> Column data types in train set: \n'.upper())
for col in train_set.columns:
    display[col]= train_set[col].dtype
display = pd.DataFrame([display],index=['Type'])
display

> COLUMN DATA TYPES IN TRAIN SET: 



Unnamed: 0,id,airline,flight,source,departure,stops,arrival,destination,class,duration,days_left,price
Type,int64,object,object,object,object,object,object,object,object,float64,float64,int64


In [6]:
display = {}
print('> Column data types in test set: \n'.upper())
for col in test_set.columns:
    display[col]= test_set[col].dtype
display = pd.DataFrame([display],index=['Type'])
display

> COLUMN DATA TYPES IN TEST SET: 



Unnamed: 0,id,airline,flight,source,departure,stops,arrival,destination,class,duration,days_left
Type,int64,object,object,object,object,object,object,object,object,float64,float64


## 3.2 View the summary statistics

In [7]:
# Numeric Data
print('Summary statistics for numeric columns: \n'.upper()+'-'*40)
numeric = train_set.select_dtypes([int,float])
numeric.describe()

SUMMARY STATISTICS FOR NUMERIC COLUMNS: 
----------------------------------------


Unnamed: 0,id,duration,days_left,price
count,40000.0,36987.0,35562.0,40000.0
mean,19999.5,12.004088,26.197936,20801.49025
std,11547.14972,7.108063,13.469232,22729.14842
min,0.0,0.83,1.0,1105.0
25%,9999.75,6.67,15.0,4687.0
50%,19999.5,11.08,26.0,7353.0
75%,29999.25,15.92,38.0,42521.0
max,39999.0,47.08,49.0,114704.0


In [4]:
# Categorical Data
print('Summary statistics for categorical columns: \n'.upper()+'-'*43)
categorical = train_set.select_dtypes([object])
categorical.describe()

SUMMARY STATISTICS FOR CATEGORICAL COLUMNS: 
-------------------------------------------


Unnamed: 0,airline,flight,source,departure,stops,arrival,destination,class
count,35387,40000.0,40000,35208,37681,40000,40000,40000
unique,6,869.0,6,6,3,6,6,2
top,Vistara,0.0,Delhi,Morning,one,Night,Mumbai,Economy
freq,15063,5240.0,8189,8302,31439,12348,7821,27536


## 3.3 Visualize numeric columns

In [9]:
# Visualise numeric columns
fig = make_subplots(2,2)
fig.add_trace(go.Histogram(x=train_set['duration'], nbinsx=50, name='duration', marker_line_width=0.5), row=1, col=1)
fig.add_trace(go.Histogram(x=train_set['days_left'], nbinsx=50, name='days left', marker_line_width=0.5), row=1, col=2)
fig.add_trace(go.Histogram(x=train_set['price'], nbinsx=50, name='price', marker_line_width=0.5), row=2, col=1)

# Set x and y axis labels
fig.update_xaxes(title_text="Duration", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)

fig.update_xaxes(title_text="Days Left", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)

fig.update_xaxes(title_text="Price", row=2, col=1)
fig.update_yaxes(title_text="Count", row=2, col=1)

fig.update_layout(title_text="NUMERIC COLUMNS<br><sub>(Hover mouse to see details)</sub>")

iplot(fig)

> ### INSIGHT
> - The Days Left column is pretty much uniformly distributed, while the other two are significantly right-skewed
> - The right skewed columns will probably have outliers

## 3.4 Data Visualisation
- Visualise and infer how different features might be related to each other

### Violin Plots: Price vs Class and Stops

In [None]:
fig = make_subplots(rows=1, cols=2, subplot_titles=("Price vs Class", "Price vs Stops"))

fig.add_trace(go.Violin(x=train_set['class'],y=train_set['price'],box_visible=True,line_color='royalblue'),
    row=1, col=1
)

fig.add_trace(go.Violin(x=train_set['stops'],y=train_set['price'],box_visible=True,line_color='indianred'),
    row=1, col=2
)

fig.update_layout(
    title_text="Violin Plots: Price vs Class and Stops<br><sub>(Hover mouse to see details)</sub>",
    showlegend=False,height=600,width=1200)

fig.update_xaxes(title_text="Class", row=1, col=1)
fig.update_yaxes(title_text="Price", row=1, col=1)

fig.update_xaxes(title_text="Stops", row=1, col=2)
fig.update_yaxes(title_text="Price", row=1, col=2)
iplot(fig)

> ### INSIGHTS
> - Low-End prices dominate the economy class, while the business class prefers a middle ranged price.
>- Surprisingly, the **most expensive flights take only one stop**, instead of two or more, as one would usually expect.
>- Most of the flights with two or more stops lie in the price segment of economy class.

### Correlation Heatmap

In [None]:
numeric_df = train_set.select_dtypes(include='number')
numeric_df = numeric_df.drop('id',axis=1)
corr_matrix = numeric_df.corr().round(2)

fig = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,x=corr_matrix.columns,y=corr_matrix.columns,
    colorscale='Viridis',zmin=-1,zmax=1,colorbar=dict(title="Correlation")))

fig.update_layout(
    title="Correlation Heatmap of Numerical Features<br><sub>(Hover mouse to see details)</sub>",xaxis_title="Features",yaxis_title="Features",
    xaxis_showgrid=False,yaxis_showgrid=False,width=700,height=700,font=dict(size=12))


> ### INSIGHTS
> - Numeric columns are **not strongly correlated**
> - This means they **do not show multicollinearity** and are important features

### Pair Plots : How are economy and business classes distributed across numerical features

In [None]:
features = ['price', 'duration', 'days_left']

fig = px.scatter_matrix(train_set,
                        dimensions=features,
                        color='class', 
                        title="Pair Plot of Key Numerical Features<br><sub>(Hover mouse to see details)</sub>",
                        height=700)

fig.update_traces(diagonal_visible=True)

> ### INSIGHTS
> - Prices increase across both classes as number of days decrease
> - **Economy class has longer average flight durations**

### Stacked Bar Charts: Class distribution across airlines

In [None]:
# Prepare data
airline_class_counts = train_set.groupby(['airline', 'class']).size().unstack(fill_value=0)

# Extract airlines and class counts
airlines = airline_class_counts.index
economy_counts = airline_class_counts.get('Economy', pd.Series([0]*len(airlines)))
business_counts = airline_class_counts.get('Business', pd.Series([0]*len(airlines)))

fig = go.Figure()
fig.add_trace(go.Bar(x=airlines,y=economy_counts,name='Economy',marker_color='skyblue'))
fig.add_trace(go.Bar(x=airlines,y=business_counts,name='Business',marker_color='indianred'))

fig.update_layout(barmode='stack',title='Class Distribution Across Airlines<br><sub>(Hover mouse to see details)</sub>',
                  xaxis_title='Airline',yaxis_title='Number of Flights',legend_title='Class',height=500,width=1000)


> ### INSIGHTS
> - Business class flights are **only** opted for by people travelling by Air India or Vistara
> - The distribution is severely **skewed towards Air India and Vistara**, so they might impact our model predictions more

# 4. Data Cleaning and Imputation (Numeric)

- The `id` column has been dropped from both training and test datasets, as it serves only as an identifier and does not contribute to the predictive modeling process.

- Missing values have been identified across several columns using `.isna().sum()`, highlighting the need for imputation to maintain dataset completeness.

- For numerical columns, two imputation strategies have been considered:
  - **Mean Imputation**: A basic approach where missing values are filled with the column's mean.
  - **KNN Imputation**: An advanced technique that imputes missing values based on similarities with other data points.

- After evaluation, **KNN Imputation** has been preferred, as it better preserves data distribution and inter-feature relationships compared to mean-based methods.

## 4.1 Drop ID columns

In [None]:
# Drop the id column
train_set.drop('id', axis=1, inplace=True)
test_set.drop('id', axis=1, inplace=True)
print('> ID columns dropped')

## 4.2 Which columns have NaN values?

In [None]:
numeric_columns = ['duration','price','days_left']
print('Columns with Nan values are: ')
for col in numeric_columns:
    if train_set[col].isna().any():
        print("> ",col)

## 4.3 Compare imputing via mean VS via K-NN

In [16]:
columns = ['duration','days_left']
# Mean
train_set2 = train_set.copy()
train_set2[columns] = train_set[columns].fillna(train_set[columns].mean())
# K-NN
imputer = KNNImputer(n_neighbors = 5)
ss = StandardScaler()                  # Scale the data as KNN is sensitive to scaling
train_set3 = train_set.copy()
train_set3[columns] = ss.fit_transform(train_set3[columns])
train_set3[columns] = imputer.fit_transform(train_set3[columns])
train_set3[columns] = ss.inverse_transform(train_set3[columns])

# Compare the plots
fig = make_subplots(rows=3, cols=2)

fig.add_trace(go.Histogram(x=train_set['duration'], nbinsx=50, marker_line_width=0.5), row=1, col=1)
fig.add_trace(go.Histogram(x=train_set['days_left'], nbinsx=50, marker_line_width=0.5), row=1, col=2)
fig.add_trace(go.Histogram(x=train_set2['duration'], nbinsx=50, marker_line_width=0.5), row=2, col=1)
fig.add_trace(go.Histogram(x=train_set2['days_left'], nbinsx=50, marker_line_width=0.5), row=2, col=2)
fig.add_trace(go.Histogram(x=train_set3['duration'], nbinsx=50, marker_line_width=0.5), row=3, col=1)
fig.add_trace(go.Histogram(x=train_set3['days_left'], nbinsx=50, marker_line_width=0.5), row=3, col=2)

fig.update_xaxes(title_text="Duration (Original)", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_xaxes(title_text="Days Left (Original)", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)
fig.update_xaxes(title_text="Duration (Mean Imputed)", row=2, col=1)
fig.update_yaxes(title_text="Count", row=2, col=1)
fig.update_xaxes(title_text="Days Left (Mean Imputed)", row=2, col=2)
fig.update_yaxes(title_text="Count", row=2, col=2)
fig.update_xaxes(title_text="Duration (KNN Imputed)", row=3, col=1)
fig.update_yaxes(title_text="Count", row=3, col=1)
fig.update_xaxes(title_text="Days Left (KNN Imputed)", row=3, col=2)
fig.update_yaxes(title_text="Count", row=3, col=2)

fig.update_layout(title_text="COMPARING IMPUTATION STRTEGIES<br><sub>(Hover mouse to see details)</sub>",showlegend=False,height=700)
iplot(fig)

> ### INSIGHT
> **As we can see, imputation via mean causes a sharp spike in the central bins, while the KNN imputer does a much better job in maintaining the dataset structure, hence providing a better imputation**

## 4.4 Impute The Data with K-NN

In [17]:
# Fill in the missing values using KNN Imputer in the original dataset
train_set[columns] = ss.fit_transform(train_set[columns])
train_set[columns] = imputer.fit_transform(train_set[columns])
train_set[columns] = ss.inverse_transform(train_set[columns])

# Same for test set
test_set[columns] = ss.transform(test_set[columns])
test_set[columns] = imputer.transform(test_set[columns])
test_set[columns] = ss.inverse_transform(test_set[columns])

> **Only transform (and not fit) on test set to avoid any leakage of training data**

## 4.5 Outlier Analysis
- Outliers have been analyzed using **box plots** for relevant numerical features, allowing visual identification of extreme values beyond the typical range.

- The **number of outliers** detected has not been significant, and thus does not warrant aggressive transformation or removal.

- Instead of applying scaling transformations to mitigate their effect, the outliers have been **clipped using IQR-based bounds**. This approach retains the overall structure of the data while capping extreme values to a reasonable range.

In [18]:
# Function for counting outliers
def count_outliers(df, columns):
    total_outliers = 0
    for column in columns:
        q1 = df[column].quantile(0.25)
        q3 = df[column].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        count = ((df[column] < lower_bound) | (df[column] > upper_bound)).sum()
        print(f"{column}: {count}")
        total_outliers += count
    print('-'*25)
    print(f"Total outliers: {total_outliers}")
    print('-'*25)

# Counting outliers
print('In training data: \n'+'='*25) 
count_outliers(train_set,numeric_columns)

# Plots
fig = make_subplots(1,3)
fig.add_trace(go.Box(y=train_set['duration'],name="Duration",boxpoints='outliers'),row=1,col=1)
fig.add_trace(go.Box(y=train_set['days_left'],name="Days Left",boxpoints='outliers'),row=1,col=2)
fig.add_trace(go.Box(y=train_set['price'],name="Price",boxpoints='outliers'),row=1,col=3)
fig.update_layout(title_text="OUTLIER ANALYSIS<br><sub>(Hover mouse to see details)</sub>",showlegend=False)
iplot(fig)

In training data: 
duration: 306
price: 14
days_left: 0
-------------------------
Total outliers: 320
-------------------------


>   ### INSIGHTS
> -  The days_left column has no outliers
> -  We clip the outliers because they are small in number as compared to size of dataset (~40,000 rows)

## 4.6 Clipping outliers

In [19]:
# Clip outliers
def clip(col,df):
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3-q1
    lower = q1-(1.5*iqr)
    upper = q3+(1.5*iqr)
    df[col] = df[col].clip(lower, upper)
print('In training data: \n'+'='*25)   
for col in ['duration','price']:
    clip(col,train_set)
count_outliers(train_set,numeric_columns)

In training data: 
duration: 0
price: 0
days_left: 0
-------------------------
Total outliers: 0
-------------------------


In [20]:
# Clip extreme values in test set
print('In test data: \n'+'='*25)   
for col in ['duration','days_left']:
    clip(col,test_set)
count_outliers(test_set,['duration','days_left'])

In test data: 
duration: 0
days_left: 0
-------------------------
Total outliers: 0
-------------------------


# 5. Data Cleaning and Imputation (Categorical)
- Categorical features have been extracted and assessed for missing values using `.isna().sum()`.

- A detailed analysis of **unique values** has been conducted, during which **scientific notation entries** were found in the `flight` column. These unusable entries have been replaced with `NaN` to enable proper handling.

- An imputation strategy has been selected based on the number of unique values and feature importance in each column:
  - Columns like `departure` and `stops`, which have **fewer unique categories**, have been imputed using **KNN imputation** with relevant numerical columns.
  - The `airline` and `flight` columns, which contain more complex categorical data, have been imputed using predictions from **Random Forest Classifier (RFC)** models trained on non-missing entries from the same dataset.

- To ensure data consistency, **bar plots** for all four imputed columns have been visualized **before and after imputation**. These visualizations have confirmed that the **original distribution of categories has been preserved**.

## 5.1 Get categorical features and inspect unique values

In [21]:
# Get catagorical features
train_cat = train_set.select_dtypes([object])
cat_cols = train_cat.columns
print('Categorical Columns are: \n'+'='*73)  
train_cat.head()

Categorical Columns are: 


Unnamed: 0,airline,flight,source,departure,stops,arrival,destination,class
0,Vistara,UK-930,Mumbai,Early_Morning,one,Night,Chennai,Business
1,Air_India,AI-539,Chennai,Evening,one,Morning,Mumbai,Economy
2,SpiceJet,SG-8107,Delhi,Early_Morning,zero,Morning,Chennai,Economy
3,,0.00E+00,Hyderabad,Early_Morning,zero,Morning,Bangalore,Economy
4,Air_India,AI-569,Chennai,Early_Morning,one,Morning,Bangalore,Economy


In [22]:
# Look at the unique values in all columns
print('Unique values are: \n'.upper()) 
for col in cat_cols:
  print(f'{col.upper()}- {len(train_set[col].unique())} uniques: {(train_set[col].unique())} \n'+'-'*90)

UNIQUE VALUES ARE: 

AIRLINE- 7 uniques: ['Vistara' 'Air_India' 'SpiceJet' nan 'AirAsia' 'Indigo' 'GO_FIRST'] 
------------------------------------------------------------------------------------------
FLIGHT- 869 uniques: ['UK-930' 'AI-539' 'SG-8107' '0.00E+00' 'AI-569' 'I5-620' 'SG-3027'
 'G8-1404' 'I5-1528' 'AI-865' 'UK-828' 'AI-570' 'AI-768' 'AI-619' 'UK-832'
 'AI-675' 'AI-683' 'I5-1561' 'AI-507' 'AI-806' 'SG-611' 'AI-770' 'UK-776'
 'AI-508' 'SG-276' 'I5-972' 'UK-899' 'I5-974' 'UK-994' 'AI-762' 'G8-292'
 'UK-823' 'G8-501' 'AI-721' 'AI-503' 'I5-768' '6.00E-219' 'UK-870'
 'UK-985' 'UK-981' 'UK-874' 'SG-612' 'UK-657' 'UK-941' 'UK-838' 'UK-820'
 'AI-685' 'I5-792' 'AI-538' 'AI-774' 'AI-698' 'AI-805' 'I5-588' 'UK-880'
 'UK-960' 'AI-442' 'SG-8264' 'G8-331' 'UK-944' 'UK-852' 'UK-706' 'UK-826'
 'UK-816' 'AI-635' 'AI-868' 'AI-505' 'AI-663' 'AI-640' 'UK-873' 'UK-897'
 'I5-336' 'I5-747' 'UK-708' 'G8-515' 'UK-910' 'UK-865' 'UK-857' 'UK-877'
 'UK-772' 'G8-365' 'G8-336' 'AI-541' 'I5-410' '6.00E-2

> ### INSIGHT
>  **Some entries in flight column are wrongly entered as scientific notation numbers, so they must be replaced with Nan values.**

## 5.2 Replace all the invalid numeric entries in `airline`

In [23]:
for i in range(len(train_set.loc[:,'flight'])):
    value = train_set.loc[i, 'flight']
    if value is not np.nan and value[0] in list('1234567890'):
        train_set.loc[i, 'flight'] = np.nan
train_cat = train_set.select_dtypes([object])
print('> Invalid entries replaced')
train_cat.head()

> Invalid entries replaced


Unnamed: 0,airline,flight,source,departure,stops,arrival,destination,class
0,Vistara,UK-930,Mumbai,Early_Morning,one,Night,Chennai,Business
1,Air_India,AI-539,Chennai,Evening,one,Morning,Mumbai,Economy
2,SpiceJet,SG-8107,Delhi,Early_Morning,zero,Morning,Chennai,Economy
3,,,Hyderabad,Early_Morning,zero,Morning,Bangalore,Economy
4,Air_India,AI-569,Chennai,Early_Morning,one,Morning,Bangalore,Economy


In [24]:
# Do the same for test set
for i in range(len(test_set.loc[:,'flight'])):
    value = test_set.loc[i, 'flight']
    if value is not np.nan and value[0] in list('1234567890'):
        test_set.loc[i, 'flight'] = np.nan
print('> Invalid entries replaced')

> Invalid entries replaced


## 5.3 Which columns have NaN values?

In [25]:
print('Columns with Nan values are: \n'+'='*30)
for col in train_cat:
    if train_set[col].isna().any():
        print("> ",col)
print('\n Number of Nan values:\n'+'-'*30)
train_set.isna().sum()

Columns with Nan values are: 
>  airline
>  flight
>  departure
>  stops

 Number of Nan values:
------------------------------


airline        4613
flight         5922
source            0
departure      4792
stops          2319
arrival           0
destination       0
class             0
duration          0
days_left         0
price             0
dtype: int64

## 5.4 Visualize distributions (before imputing)

In [26]:
# Group flights by their airline for better representation
flight_counts = {}
for flight in train_set['flight']:
    if str(flight)[:2] not in flight_counts.keys():
        flight_counts[str(flight)[:2]] = 1
    else:
        flight_counts[str(flight)[:2]] += 1
del flight_counts['na']

# Plots
fig = make_subplots(rows=2, cols=2)

airline_counts = train_set['airline'].value_counts()
departure_counts = train_set['departure'].value_counts()
stops_counts = train_set['stops'].value_counts()

fig.add_trace(go.Bar(x=airline_counts.index, y=airline_counts.values, text=airline_counts.values), row=1, col=1)
fig.add_trace(go.Bar(x=list(flight_counts.keys()), y=list(flight_counts.values()), text=list(flight_counts.values())), row=1, col=2)
fig.add_trace(go.Bar(x=departure_counts.index, y=departure_counts.values, text=departure_counts.values), row=2, col=1)
fig.add_trace(go.Bar(x=stops_counts.index, y=stops_counts.values, text=stops_counts.values), row=2, col=2)

# Set x and y axis labels for each subplot
fig.update_xaxes(title_text="Airline", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)

fig.update_xaxes(title_text="Flight Group", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)

fig.update_xaxes(title_text="Departure Time", row=2, col=1)
fig.update_yaxes(title_text="Count", row=2, col=1)

fig.update_xaxes(title_text="Stops", row=2, col=2)
fig.update_yaxes(title_text="Count", row=2, col=2)

fig.update_layout(title_text="CATEGORICAL COLUMN COUNTS<br><sub>(Hover mouse to see details)</sub>",showlegend=False,height=800)
iplot(fig)

>### IMPUTATION STRATEGY
>- For the columns `departure` and `stops` we use K-NN Imputer as they either have less number of missing entries (stops), or don't impact the price much (departure).
>- K-NN is less compute-heavy and saves on training time.
>- For `flight` and `airline` column, we use more robust tree-based classifier imputation using **RandomForestClassifier**

## 5.5 KNN based imputation for `Departure` and `Stops` columns
The process involved:
- Label encoding non-null values,
- Scaling reference columns,
- Applying `KNNImputer` with `k=5`,
- Reversing scaling and decoding the labels.

In [27]:
# KNN imputation on Departure and Stops columns
cols1 = ['departure','duration','days_left']
cols2 = ['stops','duration','days_left']
le = LabelEncoder()
ss = StandardScaler()
imp = KNNImputer(n_neighbors = 5)

def knn_impute(df,ref_cols,col,le,ss,imp):
    print("> KNN Imputation in progress...")
    temp= df[col].notna()
    df.loc[temp,col] = le.fit_transform(df.loc[temp,col]) # encode non nan values
    df[ref_cols] = ss.fit_transform(df[ref_cols])         # set a standard scale before knn
    df[ref_cols] = imp.fit_transform(df[ref_cols])
    df[ref_cols] = ss.inverse_transform(df[ref_cols])     # invert the scaling
    df[col] = df[col].round().astype(int)                 # convert float results from knn to integers
    df[col] = le.inverse_transform(df[col])               # convert labels back into catagories
    print(">> KNN Imputation done!")
    
knn_impute(train_set,cols1,'departure',le,ss,imp)
knn_impute(train_set,cols2,'stops',le,ss,imp)

> KNN Imputation in progress...
>> KNN Imputation done!
> KNN Imputation in progress...
>> KNN Imputation done!


## 5.6 RFC Classifier Imputation on `airline` and `flight` columns
Random Forest Classifier (RFC) has been used to impute missing values in the `airline` and `flight` columns based on features like `departure`, `duration`, `days_left`, and `stops`.

Key steps:
- Label-encoded both reference and target columns,
- Trained the RFC model on non-missing rows,
- Predicted and filled missing values,
- Handled formatting and inverse-transformed encoded labels.

In [28]:
# RFC Classifier Imputation on airline and flight columns
rfc =  RandomForestClassifier(max_depth=6, n_estimators= 100)
le = LabelEncoder()
cols3 = ['departure','duration','days_left','stops']
def rfc_impute(df_main,col,ref_cols,rfc):
    print("> RFC Imputation in progress...")
    names = {}
    df = df_main.copy()
    for x in ref_cols:                          # encode catagorical columns into labels
        if df[x].dtype == 'object':
            encoder = LabelEncoder()
            df[x] = encoder.fit_transform(df[x])
            names[x] = encoder
    #print(names)
    mask = df[col].notna()
    le = LabelEncoder()
    df_main.loc[mask,col] = le.fit_transform(df_main.loc[mask,col])                  # encode non-Nan values
    rfc.fit(df.loc[mask,ref_cols].astype(float),df_main.loc[mask,col].astype(float)) # train the model
    df_main.loc[~mask,col] = rfc.predict(df.loc[~mask,ref_cols])                     # fill in Nan values
    index = 0
    for x in df_main[col]:
      if str(x)[0] in list('1234567890'):
        df_main.loc[index,col] = int(x)
        index += 1
    #print(df_main[col].value_counts())
    mask_int = df_main[col].apply(lambda x: isinstance(x, int))
    df_main.loc[mask_int, col] = le.inverse_transform(df_main.loc[mask_int, col].astype(int))
    #print(df_main[col].value_counts())
    print(">> RFC Imputation done!")
    
rfc_impute(train_set,'airline',cols3,rfc)
rfc_impute(train_set,'flight',cols3,rfc)

> RFC Imputation in progress...
>> RFC Imputation done!
> RFC Imputation in progress...
>> RFC Imputation done!


In [29]:
# Confirm all missing values are imputed
train_set.isna().sum()

airline        0
flight         0
source         0
departure      0
stops          0
arrival        0
destination    0
class          0
duration       0
days_left      0
price          0
dtype: int64

In [30]:
# Imputing the columns in test set the same way
knn_impute(test_set,cols1,'departure',le,ss,imp)
knn_impute(test_set,cols2,'stops',le,ss,imp)
rfc_impute(test_set,'airline',cols3,rfc)
rfc_impute(test_set,'flight',cols3,rfc)

> KNN Imputation in progress...
>> KNN Imputation done!
> KNN Imputation in progress...
>> KNN Imputation done!
> RFC Imputation in progress...
>> RFC Imputation done!
> RFC Imputation in progress...
>> RFC Imputation done!


In [31]:
# Confirm all missing values are imputed
test_set.isna().sum()

airline        0
flight         0
source         0
departure      0
stops          0
arrival        0
destination    0
class          0
duration       0
days_left      0
dtype: int64

## 5.7 Visualize distributions (after imputing)

In [32]:
# Group flights by their airline for better representation
flight_counts = {}
for flight in train_set['flight']:
    if str(flight)[:2] not in flight_counts.keys():
        flight_counts[str(flight)[:2]] = 1
    else:
        flight_counts[str(flight)[:2]] += 1

# Plots
fig = make_subplots(rows=2, cols=2)

airline_counts = train_set['airline'].value_counts()
departure_counts = train_set['departure'].value_counts()
stops_counts = train_set['stops'].value_counts()

fig.add_trace(go.Bar(x=airline_counts.index, y=airline_counts.values, text=airline_counts.values), row=1, col=1)
fig.add_trace(go.Bar(x=list(flight_counts.keys()), y=list(flight_counts.values()), text=list(flight_counts.values())), row=1, col=2)
fig.add_trace(go.Bar(x=departure_counts.index, y=departure_counts.values, text=departure_counts.values), row=2, col=1)
fig.add_trace(go.Bar(x=stops_counts.index, y=stops_counts.values, text=stops_counts.values), row=2, col=2)

# Set x and y axis labels for each subplot
fig.update_xaxes(title_text="Airline", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)

fig.update_xaxes(title_text="Flight Group", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)

fig.update_xaxes(title_text="Departure Time", row=2, col=1)
fig.update_yaxes(title_text="Count", row=2, col=1)

fig.update_xaxes(title_text="Stops", row=2, col=2)
fig.update_yaxes(title_text="Count", row=2, col=2)

fig.update_layout(title_text="CATEGORICAL COLUMN COUNTS (After Imputing)<br><sub>(Hover mouse to see details)</sub>",showlegend=False,height=800)
iplot(fig)

> ### INSIGHT
> - Imputation has **preserved the dataset structure**, adding new entries to the columns according to the way they are distributed.
> - If instead a mode based imputation was done, it would simply add to the already-skewed distributions and make our final models biased.

# 6. Handling Duplicates
- Duplicates are identified and dropped.

In [33]:
# Handling Duplicates

dups = train_set[train_set.duplicated()]
train_set = train_set.drop_duplicates()
print('->',dups.shape[0],'duplicates dropped from training set')
print('\nTrain set now has',train_set.shape[0],'rows.')

-> 359 duplicates dropped from training set

Train set now has 39641 rows.


# 7. Encode Catagorical Columns
Categorical columns have been encoded based on their number of unique values:

- Columns with **more than 10 unique categories** have been encoded using **Target Encoding**, which replaces categories with the mean of the target variable for each group.
- Columns with **10 or fewer unique categories** have been encoded using **One-Hot Encoding**, creating binary indicator variables for each category.


## 7.1 Deciding columns for different encoding type

In [34]:
print('Number of unique values: \n'+'-'*40)
for col in train_set.columns:
    if train_set[col].dtype == object:
      print(col,":",len(train_set[col].unique()))

Number of unique values: 
----------------------------------------
airline : 6
flight : 791
source : 6
departure : 6
stops : 3
arrival : 6
destination : 6
class : 2


>### INSIGHT
> - **We encode features with more than 10 unique values with target encoder, else we use One-Hot Encoder**
> - As `flight` column has too many features so we use target encoding on it

In [35]:
# Define types of encoding as lists of column names
tar_cols = ['flight']
oh_cols = ['airline','source','departure','stops','arrival','destination','class']

## 7.2 Encoding categorical features

In [36]:
# Encode train set
te = TargetEncoder()
target = train_set['price']
for col in tar_cols:
  train_set[col] = te.fit_transform(train_set[col],target)
encoded_train = pd.get_dummies(train_set,columns=[x for x in oh_cols]).astype(int)
encoded_target = encoded_train['price']
encoded_train = encoded_train.drop('price',axis=1)
print('> Feature matrix for train set: \n'.upper())
encoded_train

> FEATURE MATRIX FOR TRAIN SET: 



Unnamed: 0,flight,duration,days_left,airline_AirAsia,airline_Air_India,airline_GO_FIRST,airline_Indigo,airline_SpiceJet,airline_Vistara,source_Bangalore,...,arrival_Morning,arrival_Night,destination_Bangalore,destination_Chennai,destination_Delhi,destination_Hyderabad,destination_Kolkata,destination_Mumbai,class_Business,class_Economy
0,36874,14,40,0,0,0,0,0,1,0,...,0,1,0,1,0,0,0,0,1,0
1,33030,16,26,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,1
2,17176,2,25,0,0,0,0,1,0,0,...,1,0,0,1,0,0,0,0,0,1
3,16341,1,22,0,0,0,1,0,0,0,...,1,0,1,0,0,0,0,0,0,1
4,22759,4,20,0,1,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,37205,21,43,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,1
39996,13107,14,12,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,1,1,0
39997,26463,14,11,0,0,0,0,0,1,0,...,0,1,1,0,0,0,0,0,0,1
39998,37205,24,4,0,0,0,0,0,1,0,...,0,1,1,0,0,0,0,0,1,0


In [37]:
# Do the same for test set
for col in tar_cols:
  test_set[col] = te.transform(test_set[col])
encoded_test = pd.get_dummies(test_set,columns=[x for x in oh_cols]).astype(int)
print('> Feature matrix for test set: \n'.upper())
encoded_test

> FEATURE MATRIX FOR TEST SET: 



Unnamed: 0,flight,duration,days_left,airline_AirAsia,airline_Air_India,airline_GO_FIRST,airline_Indigo,airline_SpiceJet,airline_Vistara,source_Bangalore,...,arrival_Morning,arrival_Night,destination_Bangalore,destination_Chennai,destination_Delhi,destination_Hyderabad,destination_Kolkata,destination_Mumbai,class_Business,class_Economy
0,32717,2,18,0,0,0,0,0,1,1,...,0,0,0,0,1,0,0,0,0,1
1,27674,12,5,0,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
2,18016,9,44,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,1
3,31918,21,26,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,1,0
4,22540,7,22,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,16272,10,23,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
9996,14316,7,40,0,1,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,1
9997,31541,10,24,0,0,0,0,0,1,0,...,0,1,0,0,0,1,0,0,1,0
9998,26191,16,49,0,1,0,0,0,0,0,...,1,0,1,0,0,0,0,0,1,0


# 8. Scale Entire Feature Matrix to Standardize
Feature scaling has been applied to **standardize the feature matrix**, ensuring that all features are on a comparable scale. This is important because:
- Many machine learning algorithms (e.g., KNN, Linear Regression, Gradient Descent-based models) are **sensitive to feature magnitudes**.
- Unscaled features can cause models to **favor higher-magnitude variables**, leading to biased learning.
- Scaling improves **convergence speed** and **stability** during training, especially for gradient-based optimizers.
- It also ensures that **distance-based models** like KNN and Ridge Regression compute fair distances across all features.

In [38]:
scaler = StandardScaler()
encoded_train = scaler.fit_transform(encoded_train)
encoded_test = scaler.transform(encoded_test)
print('> Scaled train and test feature matrices.')

> Scaled train and test feature matrices.


# 9. Training the models
- A **train-validation split** has been defined to assess model performance on unseen data.
- A **dictionary of models** has been created, and each model has been trained using a loop for efficiency.
- Both **K-Fold Cross-Validation** and **Holdout Validation** MSE scores have been computed to evaluate model performance.
- Based on the validation results, the **top 3 models** have been selected for **hyperparameter tuning** in the next phase.

## 9.1 Holdout validation split and model dictionary
- Total 11 models have been listed.
- All of them are regressors, with some being distance-based like `ElasticNet` and `KNNRegressor` while others being tree-based like `RandomForestRegressor` for maximum possible variation.

In [39]:
# Holdout validation split
X_train, X_val, y_train, y_val = train_test_split(encoded_train, encoded_target, test_size=0.2, random_state=42)

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(max_iter=10000),
    'ElasticNet Regression': ElasticNet(),
    "KNN Regressor": KNeighborsRegressor(),
    'Random Forest': RandomForestRegressor(n_jobs=-1),
    'AdaBoost': AdaBoostRegressor(),
    'Gradient Boosting': GradientBoostingRegressor(),
    'XGB Regressor': XGBRegressor(),
    'LGBM Regressor': LGBMRegressor(n_jobs=-1, verbosity=-1),
    'Neural Network': MLPRegressor(activation='relu',learning_rate='adaptive',early_stopping=True,max_iter=800)
}

## 9.2 Holdout validation training on data
- Models trained on 80% data and tested on the remaining 20%
- MSE scores stored and compared
- Top 5 models with highest MSE score will be trained on K-Fold Cross Validation

In [40]:
names = []
val_set_r2 = []
val_set_mse = []

for name, model in models.items():
    print(f'> Training {name}...')
    names.append(name)
    
    model.fit(X_train, y_train)
    
    val_preds = model.predict(X_val)
    
    # Compute R²
    val_r2 = r2_score(y_val, val_preds)
    val_set_r2.append(val_r2)
    
    # Compute MSE
    val_mse = mean_squared_error(y_val, val_preds)
    val_set_mse.append(val_mse)

evals = pd.DataFrame({
    'MODEL': names,
    'Holdout_Val_R2': val_set_r2,
    'Holdout_Val_MSE': val_set_mse
})

print('\n--- ALL MODELS TRAINED ---')
evals

> Training Linear Regression...
> Training Ridge Regression...
> Training Lasso Regression...
> Training ElasticNet Regression...
> Training KNN Regressor...
> Training Random Forest...
> Training AdaBoost...
> Training Gradient Boosting...
> Training XGB Regressor...
> Training LGBM Regressor...
> Training Neural Network...

--- ALL MODELS TRAINED ---


Unnamed: 0,MODEL,Holdout_Val_R2,Holdout_Val_MSE
0,Linear Regression,0.908235,46959840.0
1,Ridge Regression,0.908235,46959680.0
2,Lasso Regression,0.908236,46959020.0
3,ElasticNet Regression,0.873632,64667650.0
4,KNN Regressor,0.908859,46640420.0
5,Random Forest,0.976826,11859000.0
6,AdaBoost,0.922044,39893380.0
7,Gradient Boosting,0.955216,22917710.0
8,XGB Regressor,0.974136,13235400.0
9,LGBM Regressor,0.969104,15810790.0


## 9.3 5-Fold Cross Validation
- Train set divided into 5 parts
- One part is used for testing while others are used for training
- This is done for **all possible combinations** of partioned datasets
- Model MSE score is evaluated after each training round
- Mean of MSE score of all rounds is displayed
- **Top 3 are selected for hyperparameter tuning**


In [41]:
# Select top 5
top5_models = (
    evals.nsmallest(5, 'Holdout_Val_MSE')
         .set_index('MODEL')['Holdout_Val_MSE']
         .to_dict()
)
print('> MODELS FOR 5-FOLD CV:')
pd.DataFrame(top5_models.items(), columns=['MODEL', 'Holdout_Val_MSE'])

> MODELS FOR 5-FOLD CV:


Unnamed: 0,MODEL,Holdout_Val_MSE
0,Random Forest,11859000.0
1,XGB Regressor,13235400.0
2,LGBM Regressor,15810790.0
3,Gradient Boosting,22917710.0
4,Neural Network,32172420.0


In [None]:
# 5-Fold CV
names = []
cv_r2 = []
cv_mse = []
cv_models = {}

# Select top 5 models
for name, score in top5_models.items():
    cv_models[name] = models[name]

for name, model in cv_models.items():
    print(f'> Training {name}...')
    names.append(name)
    
    # R² via cross_val_score
    r2_scores = cross_val_score(model, encoded_train, encoded_target, cv=5, scoring='r2')
    cv_r2.append(r2_scores.mean())
    
    # MSE via cross_val_score
    mse_scores = cross_val_score(model, encoded_train, encoded_target, cv=5, scoring='neg_mean_squared_error')
    cv_mse.append(np.mean(-mse_scores))  # Flip sign to get actual MSE

evals = pd.DataFrame({
    'MODEL': names,
    '5-Fold CV R2': cv_r2,
    '5-Fold CV MSE': cv_mse
})

print('\n--- ALL MODELS TRAINED ---')
evals

> Training Random Forest...
> Training XGB Regressor...
> Training LGBM Regressor...
> Training Gradient Boosting...
> Training Neural Network...

--- ALL MODELS TRAINED ---


Unnamed: 0,MODEL,5-Fold CV R2,5-Fold CV MSE
0,Random Forest,0.977215,11750130.0
1,XGB Regressor,0.975216,12767710.0
2,LGBM Regressor,0.971217,14826250.0
3,Gradient Boosting,0.958265,21499670.0
4,Neural Network,0.941912,30797640.0


In [None]:
# Select top 3
top3_models = (
    evals.nsmallest(3, '5-Fold CV MSE')
         .set_index('MODEL')['5-Fold CV MSE']
         .to_dict()
)
print('> MODELS FOR 5-FOLD CV:')
pd.DataFrame(top3_models.items(), columns=['MODEL', '5-Fold CV MSE'])

> MODELS FOR 5-FOLD CV:


Unnamed: 0,MODEL,5-Fold CV MSE
0,Random Forest,11750130.0
1,XGB Regressor,12767710.0
2,LGBM Regressor,14826250.0


## 9.4 Hyperparameter Tuning
- Top 3 models are set for tuning
- The training set is split into 80-20 train-validation sets
- `GridSearchCV` is used to get the best set of parameters from training set
- Models are evaluated on validation set
- Model with best MSE on validation set will be finally used to predict test set target

In [None]:
x_train,x_val,y_train,y_val = train_test_split(encoded_train, encoded_target,test_size=0.2)

rf_params = {
    'n_estimators': [50, 150],
    'max_depth': [10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'max_features': ['sqrt']
}

xgb_params = {
    'n_estimators': [50, 100, 150],
    'max_depth': [6, 4],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 0.5],
    'colsample_bytree': [0.8, 1.0]
}

lgbm_params = {
    'n_estimators': [300, 500],
    'max_depth': [6, 10, -1],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Random Forest
print('> Tuning Random Forest Regressor...')
rf = GridSearchCV(RandomForestRegressor(n_jobs=-1), rf_params, cv=2, scoring='neg_mean_squared_error', n_jobs=-1)
rf.fit(x_train, y_train)
print("Best RF MSE:", rf.best_score_)
print("RF Best Params:", rf.best_params_)

# XGBoost
print('> Tuning XGBoost Regressor...')
xgb = GridSearchCV(XGBRegressor( verbosity=0), xgb_params, cv=2, scoring='neg_mean_squared_error', n_jobs=1)
xgb.fit(x_train, y_train)
print("Best XGB MSE:", xgb.best_score_)
print("XGB Best Params:", xgb.best_params_)

# LGBM
print('> Tuning LGBM Regressor...')
lgbm = GridSearchCV(LGBMRegressor(verbosity=-1), lgbm_params, cv=2, scoring='neg_mean_squared_error', n_jobs=1)
lgbm.fit(x_train, y_train)
print("Best LGBM MSE:", lgbm.best_score_)
print("LGBM Best Params:", lgbm.best_params_)

> Tuning Random Forest Regressor...
Best RF MSE: -14366058.247977834
RF Best Params: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 150}
> Tuning XGBoost Regressor...
Best XGB MSE: -14976302.640605088
XGB Best Params: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 150, 'subsample': 0.8}
> Tuning LGBM Regressor...
Best LGBM MSE: -14231689.248837393
LGBM Best Params: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': -1, 'n_estimators': 500, 'subsample': 0.8}


In [60]:
# Get model performance on validation set
final_eval = {}
val_mse = mean_squared_error(y_val, rf.predict(x_val))
final_eval['Random Forest Regressor'] = val_mse
val_mse = mean_squared_error(y_val, xgb.predict(x_val))
final_eval['XGB Regressor'] = val_mse
val_mse = mean_squared_error(y_val, lgbm.predict(x_val))
final_eval['LGBM Regressor'] = val_mse

finals = pd.DataFrame(final_eval.items(), columns=['MODEL', 'MSE'])
finals

Unnamed: 0,MODEL,MSE
0,Random Forest Regressor,12659140.0
1,XGB Regressor,14159760.0
2,LGBM Regressor,12839860.0


In [62]:
mini = float('inf')
best = None

for _, row in finals.iterrows():
    if row['MSE'] < mini:
        mini = row['MSE']
        best = row['MODEL']

print(f'Best model: {best}')

Best model: Random Forest Regressor


In [63]:
final_preds = pd.Series(rf.predict(encoded_test))
ids = pd.Series(range(0,len(final_preds)))
submission = pd.DataFrame({
    'id' : ids,
    'price': final_preds
})
submission

Unnamed: 0,id,price
0,0,4649.433333
1,1,9060.646667
2,2,5191.018667
3,3,61017.286667
4,4,50131.093333
...,...,...
9995,9995,4268.050000
9996,9996,4995.211111
9997,9997,85065.753333
9998,9998,54864.846667


In [None]:
submission.to_csv('submission.csv',index=False)