## EDA And Feature Engineering Flight Price Prediction
check the dataset info below
https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction


### FEATURES
The various features of the cleaned dataset are explained below:
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10)Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.

In [None]:
## Core
import pandas as pd
import numpy as np
## Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
## Modeling
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

import warnings
warnings.filterwarnings('ignore')

## 1) Load DATA

In [None]:
## reading data from Excel File
df= pd.read_excel('flight_price.xlsx')
df.head()

## Data Overview & Quality Check

In [None]:
# get basic info about data
df.info()

In [None]:
df.describe()

In [None]:
# Missing values
df.isnull().sum().sort_values(ascending=False)

## Basic Cleaning & consistent Casing

In [None]:
## striping Spaces in String columns,unify case
for col in df.select_dtypes(include='object').columns:
     df[col]=df[col].astype(str).str.strip()

In [None]:
# Feature Engineering
# If Date_of_Journey is like '24/03/2019', split into day, month, year
if 'Date_of_Journey' in df.columns:
     parts=df['Date_of_Journey'].str.split('/',expand=True)
     if parts.shape[1]==3:
          df['Journey_day']=parts[0].astype(int)
          df['Journey_month']=parts[1].astype(int)
          df['Journey_year']=parts[2].astype(int)
          df.drop('Date_of_Journey',axis=1,inplace=True)

# Extract hour/minute from Dep_Time and Arrival_Time if they exist
def split_time_col(col):
     if col in df.columns:
          # keep only the time portion before any space
          df[col]=df[col].apply(lambda x:str(x).split(' ')[0])
          hh_mm=df[col].str.split(':',expand=True)
          if hh_mm.shape[1]==2:
               df[f'{col}_hour']=pd.to_numeric(hh_mm[0],errors='coerce')
               df[f'{col}_min']=pd.to_numeric(hh_mm[1],errors='coerce')
          df.drop(col,axis=1,inplace=True) 

for c in ['Dep_Time', 'Arrival_Time']:
    split_time_col(c)

# Convert Duration to total minutes if available, handling formats like '2h 50m', '5h', '45m'

def duration_to_minutes(s):
     s=str(s).lower().strip()
     h=0
     m=0
     if 'h' in s:
        try:
            h = int(s.split('h')[0].strip())
        except:
            h = 0
        rest = s.split('h')[1]
        if 'm' in rest:
            try:
                m = int(rest.split('m')[0].strip())
            except:
                m = 0
     elif 'm' in s:
        try:
            m = int(s.split('m')[0].strip())
        except:
            m = 0
     return h*60 + m

if 'Duration' in df.columns:
    df['Duration_min'] = df['Duration'].apply(duration_to_minutes)
    df.drop('Duration', axis=1, inplace=True)
     
df.head()



## Exploratory Data Analysis(EDA)

In [None]:
# Target distribution (Price) if present
if 'Price' in df.columns:
     plt.figure()
     df['Price'].plot(kind='hist',bins=50,edgecolor='black')
     plt.title('Distribution of Ticket Price')
     plt.xlabel("Price")
     plt.ylabel("Frequency")
     plt.show()


In [None]:
# Numeric correlations heatmap (simple)
num_cols=df.select_dtypes(include=[np.number]).columns.tolist()
if len(num_cols)>=1:
     corr=df[num_cols].corr()
     plt.figure(figsize=(15,20))
     sns.heatmap(corr,annot=True)
     sns.set_theme(style='darkgrid')
     plt.title('Correlation Heatmap (Numeric Features)')
     
     plt.tight_layout()
     plt.show()

In [None]:
# Average price by key categorical features (if exist)
for cat in ['Airline', 'Source', 'Destination', 'Total_Stops', 'Route', 'Class']:
    if cat in df.columns and 'Price' in df.columns:
        plt.figure()
        df.groupby(cat)['Price'].mean().sort_values().plot(kind='bar')
        plt.title(f'Average Price by {cat}')
        plt.ylabel('Average Price')
        plt.xlabel(cat)
        plt.tight_layout()
        plt.show()

## Train Test Split and PreProcessing

In [None]:
df.head()

In [None]:
X=df.drop(columns=['Price'])
y=df['Price']


num_cols=df.select_dtypes(include=[np.number]).columns.tolist()
cat_cols=df.select_dtypes(exclude=[np.number]).columns.tolist()
num_cols.remove('Price') 
num_cols,cat_cols

In [None]:
# Preprocess: scale numeric, one-hot encode categorical

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ]
)




In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape


## Baseline Model - Linear Regression

In [None]:
linear_reg=Pipeline(steps=[('preprocess',preprocessor),
                    ('Model',LinearRegression())])

linear_reg.fit(X_train,y_train)

y_pred=linear_reg.predict(X_test)

linear_reg_mae=mean_absolute_error(y_test,y_pred)
linear_reg_mse=mean_squared_error(y_test,y_pred)
linear_reg_rmse=np.sqrt(linear_reg_mse)
linear_reg_r2_score=r2_score(y_test,y_pred)

print(f"For Linear Regression Model -> MAE:{linear_reg_mae} | RMSE:{linear_reg_rmse} | R2_SCORE:{linear_reg_r2_score}")


In [None]:

rf = Pipeline(steps=[('preprocess', preprocessor),
                    ('model', RandomForestRegressor(
                        n_estimators=300,
                        max_depth=None,
                        random_state=42,
                        n_jobs=-1
                    ))])

rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

mae_rf = mean_absolute_error(y_test, pred_rf)
rmse_rf = mean_squared_error(y_test, pred_rf, squared=False)
r2_rf = r2_score(y_test, pred_rf)

print(f"Random Forest -> MAE: {mae_rf:.2f} | RMSE: {rmse_rf:.2f} | R^2: {r2_rf:.3f}")


In [None]:

# Extract feature names after preprocessing
ohe = rf.named_steps['preprocess'].named_transformers_['cat']
cat_feature_names = ohe.get_feature_names_out(cat_cols) if len(cat_cols) else np.array([])
feature_names = np.r_[num_cols, cat_feature_names]

importances = rf.named_steps['model'].feature_importances_

# Top 20 most important
idx = np.argsort(importances)[::-1][:20]
plt.figure()
plt.bar(range(len(idx)), importances[idx])
plt.xticks(range(len(idx)), feature_names[idx], rotation=90)
plt.title('Top 20 Feature Importances (Random Forest)')
plt.tight_layout()
plt.show()


## Conclusions

**Key Findings (examples to update based on your results):**
- Prices vary strongly by **Duration_min** and **Total_Stops**.
- **Duration_min** and **departure/arrival hour** show notable correlation with price.
- **Random Forest** outperformed Linear Regression on RMSE/R², indicating non-linear relationships and interaction effects in the data.

**Limitations:**
- Some features are derived from string parsing; data quality can affect signal.
- No hyperparameter tuning yet (Grid/Randomized Search can improve RF).

**Next Steps:**
1. Add **hyperparameter tuning** for Random Forest / try **XGBoost/LightGBM**.
2. Engineer **Days Left** (if `Date_of_Journey` and booking dates available).
3. Try **target encoding** for high-cardinality categoricals.
4. Log-transform skewed targets if necessary and repeat training.



## Hyperparameter Tuning - Random Forest(RandomizedSearchCV)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

rf_pipeline=rf

params={
     'model__n_estimators': [200, 300, 400, 500, 700],
    'model__max_depth': [None, 10, 15, 20, 30],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4],
    'model__max_features': ['auto', 'sqrt', 0.8, 0.6]
}

randomsearch=RandomizedSearchCV(estimator=rf_pipeline,
                                param_distributions=params,
                                n_iter=20,
                                scoring='neg_mean_squared_error',
                                cv=5,
                                verbose=1,
                                random_state=42,
                                 n_jobs=-1)
randomsearch.fit(X_train,y_train)

best_rf=randomsearch.best_estimator_
best_params=randomsearch.best_params_
pred_best=randomsearch.predict(X_test)


mae_best = mean_absolute_error(y_test, pred_best)
rmse_best = mean_squared_error(y_test, pred_best, squared=False)
r2_best = r2_score(y_test, pred_best)

print("Best RF params:", best_params)
print(f"Tuned Random Forest -> MAE: {mae_best:.2f} | RMSE: {rmse_best:.2f} | R^2: {r2_best:.3f}")





##  Gradient Boosting Regressor (as another tree-based baseline)

In [None]:

from sklearn.ensemble import GradientBoostingRegressor

gbr = Pipeline(steps=[('preprocess', preprocessor),
                     ('model', GradientBoostingRegressor(random_state=42))])

gbr.fit(X_train, y_train)
pred_gbr = gbr.predict(X_test)

mae_gbr = mean_absolute_error(y_test, pred_gbr)
rmse_gbr = mean_squared_error(y_test, pred_gbr, squared=False)
r2_gbr = r2_score(y_test, pred_gbr)

print(f"Gradient Boosting -> MAE: {mae_gbr:.2f} | RMSE: {rmse_gbr:.2f} | R^2: {r2_gbr:.3f}")


In [None]:

import pandas as pd

results = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'Tuned Random Forest', 'Gradient Boosting'],
    'MAE':   [linear_reg_mae, mae_rf, mae_best, mae_gbr],
    'RMSE':  [linear_reg_rmse, rmse_rf, rmse_best, rmse_gbr],
    'R2':    [linear_reg_r2_score, r2_rf, r2_best, r2_gbr]
}).sort_values(by='RMSE')

results


## 

In [None]:
print(" ")

## Save Best Model For Deployment

In [None]:

import joblib

# Choose the best model based on RMSE
best_row = results.iloc[0]
best_name = best_row['Model']

if best_name == 'Tuned Random Forest':
    final_model = randomsearch
elif best_name == 'Random Forest':
    final_model = rf
elif best_name == 'Gradient Boosting':
    final_model = gbr
else:
    final_model = linear_reg

joblib.dump(final_model, 'flight_price_best_model.joblib')
print("Saved best model as 'flight_price_best_model.joblib'")



In [None]:
# Example: load the model and predict on X_test head (shape matches after preprocessing in the pipeline)
loaded = joblib.load('flight_price_best_model.joblib')
print(loaded.predict(X_test.head(3)))



- Compared multiple models; **Tuned Random Forest** and **Gradient Boosting** typically outperform linear baseline.  
- Used **RandomizedSearchCV (5-fold)** to optimize RF hyperparameters.  
- Saved the **best model** with `joblib`, making it easy to deploy later (e.g., Streamlit).  
- Next upgrade: **XGBoost/LightGBM**, target transformation for skewed price, and time-aware validation if booking date is available.
