# Flight Prices Prediction

Predict the price of a flight based on features such as airline, departing & arrival city, time, etc using a machine learning model. 

[Flight Price Prediction Dataset](https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction?datasetId=1957837&sortBy=voteCount)

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:
df = pd.read_csv('dataset/flight price/Clean_Dataset.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300153 entries, 0 to 300152
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        300153 non-null  int64  
 1   airline           300153 non-null  object 
 2   flight            300153 non-null  object 
 3   source_city       300153 non-null  object 
 4   departure_time    300153 non-null  object 
 5   stops             300153 non-null  object 
 6   arrival_time      300153 non-null  object 
 7   destination_city  300153 non-null  object 
 8   class             300153 non-null  object 
 9   duration          300153 non-null  float64
 10  days_left         300153 non-null  int64  
 11  price             300153 non-null  int64  
dtypes: float64(1), int64(3), object(8)
memory usage: 27.5+ MB


Check for duplicates

In [3]:
df.isna().sum()

Unnamed: 0          0
airline             0
flight              0
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64

---

## Data Discovery

Discover info about the dataset.

In [4]:
df1 = df.groupby(['flight','airline'],as_index=False).count()
df1.airline.value_counts()

Indigo       704
Air_India    218
GO_FIRST     205
SpiceJet     186
Vistara      133
AirAsia      115
Name: airline, dtype: int64

In [5]:
df.groupby('class')['class'].agg('count')

class
Business     93487
Economy     206666
Name: class, dtype: int64

In [6]:
df.groupby('arrival_time')['arrival_time'].agg('count')

arrival_time
Afternoon        38139
Early_Morning    15417
Evening          78323
Late_Night       14001
Morning          62735
Night            91538
Name: arrival_time, dtype: int64

In [7]:
df.groupby('stops')['stops'].agg('count')

stops
one            250863
two_or_more     13286
zero            36004
Name: stops, dtype: int64

In [8]:
df.describe()

Unnamed: 0.1,Unnamed: 0,duration,days_left,price
count,300153.0,300153.0,300153.0,300153.0
mean,150076.0,12.221021,26.004751,20889.660523
std,86646.852011,7.191997,13.561004,22697.767366
min,0.0,0.83,1.0,1105.0
25%,75038.0,6.83,15.0,4783.0
50%,150076.0,11.25,26.0,7425.0
75%,225114.0,16.17,38.0,42521.0
max,300152.0,49.83,49.0,123071.0


---

## Preprocessing

Convert the categorical features into numeric values and split into training and test set.

In [9]:
le = LabelEncoder()
for col in df.columns:
    if df[col].dtype == "object":
        df[col] = le.fit_transform(df[col])

In [10]:
X = df.drop(['Unnamed: 0', 'flight', 'price'], axis="columns")
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [11]:
X_test.iloc[0]

airline              0.00
source_city          2.00
departure_time       2.00
stops                2.00
arrival_time         5.00
destination_city     0.00
class                1.00
duration             2.75
days_left           29.00
Name: 15697, dtype: float64

Apply `MinMax` transformation to data.

In [12]:
scaler = MinMaxScaler(feature_range=(0,1))
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

## Model Building

Build model using most appropriate model i.e `RandomForestRegressor`

In [13]:
# pipeline = Pipeline([
#     ('standard scaler', MinMaxScaler(feature_range=(0,1))),
#     ('estimator', RandomForestRegressor())
# ])

# pipeline.fit(X_train, y_train)

model = RandomForestRegressor()
model.fit(X_train, y_train)

## Model Evaluation

Find the accuracy of the model and other evaluations

In [19]:
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# print("MAE: " + mae)
# print("MSE: " + mse)
# print("RMSE: " + rmse)
# print("r2: " + r2)
print(mae)
print(mse)
print(rmse)
print(r2)

1082.7747339959374
7496958.84153621
2738.0574941984346
0.9853304100325447


## Export Model

Save model to use for app development. See app folder

In [None]:
# import pickle

# with open('flight_prices_model.pkl', 'wb') as f:
#     pickle.dump(model, f)