# Air Fare Prediction 

__Compared to last few decades the passengers travelling via Air mode has increased drastically. Every other airline passenger tries to buy an air ticket at cheapest fare possible.__

&ensp;

<div>
<img src = https://img.washingtonpost.com/rf/image_1484w/2010-2019/WashingtonPost/2017/06/29/Interactivity/Images/iStock-626867464.JPG?uuid=zUBRYlq-EeeqaTlkp9VSBw width="800">
<div/>
 <center> Flight Fares </center>

&ensp;


 To achieve this we have to follow one basic rule that is plan your travel well in advance, but it doesn't always guarantee you end up buying cheapest air fare ticket. And always its not possible to plan our travel in advance. Flight fare in today's world is difficult to predict as it keep varying on frequent basis. 
 
As a Data Scientist, here is the attempt to predict the air fares for various airlines in India.

## Import required Libraries

Let's import relevant libraries.

In [5]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-1.2.0-py3-none-win_amd64.whl (86.5 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.2.0
Note: you may need to restart the kernel to use updated packages.


In [4]:
## Import library
%matplotlib inline

import warnings # To supress warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)

import numpy as np # Linear algebraa
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt  # Matlab-style plottinga
import seaborn as sns # Visualisation
from datetime import datetime
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNetCV
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

ModuleNotFoundError: No module named 'xgboost'

## Datasets

We have 2 datasets 'Train' and 'Test', both consists of categorical and numerical variables. 'Train' and 'Test' both contain similar columns except there is no 'Price' column 'Test' dataset as same has to be predicted. Let us load the datasets.

In [None]:
## Load datasets

train_df =  pd.read_excel('Data_Train.xlsx')
test_df=pd.read_excel('Test_set.xlsx')

### Train Dataset

In [None]:
train_df.shape

In [None]:
train_df.sample(5)

### Test Dataset

In [None]:
test_df.shape

In [None]:
test_df.sample(5)

### Combining datasets into one dataset

We will combine test dataset and train dataset so as to work on both datasets at same time. 

In [None]:
combined_df = train_df.append(test_df)
combined_df.reset_index(inplace=True)

In [None]:
combined_df.shape

In [None]:
combined_df.sample(3)

## Data Wrangling, EDA and Feature Engineering

In [None]:
combined_df.info()

The "Price" column is __Target Variable__ which has to be predicted in test dataset. The other variables are features.

Additional_Info : Info about type of meal or any other service passenger is willing to opt.

- Airline: __Name of the airline__
- Arrival_Time      
- Date_of_Journey  
- Dep_Time:__Time of Departure___
- Destination
- Source  
- Duration: __Total duration of the flight___
- Route: __Flight will travel via these cities__
- Total_Stops:__Total stops flight will have in the journey___


In [None]:
combined_df.describe()

_We can see only 'Index' and 'Price' columns are in numeric format._

### Processing Date Column

Date of Journey column is in dd/mm/yyyy format but its datatype is object. We need to convert this column into datetime datatype.

In [None]:
combined_df['Date_of_Journey'] =  pd.to_datetime(combined_df['Date_of_Journey'],format ='%d/%m/%Y')

In [None]:
## Splitting Date 

combined_df['Date'] = combined_df['Date_of_Journey'].dt.day.astype(int)
combined_df['Month'] = combined_df['Date_of_Journey'].dt.month.astype(int)
combined_df['Year'] = combined_df['Date_of_Journey'].dt.year.astype(int)

In [None]:
combined_df.sample(2)

__As we have extracted Date, Month & Year from 'Date_of_Journey' column, we can drop this column.__

In [None]:
combined_df = combined_df.drop(['Date_of_Journey'], axis=1)

In [None]:
combined_df.sample(2)

### Processing Price Column

In [None]:
combined_df.isna().sum()

Let's replace NA values in Price column with mean value.

In [None]:
combined_df['Price'].fillna((combined_df['Price'].mean()), inplace=True)

### Plot 

Let's try to do some analytics by plotting a graph.

In [None]:
sns.jointplot(x="Date", y="Price" ,kind = 'reg', data=combined_df);

Above plot describes that Ticket fares are slightly cheaper in mid of the month as compared to start and end of the month. Majority of tickets are of the range Rs. 1700 to Rs.18000. There is light negative corelation between Ticket Price and Date.

### Processing Arrival_Time & Dep_Time Columns

__In 'Arrival_Time' column, the time is in combined format of Date & time but we don't need date from it so we will strip date and extract only time from it.__

In [None]:
combined_df['Arrival_Time'] = combined_df['Arrival_Time'] .str.split(' ').str[0]

Extracting 'Hour' & "Minutes' in separate columns from "Arrival_Time" and "Dep_Time".

In [None]:
combined_df['Arrival_Hour'] = combined_df['Arrival_Time'] .str.split(':').str[0].astype(int)
combined_df['Arrival_Minute'] = combined_df['Arrival_Time'] .str.split(':').str[1].astype(int)

In [None]:
combined_df=combined_df.drop(['Arrival_Time'], axis=1)

In [None]:
combined_df['Dep_Hour'] = combined_df['Dep_Time'] .str.split(':').str[0].astype(int)
combined_df['Dep_Minute'] = combined_df['Dep_Time'] .str.split(':').str[1].astype(int)
combined_df=combined_df.drop(['Dep_Time'], axis=1)

### Processing 'Total Stop' Column

__'Total Stop' column contains values such as '2 Stop', '1 Stop', 'non-stop', we will replace 'non stop' with '0 stop' and get only integers out of values.__

In [None]:
combined_df['Total_Stops'] = combined_df['Total_Stops'].fillna('1 stop')

In [None]:
combined_df['Total_Stops']=combined_df['Total_Stops'].replace('non-stop','0 stop')

In [None]:
combined_df['Stop'] = combined_df['Total_Stops'].str.split(' ').str[0].astype(int)

In [None]:
# 'Total_Stops' can be dropped as we have extracted numeric values in 'Stop' 

combined_df=combined_df.drop(['Total_Stops'], axis=1) 

### Processing Route Column

We need to eliminate "→" symbol from 'Route' column and extract city names

In [None]:
combined_df['Route_1'] = combined_df['Route'] .str.split('→ ').str[0]
combined_df['Route_2'] = combined_df['Route'] .str.split('→ ').str[1]
combined_df['Route_3'] = combined_df['Route'] .str.split('→ ').str[2]
combined_df['Route_4'] = combined_df['Route'] .str.split('→ ').str[3]
combined_df['Route_5'] = combined_df['Route'] .str.split('→ ').str[4]

Let's replace NA values in 'Route_n' columns with mean None.

In [None]:
combined_df['Route_1'].fillna("None",inplace = True)
combined_df['Route_2'].fillna("None",inplace = True)
combined_df['Route_3'].fillna("None",inplace = True)
combined_df['Route_4'].fillna("None",inplace = True)
combined_df['Route_5'].fillna("None",inplace = True)

In [None]:
combined_df.describe()

### Encoding 

#### Integer Encoding

We will encode categorical data in our dataset to numerical data using Label Encoder.

For this we will import LabelEncoder from sklearn library,then fit and transform the data.

In [None]:
from sklearn.preprocessing import LabelEncoder

lb_encode = LabelEncoder()
combined_df["Additional_Info"] = lb_encode.fit_transform(combined_df["Additional_Info"])
combined_df["Airline"] = lb_encode.fit_transform(combined_df["Airline"])
combined_df["Destination"] = lb_encode.fit_transform(combined_df["Destination"])
combined_df["Source"] = lb_encode.fit_transform(combined_df["Source"])
combined_df['Route_1']= lb_encode.fit_transform(combined_df["Route_1"])
combined_df['Route_2']= lb_encode.fit_transform(combined_df["Route_2"])
combined_df['Route_3']= lb_encode.fit_transform(combined_df["Route_3"])
combined_df['Route_4']= lb_encode.fit_transform(combined_df["Route_4"])
combined_df['Route_5']= lb_encode.fit_transform(combined_df["Route_5"])

In [None]:
combined_df.sample(5)

#### One hot encoding - Dummy encoding

In addition to 'Integer Encoding', we will apply 'Dummy Encoding' to disallow our model to assume any natural ordering between categorie as this may result in poor performance. 

This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

We will apply get_dummies class from Pandas library to each column and then drop original column.

In [None]:
Additional_Info_dummies = pd.get_dummies(combined_df["Additional_Info"], prefix='Additional_Info')    
combined_df = pd.concat([combined_df, Additional_Info_dummies], axis=1)
combined_df.drop('Additional_Info', axis=1, inplace=True)

In [None]:
Airline_dummies = pd.get_dummies(combined_df["Airline"], prefix='Airline')    
combined_df = pd.concat([combined_df, Airline_dummies], axis=1)
combined_df.drop('Airline', axis=1, inplace=True)

In [None]:
Destination_dummies = pd.get_dummies(combined_df["Destination"], prefix='Destination')    
combined_df = pd.concat([combined_df, Destination_dummies], axis=1)
combined_df.drop('Destination', axis=1, inplace=True)

In [None]:
Source_dummies = pd.get_dummies(combined_df["Source"], prefix='Source')    
combined_df = pd.concat([combined_df, Source_dummies], axis=1)
combined_df.drop('Source', axis=1, inplace=True)

In [None]:
Route_1_dummies = pd.get_dummies(combined_df["Route_1"], prefix='Route_1')    
combined_df = pd.concat([combined_df, Route_1_dummies], axis=1)
combined_df.drop('Route_1', axis=1, inplace=True)

In [None]:
Route_2_dummies = pd.get_dummies(combined_df["Route_2"], prefix='Route_2')    
combined_df = pd.concat([combined_df, Route_2_dummies], axis=1)
combined_df.drop('Route_2', axis=1, inplace=True)

In [None]:
Route_3_dummies = pd.get_dummies(combined_df["Route_3"], prefix='Route_3')    
combined_df = pd.concat([combined_df, Route_3_dummies], axis=1)
combined_df.drop('Route_3', axis=1, inplace=True)

In [None]:
Route_4_dummies = pd.get_dummies(combined_df["Route_4"], prefix='Route_4')    
combined_df = pd.concat([combined_df, Route_4_dummies], axis=1)
combined_df.drop('Route_4', axis=1, inplace=True)

In [None]:
Route_5_dummies = pd.get_dummies(combined_df["Route_5"], prefix='Route_5')    
combined_df = pd.concat([combined_df, Route_5_dummies], axis=1)
combined_df.drop('Route_5', axis=1, inplace=True)

In [None]:
combined_df=combined_df.drop(['Route'], axis=1)
combined_df=combined_df.drop(['Duration'], axis=1)

In [None]:
combined_df.sample(5)

### Missing value validation

In [None]:
combined_df.isna().sum()

## TEST TRAIN SPLIT

Let's separate out train set and test set from the combined dataset

In [None]:
# Split it into test and train

df_train = combined_df[0:10683]
df_test = combined_df[10683:]
df_test = df_test.drop(['Price'], axis =1)

In [None]:
X = df_train.drop(['Price'], axis=1)
y = df_train.Price

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

## MODEL BUILDING 

We will try different models and compare the "RMSE" score for each model.

We are going to try below ML algorithms :
- LinearRegression
- Ridge Regression
- Lasso Regression
- Elastic Net Regularization
- Extreme Gradient Boosting (XGBoost)
- Light GBM


In [None]:
lin_reg = LinearRegression() #LinearRegression
rig_cv = RidgeCV() #Ridge Regression
lasso = LassoCV() #Lasso Regression
elastic = ElasticNetCV() #Elastic Net Regularization
xgb = XGBRegressor() #Extreme Gradient Boosting (XGBoost)
lig_gbm = LGBMRegressor() #Light GBM

models = [lin_reg, rig_cv, lasso, elastic, xgb, lig_gbm]

In [None]:
#Build our cross validation method
kfolds = KFold(n_splits=50,shuffle=True, random_state=0)

In [None]:
def cv_rmse(model):
    rmse = np.sqrt(-cross_val_score(model, X_train, y_train, 
                                   scoring="neg_mean_squared_error", 
                                   cv = kfolds))
    return(rmse)

In [None]:
from sklearn.metrics import mean_squared_error

acc = []
for model in models:
    print ('Cross-validation of : {0}'.format(model.__class__))
    score = cv_rmse(model).mean()
    acc.append(score)
    print ('CV score = {0}'.format(score))
    print ('****')

From the above, after applying different Regression models we can see XGboost is performing really good as compared to others.

So we will use __'XGboost'__ to predict our test data

In [None]:
xgb.fit(X_train,y_train)
y_pred_xgb = xgb.predict(X_test)

Let's apply above model to predict "Price" for original test dataset.

In [None]:
df_test_xgb = df_test
xgb_pred = xgb.predict(df_test)
df_test_xgb['Price'] = xgb_pred
df_test_xgb.to_csv('flight_price_pred.csv')

## To Conclude

In such work around, Feature Engineering plays an important role. Also here we have used dual encoding techniques to increase our model's performance.

We compared RMSE score for each model and then selected the model with better RMSE score to apply on our test dataset.

Advanced techniques like Pipeline, Stacking etc. can be used to tune algorithm and improve the performance of the model.

Further, Hyperparameter tuning can be used to fine tune our alogorithm and get best performance score from the model.

Source :

https://medium.com/code-to-express/flight-price-prediction-7c83616a13bb



import pickle
pickle.dump(model, open(‘fppmodel.pkl’, ‘wb’))