Predict The Flight Ticket Price 
Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travellers saying that flight ticket prices are so unpredictable. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities.

Size of training set: 10683 records

Size of test set: 2671 records

FEATURES:
Airline: The name of the airline.

Date_of_Journey: The date of the journey

Source: The source from which the service begins.

Destination: The destination where the service ends.

Route: The route taken by the flight to reach the destination.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about the flight

Price: The price of the ticket



In [63]:
#import the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tabulate import tabulate
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from scipy.stats import skew
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


In [64]:
flightdata=pd.read_csv('Flight_Data.csv',header=0)
flightdata


Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR ? DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU ? IXR ? BBI ? BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL ? LKO ? BOM ? COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU ? NAG ? BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR ? NAG ? DEL,16:50,21:35,4h 45m,1 stop,No info,13302
...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU ? BLR,19:55,22:25,2h 30m,non-stop,No info,4107
10679,Air India,27/04/2019,Kolkata,Banglore,CCU ? BLR,20:45,23:20,2h 35m,non-stop,No info,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR ? DEL,08:20,11:20,3h,non-stop,No info,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR ? DEL,11:30,14:10,2h 40m,non-stop,No info,12648


In [65]:
flightdata=flightdata.replace('?', np.NaN)

In [66]:
flightdata.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

In [67]:
flightdata.keys()

Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
       'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
       'Additional_Info', 'Price'],
      dtype='object')

In [68]:
# by observing the data airline price depends on Source,Destination,Total_Stops
 # so dropping remaining data 
flightdata.drop(['Route','Dep_Time', 'Arrival_Time','Additional_Info'],axis=1,inplace=True)

In [69]:
flightdata

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Duration,Total_Stops,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,2h 50m,non-stop,3897
1,Air India,1/05/2019,Kolkata,Banglore,7h 25m,2 stops,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,19h,2 stops,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,5h 25m,1 stop,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,4h 45m,1 stop,13302
...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,2h 30m,non-stop,4107
10679,Air India,27/04/2019,Kolkata,Banglore,2h 35m,non-stop,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,3h,non-stop,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,2h 40m,non-stop,12648


In [70]:
from datetime import datetime as dt
flightdata['Date_of_Journey'] =pd.to_datetime(flightdata['Date_of_Journey'])
flightdata['WeekDay'] =flightdata['Date_of_Journey'].dt.weekday # 0-6 are monday to sunday 
flightdata['WeekDay']

0        6
1        5
2        4
3        3
4        3
        ..
10678    2
10679    5
10680    5
10681    3
10682    3
Name: WeekDay, Length: 10683, dtype: int64

In [71]:
# find out the WeekEnd 
flightdata['WeekEnd']=0
flightdata['WeekEnd'] = np.where(flightdata['WeekDay']>= 5, 'Yes', 'No')
        
   
      
    
    

In [72]:
flightdata

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Duration,Total_Stops,Price,WeekDay,WeekEnd
0,IndiGo,2019-03-24,Banglore,New Delhi,2h 50m,non-stop,3897,6,Yes
1,Air India,2019-01-05,Kolkata,Banglore,7h 25m,2 stops,7662,5,Yes
2,Jet Airways,2019-09-06,Delhi,Cochin,19h,2 stops,13882,4,No
3,IndiGo,2019-12-05,Kolkata,Banglore,5h 25m,1 stop,6218,3,No
4,IndiGo,2019-01-03,Banglore,New Delhi,4h 45m,1 stop,13302,3,No
...,...,...,...,...,...,...,...,...,...
10678,Air Asia,2019-09-04,Kolkata,Banglore,2h 30m,non-stop,4107,2,No
10679,Air India,2019-04-27,Kolkata,Banglore,2h 35m,non-stop,4145,5,Yes
10680,Jet Airways,2019-04-27,Banglore,Delhi,3h,non-stop,7229,5,Yes
10681,Vistara,2019-01-03,Banglore,New Delhi,2h 40m,non-stop,12648,3,No


In [73]:
flightdata['Total_Stops'].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [74]:
#For categorical features  are  # 'police_report_available','property_damage','collision_type'

flightdata['Total_Stops']= flightdata.apply(lambda x: flightdata['Total_Stops'].fillna(flightdata['Total_Stops'].value_counts().index[0]))


In [75]:
# Applying LabelEncoder  
#,'Duration'
from sklearn.preprocessing import LabelEncoder
transcol=['Total_Stops', 'WeekEnd','Source','Destination','Airline']
for col in flightdata :
    
    for i in transcol:
        
        if col==i  :
            print(i)
            labelencoder = LabelEncoder()
            flightdata[col] = labelencoder.fit_transform(flightdata[col])

Airline
Source
Destination
Total_Stops
WeekEnd


In [76]:
flightdata

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Duration,Total_Stops,Price,WeekDay,WeekEnd
0,3,2019-03-24,0,5,2h 50m,4,3897,6,1
1,1,2019-01-05,3,0,7h 25m,1,7662,5,1
2,4,2019-09-06,2,1,19h,1,13882,4,0
3,3,2019-12-05,3,0,5h 25m,0,6218,3,0
4,3,2019-01-03,0,5,4h 45m,0,13302,3,0
...,...,...,...,...,...,...,...,...,...
10678,0,2019-09-04,3,0,2h 30m,4,4107,2,0
10679,1,2019-04-27,3,0,2h 35m,4,4145,5,1
10680,4,2019-04-27,0,2,3h,4,7229,5,1
10681,10,2019-01-03,0,5,2h 40m,4,12648,3,0


In [77]:
flightdata['Duration']=flightdata['Duration'].astype(str)

In [78]:
flightdata

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Duration,Total_Stops,Price,WeekDay,WeekEnd
0,3,2019-03-24,0,5,2h 50m,4,3897,6,1
1,1,2019-01-05,3,0,7h 25m,1,7662,5,1
2,4,2019-09-06,2,1,19h,1,13882,4,0
3,3,2019-12-05,3,0,5h 25m,0,6218,3,0
4,3,2019-01-03,0,5,4h 45m,0,13302,3,0
...,...,...,...,...,...,...,...,...,...
10678,0,2019-09-04,3,0,2h 30m,4,4107,2,0
10679,1,2019-04-27,3,0,2h 35m,4,4145,5,1
10680,4,2019-04-27,0,2,3h,4,7229,5,1
10681,10,2019-01-03,0,5,2h 40m,4,12648,3,0


In [79]:
flightdata['Duration'].str.split(",",0)

0        [2h 50m]
1        [7h 25m]
2           [19h]
3        [5h 25m]
4        [4h 45m]
           ...   
10678    [2h 30m]
10679    [2h 35m]
10680        [3h]
10681    [2h 40m]
10682    [8h 20m]
Name: Duration, Length: 10683, dtype: object

In [80]:
flightdata[['Duration_Min','Duration_Sec']] = flightdata['Duration'].str.split(" ",expand=True,)


In [118]:
# replace the values
flightdata['Duration_Min']=flightdata['Duration_Min'].str.replace('h', '')

flightdata['Duration_Min']=flightdata['Duration_Min'].str.replace('m', '')
flightdata['Duration_Sec']=flightdata['Duration_Sec'].str.replace('m', '')
flightdata['Duration_Sec']=flightdata['Duration_Sec'].fillna(0)

In [112]:
flightdata

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Duration,Total_Stops,Price,WeekDay,WeekEnd,Duration_Min,Duration_Sec
0,3,2019-03-24,0,5,2h 50m,4,3897,6,1,2,50
1,1,2019-01-05,3,0,7h 25m,1,7662,5,1,7,25
2,4,2019-09-06,2,1,19h,1,13882,4,0,19,0
3,3,2019-12-05,3,0,5h 25m,0,6218,3,0,5,25
4,3,2019-01-03,0,5,4h 45m,0,13302,3,0,4,45
...,...,...,...,...,...,...,...,...,...,...,...
10678,0,2019-09-04,3,0,2h 30m,4,4107,2,0,2,30
10679,1,2019-04-27,3,0,2h 35m,4,4145,5,1,2,35
10680,4,2019-04-27,0,2,3h,4,7229,5,1,3,0
10681,10,2019-01-03,0,5,2h 40m,4,12648,3,0,2,40


In [113]:
flightdata.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Duration           0
Total_Stops        0
Price              0
WeekDay            0
WeekEnd            0
Duration_Min       0
Duration_Sec       0
dtype: int64

In [119]:
flightdata['TotalDuration_Min']=flightdata['Duration_Min'].astype(int)*60+flightdata['Duration_Sec'].astype(int)

In [120]:
# drop the date column we will find out the weekend 
flightdata.drop(['Date_of_Journey','WeekDay','Duration','Duration_Min','Duration_Sec'],axis=1,inplace=True)

In [121]:
flightdata

Unnamed: 0,Airline,Source,Destination,Total_Stops,Price,WeekEnd,TotalDuration_Min
0,3,0,5,4,3897,1,170
1,1,3,0,1,7662,1,445
2,4,2,1,1,13882,0,1140
3,3,3,0,0,6218,0,325
4,3,0,5,0,13302,0,285
...,...,...,...,...,...,...,...
10678,0,3,0,4,4107,0,150
10679,1,3,0,4,4145,1,155
10680,4,0,2,4,7229,1,180
10681,10,0,5,4,12648,0,160


In [122]:
# Lets Check Outliers in the Dataset 

from scipy.stats import zscore
print('Before zscore',flightdata.shape)
z_score=abs(zscore(flightdata))
hrds=flightdata.iloc[(z_score<3).all(axis=1)]
print('After zscore',hrds.shape)

Before zscore (10683, 7)
After zscore (10522, 7)


In [123]:
x=hrds.drop(['Price'],axis=1)
x.shape

(10522, 6)

In [124]:
y=hrds['Price']
y=np.array(y).reshape(-1,1)

In [125]:
x.skew()

Airline              0.729381
Source              -0.436860
Destination          1.261808
Total_Stops          0.615379
WeekEnd              1.399518
TotalDuration_Min    0.777208
dtype: float64

In [126]:
# splitting data as X_train and X_test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2,random_state = 42)

In [127]:
#Linear Regression
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [128]:
y_pred = regressor.predict(X_test)

In [129]:
# calculating RMSE
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(y_test, y_pred))
rmse

2980.8223203120897

In [130]:
df = pd.DataFrame({'Actual': np.array(y_test)[:,0], 'Predicted': y_pred[:,0]})
df

Unnamed: 0,Actual,Predicted
0,14086,10264.873480
1,6216,5583.722823
2,10919,11413.384520
3,11753,8769.238223
4,4145,4460.306224
...,...,...
2100,4995,4491.609193
2101,12032,12129.798480
2102,3383,5374.298057
2103,4423,5906.768477


In [132]:
# Saving the Model with MAX Accuracy score value 
from sklearn.externals import joblib
joblib.dump(regressor,'Flight_Price_Model.obj')




['Flight_Price_Model.obj']

# Conclusion
    Preparing the Linear Regression model for  Flight Price
    
    Null vaules has been replaced
    
    Duration(Hrs) Chnaged  to TotalMin 
    
    WeekEnd Day extract from Date_of_Journey
    
    we have been applied Label Encoder
    
    
     
