## Flight Ticket Price Prediction

#### Business Case:
<br>The flight ticket price in India is based on demand and supply model
with few restriction on pricing from regulatory bodies. It is often
perceived as unpredictable and , recent dynamic pricing scheme
added to the confusion.</br>
<br>The objective is to create a machine learning model for predicting
the flight price, based on historical data, which can be used for
reference price for customers as well as airline service providers.</br>

#### PROJECT GOAL:
<br> 1. Creating a machine learning for predicting flight ticket price with high accuracy.</br>

#### Feature Details:
<br>Airline: The name of the airline.</br>
<br>Date_of_Journey: The date of the journey</br>
<br>Source: The source from which the service begins.</br>
<br>Destination: The destination where the service ends.</br>
<br>Route: The route taken by the flight to reach the destination.</br>
<br>Dep_Time: The time when the journey starts from the source.</br>
<br>Arrival_Time: Time of arrival at the destination.</br>
<br>Duration: Total duration of the flight.</br>
<br>Total_Stops: Total stops between the source and destination.</br>
<br>Additional_Info: Additional information about the flight</br>
<br>Price: The price of the ticket</br>


In [1]:
# import libraries required for project 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


In [2]:
# Read/load the data

flight_data = pd.read_excel('data/Data_Train.xlsx')

In [3]:
pd.get_dummies(flight_data)

Unnamed: 0,Price,Airline_Air Asia,Airline_Air India,Airline_GoAir,Airline_IndiGo,Airline_Jet Airways,Airline_Jet Airways Business,Airline_Multiple carriers,Airline_Multiple carriers Premium economy,Airline_SpiceJet,...,Additional_Info_1 Long layover,Additional_Info_1 Short layover,Additional_Info_2 Long layover,Additional_Info_Business class,Additional_Info_Change airports,Additional_Info_In-flight meal not included,Additional_Info_No Info,Additional_Info_No check-in baggage included,Additional_Info_No info,Additional_Info_Red-eye flight
0,3897,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,7662,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,13882,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,6218,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,13302,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5,3873,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
6,11087,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
7,22270,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
8,11087,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9,8625,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


In [4]:
# check data sample 

flight_data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


### from data above we can start with some of the cleaning and refactroing the data
 <br> 1. Date of Journey has to be converted to day of week </br>
 <br> 2. Source and destination should have the same code which was used in route </br>
 <br> 3. create array of the hops with source and destination. </br>


In [5]:
# check the info of the data
flight_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
Airline            10683 non-null object
Date_of_Journey    10683 non-null object
Source             10683 non-null object
Destination        10683 non-null object
Route              10682 non-null object
Dep_Time           10683 non-null object
Arrival_Time       10683 non-null object
Duration           10683 non-null object
Total_Stops        10682 non-null object
Additional_Info    10683 non-null object
Price              10683 non-null int64
dtypes: int64(1), object(10)
memory usage: 918.1+ KB


In [6]:
flight_data.dropna(subset=['Route','Total_Stops'], inplace=True)
flight_data.reset_index(drop=True,inplace=True)

In [7]:
def convert_to_datetime(date):
    from datetime import datetime
    return datetime.strptime(date, '%d/%m/%Y')

    

In [8]:
# import datetime 

from datetime import datetime

flight_data['date_instance'] = flight_data['Date_of_Journey'].apply(lambda x : convert_to_datetime(x))



In [9]:
flight_data['Day_of_Week'] = flight_data['date_instance'].apply(lambda x : int(x.strftime('%w')))

In [10]:
flight_data['month'] = flight_data['date_instance'].apply(lambda x : int(x.strftime('%m')))


In [11]:
# as we got the 2 columns out of date of journey , we can drop the date of journey and date_instance 
flight_data.drop(['Date_of_Journey','date_instance'],axis=1,inplace=True)


In [12]:
def convert_Stops_toInt(stop):
    if 'non-stop' == stop:
        return 0
    elif '1 stop' == stop:
        return 1
    elif '2 stops' == stop:
        return 2
    elif '3 stops' == stop:
        return 3
    elif '4 stops' == stop:
        return 4



In [13]:
flight_data['Stops_total'] = flight_data['Total_Stops'].apply(lambda x : convert_Stops_toInt(x))

In [14]:
# now as got the numerical Stops remove the Total stops 
flight_data.drop(['Total_Stops'],axis=1,inplace=True)


In [15]:
flight_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Additional_Info,Price,Day_of_Week,month,Stops_total
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,No info,3897,0,3,0
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,No info,7662,3,5,2
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,No info,13882,0,6,2
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,No info,6218,0,5,1
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,No info,13302,5,3,1


In [16]:
def convert_to_min(time):
    time_in_min = 0
    hr_min = time.split(' ')
    for t in hr_min:
        if 'h' in t:
            hr = t[:-1]
            time_in_min = time_in_min + (int(hr) * 60)
        elif 'm' in t:
            minute= t[:-1]
            time_in_min = time_in_min + int(minute)
            
    return time_in_min

In [17]:
flight_data['duration_inMin'] = flight_data['Duration'].apply(lambda x : convert_to_min(x))

In [18]:
flight_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Additional_Info,Price,Day_of_Week,month,Stops_total,duration_inMin
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,No info,3897,0,3,0,170
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,No info,7662,3,5,2,445
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,No info,13882,0,6,2,1140
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,No info,6218,0,5,1,325
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,No info,13302,5,3,1,285


In [19]:
# now we got the duration in minutes so we should remove the Duration column

flight_data.drop(['Duration'],axis=1,inplace=True)

In [20]:
flight_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Additional_Info,Price,Day_of_Week,month,Stops_total,duration_inMin
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,No info,3897,0,3,0,170
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,No info,7662,3,5,2,445
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,No info,13882,0,6,2,1140
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,No info,6218,0,5,1,325
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,No info,13302,5,3,1,285


In [21]:
def convert_to_flight_time(dep_time):
    hh_mm = dep_time.split(':')
    if int(hh_mm[0]) >= 0 and int(hh_mm[0]) < 6:
        return 'Late Night'
    elif int(hh_mm[0]) >= 6 and int(hh_mm[0]) < 12:
        return 'Morning'
    elif int(hh_mm[0]) >= 12 and int(hh_mm[0]) < 18:
        return 'Afternoon'
    else:
        return 'Evening'


In [22]:
flight_data['dep_quadrant'] = flight_data['Dep_Time'].apply(lambda x : convert_to_flight_time(x))

In [23]:
flight_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Additional_Info,Price,Day_of_Week,month,Stops_total,duration_inMin,dep_quadrant
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,No info,3897,0,3,0,170,Evening
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,No info,7662,3,5,2,445,Late Night
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,No info,13882,0,6,2,1140,Morning
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,No info,6218,0,5,1,325,Evening
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,No info,13302,5,3,1,285,Afternoon


In [24]:
# as we have got the Dep quadrant so we can remove the Dep_Time, and Arru=ival time is usually not making any sense as we have
# considered departure time and Duration of the flight so we should remove both the data. 

flight_data.drop(['Dep_Time','Arrival_Time'],axis=1,inplace=True)

In [25]:
flight_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Additional_Info,Price,Day_of_Week,month,Stops_total,duration_inMin,dep_quadrant
0,IndiGo,Banglore,New Delhi,BLR → DEL,No info,3897,0,3,0,170,Evening
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,No info,7662,3,5,2,445,Late Night
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,No info,13882,0,6,2,1140,Morning
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,No info,6218,0,5,1,325,Evening
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,No info,13302,5,3,1,285,Afternoon


In [26]:
# First we will convert the Source and destination to their respective code.
source_to_code_dict = {}

def get_source_destincation_code():
    source = flight_data['Source']
    dest = flight_data['Destination']
    route = flight_data['Route']
    i = 0 
    while i < len(source):
        routeMap = route[i]
        #print(routeMap)
        #print(type(routeMap))
        try:
            source_code = routeMap[:3]
            desc_code = routeMap[-3:]
        except:
            print('Issue was there ' , routeMap)
        source_local = source[i]
        dest_loc = dest[i]
        
        if source_local not in source_to_code_dict:
            source_to_code_dict[source_local]=source_code
    
        if dest_loc not in source_to_code_dict:
            source_to_code_dict[dest_loc]=desc_code
            
        i +=1
        

get_source_destincation_code()   


In [27]:
source_to_code_dict


{'Banglore': 'BLR',
 'New Delhi': 'DEL',
 'Kolkata': 'CCU',
 'Delhi': 'DEL',
 'Cochin': 'COK',
 'Chennai': 'MAA',
 'Mumbai': 'BOM',
 'Hyderabad': 'HYD'}

In [28]:
flight_data['Source'] = flight_data['Source'].apply(lambda x : source_to_code_dict[x])


In [29]:
flight_data['Destination'] = flight_data['Destination'].apply(lambda x : source_to_code_dict[x])

In [30]:
flight_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Additional_Info,Price,Day_of_Week,month,Stops_total,duration_inMin,dep_quadrant
0,IndiGo,BLR,DEL,BLR → DEL,No info,3897,0,3,0,170,Evening
1,Air India,CCU,BLR,CCU → IXR → BBI → BLR,No info,7662,3,5,2,445,Late Night
2,Jet Airways,DEL,COK,DEL → LKO → BOM → COK,No info,13882,0,6,2,1140,Morning
3,IndiGo,CCU,BLR,CCU → NAG → BLR,No info,6218,0,5,1,325,Evening
4,IndiGo,BLR,DEL,BLR → NAG → DEL,No info,13302,5,3,1,285,Afternoon


In [31]:
code_to_num = {}

def update_code_to_num_dict():
    source = flight_data['Source'].unique()
    unique_code = 0
    for i in source:
        if i not in code_to_num:
            code_to_num[i]=unique_code
            unique_code+=1
    
    dest = flight_data['Destination'].unique()
    for i in dest:
        if i not in code_to_num:
            code_to_num[i]=unique_code
            unique_code+=1
    
    route = flight_data['Route']
    lst = [-1,-1,-1,-1,-1,-1]
    #print(code_to_num)
    
    #ser = pd.Series()
    j=0
    temp=[]
    while j < len(route):
        routeMap = route[j]
        #print(routeMap)
        #print(lst)
       # print(unique_code)
        #print(type(routeMap))
        
        hops = routeMap.split('→')
        #print(hops)
        #print(lst)
        index =0
        for stop in hops:
            
            if stop.strip() not in code_to_num:
                code_to_num[stop.strip()]=unique_code
                lst[index]=unique_code
                unique_code+=1                
            else:
         #       print(index)
                lst[index]=code_to_num[stop.strip()]
                
            index+=1
                
            
        
        #print(lst)
        temp.append(lst.copy())
        lst=[-1,-1,-1,-1,-1,-1]
        j+=1
    
   # print(temp)
    
    
    test =  pd.Series(temp)
   # print(test)
    #print(type(test))

    flight_data['Route_Code']= test
    
    

In [32]:
update_code_to_num_dict()

In [33]:
flight_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Additional_Info,Price,Day_of_Week,month,Stops_total,duration_inMin,dep_quadrant,Route_Code
0,IndiGo,BLR,DEL,BLR → DEL,No info,3897,0,3,0,170,Evening,"[0, 2, -1, -1, -1, -1]"
1,Air India,CCU,BLR,CCU → IXR → BBI → BLR,No info,7662,3,5,2,445,Late Night,"[1, 7, 8, 0, -1, -1]"
2,Jet Airways,DEL,COK,DEL → LKO → BOM → COK,No info,13882,0,6,2,1140,Morning,"[2, 9, 4, 5, -1, -1]"
3,IndiGo,CCU,BLR,CCU → NAG → BLR,No info,6218,0,5,1,325,Evening,"[1, 10, 0, -1, -1, -1]"
4,IndiGo,BLR,DEL,BLR → NAG → DEL,No info,13302,5,3,1,285,Afternoon,"[0, 10, 2, -1, -1, -1]"


In [34]:
# as we have converted the Routes to Routes_code then we should delete the ROute column
flight_data.drop('Route',axis=1,inplace=True)

In [35]:
# we should convert Source and Destination to code generated to create route code. 
flight_data['Source'] = flight_data['Source'].apply(lambda x : code_to_num[x])
flight_data['Destination'] = flight_data['Destination'].apply(lambda x : code_to_num[x])

In [36]:
flight_data.head()

Unnamed: 0,Airline,Source,Destination,Additional_Info,Price,Day_of_Week,month,Stops_total,duration_inMin,dep_quadrant,Route_Code
0,IndiGo,0,2,No info,3897,0,3,0,170,Evening,"[0, 2, -1, -1, -1, -1]"
1,Air India,1,0,No info,7662,3,5,2,445,Late Night,"[1, 7, 8, 0, -1, -1]"
2,Jet Airways,2,5,No info,13882,0,6,2,1140,Morning,"[2, 9, 4, 5, -1, -1]"
3,IndiGo,1,0,No info,6218,0,5,1,325,Evening,"[1, 10, 0, -1, -1, -1]"
4,IndiGo,0,2,No info,13302,5,3,1,285,Afternoon,"[0, 10, 2, -1, -1, -1]"


In [37]:
# columns to convert
col_convert = ['Airline','Additional_Info','dep_quadrant']

flight_data = pd.get_dummies(columns=col_convert,data=flight_data)

In [38]:
flight_data.head()

Unnamed: 0,Source,Destination,Price,Day_of_Week,month,Stops_total,duration_inMin,Route_Code,Airline_Air Asia,Airline_Air India,...,Additional_Info_Change airports,Additional_Info_In-flight meal not included,Additional_Info_No Info,Additional_Info_No check-in baggage included,Additional_Info_No info,Additional_Info_Red-eye flight,dep_quadrant_Afternoon,dep_quadrant_Evening,dep_quadrant_Late Night,dep_quadrant_Morning
0,0,2,3897,0,3,0,170,"[0, 2, -1, -1, -1, -1]",0,0,...,0,0,0,0,1,0,0,1,0,0
1,1,0,7662,3,5,2,445,"[1, 7, 8, 0, -1, -1]",0,1,...,0,0,0,0,1,0,0,0,1,0
2,2,5,13882,0,6,2,1140,"[2, 9, 4, 5, -1, -1]",0,0,...,0,0,0,0,1,0,0,0,0,1
3,1,0,6218,0,5,1,325,"[1, 10, 0, -1, -1, -1]",0,0,...,0,0,0,0,1,0,0,1,0,0
4,0,2,13302,5,3,1,285,"[0, 10, 2, -1, -1, -1]",0,0,...,0,0,0,0,1,0,1,0,0,0


In [39]:
# now we have finised processing data. lets device the data into testing and trainin set

X = flight_data.drop(['Price','Route_Code'],axis=1)
y = flight_data['Price']


In [40]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### As we have training data , now we should apply different model 

In [41]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()



In [42]:
lin_reg.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [43]:
lin_ref_pred = lin_reg.predict(X_train)

In [45]:
from sklearn.metrics import r2_score

r2_score(y_train,lin_ref_pred)

0.6664809831929335

Lets Try some other regressor
### RandomForestRegressor

In [46]:
from sklearn.ensemble import RandomForestRegressor

ran_reg = RandomForestRegressor()

In [47]:
ran_reg.fit(X_train,y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [48]:
ran_reg_pred = ran_reg.predict(X_train)

In [49]:
r2_score(y_train,ran_reg_pred)

0.9486670641688311

In [52]:
# it look pretty good score . let check our test data

ran_pred_test = ran_reg.predict(X_test)

In [53]:
r2_score(y_test,ran_pred_test)

0.7785120086759343

In [54]:
# Lets look at some other regression Models

from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.1)


In [55]:
lasso_reg.fit(X_train,y_train)



Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [56]:
lasso_pred = lasso_reg.predict(X_train)

In [58]:
from sklearn.metrics import mean_squared_error,mean_absolute_error

print(mean_squared_error(y_train,lasso_pred))
print(mean_absolute_error(y_train,lasso_pred))
print(r2_score(y_train,lasso_pred))

7164827.28435465
1808.0702747869586
0.6664572461986802


In [60]:
# Lets use Ridge regression 

from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=0.1,solver='cholesky')

ridge_reg.fit(X_train,y_train)

ridge_pred = ridge_reg.predict(X_train)

print(mean_squared_error(y_train,ridge_pred))
print(mean_absolute_error(y_train,ridge_pred))
print(r2_score(y_train,ridge_pred))




7164837.196326104
1807.957776546364
0.66645678476868


In [61]:
# lets try with Elastic net also 

from sklearn.linear_model import ElasticNet

elastic_reg = ElasticNet(alpha=0.1,l1_ratio=0.5)

elastic_reg.fit(X_train,y_train)

elastic_pred = elastic_reg.predict(X_train)

print(mean_squared_error(y_train,elastic_pred))
print(mean_absolute_error(y_train,elastic_pred))
print(r2_score(y_train,elastic_pred))


9393121.126762614
1953.91125527163
0.5627239341482668
