https://www.machinehack.com/course/predict-the-flight-ticket-price-hackathon/

Predict The Flight Ticket Price Hackathon
Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travellers saying that flight ticket prices are so unpredictable. Huh! Here we take on the challenge! As data scientists, we are gonna prove that given the right data anything can be predicted. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities.

Size of training set: 10683 records

Size of test set: 2671 records

FEATURES: Airline: The name of the airline.

Date_of_Journey: The date of the journey

Source: The source from which the service begins.

Destination: The destination where the service ends.

Route: The route taken by the flight to reach the destination.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about the flight

Price: The price of the ticket

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading and Exploring Data

In [2]:
df_train=pd.read_excel("Data_Train.xlsx")
df_test=pd.read_excel("Test_set.xlsx")

In [3]:
df_train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [4]:
df_test.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info
0,Jet Airways,6/06/2019,Delhi,Cochin,DEL → BOM → COK,17:30,04:25 07 Jun,10h 55m,1 stop,No info
1,IndiGo,12/05/2019,Kolkata,Banglore,CCU → MAA → BLR,06:20,10:20,4h,1 stop,No info
2,Jet Airways,21/05/2019,Delhi,Cochin,DEL → BOM → COK,19:15,19:00 22 May,23h 45m,1 stop,In-flight meal not included
3,Multiple carriers,21/05/2019,Delhi,Cochin,DEL → BOM → COK,08:00,21:00,13h,1 stop,No info
4,Air Asia,24/06/2019,Banglore,Delhi,BLR → DEL,23:55,02:45 25 Jun,2h 50m,non-stop,No info


In [5]:
df_train.shape

(10683, 11)

In [6]:
df_test.shape

(2671, 10)

In [7]:
data= df_train.append(df_test, sort=False)

In [8]:
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897.0
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662.0
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882.0
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218.0
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302.0


In [9]:
data.tail()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
2666,Air India,6/06/2019,Kolkata,Banglore,CCU → DEL → BLR,20:30,20:25 07 Jun,23h 55m,1 stop,No info,
2667,IndiGo,27/03/2019,Kolkata,Banglore,CCU → BLR,14:20,16:55,2h 35m,non-stop,No info,
2668,Jet Airways,6/03/2019,Delhi,Cochin,DEL → BOM → COK,21:50,04:25 07 Mar,6h 35m,1 stop,No info,
2669,Air India,6/03/2019,Delhi,Cochin,DEL → BOM → COK,04:00,19:15,15h 15m,1 stop,No info,
2670,Multiple carriers,15/06/2019,Delhi,Cochin,DEL → BOM → COK,04:55,19:15,14h 20m,1 stop,No info,


In [10]:
data.dtypes

Airline             object
Date_of_Journey     object
Source              object
Destination         object
Route               object
Dep_Time            object
Arrival_Time        object
Duration            object
Total_Stops         object
Additional_Info     object
Price              float64
dtype: object

# Data Cleanning and pre-processing

In [11]:
data["Additional_Info"].nunique()

10

In [12]:
data["Additional_Info"].unique()

array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 Short layover', 'No Info',
       '1 Long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 Long layover'], dtype=object)

In [13]:
data.drop("Additional_Info", axis = 1, inplace=True)  #Removing irrelvent features

In [14]:
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,3897.0
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,7662.0
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,13882.0
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,6218.0
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,13302.0


In [15]:
data["Total_Stops"].nunique()

5

In [16]:
data["Total_Stops"].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [17]:
data[data["Total_Stops"].isnull()]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Price
9039,Air India,6/05/2019,Delhi,Cochin,,09:45,09:25 07 May,23h 40m,,7480.0


In [18]:
data["Total_Stops"].value_counts()

1 stop      7056
non-stop    4340
2 stops     1899
3 stops       56
4 stops        2
Name: Total_Stops, dtype: int64

In [19]:
data["Total_Stops"].fillna("1", inplace=True)     # Handling null values by mode method

In [20]:
data["Total_Stops"]=data["Total_Stops"].replace(to_replace="non-stop", value="0 stops")    # total stops feature cleaned
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,0 stops,3897.0
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,7662.0
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,13882.0
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,6218.0
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,13302.0


In [21]:
data["Total_Stops"] = data["Total_Stops"].str.split(" ").str[0]                       # total stops feature cleaned
data["Total_Stops"].astype(int)
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,0,3897.0
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2,7662.0
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2,13882.0
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1,6218.0
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1,13302.0


In [22]:
data["Duration"].nunique()

374

In [23]:
data["Duration"].value_counts().sort_values()

41h 20m      1
31h 50m      1
35h 20m      1
30h 10m      1
31h 10m      1
          ... 
2h 35m     399
2h 55m     418
2h 45m     432
1h 30m     493
2h 50m     672
Name: Duration, Length: 374, dtype: int64

In [24]:
h=data["Duration"].str.split("h|m").str[0]
m=data["Duration"].str.split(" ").str[1]
m=m.str.split("m").str[0]
h.value_counts()

2     2967
1      785
3      627
5      610
7      600
9      551
12     538
8      531
13     516
11     467
10     459
6      442
14     424
15     339
23     331
26     292
16     286
4      278
22     273
24     240
21     237
25     231
27     222
20     203
18     179
19     168
17     161
28     116
29      76
30      61
38      41
37      22
33      21
32      12
36      11
35      10
34       9
31       8
39       3
47       2
40       2
42       2
41       1
Name: Duration, dtype: int64

In [25]:
m.value_counts()

30    1818
20    1260
50    1205
45    1154
35    1149
15    1135
55    1121
25    1009
40     803
5      767
10     647
Name: Duration, dtype: int64

In [26]:
print(h.isnull().sum())
print(m.isnull().sum())

0
1286


In [27]:
m=m.fillna(0)
print(m.isnull().sum())

0


In [28]:
h=h.astype(float)
m=m.astype(float)

In [29]:
m=m/60                                                   # Convert duration time into hours only
data["Duration"]=h+m
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2.833333,0,3897.0
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7.416667,2,7662.0
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19.0,2,13882.0
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5.416667,1,6218.0
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4.75,1,13302.0


In [30]:
data.drop(["Dep_Time","Arrival_Time"], axis = 1, inplace = True)     # removing irrelvent feature

In [31]:
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,2.833333,0,3897.0
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,7.416667,2,7662.0
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,19.0,2,13882.0
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,5.416667,1,6218.0
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,4.75,1,13302.0


In [32]:
data["Route"].nunique()       # processing route feature

132

In [33]:
stop_1=data["Route"].str.split(" → ").str[0]
stop_2=data["Route"].str.split(" → ").str[1]
stop_3=data["Route"].str.split(" → ").str[2]
stop_4=data["Route"].str.split(" → ").str[3]
stop_5=data["Route"].str.split(" → ").str[4]
stop_6=data["Route"].str.split(" → ").str[5]

In [34]:
data["Route_stop_1"]=stop_1
data["Route_stop_2"]=stop_2
data["Route_stop_3"]=stop_3
data["Route_stop_4"]=stop_4
data["Route_stop_5"]=stop_5
data["Route_stop_6"]=stop_6

In [35]:
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Price,Route_stop_1,Route_stop_2,Route_stop_3,Route_stop_4,Route_stop_5,Route_stop_6
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,2.833333,0,3897.0,BLR,DEL,,,,
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,7.416667,2,7662.0,CCU,IXR,BBI,BLR,,
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,19.0,2,13882.0,DEL,LKO,BOM,COK,,
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,5.416667,1,6218.0,CCU,NAG,BLR,,,
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,4.75,1,13302.0,BLR,NAG,DEL,,,


In [36]:
data["Route_stop_1"] = data["Route_stop_1"].fillna("Not")
data["Route_stop_2"] = data["Route_stop_2"].fillna("Not")
data["Route_stop_3"] = data["Route_stop_3"].fillna("Not")
data["Route_stop_4"] = data["Route_stop_4"].fillna("Not")
data["Route_stop_5"] = data["Route_stop_5"].fillna("Not")
data["Route_stop_6"] = data["Route_stop_6"].fillna("Not")
data.isnull().sum()

Airline               0
Date_of_Journey       0
Source                0
Destination           0
Route                 1
Duration              0
Total_Stops           0
Price              2671
Route_stop_1          0
Route_stop_2          0
Route_stop_3          0
Route_stop_4          0
Route_stop_5          0
Route_stop_6          0
dtype: int64

In [37]:
a = [ x for x in range(0,len(data["Route_stop_1"])) ]     # verifying Route feature with total stops
wrong=[]
for i in range(0,len(data["Route_stop_1"])):
    if (data["Route_stop_1"][i:i+1].values != "Not"):
        a[i]=1
    else:
        a[i]=0
    if (data["Route_stop_2"][i:i+1].values != "Not"):
        a[i]=a[i]+1
    else:
        a[i]=a[i]+0
    if (data["Route_stop_3"][i:i+1].values != "Not"):
        a[i]=a[i]+1
    else:
        a[i]=a[i]+0
    if (data["Route_stop_4"][i:i+1].values != "Not"):
        a[i]=a[i]+1
    else:
        a[i]=a[i]+0
    if (data["Route_stop_5"][i:i+1].values != "Not"):
        a[i]=a[i]+1
    else:
        a[i]=a[i]+0
    if (data["Route_stop_6"][i:i+1].values != "Not"):
        a[i]=a[i]+1
    else:
        a[i]=a[i]+0
for s in range(0,999):
    if a[s]-2 != int(data["Total_Stops"][s:s+1].values):
        wrong.append(s)
print(wrong)

[]


In [38]:
s = np.append(data["Route_stop_1"].unique(),data["Route_stop_2"].unique(), axis=0)     # label encoding the route feature
s = np.append(s,data["Route_stop_3"].unique(), axis=0)
s = np.append(s,data["Route_stop_4"].unique(), axis=0)
s = np.append(s,data["Route_stop_5"].unique(), axis=0)
s = np.append(s,data["Route_stop_6"].unique(), axis=0)
z = pd.Series(s)
z = z.unique()
c = z.tolist()
val = [i for i in range(0,len(z))]
mylabencode = dict(zip(c, val))
data["Route_stop_1"] = data["Route_stop_1"].map(mylabencode)
data["Route_stop_2"] = data["Route_stop_2"].map(mylabencode)
data["Route_stop_3"] = data["Route_stop_3"].map(mylabencode)
data["Route_stop_4"] = data["Route_stop_4"].map(mylabencode)
data["Route_stop_5"] = data["Route_stop_5"].map(mylabencode)
data["Route_stop_6"] = data["Route_stop_6"].map(mylabencode)
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Price,Route_stop_1,Route_stop_2,Route_stop_3,Route_stop_4,Route_stop_5,Route_stop_6
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,2.833333,0,3897.0,0,2,5,5,5,5
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,7.416667,2,7662.0,1,6,19,0,5,5
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,19.0,2,13882.0,2,7,4,11,5,5
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,5.416667,1,6218.0,1,8,0,5,5,5
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,4.75,1,13302.0,0,8,2,5,5,5


In [39]:
from sklearn.preprocessing import LabelEncoder                  # Label encoding the rest features
labenc=LabelEncoder()
data["Airline"]=labenc.fit_transform(data["Airline"])
data["Source"]=labenc.fit_transform(data["Source"])
data["Destination"]=labenc.fit_transform(data["Destination"])
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Price,Route_stop_1,Route_stop_2,Route_stop_3,Route_stop_4,Route_stop_5,Route_stop_6
0,3,24/03/2019,0,5,BLR → DEL,2.833333,0,3897.0,0,2,5,5,5,5
1,1,1/05/2019,3,0,CCU → IXR → BBI → BLR,7.416667,2,7662.0,1,6,19,0,5,5
2,4,9/06/2019,2,1,DEL → LKO → BOM → COK,19.0,2,13882.0,2,7,4,11,5,5
3,3,12/05/2019,3,0,CCU → NAG → BLR,5.416667,1,6218.0,1,8,0,5,5,5
4,3,01/03/2019,0,5,BLR → NAG → DEL,4.75,1,13302.0,0,8,2,5,5,5


In [40]:
data['Date'] = data['Date_of_Journey'].str.split('/').str[0]               # Processing Date of journey
data['Month'] = data['Date_of_Journey'].str.split('/').str[1]
data['Year'] = data['Date_of_Journey'].str.split('/').str[2]
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Price,Route_stop_1,Route_stop_2,Route_stop_3,Route_stop_4,Route_stop_5,Route_stop_6,Date,Month,Year
0,3,24/03/2019,0,5,BLR → DEL,2.833333,0,3897.0,0,2,5,5,5,5,24,3,2019
1,1,1/05/2019,3,0,CCU → IXR → BBI → BLR,7.416667,2,7662.0,1,6,19,0,5,5,1,5,2019
2,4,9/06/2019,2,1,DEL → LKO → BOM → COK,19.0,2,13882.0,2,7,4,11,5,5,9,6,2019
3,3,12/05/2019,3,0,CCU → NAG → BLR,5.416667,1,6218.0,1,8,0,5,5,5,12,5,2019
4,3,01/03/2019,0,5,BLR → NAG → DEL,4.75,1,13302.0,0,8,2,5,5,5,1,3,2019


In [41]:
data = data.drop(["Route","Date_of_Journey"], axis = 1)
data.head()

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Price,Route_stop_1,Route_stop_2,Route_stop_3,Route_stop_4,Route_stop_5,Route_stop_6,Date,Month,Year
0,3,0,5,2.833333,0,3897.0,0,2,5,5,5,5,24,3,2019
1,1,3,0,7.416667,2,7662.0,1,6,19,0,5,5,1,5,2019
2,4,2,1,19.0,2,13882.0,2,7,4,11,5,5,9,6,2019
3,3,3,0,5.416667,1,6218.0,1,8,0,5,5,5,12,5,2019
4,3,0,5,4.75,1,13302.0,0,8,2,5,5,5,1,3,2019


In [42]:
mean = data["Price"].mean()                                # handling null vlaues
data["Price"] = data["Price"].fillna(mean)
data.isnull().sum()

Airline         0
Source          0
Destination     0
Duration        0
Total_Stops     0
Price           0
Route_stop_1    0
Route_stop_2    0
Route_stop_3    0
Route_stop_4    0
Route_stop_5    0
Route_stop_6    0
Date            0
Month           0
Year            0
dtype: int64

Data is cleaning and ready for further processing

In [43]:
df_train = data[0:10683]                       # Making data into original form
df_test = data[10683:]

In [44]:
x=df_train.drop("Price", axis = 1)             # Spliting data to independent and dependent variable
y=df_train["Price"]

In [45]:
x.head()

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Route_stop_1,Route_stop_2,Route_stop_3,Route_stop_4,Route_stop_5,Route_stop_6,Date,Month,Year
0,3,0,5,2.833333,0,0,2,5,5,5,5,24,3,2019
1,1,3,0,7.416667,2,1,6,19,0,5,5,1,5,2019
2,4,2,1,19.0,2,2,7,4,11,5,5,9,6,2019
3,3,3,0,5.416667,1,1,8,0,5,5,5,12,5,2019
4,3,0,5,4.75,1,0,8,2,5,5,5,1,3,2019


In [46]:
y.head()

0     3897.0
1     7662.0
2    13882.0
3     6218.0
4    13302.0
Name: Price, dtype: float64

# Feature Selection

In [47]:
from sklearn.model_selection import train_test_split                 # Splitting data into train and test
X_train,X_test,Y_train,Y_test=train_test_split(x,y,test_size=0.3,random_state=0)

In [48]:
from sklearn.linear_model import Lasso                              # Appliying lasso for feature selection 
from sklearn.feature_selection import SelectFromModel

In [49]:
model=SelectFromModel(Lasso(alpha=0.005,random_state=0))

In [50]:
model.fit(X_train,Y_train)

SelectFromModel(estimator=Lasso(alpha=0.005, copy_X=True, fit_intercept=True,
                                max_iter=1000, normalize=False, positive=False,
                                precompute=False, random_state=0,
                                selection='cyclic', tol=0.0001,
                                warm_start=False),
                max_features=None, norm_order=1, prefit=False, threshold=None)

In [51]:
model.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True, False])

In [52]:
selected_features=X_train.columns[(model.get_support())]
selected_features

Index(['Airline', 'Source', 'Destination', 'Duration', 'Total_Stops',
       'Route_stop_1', 'Route_stop_2', 'Route_stop_3', 'Route_stop_4',
       'Route_stop_5', 'Date', 'Month'],
      dtype='object')

In [53]:
X_train=X_train.drop(['Year'],axis=1)    # Removing poor performing feature
X_test=X_test.drop(['Year'],axis=1)

In [54]:
X_train.tail()

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Route_stop_1,Route_stop_2,Route_stop_3,Route_stop_4,Route_stop_5,Route_stop_6,Date,Month
9225,4,3,0,22.166667,1,1,4,0,5,5,5,12,6
4859,1,2,1,21.25,2,2,20,4,11,5,5,9,6
3264,4,2,1,20.416667,1,2,4,11,5,5,5,3,3
9845,4,2,1,25.083333,1,2,4,11,5,5,5,18,5
2732,4,3,0,27.166667,1,1,2,0,5,5,5,9,5


# Applying Algorithm

In [55]:
from sklearn.model_selection import RandomizedSearchCV
#Randomized Search CV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
# max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

In [56]:
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

print(random_grid)

{'n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200], 'max_features': ['auto', 'sqrt'], 'max_depth': [5, 10, 15, 20, 25, 30], 'min_samples_split': [2, 5, 10, 15, 100], 'min_samples_leaf': [1, 2, 5, 10]}


In [57]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()

In [58]:
# Random search of parameters, using 3 fold cross validation, 
# search across 50 different combinations
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 50, cv = 5, verbose=2, random_state=42, n_jobs = 1)

In [59]:
rf_random.fit(X_train,Y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5, total=   1.0s
[CV] n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s


[CV]  n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5, total=   1.1s
[CV] n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5 
[CV]  n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5, total=   1.1s
[CV] n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5 
[CV]  n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5, total=   1.2s
[CV] n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5 
[CV]  n_estimators=400, min_samples_split=100, min_samples_leaf=10, max_features=sqrt, max_depth=5, total=   1.1s
[CV] n_estimators=200, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=20 
[CV]  n_estimators=200, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=20, total=   2.2s
[CV] n_estimators=200, min_samples_spl

[CV]  n_estimators=1000, min_samples_split=15, min_samples_leaf=10, max_features=sqrt, max_depth=10, total=   3.2s
[CV] n_estimators=1000, min_samples_split=15, min_samples_leaf=10, max_features=sqrt, max_depth=10 
[CV]  n_estimators=1000, min_samples_split=15, min_samples_leaf=10, max_features=sqrt, max_depth=10, total=   3.9s
[CV] n_estimators=1000, min_samples_split=15, min_samples_leaf=10, max_features=sqrt, max_depth=10 
[CV]  n_estimators=1000, min_samples_split=15, min_samples_leaf=10, max_features=sqrt, max_depth=10, total=   3.5s
[CV] n_estimators=1000, min_samples_split=15, min_samples_leaf=10, max_features=sqrt, max_depth=10 
[CV]  n_estimators=1000, min_samples_split=15, min_samples_leaf=10, max_features=sqrt, max_depth=10, total=   3.2s
[CV] n_estimators=1000, min_samples_split=15, min_samples_leaf=10, max_features=sqrt, max_depth=10 
[CV]  n_estimators=1000, min_samples_split=15, min_samples_leaf=10, max_features=sqrt, max_depth=10, total=   3.2s
[CV] n_estimators=100, mi

[CV]  n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=10, total=   5.8s
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=10 
[CV]  n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=10, total=   6.0s
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=10 
[CV]  n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=10, total=   5.8s
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=10 
[CV]  n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=10, total=   6.1s
[CV] n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=10 
[CV]  n_estimators=700, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=10, total=   5.9s
[CV] n_estimators=1200, min_samples_split=100, min_s

[CV]  n_estimators=1200, min_samples_split=10, min_samples_leaf=10, max_features=sqrt, max_depth=25, total=   4.1s
[CV] n_estimators=300, min_samples_split=15, min_samples_leaf=2, max_features=sqrt, max_depth=20 
[CV]  n_estimators=300, min_samples_split=15, min_samples_leaf=2, max_features=sqrt, max_depth=20, total=   1.3s
[CV] n_estimators=300, min_samples_split=15, min_samples_leaf=2, max_features=sqrt, max_depth=20 
[CV]  n_estimators=300, min_samples_split=15, min_samples_leaf=2, max_features=sqrt, max_depth=20, total=   1.2s
[CV] n_estimators=300, min_samples_split=15, min_samples_leaf=2, max_features=sqrt, max_depth=20 
[CV]  n_estimators=300, min_samples_split=15, min_samples_leaf=2, max_features=sqrt, max_depth=20, total=   1.3s
[CV] n_estimators=300, min_samples_split=15, min_samples_leaf=2, max_features=sqrt, max_depth=20 
[CV]  n_estimators=300, min_samples_split=15, min_samples_leaf=2, max_features=sqrt, max_depth=20, total=   1.2s
[CV] n_estimators=300, min_samples_split=

[CV]  n_estimators=500, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=20, total=   4.6s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=20 
[CV]  n_estimators=500, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=20, total=   4.6s
[CV] n_estimators=900, min_samples_split=10, min_samples_leaf=10, max_features=sqrt, max_depth=25 
[CV]  n_estimators=900, min_samples_split=10, min_samples_leaf=10, max_features=sqrt, max_depth=25, total=   3.1s
[CV] n_estimators=900, min_samples_split=10, min_samples_leaf=10, max_features=sqrt, max_depth=25 
[CV]  n_estimators=900, min_samples_split=10, min_samples_leaf=10, max_features=sqrt, max_depth=25, total=   3.0s
[CV] n_estimators=900, min_samples_split=10, min_samples_leaf=10, max_features=sqrt, max_depth=25 
[CV]  n_estimators=900, min_samples_split=10, min_samples_leaf=10, max_features=sqrt, max_depth=25, total=   3.4s
[CV] n_estimators=900, min_samples_split

[CV]  n_estimators=1200, min_samples_split=2, min_samples_leaf=5, max_features=sqrt, max_depth=30, total=   4.6s
[CV] n_estimators=1200, min_samples_split=2, min_samples_leaf=5, max_features=sqrt, max_depth=30 
[CV]  n_estimators=1200, min_samples_split=2, min_samples_leaf=5, max_features=sqrt, max_depth=30, total=   4.6s
[CV] n_estimators=1200, min_samples_split=2, min_samples_leaf=5, max_features=sqrt, max_depth=30 
[CV]  n_estimators=1200, min_samples_split=2, min_samples_leaf=5, max_features=sqrt, max_depth=30, total=   4.7s
[CV] n_estimators=600, min_samples_split=10, min_samples_leaf=1, max_features=sqrt, max_depth=30 
[CV]  n_estimators=600, min_samples_split=10, min_samples_leaf=1, max_features=sqrt, max_depth=30, total=   2.8s
[CV] n_estimators=600, min_samples_split=10, min_samples_leaf=1, max_features=sqrt, max_depth=30 
[CV]  n_estimators=600, min_samples_split=10, min_samples_leaf=1, max_features=sqrt, max_depth=30, total=   2.7s
[CV] n_estimators=600, min_samples_split=10

[CV]  n_estimators=900, min_samples_split=100, min_samples_leaf=1, max_features=sqrt, max_depth=5, total=   2.2s
[CV] n_estimators=900, min_samples_split=100, min_samples_leaf=1, max_features=sqrt, max_depth=5 
[CV]  n_estimators=900, min_samples_split=100, min_samples_leaf=1, max_features=sqrt, max_depth=5, total=   2.2s
[CV] n_estimators=900, min_samples_split=100, min_samples_leaf=1, max_features=sqrt, max_depth=5 
[CV]  n_estimators=900, min_samples_split=100, min_samples_leaf=1, max_features=sqrt, max_depth=5, total=   2.2s
[CV] n_estimators=900, min_samples_split=100, min_samples_leaf=1, max_features=sqrt, max_depth=5 
[CV]  n_estimators=900, min_samples_split=100, min_samples_leaf=1, max_features=sqrt, max_depth=5, total=   2.2s
[CV] n_estimators=900, min_samples_split=100, min_samples_leaf=1, max_features=sqrt, max_depth=30 
[CV]  n_estimators=900, min_samples_split=100, min_samples_leaf=1, max_features=sqrt, max_depth=30, total=   2.8s
[CV] n_estimators=900, min_samples_split=

[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed: 13.4min finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                              

In [60]:
y_pred=rf_random.predict(X_test)

In [61]:
y_pred

array([ 6625.37989863,  3882.09087854, 12304.29633171, ...,
       10509.00875184, 11701.82894607,  6543.22269643])

# Export the model to a pickle file

In [62]:
import pickle
with open('flight_price_model.pickle','wb') as f:
    pickle.dump(rf_random,f)