# This is a project in which we are going to predict the air ticket price of diffetent airlines. We are provided with a dataset with following features:

# FEATURES:

**Airline**: The name of the airline.

**Date_of_Journey**: The date of the journey

**Source**: The source from which the service begins.

**Destination**: The destination where the service ends.

**Route**: The route taken by the flight to reach the destination.

**Dep_Time**: The time when the journey starts from the source.

**Arrival_Time**: Time of arrival at the destination.

**Duration**: Total duration of the flight.

**Total_Stops**: Total stops between the source and destination.

**Additional_Info**: Additional information about the flight

**Price**: The price of the ticket 

In [189]:
# Import Dependencies
%matplotlib inline

# Start Python Imports
import math, time, random, datetime

# Data Manipulation
import numpy as np
import pandas as pd

# Visualization 
import matplotlib.pyplot as plt
import missingno
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize

# Machine learning
import catboost
from sklearn.model_selection import train_test_split

# Let's be rebels and ignore warnings for now
import warnings
warnings.filterwarnings('ignore')














# Importing training and test set. 

In [190]:
train = pd.read_csv('Data_Train.csv')
test = pd.read_csv('Data_Test.csv')

In [191]:
#data in training set
train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


# Checking the number of missing values 

In [192]:
train.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

In [193]:
#Creating a separate data frame for our training setbb
df_train = pd.DataFrame()
df_test = pd.DataFrame()

# Before we take any action on our dataset, we need to transfer our data to the recently created data frame with some amendments: 

1 - The price has a correlation with the month in which the fli?ght is scheduled. Hence we can extract the information regarding the month using the Date_of_Journey.
2 - Since we are provided with the information regarding the number of stops ( Total_Stops ) the ( Route ) column becomes redundant, which can be dropped.
3 - We find that the price has a correlation with the Departure time (Dep_Time) , to map the relation we will split time into two halves ( ‘Morning’ and ‘Evening’)
4 - Since we are provided with the Duration of journey we can drop the Arrival_Time from our dataframe
5 - Duration is provided to us in hours and minute form, which can be transferred into minutes only.
6 - Total_Stops is represented in string form, we will need to extract the numerical value

In [194]:
df_train['Airline'] = train['Airline']
df_train['Date_of_Journey'] = train['Date_of_Journey']
df_train['Source'] = train['Source']
df_train['Destination'] = train['Destination']
df_train['Dep_Time'] = train['Dep_Time']
df_train['Duration'] = train['Duration']
df_train['Total_Stops'] = train['Total_Stops']
df_train['Additional_Info'] = train['Additional_Info']
df_train['Price'] = train['Price']

In [195]:
df_train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,22:20,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,05:50,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,09:25,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,18:05,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,16:50,4h 45m,1 stop,No info,13302


In [196]:
df_test['Airline'] = test['Airline']
df_test['Date_of_Journey'] = test['Date_of_Journey']
df_test['Source'] = test['Source']
df_test['Destination'] = test['Destination']
df_test['Dep_Time'] = test['Dep_Time']
df_test['Duration'] = test['Duration']
df_test['Total_Stops'] = test['Total_Stops']
df_test['Additional_Info'] = test['Additional_Info']

In [197]:
df_test.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info
0,Jet Airways,6/06/2019,Delhi,Cochin,17:30,10h 55m,1 stop,No info
1,IndiGo,12/05/2019,Kolkata,Banglore,06:20,4h,1 stop,No info
2,Jet Airways,21/05/2019,Delhi,Cochin,19:15,23h 45m,1 stop,In-flight meal not included
3,Multiple carriers,21/05/2019,Delhi,Cochin,08:00,13h,1 stop,No info
4,Air Asia,24/06/2019,Banglore,Delhi,23:55,2h 50m,non-stop,No info


## checking the row which has the missing values and it seems that both the missing values belongs to the same row 

In [198]:
df_train[(df_train.isnull().sum(axis = 1)>0)==True]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info,Price
9039,Air India,6/05/2019,Delhi,Cochin,09:45,23h 40m,,No info,7480


In [199]:
#Deleting this column
df_train = df_train[(df_train.isnull().sum(axis = 1) == 0)]

In [200]:
#checking the missing values again
df_train.isnull().sum(axis = 0)

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Dep_Time           0
Duration           0
Total_Stops        0
Additional_Info    0
Price              0
dtype: int64

## Taking out the price column and storing it in another variable named Y 

In [201]:
Y=df_train['Price']
df_train.drop(labels=['Price'],axis=1,inplace=True)

## Concatinating the training set and test set 

In [202]:
df_train = pd.concat([df_train, df_test])

In [203]:
df_train.shape

(13353, 8)

## Handling the date of jourmey, extracting the month from the date 

In [204]:
def datej(x):
  x=x.strip()
  xx=x.split('/')
  x=xx[1]
  return x
  
  
df_train['Date_of_Journey']=df_train['Date_of_Journey'].apply(datej)
df_train['Date_of_Journey']=df_train['Date_of_Journey'].astype(str)
df_train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info
0,IndiGo,3,Banglore,New Delhi,22:20,2h 50m,non-stop,No info
1,Air India,5,Kolkata,Banglore,05:50,7h 25m,2 stops,No info
2,Jet Airways,6,Delhi,Cochin,09:25,19h,2 stops,No info
3,IndiGo,5,Kolkata,Banglore,18:05,5h 25m,1 stop,No info
4,IndiGo,3,Banglore,New Delhi,16:50,4h 45m,1 stop,No info


## Handling the departure time, if the time is after 12, marking it as evening, morning otherwise 

In [205]:
def deptime(x):
  x=x.strip()
  tt=(int)(x.split(':')[0])
  if(tt>=12):
    x='Evening'
  else:
    x='Morning'
  return x

df_train['Dep_Time']=df_train['Dep_Time'].apply(deptime)
df_train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info
0,IndiGo,3,Banglore,New Delhi,Evening,2h 50m,non-stop,No info
1,Air India,5,Kolkata,Banglore,Morning,7h 25m,2 stops,No info
2,Jet Airways,6,Delhi,Cochin,Morning,19h,2 stops,No info
3,IndiGo,5,Kolkata,Banglore,Evening,5h 25m,1 stop,No info
4,IndiGo,3,Banglore,New Delhi,Evening,4h 45m,1 stop,No info


## Handling the time of flight, converting the time into minutes 

In [206]:
def changed(test):
    test = test.strip()
    total=test.split(' ')
    to=total[0]
    hrs=(int)(to[:-1])*60
    if((len(total))==2):
      #t0=total[0]
      mint=(int)(total[1][:-1])
      hrs=hrs+mint
    test=str(hrs)
    return test
  
  
df_train['Duration']=df_train['Duration'].apply(changed)
df_train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info
0,IndiGo,3,Banglore,New Delhi,Evening,170,non-stop,No info
1,Air India,5,Kolkata,Banglore,Morning,445,2 stops,No info
2,Jet Airways,6,Delhi,Cochin,Morning,1140,2 stops,No info
3,IndiGo,5,Kolkata,Banglore,Evening,325,1 stop,No info
4,IndiGo,3,Banglore,New Delhi,Evening,285,1 stop,No info


## Handling the number of stops 

In [207]:
def stops(x):
  if(x=='non-stop'):
    x=str(0)
  else:
    x.strip()
    stps=x.split(' ')[0]
    x=stps
  return x

df_train['Total_Stops']=df_train['Total_Stops'].apply(stops)
df_train['Total_Stops']

0       0
1       2
2       2
3       1
4       1
5       0
6       1
7       1
8       1
9       1
10      1
11      0
12      0
13      1
14      0
15      2
16      1
17      1
18      2
19      1
20      1
21      1
22      0
23      0
24      1
25      2
26      1
27      1
28      0
29      0
       ..
2641    1
2642    1
2643    2
2644    1
2645    1
2646    2
2647    2
2648    0
2649    0
2650    2
2651    1
2652    1
2653    1
2654    1
2655    1
2656    1
2657    1
2658    1
2659    2
2660    2
2661    2
2662    0
2663    1
2664    1
2665    0
2666    1
2667    0
2668    1
2669    1
2670    1
Name: Total_Stops, Length: 13353, dtype: object

## Handling the typo-error in Additional Info column 

In [208]:
s = set(df_train['Additional_Info'])
s

{'1 Long layover',
 '1 Short layover',
 '2 Long layover',
 'Business class',
 'Change airports',
 'In-flight meal not included',
 'No Info',
 'No check-in baggage included',
 'No info',
 'Red-eye flight'}

In [209]:
import pandas as pd
pd.options.mode.chained_assignment = None 
for i in range(train.shape[0]):
    if(df_train.iloc[i]['Additional_Info']=='No info'):
        df_train.iloc[i]['Additional_Info']='No Info'

## Applying label encoding on the training set (train and test combined) 

In [210]:
df_train_enc = df_train.apply(LabelEncoder().fit_transform)

df_train_enc.count() #counting the number of rows

Airline            13353
Date_of_Journey    13353
Source             13353
Destination        13353
Dep_Time           13353
Duration           13353
Total_Stops        13353
Additional_Info    13353
dtype: int64

## Splitting the data into training set and test set 

In [211]:
x_test=df_train_enc.iloc[10682:]
x_train=df_train_enc.iloc[:10682]

In [212]:
x_test.count() #number of rows in test set

Airline            2671
Date_of_Journey    2671
Source             2671
Destination        2671
Dep_Time           2671
Duration           2671
Total_Stops        2671
Additional_Info    2671
dtype: int64

In [213]:
x_train.count() #number of rows in training set

Airline            10682
Date_of_Journey    10682
Source             10682
Destination        10682
Dep_Time           10682
Duration           10682
Total_Stops        10682
Additional_Info    10682
dtype: int64

## Modeling 

In [214]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_train,Y,random_state=10)

from sklearn.ensemble import RandomForestRegressor
rr=RandomForestRegressor(random_state=0)
rr.fit(x_train,y_train)

y_pred=rr.predict(x_test)

## Method to check RMSLE 

In [215]:
def rmsle(Y, YH):
    sum = 0.0
    for y, yh in zip(Y, YH):
        p = np.log(yh)
        r = np.log(y)
        sum = sum + (p-r)**2
    return (sum/len(Y))**0.5

## Predicting the price for test set 

In [216]:
yy_pred=rr.predict(x_test)
yy_pred=yy_pred.astype(int)

## Checking RMSLE 

In [217]:
rmsle(y_test,y_pred)

0.21692604407109634

## Checking the accuracy of our model 

In [218]:
(1-rmsle(y_test,y_pred))*100

78.30739559289037

## The predicted price values are: 

In [219]:
yy_pred

array([ 8687, 12627,  9887, ..., 13153,  4976, 10262])

## THE-END 