# Problem Description

Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travellers saying that flight ticket prices are so unpredictable. Huh! Here we take on the challenge! As data scientists, we are gonna prove that given the right data anything can be predicted. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities.

## Feature Description

* Size of training set: 10683 records
* Size of test set: 2671 records
* FEATURES: Airline: The name of the airline.     
* Date_of_Journey: The date of the journey
* Source: The source from which the service begins.
* Destination: The destination where the service ends.
* Route: The route taken by the flight to reach the destination.
* Dep_Time: The time when the journey starts from the source.
* Arrival_Time: Time of arrival at the destination.
* Duration: Total duration of the flight.
* Total_Stops: Total stops between the source and destination.
* Additional_Info: Additional information about the flight
* Price: The price of the ticket

# Importing Libraries

In [None]:
#data preprocessing
import pandas as pd

#Linear Algebra
import numpy as np

#Data Visualization
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
#plotly
!pip install chart_studio
import plotly.express as px
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

#Algorithms
from sklearn.svm import SVR
from sklearn import linear_model
from sklearn.linear_model import Ridge,Lasso,ElasticNet
from pandas import Series, DataFrame
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import GradientBoostingRegressor 
from sklearn import metrics  
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.preprocessing import StandardScaler,LabelEncoder


import warnings
warnings.filterwarnings('ignore')

# Getting Data

In [None]:
flight=pd.read_excel('../input/data-train/Data_Train.xlsx')
flight.head()

In [None]:
flight.describe()

Here we can observe one thing minimum flight price is 1759 and maximum is 79512.

In [None]:
flight.info()

Here we can observe price is in integer format and other than price ever other feature is in object format.

In [None]:
flight.shape

flight data has 11 features and 10683 rows .

# Feature Engineering

In [None]:
flight['Date']=flight['Date_of_Journey'].str.split('/').str[0]
flight['Month']=flight['Date_of_Journey'].str.split('/').str[1]
flight['Year']=flight['Date_of_Journey'].str.split('/').str[2]


In [None]:
flight.head()

In [None]:
flight.dtypes

In [None]:
flight['Date']=flight['Date'].astype(int)
flight['Month']=flight['Month'].astype(int)
flight['Year']=flight['Year'].astype(int)

In [None]:
flight.dtypes

In [None]:
#droping Date-of-journey column
flight.drop('Date_of_Journey',axis=1,inplace=True)


In [None]:
flight.head()

In [None]:
flight['Arrival_time']=flight['Arrival_Time'].str.split(' ').str[0]
flight.head()

In [None]:
#dropping Arrival_time
flight.drop('Arrival_Time',axis=1,inplace=True)
flight.head()

In [None]:
flight[flight['Total_Stops'].isnull()]

In [None]:
flight['Total_Stops']=flight['Total_Stops'].fillna('1 stop')

In [None]:
flight['Total_Stops']=flight['Total_Stops'].replace('non-stop','0 stop')

In [None]:
flight.head()

In [None]:
flight['Stop']=flight['Total_Stops'].str.split(' ').str[0]

In [None]:
flight.head()

In [None]:
flight.drop('Total_Stops',axis=1,inplace=True)

In [None]:
flight.head()

In [None]:
flight['Stop']=flight['Stop'].astype(int)

In [None]:
flight.dtypes

Data looks perfect upto now .

In [None]:
flight['Arrival_hour']=flight['Arrival_time'].str.split(':').str[0]
flight['Arrival_minutes']=flight['Arrival_time'].str.split(':').str[1]

In [None]:
flight.head()

In [None]:
#Dropping Arrival_time feature from data
flight.drop('Arrival_time',axis=1,inplace=True)

In [None]:
#converting data type from string to float
flight['Arrival_hour']=flight['Arrival_hour'].astype(int)
flight['Arrival_minutes']=flight['Arrival_minutes'].astype(int)

In [None]:
flight['Dep_hour']=flight['Dep_Time'].str.split(':').str[0]
flight['Dep_minutes']=flight['Dep_Time'].str.split(':').str[1]

In [None]:
flight['Dep_hour']=flight['Dep_hour'].astype(int)
flight['Dep_minutes']=flight['Dep_minutes'].astype(int)
flight.drop('Dep_Time',axis=1,inplace=True)

In [None]:
flight.head()

In [None]:
flight['Route_1']=flight['Route'].str.split('→').str[0]
flight['Route_2']=flight['Route'].str.split('→').str[1]
flight['Route_3']=flight['Route'].str.split('→').str[2]
flight['Route_4']=flight['Route'].str.split('→').str[3]
flight['Route_5']=flight['Route'].str.split('→').str[4]


In [None]:
flight.head()

In [None]:
flight['Price'].fillna((flight['Price'].mean()),inplace=True)

In [None]:
flight['Route_1'].fillna('None',inplace=True)
flight['Route_2'].fillna('None',inplace=True)
flight['Route_3'].fillna('None',inplace=True)
flight['Route_4'].fillna('None',inplace=True)
flight['Route_5'].fillna('None',inplace=True)

In [None]:
flight.head()

In [None]:
flight.drop(['Route','Duration'],axis=1,inplace=True)

In [None]:
flight.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
flight['Airline']=encoder.fit_transform(flight['Airline'])
flight['Source']=encoder.fit_transform(flight['Source'])
flight['Destination']=encoder.fit_transform(flight['Destination'])
flight['Additional_Info']=encoder.fit_transform(flight['Additional_Info'])
flight['Route_1']=encoder.fit_transform(flight['Route_1'])
flight['Route_2']=encoder.fit_transform(flight['Route_2'])
flight['Route_3']=encoder.fit_transform(flight['Route_3'])
flight['Route_4']=encoder.fit_transform(flight['Route_4'])
flight['Route_5']=encoder.fit_transform(flight['Route_5'])

In [None]:
flight.head()

In [None]:
#dropping year column
flight.drop('Year',axis=1,inplace=True)

In [None]:
flight.isnull().sum()

Route and Total_Stops contains missing data and it's better to clear this missing values.

In [None]:
#check still any missing values present or not
sns.heatmap(flight.isnull())

Now you can see here there is no missing data present here .

# Exploratory Data Analysis

In [None]:
flight['Stop'].value_counts().iplot(kind='bar',
                                              yTitle='Counts', 
                                              linecolor='black', 
                                              opacity=0.7,
                                              color='blue',
                                              theme='pearl',
                                              bargap=0.5,
                                              gridcolor='white',
                                              title='Distribution of classes column ')

Here you can see out of 100% 52 % of stops are belongs to 1 stop and 32% of stops are belongs to no-stop ,14 % are belongs to 2 stops ,rest of are 3 stops.

In [None]:

fig = px.scatter(flight, x="Arrival_hour", y="Dep_hour", color='Price')
fig.show()

From here we can observe that one thing where dep-hour is in between(15-18) hours arrival time at  any time the price of flight ticket is more than 90000.

In [None]:
#check cor-relation
corr_hmap=flight.corr()
plt.figure(figsize=(8,7))
sns.heatmap(corr_hmap,annot=True)
plt.show()

## Feature Selection

In [None]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

In [None]:
x=flight.drop('Price',axis=1)
x.head()

In [None]:
y=flight['Price']
y.head()

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)

In [None]:
x_train.shape

In [None]:
y_train.shape

In [None]:
x_test.shape

In [None]:
y_test.shape

In [None]:
model=SelectFromModel(Lasso(alpha=0.005,random_state=0))

In [None]:
model.fit(x_train,y_train)

In [None]:
model.get_support()

In [None]:
selected_features=x_train.columns[(model.get_support())]

In [None]:
selected_features

## Random Forest Regressor

In [None]:
from sklearn.model_selection import RandomizedSearchCV
#Randomized Search CV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
# max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

In [None]:
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

print(random_grid)

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()

In [None]:
# Random search of parameters, using 3 fold cross validation, 
# search across 50 different combinations
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 50, cv = 5, verbose=2, random_state=42, n_jobs = 1)


In [None]:
rf_random.fit(x_train,y_train)

In [None]:
y_pred=rf_random.predict(x_test)

In [None]:
rsquare=metrics.r2_score(y_test,y_pred)
print('R-square',rsquare)

In [None]:
sns.distplot(y_test-y_pred)

In [None]:
plt.scatter(y_test,y_pred)


# Predicting Data

In [None]:
pred_rfr=rf_random.predict(x_test)
print("predicted price",pred_rfr)
print("actual price",y_test)

It's done upto now