Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travellers saying that flight ticket prices are so unpredictable. Under this dataset we are going to predict the filght price on the basis of 11 instances:
Airline: The name of the airline.

Date_of_Journey: The date of the journey

Source: The source from which the service begins.

Destination: The destination where the service ends.

Route: The route taken by the flight to reach the destination.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about the flight

Price: The price of the ticket

In [None]:
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.model_selection import train_test_split

Loading data set

In [None]:
df=pd.read_csv("Flight_ticket_train.csv")

In [None]:
df

In [None]:
df.keys

In [None]:
df.columns

Airline: The name of the airline.

Date_of_Journey: The date of the journey

Source: The source from which the service begins.

Destination: The destination where the service ends.

Route: The route taken by the flight to reach the destination.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about the flight

Price: The price of the ticket

In [None]:
df.info()

In [None]:
df.dtypes

The type of dataset is object

In [None]:
df.isnull().sum()

There is one null value in raute and one in total stops. I am replacing these null values with NaN values

In [None]:
df["Route"].fillna("NaN", inplace= True)
df["Total_Stops"].fillna("NaN", inplace= True)

In [None]:
df.describe()

In [None]:
sns.heatmap(df.isnull())

In [None]:
plt.figure(figsize=(20,20))
sns.barplot(x='Airline',y='Price',data=df)
plt.title("Airline freight")
plt.xticks(rotation=45)
plt.show()

The graphs shows that the Jet Airways Business has the heighest Price among all Airlines

In [None]:
plt.figure(figsize=(20,20))
sns.barplot(x='Date_of_Journey',y='Price',data=df)
plt.title("Airline freight")
plt.xticks(rotation=45)
plt.show()

The price of airlines were on dated 01/03/2019 and the lowest price was on 30/04/2019

In [None]:
plt.figure(figsize=(20,20))
sns.barplot(x='Source',y='Price',data=df)
plt.title("Airline freight")
plt.xticks(rotation=45)
plt.show()

The hieghest price was for Delhi than Kolkatta,Banglore,Mumbai and than Chennai

In [None]:
plt.figure(figsize=(20,20))
sns.barplot(x='Destination',y='Price',data=df)
plt.title("Airline freight")
plt.xticks(rotation=45)
plt.show()

The heighest destinated price is for New Delhi than Cochin,Banglore,Delhi,Hyderabad and the last one is for Kolkatta

In [None]:
plt.figure(figsize=(20,20))
sns.barplot(x='Total_Stops',y='Price',data=df)
plt.title("Airline freight")
plt.xticks(rotation=45)
plt.show()

The graphs shows that the heighest price charged for 4stops

In [None]:
plt.figure(figsize=(20,20))
sns.barplot(x='Additional_Info',y='Price',data=df)
plt.title("Airline freight")
plt.xticks(rotation=45)
plt.show()

The hieghest price is for business class and the lowest one is for No check in baggage included

In [None]:
from sklearn.preprocessing import LabelEncoder #Converting catagorical data into numerical
LE=LabelEncoder()

df["Airline"]=LE.fit_transform(df["Airline"])
df["Date_of_Journey"]=LE.fit_transform(df["Date_of_Journey"])
df["Source"]=LE.fit_transform(df["Source"])
df["Destination"]=LE.fit_transform(df["Destination"])
df["Route"]=LE.fit_transform(df["Route"])
df["Dep_Time"]=LE.fit_transform(df["Dep_Time"])
df["Arrival_Time"]=LE.fit_transform(df["Arrival_Time"])
df["Duration"]=LE.fit_transform(df["Duration"])
df["Total_Stops"]=LE.fit_transform(df["Total_Stops"])
df["Additional_Info"]=LE.fit_transform(df["Additional_Info"])

In [None]:
dfcor=df.corr() #Checking correlation
dfcor

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(dfcor,annot=True)
plt.title("Correlation Matrix")
plt.show()

In [None]:
#Plotting boxplots to check outliers
df.iloc[:,0:23].boxplot(figsize=[20,8])
plt.subplots_adjust(bottom=0.25)
plt.show()

The price column has the heighest no of outliers

In [None]:
#removing outlieers
from scipy.stats import zscore
z=np.abs(zscore(df))
z

In [None]:
threshold=3
print(np.where(z>3))

In [None]:
df_new=df[(z<3).all(axis=1)]
print(df.shape)
print(df_new.shape)

In [None]:
loss_percent=(10683-10578)/10683*100
print(loss_percent)

In [None]:
df.skew() #checking skewness

In [None]:
from sklearn.preprocessing import power_transform #Removing skewness with the help of power transform
df_new=power_transform(df)
df_new=pd.DataFrame(df_new,columns=df.columns)

In [None]:
x=df.drop("Price",axis=1) #Splitting target variable
y=df["Price"]

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x=sc.fit_transform(df)
x

Training Testing validating and hpyertuninig model

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.linear_model import LinearRegression
maxAccu=0
maxRS=0
for i in range(1,200):
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=30,random_state=i)
    LR=LinearRegression()
    LR.fit(x_train,y_train)
    predrf=LR.predict(x_test)
    r2=r2_score(y_test,predrf)
    if r2>maxAccu:
        maxAccu=r2
        maxRS=i
print("best accuracy is ",maxAccu,"on Random_state",maxRS)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=30,random_state=171)

In [None]:
lr.fit(x_train,y_train)

In [None]:
predrf_test=lr.predict(x_test)

In [None]:
print(r2_score(y_test,predrf_test))

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf=RandomForestRegressor(n_estimators=100,random_state=57)
rf.fit(x_train,y_train)
predrf=rf.predict(x_test)
print(r2_score(y_test,predrf))
print('Mean absolute error:',mean_absolute_error(y_test,predrf))
print('Mean squared error:',mean_squared_error(y_test,predrf))
print('Root mean squared error:',np.sqrt(mean_squared_error(y_test,predrf)))

In [None]:
from sklearn.svm import SVR

kernellist=['linear','poly','rbf','sigmoid']
for i in kernellist:
    sv=SVR(kernel=i)
    sv.fit(x_train,y_train)
    print(sv.score(x_train,y_train))

In [None]:
from sklearn.linear_model import Lasso,Ridge
ls= Lasso(alpha=0.1)
#ls=Lasso(alpha=1.0)#default
ls.fit(x_train,y_train)
ls.score(x_train,y_train)

In [None]:
ls.coef_

In [None]:
#ElasticNet is a combinations of both Lasso and Ridge
from sklearn.linear_model import ElasticNet
enr=ElasticNet(alpha=0.1)
#enr=ElasticNet()
enr.fit(x_train,y_train)
enrpred=enr.predict(x_test)
print(enr.score(x_train,y_train))
enr.coef_

In [None]:
from sklearn.model_selection import cross_val_score
enr=ElasticNet(alpha=0.1)
score=cross_val_score(enr, x,y,cv=5)
r2=r2_score(y_test,predrf)
r2

In [None]:
from sklearn.model_selection import cross_val_score
ls=Lasso(alpha=0.0001)
score=cross_val_score(ls, x,y,cv=5)
r2=r2_score(y_test,predrf)
r2

In [None]:
from sklearn.model_selection import cross_val_score
rf=RandomForestRegressor()
score=cross_val_score(rf, x,y,cv=5)
r2=r2_score(y_test,predrf)
r2

In [None]:
from sklearn.model_selection import cross_val_score
lr=LinearRegression()
score=cross_val_score(rf, x,y,cv=5)
r2=r2_score(y_test,predrf)
r2

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
plt.scatter(x=y_test,y=predrf_test,color='r')
plt.plot(y_test,y_test,color='b')
plt.xlabel('actual Price',fontsize=14)
plt.ylabel('predicted Price',fontsize=14)
plt.title('RandomForestRegressor()',fontsize=18)
plt.show()

Minimum difference in accuracy and cross validation score is for RandomForestRegressor so this our best moder



In [None]:
from sklearn.model_selection import GridSearchCV # Hyper tunning model with GridSearchCV
from sklearn.ensemble import RandomForestRegressor
parameters={'criterion':['mse','mae'],'max_features':['auto','sqrt','log2']}
rf=RandomForestRegressor()
grid=GridSearchCV(rf,parameters)
grid.fit(x_train,y_train)
print(grid.best_params_)

In [None]:
rf=RandomForestRegressor(criterion='mse',max_features='auto')
rf.fit(x_train,y_train)
rf.score(x_train,y_train)
predrf_decision=rf.predict(x_test)
rfs=r2_score(y_test,predrf_decision)
print("R2 Score:",rfs*100)
rfscore=cross_val_score(rf,x,y,cv=5)
rfc=rfscore.mean()
print("corss val score:",rfc*100)

Saving model

In [None]:
import joblib
joblib.dump(rf,"Flight_ticket.csv.obj")

Uploading test data

In [None]:
df2=pd.read_csv("Flight_ticket_test.csv")

In [None]:
df2

In [None]:
df2.columns

In [None]:
df2.keys

In [None]:
df2.info()

In [None]:
df2.dtypes

In [None]:
df2.isnull().sum()

In [None]:
from sklearn.preprocessing import LabelEncoder #Converting catagorical data into numerical
LE=LabelEncoder()

df2["Airline"]=LE.fit_transform(df2["Airline"])
df2["Date_of_Journey"]=LE.fit_transform(df2["Date_of_Journey"])
df2["Source"]=LE.fit_transform(df2["Source"])
df2["Destination"]=LE.fit_transform(df2["Destination"])
df2["Route"]=LE.fit_transform(df2["Route"])
df2["Dep_Time"]=LE.fit_transform(df2["Dep_Time"])
df2["Arrival_Time"]=LE.fit_transform(df2["Arrival_Time"])
df2["Duration"]=LE.fit_transform(df2["Duration"])
df2["Total_Stops"]=LE.fit_transform(df2["Total_Stops"])
df2["Additional_Info"]=LE.fit_transform(df2["Additional_Info"])

In [None]:
df2.describe()

In [None]:
dfcor=df.corr()
dfcor

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(dfcor,annot=True)
plt.title("Correlation Matrix")
plt.show()

In [None]:
#Plotting boxplots to check outliers
df2.iloc[:,0:23].boxplot(figsize=[20,8])
plt.subplots_adjust(bottom=0.25)
plt.show()

In [None]:
#removing outlieers
from scipy.stats import zscore
z=np.abs(zscore(df2))
z


In [None]:
threshold=3
print(np.where(z>3))

In [None]:
df2_new=df2[(z<3).all(axis=1)]
print(df2.shape)
print(df2_new.shape)

In [None]:
loss_percent=(2671-2668)/2671*100
print(loss_percent)

In [None]:
df2.skew() #checking skewness

In [None]:
from sklearn.preprocessing import power_transform #Removing Skewness
df2_new=power_transform(df2)
df2_new=pd.DataFrame(df2_new,columns=df2.columns)

In [None]:
from sklearn.preprocessing import StandardScaler #Scalling dataset
sc=StandardScaler()
x=sc.fit_transform(df2)
x

importing train data

In [None]:
p=joblib.load(open("Flight_ticket.csv.obj","rb"))

In [None]:
p

In [None]:
prediction=p.predict(df2)

In [None]:
prediction