# Problem Statement - 

Anyone who has booked a flight ticket knows how unexpectedly the prices vary. The cheapest available ticket on a given flight gets more and less expensive over time. This usually happens as
an attempt to maximize revenue based on -


1. Time of purchase patterns (making sure last-minute purchases are expensive)

2. Keeping the flight as full as they want it (raising prices on a flight which is filling up in order to reduce sales and hold back inventory for those expensive last-minute expensive
purchases)


So, we have to work on a project where we collect data of flight fares with other features and work to make a model to predict fares of flights.

### Importing required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

import warnings
warnings.filterwarnings('ignore')

In [None]:
#To print all rows

pd.set_option('display.max_rows',None)

In [None]:
#importing dataset

df = pd.read_csv("Flight_Price.csv")
df.head()

Since Price is our target and it seems to be continuous feature so this perticular problem is Regression Problem.

### Features Information:
    
    
 - Airline: The name of the airline.
    
 - Journey_date: The date of the journey.
    
 - From: The source from which the service begins.
    
 - To: The destination where the service ends.
    
 - Route: The route taken by the flight to reach the destination.
    
 - D_Time: The time when the journey starts from the source.
    
 - A_Time: Time of arrival at the destination.
    
 - Stops: Total stops between the source and destination.
    
 - Price: The price of the ticket

## Preprocessing and EDA:

In [None]:
#Checking shape of our dataset
df.shape

In our dataset we have 5204 rows and 9 columns.

In [None]:
#Removing Column "Unnamed:0 as it will not affect our core dataset

df.drop('Unnamed: 0',axis=1,inplace=True)
df.head()

Removing Unnamed: 0 column as it is the index column of csv file.

In [None]:
#Removing First row as it have only NaN values

df=df.drop([df.index[0]])

Since we are having all the entries in first row as nan so we have dropped this row.

In [None]:
#Equalizing Price column

df.Price = df.Price.str.replace('[^0-9.]','').astype('float64')
df.head()

We have changed the price column datatype to float.

In [None]:
#Checking all column names

df.columns

Above are the column names of the dataset.

In [None]:
#Checking the data types of all columns

df.dtypes

Except Price all other columns are object type datas. But we have to convert journey_date, d_time and a_time columns from object to datetime type data.

In [None]:
#Checking the info about the dataset

df.info()

There is no nan values in the dataset. But we have to convert journey_date, d_time and a_time columns from object to datetime type data.

In [None]:
#Lets check the value count of each column to see if there are any unexpected and unwanted entries present in the column.

for i in df.columns:
        print(df[i].value_counts())
        print('****************************************')

 - Above are the value counts of each column. In Airline and Stops column we have to use grouping to get better understanding on the feature.

In [None]:
#Grouping Airlines column for multiple airlines

df["Airline"].replace(("Spicejet, IndiGo","Air India, IndiGo","Spicejet, AirAsia","IndiGo, Air India","IndiGo, Spicejet","AirAsia, IndiGo","IndiGo, Go First","IndiGo, TruJet","Vistara, IndiGo","Spicejet, Air India","Air India, Go First","Vistara, Spicejet","Spicejet, Go First","Go First, IndiGo","IndiGo, AirAsia","Air India, AirAsia","Vistara, Go First","TruJet, IndiGo","Spicejet, Vistara","IndiGo, Vistara","Air India, Spicejet","AirAsia, Go First","Vistara, AirAsia","Vistara, Air India","Go First, AirAsia","Spicejet, TruJet","Vistara, TruJet","AirAsia, TruJet","Go First, Air India","Go First, Spicejet","Air India, Vistara"),"Multiple Airlines",inplace=True)

In [None]:
#Checking the value counts of Airline column

df.Airline.value_counts()

In [None]:
#Grouping Stops column 

df["Stops"].replace(("1 stop via Mumbai","1 stop via Hyderabad","1 stop via Bengaluru","1 stop via New Delhi","1 stop via Ahmedabad","1 stop via Goa","1 stop via Pune","1 stop via Lucknow","1 stop via Ranchi","1 stop via Kolkata","1 stop via Chennai","1 stop via Chandigarh","1 stop via Kochi","1 stop via Jaipur","1 stop via Nagpur","1 stop via Amritsar","1 stop via Patna","1 stop via Surat","1 stop via Guwahati","1 stop via Vadodara","1 stop via Udaipur","1 stop via Indore","1 stop via Bhavnagar","1 stop via Madurai","1 stop via Bagdogra","1 stop via Varanasi","1 stop via Srinagar","1 stop via Mangalore","1 stop via Jammu","1 stop via Vijayawada","1 stop via Jodhpur","1 stop via Kalaburagi","1 stop via Aurangabad","1 stop via Rajkot","1 stop via Mysore","1 stop via Bhopal","1 stop via Tirupati","1 stop via Dehradun","1 stop via Visakhapatnam"),"1 Stop",inplace=True)

In [None]:
#Grouping Stops column

df["Stops"].replace(("2 stop via New Delhi,Hyderabad","2 stop via Hyderabad,New Delhi","2 stop via Mumbai,Hyderabad","2 stop via Mumbai,New Delhi","2 stop via Hyderabad,Mumbai","2 stop via Bengaluru,Hyderabad","2 stop via Hyderabad,Bengaluru","2 stop via New Delhi,Mumbai","2 stop via Varanasi,Bengaluru","2 stop via New Delhi,Chandigarh","2 stop via Chandigarh,New Delhi","2 stop via Chandigarh,Ahmedabad","2 stop via Ranchi,New Delhi","2 stop via Ranchi,Bengaluru","2 stop via Ahmedabad,Chandigarh","2 stop via Chandigarh,Srinagar","2 stop via Bengaluru,Ranchi","2 stop via Jammu,Srinagar","2 stop via Kochi,Mumbai","2 stop via New Delhi,Varanasi","2 stop via Hyderabad,Mysore","2 stop via Mumbai,Ranchi","2 stop via Chennai,Ranchi","2 stop via Hyderabad,Pune","2 stop via Nagpur,Pune","2 stop via Chennai,Hyderabad","2 stop via Pune,Hyderabad","2 stop via Hyderabad,Nanded","2 stop via Vijayawada,Hyderabad","2 stop via Hyderabad,Goa","2 stop via Nanded,Hyderabad","2 stop via Mumbai,Chandigarh","2 stop via Belgaum,Hyderabad","2 stop via Chennai,Jaipur","2 stop via Hyderabad,Chennai","2 stop via Hyderabad,Tirupati","2 stop via Srinagar,Chandigarh","2 stop via Mangalore,Mumbai","2 stop via Amritsar,Srinagar","2 stop via Goa,Hyderabad","2 stop via Mysore,Hyderabad"),"2 Stops",inplace=True)

In [None]:
#Grouping Stops column

df["Stops"].replace(("3 stop via Goa,New Delhi,Hyderabad","3 stop via Mumbai,Aurangabad,New Delhi","3 stop via Chandigarh,New Delhi,Ranchi","3 stop via New Delhi,Aurangabad,Mumbai","3 stop via Leh,Jammu,Srinagar","3 stop via Bhubaneswar,New Delhi,Hyderabad","3 stop via Hyderabad,New Delhi,Mumbai","3 stop via Indore,Hyderabad,Mumbai","3 stop via Hyderabad,New Delhi,Jaipur","3 stop via Hyderabad,New Delhi,Goa","3 stop via Ahmedabad,New Delhi,Hyderabad","3 stop via Belgaum,Hyderabad,Mumbai","3 stop via Hyderabad,New Delhi,Bhopal","3 stop via Mumbai,New Delhi,Hyderabad"),"3 Stops",inplace=True)

In [None]:
#Grouping Stops column

df["Stops"].replace(("4 stop via Bhubaneswar,Surat,New Delhi,Hyderabad"),"4 Stops",inplace=True)

In [None]:
#Checking the value counts of Stops column

df.Stops.value_counts()

In [None]:
#Let me assign values for Stops column

df.replace({"Non stop": 0,"1 Stop": 1,"2 Stops": 2,"3 Stops": 3,"4 Stops": 4},inplace = True)

In [None]:
#Checking the value counts of Stops column again

df.Stops.value_counts()

 - Now Stops column is set for our analysis.

In [None]:
#Checking null values in Dataset

print("Empty cells in Dataset is ",df.isna().values.any())

print("\nColumnwise Empty cell analysis\n")

print(df.isna().sum())

 - There are no null values in our dataset

In [None]:
#Visualizeing null values

plt.figure(figsize=[12,4])

sns.heatmap(df.isnull())

plt.title("Null Values")

plt.show()

 - By visualization we can clearly say that there is no null values in the dataset.

In [None]:
#Checking for empty observations

df.loc[df['Price'] == " "]

 - There is no empty observations in our target column.

# Feature Extraction:

In [None]:
#Converting object data type to datetime in Journey_date column 

df['Journey_date'] =  pd.to_datetime(df['Journey_date'])

In [None]:
#Extracting Journey year,month and day from Journey_date

#Extracting year
df["Journey_year"]=pd.to_datetime(df.Journey_date, format="%Y/%m/%d").dt.year

#Extracting month
df["Journey_mon"]=pd.to_datetime(df.Journey_date, format="%Y/%m/%d").dt.month

#Extracting day
df["Journey_day"]=pd.to_datetime(df.Journey_date, format="%Y/%m/%d").dt.day

In [None]:
#Checking valuecount of Journey_year column

df.Journey_year.value_counts()

 - Since all the entries in Journey_year column are same let's drop as it will not help in our core analysis.

In [None]:
#Droping Journey_year column

df = df.drop(["Journey_year"],axis=1)

In [None]:
#Checking valuecount of Journey_mon column

df.Journey_mon.value_counts()

 - Since all the entries in Journey_mon column are same let's drop as it will not help in our core analysis.

In [None]:
#Droping Journey_mon column

df = df.drop(["Journey_mon"],axis=1)

In [None]:
#Checking valuecount of Journey_year column

df.Journey_day.value_counts()

 - Now Journey_day is ready for our analysis.

In [None]:
#Droping Journey_date column

df = df.drop(["Journey_date"],axis=1)

 - Dropping Journey_date column after extracting required information.

In [None]:
#Converting object data type to datetime 

df['Dtime'] =  pd.to_datetime(df['Dtime'])

df['Atime'] =  pd.to_datetime(df['Atime'])

In [None]:
#Checking the data types of all columns again

df.dtypes

 - The data type has changed now.

In [None]:
#Extracting hours and minutes from Dtime

#Extracting Hours
df["Dhour"]=pd.to_datetime(df["Dtime"]).dt.hour

#Extracting Hours
df["DMin"]=pd.to_datetime(df["Dtime"]).dt.minute

In [None]:
#Droping Dep_Time column after extraction

df = df.drop(["Dtime"],axis=1)

In [None]:
#Extracting hours and minutes from Arrival_Time

#Extracting Hours
df["AHour"]=pd.to_datetime(df["Atime"]).dt.hour

#Extracting Hours
df["AMin"]=pd.to_datetime(df["Atime"]).dt.minute

In [None]:
#Droping Arrival_Time column after extraction

df = df.drop(["Atime"],axis=1)

In [None]:
#Checking the data types of all columns again

df.dtypes

 - This is the datatypes after extraction and preprocessing.

In [None]:
#Checking description of data set

df.describe()

### Above is the statistics about the dataset. The mean and the 2nd quantile values are almost same so there is no extreme outliers in the dataset.

# Visualization:

### Univariate Analysis:

In [None]:
# checking for categorical columns

categorical_columns=[]
for i in df.dtypes.index:
    if df.dtypes[i]=='object':
        categorical_columns.append(i)
print(categorical_columns)

 - Above are the categorical columns in the data set.

In [None]:
# Now checking for numerical columns

numerical_columns=[]
for i in df.dtypes.index:
    if df.dtypes[i]!='object':
        numerical_columns.append(i)
print(numerical_columns)

 - Above are the numerical columns in the data set.

## Univariate analysis for numerical columns:

In [None]:
#Distribution plot for all numerical columns

plt.figure(figsize = (30,16))
plotnumber = 1
for column in df[numerical_columns]:
    if plotnumber <=9:
        ax = plt.subplot(3,3,plotnumber)
        sns.distplot(df[column])
        plt.xlabel(column,fontsize = 25)
        plt.ylabel('Density',fontsize = 25)
        plt.xticks(fontsize=20)  
        plt.yticks(fontsize=20)
    plotnumber+=1
plt.tight_layout()

- There is no skewness in any of the numerical columns.

## Univariate Analysis for categorical columns:

In [None]:
#Bar plot for all Categorical columns

plt.figure(figsize = (30,10))
plotnumber = 1
for column in df[categorical_columns]:
    if plotnumber <=3:
        ax = plt.subplot(1,3,plotnumber)
        sns.countplot(df[column])
        plt.xlabel(column,fontsize = 25)
        plt.ylabel('Count',fontsize = 25)
        plt.xticks(rotation=90,fontsize=20)  
        plt.yticks(fontsize=20)
    plotnumber+=1
plt.tight_layout()

 - Indigo has maximum count which means most of the passengers preferred Indigo for there travelling.

    
 - New Delhi has maximum count for source which means maximum passengers are choosing New Delhi as there source.


 - New Delhi has maximum count for Destination which means maximum passengers are choosing New Delhi as there Destination.

# Bivariate Analysis:

In [None]:
col=['Stops', 'Journey_day', 'Dhour', 'DMin', 'AHour', 'AMin']

In [None]:
#stripplot for numerical columns

plt.figure(figsize=(40,40))
for i in range(len(col)):
    plt.subplot(4,2,i+1)
    sns.stripplot(x=df[col[i]] , y=df['Price'])
    plt.title(f"Price VS {col[i]}",fontsize=40)
    plt.xticks(fontsize=25)  
    plt.yticks(fontsize=25)
    plt.xlabel(col[i],fontsize = 30)
    plt.ylabel('Price',fontsize = 30)
    plt.tight_layout()

### Observations:

 - Flights with 2 stops costs more price compared to other flights.


 - In all the dates the price is almost same.


 - At 2PM departure time of every day the flight prices are high so it looks good to book flights rather than this departure time.


 - And Departure minute has less relation with target price.


 - At 7AM to 1PM Arrival time of every day the flight prices are high so it looks good to book flights rather than this arrival time.


 - And Arrival minute has less relation with target price.

In [None]:
#Bar plot for all categorical columns

plt.figure(figsize=(20,20))
for i in range(len(categorical_columns)):
    plt.subplot(3,2,i+1)
    sns.barplot(y=df['Price'],x=df[categorical_columns[i]])
    plt.title(f"Price VS {categorical_columns[i]}",fontsize=25)
    plt.xticks(rotation=90,fontsize=15)  
    plt.yticks(rotation=0,fontsize=15)
    plt.xlabel(categorical_columns[i],fontsize = 20)
    plt.ylabel('Price',fontsize = 20)
    plt.tight_layout()

### Observations:

 - For Multiple Airlines the Price is high compared to other Airlines.


 - Taking Tirupati as Source costs highest Price Compared to other Source points.

    
 - Taking Tirupati as Destination costs highest Price Compared to other Destination points.

# Multivariate Analysis:

In [None]:
#pair ploting for df

sns.pairplot(df,hue="Price")

 - Above are the pair plots of each pair of features.

# Checking for outliers:

In [None]:
# Identifying the outliers using boxplot

plt.figure(figsize=(30,15),facecolor='white')
plotnumber=1
for column in numerical_columns:
    if plotnumber<=9:
        ax=plt.subplot(3,3,plotnumber)
        sns.boxplot(df[column],color='gold')
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.tight_layout()

### There are outliers in

 - Stops
 - Price

Since Price is our target so we should not remove outliers from this column. And Stops is a categorical column So we should not remove outliers here also.

# Checking for skewness:

In [None]:
#Checking for skewness

df.skew()

 - There is skewness in Stops and Price. Since Price is our target that's why we are not removing skewness here because we don't want our target to get manupulated. And Stops is categorical column so we are not removing skewness here also.

# Label Encoding:

In [None]:
# Separating categorical columns in df_1

cat_col=[]
for i in df.dtypes.index:
    if df.dtypes[i]=='object':
        cat_col.append(i)
print(cat_col)

 - Above are the list of categorical columns in df.

In [None]:
from sklearn.preprocessing import LabelEncoder
LE=LabelEncoder()
df[cat_col]= df[cat_col].apply(LE.fit_transform)

In [None]:
df.head()

 - Using label encoder i have encoded the categorical columns.

# Checking correlation using heat map:

In [None]:
cor=df.corr()

In [None]:
#Checking correlation

cor

 - Above are the correlations of all the pair of features. To get better visualization on the correlation of features, let's plot it using heat map.

In [None]:
# Visualizing the correlation matrix by plotting heat map.

plt.figure(figsize=(10,8))
sns.heatmap(df.corr(),linewidths=.1,vmin=-1, vmax=1, fmt='.1g', annot = True, linecolor="black",annot_kws={'size':10},cmap="coolwarm")
plt.yticks(rotation=0);

 - There is no multicolinearity issue in any features.

    
 - AMin is very less correlated with target.

In [None]:
plt.figure(figsize=(15,6))
df.corr()['Price'].sort_values(ascending=False).drop(['Price']).plot(kind='bar',color='g')
plt.xlabel('Feature',fontsize=14)
plt.ylabel('column with target names',fontsize=14)
plt.title('correlation',fontsize=18)
plt.show()

 - AMin is very less correlated with target.

# Separating features and label in train dataset:

In [None]:
x = df.drop("Price",axis=1)
y = df["Price"]

 - Here we have separated our target and independent columns.

# Scaling the data using standard scaler:

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)

 - We have scaled our data using standard scaler.

In [None]:
X.head()

 - This is the data of independent variables after scaling.

# Checking for multicolinearity issue using VIF:

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif=pd.DataFrame()
vif["vif_Features"]=[variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["Features"]=X.columns
vif

 - There is no multicolinearity issue in this dataset.

# Finding Best Random State and Accuracy:

In [None]:
#importing necessary libraries

from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.ensemble import RandomForestRegressor

maxAccu=0
maxRS=0
for i in range(1,200):
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.30, random_state =i)
    mod = RandomForestRegressor()
    mod.fit(X_train, y_train)
    pred = mod.predict(X_test)
    acc=r2_score(y_test, pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print("Best accuracy is ",maxAccu," on Random_state ",maxRS)

### We got the best accuracy and random state.

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=maxRS)  #Created train test split

# Regression Algorithms:

In [None]:
#importing necessary libraries

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor as KNN
from sklearn.linear_model import SGDRegressor
from xgboost import XGBRegressor
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingRegressor
from sklearn import metrics

## i) RandomForestRegressor:

In [None]:
RFR=RandomForestRegressor()
RFR.fit(X_train,y_train)
pred=RFR.predict(X_test)
R2_score = r2_score(y_test,pred)*100
print('R2_score:',R2_score)
print('mean_squared_error:',metrics.mean_squared_error(y_test,pred))
print('mean_absolute_error:',metrics.mean_absolute_error(y_test,pred))
print('root_mean_squared_error:',np.sqrt(metrics.mean_squared_error(y_test,pred)))

 - RFR is giving me 80.26% r2_score.

## ii) XGB Regressor:

In [None]:
XGB=XGBRegressor()
XGB.fit(X_train,y_train)
pred=XGB.predict(X_test)
R2_score = r2_score(y_test,pred)*100
print('R2_score:',R2_score)
print('mean_squared_error:',metrics.mean_squared_error(y_test,pred))
print('mean_absolute_error:',metrics.mean_absolute_error(y_test,pred))
print('root_mean_squared_error:',np.sqrt(metrics.mean_squared_error(y_test,pred)))

 - XGB is giving me 79.56% r2_score.

## iii) ExtraTreeRegressor:

In [None]:
ETR=ExtraTreesRegressor()
ETR.fit(X_train,y_train)
pred=ETR.predict(X_test)
print('R2_score:',r2_score(y_test,pred))
print('mean_squared_error:',metrics.mean_squared_error(y_test,pred))
print('mean_absolute_error:',metrics.mean_absolute_error(y_test,pred))
print('root_mean_squared_error:',np.sqrt(metrics.mean_squared_error(y_test,pred)))

 - ETR is giving me 81.13% r2_score.

## iv) Gradient Boosting Regressor:

In [None]:
GBR=GradientBoostingRegressor()
GBR.fit(X_train,y_train)
pred=GBR.predict(X_test)
print('R2_score:',r2_score(y_test,pred))
print('mean_squared_error:',metrics.mean_squared_error(y_test,pred))
print('mean_absolute_error:',metrics.mean_absolute_error(y_test,pred))
print('root_mean_squared_error:',np.sqrt(metrics.mean_squared_error(y_test,pred)))

 - GBR is giving me 65.71% r2_score.

## v) DecisionTreeRegressor:

In [None]:
DTR=DecisionTreeRegressor()
DTR.fit(X_train,y_train)
pred=DTR.predict(X_test)
print('R2_score:',r2_score(y_test,pred))
print('mean_squared_error:',metrics.mean_squared_error(y_test,pred))
print('mean_absolute_error:',metrics.mean_absolute_error(y_test,pred))
print('root_mean_squared_error:',np.sqrt(metrics.mean_squared_error(y_test,pred)))

 - DTR is giving me 64.55% r2_score.

## vi) KNN:

In [None]:
knn=KNN()
knn.fit(X_train,y_train)
pred=knn.predict(X_test)
print('R2_score:',r2_score(y_test,pred))
print('mean_squared_error:',metrics.mean_squared_error(y_test,pred))
print('mean_absolute_error:',metrics.mean_absolute_error(y_test,pred))
print('root_mean_squared_error:',np.sqrt(metrics.mean_squared_error(y_test,pred)))

 - KNN is giving me 53.35% r2_score.

## vii) Bagging Regressor:

In [None]:
BG=BaggingRegressor()
BG.fit(X_train,y_train)
pred=BG.predict(X_test)
print('R2_score:',r2_score(y_test,pred))
print('mean_squared_error:',metrics.mean_squared_error(y_test,pred))
print('mean_absolute_error:',metrics.mean_absolute_error(y_test,pred))
print('root_mean_squared_error:',np.sqrt(metrics.mean_squared_error(y_test,pred)))

 - Bagging Regressor is giving me 79.32% r2_score.

By looking into the model r2_score and error i found ExtraTreesRegressor as the best model with highest r2_score and least errors.

# Hyper Parameter Tuning:

In [None]:
#importing necessary libraries

from sklearn.model_selection import GridSearchCV

In [None]:
parameter = {'max_features':['auto','sqrt','log2'],
             'min_samples_split':[1,2,3,4],
             'n_estimators':[20,40,60,80,100],
             'min_samples_leaf':[1,2,3,4,5],
              'n_jobs':[-2,-1,1,2]}

 - Giving ETR parameters.

In [None]:
GCV=GridSearchCV(ExtraTreesRegressor(),parameter,cv=5)

 - Running grid search CV for ETR.

In [None]:
GCV.fit(X_train,y_train)

 - Tunning the model using GCV.

In [None]:
GCV.best_params_

 - Got the best parameters for ETR.

In [None]:
Best_mod=ExtraTreesRegressor(max_features='auto',min_samples_leaf=2,min_samples_split=2,n_estimators=80,n_jobs=1)
Best_mod.fit(X_train,y_train)
pred=Best_mod.predict(X_test)
print('R2_Score:',r2_score(y_test,pred)*100)
print('mean_squared_error:',metrics.mean_squared_error(y_test,pred))
print('mean_absolute_error:',metrics.mean_absolute_error(y_test,pred))
print("RMSE value:",np.sqrt(metrics.mean_squared_error(y_test, pred)))

### This is our model after tuning. We got 82.01% as r2_score before it was 81.18% which means accuracy has increased which is good!!!

# Saving the model:

In [None]:
# Saving the model using .pkl

import joblib
joblib.dump(Best_mod,"Flight_Price.pkl")

 - We have saved our model as Flight_Price using .pkl

## Predicting Flight Price for test dataset using Saved model of trained dataset:

In [None]:
# Loading the saved model
model=joblib.load("Flight_Price.pkl")

#Prediction
prediction = model.predict(X_test)
prediction

In [None]:
pd.DataFrame([model.predict(X_test)[:],y_test[:]],index=["Predicted","Actual"])

 - Above are the predicted values and the actual values. They are almost similar.

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(y_test, prediction, c='crimson')
p1 = max(max(prediction), max(y_test))
p2 = min(min(prediction), min(y_test))
plt.plot([p1, p2], [p1, p2], 'b-')
plt.xlabel('Actual', fontsize=15)
plt.ylabel('Predicted', fontsize=15)
plt.title("ExtraTreesRegressor")
plt.show()

 - Plotting Actual vs Predicted, To get better insight. Blue line is the actual line and red dots are the predicted values.