<a href="https://colab.research.google.com/github/yashtambee/Airline-Passenger-Referral-Prediction/blob/main/Yash_Tambe_Airline_Passenger_Referral_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **✈️Airline Passenger Referral Prediction**    



##### **Project Type**    - Classification
##### **Contribution**    - Team
##### **Team Member 1**  - Yash Tambe
##### **Team Member 2**  - Chaitanya Chaudhari

# **Project Summary -**

In Airline Passenger Referral Prediction capstone project our main objectives to predict whether passengers will refer the airline to their friends. Therefore according to the data given in the dataset, we will implement the Machine learning classification model to predict the right travel airline for the passengers. And accordingly we will make their cross validation and hyper parameter tuning to make the predictions more accurate. For this process, we will try to conditioned our data as per requirement we have.

# **GitHub Link -**

https://github.com/yashtambee/Airline-Passenger-Referral-Prediction

# **Problem Statement**


Data includes airline reviews from 2006 to 2019 for popular airlines around the world with multiple choice and free text questions. Data is scraped in Spring 2019.The main objectives to predict whether passengers will refer the airline to their friends.

# **Let's Begin !**

## **Know Your Data**

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
49

from statsmodels.stats.outliers_influence import variance_inflation_factor
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, auc
from sklearn.metrics import recall_score,precision_score,classification_report,roc_auc_score,roc_curve
from sklearn.metrics import f1_score
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

!pip install eli5
import eli5 as eli

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
airline_df = pd.read_excel("https://github.com/yashtambee/Airline-Passenger-Referral-Prediction/blob/main/data_airline_reviews.xlsx?raw=true")

### Dataset First View

In [None]:
# Dataset First Look
airline_df.head()

In [None]:
airline_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'In the given dataset,\nThe total number of rows are {airline_df.shape[0]} and \nThe total number of columns are {airline_df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
airline_df.info()

##**Understanding Your Variables**

In [None]:
# Dataset Describe
airline_df.describe(include='all')

In [None]:
# Dataset Columns
airline_df.columns

### Variables Description 

**airline**: Name of the airline.

**overall**: Overall point is given to the trip between 1 to 10.

**author**: Author of the trip

**review date**: Date of the Review

**customer review**:Review of the customers in free text format

**aircraft**: Type of the aircraft

**traveller type**: Type of traveler (e.g. business, leisure)

**cabin**: Cabin at the flight date flown: Flight date

**seat comfort**: Rated between 1-5

**cabin service**: Rated between 1-5

**foodbev**: Rated between 1-5

**entertainment**: Rated between 1-5

**ground service**: Rated between 1-5

**value for money**: Rated between 1-5

**recommended**: Binary, target variable.

##**Data Wrangling**

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(airline_df[airline_df.duplicated()])

In [None]:
# permanently dropping the duplicate rows from the dataset
airline_df.drop_duplicates(inplace = True)

In [None]:
# checking the duplicate rows again afer dropping them 
len(airline_df[airline_df.duplicated()])

In [None]:
# checking the dataset shape after droping the duplicate rows
airline_df.shape

As we can see after dropping duplicate rows our shape of dataset got reduced from 131895 rows 17 columns to 61184 rows 17 columns

In [None]:
# first view of the dataset after dropping the duplicate rows
airline_df.head()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values = airline_df.isna().sum()
null_values

In [None]:
# Visualizing the missing values
plt.rcParams['figure.figsize'] = (6,5)
airline_df.isna().sum().plot(kind = 'bar', color = 'pink')
plt.xlabel('Variables with null values')
plt.ylabel('Total Null value count',labelpad = 10)
plt.title('Null values in dataframe')

In [None]:
# Permanently dropping the null values 
airline_df.dropna(inplace=True)

In [None]:
# checking the dataset shape after droping the null values
airline_df.shape

As we can see after dropping duplicate rows our shape of dataset got reduced from 61184 rows 17 columns to 13189 rows 17 columns

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable 
# creating an object named required_variables that will contain all 
required_variables = airline_df.loc[:,:]
for i in required_variables.columns :
  print(f'Unique values for variable {i} is as below :')
  print(required_variables[i].unique())
  print('\n')

# **EDA (Explorative Data Analysis)**

#### Analysing which airline has most no of overall ratings points


In [None]:
# performing groupby operation to find the best airline with best overall perfromance
most_overall_rating = airline_df.groupby('airline')['overall'].sum().sort_values(ascending= False)
most_overall_rating

In [None]:
# visualizing the above result 
plt.rcParams['figure.figsize'] = (25,8)
most_overall_rating.plot(kind = 'bar')

# assigning title, x label , y label to the plot
plt.xlabel('airline', labelpad = 17, fontsize = 12)
plt.ylabel('overall ratings points', labelpad = 17, fontsize = 12)
plt.title('', pad = 19, fontsize = 14)


From the above bar plot we can see that China Southern Airlines has the most numbers of the overall reviews

#### Analysing which airline is the most worthy for money

In [None]:
# finding the most worthy for money airline using groupby operation on 'airline' & 'value for money' feature 
airline_most_worthy_for_money = airline_df.groupby('airline')['value_for_money'].sum().sort_values(ascending=False)
airline_most_worthy_for_money

In [None]:
# visualizing the above result 
plt.rcParams['figure.figsize'] = (25,8)
airline_most_worthy_for_money.plot(kind = 'bar')

# assigning title, x label , y label to the plot
plt.xlabel('airline', labelpad = 17, fontsize = 12)
plt.ylabel('value for money ratings points', labelpad = 17, fontsize = 12)
plt.title('', pad = 19, fontsize = 14)

From the above pie plot we can see that the 'China Southern Airlines' is the most worthy for money airline and the 'airBaltic' airline is the least worthy for money travel

#### Analysing the food beverages and entertaiment average ratings given by passenger


In [None]:
# finding the average ratings of food beverages and entertaiment
avg_rating_foodbev_and_entertainment = airline_df.groupby('cabin')[['food_bev','entertainment']].mean()
avg_rating_foodbev_and_entertainment

In [None]:
# visualizing the average food beverages & entertainment ratings given by the passengers
plt.rcParams['figure.figsize'] = (7,5)
avg_rating_foodbev_and_entertainment.plot(kind = 'bar')
plt.ylim([0,4])
plt.xticks(rotation = 50)
plt.ylabel('avg  food beverages and entertainment ratings',fontsize = 9,labelpad = 10)
plt.xlabel('cabin type',fontsize = 9)

Economy class has the lowest average food beverages and entertaining ratings as compared to other classes

Whereas the Business Class has the highest food beverages and entertaining ratings

#### Analysing top 10 airline with most number of trips ?

In [None]:
# finding the top 10 airlines with most no. of trips
top_10_airlines = airline_df['airline'].value_counts().head(10)

In [None]:
# visualizing the top 10 airlines with most no. of trips
plt.rcParams['figure.figsize'] = (10,8)
top_10_airlines.plot(kind = 'bar')
plt.xticks(rotation = 50)
plt.ylabel('No. of trips',fontsize = 12,labelpad = 14)
plt.xlabel('airlines',fontsize = 12,labelpad = 14)
plt.title('top 10 airlines with most no. of trips',pad = 14,fontsize = 12)

British Airways ranked at the top among the list of top 10 airlines with most number of trips

### Checking the distribution of dependent variable 

In [None]:
# checking the distribution of values of YES - NO  
target_distribution = airline_df['recommended'].value_counts()
target_distribution

In [None]:
# visualizing the distribution of the dependent variable
plt.rcParams['figure.figsize'] = (5,5)
sns.countplot(x = airline_df['recommended'])
plt.title('Distribution of Dependent Variable : recommended ',pad = 12)
plt.xlabel('recommended',fontsize = 8.5)
plt.ylabel('count',fontsize = 8.5)

From the above visualization, we got know that among nearly 13000 times passengers have travelled by flights, they recommended 8802 times that we can travel by airway.

In [None]:
# checking the percentage of distribution of yes v/s no
target_distribution.plot(kind = 'pie',autopct='%.1f%%', shadow = True, explode = [0.2,0.1], fontsize = 12)

### Checking the Distribution of the Independent variables




In [None]:
# plot histogram to see the distribution of the data
fig = plt.figure(figsize = (10,10))
ax = fig.gca()
airline_df.hist(ax = ax)
plt.show()

### Checking the relationship between categorical dependent variable and independent variable

In [None]:
# defining the categorical variable list 
categorical_features = ['overall','seat_comfort','cabin_service','food_bev','entertainment','ground_service','value_for_money']

In [None]:
# visualizing the relationship between categorical dependent & independent variables
for col in categorical_features : 
  plt.figure(figsize=(8,5))
  sns.violinplot(x=col, y="recommended", data=airline_df, hue = 'recommended') # plots the violin plot
  plt.title("Relationship between recommended &" + " " + col)                  # assining title to the plot

### Checking Multi - Collinearity 

In [None]:
# encoding the dependent variable "recommended" as it is going to be used for checking multicollinearity
airline_df['recommended'] = airline_df['recommended'].map({'yes':1,'no':0})

In [None]:
# checking the correlations among the features
airline_df.corr()

In [None]:
# analyisng the Correlations of features using the heatmap
plt.rcParams['figure.figsize'] = (8,6)
sns.heatmap(abs(airline_df.corr()),annot = True, cmap = 'Blues')

### VIF (Variance Inflation Factor) Analysis of Independent Variables
Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables.



In [None]:
# defining a function to calculate the VIF 
def cal_vif(x) :
# calculating vif
  vif = pd.DataFrame()
  vif['variables'] = x.columns # rows will be column of the passed dataset
  vif['VIF'] = [variance_inflation_factor(x.values,i) for i in range(x.shape[1])] # df.shape[1] means shape of the columns
                                                                                  
  return(vif) # returning vif df

In [None]:
cal_vif(airline_df[[i for i in airline_df.describe().columns if i not in ['airline','author','review_date','customer_review','aircraft',
                                                                          'traveller_type','cabin','date_flown','route','recommended']]])

Here the 'overall' and 'value for money' feature has very VIF factor & has very high collinearity of 0.87

Also 'overall' and 'recommended' has correlation of 0.86

'value for money' and 'recommended' has correlation of  0.79

So we will drop 'value_for_money' feature

Further we are having high VIF values for remaining features but they are not exhibiting very high correlation in the heatmap plot. 

So we will conclude our VIF process here

In [None]:
# dropping 'value_for_money' feature from the VIF list
cal_vif(airline_df[[i for i in airline_df.describe().columns if i not in ['airline','author','review_date','customer_review',
                                                                          'aircraft','traveller_type','cabin','date_flown','route',
                                                                          'recommended','value_for_money']]])

In [None]:
# analyisng the Correlations of features using the heatmap after dropping the 'overall' feature
plt.rcParams['figure.figsize'] = (8,6)
sns.heatmap(airline_df[['overall','seat_comfort','cabin_service','food_bev','entertainment','ground_service','recommended']].corr(),annot = True, cmap = 'Blues')

### Outlier Detection

In [None]:
# Checking outliers for the box plot
sns.set_theme(style="whitegrid")
sns.set(rc={'figure.figsize':(13.7,8.27)})
ax = sns.boxplot(data=airline_df, orient="v", palette="Set2")

With these our Explorative Data Analysis is complete

## **Feature Engineering & Data Pre-processing**

### Reducing cardinality & Feature Encoding

In [None]:
# Creating a copy of the original dataset & dropping 'value_for_money' feature from it as per our VIF analysis
airline_df_cp = airline_df.copy().drop('value_for_money',axis = 1)

In [None]:
# first view of the copied dataset after droping the 'value_for_money' feature
airline_df_cp.head()

In [None]:
# creating a list of columns whose cardinality is to be checked
cols_for_cardinality_check = ['airline','overall','seat_comfort','cabin_service','food_bev','entertainment','cabin','traveller_type','ground_service','recommended']

In [None]:
# Number of labels = cardinality
#Let's now check if our categorical variables have a huge number of categories. 
#This may be a problem for some machine learning models.
for var in airline_df_cp[cols_for_cardinality_check]:
    print(var, ' contains ', len(airline_df_cp[var].unique()), ' labels')

In [None]:
# encoding the original data
airline_df_cp['cabin'] = airline_df_cp['cabin'].map({'Economy Class':0 ,'Business Class':1 ,'First Class':2 ,'Premium Economy':3})
airline_df_cp['traveller_type'] = airline_df_cp['traveller_type'].map({'Solo Leisure':0 ,'Couple Leisure':1 ,'Business':2 ,'Family Leisure':3})

# creating dummies values for the airline feature
airline_df_cp = pd.get_dummies(airline_df_cp, columns=['airline'])

In [None]:
# first view of data after encoding
airline_df_cp.head()

In [None]:
# checking shape of data after feature encoding
airline_df_cp.shape

In [None]:
# checking all columns in new data set after feature encoding & creating dummy variable  
airline_df_cp.columns

### Checking Class Imbalance of Target variable 

In [None]:
# counting the total number of each class present in the dataset
# here yes = 1, no = 0
airline_df_cp['recommended'].value_counts()

In [None]:
# calculating the total number of rows in the dataset
total = airline_df_cp['recommended'].value_counts()[1] + airline_df_cp['recommended'].value_counts()[0]
print('total target variable label count :',total)

# calculating the percentage of observations of dependent variable belonging to the class 1
percentage_class_1 = round((airline_df_cp['recommended'].value_counts()[1]/total)*100,2)
print('Percentage of class 1 :',percentage_class_1)

# calculating the percentage of observations of dependent variable belonging to the class 0
percentage_class_0 = round((airline_df_cp['recommended'].value_counts()[0]/total)*100,2)
print('Percentage of class 0 :',percentage_class_0)

Here we are having the class imbalance as the class 1 is almost double(2x) of class 0 

The model will accurately predict the class 1 but might create error in prediction of class 0 as during training the model it will get trained more on the class 1 basis

So we have to perform class imbalance handling operation to fix this problem & we will use Synthetic Minority Oversampling Technique (SMOTE) process