<a href="https://colab.research.google.com/github/yogesh1199/Classification-Airline-Passenger-Referral-Prediction/blob/main/classification_airline_passenger_referral_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Airline Passenger Referral Prediction**



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Github Link**    - https://github.com/yogesh1199/Classification-Airline-Passenger-Referral-Prediction

# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


Air transport or aviation plays a very important role in the current transport infrastructure of the world and is definitely considered as the gift of the 20th century to the world. In today's fast paced world, air transport is a boon to all because of its speed. This mode of transportation is very useful for getting products quickly and safely to those who need them with short delivery times, but also allows the tourism industry in each country to grow steadily, reducing the distance between all the people living in the world.

Here, I have a dataset related to customer service ratings by various airlines. The main objective of this project is to understand how passengers will recommend airlines to others. Here the dataset is very large which initially had 131895 rows and 17 columns. On checking the data information, it was found that the dataset basically had two different types of data, 7 columns of floats 64, 10 columns of data types with object types. Coming to the null values ​​and missing values ​​in the dataset, it was observed that non-null counts do not match which clearly states that a large number of missing and null values ​​are present in the dataset.

The main objective is to predict whether passengers will refer the airline to them or not

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt

import seaborn as sns
from scipy.stats import *

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier
from xgboost import XGBRFClassifier
from sklearn.tree import export_graphviz

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
url = "https://github.com/yogesh1199/Classification-Airline-Passenger-Referral-Prediction/raw/main/data_airline_reviews.xlsx"

airline_df = pd.read_excel(url)


### Dataset First View

In [None]:
# Dataset First
airline_df.head()

In [None]:
airline_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
airline_df.shape

### Dataset Information

In [None]:
# Dataset Info
airline_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(airline_df[airline_df.duplicated()])

In [None]:
airline_df.drop_duplicates(inplace = True)

In [None]:
len(airline_df[airline_df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(airline_df.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
###sns.heatmap(airline_df.isnull())

plt.figure(figsize=(10, 6))
plt.imshow(airline_df.isnull(), cmap='viridis', aspect='auto')
plt.xticks(range(len(airline_df.columns)), airline_df.columns, rotation=90)
plt.colorbar(label='Missing Values')
plt.title('Missing Value Heatmap')

plt.show()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airline_df.columns

In [None]:
# Dataset Describe
airline_df.describe(include='all')

### Variables Description

**airline**: The name of the airline being reviewed.

**overall**: The overall rating given by the customer, representing their overall satisfaction with the airline experience.

**author**: The author or person who wrote the customer review.

**review_date**: The date when the review was posted by the customer.

**customer_review**: The actual text of the customer**s review expressing their opinions and feedback about the airline.

**aircraft**: The type or model of the aircraft used for the flight.

**traveller_type**: The type of traveler, indicating whether the customer is a business traveler, leisure traveler, etc.

**cabin**: The cabin class in which the customer traveled (e.g., economy, business, first class).

**route**: The route or flight path taken by the airline for the reviewed journey.

**date_flown**: The date when the flight was taken by the customer.

**seat_comfort**: The rating given by the customer for the comfort of the seat during the flight.

**cabin_service**: The rating given by the customer for the service provided in the cabin.

**food_bev**: The rating given by the customer for the quality of food and beverages provided during the flight.

**entertainment**: The rating given by the customer for the in-flight entertainment options.

**ground_service**: The rating given by the customer for the overall ground services (e.g., check-in, baggage handling).

**value_for_money**: The rating given by the customer for the perceived value for money in relation to the overall experience.

**recommended**: A binary indicator (1 or 0) indicating whether the customer would recommend the airline based on their experience.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in airline_df.columns.tolist():
  print("No. of unique values in ",i,"is",airline_df[i].nunique(),".")

airline_df['overall'].unique()


## ***3. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# Chart - 1 - Distributuion of ratings

### Q1. What is the distribution of overall ratings given by customers?
### Q2. Are there more positive or negative reviews?

In [None]:
####What is the distribution of overall ratings given by customers?
plt.figure(figsize=(8,6))
sns.histplot(airline_df['overall'],bins=10)
plt.title('Distribution of Overall Ratings')
plt.xlabel('Overall Rating')
plt.ylabel('Frequency')
plt.show()

positive_reviews = airline_df[airline_df['overall'] >= 4]['overall'].count()
negative_reviews = airline_df[airline_df['overall'] <= 2]['overall'].count()



*   The chart shows that the most common rating is a 1, with over 14,000 customers giving this rating.

*   The least common rating is 4 and 6 and less then 3000 customers giving that rating

In [None]:
### Q2. Are there more positive or negative reviews?
airline_df['review_category'] = pd.cut(airline_df['overall'], bins=[0, 4, 6, 10], labels=['Negative', 'Neutral', 'Positive'], include_lowest=True)

# Plot count of positive vs. negative reviews
plt.figure(figsize=(8, 5))
sns.countplot(x='review_category', data=airline_df, palette='viridis')
plt.title('Distribution of Reviews (Positive vs.Neutral vs. Negative)')
plt.xlabel('Review Category')
plt.ylabel('Count')
plt.show()

if we divide ratings as 0-4 (negitive) , 5-6 (Neutral), 7-10 (positive) So as per above graph it is clearly visible that in dataset there are more negitive reviews as compare to positive reviews

## Chart - 2 - Airline Performance:

###Q1.Which airlines have the highest and lowest average overall ratings?
###Q2.How does the overall rating vary across different airlines?

In [None]:
avg_ratings = airline_df.groupby('airline')['overall'].mean().sort_values(ascending=False)

# Plot the bar plot
plt.figure(figsize=(15,25))
sns.barplot(x=avg_ratings.values, y=avg_ratings.index, palette='viridis')
plt.title('Average Overall Ratings by Airline')
plt.xlabel('Average Overall Rating')
plt.ylabel('Airline')
plt.show()

According to the graph we can visualise

*   Garuda Indonasia has the highest average overall rating
*   Frontier Airlines has the lowest average overall rating

#### Chart - 3 Review Trends:

####Q1 How does the number of customer reviews change over time?


In [None]:
airline_df['review_date'] = pd.to_datetime(airline_df['review_date'])

# Extract year from 'review_date' for grouping
airline_df['year'] = airline_df['review_date'].dt.year

# Group by year and calculate the count of reviews
review_trends_yearly = airline_df.groupby('year').size()

# Plotting the time series plot for each year
plt.figure(figsize=(12, 6))
sns.lineplot(x=review_trends_yearly.index, y=review_trends_yearly.values, marker='o', color='blue')
plt.title('Number of Customer Reviews Over Years')
plt.xlabel('Year')
plt.ylabel('Number of Reviews')
plt.xticks(rotation=45)
plt.show()

The number of customer reviews appears to have increased over the years according to the graph.

it appears the number of reviews may have increased more rapidly in the earlier years (between 2002 and 2010) as the slope of the line appears steeper in that section.

### Chart - 4 Customer Types:

#####Q1  What types of travelers (traveller_type) contribute the most reviews?
#####Q2  How do overall ratings differ among different traveller types?

In [None]:
#### Q1  What types of travelers (traveller_type) contribute the most reviews?
plt.figure(figsize=(12, 6))
sns.countplot(x='traveller_type', data=airline_df, palette='viridis')
plt.title('Number of Reviews by Traveler Type')
plt.xlabel('Traveler Type')
plt.ylabel('Number of Reviews')
plt.show()

From the above graph it is clearly visible that:

*   Solo Leisure traveler type has more reviews i.e. over 12,000  
*   Business traveler type has lowest reviws followed by family Leisure and Couple Leisure



In [None]:
#####Q2  How do overall ratings differ among different traveller types?
plt.figure(figsize=(12, 6))
sns.boxplot(x='traveller_type', y='overall', data=airline_df, palette='viridis')
plt.title('Overall Ratings by Traveler Type')
plt.xlabel('Traveler Type')
plt.ylabel('Overall Rating')
plt.show()

As per the above Box Plot it is clearly visible that Solo Leisure has more overall rating as compare to other traveller types

### Chart - 5 - Cabin Analysis:

#### Q1. What is the distribution of ratings for different cabin classes?
#### Q2. Are certain cabin classes associated with higher overall ratings?

In [None]:
plt.figure(figsize=(12, 6))
sns.histplot(data=airline_df, x='seat_comfort', hue='cabin', bins=range(1,6), palette='viridis', multiple='stack')
plt.title('Distribution of Seat Comfort Ratings by Cabin Class')
plt.xlabel('Seat Comfort Rating')
plt.ylabel('Frequency')
plt.show()

The chart shows that:

*   Economy class has the most customers who gave a rating of 2.0.
*   Business class and Premium economy class have the most customers who gave a rating of 4.0.
*   First class has the most customers who gave a rating of 4.5.








In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='cabin', y='overall', data=airline_df, palette='viridis')
plt.title('Overall Ratings by Cabin Class')
plt.xlabel('Cabin Class')
plt.ylabel('Overall Rating')
plt.show()

The box plots show the distribution of overall ratings for different cabin classes. The higher the box in the plot, the higher the overall rating for that cabin class.

*   First class cabins are typically associated with higher overall ratings than other classes. The box for first class is higher than the boxes for all other classes.
*   Business class and premium economy cabins are typically associated with mid-range overall ratings. The boxes for business class and premium economy are lower than the box for first class and higher than the box for economy class.
*   Economy class cabins are typically associated with lower overall ratings. The box for economy class is the lowest of all the cabin classes.

### Chart - 6 - Food & Beverage, Entertainment:

#### Q1 what is the average ratings of Food_bev and entertainment given by passenger?

In [None]:
cabin_df=airline_df.groupby('cabin')[['food_bev','entertainment']].mean().reset_index()
cabin_df

In [None]:
ratings_df = cabin_df[['cabin', 'food_bev', 'entertainment']]

# Melt the DataFrame to have a single 'Rating Type' column
melted_ratings = pd.melt(ratings_df, id_vars=['cabin'], var_name='Rating Type', value_name='Overall Rating')

# Plot the barplot
plt.figure(figsize=(12, 6))
sns.barplot(x='cabin', y='Overall Rating', hue='Rating Type', data=melted_ratings, palette='viridis')
plt.title('Overall Ratings by Food & Beverage and Entertainment')
plt.xlabel('Cabin Class')
plt.ylabel('Overall Rating')
plt.show()

### Chart 7 - Word Cloud
### Q1.What are the most common words or phrases used in customer reviews?

In [None]:
from wordcloud import WordCloud
text_data = ' '.join(airline_df['customer_review'].dropna().astype(str))
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_data)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Customer Reviews')
plt.axis('off')  # Turn off the axis labels
plt.show()

#### Chart - 8 - Comparison of all independent variable

In [None]:
airline_df.hist(bins=50, figsize=(20,15),color = 'blue')
plt.show()

From above plot

              The overall feature ratings of 1 to 2 occur more frequently. From Seat comfort feature, We can say that rating of 1 is highest and rating of 4 is the second highest.

              From cabin service feature, We can say that rating of 5 is highest and rating of 1 is the second highest.

              The food bev feature ratings of 2,4 and 5 are varies equally.Which means their frequency are approximately equal.

              The features of both the entertainment & ground service, We can say that ratings of 3 is highest and ratings of 1 is the second highest.

              From value for money feature, It clearly shows that most of the passenger gives ratings of 1 as highest. From this we can say that most of the airline does not provide good service to passenger.



#### Chart - 9 - Correlation plot

In [None]:
numerical_columns = ['overall', 'seat_comfort','cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money', 'recommended']
numerical_df = airline_df[numerical_columns]

# Calculate the correlation matrix
correlation_matrix = numerical_df.corr()

# Plot a heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix for Numerical Variables')
plt.show()