<a href="https://colab.research.google.com/github/saketvaibhav7114/Classification_Airline_Passenger_Referral_Prediction/blob/main/ML_Classification_project_on_Airline_Passenger_Referral_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



**Project Type**    -Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** Saket Vaibhav


# **Project Summary -**

Air transport has revolutionized global connectivity, earning its place as one of the most remarkable innovations of the twentieth century. Its hallmark attribute, speed, has made it an indispensable mode of transportation for both goods and people.

In the dynamic world of air travel, where passenger experience is a key driver of success, the ability to predict passenger referrals and recommendations has become a strategic imperative for airlines. Understanding which passengers are likely to endorse an airline to their friends and networks can be a game-changer in enhancing customer satisfaction and fueling revenue growth.

**The Data-Driven Approach:**

The journey towards predicting passenger referrals begins with robust data analysis and preprocessing. This includes loading and cleaning the dataset, conducting exploratory data analysis (EDA), and engineering relevant features. Target encoding and feature selection refine the dataset, setting the stage for model building.

**Model Selection and Hyperparameter Tuning:**

To develop an accurate prediction system, a diverse array of classification models is employed. These models include Logistic Regression, Decision Trees, Random Forests, Support Vector Machines (SVM), K-Nearest Neighbors, and Naïve Bayes. Ensuring model reliability, hyperparameter tuning is performed to optimize performance and mitigate overfitting.

**Evaluation Metrics:**

The core focus of the analysis is on classification metrics, with Recall as the highest priority. Accuracy and ROC AUC follow closely behind. These metrics gauge the models' ability to correctly identify passengers who recommend airlines, crucial for targeting customer engagement efforts effectively.

**Model Performance:**

Through meticulous analysis, it is determined that all models exceed the 90% accuracy threshold. Logistic Regression emerges as a standout performer, but SVM trails closely, offering an equally strong solution. These models provide airlines with powerful tools to predict passenger recommendations accurately.

**Key Features:**

"Overall rating" and "value for money" are identified as pivotal factors in predicting passenger recommendations. Airlines can leverage this knowledge to enhance these aspects and elevate customer satisfaction.

**Business Implications:**

The developed classifier models are not just analytical tools; they represent opportunities for airlines to boost revenue. By predicting passenger referrals, airlines can identify and prioritize influential passengers, thus channeling resources and efforts where they will have the most impact.

**Recommendations for Airline Company:**

To thrive and expand, airlines are advised to focus on delivering exceptional cabin services, efficient ground handling, delightful food and beverages, and comfortable seating experiences. These elements are central to attracting and retaining loyal passengers who, in turn, contribute to an airline's growth and success.

In conclusion, the ability to predict passenger referrals is a game-changer for airlines in today's competitive landscape. By harnessing the power of data and machine learning, airlines can not only enhance customer satisfaction but also drive revenue growth, ensuring their continued success in the aviation industry.



# **GitHub Link -**

https://github.com/saketvaibhav7114/Classification_Airline_Passenger_Referral_Prediction

# **Problem Statement**


In the highly competitive and rapidly evolving airline industry, customer satisfaction and loyalty play a pivotal role in an airline's success. Airlines are constantly seeking innovative ways to enhance passenger experiences and boost their brand reputation. One significant factor in achieving these objectives is the ability to predict which passengers are likely to recommend the airline to their friends and networks.

The problem at hand revolves around the development of a predictive model that can accurately identify passengers who are inclined to provide referrals for the airline. This predictive model will serve as a strategic tool for airlines to:
> -Enhance Customer Satisfaction

> -Drive Revenue Growth

>-Optimize Marketing Efforts

>-Improve Service Quality

>-Gain a Competitive Edge

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import lightgbm
import warnings
warnings.filterwarnings('ignore')

# Importing all models from sklearn to be used in model building
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.naive_bayes import MultinomialNB

# Importing  metrics for evaluation of models
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score,precision_score
from sklearn.metrics import recall_score,f1_score,roc_curve, roc_auc_score

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#load the dataset from drive
airline_df = pd.read_excel("/content/drive/MyDrive/data_airline_reviews.xlsx")

### Dataset First View

In [None]:
# Dataset First Look
airline_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airline_df.shape

### Dataset Information

In [None]:
# Dataset Info
airline_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
airline_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values=airline_df.isnull().sum()
missing_values

In [None]:
# Visualizing the missing values

columns_with_missing_values = missing_values[missing_values > 0]      #  Filter columns with missing values

# Calculate the percentage of missing values in each column
total_rows = len(airline_df)
percentage_missing = (columns_with_missing_values / total_rows) * 100

# Create a bar chart
plt.figure(figsize=(15, 6))
bar_plot = columns_with_missing_values.plot(kind='bar', color='lightcoral')
plt.xlabel('Columns',fontsize=14)
plt.ylabel('Number of Missing Values',fontsize=14)
plt.title('Number of Missing Values in Columns',fontsize=14)
plt.xticks(rotation=90, ha='center')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the percentage of missing values on top of each bar
for index, value in enumerate(columns_with_missing_values):
    plt.text(index, value, f'{percentage_missing[index]:.2f}%', ha='center', va='bottom')

plt.show()

### What did you know about your dataset?

The dataset is well-prepared for further analysis, as it contains 131895 rows and 17 features. There are some missing values in every feature, which need to be fixed either by using the fillna method or dropping the rows. Additionally, there are 70711 duplicate rows, which also need to be dropped so that there is a clean and unique dataset for analysis. Most of the features are either objects or floats. If necessary, it needs to be converted into the required datatype. After the necessary cleaning, the dataset will be ready for preprocessing steps, allowing the focus to be on feature engineering and model development to achieve accurate predictions.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airline_df.columns

In [None]:
# Dataset Describe
airline_df.describe().T

### Variables Description

**airline:** Name of the airline.

**overall:** Overall point is given to the trip between 1 to 10.

**author:** Author of the trip

**review date:** Date of the Review

**customer review:**Review of the customers in free text format

**aircraft:** Type of the aircraft

**traveller type:** Type of traveler (e.g. business, leisure)

**cabin:** Cabin at the flight date flown: Flight date

**seat comfort:** Rated between 1-5

**cabin service:** Rated between 1-5

**foodbev:** Rated between 1-5

**entertainment:** Rated between 1-5

**ground service:** Rated between 1-5

**value for money:** Rated between 1-5

**recommended**: Binary, target variable.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
airline_df.nunique()

In [None]:
#Checking the unique values of the target variable
airline_df.recommended.unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
airline_df['aircraft'].unique()

In [None]:
# Write your code to make your dataset analysis ready.

# Dropping the column with more than 80 % empty columns
airline_df = airline_df.drop(columns='aircraft',axis=1)

#droping the duplicate values
airline_df.drop_duplicates(inplace = True)

# Convert the "review_date" & "date_flown" column from object to datetime data type
airline_df['review_date'] = pd.to_datetime(airline_df['review_date'])
airline_df['date_flown'] = pd.to_datetime(airline_df['date_flown'])

# Extract the year from the "date_flown" column and create a new column "year"
airline_df['year'] = airline_df['date_flown'].dt.year


In [None]:
# Check the changes made in dataset
airline_df.info()

In [None]:
airline_df.shape

### What all manipulations have you done and insights you found?

The column "aircraft" has more than 80% of the missing values. Hence, this column is removed. After that, the duplicated rows are deleted. These two cleaning processes reduce some of the missing entries. The datatype of the "review date" and "date_flown" columns is incorrect & hence changed to the datetime datatype. A new column for year has been created from the "date_flown" column.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**Chart - 1 Check the nature of data set: balanced or imbalanced?**

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8, 6))
airline_df['recommended'].value_counts(normalize = True).plot(kind='bar', color= ['red','green'])
plt.xlabel('Recommended',fontsize=14)
plt.ylabel('Frequency',fontsize=14)
plt.title('Recommendation Frequency',fontsize=14)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**Ans:** The countplot allows for a direct visual comparison of the counts of "Yes" and "No" recommendations. By using a single plot, we can easily compare the frequencies of these two categories.

##### 2. What is/are the insight(s) found from the chart?

**Ans:** The above plot shows a distribution of around 52%:48% between neutral/dissatisfied passengers and satisfied passengers respectively. So the data is quite balanced and it does not require any special treatment/resampling.

##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

**Ans:** The bar plot showing recommendation frequency can have both positive and potentially negative impacts on a business, depending on how they are interpreted and acted upon.

**Positive Business Impact**:

> **Strategic Decision-Making**: Understanding the proportion of customers who recommend the business's offerings can inform strategic decisions. It can guide marketing strategies, product development, and customer service improvements that align with customer preferences and satisfaction.

**Negative Business Impact**:

>  **Overlooking Negative Feedback**: Focusing solely on the positive recommendations may cause the business to overlook or dismiss critical negative feedback. Ignoring negative feedback can hinder efforts to address underlying issues and improve the customer experience.


**Chart - 2 Check the distribution of traveller types**



In [None]:
# Chart - 2 visualization code


traveller_type_counts = airline_df['traveller_type'].value_counts()

# Create a pie chart
plt.figure(figsize=(8, 8))
plt.pie(traveller_type_counts, labels=traveller_type_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Traveller Types')
plt.axis('equal')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

**Ans:** The "traveller_type" column contains categorical data, which means it consists of distinct categories or labels (e.g., "Business," "Leisure"). Pie-chart are particularly useful for visualizing the distribution of categorical data.

##### 2. What is/are the insight(s) found from the chart?

**Ans:** The bigger sector in the pie-chart represent the most frequently occurring traveler types. This can help us identify dominant or prevalent traveler types in the dataset.

##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

**Ans:** A pie-chart of traveler types alone may not directly lead to a positive or negative business impact, as they provide foundational information about the distribution of customers but may not reveal the full picture of customer behavior or preferences. However, these insights can inform further analysis and decision-making, which can potentially impact the business positively or negatively depending on how they are leveraged.

**Potential Positive Impacts:**

> **Customer Segmentation:** Understanding the distribution of traveler types can help in segmenting the customer base. This segmentation can lead to targeted marketing strategies and personalized services, which can enhance customer satisfaction and loyalty.

**Potential Negative Impacts:**

> **Neglect of Minority Traveler Types:** Focusing solely on the most common traveler types may lead to neglecting the needs and preferences of minority traveler groups. This could result in decreased satisfaction and loyalty among those customers.

#### Chart - 3 Recommendation based on Cabin Type

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(15, 6))
sns.countplot(data=airline_df, x='cabin', hue='recommended', palette='Set2')
plt.title('Distribution of Recommended by Cabin', fontsize=14)
plt.xlabel('Cabin', fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.legend(title='Recommended', labels=['Yes', 'No'])
plt.show()

##### 1. Why did you pick the specific chart?

**Ans:** The countplot with the hue parameter is an effective choice when we want to compare the distribution of a binary variable (such as "recommended") within different categories (in this case, "cabin" types). It allows for clear visualization and comparison, which can lead to insights about customer preferences and recommendations across cabin types.

##### 2. What is/are the insight(s) found from the chart?

**Ans:** The chart highlights variations in the distribution of recommendations across different cabin types. For instance-
* In the "Economy" cabin, there are both more recommendations and more non-recommendations compared to other cabins.

* In contrast, "First" cabin passengers seem to have a higher rate of recommendations compared to non-recommendations.

##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.


**Ans:** The countplot of "recommended" values within each "cabin" category can potentially have both positive and negative implications for a business, depending on how the insights are leveraged and acted upon.

**Positive Business Impact:**

> Targeted Marketing: The insights can inform targeted marketing strategies. Airlines can focus their marketing efforts on promoting the features and benefits of cabin types that receive high recommendations, attracting more customers to those premium offerings.

**Negative Business Impact:**

> Missed Revenue Opportunities: Ignoring insights about low recommendation rates may result in missed revenue opportunities. By not addressing passenger concerns and improving services in underperforming cabins, airlines may lose potential revenue from dissatisfied customers.

#### Chart - 4 "Traveler Ratings: Value for Money Across Different Traveler Types"

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
sns.barplot(x='traveller_type', y='value_for_money', data=airline_df)
plt.title('Value for Money by Traveler Type',fontsize=14)
plt.xlabel('Traveller Type',fontsize=14)
plt.ylabel('Value for Money',fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

**Ans:** Bar plots are easy to understand. They display discrete categories on the x-axis and the numerical variable on the y-axis, making it straightforward for viewers to interpret the data.

##### 2. What is/are the insight(s) found from the chart?

**Ans:** We can see how different traveller types rate the "value_for_money" aspect of the airline service.

*   The Solo Traveller have given highest rating for "Value For Money" while there are almost equal rating given by rest of the traveller.
*   This can help identify which traveller type, such as business travellers, leisure travellers, or others, find the service to be of better value.







##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

**Ans:**  Analyzing the chart of traveler types and their ratings for "value_for_money" can potentially lead to both positive and negative business impacts, depending on the specific findings and how they are acted upon.

**Positive Business Impacts**:

> **Service Improvement**: By identifying traveller types with lower ratings for "value_for_money," the airline can investigate why these travelers feel this way. This feedback can guide improvements in pricing, amenities, or services to enhance customer satisfaction.

**Negative Business Impacts**:

> **Customer Churn**: If certain traveller types consistently rate the airline's "value_for_money" poorly and these issues are not addressed, it could lead to customer churn. Unaddressed negative feedback can result in the loss of valuable customers.

#### Chart - 5 "Airline Seat Comfort Ratings: Top Airlines for Passenger Comfort"

In [None]:
# Chart - 5 visualization code

# Calculate the mean of seat comfort rating for each airline
mean_seat_comfort = airline_df.groupby('airline')['seat_comfort'].mean().reset_index()

# Sort the DataFrame by mean seat comfort ratings in descending order
mean_seat_comfort_sorted = mean_seat_comfort.sort_values(by='seat_comfort', ascending=False)

sns.set(style="whitegrid")
plt.figure(figsize=(14, 8))
sns.barplot(x='airline', y='seat_comfort', data=mean_seat_comfort_sorted, palette='viridis')

plt.xticks(rotation=90, fontsize=10)
plt.xlabel('Airline', fontsize=14)
plt.ylabel('Seat Comfort Rating', fontsize=14)
plt.title('Seat Comfort Rating by Airline', fontsize=14)
plt.tight_layout()
plt.show()

1.Why did you pick the specific chart?

**Ans:** Bar plots are effective for comparing the values of a single variable (in this case, "seat_comfort" ratings) across different categories (airlines). They allow you to easily see and compare how the ratings vary for each airline.

##### 2. What is/are the insight(s) found from the chart?

**Ans:** The Seat Comfort rating of some airlines, such as "Air Canada", "Frontier Airlines," and "Spirit Airlines," is very poor compared to the average rating of all other airlines. While some airlines, such as "Asiana Airline","EVA Air", "China Southern Airlines," and "Garuda Airlines, are rated the best compared to other airlines.

##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

**Ans:** Analyzing seat comfort ratings by passenger can have both positive and potentially negative impacts on airlines' business strategies.

**Positive Business Impact**:

> **Pricing Strategies**: Airlines with exceptional seat comfort may have the opportunity to position themselves as premium carriers and charge higher ticket prices. Passengers may be willing to pay more for increased comfort, leading to higher revenue per passenger.

**Negative Business Impact**:

> **Customer Churn**: Airlines with consistently low seat comfort ratings may experience customer churn as passengers opt for competitors with better comfort offerings. This can lead to a loss of revenue and market share.

#### Chart - 6 "Average Cabin Service Ratings by Airline"

In [None]:
# Chart - 6 visualization code
# Calculate the mean cabin service rating for each airline
mean_seat_comfort = airline_df.groupby('airline')['cabin_service'].mean().reset_index()

# Sort the DataFrame by mean seat comfort ratings in descending order
mean_seat_comfort_sorted = mean_seat_comfort.sort_values(by='cabin_service', ascending=False)

sns.set(style="whitegrid")
plt.figure(figsize=(14, 8))
sns.barplot(x='airline', y='cabin_service', data=mean_seat_comfort_sorted, palette='viridis')

plt.xticks(rotation=90, fontsize=10)
plt.xlabel('Airline', fontsize=14)
plt.ylabel('Cabin Service Rating', fontsize=14)
plt.title('Cabin Service Rating by Airline', fontsize=14)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**Ans:** Bar charts are commonly used to display and compare data for different categories. It allows for easy comparison between different airlines' cabin service ratings. The bars make it straightforward to see which airlines have higher or lower ratings.

##### 2. What is/are the insight(s) found from the chart?

**Ans:** The chart clearly shows that "Garuda Indonesia" ""Nippon Airways" etc. have the highest cabin service ratings while "Frontier Airlines" & "Spirit Airlines" have the lowest cabin service ratings. This allows viewers to quickly identify the best and worst performers airlines in terms of cabin service.

##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

**Ans:** The chart can indeed help create a positive business impact for airlines. However, they can also potentially lead to negative growth if not acted upon appropriately.

**Positive Business Impact:**

> **Competitive Advantage**: Airlines with higher cabin service ratings can leverage this information to promote their superior service in marketing and advertising campaigns. This can attract more passengers who value quality service, potentially leading to increased market share and revenue.

**Negative Growth Potential:**

> **Inaction**: One of the most significant potential negative impacts is inaction. If airlines do not address the issues highlighted by low cabin service ratings, they risk losing customers to competitors who offer better service. This can result in decreased revenue and market share.

#### Chart - 7 "Average Food and Beverage Ratings by Airline

In [None]:
# Chart - 7 visualization code
# Calculate the mean food beverages rating for each airline
mean_seat_comfort = airline_df.groupby('airline')['food_bev'].mean().reset_index()

# Sort the DataFrame by mean seat comfort ratings in descending order
mean_seat_comfort_sorted = mean_seat_comfort.sort_values(by='food_bev', ascending=False)

sns.set(style="whitegrid")
plt.figure(figsize=(15, 8))
sns.barplot(x='airline', y='food_bev', data=mean_seat_comfort_sorted, palette='viridis')

plt.xticks(rotation=90, fontsize=10)
plt.xlabel('Airline', fontsize=14)
plt.ylabel('Food Beverages Rating', fontsize=14)
plt.title('Food Beverages Rating by Airline', fontsize=14)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**Ans:** A bar chart is an excellent choice when we want to compare the values of a categorical variable (airlines) with respect to a continuous variable (food and beverage ratings). It allows viewers to quickly discern differences in ratings between airlines.

##### 2. What is/are the insight(s) found from the chart?

**Ans:** The chart clearly shows that "Garuda Indonesia" ""Nippon Airways" & "Asiana Airlines" etc. have the highest food beverages rating while "Frontier Airlines" & "Spirit Airlines" have the lowest food beverages ratings. This allows viewers to quickly identify the best and worst performers airlines in terms of Food & Beverages Services.

##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

**Ans:** Analyzing food and beverage ratings for airlines can indeed have a positive business impact. However, whether they lead to positive or negative growth depends on how the insights are interpreted and acted upon.

**Positive Business Impact:**

> **Revenue Growth**: Positive ratings can lead to higher revenue through increased ticket sales and potentially higher spending by passengers on in-flight dining options.

> **Brand Reputation**: High ratings contribute to a positive brand reputation, which can lead to brand loyalty and the attraction of new customers.

**Negative Business Impact:**

> **Reduced Revenue**: Low ratings can deter passengers from purchasing in-flight meals or snacks, resulting in reduced revenue from onboard sales.

> **Negative Publicity**: Negative feedback about food and beverages on social media or review platforms can harm an airline's image and result in negative publicity.

#### Chart - 8 "Recommendation Count per Airline"

In [None]:
# Chart - 8 visualization code

plt.figure(figsize=(13, 6))
sns.countplot(x='airline', hue='recommended', data=airline_df)
plt.xticks(rotation=90,fontsize=10)
plt.xlabel('Airline',fontsize=16)
plt.ylabel('Count',fontsize=16)
plt.title('Recommendation Count per Airline',fontsize=16)
plt.legend(title='Recommended')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**Ans:** The countplot allows for a direct visual comparison of recommendation counts across multiple airlines. By using the hue parameter to differentiate between "Yes" and "No" recommendations, it's easy to assess the distribution of recommendations for each airline.

##### 2. What is/are the insight(s) found from the chart?

**Ans:** From the countplot that visualizes the recommendation counts for each airline, we can derive several insights and observations:


> **Most Recommended Airlines:** Qatar Airlines, Singapore Airlines, China Southern Airlines, Garuda Airlines & Qantas Airlines have a higher count of "Yes" recommendations. These airlines are likely providing a positive experience to passengers, leading to more recommendations.

> **Least Recommended Airlines:** American Airlines, United Airlines, Spirit Airlines & Frontier Airlines have a higher count of "No" recommendations. These airlines may have areas for improvement in their services or customer satisfaction.

##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

**Ans:** The recommendation counts for each airline can indeed help create a positive business impact. However, the actual impact depends on how airlines respond to these insights.

**Positive Business Impact**:

> **Customer Loyalty**: Airlines with higher counts of "Yes" recommendations have the potential to build strong customer loyalty. This can lead to repeat business, positive word-of-mouth recommendations, and an increase in customer lifetime value.

> **Strategic Decision-Making**: Airlines can use these insights to make informed strategic decisions, such as investing in service enhancements, training staff, or upgrading amenities to meet passenger expectations.

**Negative Business Impact**:

> **Customer Churn**: Airlines with consistently high counts of "No" recommendations may experience customer churn. Passengers may choose competitors with better ratings, leading to a loss of revenue and market share.

> **Reputation Damage**: Persistently poor recommendation counts can harm an airline's reputation. Negative reviews and low recommendations can deter potential customers and erode trust in the brand.

#### Chart - 9 "Average Ratings of Services by Cabin Type"

In [None]:
# Chart - 9 visualization code

# Calculate the average ratings for 'seat_comfort','cabin_service','food_bev', 'entertainment','ground_service' by cabin
average_ratings = airline_df.groupby('cabin')[['seat_comfort','cabin_service','food_bev', 'entertainment','ground_service']].mean().reset_index()

# Set the figure size
plt.rcParams['figure.figsize']=(17,10)
average_ratings.plot(x="cabin", y=['seat_comfort','cabin_service','food_bev', 'entertainment','ground_service'], kind="bar",fontsize=12)
plt.xlabel("Cabin Type",fontsize=15)
plt.ylabel("Average Ratings by Cabin Type",fontsize=15)
plt.legend(["Seat Comfort", "Cabin Service", "Food & Beverage", "Entertainment", "Ground Service"])
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

**Ans:** A grouped bar chart is an effective way to compare and visualize the average ratings of different factors for each cabin type. Each factor is represented by a distinct bar, and the bars are grouped by cabin type. This clear differentiation makes it easy to identify and compare ratings for each factor within each cabin type.

##### 2. What is/are the insight(s) found from the chart?

*   The average rating of all services types for Business Class as well as First Class cabin type is best.
*   Economy Class is worst rated in all the service types.


##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the chart can potentially have a positive business impact for the airline. However, there may also be insights that could indicate areas of concern or potential negative growth.

**Positive Business Impact:**

> **Marketing and Pricing Strategies**: Understanding which cabin types receive higher ratings for specific categories enables the airline to target marketing efforts more effectively. They can promote the strengths of certain cabins and tailor pricing strategies to appeal to different customer preferences.

**Negative Growth or Concerns:**

> **Operational Challenges**: Insights into lower ratings for specific services or cabin types may signal operational challenges that need immediate attention. Failure to address these issues could result in negative growth as passengers seek better experiences elsewhere.

#### Chart 10- Overall Rating by Passenger vs. Airline

In [None]:
# Chart - 10 visualization code

plt.figure(figsize=(14, 8))
sns.lineplot(x='airline', y='overall', data=airline_df, marker='o', markersize=6, color='black', markerfacecolor='red',linewidth=2)

plt.xlabel('Airline', fontsize=15)
plt.ylabel('Overall Rating by Passenger', fontsize=15)
plt.title('Overall Rating of Airline', fontsize=15)
plt.xticks(rotation=90, fontsize=10)

# Show the plot
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

# Count the occurrences of each unique route
route_counts = airline_df['route'].value_counts()

# Select the top 50 most traveled routes
top_50_routes = route_counts.head(50)

# Set the figure size
plt.figure(figsize=(13, 8))

# Create a bar plot for the top 10 most traveled routes
sns.barplot(x=top_50_routes.index, y=top_50_routes.values, palette='viridis')

plt.xlabel('Route',fontsize=15)
plt.ylabel('Count of Travel',fontsize=15)
plt.title('Top 50 Most Traveled Routes',fontsize=15)
plt.xticks(rotation=90,fontsize=9)

# Show the plot
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12 - Airlines vs. Frequency on Most Travelled Routes

In [None]:
# Chart - 12 visualization code

# Count the number of unique routes traveled by each airline
airline_route_counts = airline_df.groupby('airline')['route'].nunique().reset_index()

# Sort the airlines by the number of routes in descending order
airline_route_counts = airline_route_counts.sort_values(by='route', ascending=False)

# Set the figure size
plt.figure(figsize=(14, 6))

# Create a count plot for airlines and the number of routes traveled
sns.barplot(x='airline', y='route', data=airline_route_counts, palette='viridis')

plt.xlabel('Airline',fontsize=15)
plt.ylabel('Frequency of Travel',fontsize=15)
plt.title('Airlines vs. Frequency on Most Travelled Routes',fontsize=15)

# Show the plot
plt.tight_layout()
plt.xticks(rotation=90,fontsize=10)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13- Change in overall review over succeeding year for top 12 airlines

In [None]:
# Chart - 13 visualization code

# Calculate the average overall rating for each airline
average_overall_rating = airline_df.groupby('airline')['overall'].mean().reset_index()

# Sort by average overall rating and select the top 12 airlines
top_12_airlines = average_overall_rating.nlargest(12, 'overall')['airline']

# Filter the DataFrame to include only data for the top 10 airlines
filtered_df = airline_df[airline_df['airline'].isin(top_12_airlines)]

# Create a FacetGrid with subplots for each airline
g = sns.FacetGrid(filtered_df, col='airline', col_wrap=4, height=4)
g.map(sns.lineplot, 'year', 'overall', marker='o', lw=2)
g.set_axis_labels('Year', 'Overall Rating',fontsize=16)
g.set_titles(col_template='{col_name}')
plt.subplots_adjust(top=0.85)  # Adjust subplot spacing

# Set the title at the top
plt.suptitle('Overall Rating Over Succeeding Years for Top 12 Airlines', fontsize=16)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**Ans:**  Line plots are well-suited for time-series data, which involves tracking data points over successive time periods. This makes it suitable for analyzing how overall rating changes over time.

##### 2. What is/are the insight(s) found from the chart?

**Ans:** The chart aids in competitive analysis by showing how each airline's overall rating compares to its peers.
> Airlines with consistently high ratings may have a competitive advantage. For example "China Southern Airlines" & "Garuda Airlines".

> Sudden drops or spikes in overall ratings may indicate shifts in customer sentiment.For example "Aegean Airlines".



##### 3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

**Ans:** The line plots showing how the overall rating changes for the top 12 airlines over succeeding years can indeed help create a positive business impact for airlines and the travel industry. However, they can also potentially lead to negative growth if not acted upon effectively.

**Positive Business Impact**:

> **Identifying Improvement Areas**: Insights that highlight consistent upward trends in overall ratings can help airlines identify areas where they are excelling. They can leverage these strengths in marketing efforts to attract more customers who prioritize those aspects.


**Potential Negative Growth**:

> **Ignoring Negative Trends**: Failing to address consistent negative trends in overall ratings can lead to a decline in customer satisfaction and negative growth. If airlines do not respond to customer feedback and complaints, they risk losing customers to competitors.

> **Competitive Disadvantage**: Airlines with consistently low ratings may find it challenging to compete in the market. Negative feedback can deter potential customers, leading to decreased market share.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(12,10))
sns.heatmap(airline_df.corr(), annot=True)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

# Create a pairplot
sns.pairplot(airline_df[columns])
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Hypothesis 1: "Airline passengers who rate seat comfort higher are more likely to recommend the airline."

Hypothesis 2: "Reviews posted in recent years are more critical of airline services compared to reviews from earlier years."

Hypothesis 3: "Passengers who travel for business purposes rate cabin service higher than those traveling for leisure."

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant correlation between seat comfort ratings and the likelihood of recommending the airline.

**Alternative Hypothesis (H1):** There is a significant correlation between seat comfort ratings and the likelihood of recommending the airline.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Dropping the Null Value from the "recommended" & "seat_comfort" column
recommended=airline_df['recommended'].dropna()
seat_comfort=airline_df['seat_comfort'].dropna()

# Convert the data in 'recommended' columns to numeric
recommended=recommended.replace({'yes': 1, 'no': 0})

# Picking out 500 Random Samples to perform t-test
sample_recommended=recommended.sample(500,random_state=42)
sample_seat_comfort=seat_comfort.sample(500,random_state=42)

# Perform t-test
t_statistic, p_value = stats.ttest_ind(sample_seat_comfort, sample_recommended)

# Set the significance level (alpha)
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in seat comfort ratings.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in seat comfort ratings.")

##### Which statistical test have you done to obtain P-Value?

**Ans:** T-test are performed to find P-value

##### Why did you choose the specific statistical test?

**Ans:** T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the average overall ratings of airline reviews posted in recent years compared to reviews from earlier years.

**Alternative Hypothesis (H1)**: Reviews posted in recent years have significantly lower average overall ratings compared to reviews from earlier years.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Define a threshold year to distinguish recent and earlier years
threshold_year = 2019

# Split the data into two groups: recent and earlier years
recent_years = airline_df[airline_df['year'] >= threshold_year]['overall']
earlier_years = airline_df[airline_df['year'] < threshold_year]['overall']

# Dropping the null value
recent_years=recent_years.dropna()
earlier_years=earlier_years.dropna()

# Picking 500 random samples
random_recent_years=recent_years.sample(500,random_state=42)
random_earlier_years=earlier_years.sample(500,random_state=42)

# Perform a t-test to compare the average overall ratings
t_statistic, p_value = stats.ttest_ind(random_recent_years, random_earlier_years)

# Set the significance level (alpha)
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis. Reviews from recent years are more critical of airline services compared to reviews from earlier years.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in review ratings between recent and earlier years.")


##### Which statistical test have you done to obtain P-Value?

**Ans:** T-test are performed to find P-value.

##### Why did you choose the specific statistical test?

**Ans:** T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the average cabin service ratings between passengers who travel for business purposes and those who travel for leisure.

**Alternative Hypothesis (H1):** Passengers who travel for business purposes rate cabin service significantly higher than passengers who travel for leisure.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Convert 'traveller_type' column to lowercase for consistency
airline_df['traveller_type'] = airline_df['traveller_type'].str.lower()

# Define the two groups: business travelers and leisure travelers
business_travelers = airline_df[airline_df['traveller_type'] == 'business']['cabin_service']
leisure_travelers = airline_df[airline_df['traveller_type'] !='business']['cabin_service']

# Dropping the Null Value from
business_travelers=business_travelers.dropna()
leisure_travelers=leisure_travelers.dropna()

# Picking 500 random samples
sample_business_travelers=business_travelers.sample(50,random_state=42)
sample_couple_travelers=leisure_travelers.sample(50,random_state=42)

# Perform a t-test to compare the average cabin service ratings
t_statistic, p_value = stats.ttest_ind(sample_business_travelers, sample_couple_travelers)

# Set the significance level (alpha)
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis. Business travelers rate cabin service higher than leisure travelers.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in cabin service ratings between business and leisure travelers.")


##### Which statistical test have you done to obtain P-Value?

**Ans:** T-test are performed to find P-value

##### Why did you choose the specific statistical test?

**Ans:** T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***