# **Project Name**    - IndiGo Airline Passenger Referral Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual

# **Project Summary -**

Air travel has dramatically transformed the landscape of global connectivity, standing out as one of the most significant breakthroughs of the twentieth century. Its defining characteristic, speed, has rendered it an essential mode of transport for both individuals and goods.

In the ever-evolving realm of aviation, passenger experience plays a pivotal role in determining success. The capacity to forecast passenger endorsements and recommendations has emerged as a strategic necessity for airlines. Gaining insights into which passengers are inclined to recommend an airline to their social circles can be a transformative factor in boosting customer satisfaction and driving revenue growth.

Work Flow:-
- Data Collection
- Data Cleaning & Preprocessing:- This includes handling missing value, outliers treatment target encoding & feature enginnering.
- Exploratory Data Analysis (EDA):- This include visualization of the data using various graphs & plots.
- Splitting the Data into training & testing part
- Model Selection and Hyperparameter Tuning:-To develop an accurate prediction system, a diverse array of classification models is employed. These models include Logistic Regression,Random Forests, & Support Vector Machines (SVM). Ensuring model reliability, hyperparameter tuning is performed to optimize performance and mitigate overfitting.
- Evaluation Metrics:-The core focus of the analysis is on classification metrics, with Recall as the highest priority. Accuracy and ROC AUC follow closely behind. These metrics gauge the models' ability to correctly identify passengers who recommend airlines, crucial for targeting customer engagement efforts effectively.

# **GitHub Link -**

https://github.com/tanveermohd/Indigo-Airline-Passenger-Referral-Prediction

# **Problem Statement**


In the fast-paced and fiercely competitive world of aviation, customer satisfaction and loyalty are key determinants of an airline’s success. Airlines are perpetually exploring novel methods to elevate passenger experiences and bolster their brand image. A crucial element in realizing these goals is the capability to foresee which passengers are prone to endorse the airline within their social circles.

The challenge we face involves the creation of a predictive model that can precisely pinpoint passengers who are likely to advocate for the airline. This predictive model will act as a strategic instrument for airlines to:

1.Enhance their customer service by focusing on passengers who are potential advocates.

2.Tailor their marketing and loyalty programs towards passengers who are more likely to recommend their services.

3.Improve their overall brand reputation by increasing the number of positive referrals.

4.Make informed business decisions based on the insights derived from the model.

5.Ultimately, drive revenue growth by converting satisfied passengers into brand advocates.

-Enhance Customer Satisfaction

-Drive Revenue Growth

-Optimize Marketing Efforts

-Improve Service Quality

-Gain a Competitive Edge

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Importing all models from sklearn to be used in model building
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Importing  metrics for evaluation of models
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score,precision_score
from sklearn.metrics import recall_score,f1_score,roc_curve, roc_auc_score

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/Indigo airline referral prediction/data_airline_reviews.xlsx - capstone_airline_reviews3.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing=df.isnull().sum()
missing

In [None]:
# Visualizing the missing values
columns_with_missing_values = missing[missing > 0]
len(columns_with_missing_values)


In [None]:
# Calculate the percentage of missing values in each column
total_rows = len(df)
percentage_missing = (columns_with_missing_values / total_rows) * 100

# Create a bar chart
plt.figure(figsize=(14, 6))
bar_plot = columns_with_missing_values.plot(kind='bar', color='coral')
plt.xlabel('Columns',fontsize=14)
plt.ylabel('Number of Missing Values',fontsize=14)
plt.title('Number of Missing Values in Columns',fontsize=14)
plt.xticks(rotation=90, ha='center',fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the percentage of missing values on top of each bar
for index, value in enumerate(columns_with_missing_values):
    plt.text(index, value, f'{percentage_missing[index]:.2f}%', ha='center', va='bottom',fontsize=10)

plt.show()

### What did you know about your dataset?

The dataset is well-prepared for further analysis, as it contains 131895 rows and 17 features. There are some missing values in every feature, which need to be fixed either by using the fillna method or dropping the rows. Additionally, there are 70711 duplicate rows, which also need to be dropped so that there is a clean and unique dataset for analysis. Most of the features are either objects or floats. If necessary, it needs to be converted into the required datatype. After the necessary cleaning, the dataset will be ready for preprocessing steps, allowing the focus to be on feature engineering and model development to achieve accurate predictions.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

airline: Name of the airline.

overall: Overall point is given to the trip between 1 to 10.

author: Author of the trip

review date: Date of the Review

customer review:Review of the customers in free text format

aircraft: Type of the aircraft

traveller type: Type of traveler (e.g. business, leisure)

cabin: Cabin at the flight

date flown: Flight date

seat comfort: Rated between 1-5

cabin service: Rated between 1-5

foodbev: Rated between 1-5

entertainment: Rated between 1-5

ground service: Rated between 1-5

value for money: Rated between 1-5

recommended: Binary, target variable.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Dropping the column with more than 80 % empty columns
df = df.drop(columns='aircraft',axis=1)

#droping the duplicate values
df.drop_duplicates(inplace = True)

# Remove ordinal suffixes using regex
df['review_date'] = df['review_date'].str.replace(r'(\d+)(st|nd|rd|th)', r'\1', regex=True)

# Convert the "review_date" & "date_flown" column from object to datetime data type
df['review_date'] = pd.to_datetime(df['review_date'])
df['date_flown'] = pd.to_datetime(df['date_flown'],errors='coerce')

# Extract the year from the "date_flown" column and create a new column "year"
df['year'] = df['date_flown'].dt.year


In [None]:
df.info()

### What all manipulations have you done and insights you found?

The column "aircraft" has more than 80% of the missing values. Hence, this column is removed. After that, the duplicated rows are deleted. These two cleaning processes reduce some of the missing entries. The datatype of the "review date" and "date_flown" columns is incorrect & hence changed to the datetime datatype. A new column for year has been created from the "date_flown" column.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Check the dataset (balance or imbalance)

In [None]:
df['recommended'].value_counts()

In [None]:
count_percentage = df['recommended'].value_counts(normalize=True)*100
count_percentage

In [None]:
plt.figure(figsize=(8,4))
palette = sns.color_palette()

# Create a countplot of the 'recommended' data
sns.countplot(x=df['recommended'], palette='muted')

# Set the x-axis label
plt.xlabel('Recommended', fontsize=14)

# Set the y-axis label
plt.ylabel('Total counts', fontsize=14)

# Set the title of the plot
plt.title('Recommendation Status', fontsize=15)

plt.show()

##### 1. Why did you pick the specific chart?

The countplot allows for a direct visual comparison of the counts of "Yes" and "No" recommendations. By using a single plot, we can easily compare the frequencies of these two categories.

##### 2. What is/are the insight(s) found from the chart?

The dataset is balanced, the distribution suggest that around 48% of customers are satisfied customer that recommended airline and 52% of customer are unhappy and didn't recommend the airline to other people.

##### 3. Will the gained insights help creating a positive business impact?


The bar plot showing recommendation frequency can have both positive and potentially negative impacts on a business, depending on how they are interpreted and acted upon.

#### Chart - 2 Checking the distribution of traveller types

In [None]:
traveller_type_counts=df['traveller_type'].value_counts()
traveller_type_counts

In [None]:
# Create a pie chart
plt.figure(figsize=(4, 4))
# Create the pie chart with the explode effect and a shadow
labels = traveller_type_counts.index
sizes =traveller_type_counts.values
explode = [0.1]*len(sizes)
plt.pie(sizes, labels=labels, autopct='%1.1f%%', explode=explode, shadow=True ,startangle=140,)
plt.title('Distribution of Traveller Types')
plt.axis('equal')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

The "traveller_type" column contains categorical data, which means it consists of distinct categories or labels (e.g., "Business," "Leisure"). Pie-chart are particularly useful for visualizing the distribution of categorical data.

##### 2. What is/are the insight(s) found from the chart?

The bigger sector in the pie-chart represent the most frequently occurring traveler types. This can help us identify dominant or prevalent traveler types in the dataset.

"Solo Travellers" constitue 37.1% of overall travellers type & contribute the biggest share in pie-chart followed by "Couple Travellers" which constitue 25.8% of overall travellers type.

Business Travellers has the smallest share in travellers type distribution pie-chart.

#### Chart - 3 distribution of Cabin type based on recommended or not

In [None]:
df['cabin'].value_counts()

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='cabin', hue='recommended', palette='Set2')
plt.title('Distribution of Recommended by Cabin', fontsize=14)
plt.xlabel('Cabin', fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.legend(title='Recommended', labels=['Yes', 'No'])
plt.show()

##### 1. Why did you pick the specific chart?

The countplot with the hue parameter is an effective choice when we want to compare the distribution of a binary variable (such as "recommended") within different categories (in this case, "cabin" types). It allows for clear visualization and comparison, which can lead to insights about customer preferences and recommendations across cabin types.

##### 2. What is/are the insight(s) found from the chart?

The chart highlights variations in the distribution of recommendations across different cabin types. For instance-

In the "Economy" cabin, there are both more recommendations and more non-recommendations compared to other cabins.

In contrast, "Business class" and "First class" cabin passengers seem to have a higher rate of recommendations compared to non-recommendations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Targeted Marketing: The insights can inform targeted marketing strategies. Airlines can focus their marketing efforts on promoting the features and benefits of cabin types that receive high recommendations, attracting more customers to those premium offerings.

Negative Business Impact:

Missed Revenue Opportunities: Ignoring insights about low recommendation rates may result in missed revenue opportunities. By not addressing passenger concerns and improving services in underperforming cabins, airlines may lose potential revenue from dissatisfied customers.

#### Chart - 4 Value for Money Across Different Traveler Types

In [None]:

# Chart - 4 visualization code
plt.figure(figsize=(10, 6))

# Create a barplot with the 'Set1' color palette
sns.barplot(x='traveller_type', y='value_for_money', data=df, palette='Set2')

plt.title('Value for Money by Traveler Type',fontsize=14)
plt.xlabel('Traveller Type',fontsize=14)
plt.ylabel('Value for Money',fontsize=14)

plt.show()

##### 1. Why did you pick the specific chart?

Bar plots are easy to understand. They display discrete categories on the x-axis and the numerical variable on the y-axis, making it straightforward for viewers to interpret the data.






##### 2. What is/are the insight(s) found from the chart?

We can see how different traveller types rate the "value_for_money" aspect of the airline service.

The Solo Traveller have given highest rating for "Value For Money" while there are almost equal rating given by rest of the traveller.
This can help identify which traveller type, such as business travellers, leisure travellers, or others, find the service to be of better value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impacts:

Service Improvement: By identifying traveller types with lower ratings for "value_for_money," the airline can investigate why these travelers feel this way. This feedback can guide improvements in pricing, amenities, or services to enhance customer satisfaction.

Negative Business Impacts:

Customer Churn: If certain traveller types consistently rate the airline's "value_for_money" poorly and these issues are not

#### Chart - 5 Airline Seat Comfort Ratings: Top Airlines for Passenger Comfort

In [None]:
df['airline'].value_counts()

In [None]:
df['airline'].nunique()

In [None]:
# Calculate the average of seat comfort rating for each airline
avg_seat_comfort = df.groupby('airline')['seat_comfort'].mean().reset_index()

# Sort the DataFrame by average seat comfort ratings in descending order
avg_seat_comfort_sorted = avg_seat_comfort.sort_values(by='seat_comfort', ascending=False)
print(avg_seat_comfort_sorted)

In [None]:
# Chart - 5 visualization code
sns.set(style="whitegrid")
plt.figure(figsize=(12, 8))

# Create a barplot
sns.barplot(x='airline', y='seat_comfort', data=avg_seat_comfort_sorted, palette='viridis')

plt.xticks(rotation=90, fontsize=10)
plt.xlabel('Airline', fontsize=10)
plt.ylabel('Seat Comfort Rating', fontsize=14)
plt.title('Seat Comfort Rating by Airline', fontsize=14)
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Bar plots are effective for comparing the values of a single variable (in this case, "seat_comfort" ratings) across different categories (airlines). They allow you to easily see and compare how the ratings vary for each airline.

##### 2. What is/are the insight(s) found from the chart?

The Seat Comfort rating of some airlines, such as "Air Canada", "Frontier Airlines," and "Spirit Airlines," is very poor compared to the average rating of all other airlines. While some airlines, such as "Asiana Airline","EVA Air", "China Southern Airlines," and "Garuda Airlines, are rated the best compared to other airlines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Pricing Strategies: Airlines with exceptional seat comfort may have the opportunity to position themselves as premium carriers and charge higher ticket prices. Passengers may be willing to pay more for increased comfort, leading to higher revenue per passenger.

Negative Business Impact:

Customer Churn: Airlines with consistently low seat comfort ratings may experience customer churn as passengers opt for competitors with better comfort offerings. This can lead to a loss of revenue and market share.

#### Chart - 6 Average Cabin Service Ratings by Airline

In [None]:
# Calculate the average cabin service rating for each airline
avg_seat_comfort = df.groupby('airline')['cabin_service'].mean().reset_index()

# Sort the DataFrame by average seat comfort ratings in descending order
avg_seat_comfort_sorted = avg_seat_comfort.sort_values(by='cabin_service', ascending=False)
print(avg_seat_comfort_sorted)

In [None]:
# Chart - 6 visualization code
sns.set(style="whitegrid")
plt.figure(figsize=(14, 8))
sns.barplot(x='airline', y='cabin_service', data=avg_seat_comfort_sorted, palette='viridis')

plt.xticks(rotation=90, fontsize=10)
plt.xlabel('Airline', fontsize=14)
plt.ylabel('Cabin Service Rating', fontsize=14)
plt.title('Cabin Service Rating by Airline', fontsize=14)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are commonly used to display and compare data for different categories. It allows for easy comparison between different airlines' cabin service ratings. The bars make it straightforward to see which airlines have higher or lower ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart clearly shows that "Garuda Indonesia" ""Nippon Airways" etc. have the highest cabin service ratings while "Frontier Airlines" & "Spirit Airlines" have the lowest cabin service ratings. This allows viewers to quickly identify the best and worst performers airlines in terms of cabin service.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Competitive Advantage: Airlines with higher cabin service ratings can leverage this information to promote their superior service in marketing and advertising campaigns. This can attract more passengers who value quality service, potentially leading to increased market share and revenue.

Negative Growth Potential:

Inaction: One of the most significant potential negative impacts is inaction. If airlines do not address the issues highlighted by low cabin service ratings, they risk losing customers to competitors who offer better service. This can result in decreased revenue and market share.

#### Chart - 7 Average Food and Beverage Ratings by Airline

In [None]:
# Calculate the average food beverages rating for each airline
avg_seat_comfort = df.groupby('airline')['food_bev'].mean().reset_index()

# Sort the DataFrame by average seat comfort ratings in descending order
avg_seat_comfort_sorted = avg_seat_comfort.sort_values(by='food_bev', ascending=False)

In [None]:
# Chart - 7 visualization code
sns.set(style="whitegrid")
plt.figure(figsize=(12, 8))
sns.barplot(x='airline', y='food_bev', data=avg_seat_comfort_sorted, palette='viridis')

plt.xticks(rotation=90, fontsize=10)
plt.xlabel('Airline', fontsize=12)
plt.ylabel('Food Beverages Rating', fontsize=14)
plt.title('Food Beverages Rating by Airline', fontsize=14)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is an excellent choice when we want to compare the values of a categorical variable (airlines) with respect to a continuous variable (food and beverage ratings). It allows viewers to quickly discern differences in ratings between airlines.

##### 2. What is/are the insight(s) found from the chart?

The chart clearly shows that "Garuda Indonesia" ""Nippon Airways" & "Asiana Airlines" etc. have the highest food beverages rating while "Frontier Airlines" & "Spirit Airlines" have the lowest food beverages ratings. This allows viewers to quickly identify the best and worst performers airlines in terms of Food & Beverages Services.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Revenue Growth: Positive ratings can lead to higher revenue through increased ticket sales and potentially higher spending by passengers on in-flight dining options.

Brand Reputation: High ratings contribute to a positive brand reputation, which can lead to brand loyalty and the attraction of new customers.

Negative Business Impact:

Reduced Revenue: Low ratings can deter passengers from purchasing in-flight meals or snacks, resulting in reduced revenue from onboard sales.

Negative Publicity: Negative feedback about food and beverages on social media or review platforms can harm an airline's image and result in negative publicity.

#### Chart - 8 Recommendation Count per Airline

In [None]:
# Calculate the recommendation count per airline
recommendation_counts = df.groupby(['airline', 'recommended']).size().unstack(fill_value=0)

# Reset the index
recommendation_counts.reset_index(inplace=True)

# Rename the columns
recommendation_counts.columns = ['Airline', 'No', 'Yes']

print(recommendation_counts)

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(12, 6))
sns.countplot(x='airline', hue='recommended', data=df,palette='Set2')
plt.xticks(rotation=90,fontsize=10)
plt.xlabel('Airline',fontsize=12)
plt.ylabel('Count',fontsize=16)
plt.title('Recommendation Count per Airline',fontsize=16)
plt.legend(title='Recommended')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The countplot allows for a direct visual comparison of recommendation counts across multiple airlines. By using the hue parameter to differentiate between "Yes" and "No" recommendations, it's easy to assess the distribution of recommendations for each airline.

##### 2. What is/are the insight(s) found from the chart?

Qatar Airlines, Singapore Airlines, China Southern Airlines, Garuda Airlines & Qantas Airlines have a higher count of "Yes" recommendations. These airlines are likely providing a positive experience to passengers, leading to more recommendations.whereas, American Airlines, United Airlines, Spirit Airlines & Frontier Airlines have a higher count of "No" recommendations. These airlines may have areas for improvement in their services or customer satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Customer Loyalty: Airlines with higher counts of "Yes" recommendations have the potential to build strong customer loyalty. This can lead to repeat business, positive word-of-mouth recommendations, and an increase in customer lifetime value.

Strategic Decision-Making: Airlines can use these insights to make informed strategic decisions, such as investing in service enhancements, training staff, or upgrading amenities to meet passenger expectations.

Negative Business Impact:

Customer Churn: Airlines with consistently high counts of "No" recommendations may experience customer churn. Passengers may choose competitors with better ratings, leading to a loss of revenue and market share.

Reputation Damage: Persistently poor recommendation counts can harm an airline's reputation. Negative reviews and low recommendations can deter potential customers and erode trust in the brand.

#### Chart - 9 Average Ratings of Services by Cabin Type

In [None]:
# Chart - 9 visualization code

# Calculate the average ratings for 'seat_comfort','cabin_service','food_bev', 'entertainment','ground_service' by cabin
average_ratings = df.groupby('cabin')[['seat_comfort','cabin_service','food_bev', 'entertainment','ground_service']].mean().reset_index()

# Set the figure size
plt.rcParams['figure.figsize']=(8,6)

# Define the color list for the bars
colors = ['b', 'g', 'r', 'c', 'm']

# Plot the data
average_ratings.plot(x="cabin", y=['seat_comfort','cabin_service','food_bev', 'entertainment','ground_service'], kind="bar", color=colors, fontsize=12)

# Set labels and title
plt.xlabel("Cabin Type",fontsize=15)
plt.ylabel("Average Ratings by Cabin Type",fontsize=15)
plt.title("Average Ratings for Different Cabin Types", fontsize=18)

# Set legend
plt.legend(["Seat Comfort", "Cabin Service", "Food & Beverage", "Entertainment", "Ground Service"], fontsize=12)

# Rotate x-axis labels
plt.xticks(rotation=0)

# Show the plot
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A grouped bar chart is an effective way to compare and visualize the average ratings of different factors for each cabin type. Each factor is represented by a distinct bar, and the bars are grouped by cabin type. This clear differentiation makes it easy to identify and compare ratings for each factor within each cabin type.

##### 2. What is/are the insight(s) found from the chart?

The average rating of all services types for Business Class as well as First Class cabin type is best.
Economy Class is worst rated in all the service types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Marketing and Pricing Strategies: Understanding which cabin types receive higher ratings for specific categories enables the airline to target marketing efforts more effectively. They can promote the strengths of certain cabins and tailor pricing strategies to appeal to different customer preferences.

Negative Growth or Concerns:

Operational Challenges: Insights into lower ratings for specific services or cabin types may signal operational challenges that need immediate attention. Failure to address these issues could result in negative growth as passengers seek better experiences elsewhere.

#### Chart - 10 Overall Rating by Passenger vs. Airline

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(12, 6))
sns.set_style("whitegrid")  # Set style to whitegrid for better readability
sns.lineplot(x='airline', y='overall', data=df, marker='o', markersize=5, color='darkblue', markerfacecolor='red',linewidth=3)

plt.xlabel('Airline', fontsize=15)
plt.ylabel('Overall Rating by Passenger', fontsize=14)
plt.title('Overall Rating of Airline', fontsize=12)
plt.xticks(rotation=90, fontsize=10)

# Show the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

 Line plots are excellent for showing continuous trends or patterns in data. They connect data points with lines, making it easier to identify trends or fluctuations over time.

##### 2. What is/are the insight(s) found from the chart?

We can clearly observe that "Garuda Airlines","Asiana Airline" & "EVA Air" have the highest overall rating while "Frontier Airlines", "Spirit Airlines","American Airlines" & "Delta Airlines" are worst rated airlines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Identification of Top-Rated Airlines: Airlines with higher ratings can use this information to market themselves as customer favorites and attract more passengers. They can can leverage their positive reputation to gain a competitive advantage in the market. This can lead to increased market share and revenue growth.

Potential Negative Growth:

Customer Attrition: Passengers dissatisfied with low-rated airlines may choose alternative modes of transportation or opt for competitors, resulting in customer attrition and revenue loss. These Airlines may struggle with operational challenges, including increased customer complaints, regulatory scrutiny, and employee morale issues.

#### Chart - 11 Change in overall review over succeeding year for top 12 airlines

In [None]:
# Chart - 13 visualization code
# Calculate the average overall rating for each airline
average_overall_rating = df.groupby('airline')['overall'].mean().reset_index()

# Sort by average overall rating and select the top 12 airlines
top_12_airlines = average_overall_rating.nlargest(12, 'overall')['airline']

# Filter the DataFrame to include only data for the top 12 airlines
filtered_df = df[df['airline'].isin(top_12_airlines)]

# Create a FacetGrid with subplots for each airline
g = sns.FacetGrid(filtered_df, col='airline', col_wrap=4, height=4, aspect=0.7, hue='airline', palette='Set1')
g.map(sns.lineplot, 'year', 'overall', marker='o', lw=2)

# Set axis labels and titles
g.set_axis_labels('Year', 'Overall Rating', fontsize=16)
g.set_titles(col_template='{col_name}')

# Adjust subplot spacing and add a title at the top
plt.subplots_adjust(top=0.85)
plt.suptitle('Overall Rating Over Succeeding Years for Top 12 Airlines', fontsize=16)

# Show the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Line plots are well-suited for time-series data, which involves tracking data points over successive time periods. This makes it suitable for analyzing how overall rating changes over time.

##### 2. What is/are the insight(s) found from the chart?

The chart aids in competitive analysis by showing how each airline's overall rating compares to its peers.

Airlines with consistently high ratings may have a competitive advantage. For example "China Southern Airlines" & "Garuda Airlines".

Sudden drops or spikes in overall ratings may indicate shifts in customer sentiment.For example "Aegean Airlines".

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Identifying Improvement Areas: Insights that highlight consistent upward trends in overall ratings can help airlines identify areas where they are excelling. They can leverage these strengths in marketing efforts to attract more customers who prioritize those aspects.

Potential Negative Growth:

Ignoring Negative Trends: Failing to address consistent negative trends in overall ratings can lead to a decline in customer satisfaction and negative growth. If airlines do not respond to customer feedback and complaints, they risk losing customers to competitors.

Competitive Disadvantage: Airlines with consistently low ratings may find it challenging to compete in the market. Negative feedback can deter potential customers, leading to decreased market share.

#### Chart - 12 Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
df_numeric = df.select_dtypes(include=['number'])
plt.figure(figsize=(10,6))
sns.heatmap(df_numeric.corr(), annot=True,cmap='coolwarm')
plt.show()

##### 1. Why did you pick the specific chart?

Heatmaps are particularly effective for visualizing correlation between variables.

##### 2. What is/are the insight(s) found from the chart?

From the heatmap it is clearly visible that all the independent variables are strongly correlated with each other. Hence, during further data preprocessing we need to take care of multicollinearity.

#### Chart - 13 Pair Plot

In [None]:
# 15 Pair Plot visualization code
columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']
plt.figure(figsize=(8,4))
# Create a pairplot
sns.pairplot(df[columns])
plt.show()


##### 1. Why did you pick the specific chart?

Pairplots allow us to visualize multivariate relationships in a dataset. It help us to identify patterns, trends, and relationships between variables.

##### 2. What is/are the insight(s) found from the chart?

Since all the variables are discrete in nature, it is not possible to reach any conclusion without further data analysis.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

- 1 Airline passengers who rate seat comfort higher are more likely to recommend the airline.
- 2 Reviews posted in recent years are more critical of airline services compared to reviews from earlier years.
- 3 Passengers who travel for business purposes rate cabin service higher than those traveling for leisure.

### Hypothetical Statement - 1  Airline passengers who rate seat comfort higher are more likely to recommend the airline.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant correlation between seat comfort ratings and the likelihood of recommending the airline.

Alternative Hypothesis (H1): There is a significant correlation between seat comfort ratings and the likelihood of recommending the airline.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Dropping the Null Value from the "recommended" & "seat_comfort" column
recommended=df['recommended'].dropna()
seat_comfort=df['seat_comfort'].dropna()

# Convert the data in 'recommended' columns to numeric
recommended=recommended.replace({'yes': 1, 'no': 0})

# Picking out 100 Random Samples to perform t-test
sample_recommended=recommended.sample(100,random_state=42)
sample_seat_comfort=seat_comfort.sample(100,random_state=42)

# Perform t-test
t_statistic, p_value = stats.ttest_ind(sample_seat_comfort, sample_recommended)

# Set the significance level (alpha)
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in seat comfort ratings.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in seat comfort ratings.")


##### Which statistical test have you done to obtain P-Value?

T-test are performed to find P-value

##### Why did you choose the specific statistical test?

T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 2 Reviews posted in recent years are more critical of airline services compared to reviews from earlier years.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the average overall ratings of airline reviews posted in recent years compared to reviews from earlier years.

Alternative Hypothesis (H1): Reviews posted in recent years have significantly lower average overall ratings compared to reviews from earlier years.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Define a threshold year to distinguish recent and earlier years
threshold_year = 2019

# Split the data into two groups: recent and earlier years
recent_years = df[df['year'] >= threshold_year]['overall']
earlier_years = df[df['year'] < threshold_year]['overall']

# Dropping the null value
recent_years=recent_years.dropna()
earlier_years=earlier_years.dropna()

# Picking 100 random samples to perform t-test
random_recent_years=recent_years.sample(100,random_state=42)
random_earlier_years=earlier_years.sample(100,random_state=42)

# Perform a t-test to compare the average overall ratings
t_statistic, p_value = stats.ttest_ind(random_recent_years, random_earlier_years)

# Set the significance level (alpha)
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis. Reviews from recent years are more critical of airline services compared to reviews from earlier years.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in review ratings between recent and earlier years.")


##### Which statistical test have you done to obtain P-Value?

T-test are performed to find P-value.

##### Why did you choose the specific statistical test?

T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 3 Passengers who travel for business purposes rate cabin service higher than those traveling for leisure.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the average cabin service ratings between passengers who travel for business purposes and those who travel for leisure.

Alternative Hypothesis (H1): Passengers who travel for business purposes rate cabin service significantly higher than passengers who travel for leisure.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Convert 'traveller_type' column to lowercase for consistency
df['traveller_type'] = df['traveller_type'].str.lower()

# Define the two groups: business travelers and leisure travelers
business_travelers = df[df['traveller_type'] == 'business']['cabin_service']
leisure_travelers = df[df['traveller_type'] !='business']['cabin_service']

# Dropping the Null Value from
business_travelers=business_travelers.dropna()
leisure_travelers=leisure_travelers.dropna()

# Picking 100 random samples to perform t-test
sample_business_travelers=business_travelers.sample(100,random_state=42)
sample_couple_travelers=leisure_travelers.sample(100,random_state=42)

# Perform a t-test to compare the average cabin service ratings
t_statistic, p_value = stats.ttest_ind(sample_business_travelers, sample_couple_travelers)

# Set the significance level (alpha)
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis. Business travelers rate cabin service higher than leisure travelers.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in cabin service ratings between business and leisure travelers.")


##### Which statistical test have you done to obtain P-Value?

T-test are performed to find P-value

##### Why did you choose the specific statistical test?

T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Making copy of original dataframe
df1=df.copy()
# take useful columns
df1 = df[['overall','traveller_type', 'cabin','seat_comfort','cabin_service', 'food_bev',
               'entertainment', 'ground_service','value_for_money', 'recommended']]

In [None]:
df1.info()

### 1. Handling Missing Values

In [None]:
df1.isnull().sum()

In [None]:
# Handling Missing Values & Missing Value Imputation
# Imputing numerical column with mean using Sklearn Simple Imputer method
from sklearn.impute import SimpleImputer

# Columns to impute
numeric_column=['overall', 'seat_comfort', 'cabin_service','food_bev', 'entertainment', 'ground_service', 'value_for_money']
categorical_column=['traveller_type', 'cabin']

# Create instance of Simple Imputer with mean strategy
numeric_imputer=SimpleImputer(strategy='mean')
categorical_imputer=SimpleImputer(strategy='most_frequent')

# Fitting the imputer method
df1[numeric_column]=numeric_imputer.fit_transform(df1[numeric_column])
df1[categorical_column]=categorical_imputer.fit_transform(df1[categorical_column])

# Applying dropna() method on target variable
df1.dropna(subset='recommended',inplace=True)

In [None]:
df1.isnull().sum()

In [None]:
df1.duplicated().sum()

In [None]:
df1.shape

In [None]:
df1.drop_duplicates(inplace=True)
df1.duplicated().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Mean imputation technique is used on numerical columns while mode imputation technique is used on categorical columns.

Mean imputation is appropriate when we want to maintain the central tendency of the data.

Mode imputation is suitable for categorical data as it preserves the most common category.

The Target column("recommended") is imputed using dropna technique because using mode imputation on target column will lead to Class Imbalance.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
plt.figure(figsize=(14,6))
sns.boxplot(df1)
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

There is no need to address outliers because there are no outliers in the independent variables.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Binary Encoding Target Variable
df1['recommended']=df1['recommended'].replace({'yes': 1, 'no': 0})

# Applying Ordinal Encoding to cabin column
df1['cabin']=df1['cabin'].replace({'Economy Class':0, 'Premium Economy':1, 'Business Class' : 2,'First Class':3})

# Applying One Hot Encoding to Traveller_Type column
ohe=pd.get_dummies(df1['traveller_type'],drop_first=True)

# Concatenating the encoded feature with original dataframe
df1=pd.concat([df1,ohe],axis=1)

# Dropping traveller_type column from the dataframe
df1=df1.drop('traveller_type',axis=1)


In [None]:
df1.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Binary Encoding for Target Variable:-This is often done when we have a binary classification problem, where you want to predict one of two classes.This code is converting the 'recommended' column with values 'yes' and 'no,' into numerical values 'yes' is being encoded as 1, and 'no' as 0.

Ordinal Encoding for 'cabin' Column:-Ordinal encoding is suitable when there is an inherent order or ranking among the categories. Different cabin classes ('Economy Class,' 'Premium Economy,' 'Business Class,' 'First Class') are being mapped to integer values (0, 1, 2, 3).

One-Hot Encoding for 'traveller_type' Column:- One-hot encoding is suitable for the those categorical column with no intrinsic order. One-hot encoding creates binary (0 or 1) columns for each category, indicating whether each instance belongs to that category or not. The drop_first=True argument is specified to drop one of the one-hot-encoded columns to prevent multicollinearity.



### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Checking Multicollinearity

def calculate_vif(X):
    vif = pd.DataFrame()
    vif['Features'] = X.columns
    vif['VIF'] = [round(variance_inflation_factor(X.values, i),2) for i in range(X.shape[1])]
    return vif

# Select columns for which VIF is calculated
selected_columns = [col for col in df1.describe().columns if col not in ['recommended']]

# Selecting the columns from DataFrame
selected_data = df1[selected_columns]

# Calculate VIF for the selected columns
vif_result = calculate_vif(selected_data)

# Sort the VIF result DataFrame by VIF in descending order
vif_result_sorted = vif_result.sort_values(by='VIF', ascending=False)

print(vif_result_sorted)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Identify columns with high VIF
high_vif_cols = vif_result[vif_result['VIF'] > 12]['Features']

# Remove columns with high VIF
df1.drop(high_vif_cols, axis=1, inplace=True)
df1.head()

##### What all feature selection methods have you used  and why?

Variance Inflation Factor method is used for feature selection.

VIF is used to identify and potentially remove features that contribute to multicollinearity. The idea is to retain a subset of features that are relatively independent of each other, reducing the negative effects of multicollinearity.

High VIF values suggest that a feature can be predicted using the other features, and therefore it might be redundant in the presence of other correlated features.

##### Which all features you found important and why?

In a multivariate regression model, multicollinearity exists when there is a correlation betweem many independent variables. Under ideal conditions, small VIF value suggest low correlation accross variables. Hence keeping the threshold limit of 12, all the variables with VIF< 12 is included in the model.

### 5. Data Transformation

In [None]:
df1.skew()

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

No Need to Transform the data as the data is almost symmetrical in nature. The skewness which is shown in "Cabin","couple leisure" & "family leisure" becuase they are encoded data.

### 6. Data Scaling

In [None]:
# Scaling your data
# Normalizing data using MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
scaled_df = pd.DataFrame(sc.fit_transform(df1))
scaled_df.columns = df1.columns
scaled_df.head()

##### Which method have you used to scale you data and why?
MinMax scaling is used to scale the data.

During the outlier removal step, some of the outliers remain with the data, and hence, to reduce the effect of outliers, MinMax scaling is the best scaling technique. It compressed the whole data into the range of 0 to 1.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality reduction techniques can lead to information loss. Since we have only 10 features in our dataset, the risk of overfitting is reduced because the model has fewer opportunities to fit noise in the data. Hence no need to apply dimensionality reduction techniques such as PCA.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

X=scaled_df.drop('recommended',axis=1)
y=scaled_df['recommended']
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.20,random_state=10)

In [None]:
print(X_train.shape)
print(X_test.shape)

##### What data splitting ratio have you used and why?

Data is split in the ratio of 80:20 which means 80% of the data is used for training purpose & remaining 20% data is used for testing purpose.

The choice of the train-test split ratio, such as 0.8 (80%) for training and 0.2 (20%) for testing, is not a strict rule, but rather a commonly used practice in machine learning and data analysis. This ratio is often chosen due to a balance between ensuring sufficient data for training a model while also having a sizable portion for evaluating its performance on unseen data.

### 9. Handling Imbalanced Dataset

In [None]:
df1['recommended'].value_counts()

##### Do you think the dataset is imbalanced? Explain Why.

The dataset is not imbalanced & hence no need to balance it.

## ***7. ML Model Implementation***

### ML Model - 1 (Logistic Regression)

In [None]:
# ML Model - 1 Implementation
# Applying Logistic Regression
model_lr=LogisticRegression(fit_intercept=True, max_iter=1000)

# Fit the Algorithm
model_lr.fit(X_train,y_train)
# Predict on the model
train_class_preds = model_lr.predict(X_train)
test_class_preds = model_lr.predict(X_test)

In [None]:
# Get the accuracy scores
train_accuracy = accuracy_score(train_class_preds,y_train)
test_accuracy = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy)
print("The accuracy on test data is ", test_accuracy)

In [None]:
# Visualizing actual vs predicted value
plt.figure(figsize=(15, 5))

# Plotting the predicted values for the specific range
plt.plot(test_class_preds[100:200], label="Predicted", color='limegreen')

# Plotting the actual values for the specific range
plt.plot(np.array(y_test[100:200]), label="Actual", color='black')

plt.legend(loc='upper left')
plt.title("Actual vs. Predicted Values (Logistic Regression)")
plt.xlabel("Data Points")
plt.ylabel("Values")
plt.show()

In [None]:
# Plot the confusion matrix of Training Class

plt.figure(figsize=(8,6))
confuse_matrix_train_lr = confusion_matrix(y_train,train_class_preds)

ax= plt.subplot()
sns.heatmap(confuse_matrix_train_lr, annot=True, fmt = 'd',ax = ax,cmap='coolwarm')
ax.set_xlabel('Predicted Labels',fontsize=15)
ax.set_ylabel('Actual Labels',fontsize=15)
ax.set_title('Confusion Matrix of Training Class Data',fontsize=15)
plt.plot()

In [None]:
# Plot the confusion matrix of Test Class

plt.figure(figsize=(8,6))
confuse_matrix_test_lr = confusion_matrix(y_test,test_class_preds)

ax= plt.subplot()
sns.heatmap(confuse_matrix_test_lr, annot=True, fmt = 'd',ax = ax,cmap='coolwarm')
ax.set_xlabel('Predicted Labels',fontsize=15)
ax.set_ylabel('Actual Labels',fontsize=15)
ax.set_title('Confusion Matrix of Test Class Data',fontsize=15)
plt.plot()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Visualizing evaluation Metric Score chart
print("Training Data")
print(classification_report(y_train, train_class_preds))
print("\n")
print("Testing Data")
print(classification_report(y_test, test_class_preds))

Logistic Regression is a statistical and machine learning technique used for binary and multiclass classification tasks. It's commonly used for binary classification problems, where the target variable has two possible outcomes, often denoted as 0 (negative class) and 1 (positive class).

Based on the above evaluation metric score chart, the Logistic Regression model demonstrates strong performance in both classifying instances as class 0 and class 1. It achieves a good balance between precision and recall for both classes and an overall accuracy of 85%, indicating its effectiveness in making accurate predictions for binary classification

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
logistic_param={'penalty': ['l1', 'l2', 'elasticnet', None],
                'C':[0.01,0.1,1,5,10],
                'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],
                'l1_ratio':[0,0.4,0.6,0.8,1]}

model_logistic=LogisticRegression()

# Fit the Algorithm
logistic_grid = GridSearchCV(model_logistic, logistic_param, cv=5, scoring='roc_auc')

logistic_grid.fit(X_train, y_train)

In [None]:
print(logistic_grid.best_params_)
print(logistic_grid.best_score_)

# Predict on the model
train_lr_preds = logistic_grid.predict(X_train)
test_lr_preds = logistic_grid.predict(X_test)

In [None]:
# Get the accuracy scores
train_accuracy_lr = accuracy_score(train_lr_preds,y_train)
test_accuracy_lr = accuracy_score(test_lr_preds,y_test)

print("The accuracy on train data is ", train_accuracy_lr)
print("The accuracy on test data is ", test_accuracy_lr)

In [None]:

# Applying Cross Validation on  Logistic Regression
scores=cross_val_score(model_logistic,X_train,y_train,cv=10,scoring='roc_auc')
cv_score=scores.mean()
print("Cross Validation Score is: ",cv_score)

##### Which hyperparameter optimization technique have you used and why?

The GridSearchCV is used as hyperparameter optimisation. The reason is that it exhaustively explores a given hyperparameter space to identify the perfect hyperparameter that would produce the greatest model performance.

In [None]:
# Visualizing evaluation Metric Score chart after Hyperparameter Tuning
print("Training Data")
print(classification_report(y_train, train_lr_preds))
print("\n")
print("Testing Data")
print(classification_report(y_test, test_lr_preds))

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

As far as accuracy is concerned, there is no improvement after performing hyperparameter tuning, and it remains at 85%.

### ML Model - 2 Random Forest Classifier

In [None]:

# Random Forest Regressor
model_rf=RandomForestClassifier()

# fit the model
model_rf.fit(X_train,y_train)

In [None]:
# Predict on the model
train_preds_rf = model_rf.predict(X_train)
test_preds_rf = model_rf.predict(X_test)

In [None]:

# Get the accuracy scores
train_rf_accuracy = accuracy_score(train_preds_rf,y_train)
test_rf_accuracy = accuracy_score(test_preds_rf,y_test)

print("The accuracy on train data is ", train_rf_accuracy)
print("The accuracy on test data is ", test_rf_accuracy)

In [None]:
# Visualizing actual vs predicted value

plt.figure(figsize=(15, 5))

# Plotting the predicted values for the specific range
plt.plot(test_preds_rf[100:200],label="Predicted",color='limegreen')


# Plotting the actual values for the specific range
plt.plot(np.array(y_test[100:200]),label="Actual",color='black')

plt.legend(loc='upper left')
plt.title("Actual vs. Predicted Values (Random Forest Classifier)")
plt.xlabel("Data Points")
plt.ylabel("Values")
plt.show()

In [None]:
# Plot the confusion matrix of Training Class
plt.figure(figsize=(8,6))
confuse_matrix_train_rf = confusion_matrix(y_train,train_preds_rf)

ax= plt.subplot()
sns.heatmap(confuse_matrix_train_rf, annot=True, fmt = 'd', ax = ax, cmap='coolwarm')  # Added 'cmap' parameter
ax.set_xlabel('Predicted Labels',fontsize=15)
ax.set_ylabel('Actual Labels',fontsize=15)
ax.set_title('Confusion Matrix of Training Class Data',fontsize=15)
plt.show()  # Changed from plt.plot() to plt.show()


In [None]:
# Plot the confusion matrix of Testing Class

plt.figure(figsize=(8,6))
confuse_matrix_test_rf = confusion_matrix(y_test,test_preds_rf)

ax= plt.subplot()
sns.heatmap(confuse_matrix_test_rf, annot=True, fmt = 'd',ax = ax,cmap='coolwarm')
ax.set_xlabel('Predicted Labels',fontsize=15)
ax.set_ylabel('Actual Labels',fontsize=15)
ax.set_title('Confusion Matrix of Test Data',fontsize=15)
plt.plot()

In [None]:
# Visualizing evaluation Metric Score chart

print("Training Data")
print(classification_report(y_train, train_preds_rf))
print("\n")
print("Testing Data")
print(classification_report(y_test, test_preds_rf))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The Random Forest Classifier is an ensemble learning algorithm that combines multiple decision trees to make more accurate predictions. In a Random Forest, a random subset of the training data and a random subset of the input features are used to train each decision tree. This randomness helps reduce overfitting and increases the model's generalization ability. During classification, the algorithm aggregates the predictions of the individual trees and typically selects the majority class as the final prediction.

From the evaluation metric score chart, we can clearly observe that the accuracy score has decreased to 82% from 85% calculated through the logistic regression model. However the Random Forest Model did good on training data (91%) compared to Logistic Regression Model (84%).

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
max_features = [0.2,0.6,1.0]
max_depth = [2,8,None]
max_samples = [0.5,0.75,1.0]


param_grid = {'max_features': max_features,
              'max_depth': max_depth,
            'max_samples':max_samples}
# Fit the Algorithm
model_rf = RandomForestClassifier()
rf_grid = GridSearchCV(estimator = model_rf,
                       param_grid = param_grid,
                       cv = 3,
                       verbose=2,
                       n_jobs = -1)
rf_grid.fit(X_train,y_train)

In [None]:
print(rf_grid.best_params_)
print(rf_grid.best_score_)

In [None]:
# Predict on the model
train_tuned_rf_preds = rf_grid.predict(X_train)
test_tuned_rf_preds = rf_grid.predict(X_test)
# Get the accuracy scores
train_accuracy_tuned_rf = accuracy_score(train_tuned_rf_preds,y_train)
test_accuracy_tuned_rf = accuracy_score(test_tuned_rf_preds,y_test)

print("The accuracy on train data is ", train_accuracy_tuned_rf)
print("The accuracy on test data is ", test_accuracy_tuned_rf)

In [None]:
# Applying Cross Validation on Random Forest Model
scores=cross_val_score(model_rf,X_train,y_train,cv=10,scoring='roc_auc')
cv_score=scores.mean()
print("Cross Validation Score is: ",cv_score)

##### Which hyperparameter optimization technique have you used and why?

The GridSearchCV is used as hyperparameter optimisation. The reason is that it exhaustively explores a given hyperparameter space to identify the perfect hyperparameter that would produce the greatest model performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There is slight increase in the testing accuracy after hyperparameter tuning. The accuracy score which was 82% without hyperparameter tuning increased to 85% after hyperparameter tuning.

However one thing to note here is that the training accuracy which was 91% without tuning has decreased to 85%.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Each evaluation metric in a machine learning model provides valuable insights into the model's performance, and these insights can have specific implications for businesses. Let's discuss each metric's indication and its potential business impact:

- Indication: Precision measures the accuracy of positive predictions made by the model. It answers the question: "Of all the positive predictions made by the model, how many were actually correct?"
- Business Impact: High precision is crucial in situations where false positives are costly or detrimental to the business. For example, in a medical diagnosis application, high precision ensures that fewer healthy patients are mistakenly classified as having a disease, reducing unnecessary stress and medical costs.

- Indication: Recall measures the model's ability to correctly identify all positive instances in the dataset. It answers the question: "Of all the actual positive cases, how many did the model correctly identify?"
- Business Impact: High recall is essential when missing positive cases can have severe consequences. For instance, in fraud detection, high recall ensures that the majority of fraudulent transactions are caught, minimizing financial losses for the business.

- Indication: F1-Score is the harmonic mean of precision and recall, providing a balanced measure that considers both false positives and false negatives.
- Business Impact: F1-Score is valuable when a balance between precision and recall is needed. It helps businesses strike the right trade-off between false positives and false negatives based on their specific priorities.

d) Accuracy:

c) F1-Score:

b) Recall:

a) Precision:

- Indication: Accuracy measures the overall correctness of the model's predictions across all classes.
- Business Impact: High accuracy is generally desirable, but it can be misleading in imbalanced datasets. In such cases, where one class is rare, a high overall accuracy may hide poor performance in the minority class.

### ML Model - 3 SVM Classifier

In [None]:
## SVM Classifier
model_svc = SVC()

# fit the model
model_svc.fit(X_train,y_train)

# Predict on the model
train_preds_svc = model_svc.predict(X_train)
test_preds_svc = model_svc.predict(X_test)


# Get the accuracy scores
train_svc_accuracy = accuracy_score(train_preds_svc,y_train)
test_svc_accuracy = accuracy_score(test_preds_svc,y_test)

print("The accuracy on train data is ", train_svc_accuracy)
print("The accuracy on test data is ", test_svc_accuracy)

In [None]:
# Visualizing the Predicted vs Actual Graph
plt.figure(figsize=(15, 5))

# Plotting the predicted values for the specific range
plt.plot(test_preds_svc[100:200],label="Predicted",color='limegreen')


# Plotting the actual values for the specific range
plt.plot(np.array(y_test[100:200]),label="Actual",color='black')

plt.legend(loc='upper left')
plt.title("Actual vs. Predicted Values (SVM Classifier)")
plt.xlabel("Data Points")
plt.ylabel("Values")
plt.show()

In [None]:
# Plot the confusion matrix of Training Class

plt.figure(figsize=(8,6))
confuse_matrix_train_svc = confusion_matrix(y_train,train_preds_svc)

ax= plt.subplot()
sns.heatmap(confuse_matrix_train_svc, annot=True, fmt = 'd',ax = ax, cmap='coolwarm')
ax.set_xlabel('Predicted Labels',fontsize=15)
ax.set_ylabel('Actual Labels',fontsize=15)
ax.set_title('Confusion Matrix of Training Class Data',fontsize=15)
plt.plot()

In [None]:

# Plot the confusion matrix of Testing Class

plt.figure(figsize=(8,6))
confuse_matrix_test_svc = confusion_matrix(y_test,test_preds_svc)

ax= plt.subplot()
sns.heatmap(confuse_matrix_test_svc, annot=True, fmt = 'd',ax = ax, cmap='coolwarm')
ax.set_xlabel('Predicted Labels',fontsize=15)
ax.set_ylabel('Actual Labels',fontsize=15)
ax.set_title('Confusion Matrix of Test Data',fontsize=15)
plt.plot()

In [None]:
# Visualizing evaluation Metric Score chart
print("Training Data")
print(classification_report(y_train, train_preds_svc))
print("\n")
print("Testing Data")
print(classification_report(y_test, test_preds_svc))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

A Support Vector Machine (SVM) is a powerful and versatile machine learning algorithm used for both classification and regression tasks. It's particularly well-suited for classification problems. The primary goal of an SVM is to find the optimal hyperplane that best separates data points belonging to different classes in a way that maximizes the margin between these classes.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:

# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

param_grid = {'C':[0.01,0.1,1],
              'kernel':['linear', 'poly', 'rbf', 'sigmoid']}


# Fit the Algorithm
model_svc = SVC()
svc_grid = GridSearchCV(estimator = model_svc,
                       param_grid = param_grid,
                       cv = 3,
                       verbose=2,
                       n_jobs = -1)

svc_grid.fit(X_train,y_train)

print(svc_grid.best_params_)
print(svc_grid.best_score_)


In [None]:
# Predict on the model
train_tuned_svc_preds = svc_grid.predict(X_train)
test_tuned_svc_preds = svc_grid.predict(X_test)


# Get the accuracy scores
train_accuracy_tuned_svc = accuracy_score(train_tuned_svc_preds,y_train)
test_accuracy_tuned_svc = accuracy_score(test_tuned_svc_preds,y_test)

print("The accuracy on train data is ", train_accuracy_tuned_svc)
print("The accuracy on test data is ", test_accuracy_tuned_svc)

In [None]:
# Applying Cross Validation on SVM Classifier Model
scores=cross_val_score(model_svc,X_train,y_train,cv=10,scoring='roc_auc')
cv_score=scores.mean()
print("Cross Validation Score is: ",cv_score)

##### Which hyperparameter optimization technique have you used and why?

The GridSearchCV is used as hyperparameter optimisation. The reason is that it exhaustively explores a given hyperparameter space to identify the perfect hyperparameter that would produce the greatest model performance.

In [None]:

# Visualizing evaluation Metric Score chart after Hyperparameter Tuning
print("Training Data")
print(classification_report(y_train, train_tuned_svc_preds))
print("\n")
print("Testing Data")
print(classification_report(y_test, test_tuned_svc_preds))

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Definitely, we have seen an improvement in the accuracy score on test data compared to the earlier model used. The testing accuracy has increased to 86%, from 85% earlier.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The choice of which evaluation metric to prioritize depends on the specific business problem and its associated costs and risks. Hence we should carefully consider these metrics to make informed decisions. High precision is valuable when minimizing false positives is critical, high recall is important when catching all positive cases is essential, and F1-Score provides a balanced perspective. High accuracy can be misleading in imbalanced datasets where a high overall accuracy may hide poor performance in the minority class. Thus, we should consider accuracy along with precision, recall, and F1-Score to assess model performance comprehensively.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

The choice of the best model should consider the specific goals and requirements of the project, the interpretability of the model, and potential business implications.

Based on the provided metrics and considering the Accuracy score as a key criterion, the SVM Model appears to be the better choice for the final prediction model. It also has a higher Cross Validation score on the testing data compared to Random Forest Classifier.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

eli5 (Explain Like I'm 5) is the most commonly used model explainability tool to interpret the feature importance of a machine learning model.

In [None]:
!pip install eli5

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

# Linear Regression
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)

perm_importance_lr = PermutationImportance(model_lr, random_state=42).fit(X_test, y_test)

# Random Forest Regressor
model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)

perm_importance_rf = PermutationImportance(model_rf, random_state=42).fit(X_test, y_test)

# XGBoost Regressor
model_svc = SVC()
model_svc.fit(X_train, y_train)

perm_importance_svc = PermutationImportance(model_svc, random_state=42).fit(X_test, y_test)


In [None]:
# Print model
print("Logistic Regression Model")
eli5.show_weights(perm_importance_lr, feature_names=X_test.columns.tolist())

# Print model
print("Random Forest Classifier Model")
eli5.show_weights(perm_importance_rf, feature_names=X_test.columns.tolist())

# Print model
print("SVM Classifier Model")
eli5.show_weights(perm_importance_svc, feature_names=X_test.columns.tolist())

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
!pip install joblib
import joblib

#Random Forest Regressor model
best_model = model_svc

# Specify the file path to save the model
model_filename = 'best_model_svc.joblib'

# Save the model to the file
joblib.dump(best_model, model_filename)

print(f"Model saved as {model_filename}")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
loaded_model = joblib.load(model_filename)
# Now, you can use loaded_model for predictions

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In conclusion, our exploration into the Airline Passenger Referral Prediction dataset has been a comprehensive and enlightening journey into the world of machine learning and predictive analytics. Utilizing the strengths of three distinct models—Logistic Regression, SVM Classifier, and Random Forest—we have effectively addressed the challenge of predicting passenger referrals, a crucial aspect in enhancing airline operations and passenger experiences.

Key insights from this project include:

Logistic Regression: This model served as a robust starting point for our analysis, offering simplicity and interpretability. It provided valuable insights into how various factors influence passenger referrals, aiding in informed decision-making.

SVM Classifier: The SVM Classifier demonstrated its prowess in handling complex relationships within the data. Its ability to identify intricate patterns that simpler models might overlook resulted in high predictive accuracy.

Random Forest: The Random Forest model, a collection of decision trees, excelled in both predictive accuracy and feature importance. Its ability to model non-linear relationships and highlight key drivers of passenger referrals made it an invaluable tool in this project.

The combination of these three models not only improved the predictive accuracy of our system but also offered a comprehensive understanding of the factors influencing passenger referrals. This knowledge is vital for airlines in refining their strategies, enhancing customer interactions, and ultimately improving their services.

Moreover, the insights gained from this classification project extend beyond the airline industry. The methodologies used and lessons learned can be applied to various sectors where classification and predictive modeling are key components of decision-making.

In summary, our classification project on the Airline Passenger Referral Prediction dataset has provided us with the necessary tools and insights to tackle complex prediction tasks. It highlights the importance of model diversity in achieving optimal results and stands as a testament to the power of data-driven decision-making in improving operational efficiency and customer satisfaction in the airline industry and beyond.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***