<a href="https://colab.research.google.com/github/sadhana3636/Capstone_Project_6_ML_Indigo-Airline_Passenger_Referral_Prediction/blob/main/ML_Capstone_project_IndiGo_Airline_Passenger_Referral_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  IndiGo Airline Passenger Referral Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual - Sadhana B


**Video link :**  https://drive.google.com/file/d/1bMzidcxziCPxMsG7SAWMPkOGdk8LAw4B/view?usp=sharing

# **Project Summary -**

**IndiGo Airline Passenger Referral Prediction** aims to analyze customer reviews from 2006 to 2019 to understand the factors influencing passenger referrals. In the highly competitive airline industry, customer experience plays a pivotal role in building brand loyalty and driving positive word-of-mouth. By leveraging machine learning techniques, this project seeks to develop a predictive model that identifies which passengers are most likely to recommend IndiGo. Insights from this analysis will help the airline refine its services by focusing on key aspects such as in-flight comfort, customer service, and overall value for money.  

With a data-driven approach, IndiGo can implement targeted improvements in areas that significantly impact passenger satisfaction. Predicting referrals enables strategic marketing efforts, allowing the airline to capitalize on positive customer feedback and strengthen its brand reputation. Additionally, continuous refinement of service quality based on predictive insights will help IndiGo differentiate itself from competitors and maintain a strong position in the market. By integrating advanced analytics into decision-making, IndiGo can enhance customer engagement, optimize service offerings, and foster long-term brand loyalty.

# **GitHub Link -**

 GitHub Link :  https://github.com/sadhana3636/Capstone_Project_6_ML_Indigo-Airline_Passenger_Referral_Prediction

# **Problem Statement**


IndiGo aims to predict passenger referrals by analyzing customer reviews from 2006 to 2019. The goal is to identify key service factors influencing recommendations and enhance customer satisfaction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Data Handling & Processing
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
import scipy.stats as stats
from scipy.stats import chi2_contingency

# Preprocessing & Feature Engineering
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from imblearn.over_sampling import SMOTE

# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
import xgboost as xgb
import lightgbm as lgb
import time

# Model Evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve, auc
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score


# Warnings
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Load Dataset

# Define the path to dataset
file_path = "/content/drive/MyDrive/Colab Notebooks/Alma_FoundationTrack/AlmaBetter_Capstone_Project_6/data_airline_reviews.xlsx - capstone_airline_reviews3.csv"

reviews_df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look

reviews_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# Count the number of rows and columns
num_rows, num_columns = reviews_df.shape

print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_columns}")


### Dataset Information

In [None]:
# Dataset Info

reviews_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_count = reviews_df.duplicated().sum()
print(f"Number of Duplicate Rows: {duplicate_count}")

In [None]:
# Drop the duplicate rows

reviews_df.drop_duplicates(inplace=True)

In [None]:
reviews_df.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count


missing_values = reviews_df.isnull().sum()
print("Missing Values:\n", missing_values)


In [None]:
# Percentage of missing values


missing_percentage = (missing_values / len(reviews_df)) * 100
print("Percentage of Missing Values:\n", round(missing_percentage,2))

In [None]:
# Bar graph representation of missing values


# Calculate missing percentages
missing_percent = reviews_df.isnull().mean() * 100

# Plot missing values as bar chart
plt.figure(figsize=(10, 5))
ax = missing_percent.sort_values(ascending=False).plot(kind='bar', color='orange')

# Add value labels on top of bars
for i, v in enumerate(missing_percent.sort_values(ascending=False)):
    plt.text(i, v + 1, f"{v:.2f}%", ha='center', va='bottom', fontsize=10, color='black')

plt.title('Percentage of Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Percentage Missing')
plt.xticks(rotation=45)
plt.show()



### What did you know about your dataset?

1. The Airline Reviews dataset has 1,31,895 rows and 17 columns.
2. Number of duplicate rows is 70711 and these duplicate rows were dropped.
3. It is found that 69% of data in 'aircraft' column is missing.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
reviews_df.columns

In [None]:
# Dataset Describe
reviews_df.describe()

### Variables Description

There are 17 columns present in the dataset.
1. **airline :** Name of the airline.
2. **overall :** Overall points are given to the trip between 1 and 10.
3. **author :** Author of the trip
4. **review_date :** Date of the Riview
5. **customer_review :**Review from the customer on travel experience.
6. **aircraft :** Type of the aircraft
7. **traveller_type :** Type of the traveler(e.g. business,leisure)
8. **cabin :** Type of cabin
9. **route :** Route of flight( from which place to which place)
10. **date_flown :** Date of fly
11. **seat_comfort :** Rated between 1-5
12. **cabin_service :** Rated between 1-5
13. **food_bev :** Rated between 1-5
14. **entertainment :** Rated between 1-5
15. **ground_service :** Rated between 1-5
16. **value_for_money :** Rated between 1-5
17. **recommended :** target variable (yes or no)

### Check Unique Values for each variable.

In [None]:
# Number of unique values in each column

unique_values = reviews_df.nunique()
print("Number of Unique Values in Each Column:\n", unique_values)


In [None]:
# unique values of the column 'trveller_type'

unique_values = reviews_df['traveller_type'].unique()
print("Unique Values in 'traveller_type' Column:\n", unique_values)

In [None]:
# unique values of the column 'cabin'

unique_values = reviews_df['cabin'].unique()
print("Unique Values in 'cabin' Column:\n", unique_values)

In [None]:
# unique values of the column 'recommended'

unique_values = reviews_df['recommended'].unique()
print("Unique Values in 'recommended' Column:\n", unique_values)

In [None]:
reviews_df.head()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# We have dropped duplicate rows.

### What all manipulations have you done and insights you found?

1. We have dropped duplicate rows.

2. Found the total number of missing values in each column and visualized it using graph.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### Univariate Analysis

#### Chart - 1 : Top 10 Airlines with Largest Flights Conducted - Bar Chart

In [None]:
# Chart - 1 visualization code- Bar chart


# Count the number of occurrences (flights) for each airline
top_airlines = reviews_df['airline'].value_counts().nlargest(10)

# Plotting
plt.figure(figsize=(10, 5))
ax = top_airlines.plot(kind='bar', color='green')

# Add value labels on top of bars
for i, v in enumerate(top_airlines):
    plt.text(i, v + 10, str(v), ha='center', va='bottom', fontsize=10, color='black')

plt.title('Top 10 Airlines with Largest Flights Conducted')
plt.xlabel('Airlines')
plt.ylabel('Number of Flights')
plt.xticks(rotation=45)
plt.show()




##### 1. Why did you pick the specific chart?

I chose a **bar chart** because it effectively visualizes the **count of flights** for each airline, making it easy to compare the **top 10 airlines** and display exact values with labels.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the **top 10 airlines with the largest number of flights**, highlighting the market leaders. It shows which airlines dominate in terms of **flight volume**, indicating their operational scale.


*   Here we find that the highest number of trips conducted by the airline **'Spirit Airlines'** with a count of **2871**.



#### Chart - 2 : Top 10 Most Frequently Used Aircrafts - Bar chart

In [None]:
# Chart - 2 visualization code - bar chart

# Count the occurrences of each aircraft type
aircraft_counts = reviews_df['aircraft'].value_counts().head(10)  # Top 10 aircrafts

# Plot the chart
plt.figure(figsize=(10, 5))
ax = aircraft_counts.plot(kind='bar', color='skyblue')

# Add labels and title
plt.title('Top 10 Most Frequently Used Aircrafts')
plt.xlabel('Aircraft')
plt.ylabel('Number of Flights')
plt.xticks(rotation=45)

# Display values on top of bars
for i in ax.containers:
    ax.bar_label(i, fmt='%d', label_type='edge', fontsize=10)

plt.show()


##### 1. Why did you pick the specific chart?

I chose a **bar chart** because it effectively displays the **frequency of aircraft usage** with clear, distinct bars, making it easy to compare the top 10 models. It also allows for adding value labels for better readability.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that **A320** is the most frequently used aircraft for flights, indicating its popularity and reliability. The **top 10 aircraft** account for a significant portion of the total flights conducted.

#### Chart - 3 : Top 10 Most Chosen Routes by Customers - Horizontal Bar chart

In [None]:
# Chart - 3 visualization - Horizontal Bar chart

# Top 10 most chosen routes
top_routes = reviews_df['route'].value_counts().head(10)
# Plotting the data
plt.figure(figsize=(12, 6))
ax = sns.barplot(x=top_routes.values, y=top_routes.index, palette='coolwarm')
# Display count labels near the bars
for i, value in enumerate(top_routes.values):
    ax.text(value + 5, i, f'{value}', color='black', ha='left', va='center', fontsize=10)
plt.title('Top 10 Most Chosen Routes by Customers')
plt.xlabel('Number of Flights')
plt.ylabel('Route')
plt.show()



##### 1. Why did you pick the specific chart?

The **horizontal bar chart** is chosen because it effectively displays the **top 10 most chosen routes** with clear labels and space for count values, making it easy to compare the popularity of different routes.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the **most frequently chosen routes** by customers, indicating popular travel corridors. This insight helps airlines **prioritize services and resources** on high-demand routes.
Most preferred route is from BKK to LHR.

#### Chart - 4 : Customer Recommendations (Yes/No) -  Pie chart

In [None]:
# Chart - 4 visualization code - Pie chart

# Count the number of recommendations
recommendation_counts = reviews_df['recommended'].value_counts()

# Plotting the pie chart
plt.figure(figsize=(6, 6))
plt.pie(recommendation_counts, labels=recommendation_counts.index, autopct='%1.1f%%', startangle=140, colors=['#4caf50', '#f44336'])
plt.title('Customer Recommendations (Yes/No)')
plt.show()

##### 1. Why did you pick the specific chart?

The **pie chart** is chosen because it effectively shows the **proportion of customer recommendations** in an easy-to-interpret visual format. It clearly highlights the **distribution between positive and negative feedback**.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that a **majority of customers do not recommend** the airline, indicating **dissatisfaction with the service**. The smaller proportion of positive recommendations suggests **room for improvement**.

### Bivariate Analysis

#### Chart - 5 : Relationship between Traveller Type and Recommendation - Stacked bar chart

In [None]:
# Chart - 5 visualization code - stacked bar chart


# Create a stacked bar chart
fig, ax = plt.subplots(figsize=(12, 5))
stacked_bar = pd.crosstab(reviews_df['traveller_type'], reviews_df['recommended']).plot(
    kind='bar', stacked=True, colormap='viridis', ax=ax
)
# Add count labels on the bars
for container in ax.containers:
    ax.bar_label(container, label_type='center', fmt='%d', color='white', fontsize=10)
# Labels and title
plt.title('Relationship between Traveller Type and Recommendation')
plt.xlabel('Traveller Type')
plt.ylabel('Count')
plt.legend(title='Recommended', labels=['No', 'Yes'])
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

 It clearly shows the distribution of recommendations (Yes/No) within each traveller type, making it easy to compare how likely different traveller groups (e.g., solo, family, business) are to recommend the airline.

##### 2. What is/are the insight(s) found from the chart?

It highlights potential patterns, such as whether business travellers are more likely to recommend the airline compared to leisure travellers.

It is clear that people who travel with the reasons 'Business', 'Couple Leisure', 'Family Leisure' recommend more compared to 'Solo Leisure'.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from the stacked bar chart can lead to a positive business impact.

*   The chart reveals that business travelers and family leisure travelers are more likely to recommend the airline, indicating higher satisfaction levels.
*   Targeted marketing strategies can be created to retain and attract more of these customer segments by offering tailored benefits, such as loyalty programs, family-friendly packages, or business-class discounts.
*    Improving the travel experience for solo leisure travelers, who are less likely to recommend the airline, can also enhance overall customer satisfaction.

#### Chart - 6 : Relationship Between Cabin and Recommended - Grouped bar chart

In [None]:
# Chart - 6 visualization code - Grouped bar chart


# Count the occurrences of each combination of 'cabin' and 'recommended'
cabin_recommend_counts = reviews_df.groupby(['cabin', 'recommended']).size().unstack()
# Plotting the grouped bar chart
plt.figure(figsize=(12, 5))
cabin_recommend_counts.plot(kind='bar', stacked=False, colormap='cividis', ax=plt.gca())
# Add value labels on the bars
for bars in plt.gca().containers:
    plt.gca().bar_label(bars, fmt='%d', label_type='edge', fontsize=10)
plt.title('Relationship Between Cabin and Recommended')
plt.xlabel('Cabin Type')
plt.ylabel('Count')
plt.legend(title='Recommended', labels=['Not Recommended', 'Recommended'])
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

##### 1. Why did you pick the specific chart?

The grouped bar chart is chosen because it effectively compares the relationship between **cabin classes** and **recommendations**, making it easy to visualize how preferences vary across different cabins. It clearly displays both the distribution and the recommendation patterns.



##### 2. What is/are the insight(s) found from the chart?

The chart reveals that **Business and Premium Economy passengers** are more likely to recommend the airline compared to **Economy passengers**, indicating higher satisfaction in premium cabins. **Economy class** shows a lower recommendation rate, suggesting potential service improvement areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights highlight that premium cabin passengers are more satisfied and likely to recommend the airline. This presents an opportunity to expand premium services and offer loyalty programs, attracting more high-value customers.

**Potential Negative Growth:**

The lower recommendation rate in Economy class suggests potential dissatisfaction, which could lead to negative reviews and reduced customer retention. Addressing service quality in Economy class is essential to prevent losing price-sensitive customers.

### Multivariate Ananlysis

#### Chart - 7 : Overall Service Rating by Cabin Class and Traveller Type - Grouped Bar chart

In [None]:
# Chart - 7 visualization code - Grouped Bar chart

# Create a grouped DataFrame for average service rating by cabin and traveller type
service_by_cabin_traveller = reviews_df.groupby(['cabin', 'traveller_type'])['cabin_service'].mean().unstack()
# Plotting Grouped Bar Chart
plt.figure(figsize=(14, 4))
service_by_cabin_traveller.plot(kind='bar', colormap='viridis', ax=plt.gca())
plt.title('Overall Service Rating by Cabin Class and Traveller Type (Grouped)')
plt.xlabel('Cabin Class')
plt.ylabel('Average Service Rating')
plt.xticks(rotation=45)
plt.legend(title='Traveller Type')
plt.show()


##### 1. Why did you pick the specific chart?


The grouped bar chart is chosen because it allows for a clear side-by-side comparison of service ratings across different 'cabin' classes and 'traveller_type' categories, making it easy to identify patterns and differences.

##### 2. What is/are the insight(s) found from the chart?

Overall service rating in economy class is very poor.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive Business Impact:
The insight regarding higher service ratings in Business and First Class highlights the effectiveness of premium services. IndiGo can leverage this by promoting loyalty programs and offering exclusive deals to frequent Business travellers, boosting customer retention and profitability.


* Potential Negative Growth:
The poor service ratings in Economy Class could deter repeat customers and harm the airline's reputation. To prevent negative growth, IndiGo should improve economy-class service by enhancing comfort, addressing customer complaints, and providing better value-for-money experiences.

#### Chart - 8 : Overall Service Rating Over Time - Line Chart

In [None]:
# Chart - 8 visualization code - Line Chart

#  Convert 'review_date' to datetime format
reviews_df['review_date'] = pd.to_datetime(reviews_df['review_date'], errors='coerce')

#  Extract Year-Month for grouping
reviews_df['year_month'] = reviews_df['review_date'].dt.to_period('M')

#  Group by Year-Month and Calculate Average Overall Rating
rating_over_time = reviews_df.groupby('year_month')['overall'].mean()

#  Plotting the Trend Over Time
plt.figure(figsize=(14, 5))
rating_over_time.plot(color='royalblue', marker='o', linestyle='-', linewidth=2, markersize=6)

plt.title('Overall Service Rating Over Time')
plt.xlabel('Time (Monthly)')
plt.ylabel('Average Overall Rating')
plt.grid(True)
plt.show()



##### 1. Why did you pick the specific chart?



The **line chart** is chosen because it effectively visualizes **trends over time**, making it easy to identify **patterns, fluctuations, and long-term changes** in the overall service rating. It clearly shows whether the rating is improving, declining, or remaining stable over different periods.

##### 2. What is/are the insight(s) found from the chart?



The **overall service rating** shows **fluctuations over time**, indicating **variations in customer satisfaction** during different periods. A **consistent decline or sharp drops** in specific months or years may highlight **service issues** or negative incidents, while upward trends suggest **improvements or positive experiences**.

There is drastice decline in overall rating around the year 2008 and 2009.
Within a year overall rating found a quick spike upwards nearly in 2010.
We can see a gradual decrease in overall rating by the costomers rom the year 2010 to 2019.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*  **Will the gained insights help create a positive business impact?**
Yes, the insights can positively impact the business by identifying periods of poor customer satisfaction. IndiGo can analyze the root causes behind the 2008-2009 decline (e.g., service issues, operational challenges) and ensure similar mistakes are avoided. Additionally, they can replicate the strategies that led to the 2010 improvement to enhance customer satisfaction.

* **Are there any insights that lead to negative growth? Justify with specific reason.**
Yes, the gradual decline in overall rating from 2010 to 2019 indicates a deteriorating customer experience. If this trend continues, it could damage the airline’s reputation, reduce customer loyalty, and ultimately impact revenue growth. IndiGo needs to address customer concerns promptly to prevent further decline.

#### Chart - 9 : Value for Money by Cabin Type - Violin Plot

In [None]:
# Chart - 9 visualization code - Violin plot

plt.figure(figsize=(12, 5))
sns.violinplot(x='cabin', y='value_for_money', data=reviews_df, palette='muted')

# Add labels and title
plt.title('Value for Money by Cabin Type (Violin Plot)')
plt.xlabel('Cabin Type')
plt.ylabel('Value for Money')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()


##### 1. Why did you pick the specific chart?

The violin plot is used to visualize the distribution and density of value_for_money across different cabin types. It combines a box plot with a kernel density plot, making it easy to see both the distribution shape and variability in the data.

##### 2. What is/are the insight(s) found from the chart?

* Wider Distribution Indicates Variability: The broader shape of the violin plot in Business and First-class cabins shows greater variability in perceived value, meaning some passengers find it extremely valuable while others may not.

* Narrower Distribution Shows Consistency: The narrower shape in Economy indicates that most passengers rate the value similarly, reflecting a consistent but lower satisfaction level.

* Median Line Shows Central Tendency: The median lines reveal the typical value-for-money rating for each cabin class, helping to identify which classes are generally rated higher or lower.
* Clearly Business class and First class passengers are more satisfied with the facility.
* Bottom is wide in Economy classs and premium economy class which says that those passengers wre not happy with the travel facilities.


####Chart 10 :  Overall Rating vs Recommended- stacked bar chart

In [None]:
# Code for visualization -stacked bar chart

# Create a grouped bar plot
plt.figure(figsize=(12, 4))
sns.barplot(x='overall', y='recommended', data=reviews_df, ci=None, estimator=lambda x: sum(x)/len(x), palette='coolwarm')

# Add labels and title
plt.title('Overall Rating vs Recommended')
plt.xlabel('Overall Rating')
plt.ylabel('Percentage Recommended')
plt.grid(axis='y', linestyle='--', alpha=0.5)

# Display the chart
plt.show()

1. Why did you pick the specific chart?

The grouped bar chart effectively shows the relationship between overall ratings and the percentage of recommendations, making it easy to compare how customer satisfaction influences recommendations. The color contrast helps in quickly identifying patterns and trends.

2. What is/are the insight(s) found from the chart?

Passengers who gave higher overall ratings are significantly more likely to recommend the airline. Conversely, lower overall ratings correspond to fewer recommendations, indicating that customer satisfaction strongly impacts referral likelihood.

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights will help create a positive business impact. By identifying the strong correlation between overall rating and recommendations, the airline can focus on improving service quality to enhance customer satisfaction, ultimately driving more referrals and customer retention.

#### Chart - 11 :  Distribution of variables - Histogram

In [None]:
# Chart - 10 visualization code - Histogram

# Plot histograms for numerical variables
numerical_cols = reviews_df.select_dtypes(include=['float64', 'int64']).columns

# Set figure size
plt.figure(figsize=(14, 10))
# Loop through each numerical column and create subplots
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(4, 4, i)  # Adjust grid size based on number of columns
    plt.hist(reviews_df[col].dropna(), bins=20, color='skyblue', edgecolor='black')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
# Adjust layout
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The histogram was chosen because it effectively visualizes the distribution of numerical variables, helping to identify patterns, such as data skewness, outliers, and central tendencies. It provides a clear overview of how frequently different values occur in each variable.

##### 2. What is/are the insight(s) found from the chart?

Among all cabin service is found to have better ratings.

Overall rating is not satisfactory.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Will the gained insights help create a positive business impact?

Yes, the gained insights can drive a positive business impact.

Since cabin service is rated higher, the airline can leverage this strength in their marketing campaigns by promoting their superior cabin service, attracting more customers.
The focus on maintaining and enhancing cabin service can help in boosting customer satisfaction and loyalty, ultimately leading to better reviews and recommendations.

* Are there any insights that lead to negative growth? Justify with specific reason.
Yes, the overall rating being unsatisfactory indicates a potential risk of negative growth.

Low overall ratings may reflect poor customer experiences in certain areas, discouraging new or returning passengers.
This could lead to reduced bookings and a decline in the airline's reputation if the underlying service issues (other than cabin service) are not addressed.

#### Chart - 12 : Heatmap

In [None]:
# Chart Visualization code-  Heatmap

# Select only numerical columns for the heatmap
numerical_cols = reviews_df.select_dtypes(include=['float64', 'int64'])

# Plotting the heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(numerical_cols.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5, square=True, cbar=True)

# Add title
plt.title('Correlation Heatmap of Numerical Columns', fontsize=16)
plt.show()


1. Why did you pick the specific chart?

Heatmap is chosen because it effectively visualizes the correlation between numerical variables, helping to identify strong or weak relationships. It provides a clear overview of feature dependencies, aiding in feature selection and multicollinearity detection.

2. What is/are the insight(s) found from the chart?

The heatmap reveals relationships and patterns between numerical variables, highlighting which features are strongly or weakly correlated. This helps identify key influencers and potential multicollinearity issues in the dataset.
Here clearly we can a positive correlation between overall and value_for_money columns with a value 0.89.
Also there exists a negative correlation of0.60 between entertainment and the column cabin_service.

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* The **positive correlation** between 'overall' and 'value_for_money' (0.89) suggests that **improving the value-for-money experience** directly boosts overall satisfaction, enabling the airline to **enhance customer retention** and loyalty.

* The **negative correlation** between 'entertainment' and 'cabin_service' (-0.60) indicates that **poor entertainment quality** might be lowering cabin service ratings, potentially leading to **negative customer experiences** and dissatisfaction.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

The three hypothetical statements can be stated as follows:

* **Hypothesis 1: Service Quality and Recommendations**

Null Hypothesis (H₀): There is no significant relationship between the airline's cabin service rating and whether a customer recommends the airline.
Alternative Hypothesis (H₁): There is a significant relationship between the cabin service rating and customer recommendations.

* **Hypothesis 2: Traveller Type and Value for Money**

Null Hypothesis (H₀): Traveller type (Business, Leisure, etc.) does not significantly affect the perceived value for money rating.
Alternative Hypothesis (H₁): Traveller type has a significant impact on the value for money rating.

* **Hypothesis 3: Overall Rating and Recommendation Likelihood**

Null Hypothesis (H₀): There is no significant difference in the overall rating between customers who recommend and those who don't.
Alternative Hypothesis (H₁): Customers who recommend the airline give a higher overall rating than those who do not.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Hypothesis 1: Service Quality and Recommendations**

**Null Hypothesis (H₀):** There is no significant relationship between the airline's cabin service rating and whether a customer recommends the airline.

 **Alternative Hypothesis (H₁):** There is a significant relationship between the cabin service rating and customer recommendations.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value


# Create a contingency table between 'cabin_service' and 'recommended'
contingency_table = pd.crosstab(reviews_df['cabin_service'], reviews_df['recommended'])

# Perform Chi-Square Test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Display the results
print("Chi-Square Statistic:", chi2)
print("Degrees of Freedom:", dof)
print("P-Value:", p_value)

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("\n Reject the Null Hypothesis: There is a significant relationship between cabin service rating and customer recommendations.")
else:
    print("\n Fail to Reject the Null Hypothesis: There is no significant relationship between cabin service rating and customer recommendations.")

##### Which statistical test have you done to obtain P-Value?

Chi-Square Test is used here. The P-Value is shown as 0.0 . Because P-Value is very near to 0.

##### Why did you choose the specific statistical test?

* Categorical Variables: Both cabin_service and recommended are categorical, making the Chi-Square test appropriate.

* Test for Association: It evaluates whether there is a significant relationship between the two variables.
* Contingency Table Analysis: Compares observed and expected frequencies to detect dependencies.
* Non-Parametric Test: Suitable for categorical data without assuming normal distribution.
* Measures Independence: Tests if customer recommendations depend on cabin service ratings.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis 2: Traveller Type and Value for Money

**Null Hypothesis (H₀):** Traveller type (Business, Leisure, etc.) does not significantly affect the perceived value for money rating.

**Alternative Hypothesis (H₁):** Traveller type has a significant impact on the value for money rating.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value-
# Chi-Square Test of Independence

# Create a contingency table
contingency_table = pd.crosstab(reviews_df['traveller_type'], reviews_df['value_for_money'])

# Perform the Chi-Square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

# Display results
print("Chi-Square Statistic:", chi2)
print("Degrees of Freedom:", dof)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("Reject Null Hypothesis: Significant relationship between traveller type and value for money.")
else:
    print("Fail to Reject Null Hypothesis: No significant relationship between traveller type and value for money.")


##### Which statistical test have you done to obtain P-Value?

Chi-Square Test of Independence.

##### Why did you choose the specific statistical test?

Chi-Square Test of Independence is used here, because
* Both traveller_type and value_for_money are categorical variables.
* The Chi-Square test evaluates whether there is a statistically significant association between the two categorical variables.
* Suitable for analyzing frequency distribution across categories.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Hypothesis 3: Overall Rating and Recommendation Likelihood**

**Null Hypothesis (H₀):** There is no significant difference in the overall rating between customers who recommend and those who don't.
**Alternative Hypothesis (H₁):** Customers who recommend the airline give a higher overall rating than those who do not.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value : Independent Samples T-Test


# Splitting data into two groups: Recommended and Not Recommended
group_recommended = reviews_df[reviews_df['recommended'] == 'yes']['overall'].dropna()
group_not_recommended = reviews_df[reviews_df['recommended'] == 'no']['overall'].dropna()

# Perform Independent Samples T-Test
t_stat, p_value = stats.ttest_ind(group_recommended, group_not_recommended, equal_var=False)

# Display results
print("T-Statistic:", t_stat)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("Reject Null Hypothesis: Significant difference in overall rating between recommended and not recommended customers.")
else:
    print("Fail to Reject Null Hypothesis: No significant difference in overall rating between the two groups.")

##### Which statistical test have you done to obtain P-Value?

Independent Samples T-Test

##### Why did you choose the specific statistical test?

Independent Samples T-Test. Bacause,
* overall is a continuous numerical variable (rating scale).
* recommended is a categorical binary variable (Yes/No or 1/0).
* The T-test compares the means of the overall rating between the two groups (recommended vs. not recommended).
* It tests if the difference in means is statistically significant.


### Using another test

To test whether data is normally distributed or not : Kolmogorov-Smirnov Test
* p > 0.05 → Normally distributed → Use T-Test.

* p ≤ 0.05 → Not normally distributed → Use Mann-Whitney U Test.

In [None]:
# Kolmogorov-Smirnov Test-

from scipy.stats import kstest

# Perform Kolmogorov-Smirnov test
stat, p_value = kstest(reviews_df['overall'].dropna(), 'norm', args=(reviews_df['overall'].mean(), reviews_df['overall'].std()))

print(f"Kolmogorov-Smirnov Statistic: {stat}")
print(f"P-Value: {p_value}")

# Interpretation
if p_value > 0.05:
    print("Fail to Reject Null Hypothesis: Data is normally distributed.")
else:
    print("Reject Null Hypothesis: Data is NOT normally distributed.")


Mann-Whitney U Test: If the overall rating distribution is not normally distributed.


In [None]:
# Perform Mann-Whitney U Test
u_stat, p_value_u = stats.mannwhitneyu(group_recommended, group_not_recommended)
# Display results
print("Mann-Whitney U Statistic:", u_stat)
print("P-Value:", p_value_u)
# Interpretation
if p_value_u < 0.05:
    print("Reject Null Hypothesis: Significant difference between recommended and not recommended customers.")
else:
    print("Fail to Reject Null Hypothesis: No significant difference between the two groups.")


Since the data of 'overall' columns is not normally distributed we conducted hypothesis test :  Mann-Whitney U Test.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
missing_values = reviews_df.isnull().sum()
print(missing_values)

In [None]:
# Percentage of missing values

missing_percentage = (missing_values / len(reviews_df)) * 100
print("Percentage of Missing Values:\n", round(missing_percentage,2))

Clearly 69 % values of column 'aircraft' is missing. so we can drop that column.

In [None]:
# Copy of dataframe is created for preprocessing.

reviews_pre_df = reviews_df.copy()


In [None]:
# code to drop column 'aircraft'
reviews_pre_df.drop('aircraft', axis=1, inplace=True)


1. Column 'author' is not much contributing to the required data.
2. 'review_date' is not much importance as it will be same as fly date.
Hence we can drop these two columns.

In [None]:
# drop the columns 'author' and 'review_date'
reviews_pre_df.drop(['author', 'review_date'], axis=1, inplace=True)

In [None]:
missing_values = reviews_pre_df.isnull().sum()
print(missing_values)

In [None]:
#  Drop rows with minor missing values
reviews_pre_df.dropna(subset=['airline', 'customer_review'], inplace=True)

In [None]:
# Select numerical columns
num_cols = reviews_pre_df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Display numerical columns
print("Numerical Columns:", num_cols)


In [None]:
# Median Imputation for Numerical Columns
for col in num_cols:
    reviews_pre_df[col].fillna(reviews_df[col].median(), inplace=True)

In [None]:
# Mode Imputation for Categorical Columns
cat_cols = ['traveller_type', 'cabin', 'route', 'recommended', 'date_flown', 'year_month']
for col in cat_cols:
    reviews_pre_df[col].fillna(reviews_df[col].mode()[0], inplace=True)

In [None]:
# Verify if missing values are handled
print("Missing values after imputation:")
print(reviews_pre_df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

1. Dropped Irrelevant Columns : Columns dropped 'author', 'aircraft','review_date' as they had too many missing values.
2. Dropped Rows with Rare Missingness: Only rows with missing values were dropped from the columns 'airline', 'customer_review'.
3.  Median Imputation is used for Numerical Columns.
4. Mode imputaion is used for Categorical columns.


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Plotting box plots for numerical columns
plt.figure(figsize=(16, 10))
for i, col in enumerate(num_cols, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(x=reviews_df[col], color='skyblue')
    plt.title(f'Box Plot of {col}')
plt.tight_layout()
plt.show()


In [None]:
# Plotting distribution for numerical columns- histogram
plt.figure(figsize=(16, 10))
for i, col in enumerate(num_cols, 1):
    plt.subplot(3, 3, i)
    sns.histplot(reviews_df[col], kde=True, color='orange', bins=30)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

Data is visualized for outliers using boxplot and histograms. There are no visible outliers found.


### 3. Categorical Encoding

In [None]:
reviews_pre_df.head()

In [None]:
# Encode your categorical columns
# Identify categorical columns
categorical_cols = reviews_pre_df.select_dtypes(include=['object', 'category']).columns
print("Categorical Columns:", categorical_cols)

In [None]:
# Encoding binary values using Label Encoding
# Apply Label Encoding to binary columns
label_encoder = LabelEncoder()

# 'recommended' column has binary values
reviews_pre_df['recommended_encoded'] = label_encoder.fit_transform(reviews_pre_df['recommended'])

# Encode all binary columns
binary_cols = ['recommended']  # Add more binary columns if any
for col in binary_cols:
    reviews_pre_df[col + '_encoded'] = label_encoder.fit_transform(reviews_pre_df[col])

# Check encoding
reviews_pre_df[['recommended', 'recommended_encoded']].head()


In [None]:
# For columns with multiple categories (e.g., traveller_type, cabin), using One-Hot Encoding.

# One-Hot Encoding for multi-class categorical columns
multi_class_cols = ['traveller_type', 'cabin']

# Perform One-Hot Encoding
reviews_pre_df = pd.get_dummies(reviews_pre_df, columns=multi_class_cols, drop_first=True)

# Check the new columns
print(reviews_pre_df.columns)

In [None]:
reviews_pre_df.head()

In [None]:
# Frequency encoding(For High Cardinality Columns)- with many unique categories.

# Frequency encoding
freq_encoding = reviews_pre_df['route'].value_counts(normalize=True)
reviews_pre_df['route_encoded'] = reviews_pre_df['route'].map(freq_encoding)

# Check encoding
reviews_pre_df[['route', 'route_encoded']].head()

In [None]:
# verify the encoded columns

# Check the encoded dataset
print(reviews_pre_df.head())
reviews_pre_df.shape

In [None]:
reviews_pre_df.info()

In [None]:
# Frequency Encoding
airline_freq = reviews_pre_df['airline'].value_counts(normalize=True)  # Frequency of each airline
reviews_pre_df['airline_freq_encoded'] = reviews_pre_df['airline'].map(airline_freq)

# Drop original column to avoid redundancy
reviews_pre_df.drop('airline', axis=1, inplace=True)

print(" Airline column encoded successfully!")
reviews_pre_df[['airline_freq_encoded']].head()


In [None]:
# Dropping customer_review column- as it is not directly useful here.
reviews_pre_df.drop('customer_review', axis=1, inplace=True)
print(" Customer review column dropped")

# Dropping original route column- as it is already encoded
reviews_pre_df.drop('route', axis=1, inplace=True)
print(" Route column dropped (already encoded)")

In [None]:
# Dropping the 'recommended' column
reviews_pre_df.drop('recommended', axis=1, inplace=True)
print(" 'recommended' column dropped successfully")

In [None]:
# Ensure date_flown is in datetime format
reviews_pre_df['date_flown'] = pd.to_datetime(reviews_pre_df['date_flown'], errors='coerce')

# Convert date to ordinal format
reviews_pre_df['date_flown_ordinal'] = reviews_pre_df['date_flown'].map(lambda x: x.toordinal() if pd.notnull(x) else np.nan)

# Drop the original 'date_flown' column
reviews_pre_df.drop('date_flown', axis=1, inplace=True)

print(" date_flown converted to ordinal format and original column dropped.")

Converting datatype of 'year_month' to float type

In [None]:
# # Convert period to float format (YYYY.MM)
reviews_pre_df['year_month_float'] = reviews_pre_df['year_month'].dt.year + (reviews_pre_df['year_month'].dt.month / 12)

# Drop the original period column
reviews_pre_df.drop('year_month', axis=1, inplace=True)

print(" year_month converted to float format (YYYY.MM) and original column dropped.")
print(reviews_pre_df[['year_month_float']].head())


In [None]:
# Display final dataset info
print(reviews_pre_df.info())


In [None]:
# check for missing values
print(reviews_pre_df.isnull().sum())

In [None]:
reviews_pre_df.head()

In [None]:
# Drop the 'date_flown_ordinal' column
reviews_pre_df.drop('date_flown_ordinal', axis=1, inplace=True)

print(" Column 'date_flown_ordinal' has been dropped successfully.")


In [None]:
reviews_pre_df.shape

#### What all categorical encoding techniques have you used & why did you use those techniques?

1. Encoded the binary values of the column 'recommended' using Label Encoding.
2. Columns with multiple categories are encoded using One-Hot Encoding : cabin, traveller_type
3. Columns with multiple unique categories are encoded using Frequency encoding: route, airline
4. 'customer_review' column is dropped as it is not directly useful here.
5. Converting 'date_flown' to datetime.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

In [None]:
# Drop one feature from pairs with correlation > 0.85(threshold).

# Identify highly correlated features
corr_matrix = reviews_pre_df.corr(numeric_only=True)
high_corr_pairs = set()

# Threshold for correlation
threshold = 0.85

# Identify pairs with high correlation
for i in range(len(corr_matrix.columns)):
    for j in range(i):
        if abs(corr_matrix.iloc[i, j]) > threshold:
            col1 = corr_matrix.columns[i]
            col2 = corr_matrix.columns[j]
            high_corr_pairs.add((col1, col2))

# Display correlated pairs
print("Highly correlated pairs:")
for pair in high_corr_pairs:
    print(pair)

# Drop one feature from each pair (e.g., drop second feature)
features_to_drop = [pair[1] for pair in high_corr_pairs]

# Drop the features
reviews_pre_df.drop(features_to_drop, axis=1, inplace=True)
print(f" Dropped highly correlated features: {features_to_drop}")


##### What all feature selection methods have you used  and why?

* Manually dropped many irrelevant columns from the table.


### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Already performed

Data is transformed to get better resuilts while applying Machine Learning Models.
1. Categorical variables are encoded.
2. Missing values were handled
3. outlier is handled( No outlier was found).
4. Datetime conversion is done.


### 6. Data Scaling

In [None]:
reviews_pre_df.head()

In [None]:
# Z score scaling

#  Dynamically select numerical columns
num_cols = reviews_pre_df.select_dtypes(include=['float64', 'int64']).columns

#  Apply Standardization
scaler = StandardScaler()
reviews_pre_df[num_cols] = scaler.fit_transform(reviews_pre_df[num_cols])

#  Display the scaled dataset
reviews_pre_df.head()

##### Which method have you used to scale you data and why?

I have used Z score scaling to scale the data.

Because,
1. Centers the data around a mean of 0 with a standard deviation of 1, making it ideal for models sensitive to feature magnitude.

2. Handles outliers better compared to Min-Max scaling, as it reduces the influence of extreme values.

3. Improves model performance by standardizing the distribution, ensuring consistent feature scaling.

### 7. Dimesionality Reduction

In [None]:
reviews_pre_df.shape

##### Do you think that dimensionality reduction is needed? Explain Why?

 No, it is not required for this dataset because:

1. Low Feature Count: The dataset has only 16 columns, which is manageable and does not suffer from the curse of dimensionality.

2. Feature-to-Row Ratio: With 61,183 rows, the dataset has sufficient samples per feature, making dimensionality reduction unnecessary.

3. Risk of Information Loss: Applying dimensionality reduction may remove important information, which could negatively impact model performance

In [None]:
# DImensionality Reduction
# Not required

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Not needed

### 8. Data Splitting

Converting the datatype of column 'year_month'

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.


# Splitting the data (80% train, 20% test)
X = reviews_pre_df.drop('recommended_encoded', axis=1)  # Features
y = reviews_pre_df['recommended_encoded']  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=y)

# Display the shapes
print(f"Training Set: {X_train.shape}, {y_train.shape}")
print(f"Test Set: {X_test.shape}, {y_test.shape}")


##### What data splitting ratio have you used and why?

Ratio: 80% for training and 20% for testing. Because,
1. Sufficient Training Data:Allocating 80% of the data for training ensures the model learns effectively with more samples, improving its accuracy and stability.

2. Reliable Testing Evaluation: The 20% testing set provides enough data to evaluate the model’s performance without overfitting or underfitting.



### 9. Handling Imbalanced Dataset

In [None]:
# to check for imbalanced data

# Check the distribution of the target variable
print("Class Distribution:")
print(y_train.value_counts(normalize=True))  # Percentage distribution
print(y_test.value_counts(normalize=True))

# Visualizing the class distribution
plt.figure(figsize=(10, 5))
sns.countplot(x=y_train, palette='coolwarm')
plt.title('Class Distribution in Training Set')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()


##### Do you think the dataset is imbalanced? Explain Why.

No, the dataset is not significantly imbalanced. The class distribution shows 53.43% for the "Not Recommended" class and 46.57% for the "Recommended" class, resulting in only a 7% difference.

Since both classes have fairly similar proportions, the dataset does not require class balancing techniques.

In [None]:
# Handling Imbalanced Dataset (If needed)
# Not required

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Dataset is somewhat balanced.

## ***7. ML Model Implementation***

In [None]:
# Assuming 'X_train', 'X_test', 'y_train', and 'y_test' are already created
print(f"Training set size: {X_train.shape}, Test set size: {X_test.shape}")


### ML Model - 1 : Logistic Regression

In [None]:
reviews_pre_df.head()

In [None]:
reviews_pre_df.info()

In [None]:
# ML Model - 1 Implementation - Logistic Regression

#  Check the data type of the target variable
print("y_train data type:", y_train.dtype)
print("y_test data type:", y_test.dtype)

#  Convert target variable to discrete classes if it's continuous
if y_train.dtype != 'int' and y_train.dtype != 'bool':
    y_train = y_train.round().astype(int)
    y_test = y_test.round().astype(int)

#  Start Timer
start = time.time()

#  Train Logistic Regression Model
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

#  Predictions on Train and Test Sets
y_train_pred = lr_model.predict(X_train)  # Train Set Predictions
y_test_pred = lr_model.predict(X_test)    # Test Set Predictions

#  Train Set Metrics
print("\n--- Logistic Regression - Train Set ---")
print(f"Accuracy: {accuracy_score(y_train, y_train_pred):.4f}")
print(f"ROC-AUC Score: {roc_auc_score(y_train, y_train_pred):.4f}")
print(f"Precision: {precision_score(y_train, y_train_pred):.4f}")
print(f"Recall: {recall_score(y_train, y_train_pred):.4f}")
print(f"F1-Score: {f1_score(y_train, y_train_pred):.4f}")
print("\nClassification Report (Train Set):")
print(classification_report(y_train, y_train_pred))

#  Test Set Metrics
print("\n--- Logistic Regression - Test Set ---")
print(f"Accuracy: {accuracy_score(y_test, y_test_pred):.4f}")
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_test_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_test_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_test_pred):.4f}")
print(f"F1-Score: {f1_score(y_test, y_test_pred):.4f}")
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_pred))

#  End Timer
end = time.time()
print(f"\nTime taken: {round(end - start, 2)} seconds")


In [None]:
# Performing cross validation
from sklearn.model_selection import cross_val_score

# Perform cross-validation
cv_scores = cross_val_score(lr_model, X_train, y_train, cv=5, scoring='accuracy')

# Display results
print("\nCross-Validation Scores:", cv_scores)
print("Mean Accuracy:", cv_scores.mean())
print("Standard Deviation:", cv_scores.std())


In [None]:
# Learning curve plot to check for overfitting:

from sklearn.model_selection import learning_curve

# Plot learning curve
train_sizes, train_scores, val_scores = learning_curve(
    lr_model, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1,
    train_sizes=np.linspace(0.1, 1.0, 10)
)

# Mean and std deviation for train and validation scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(train_sizes, train_mean, label='Training Accuracy', color='blue')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color='blue', alpha=0.2)
plt.plot(train_sizes, val_mean, label='Validation Accuracy', color='orange')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, color='orange', alpha=0.2)

plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve: Logistic Regression')
plt.legend()
plt.grid(True)
plt.show()


From the graph it is clear that there is no overfitting. Both training accuracy and validation accuracy are high.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# # Visualizing evaluation Metric Score chart

#  Train set metrics
train_metrics = [
    accuracy_score(y_train, y_train_pred),
    precision_score(y_train, y_train_pred),
    recall_score(y_train, y_train_pred),
    f1_score(y_train, y_train_pred),
    roc_auc_score(y_train, y_train_pred)
]

#  Test set metrics
test_metrics = [
    accuracy_score(y_test, y_test_pred),
    precision_score(y_test, y_test_pred),
    recall_score(y_test, y_test_pred),
    f1_score(y_test, y_test_pred),
    roc_auc_score(y_test, y_test_pred)
]

#  Metric names
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']

#  Plotting
fig, ax = plt.subplots(figsize=(10, 5))
x = range(len(metrics))

# Bar width
width = 0.35

#  Plotting train and test metrics
ax.bar([p - width/2 for p in x], train_metrics, width, label='Train Set', color='cornflowerblue')
ax.bar([p + width/2 for p in x], test_metrics, width, label='Test Set', color='lightcoral')

# Labels and formatting
ax.set_xlabel('Evaluation Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Logistic Regression - Model Performance (Train vs Test)', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

# Display score values above the bars
for i in range(len(metrics)):
    ax.text(i - width/2, train_metrics[i] + 0.01, f"{train_metrics[i]:.4f}", ha='center', va='bottom', fontsize=10, color='black')
    ax.text(i + width/2, test_metrics[i] + 0.01, f"{test_metrics[i]:.4f}", ha='center', va='bottom', fontsize=10, color='black')

plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

 # Hyperparameter tuning is not required in this case.The model already demonstrates high and stable performance, making additional tuning unnecessary.

##### Which hyperparameter optimization technique have you used and why?

Hyperparameter tuning is not essential in this case.

* The model already demonstrates high and stable performance, making additional tuning unnecessary.

* Cross-validation could still be applied to validate the model’s stability, but it is not mandatory for further improvement.

* Tuning may not significantly enhance the model’s performance and could introduce unnecessary complexity or increase runtime.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Tuning is not used here.

### ML Model - 2 : Decision Tree

In [None]:
# ML Model 2 : implementation - Decision Tree

#  Start Timer
start = time.time()

#  Initialize Decision Tree Model with default parameters
dt_model = DecisionTreeClassifier(random_state=42)

#  Train the model
dt_model.fit(X_train, y_train)

#  Predictions on Train and Test Sets
y_train_pred_dt = dt_model.predict(X_train)  # Train Set Predictions
y_test_pred_dt = dt_model.predict(X_test)    # Test Set Predictions

#  Train Set Metrics
train_metrics = {
    "Dataset": "Train Set",
    "Accuracy": accuracy_score(y_train, y_train_pred_dt),
    "Precision": precision_score(y_train, y_train_pred_dt),
    "Recall": recall_score(y_train, y_train_pred_dt),
    "F1-Score": f1_score(y_train, y_train_pred_dt),
    "ROC-AUC": roc_auc_score(y_train, y_train_pred_dt)
}

#  Test Set Metrics
test_metrics = {
    "Dataset": "Test Set",
    "Accuracy": accuracy_score(y_test, y_test_pred_dt),
    "Precision": precision_score(y_test, y_test_pred_dt),
    "Recall": recall_score(y_test, y_test_pred_dt),
    "F1-Score": f1_score(y_test, y_test_pred_dt),
    "ROC-AUC": roc_auc_score(y_test, y_test_pred_dt)
}

#  Create DataFrame for Comparison
metrics_df = pd.DataFrame([train_metrics, test_metrics])

#  Display Metrics
print("\n---  Decision Tree Model - Evaluation Metrics ---")
print(metrics_df)

#  Print Classification Reports
print("\n---  Classification Report (Train Set) ---")
print(classification_report(y_train, y_train_pred_dt))

print("\n---  Classification Report (Test Set) ---")
print(classification_report(y_test, y_test_pred_dt))

#  End Timer
end = time.time()
print(f"\n⏱ Time taken: {round(end - start, 2)} seconds")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#  Extracting the metrics for visualization
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
train_scores = [
    accuracy_score(y_train, y_train_pred_dt),
    precision_score(y_train, y_train_pred_dt),
    recall_score(y_train, y_train_pred_dt),
    f1_score(y_train, y_train_pred_dt),
    roc_auc_score(y_train, y_train_pred_dt)
]

test_scores = [
    accuracy_score(y_test, y_test_pred_dt),
    precision_score(y_test, y_test_pred_dt),
    recall_score(y_test, y_test_pred_dt),
    f1_score(y_test, y_test_pred_dt),
    roc_auc_score(y_test, y_test_pred_dt)
]

#  Plotting the scores
x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 5))

#  Bars for Train and Test Scores
bars1 = ax.bar(x - width/2, train_scores, width, label='Train Set', color='skyblue')
bars2 = ax.bar(x + width/2, test_scores, width, label='Test Set', color='lightcoral')

#  Adding value labels above bars
for bar, score in zip(bars1, train_scores):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
            f"{score:.4f}", ha='center', fontsize=10, color='black')

for bar, score in zip(bars2, test_scores):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
            f"{score:.4f}", ha='center', fontsize=10, color='black')

#  Labels and Title
ax.set_xlabel('Evaluation Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Decision Tree Model - Train vs Test Set Performance', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

#  Grid and formatting
plt.ylim(0, 1.1)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)


# Define the parameter grid for Decision Tree
param_grid = {
    'max_depth': [5, 10, 15, 20, None],        # Control depth of the tree
    'min_samples_split': [2, 5, 10, 15],        # Minimum samples to split a node
    'min_samples_leaf': [1, 5, 10],             # Minimum samples per leaf
    'criterion': ['gini', 'entropy'],           # Split criteria
    'random_state': [42]
}


In [None]:
# Apply gridsearchCV

# Initialize the Decision Tree model
dt_model = DecisionTreeClassifier()

# Set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=dt_model,
                           param_grid=param_grid,
                           cv=5,                      # 5-fold cross-validation
                           scoring='accuracy',        # Use accuracy as the metric
                           n_jobs=-1,                 # Use all available CPU cores
                           verbose=2)

# Fit the model
grid_search.fit(X_train, y_train)

# Display the best parameters and best score
print("\nBest Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)


In [None]:
# Evaluate the trained model

# Get the best model from GridSearchCV
best_dt_model = grid_search.best_estimator_

# Make predictions
y_pred_dt_tuned = best_dt_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred_dt_tuned)
roc_auc = roc_auc_score(y_test, y_pred_dt_tuned)
precision = precision_score(y_test, y_pred_dt_tuned)
recall = recall_score(y_test, y_pred_dt_tuned)
f1 = f1_score(y_test, y_pred_dt_tuned)

# Display metrics
print("\n--- Tuned Decision Tree ---")
print(f"Accuracy: {accuracy:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


##### Which hyperparameter optimization technique have you used and why?

GridsearchCV is used here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There is significance improvement is found in evaluation metrics after using GridsearchCV.

In [None]:
# Visaulization - Decision Tree - Model Performance Before vs After Tuning

#  Metrics Before Tuning - Decision Tree
accuracy_before_dt = 0.9067
roc_auc_before_dt = 0.9063
precision_before_dt = 0.8987
recall_before_dt = 0.9012
f1_before_dt = 0.8999

#  Metrics After Tuning - Decision Tree (use actual values)
accuracy_after_dt = accuracy_score(y_test, y_pred_dt_tuned)
roc_auc_after_dt = roc_auc_score(y_test, y_pred_dt_tuned)
precision_after_dt = precision_score(y_test, y_pred_dt_tuned)
recall_after_dt = recall_score(y_test, y_pred_dt_tuned)
f1_after_dt = f1_score(y_test, y_pred_dt_tuned)

# Labels
metrics = ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score']

# Prepare Data for Plotting
before_dt = [accuracy_before_dt, roc_auc_before_dt, precision_before_dt, recall_before_dt, f1_before_dt]
after_dt = [accuracy_after_dt, roc_auc_after_dt, precision_after_dt, recall_after_dt, f1_after_dt]

# Plotting
x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 5))

ax.bar(x - width/2, before_dt, width, label='Before Tuning', color='lightcoral')
ax.bar(x + width/2, after_dt, width, label='After Tuning', color='seagreen')

# Labels and Formatting
ax.set_xlabel('Evaluation Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Decision Tree - Model Performance Before vs After Tuning', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

# Display values on bars
for i in range(len(metrics)):
    ax.text(i - width/2, before_dt[i] + 0.005, f"{before_dt[i]:.4f}", ha='center', va='bottom', fontsize=10, color='black')
    ax.text(i + width/2, after_dt[i] + 0.005, f"{after_dt[i]:.4f}", ha='center', va='bottom', fontsize=10, color='black')

plt.tight_layout()
plt.show()


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

**1. Accuracy (93.55%):** The model correctly predicts 93.55% of the cases, indicating overall good performance. However, accuracy alone may not reflect class imbalances.

**2. ROC-AUC Score (93.42%):** The model effectively distinguishes between satisfied and dissatisfied customers, making it reliable for customer classification tasks.

**3. Precision (94.51%):** 94.51% of the customers predicted as satisfied are actually satisfied, reducing the risk of false positives, which helps in targeting the right customer group.

**4. Recall (91.47%):** The model correctly identifies 91.47% of actual satisfied customers, ensuring fewer dissatisfied customers are misclassified, which enhances customer retention efforts.

**5. F1-Score (92.96%):**  Balances precision and recall, indicating the model maintains a strong trade-off, making it effective for business decisions related to customer satisfaction prediction.

### ML Model - 3 : Random Forest



In [None]:
# # ML Model - 3 Implementation - Random Forest

#  Start time
start = time.time()

#  Initialize the Random Forest model
rf_model = RandomForestClassifier(
    n_estimators=100,         # Number of trees
    max_depth=10,             # Limit depth to prevent overfitting
    min_samples_split=10,     # Minimum samples required to split a node
    min_samples_leaf=5,       # Minimum samples required at each leaf node
    random_state=42
)

#  Train the model
rf_model.fit(X_train, y_train)

#  Make predictions on both Train and Test Sets
y_train_pred_rf = rf_model.predict(X_train)   # Train set predictions
y_test_pred_rf = rf_model.predict(X_test)     # Test set predictions

#  Calculate metrics
metrics = ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score']

#  Train Set Metrics
train_metrics = [
    accuracy_score(y_train, y_train_pred_rf),
    roc_auc_score(y_train, y_train_pred_rf),
    precision_score(y_train, y_train_pred_rf),
    recall_score(y_train, y_train_pred_rf),
    f1_score(y_train, y_train_pred_rf)
]

#  Test Set Metrics
test_metrics = [
    accuracy_score(y_test, y_test_pred_rf),
    roc_auc_score(y_test, y_test_pred_rf),
    precision_score(y_test, y_test_pred_rf),
    recall_score(y_test, y_test_pred_rf),
    f1_score(y_test, y_test_pred_rf)
]

#  Create DataFrame for easy comparison
metrics_df = pd.DataFrame({
    'Metric': metrics,
    'Train Set': train_metrics,
    'Test Set': test_metrics
})

#  Format the scores to 4 decimal places
metrics_df = metrics_df.round(4)

#  Display the table
print("\n---  Random Forest - Evaluation Metrics ---")
print(metrics_df)

#  End time
end = time.time()
print(f"\n⏱️ Time taken: {round(end - start, 2)} seconds")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# # Visualizing evaluation Metric Score chart

#  Labels and Scores
metrics = ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score']
train_scores = [
    accuracy_score(y_train, y_train_pred_rf),
    roc_auc_score(y_train, y_train_pred_rf),
    precision_score(y_train, y_train_pred_rf),
    recall_score(y_train, y_train_pred_rf),
    f1_score(y_train, y_train_pred_rf)
]

test_scores = [
    accuracy_score(y_test, y_test_pred_rf),
    roc_auc_score(y_test, y_test_pred_rf),
    precision_score(y_test, y_test_pred_rf),
    recall_score(y_test, y_test_pred_rf),
    f1_score(y_test, y_test_pred_rf)
]

#  Plotting the chart
fig, ax = plt.subplots(figsize=(10, 5))

x = range(len(metrics))
bar_width = 0.35

# Plot Train and Test Scores
bars1 = ax.bar([p - bar_width/2 for p in x], train_scores, bar_width, label='Train Set', color='#4CAF50')
bars2 = ax.bar([p + bar_width/2 for p in x], test_scores, bar_width, label='Test Set', color='#2196F3')

#  Add score values on top of the bars
for bar, score in zip(bars1, train_scores):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, f"{score:.4f}", ha='center', fontsize=10, color='black')

for bar, score in zip(bars2, test_scores):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, f"{score:.4f}", ha='center', fontsize=10, color='black')

#  Labels and formatting
ax.set_title('Random Forest - Train vs Test Set Evaluation Metrics', fontsize=16, fontweight='bold')
ax.set_ylabel('Score', fontsize=12)
ax.set_xlabel('Metrics', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

#  Styling
plt.ylim(0, 1.1)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()

#  Display the chart
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Not required

##### Which hyperparameter optimization technique have you used and why?

Not required.
* The Train and Test set metrics are consistently high, showing no signs of overfitting or underfitting.

* The model has balanced precision, recall, and F1-scores, indicating it handles both classes effectively.

#### ML Model 4: XGBoost

In [None]:
#  Map target labels to 0 and 1
y_train = y_train.replace({-1: 0, 1: 1})
y_test = y_test.replace({-1: 0, 1: 1})

#  Verify the mapping
print("Unique values in y_train:", y_train.unique())
print("Unique values in y_test:", y_test.unique())


In [None]:
# ML Model 4 -  Implementation - XGBoost

#  Start time
start = time.time()

# Create an instance of the XGBoost classifier
xgb_model = xgb.XGBClassifier()

#  Train the model
xgb_model.fit(X_train, y_train)

#  Predictions on Train and Test sets
y_train_pred_xgb = xgb_model.predict(X_train)
y_test_pred_xgb = xgb_model.predict(X_test)

#  Train set metrics
train_accuracy = accuracy_score(y_train, y_train_pred_xgb)
train_roc_auc = roc_auc_score(y_train, y_train_pred_xgb)
train_precision = precision_score(y_train, y_train_pred_xgb)
train_recall = recall_score(y_train, y_train_pred_xgb)
train_f1 = f1_score(y_train, y_train_pred_xgb)

#  Test set metrics
test_accuracy = accuracy_score(y_test, y_test_pred_xgb)
test_roc_auc = roc_auc_score(y_test, y_test_pred_xgb)
test_precision = precision_score(y_test, y_test_pred_xgb)
test_recall = recall_score(y_test, y_test_pred_xgb)
test_f1 = f1_score(y_test, y_test_pred_xgb)

#  Create a comparison table


metrics_table = pd.DataFrame({
    'Metric': ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score'],
    'Train Set': [train_accuracy, train_roc_auc, train_precision, train_recall, train_f1],
    'Test Set': [test_accuracy, test_roc_auc, test_precision, test_recall, test_f1]
})

#  Display the table
print("\n---  XGBoost - Evaluation Metrics ---")
print(metrics_table)

#  Confusion matrices
print("\n--- Confusion Matrix (Train Set) ---")
print(confusion_matrix(y_train, y_train_pred_xgb))

print("\n--- Confusion Matrix (Test Set) ---")
print(confusion_matrix(y_test, y_test_pred_xgb))

#  Classification reports
print("\n--- Classification Report (Train Set) ---")
print(classification_report(y_train, y_train_pred_xgb))

print("\n--- Classification Report (Test Set) ---")
print(classification_report(y_test, y_test_pred_xgb))

#  End time
end = time.time()
print(f"\n Time taken: {round(end - start, 2)} seconds")



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
 # Visualizing evaluation Metric Score chart


#  Metrics for Train and Test sets
metrics = ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score']
train_scores = [train_accuracy, train_roc_auc, train_precision, train_recall, train_f1]
test_scores = [test_accuracy, test_roc_auc, test_precision, test_recall, test_f1]

#  Plotting the comparison chart
x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 5))

# Bars for Train and Test sets
bars1 = ax.bar(x - width/2, train_scores, width, label='Train Set', color='steelblue')
bars2 = ax.bar(x + width/2, test_scores, width, label='Test Set', color='orange')

# Labels and formatting
ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('XGBoost Model - Evaluation Metrics (Train vs Test)', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

#  Display score values on bars
for bars, scores in zip([bars1, bars2], [train_scores, test_scores]):
    for bar, score in zip(bars, scores):
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.005,
                f"{score:.4f}", ha='center', va='bottom', fontsize=10, color='black')

#  Grid and layout adjustments
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.ylim(0, 1.1)
plt.tight_layout()
plt.show()

#  Plotting Confusion Matrices
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Train Set Confusion Matrix
sns.heatmap(confusion_matrix(y_train, y_train_pred_xgb), annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Train Set Confusion Matrix')
axes[0].set_xlabel('Predicted Labels')
axes[0].set_ylabel('True Labels')

# Test Set Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_test_pred_xgb), annot=True, fmt='d', cmap='Oranges', ax=axes[1])
axes[1].set_title('Test Set Confusion Matrix')
axes[1].set_xlabel('Predicted Labels')
axes[1].set_ylabel('True Labels')

plt.tight_layout()
plt.show()


####2. Cross- Validation & Hyperparameter Tuning

In [None]:
 # ML Model - 4 Implementation with hyperparameter optimization techniques- randomizedsearchCV

#  Start Timer
start = time.time()

#  XGBoost Model Initialization
xgb_model = XGBClassifier(objective='binary:logistic', random_state=42, use_label_encoder=False, eval_metric='logloss')

#  Define Hyperparameter Grid
param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'min_child_weight': [1, 3, 5, 7],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 0.1, 0.2, 0.3],
    'reg_alpha': [0, 0.01, 0.1, 1],
    'reg_lambda': [0, 0.01, 0.1, 1]
}

#  RandomizedSearchCV with 5-fold Cross-Validation
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_grid,
    n_iter=50,              # Number of random parameter combinations
    scoring='accuracy',     # Evaluation metric
    cv=5,                   # 5-fold cross-validation
    verbose=2,
    random_state=42,
    n_jobs=-1
)

#  Fit the model
random_search.fit(X_train, y_train)

#  Best model and parameters
best_xgb_model = random_search.best_estimator_
print("\n Best Hyperparameters:", random_search.best_params_)

#  Predictions on Train and Test Sets
y_train_pred_xgb_tuned = best_xgb_model.predict(X_train)
y_test_pred_xgb_tuned = best_xgb_model.predict(X_test)

#  Train Set Metrics
train_accuracy = accuracy_score(y_train, y_train_pred_xgb_tuned)
train_roc_auc = roc_auc_score(y_train, y_train_pred_xgb_tuned)
train_precision = precision_score(y_train, y_train_pred_xgb_tuned)
train_recall = recall_score(y_train, y_train_pred_xgb_tuned)
train_f1 = f1_score(y_train, y_train_pred_xgb_tuned)

#  Test Set Metrics
test_accuracy = accuracy_score(y_test, y_test_pred_xgb_tuned)
test_roc_auc = roc_auc_score(y_test, y_test_pred_xgb_tuned)
test_precision = precision_score(y_test, y_test_pred_xgb_tuned)
test_recall = recall_score(y_test, y_test_pred_xgb_tuned)
test_f1 = f1_score(y_test, y_test_pred_xgb_tuned)

#  Create a comparison table
metrics_table = pd.DataFrame({
    'Metric': ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score'],
    'Train Set': [train_accuracy, train_roc_auc, train_precision, train_recall, train_f1],
    'Test Set': [test_accuracy, test_roc_auc, test_precision, test_recall, test_f1]
})

#  Display the table
print("\n---  XGBoost Model - Evaluation Metrics (After Tuning) ---")
print(metrics_table)

#  Confusion matrices
print("\n--- Confusion Matrix (Train Set) ---")
print(confusion_matrix(y_train, y_train_pred_xgb_tuned))

print("\n--- Confusion Matrix (Test Set) ---")
print(confusion_matrix(y_test, y_test_pred_xgb_tuned))

#  Classification reports
print("\n--- Classification Report (Train Set) ---")
print(classification_report(y_train, y_train_pred_xgb_tuned))

print("\n--- Classification Report (Test Set) ---")
print(classification_report(y_test, y_test_pred_xgb_tuned))

#  End Timer
end = time.time()
print(f"\n Time taken: {round(end - start, 2)} seconds")


Which hyperparameter optimization technique have you used and why?

I used RandomizedSearchCV because it efficiently explores a wide range of hyperparameter combinations in less time compared to GridSearchCV, making it suitable for faster tuning with large datasets.

Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
#  Visualization of comparison between before and after tuning of evalation metrics values.

#  Metrics and Scores
metrics = ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score']

#  Before Tuning Scores (replace these with actual before-tuning scores)
before_tuning = [ 0.941162, 0.940760, 0.938524, 0.934901, 0.936709]

#  After Tuning Scores
after_tuning = [
    test_accuracy,
    test_roc_auc,
    test_precision,
    test_recall,
    test_f1
]

#  Plotting
x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 5))

# Bars for Before and After Tuning
bars1 = ax.bar(x - width/2, before_tuning, width, label='Before Tuning', color='lightcoral')
bars2 = ax.bar(x + width/2, after_tuning, width, label='After Tuning', color='seagreen')

#  Labels and Formatting
ax.set_xlabel('Evaluation Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('XGBoost Model - Comparison of Evaluation Metrics (Before vs After Tuning)', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

#  Display score values on bars
for bars, scores in zip([bars1, bars2], [before_tuning, after_tuning]):
    for bar, score in zip(bars, scores):
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.005,
                f"{score:.4f}", ha='center', va='bottom', fontsize=10, color='black')

#  Grid and Layout
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.ylim(0, 1.1)
plt.tight_layout()
plt.show()




There is small amount of change in evaluation metrics ofger tuning.

### ML Model 5: LightGBM

In [None]:
# ML Model -5  Implementation - LightGBM

#  Start Timer
start = time.time()

#  Initialize LightGBM model with basic parameters
lgb_model = lgb.LGBMClassifier(
    boosting_type='gbdt',
    objective='binary',        # For binary classification
    n_estimators=100,          # Number of boosting rounds
    learning_rate=0.1,         # Step size shrinkage
    max_depth=10,              # Maximum tree depth
    num_leaves=31,             # Number of leaves in each tree
    min_child_samples=20,      # Minimum samples in leaf
    subsample=0.8,             # Subsample ratio for rows
    colsample_bytree=0.8,      # Subsample ratio for columns
    random_state=42,
    verbose=-1
)

#  Fit the model
lgb_model.fit(X_train, y_train)

#  Make predictions
y_train_pred_lgb = lgb_model.predict(X_train)
y_test_pred_lgb = lgb_model.predict(X_test)

#  Train Set Metrics
train_accuracy = accuracy_score(y_train, y_train_pred_lgb)
train_roc_auc = roc_auc_score(y_train, y_train_pred_lgb)
train_precision = precision_score(y_train, y_train_pred_lgb)
train_recall = recall_score(y_train, y_train_pred_lgb)
train_f1 = f1_score(y_train, y_train_pred_lgb)

#  Test Set Metrics
test_accuracy = accuracy_score(y_test, y_test_pred_lgb)
test_roc_auc = roc_auc_score(y_test, y_test_pred_lgb)
test_precision = precision_score(y_test, y_test_pred_lgb)
test_recall = recall_score(y_test, y_test_pred_lgb)
test_f1 = f1_score(y_test, y_test_pred_lgb)

#  Create a comparison table
metrics_table = pd.DataFrame({
    'Metric': ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score'],
    'Train Set': [train_accuracy, train_roc_auc, train_precision, train_recall, train_f1],
    'Test Set': [test_accuracy, test_roc_auc, test_precision, test_recall, test_f1]
})

#  Display the table
print("\n---  LightGBM - Evaluation Metrics ---")
print(metrics_table)

#  Confusion Matrices
print("\n--- Confusion Matrix (Train Set) ---")
print(confusion_matrix(y_train, y_train_pred_lgb))

print("\n--- Confusion Matrix (Test Set) ---")
print(confusion_matrix(y_test, y_test_pred_lgb))

# Classification Reports
print("\n--- Classification Report (Train Set) ---")
print(classification_report(y_train, y_train_pred_lgb))

print("\n--- Classification Report (Test Set) ---")
print(classification_report(y_test, y_test_pred_lgb))

#  End Timer
end = time.time()
print(f"\n Time taken: {round(end - start, 2)} seconds")


####1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart


#  Metrics and scores
metrics = ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score']
train_scores = [train_accuracy, train_roc_auc, train_precision, train_recall, train_f1]
test_scores = [test_accuracy, test_roc_auc, test_precision, test_recall, test_f1]

#  Plotting the comparison chart
x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 5))

# Bars for Train and Test sets
bars1 = ax.bar(x - width/2, train_scores, width, label='Train Set', color='steelblue')
bars2 = ax.bar(x + width/2, test_scores, width, label='Test Set', color='orange')

#  Labels and formatting
ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('LightGBM Model - Evaluation Metrics (Train vs Test)', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

#  Display score values on bars
for bars, scores in zip([bars1, bars2], [train_scores, test_scores]):
    for bar, score in zip(bars, scores):
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.005,
                f"{score:.4f}", ha='center', va='bottom', fontsize=10, color='black')

#  Grid and layout adjustments
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.ylim(0, 1.1)
plt.tight_layout()
plt.show()

2. Cross- Validation & Hyperparameter Tuning

Cross validation and Hyperparameter tuning is not requred here. But to get accurate results,  Stratified K-Fold Cross-Validation and RandomizedSearchCV is used.

In [None]:
#   Code for Stratified K-Fold Cross-Validation and RandomizedSearchCV

#  Define the LightGBM model
lgb_model = lgb.LGBMClassifier(random_state=42)

#  Hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400],          # Number of boosting rounds
    'max_depth': [-1, 10, 15, 20],                 # Max tree depth (-1 means no limit)
    'learning_rate': [0.01, 0.05, 0.1, 0.2],       # Step size
    'num_leaves': [31, 50, 70, 100],               # Number of leaves in each tree
    'min_child_samples': [10, 20, 30, 50],         # Min samples in a leaf
    'subsample': [0.6, 0.8, 1.0],                   # Row sampling
    'colsample_bytree': [0.6, 0.8, 1.0],            # Feature sampling
    'reg_alpha': [0, 0.1, 0.5, 1],                 # L1 regularization
    'reg_lambda': [0, 0.1, 0.5, 1]                 # L2 regularization
}

#  Stratified K-Fold Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

#  RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=lgb_model,
    param_distributions=param_grid,
    n_iter=25,                        # 25 random combinations
    scoring='accuracy',               # Evaluation metric
    cv=cv,                            # 5-fold cross-validation
    verbose=2,
    random_state=42,
    n_jobs=-1                         # Use all CPUs
)

#  Start timer
start = time.time()

#  Fit the model using RandomizedSearchCV
random_search.fit(X_train, y_train)

#  Best hyperparameters
print("\n Best Hyperparameters:", random_search.best_params_)

#  Train the LightGBM model with the best parameters
best_lgb_model = random_search.best_estimator_

#  Fit the optimized model
best_lgb_model.fit(X_train, y_train)

#  Predictions on Train and Test sets
y_train_pred_lgb_tuned = best_lgb_model.predict(X_train)
y_test_pred_lgb_tuned = best_lgb_model.predict(X_test)

#  Train Set Metrics
train_accuracy_tuned = accuracy_score(y_train, y_train_pred_lgb_tuned)
train_roc_auc_tuned = roc_auc_score(y_train, y_train_pred_lgb_tuned)
train_precision_tuned = precision_score(y_train, y_train_pred_lgb_tuned)
train_recall_tuned = recall_score(y_train, y_train_pred_lgb_tuned)
train_f1_tuned = f1_score(y_train, y_train_pred_lgb_tuned)

# Test Set Metrics
test_accuracy_tuned = accuracy_score(y_test, y_test_pred_lgb_tuned)
test_roc_auc_tuned = roc_auc_score(y_test, y_test_pred_lgb_tuned)
test_precision_tuned = precision_score(y_test, y_test_pred_lgb_tuned)
test_recall_tuned = recall_score(y_test, y_test_pred_lgb_tuned)
test_f1_tuned = f1_score(y_test, y_test_pred_lgb_tuned)

# Create a comparison table
metrics_table_tuned = pd.DataFrame({
    'Metric': ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score'],
    'Train Set': [train_accuracy_tuned, train_roc_auc_tuned, train_precision_tuned, train_recall_tuned, train_f1_tuned],
    'Test Set': [test_accuracy_tuned, test_roc_auc_tuned, test_precision_tuned, test_recall_tuned, test_f1_tuned]
})

#  Display the table
print("\n---  LightGBM Model (After Tuning) - Evaluation Metrics ---")
print(metrics_table_tuned)

# Confusion Matrices
print("\n--- Confusion Matrix (Train Set) ---")
print(confusion_matrix(y_train, y_train_pred_lgb_tuned))

print("\n--- Confusion Matrix (Test Set) ---")
print(confusion_matrix(y_test, y_test_pred_lgb_tuned))

# Classification Reports
print("\n--- Classification Report (Train Set) ---")
print(classification_report(y_train, y_train_pred_lgb_tuned))

print("\n--- Classification Report (Test Set) ---")
print(classification_report(y_test, y_test_pred_lgb_tuned))

# End Timer
end = time.time()
print(f"\n Time taken: {round(end - start, 2)} seconds")




Which hyperparameter optimization technique have you used and why?

RandomizedsearchCV is used here.

Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# Visualization of evaluation metrics after tuning

#  Metrics and scores
metrics = ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score']
train_scores = [train_accuracy_tuned, train_roc_auc_tuned, train_precision_tuned, train_recall_tuned, train_f1_tuned]
test_scores = [test_accuracy_tuned, test_roc_auc_tuned, test_precision_tuned, test_recall_tuned, test_f1_tuned]

#  Plotting the comparison chart
x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 5))

# Bars for Train and Test sets
bars1 = ax.bar(x - width/2, train_scores, width, label='Train Set', color='steelblue')
bars2 = ax.bar(x + width/2, test_scores, width, label='Test Set', color='orange')

# Labels and formatting
ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title(' LightGBM Model - Evaluation Metrics (After Tuning)', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

#  Display score values on bars
for bars, scores in zip([bars1, bars2], [train_scores, test_scores]):
    for bar, score in zip(bars, scores):
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.005,
                f"{score:.4f}", ha='center', va='bottom', fontsize=10, color='black')

#  Grid and layout adjustments
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.ylim(0, 1.1)
plt.tight_layout()
plt.show()


In [None]:
#   LightGBM - Evaluation Metrics (Before vs After Tuning)

# Metrics and scores
metrics = ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score']

# Scores before tuning (use your actual pre-tuning values)
test_scores_before = [0.943695, 0.943334, 0.940866, 0.940866, 0.939461]

# Scores after tuning
test_scores_after = [0.944594, 0.944197, 0.942379, 0.938410, 0.940390]

#  Plotting the comparison chart
x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 5))

# Bars for Before and After Tuning
bars1 = ax.bar(x - width/2, test_scores_before, width, label='Before Tuning', color='lightcoral')
bars2 = ax.bar(x + width/2, test_scores_after, width, label='After Tuning', color='seagreen')

# Labels and formatting
ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title(' LightGBM - Evaluation Metrics (Before vs After Tuning)', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

#  Display score values on bars
for bars, scores in zip([bars1, bars2], [test_scores_before, test_scores_after]):
    for bar, score in zip(bars, scores):
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.005,
                f"{score:.4f}", ha='center', va='bottom', fontsize=10, color='black')

# Grid and layout adjustments
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.ylim(0, 1.1)
plt.tight_layout()
plt.show()


There is no significant difference is found in evaluation metric values.

### Displaying all evaluation metric values from the outputs of all models.

In [None]:
#  Create a comparison table with evaluation metrics for all models
metrics_comparison = pd.DataFrame({
    'Metric': ['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score'],

    # Train Set Metrics
    'Logistic Regression (Train)': [ 0.9379, 0.9377, 0.9325, 0.9343, 0.9334],
    'Decision Tree (Train)': [0.997896, 0.997822, 0.998725, 0.996754, 0.997738],
    'Random Forest (Train)': [0.9476, 0.9470, 0.9480 , 0.9389, 0.9434],
    'XGBoost (Train)': [0.954644, 0.954367, 0.952176, 0.950338, 0.951256],
    'LightGBM (Train)': [ 0.947697, 0.947363, 0.945055, 0.942485, 0.943768],

    # Test Set Metrics
    'Logistic Regression (Test)': [0.9381, 0.9377, 0.9338, 0.9331, 0.9335],
    'Decision Tree (Test)': [0.9355, 0.9342, 0.9451, 0.9147, 0.9296],
    'Random Forest (Test)': [0.9427, 0.9419, 0.9455, 0.9307, 0.9380],
    'XGBoost (Test)': [0.942796, 0.942470, 0.939357, 0.937708, 0.938532],
    'LightGBM (Test)': [0.944594, 0.944197, 0.942379, 0.938410, 0.940390]
})

#  Display the comparison table
print("\n--- Evaluation Metrics Comparison Across All Models ---")
metrics_comparison


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered Accuracy, ROC-AUC, Precision, Recall, and F1-Score for a positive business impact. Accuracy ensures overall correctness, ROC-AUC measures the model’s ability to distinguish between classes, and F1-Score balances precision and recall, which is crucial for minimizing false predictions and ensuring reliable results.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose LightGBM as the final prediction model because it achieved the highest test accuracy (94.46%), ROC-AUC (94.42%), and F1-score (94.04%), indicating superior performance and generalization. It effectively balances accuracy and robustness, making it the most reliable model.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**


1. **Comprehensive Data Preprocessing:** The project began with thorough data cleaning, handling missing values, and performing exploratory data analysis (EDA) to identify patterns, correlations, and outliers.  
2. **Feature Engineering and Selection:** Effective feature selection improved model accuracy by retaining only relevant variables, reducing noise, and preventing overfitting.  
3. **Model Variety and Comparison:** Multiple machine learning models were implemented, including **Logistic Regression, Decision Tree, Random Forest, XGBoost, and LightGBM**, ensuring a diverse evaluation.  
4. **Cross-Validation and Tuning:** Hyperparameter tuning and cross-validation significantly enhanced model performance by preventing overfitting and improving generalization.  
5. **LightGBM emerged as the best model**, delivering the highest accuracy (**94.46%**) and F1-Score, making it the final choice for prediction.  
6. **XGBoost showed competitive performance**, with slightly lower accuracy than LightGBM but still highly reliable, making it a close alternative.  
7. **Decision Tree and Random Forest models overfitted**, showing high training accuracy but lower test accuracy, indicating poor generalization.  
8. **Evaluation Metrics Consistency:** Precision, recall, and F1-scores remained consistent for the best-performing models, ensuring reliability across various performance dimensions.   
9. **Scalability and Efficiency:** LightGBM was chosen as the final model due to its superior accuracy, faster execution, and scalability, making it suitable for large datasets and real-world deployment.

## ***Thank You***

### ***You have successfully completed your Machine Learning Capstone Project !***