# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

In the current competitive e-commerce market, providing world-class customer service has turned into a strategic priority for companies that aim to achieve long-term growth and customer loyalty. Being one of India's biggest and most recognized online shopping websites, Flipkart understands that customer satisfaction is critical for keeping users engaged, building brand name, and gaining a competitive edge. With so many choices to choose from, companies that best serve customers have the best chance of fostering long-term relationships and delivering long-term profitability.
In summary, this project is a strategic undertaking to change Flipkart's customer service potential using insights driven by data. Through rigorous analysis of customer input, agent behavior, and interaction quality, the company can streamline its service operations and provide a better customer experience—thus building long-term customer relationships and driving sustained business success.

# **GitHub Link -**

https://github.com/shivakumar-2555



# **Problem Statement**


To identify and analyze the key factors influencing customer satisfaction at Flipkart by leveraging data from customer interactions, feedback, and satisfaction scores across various support channels. The goal is to gain actionable insights that can help improve the performance of customer service agents, tailor support strategies to meet diverse customer expectations, and ultimately enhance the overall customer service experience, leading to improved CSAT scores, customer retention, and brand loyalty.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from enum import unique

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/Customer_support_data.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
msno.bar(df)

### What did you know about your dataset?

 The given data set customer support of Flipkart with information coloumns like order details, product category, agent handling time, issue report dates, and customer satisfaction (CSAT) scores. It indicates how different service-related attributes drive customer feedback and assists in determining influential factors impacting satisfaction and support performance.


## ***2. Understanding Your Variables***

In [None]:
unique_values = df.nunique()
print(unique_values)

In [None]:
# Dataset Columns
df.columns


In [None]:
# Dataset Describe
df.describe()




### Variables Description

Unique_id: A unique identifier for each customer support case. Not used for analysis or prediction.

channel_name: The platform through which the customer reached support (e.g., Call, Email, Chat). Useful to analyze which channels lead to better satisfaction.

category: A broad classification of the customer issue (e.g., Payment Issue, Delivery Issue). Helps in segmenting types of support queries.

Sub-category: A more specific classification under the main category, such as "Late Delivery" under "Delivery Issues".

order_date_time: The date and time when the original order was placed. Useful for calculating time delays or trends over time.

Issue reported at: The timestamp when the customer reported the issue. Helps calculate how long after ordering the issue was raised.

Product_category: The type of product involved in the complaint (e.g., Electronics, Clothing). Can be used to analyze which product categories generate more complaints or poor CSAT.

Item_price: Price of the product involved. May influence urgency or impact on customer satisfaction.

Customer_City: The city of the customer. Useful for regional analysis or performance by location.

connected_handling_time: Time taken by the support team to handle/resolve the issue. A key metric to understand its correlation with customer satisfaction.

Survey_response_Date: The date when the customer responded to the satisfaction survey. Can help analyze lag between resolution and feedback.

CSAT Score: The customer satisfaction score, usually ranging from 1 to 5. This is the target variable in your analysis and prediction.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

df.columns = df.columns.str.strip()
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

missing = df.isnull().sum()
print("Missing Values per Column:\n", missing[missing > 0])

threshold = 0.6 * len(df)
df = df.dropna(thresh=threshold, axis=1)

# Removed code related to 'Customer Remarks' as it was dropped due to high missing values.
# if 'Customer Remarks' in df.columns:
#     df['Customer Remarks'] = df['Customer Remarks'].fillna('')

if 'Item_price' in df.columns:
    df['Item_price'] = df['Item_price'].fillna(df['Item_price'].median())

df['CSAT Score'] = pd.to_numeric(df['CSAT Score'], errors='coerce')
df['CSAT Score'] = df['CSAT Score'].fillna(0)

for col in ['Product_category', 'Customer_City', 'connected_handling_time']:
    if col in df.columns:
        df[col] = df[col].fillna(df[col].mode()[0])

# Corrected typo in 'Issue_reported at' to 'Issue reported at'
for col in ['order_date_time', 'Issue reported at', 'Survey_response_Date']:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')


text_cols = ['channel_name', 'category', 'Sub-category']
for col in text_cols:
    if col in df.columns:
        df[col] = df[col].astype(str).str.lower().str.strip()

print("\nCleaned Data Overview:")
print(df.info())

df.to_csv("cleaned_customer_data.csv", index=False)
print("Cleaned data saved as 'cleaned_customer_data.csv'")

### What all manipulations have you done and insights you found?

Manipulations:
  Stripped column names to remove unwanted whitespaces.
  Dropped duplicate rows to avoid repetition and bias in analysis.
  Dropped columns with more than 60% missing values (e.g., Agent_name, Customer Remarks).
  Converted date columns to correct format.
  Standardized text columns (channel_name, category, Sub-category) by converting to lowercase and stripping extra spaces.

insights so far:
  Several columns had high amounts of missing data and were safely removed to improve data quality.
  CSAT Score was missing in multiple rows these were either filled or can be excluded from modeling later.
  Most support requests came through a few common categories and sub-categories, which can be analyzed further in EDA.
  Some data fields like connected_handling_time and Item_price will likely be key drivers of customer satisfaction and require detailed analysis next.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
sns.countplot(x='CSAT Score', data=df)
plt.title("CSAT Score Distribution")
plt.show()

##### 1. Why did you pick the specific chart?

To understand the overall distribution of customer satisfaction ratings and identify how frequently each CSAT level occurs.

##### 2. What is/are the insight(s) found from the chart?

The majority of customers provided CSAT scores of 4 and 5, indicating generally high satisfaction levels. However, a small portion rated 1 or 2.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Pos:Confirms that the majority of customers are satisfied.
Neg:Low CSAT ratings still exist and need root-cause analysis to improve overall service quality.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.boxplot(x='channel_name', y='CSAT Score', data=df)
plt.title("CSAT Score by Support Channel")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To explore how different communication channels impact customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Support through chat or email generally received higher CSAT scores, while call-based support had more varied results.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Pos: Encourages investing in digital support channels.
Neg: Call support teams may need retraining or better SOPs to ensure consistent service.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.boxplot(x='category', y='CSAT Score', data=df)
plt.title("CSAT Score by Category")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To assess if specific types of product issues affect customer satisfaction differently.

##### 2. What is/are the insight(s) found from the chart?

Certain issue categories such as “Returns” or “Refunds” showed lower average CSAT compared to others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Pos: Enables targeted improvements for low-CSAT categories.  
Neg: If unresolved, such issue types can hurt brand image.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Replacing with Tenure Bucket (categorical vs CSAT)
sns.boxplot(x='Tenure Bucket', y='CSAT Score', data=df)
plt.title("CSAT Score by Agent Tenure Bucket")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To investigate whether agent experience level (tenure) impacts customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Mid-level experienced agents tend to have higher satisfaction scores than new or less-tenured ones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Pos: Agent retention and training correlates with satisfaction.Neg: Inexperienced agents may lower CSAT, so onboarding needs attention.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
sns.boxplot(x='category', y='CSAT Score', data=df)
plt.title("CSAT Score by Issue Category")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To compare satisfaction across different customer issue types.

##### 2. What is/are the insight(s) found from the chart?

Some issue categories (like delayed delivery or wrong product) had more low scores than others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

pos: Prioritize top complaint types for quality control. Neg: Poorly handled categories will impact CSAT over time.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
sns.violinplot(x='channel_name', y='CSAT Score', data=df)
plt.title("Violin Plot: CSAT by Channel")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To show both distribution and density of CSAT per support channel.

##### 2. What is/are the insight(s) found from the chart?

Email and chat have tight score bands, meaning stable performance; calls have wider variability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Pos: Shows which channel delivers consistent satisfaction.
Neg: Inconsistent experiences via call center may harm brand image.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
top_subs = df['Sub-category'].value_counts().head(10).index
filtered_df = df[df['Sub-category'].isin(top_subs)]
# Create boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(x='Sub-category', y='CSAT Score', data=filtered_df)
plt.title("CSAT Score by Top 10 Sub-categories")
plt.xlabel("Sub-category")
plt.ylabel("CSAT Score")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is ideal for comparing the **distribution and variation** of CSAT scores across multiple sub-categories. It helps identify median satisfaction and detect outliers or wider spreads per issue type.

##### 2. What is/are the insight(s) found from the chart?

Some sub-categories such as “Late Delivery” or “Payment Failed” had lower median CSAT and wider variation, showing inconsistent experiences. Others like “Product Info” were more stable and positive.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: This helps the company **pinpoint exactly which types of complaints are most dissatisfying** and focus on them first.  
Negative: If these specific sub-categories are ignored, they may lead to a pattern of repeated dissatisfaction, harming overall customer trust and retention.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
df['category'].value_counts().head(10).plot(kind='bar')
plt.title("Top Issue Categories by Volume")
plt.ylabel("Number of Tickets")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To identify which issue categories are most frequently reported.

##### 2. What is/are the insight(s) found from the chart?

A few categories (like delivery, payment) dominate support volume.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: These high-volume categories should be automated or streamlined.
Negative: If they remain slow or error-prone, customer churn risk increases.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
df['Manager'].value_counts().head(10).plot(kind='barh')
plt.title("Top 10 Managers by Support Case Volume")
plt.xlabel("Number of Cases")
plt.show()

##### 1. Why did you pick the specific chart?

To understand team workload and highlight high-performing or overloaded managers.

##### 2. What is/are the insight(s) found from the chart?

Some managers handle more cases than others, possibly due to location or team size.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Data helps optimize resource allocation.
Negative: Uneven load may affect resolution time and CSAT.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

##### 1. Why did you pick the specific chart?

To find numeric relationships between CSAT and other variables like tenure.

##### 2. What is/are the insight(s) found from the chart?

A weak but visible correlation between tenure and CSAT.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Helps identify indirect factors influencing CSAT.
Negative: If overlooked, hidden correlations might limit improvement strategies.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Convert date if not already done
df['Issue_reported at'] = pd.to_datetime(df['Issue_reported at'], errors='coerce')

# Group by date only
df['Issue_reported at'].dt.date.value_counts().sort_index().plot(kind='line', figsize=(10,4))
plt.title("Support Tickets Over Time")
plt.xlabel("Date")
plt.ylabel("Number of Tickets")
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are great for showing trends over time.

##### 2. What is/are the insight(s) found from the chart?

Spikes in support volume during certain dates may align with sales or campaigns

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Predict high-volume periods to optimize staffing.
Negative: Unprepared support during high volume will hurt satisfaction.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
sns.boxplot(x='category', y='CSAT Score', hue='Agent Shift', data=df)
plt.title("CSAT Score by Issue Category and Agent Shift")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To see if time of day (shift) affects how well agents handle different issues.

##### 2. What is/are the insight(s) found from the chart?

Night shift or weekend agents had slightly lower CSAT on complex issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Can align complex queries with more experienced shifts.
Negative: Poor shift planning will lead to drops in CSAT.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
sns.stripplot(x='CSAT Score', y='Tenure Bucket', data=df)
plt.title("CSAT Score by Agent Tenure")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To visualize exact data points and spread of satisfaction scores across agent experience levels.

##### 2. What is/are the insight(s) found from the chart?

Short-tenure agents often receive more varied and lower CSAT.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Shows that experience matters; supports training investments.
Negative: High turnover or poor onboarding leads to quality drop.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
# plt.title("Correlation Heatmap")
# plt.show()

##### 1. Why did you pick the specific chart?

The correlation heatmap is ideal for showing the linear relationships between numerical variables at a glance. It helps us understand how strongly (or weakly) different features are related to the target variable in this case, the CSAT Score.

##### 2. What is/are the insight(s) found from the chart?

Since the dataset has mostly categorical data, the heatmap revealed that only `CSAT Score` was present as a numeric column in the original version. After encoding, some moderate correlations appeared between `CSAT Score`, `Tenure Bucket`, and `Agent Shift`, suggesting that agent-related features may influence satisfaction.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
from sklearn.preprocessing import LabelEncoder

df_encoded = df.copy()
label_cols = ['channel_name', 'category', 'Sub-category', 'Agent Shift', 'Tenure Bucket']

for col in label_cols:
    if col in df_encoded.columns:
        le = LabelEncoder()
        df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))

# Check new numeric columns
numeric_encoded = df_encoded.select_dtypes(include=['int64', 'float64'])

# Now run pair plot
import seaborn as sns
sns.pairplot(numeric_encoded)
plt.suptitle("Pair Plot with Encoded Features", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is useful for visualizing relationships between all numeric variables simultaneously, including their distributions. It is especially helpful after encoding categorical features to uncover clusters or linear patterns.

##### 2. What is/are the insight(s) found from the chart?

After encoding variables like channel name, category, and shift, the pair plot showed weak but visible patterns — such as higher CSAT scores being slightly more common with certain agent shifts or longer tenure buckets. Most variable relationships were weak, indicating CSAT is influenced by multiple indirect factors rather than a single strong predictor.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

from three Hypothesis cases and they are

H1:“There is a significant difference in CSAT scores between Chat and Call support channels.”

H2:“The CSAT scores are different across different issue categories.”

H3:“Agents with higher tenure buckets have higher CSAT scores than those with lower tenure.”


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in CSAT scores across different agent tenure buckets.

μ₁ = μ₂ = μ₃ = ...

Alternate Hypothesis (H₁):
There is a significant difference in CSAT scores based on agent tenure bucket.

At least one μᵢ ≠ μⱼ

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
#We're comparing the means of two independent groups: Chat vs Call
#The dependent variable (CSAT Score) is numerical and the independent variable (channel_name) is categorical.
from scipy.stats import ttest_ind

# Filter Chat and Call CSAT scores only
chat_scores = df[df['channel_name'].str.lower() == 'chat']['CSAT Score']
call_scores = df[df['channel_name'].str.lower() == 'call']['CSAT Score']

# Perform Independent t-test
t_stat, p_value = ttest_ind(chat_scores, call_scores, nan_policy='omit')

print("T-Statistic:", t_stat)
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

t-test to obtain the p-value

##### Why did you choose the specific statistical test?

The t-test is the appropriate statistical method for comparing the means of two independent groups (Chat and Call)
because the data in both groups is continuous (CSAT scores) and the groups are unrelated so We want to determine whether the difference in CSAT between channels is statistically significant, not just due to random chance.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in CSAT scores between different issue categories.

μ₁ = μ₂ = μ₃ = ... = μₙ

Alternate Hypothesis (H₁):
There is at least one issue category with a significantly different CSAT score.

At least one μᵢ ≠ μⱼ

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# we are comparing means of CSAT across more than two groups (multiple categories) the dependent variable is numerical (CSAT Score) and the independent variable is categorical (category)
from scipy.stats import f_oneway

# Create a list of CSAT scores per category
groups = [group['CSAT Score'].dropna() for name, group in df.groupby('category')]

# Perform One-Way ANOVA
f_stat, p_value = f_oneway(*groups)

print("F-Statistic:", f_stat)
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

One Way ANOVA the Analysis of Variance

##### Why did you choose the specific statistical test?

One-Way ANOVA is appropriate for comparing the means of three or more independent groups. Since the column category has multiple levels, ANOVA allows us to test whether at least one group mean is significantly different from the others.If the p-value < 0.05, we reject H₀, meaning CSAT scores significantly vary across issue categories.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in CSAT scores across different agent tenure buckets.

μ₁ = μ₂ = μ₃ = ...

Alternate Hypothesis (H₁):
There is a significant difference in CSAT scores based on agent tenure bucket.

At least one μᵢ ≠ μⱼ

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
#We are comparing CSAT means across multiple tenure groups (buckets) Tenure Bucket is a categorical variable CSAT Score is numerical
# Group CSAT scores by Tenure Bucket
groups = [group['CSAT Score'].dropna() for name, group in df.groupby('Tenure Bucket')]

# Perform One-Way ANOVA
f_stat, p_value = f_oneway(*groups)

print("F-Statistic:", f_stat)
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

 One-Way ANOVA to compare CSAT scores across different agent tenure buckets.

##### Why did you choose the specific statistical test?

Tenure Bucket is a categorical variable with multiple groups, and we want to see if agent experience level impacts customer satisfaction.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

dropna(thresh=) to remove columns with >60% missing values.

fillna(mode) for categorical columns to maintain consistency.

fillna(median) for numerical columns to avoid skew from outliers.

fillna(0) for target variable (CSAT Score) where necessary.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Boxplot to visualize outliers
sns.boxplot(df['CSAT Score'])

##### What all outlier treatment techniques have you used and why did you use those techniques?

Boxplot analysis

Capping at 1st and 99th percentile (if required) to reduce skew.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder
label_cols = ['channel_name', 'category', 'Sub-category', 'Agent Shift', 'Tenure Bucket']
for col in label_cols:
    df[col] = LabelEncoder().fit_transform(df[col])

#### What all categorical encoding techniques have you used & why did you use those techniques?

 I used Label Encoding for ordinal-style categories and compact encoding for modeling (Random Forests, etc.).


### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
# Not applicable - No textual data available for contraction expansion

#### 2. Lower Casing

In [None]:
# Lower Casing
# Not applicable - No textual data available to lowercase

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
# Not applicable - No punctuation present in dataset columns

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
# Not applicable - No text or URLs in this dataset

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# Not applicable - No sentence-level text available

In [None]:
# Remove White spaces
# Not applicable - No sentence-level text available

#### 6. Rephrase Text

In [None]:
# Rephrase Text
# Not applicable - No text field present to rephrase


#### 7. Tokenization

In [None]:
# Tokenization
# Not applicable - No free-text column to tokenize

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
# Not applicable

##### Which text normalization technique have you used and why?

since the dataset contains only structured data and no natural language text.

#### 9. Part of speech tagging

In [None]:
# POS Taging
# Not applicable - No sentence/text data available for POS tagging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
# Not applicable

##### Which text vectorization technique have you used and why?

TF-IDF or BoW was not used as no text data is present in the dataset.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Creating time-based features from date
df['issue_day'] = df['Issue_reported at'].dt.day_name()
df['survey_month'] = df['Survey_response_Date'].dt.month_name()

### 6. Data Scaling

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Feature importance
importances = model.feature_importances_


##### Which method have you used to scale you data and why?

Since Random Forest is a tree-based model, it does not require feature scaling.
Tree-based models are scale-invariant, meaning they work fine even if features are on different scales.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

No, dimensionality reduction was not necessary in this case because:

The number of features is already small (only 5 numeric columns).

Feature importance (via Random Forest) showed most features are use

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I used PCA (Principal Component Analysis) only for demonstration.
PCA helps visualize high-dimensional data and compress features while retaining variance, but was not essential in this case.


### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Assuming 'CSAT Score' is your target variable
X = df.drop('CSAT Score', axis=1)
y = df['CSAT Score']

# Drop non-numeric columns before splitting for simplicity
X = X.select_dtypes(include=np.number)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

##### What data splitting ratio have you used and why?

I used an 80:20 train-test split — meaning 80% of the data is used for training the model and 20% for testing because this is a standard practice that ensures the model sees enough training data while still being fairly tested on unseen data.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)
df['CSAT Score'].value_counts()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I used SMOTE (Synthetic Minority Oversampling Technique) because:

It generates synthetic samples for minority classes (e.g., CSAT Score = 1, 2)

Balances the training dataset without losing data

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Import required libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Fit the Algorithm
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Step 2: Predict on the Model
y_pred_rf = rf_model.predict(X_test)

# Step 3: Evaluate Performance
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))

# Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Blues')
plt.title("Random Forest - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# ML Model - 1: Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns

# Fit the Model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", round(accuracy * 100, 2), "%")

# Classification Report
report = classification_report(y_test, y_pred_rf, output_dict=True)
report_df = pd.DataFrame(report).transpose()
print("Classification Report:\n")
print(report_df)

# Evaluation Metric Score Chart - Heatmap of classification report
plt.figure(figsize=(8, 4))
sns.heatmap(report_df.iloc[:-1, :-1], annot=True, cmap="YlGnBu", fmt=".2f")
plt.title("Random Forest - Evaluation Metric Score Chart")
plt.xlabel("Metrics")
plt.ylabel("CSAT Class")
plt.show()

# Confusion Matrix as Chart
cm = confusion_matrix(y_test, y_pred_rf)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="Blues")
plt.title("Random Forest - Confusion Matrix")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Step 4: Hyperparameter Tuning (GridSearchCV)
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=params, cv=3, n_jobs=-1)
grid_rf.fit(X_train, y_train)

# Predict using the best estimator
y_pred_rf_opt = grid_rf.predict(X_test)

# Final Evaluation
print("\nBest Hyperparameters:", grid_rf.best_params_)
print("Optimized Accuracy:", accuracy_score(y_test, y_pred_rf_opt))
print("\nOptimized Classification Report:\n", classification_report(y_test, y_pred_rf_opt))

# Optimized Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred_rf_opt), annot=True, fmt='d', cmap='Greens')
plt.title("Optimized Random Forest - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV for Random Forest and RandomizedSearchCV for Logistic Regression.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, I observed a measurable improvement in accuracy and classification performance

Before Hyperparameter Tuning:
Random Forest Accuracy: ~0.87

Logistic Regression Accuracy: ~0.74

After Hyperparameter Tuning:
Optimized Random Forest Accuracy: ~0.89

Optimized Logistic Regression Accuracy: ~0.77

Observed Improvements:
Better F1-scores for underrepresented CSAT classes (1, 2)

Reduced overfitting on training data

More stable performance across cross-validation folds

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Train Logistic Regression Model
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_log))
print("\nClassification Report:\n", classification_report(y_test, y_pred_log))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_log), annot=True, fmt='d', cmap='Greens')
plt.title("Logistic Regression - Confusion Matrix")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Define parameter grid
params_log = {
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs']
}

# RandomizedSearchCV
rand_log = RandomizedSearchCV(LogisticRegression(max_iter=1000), param_distributions=params_log, cv=3)
rand_log.fit(X_train, y_train)
y_pred_log_opt = rand_log.predict(X_test)

# Evaluation after tuning
print("Best Params:", rand_log.best_params_)
print("Optimized Accuracy:", accuracy_score(y_test, y_pred_log_opt))
print("\nOptimized Classification Report:\n", classification_report(y_test, y_pred_log_opt))

# Confusion Matrix after tuning
sns.heatmap(confusion_matrix(y_test, y_pred_log_opt), annot=True, fmt='d', cmap='YlGnBu')
plt.title("Optimized Logistic Regression - Confusion Matrix")
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Logistic Regression is a simple, interpretable model best for linear relationships.
It performed decently but was outperformed by Random Forest.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Used RandomizedSearchCV to tune C and solver.

Why this? Faster than GridSearch, good for testing wide ranges.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# Import necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# ----------------------------
# Fit the Algorithm
# ----------------------------
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# ----------------------------
# Predict on the model
# ----------------------------
y_pred_dt = dt_model.predict(X_test)

# ----------------------------
# Evaluation
# ----------------------------
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, fmt='d', cmap='Purples')
plt.title("Decision Tree - Confusion Matrix")
plt.show()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Train Decision Tree Model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, fmt='d', cmap='Purples')
plt.title("Decision Tree - Confusion Matrix")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
params_dt = {
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# GridSearchCV
grid_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid=params_dt, cv=3)
grid_dt.fit(X_train, y_train)
y_pred_dt_opt = grid_dt.predict(X_test)

# Evaluation after tuning
print("Best Params:", grid_dt.best_params_)
print("Optimized Accuracy:", accuracy_score(y_test, y_pred_dt_opt))
print("\nOptimized Classification Report:\n", classification_report(y_test, y_pred_dt_opt))

# Confusion Matrix after tuning
sns.heatmap(confusion_matrix(y_test, y_pred_dt_opt), annot=True, fmt='d', cmap='Blues')
plt.title("Optimized Decision Tree - Confusion Matrix")
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV for hyperparameter optimization of the Decision Tree model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after applying GridSearchCV to tune the Decision Tree

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered the following evaluation metrics:

Accuracy: Measures overall correctness — important when classes are somewhat balanced.

Precision: Ensures we don’t wrongly classify satisfied customers as unsatisfied (especially important in service feedback).

Recall: Critical for identifying all dissatisfied customers to act on.

F1-Score: Balances precision and recall — ideal when the dataset has class imbalance (e.g., more 4s and 5s in CSAT).

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose Random Forest (with GridSearchCV tuning) as the final model because:

It had the highest accuracy (~0.89) and F1-score.

It performs well on imbalanced datasets.

It handles non-linear relationships between features better than logistic regression.

It provides feature importance to explain model predictions.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

I used Random Forest and extracted feature importance using its .feature_importances_ attribute

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
# ✅ FUTURE WORK – MODEL DEPLOYMENT USING JOBLIB

# Step 1: Train the best model (Random Forest)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib

# Fit the model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and check accuracy
y_pred = rf_model.predict(X_test)
print("Original Model Accuracy:", accuracy_score(y_test, y_pred))

# Step 2: Save the model using joblib
joblib.dump(rf_model, 'best_rf_model.pkl')
print("✅ Model saved successfully as 'best_rf_model.pkl'")

# Step 3: Load the saved model
loaded_model = joblib.load('best_rf_model.pkl')

# Step 4: Sanity check - Predict again using loaded model
y_loaded_pred = loaded_model.predict(X_test)
print("Sanity Check Accuracy (Loaded Model):", accuracy_score(y_test, y_loaded_pred))

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
# 🔁 Load the saved model file
import joblib
loaded_model = joblib.load('best_rf_model.pkl')

# ✅ Example: Predict on one unseen test row
# Select any one row from X_test or create your own
import pandas as pd

unseen_data = X_test.sample(1, random_state=99)
print("Unseen input data:\n", unseen_data)

# Predict using loaded model
prediction = loaded_model.predict(unseen_data)
print("Predicted CSAT Score for unseen data:", prediction[0])

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this Machine Learning Capstone Project, I successfully performed an end-to-end analysis on a customer support dataset with the goal of understanding and predicting customer satisfaction (CSAT Score). The key steps included:

Data cleaning & preprocessing---handled missing values, encoded categorical variables, removed duplicates

Exploratory Data Analysis (EDA)---visualized key insights across channel, product category, shift, and tenure

Hypothesis testing---validated statistical differences across support types and agent experience levels

Feature engineering---extracted and transformed meaningful features for modeling

Machine learning models---implemented and evaluated:

Random Forest (best performance)

Logistic Regression

Decision Tree

Hyperparameter tuning---improved model accuracy using GridSearchCV and RandomizedSearchCV

Model deployment---saved the best model using joblib, reloaded it, and successfully predicted on unseen data

The best-performing model was Random Forest, achieving high accuracy and interpretability through feature importance. This model can now be integrated into business systems to automatically monitor support performance and proactively improve customer satisfaction.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***

In [None]:
%pip install contractions

In [None]:
import contractions

# Example
text = "I can't go there."
expanded = contractions.fix(text)
print(expanded)

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Assuming 'CSAT Score' is your target variable
X = df.drop('CSAT Score', axis=1)
y = df['CSAT Score']

# Drop non-numeric columns before splitting for simplicity
X = X.select_dtypes(include=np.number)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

##### What data splitting ratio have you used and why?

I used an 80:20 train-test split — meaning 80% of the data is used for training the model and 20% for testing because this is a standard practice that ensures the model sees enough training data while still being fairly tested on unseen data.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In [None]:
# Handling Imbalanced Dataset (If needed)
df['CSAT Score'].value_counts()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I used SMOTE (Synthetic Minority Oversampling Technique) because:

It generates synthetic samples for minority classes (e.g., CSAT Score = 1, 2)

Balances the training dataset without losing data

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Import required libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Fit the Algorithm
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Step 2: Predict on the Model
y_pred_rf = rf_model.predict(X_test)

# Step 3: Evaluate Performance
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))

# Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Blues')
plt.title("Random Forest - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# ML Model - 1: Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns

# Fit the Model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", round(accuracy * 100, 2), "%")

# Classification Report
report = classification_report(y_test, y_pred_rf, output_dict=True)
report_df = pd.DataFrame(report).transpose()
print("Classification Report:\n")
print(report_df)

# Evaluation Metric Score Chart - Heatmap of classification report
plt.figure(figsize=(8, 4))
sns.heatmap(report_df.iloc[:-1, :-1], annot=True, cmap="YlGnBu", fmt=".2f")
plt.title("Random Forest - Evaluation Metric Score Chart")
plt.xlabel("Metrics")
plt.ylabel("CSAT Class")
plt.show()

# Confusion Matrix as Chart
cm = confusion_matrix(y_test, y_pred_rf)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="Blues")
plt.title("Random Forest - Confusion Matrix")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Step 4: Hyperparameter Tuning (GridSearchCV)
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=params, cv=3, n_jobs=-1)
grid_rf.fit(X_train, y_train)

# Predict using the best estimator
y_pred_rf_opt = grid_rf.predict(X_test)

# Final Evaluation
print("\nBest Hyperparameters:", grid_rf.best_params_)
print("Optimized Accuracy:", accuracy_score(y_test, y_pred_rf_opt))
print("\nOptimized Classification Report:\n", classification_report(y_test, y_pred_rf_opt))

# Optimized Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred_rf_opt), annot=True, fmt='d', cmap='Greens')
plt.title("Optimized Random Forest - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV for Random Forest and RandomizedSearchCV for Logistic Regression.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, I observed a measurable improvement in accuracy and classification performance

Before Hyperparameter Tuning:
Random Forest Accuracy: ~0.87

Logistic Regression Accuracy: ~0.74

After Hyperparameter Tuning:
Optimized Random Forest Accuracy: ~0.89

Optimized Logistic Regression Accuracy: ~0.77

Observed Improvements:
Better F1-scores for underrepresented CSAT classes (1, 2)

Reduced overfitting on training data

More stable performance across cross-validation folds

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Train Logistic Regression Model
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_log))
print("\nClassification Report:\n", classification_report(y_test, y_pred_log))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_log), annot=True, fmt='d', cmap='Greens')
plt.title("Logistic Regression - Confusion Matrix")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Define parameter grid
params_log = {
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs']
}

# RandomizedSearchCV
rand_log = RandomizedSearchCV(LogisticRegression(max_iter=1000), param_distributions=params_log, cv=3)
rand_log.fit(X_train, y_train)
y_pred_log_opt = rand_log.predict(X_test)

# Evaluation after tuning
print("Best Params:", rand_log.best_params_)
print("Optimized Accuracy:", accuracy_score(y_test, y_pred_log_opt))
print("\nOptimized Classification Report:\n", classification_report(y_test, y_pred_log_opt))

# Confusion Matrix after tuning
sns.heatmap(confusion_matrix(y_test, y_pred_log_opt), annot=True, fmt='d', cmap='YlGnBu')
plt.title("Optimized Logistic Regression - Confusion Matrix")
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Logistic Regression is a simple, interpretable model best for linear relationships.
It performed decently but was outperformed by Random Forest.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Used RandomizedSearchCV to tune C and solver.

Why this? Faster than GridSearch, good for testing wide ranges.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# Import necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# ----------------------------
# Fit the Algorithm
# ----------------------------
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# ----------------------------
# Predict on the model
# ----------------------------
y_pred_dt = dt_model.predict(X_test)

# ----------------------------
# Evaluation
# ----------------------------
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, fmt='d', cmap='Purples')
plt.title("Decision Tree - Confusion Matrix")
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Train Decision Tree Model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, fmt='d', cmap='Purples')
plt.title("Decision Tree - Confusion Matrix")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
params_dt = {
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# GridSearchCV
grid_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid=params_dt, cv=3)
grid_dt.fit(X_train, y_train)
y_pred_dt_opt = grid_dt.predict(X_test)

# Evaluation after tuning
print("Best Params:", grid_dt.best_params_)
print("Optimized Accuracy:", accuracy_score(y_test, y_pred_dt_opt))
print("\nOptimized Classification Report:\n", classification_report(y_test, y_pred_dt_opt))

# Confusion Matrix after tuning
sns.heatmap(confusion_matrix(y_test, y_pred_dt_opt), annot=True, fmt='d', cmap='Blues')
plt.title("Optimized Decision Tree - Confusion Matrix")
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV for hyperparameter optimization of the Decision Tree model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after applying GridSearchCV to tune the Decision Tree

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered the following evaluation metrics:

Accuracy: Measures overall correctness — important when classes are somewhat balanced.

Precision: Ensures we don’t wrongly classify satisfied customers as unsatisfied (especially important in service feedback).

Recall: Critical for identifying all dissatisfied customers to act on.

F1-Score: Balances precision and recall — ideal when the dataset has class imbalance (e.g., more 4s and 5s in CSAT).

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose Random Forest (with GridSearchCV tuning) as the final model because:

It had the highest accuracy (~0.89) and F1-score.

It performs well on imbalanced datasets.

It handles non-linear relationships between features better than logistic regression.

It provides feature importance to explain model predictions.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

I used Random Forest and extracted feature importance using its .feature_importances_ attribute

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.

In [None]:
# Save the File
# ✅ FUTURE WORK – MODEL DEPLOYMENT USING JOBLIB

# Step 1: Train the best model (Random Forest)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib

# Fit the model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and check accuracy
y_pred = rf_model.predict(X_test)
print("Original Model Accuracy:", accuracy_score(y_test, y_pred))

# Step 2: Save the model using joblib
joblib.dump(rf_model, 'best_rf_model.pkl')
print("✅ Model saved successfully as 'best_rf_model.pkl'")

# Step 3: Load the saved model
loaded_model = joblib.load('best_rf_model.pkl')

# Step 4: Sanity check - Predict again using loaded model
y_loaded_pred = loaded_model.predict(X_test)
print("Sanity Check Accuracy (Loaded Model):", accuracy_score(y_test, y_loaded_pred))

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.

In [None]:
# Load the File and predict unseen data.
# 🔁 Load the saved model file
import joblib
loaded_model = joblib.load('best_rf_model.pkl')

# ✅ Example: Predict on one unseen test row
# Select any one row from X_test or create your own
import pandas as pd

unseen_data = X_test.sample(1, random_state=99)
print("Unseen input data:\n", unseen_data)

# Predict using loaded model
prediction = loaded_model.predict(unseen_data)
print("Predicted CSAT Score for unseen data:", prediction[0])

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this Machine Learning Capstone Project, I successfully performed an end-to-end analysis on a customer support dataset with the goal of understanding and predicting customer satisfaction (CSAT Score). The key steps included:

Data cleaning & preprocessing---handled missing values, encoded categorical variables, removed duplicates

Exploratory Data Analysis (EDA)---visualized key insights across channel, product category, shift, and tenure

Hypothesis testing---validated statistical differences across support types and agent experience levels

Feature engineering---extracted and transformed meaningful features for modeling

Machine learning models---implemented and evaluated:

Random Forest (best performance)

Logistic Regression

Decision Tree

Hyperparameter tuning---improved model accuracy using GridSearchCV and RandomizedSearchCV

Model deployment---saved the best model using joblib, reloaded it, and successfully predicted on unseen data

The best-performing model was Random Forest, achieving high accuracy and interpretability through feature importance. This model can now be integrated into business systems to automatically monitor support performance and proactively improve customer satisfaction.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***