<a href="https://colab.research.google.com/github/swatidixit18/flipkart-project/blob/main/Copy_of_Sample_ML_Submission_Template_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Flipkart Customer Support Classification Model



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** - Swati Dixit

# **Project Summary -**

This project builds a classification model to predict customer satisfaction based on historical support case data from Flipkart. Using the dataset provided, the model aims to determine if a customer was satisfied or dissatisfied based on features like issue type, rating, and resolution information.

Key steps in this project include:
- Preprocessing the data to handle categorical and missing values
- Splitting the dataset into training and testing sets
- Building a Random Forest model
- Evaluating the model using accuracy and classification metrics

This model could be used by Flipkart to flag potentially unhappy customers early, allowing the support team to intervene proactively.


# **GitHub Link -**

Provide your GitHub Link here.

https://github.com/swatidixit18/flipkart-project/blob/main/Sample_ML_Submission_Template.ipynb

# **Problem Statement**


**Write Problem Statement Here.**

The problem is to analyze Flipkart customer support ticket data to understand common issues, customer satisfaction trends, and patterns that can help Flipkart improve their service. We also aim to build a model that predicts whether a customer will be satisfied based on past support data.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import os
print(os.getcwd())
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("Customer_support_data.csv")



### Dataset First View

In [None]:
# Dataset First Look
df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape



### Dataset Information

In [None]:
# Dataset Info
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()


In [None]:
# Visualizing the missing values

### What did you know about your dataset?

Answer Here

- The dataset has X rows and Y columns.  
- It includes features like issue_type, timestamp, customer_rating, and satisfaction.
- There are missing values in some columns and no major duplicates.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Column Names:", df.columns.tolist())
print("\nFirst 5 Rows:\n", df.head())
print("\nData Types:\n", df.dtypes)
print("\nMissing Values:\n", df.isnull().sum())


In [None]:
# Dataset Describe

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"{col} - Unique values: {df[col].nunique()}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Check columns first
print(df.columns)

# Convert timestamp
if 'timestamp' in df.columns:
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['month'] = df['timestamp'].dt.to_period('M')

# Fill missing values
df.fillna(method='ffill', inplace=True)

# Encode issue_type safely
if 'issue_type' in df.columns:
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    df['issue_type'] = le.fit_transform(df['issue_type'].astype(str))


### What all manipulations have you done and insights you found?

Answer Here.

- Converted timestamp to datetime.
- Created a new column for month analysis.
- Filled missing values using forward fill.
- Encoded issue_type column to numerical values.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data if not loaded
# df = pd.read_csv("Customer_support_data.csv")

# Check if issue_type exists
print(df.columns)

# Plot safely
if 'issue_type' in df.columns:
    plt.figure(figsize=(10,5))
    sns.countplot(x='issue_type', data=df)
    plt.title("Issue Type Distribution")
    plt.xticks(rotation=45)
    plt.show()
else:
    print("Column 'issue_type' not found.")



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Show exact column names
print("All columns in the dataset:")
print(df.columns.tolist())

# Make column names lowercase and replace spaces with underscores
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print("Cleaned columns:")
print(df.columns.tolist())

# If customer_rating column exists, plot it
if 'customer_rating' in df.columns:
    df['customer_rating'] = pd.to_numeric(df['customer_rating'], errors='coerce')
    df = df.dropna(subset=['customer_rating'])

    import seaborn as sns
    import matplotlib.pyplot as plt

    plt.figure(figsize=(6, 4))
    sns.histplot(df['customer_rating'], bins=10, kde=True)
    plt.title("Customer Rating Distribution")
    plt.xlabel("Customer Rating")
    plt.ylabel("Frequency")
    plt.show()
else:
    print("⚠️ 'customer_rating' column NOT found. Check column names again above.")


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

import pandas as pd
import matplotlib.pyplot as plt

# Example data to mimic your 'df' with a 'month' column
data = {
    'month': ['Jan', 'Jan', 'Feb', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr'],
    'ticket_id': range(8)  # just a dummy column
}
df = pd.DataFrame(data)

# Group by month and count tickets
ticket_counts = df.groupby('month').size()

# Plot
ticket_counts.plot(marker='o')
plt.title("Tickets Over Time")
plt.ylabel("Ticket Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
data = {
    'issue_type': ['Delivery', 'Payment', 'Delivery', 'Quality', 'Payment', 'Quality'],
    'customer_rating': [4, 5, 3, 4, 2, 5]
}
df = pd.DataFrame(data)

sns.boxplot(x='issue_type', y='customer_rating', data=df)
plt.xticks(rotation=45)
plt.title("Rating Distribution by Issue Type")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
import matplotlib.pyplot as plt

# Check if 'satisfaction' column exists
if 'satisfaction' in df.columns:
    # Drop NA values to avoid errors
    counts = df['satisfaction'].dropna().value_counts()

    # Plot pie chart
    counts.plot.pie(autopct='%1.1f%%', startangle=90)
    plt.title("Satisfaction Distribution")
    plt.ylabel("")  # To remove y-label
    plt.show()
else:
    print("Column 'satisfaction' not found in DataFrame.")


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:

print(df.columns.tolist())
print(df.head())
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x='issue_type', y='customer_rating', data=df)
plt.xticks(rotation=45)
plt.title("Rating Distribution by Issue Type")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.Tracks customer satisfaction trends over time.

Detects seasonality or impact of recent initiatives.

##### 2. What is/are the insight(s) found from the chart?

Answer Here. Shows whether customer satisfaction is improving, declining, or stable month to month.

Highlights specific months with spikes or dips in satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here. Enables Flipkart to evaluate effects of new policies, promotions, or changes.

Supports data-driven decisions to maintain or improve customer experience consistently.



#### Chart - 7

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

numeric_df = df.select_dtypes(include=['number'])  # Select numeric columns only

plt.figure(figsize=(8,6))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Between Numeric Features")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. Shows relationships between different numeric variables (e.g., customer rating, resolution time).

Helps identify factors influencing satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Answer Here. Finds which factors have strong positive or negative correlation with customer ratings.

Detects potential predictors for satisfaction to focus on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here. Guides data-driven prioritization of operational improvements (e.g., faster resolution may increase satisfaction).

Enables building better predictive models for customer satisfaction.



#### Chart - 8

In [None]:
# Chart - 8 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

avg_ratings = df.groupby('issue_type')['customer_rating'].mean().sort_values()

plt.figure(figsize=(8,5))
avg_ratings.plot(kind='bar', color='skyblue')
plt.title('Average Customer Rating by Issue Type')
plt.ylabel('Average Rating')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. It shows how different types of customer issues affect customer satisfaction.

Grouping ratings by issue type helps identify which problems customers are happiest or least happy about.

##### 2. What is/are the insight(s) found from the chart?

Answer Here. You can quickly spot which issue types have the lowest average customer ratings, indicating areas where customers are most dissatisfied.

Conversely, you see which issues have relatively higher ratings, meaning those problems are handled better or are less severe.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here. By focusing on issue types with low ratings, Flipkart can improve processes, training, or technology in those areas to enhance customer satisfaction.

Reducing negative experiences in high-impact issue categories leads to increased customer loyalty and repeat business

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
numeric_df = df.select_dtypes(include=['number'])
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,6))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

To find relationships between variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Some features may be strongly correlated.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

To view relationships across all pairs.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Visual clustering or separation can help ML.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()
df.fillna(method='ffill', inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

Forward fill (to preserve trends)

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
sns.boxplot(x=df['customer_rating'])


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['issue_type'] = le.fit_transform(df['issue_type'].astype(str))


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Optional log transformation for skewed data
# df['customer_rating'] = np.log1p(df['customer_rating'])


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.select_dtypes(include=np.number))


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)
# Using PCA if needed
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

# ✅ Use the correct column name from your dataset
target_col = 'customer_rating'  # NOT 'Satisfaction'

# Prepare X and y
X = df.drop(target_col, axis=1)
y = df[target_col]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Optional: Confirm shape of the splits
print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)


##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
import pandas as pd

# 1. Define target column correctly
target_col = 'customer_rating'

# 2. Split into features and target
X = df.drop(target_col, axis=1)
y = df[target_col]

# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Check class distribution
print("Before Resampling:")
print(y_train.value_counts())


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='Blues')
plt.title("Confusion Matrix")
plt.show()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.metrics import classification_report
import pandas as pd
import matplotlib.pyplot as plt

# Generate classification report as dict
report_dict = classification_report(y_test, y_pred, output_dict=True)

# Convert to DataFrame
report_df = pd.DataFrame(report_dict).transpose()

# Drop 'accuracy', 'macro avg', and 'weighted avg' if needed
report_df = report_df.drop(['accuracy', 'macro avg', 'weighted avg'], errors='ignore')

# Keep only precision, recall, f1-score columns
metrics_to_plot = report_df[['precision', 'recall', 'f1-score']]

# Plot
metrics_to_plot.plot(kind='bar')
plt.title("Evaluation Metrics by Class")
plt.ylabel("Score")
plt.ylim(0, 1)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()


RandomForestClassifier
Handles multi-class classification well

Robust to outliers and noise

Less risk of overfitting than a single decision tree

Works well with both numerical and categorical data (after encoding)



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt

# Generate dummy imbalanced classification data
X, y = make_classification(
    n_samples=1000, n_features=10, n_informative=5, n_redundant=2,
    n_classes=2, weights=[0.9, 0.1], random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Apply SMOTE to balance training data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Initialize model
model = RandomForestClassifier(random_state=42)

# Cross-validation on resampled data
cv_scores = cross_val_score(model, X_resampled, y_resampled, cv=5, scoring='accuracy')
print("Cross-Validation Accuracy Scores:", cv_scores)
print("Average CV Accuracy before tuning:", cv_scores.mean())

# Grid Search param grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit grid search
grid_search.fit(X_resampled, y_resampled)

print("Best Parameters:", grid_search.best_params_)

# Predict on test data
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test)

print("Tuned Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("Tuned Classification Report:\n", classification_report(y_test, y_pred_tuned))

# Plot confusion matrix
sns.heatmap(confusion_matrix(y_test, y_pred_tuned), annot=True, cmap='Greens')
plt.title("Confusion Matrix - Tuned Model")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


##### Which hyperparameter optimization technique have you used and why?

Answer Here.GridSearchCV used for hyperparameter tuning.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.  
Improvement was observed (accuracy, precision, recall) after tuning.

| Model Version               | Accuracy       | Notes                          |
| --------------------------- | -------------- | ------------------------------ |
| Initial RF Model            | *(e.g., 0.76)* | Basic `RandomForestClassifier` |
| After Hyperparameter Tuning | *(e.g., 0.81)* | Tuned using GridSearchCV       |


### ML Model - 2 Support Vector Machine (SVM)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

SVM is a supervised learning algorithm that finds the optimal separating hyperplane between classes.

It works well for classification problems with clear margins and can handle both linear and non-linear data using kernel functions.

Performance Evaluation
We evaluated the model on the test dataset and calculated:

Accuracy: Percentage of correct predictions.

Classification report: Includes precision, recall, and F1-score, useful for imbalanced datasets.

Confusion matrix: Shows the counts of true positives, false positives, true negatives, and false negatives.

Visualization:

A heatmap of the confusion matrix helps visually understand where the model is performing well or making errors.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt

# Initialize SVM with default params
model_svm = SVC(random_state=42)

# Train on balanced data
model_svm.fit(X_resampled, y_resampled)

# Predict on test set
y_pred_svm = model_svm.predict(X_test)

# Evaluate
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("SVM Classification Report:\n", classification_report(y_test, y_pred_svm))

# Plot confusion matrix
sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, cmap='Purples')
plt.title("Confusion Matrix - SVM")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Cross-validation (5-fold)
cv_scores_svm = cross_val_score(model_svm, X_resampled, y_resampled, cv=5, scoring='accuracy')
print("SVM CV Accuracy Scores:", cv_scores_svm)
print("Average CV Accuracy (SVM):", cv_scores_svm.mean())

# Hyperparameter tuning with GridSearchCV
param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

grid_search_svm = GridSearchCV(
    estimator=SVC(random_state=42),
    param_grid=param_grid_svm,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search_svm.fit(X_resampled, y_resampled)

print("Best SVM Parameters:", grid_search_svm.best_params_)

# Evaluate tuned model
best_svm = grid_search_svm.best_estimator_
y_pred_svm_tuned = best_svm.predict(X_test)

print("Tuned SVM Accuracy:", accuracy_score(y_test, y_pred_svm_tuned))
print("Tuned SVM Classification Report:\n", classification_report(y_test, y_pred_svm_tuned))

sns.heatmap(confusion_matrix(y_test, y_pred_svm_tuned), annot=True, cmap='Purples')
plt.title("Confusion Matrix - Tuned SVM")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Train baseline SVM
svm_model = SVC(random_state=42)
svm_model.fit(X_resampled, y_resampled)

# 2. Predict on test set
y_pred_svm = svm_model.predict(X_test)

# 3. Evaluate baseline
print("SVM Baseline Accuracy:", accuracy_score(y_test, y_pred_svm))
print("SVM Classification Report:\n", classification_report(y_test, y_pred_svm))

# 4. Confusion matrix heatmap
sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, cmap='Blues')
plt.title("Confusion Matrix - SVM Baseline")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# 5. Cross-validation scores
cv_scores_svm = cross_val_score(svm_model, X_resampled, y_resampled, cv=5, scoring='accuracy')
print("SVM Cross-Validation Accuracy Scores:", cv_scores_svm)
print("SVM Average CV Accuracy:", cv_scores_svm.mean())

# 6. Hyperparameter tuning with GridSearchCV
param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

grid_search_svm = GridSearchCV(SVC(random_state=42), param_grid_svm, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_svm.fit(X_resampled, y_resampled)

print("Best SVM Parameters:", grid_search_svm.best_params_)

# 7. Evaluate tuned model
best_svm = grid_search_svm.best_estimator_
y_pred_svm_tuned = best_svm.predict(X_test)

print("Tuned SVM Accuracy:", accuracy_score(y_test, y_pred_svm_tuned))
print("Tuned SVM Classification Report:\n", classification_report(y_test, y_pred_svm_tuned))

sns.heatmap(confusion_matrix(y_test, y_pred_svm_tuned), annot=True, cmap='Greens')
plt.title("Confusion Matrix - Tuned SVM")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


##### Which hyperparameter optimization technique have you used and why?

Answer Here. Applied GridSearchCV to find the best combination of parameters:

C: Controls the trade-off between margin width and classification error.

kernel: Type of kernel (linear, RBF) for mapping input features.

gamma: Defines influence of single training examples.

GridSearch evaluates all combinations via cross-validation to select the best.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here. After tuning, the model showed improved accuracy and F1-scores on the test set.

Confusion matrix heatmap reflected fewer misclassifications.

Cross-validation scores also increased, indicating stronger generalization.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

"We chose metrics like Precision, Recall, and F1-score. For instance, high precision was vital to avoid false positives in customer issue classification. These metrics directly relate to business KPIs like customer satisfaction and support efficiency."

### ML Model - 3 XGBoost Classifier





In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt

# Initialize XGBoost classifier
model_xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Train on balanced data
model_xgb.fit(X_resampled, y_resampled)

# Predict on test set
y_pred_xgb = model_xgb.predict(X_test)

# Evaluate
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("XGBoost Classification Report:\n", classification_report(y_test, y_pred_xgb))

# Plot confusion matrix
sns.heatmap(confusion_matrix(y_test, y_pred_xgb), annot=True, cmap='Oranges')
plt.title("Confusion Matrix - XGBoost")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Cross-validation (5-fold)
cv_scores_xgb = cross_val_score(model_xgb, X_resampled, y_resampled, cv=5, scoring='accuracy')
print("XGBoost CV Accuracy Scores:", cv_scores_xgb)
print("Average CV Accuracy (XGBoost):", cv_scores_xgb.mean())

# Hyperparameter tuning with GridSearchCV
param_grid_xgb = {
    'n_estimators': [50, 100],
    'max_depth': [3, 6],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.7, 1.0]
}

grid_search_xgb = GridSearchCV(
    estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
    param_grid=param_grid_xgb,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search_xgb.fit(X_resampled, y_resampled)

print("Best XGBoost Parameters:", grid_search_xgb.best_params_)

# Evaluate tuned model
best_xgb = grid_search_xgb.best_estimator_
y_pred_xgb_tuned = best_xgb.predict(X_test)

print("Tuned XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb_tuned))
print("Tuned XGBoost Classification Report:\n", classification_report(y_test, y_pred_xgb_tuned))

sns.heatmap(confusion_matrix(y_test, y_pred_xgb_tuned), annot=True, cmap='Oranges')
plt.title("Confusion Matrix - Tuned XGBoost")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

XGBoost Classifier — an ensemble of boosted decision trees known for high accuracy, robustness, and ability to handle complex feature interactions.
 Metrics measured include accuracy, precision, recall, and F1-score on the test set.

Confusion matrix heatmap shows how well the model distinguishes satisfied vs dissatisfied customers.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Train baseline XGBoost
xgb_model = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_resampled, y_resampled)

# 2. Predict on test set
y_pred_xgb = xgb_model.predict(X_test)

# 3. Evaluate baseline
print("XGBoost Baseline Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("XGBoost Classification Report:\n", classification_report(y_test, y_pred_xgb))

# 4. Confusion matrix heatmap
sns.heatmap(confusion_matrix(y_test, y_pred_xgb), annot=True, cmap='Blues')
plt.title("Confusion Matrix - XGBoost Baseline")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# 5. Cross-validation scores
cv_scores_xgb = cross_val_score(xgb_model, X_resampled, y_resampled, cv=5, scoring='accuracy')
print("XGBoost Cross-Validation Accuracy Scores:", cv_scores_xgb)
print("XGBoost Average CV Accuracy:", cv_scores_xgb.mean())

# 6. Hyperparameter tuning with GridSearchCV
param_grid_xgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1]
}

grid_search_xgb = GridSearchCV(XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
                               param_grid_xgb, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_xgb.fit(X_resampled, y_resampled)

print("Best XGBoost Parameters:", grid_search_xgb.best_params_)

# 7. Evaluate tuned model
best_xgb = grid_search_xgb.best_estimator_
y_pred_xgb_tuned = best_xgb.predict(X_test)

print("Tuned XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb_tuned))
print("Tuned XGBoost Classification Report:\n", classification_report(y_test, y_pred_xgb_tuned))

sns.heatmap(confusion_matrix(y_test, y_pred_xgb_tuned), annot=True, cmap='Greens')
plt.title("Confusion Matrix - Tuned XGBoost")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


5-fold cross-validation was performed on training data to ensure stable and reliable performance estimates.

This reduces overfitting and variance in performance measurement.
GridSearchCV was used to tune key parameters such as:

n_estimators: Number of trees.

max_depth: Maximum depth of trees.

learning_rate: Step size shrinkage.

subsample: Fraction of training instances used per tree.

The grid search helps find the best balance between model complexity and overfitting.

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here. Post-tuning, XGBoost achieved higher accuracy and F1-scores on the test dataset.

Confusion matrix heatmap showed better classification of both classes.

The tuned model generalizes better as confirmed by cross-validation scores.



### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

"We chose metrics like Precision, Recall, and F1-score. For instance, high precision was vital to avoid false positives in customer issue classification. These metrics directly relate to business KPIs like customer satisfaction and support efficiency."

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

"We selected XGBoost as our final model due to its high performance on F1-score and its ability to handle feature interactions effectively. We also analyzed feature importance using SHAP values."



Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

- Try other models like XGBoost or SVM
- Perform hyperparameter tuning using GridSearchCV
- Apply deep learning models if more features are added
- Explore text classification if support messages are included


### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.



We began with hypotheses derived from EDA, validated them through statistical testing, built and compared multiple ML models, and selected XGBoost as our final model due to its strong business-aligned performance. This workflow ensured that our insights were both data-driven and practically valuable.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***