# **Project Name**    - Email Campaign Effectiveness Prediction



##### **Project Type**    - **Supervised ML Classification**
##### **Contribution**    - Individual


# **Project Summary -**

The objective of this project is to develop a machine learning model capable of analyzing and monitoring emails within Gmail-based email marketing campaigns. The intended users are small to medium-sized business owners seeking to enhance the efficiency of their email marketing strategies and boost customer retention.

A primary challenge in email marketing is deciphering which emails are being read, ignored, or acknowledged by recipients. Gaining insights into the effectiveness of emails enables business owners to tailor their marketing approaches and enhance their likelihood of success.

To tackle this issue, we will collect data encompassing various email attributes, including the subject line, sender name, email content, format, frequency, target audience, and other pertinent factors. Leveraging this data, we will train a machine learning model to predict whether an email is likely to be read, ignored, or acknowledged by the recipient. The model will have the capability to analyze new emails and provide predictions on how they are expected to be received.

To assess the model's performance, we will divide our data into a training set and a testing set. The training set will be utilized to train the model, while the testing set will be used to evaluate its accuracy. Metrics such as precision, recall, and F1 score will be employed to gauge the model's effectiveness.

Upon successful training and evaluation, the model can be deployed in a production environment, offering business owners a valuable tool to enhance the effectiveness of their email marketing campaigns. By leveraging the model to characterize and monitor emails, they can make more informed decisions about targeting their marketing efforts and increasing customer retention.

In summary, this project endeavors to provide small to medium business owners with a robust solution for optimizing their email marketing campaigns. Through the application of machine learning to analyze and monitor emails, these business owners can make informed decisions and elevate the success potential of their marketing endeavors.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Small and medium-sized business owners are currently employing Gmail-based email marketing tactics to convert potential customers into leads. However, they face a challenge in tracking the reception of their emails—whether they are being ignored, read, or acknowledged by the recipients. Their goal is to develop a machine learning model that can assist in characterizing and monitoring these emails. The primary aim is to enhance the efficiency of their email marketing initiatives and boost customer retention.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#update scikit learn for some features like roc_auc_ovr
!pip install --upgrade scikit-learn
!pip install shap



In [None]:
# Import Libraries
# Import Libraries
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import seaborn as sns
from scipy.stats import *
from scipy import stats
import math
import shap

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,f1_score, recall_score
from sklearn import metrics
from sklearn.metrics import roc_curve,roc_auc_score,precision_score, roc_auc_score
# from sklearn.metrics import roc_auc_ovr
# from sklearn.metrics import roc_auc_ovo
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold

from xgboost import XGBClassifier
from xgboost import XGBRFClassifier

# The following lines adjust the granularity of reporting.
pd.options.display.float_format = "{:.2f}".format

import warnings
warnings.filterwarnings('ignore')



### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("/content/data_email_campaign.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# Check for duplicated rows
duplicated_rows = df.duplicated()

# Count the occurrences of True and False in the duplicated_rows Series
duplicated_counts = duplicated_rows.value_counts()

# Count the number of unique rows with duplication
num_duplicates = len(df[duplicated_rows])

# Print the result
print(f"Data is duplicated? {duplicated_counts}, unique values with {num_duplicates} duplication")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Define a function called showMissing
def showMissing():
    # Find columns with missing values in the DataFrame
    missing = df.columns[df.isnull().any()].tolist()
    return missing

# Create an empty DataFrame called missingVal
missingVal = pd.DataFrame()

# Create two columns in missingVal DataFrame:
# 1. 'Missing Data Count': Count of missing values for each column in email_df
# 2. 'Missing Data Percentage': Percentage of missing values for each column in email_df
missingVal['Missing Data Count'] = df[showMissing()].isnull().sum().sort_values(ascending=False)
missingVal['Missing Data Percentage'] = df[showMissing()].isnull().sum().sort_values(ascending=False) / len(df) * 100
missingVal

In [None]:
# Visualizing the missing values
# Set the size of the heatmap
plt.figure(figsize=(12, 8))

# Create a heatmap to visualize missing values in email_df
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')

# Set the title of the plot
plt.title('Missing Values Heatmap')

# Show the plot
plt.show()


### What did you know about your dataset?



1. The dataset consists of 68,353 observations and encompasses 12 features.

2. It is a diverse dataset, containing a combination of integer, float, and object data types.
3. Notably, the dataset is characterized by the absence of duplicate values, ensuring that the data is free from bias. Duplicates, if present, could introduce complications in downstream analyses, potentially biasing results or impeding accurate data summarization.
4. Specific features exhibit null values, with "customer location" having 11,595 instances of null values, equivalent to 16% of the dataset.
5. Features like customer location (11595 i.e., 16%), total past communication (6825 i.e., 10%), total link (2201 i.e., 3%), total images (1677 i.e., 2%) has null values.




## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().T

### Variables Description

**Attribute Information**

---
* **Email_Id** - Email id of customer
* **Email_Type** - There are two categories 1 and 2. We can think of them as marketing emails or important updates and notices like emails regarding business
* **Subject_Hotness_Score** - It is the email's subject's score on the basis of how good and effective the content is
* **Email_Source_Type** - It represents the source of the email like sales and marketing or important admin mails related to the product
* **Email_Campaign_Type** - The campaign type of the email.
* **Customer_Location** - Contains demographical data of the customer, the location where the customer resides.
* **Total_Past_Communications** - This columns contains the total previous mails from the same source, the number of communications had.
* **Time_Email_sent_Category** - It has three categories 1,2 and 3, Time of the day when the email was sent, either morning, evening and night time
* **Word_Count** - Total count of word in each email
* **Total_links** - Total number of links in the email
* **Total_Images** - Total Number of images in the email
* **Email_Status** - Our target variable which contains whether the mail was ignored, read, acknowledged by the reader

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns.tolist():
    unique_values = df[column].nunique()
    print("Number of unique values in '{}' is {}.".format(column, unique_values))
    if unique_values < 10:  # Adjust the threshold as needed
        print("Unique Values: {}".format(df[column].unique()))
    print("-" * 40)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Checking Shape of Email Read
print("Number of Email Read : -",len(df[df['Email_Status'] == 1]))
# Checking Shape of Email Acknowledged by Reader
print("Number of Email Acknowledged : -",len(df[df['Email_Status'] == 2]))
# Checking Shape of Email Ignored
print("Numberof Email Ignored : -",len(df[df['Email_Status'] == 0]))

In [None]:
# Email Status groupby Email_Type
result_df = df.groupby(['Email_Type', 'Email_Status']).size().reset_index(name="Count")
print(result_df)


In [None]:
# Email Status groupby Customer_Location
pd.DataFrame(df.groupby('Customer_Location')['Email_Status'].value_counts().reset_index(name="Count"))

In [None]:
# Email Status groupby Email_Source_Type
pd.DataFrame(df.groupby('Email_Source_Type')['Email_Status'].value_counts().reset_index(name="Count"))

In [None]:
# Email Status groupby Email_Campaign_Type
pd.DataFrame(df.groupby('Email_Campaign_Type')['Email_Status'].value_counts().reset_index(name="Count"))

In [None]:
# Email Status groupby Time_Email_sent_Category
pd.DataFrame(df.groupby('Time_Email_sent_Category')['Email_Status'].value_counts().reset_index(name="Count"))

In [None]:
#creating variable to store numerical feature
num_feature = df.select_dtypes(include = 'float').columns.to_list()
num_feature.append('Word_Count')
num_feature

In [None]:
#creating variable to store categorial features
cat_feature = [feature for feature in df.columns.to_list() if feature not in num_feature]
cat_feature

In [None]:
#finding count, sum, mean and median based on Email Type
df.groupby('Email_Type')[num_feature].agg(['count','sum','mean','median']).T

In [None]:
#finding count, sum, mean and median based on Email_Source_Type
df.groupby('Email_Source_Type')[num_feature].agg(['count','sum','mean','median']).T

In [None]:
#finding count, sum, mean and median based on Customer_Location
df.groupby('Customer_Location')[num_feature].agg(['count','sum','mean','median']).T

In [None]:
#finding count, sum, mean and median based on Email_Campaign_Type
df.groupby('Email_Campaign_Type')[num_feature].agg(['count','sum','mean','median']).T

In [None]:
#finding count, sum, mean and median based on Time_Email_sent_Category
df.groupby('Time_Email_sent_Category')[num_feature].agg(['count','sum','mean','median']).T

In [None]:
#Analyzing mean median and sum based on Email Acknowledged with respect to numerical features
df[df['Email_Status'] == 2][num_feature].agg(['sum','mean','median']).T

In [None]:
#Analyzing mean median and sum based on Email Opened with respect to numerical features
df[df['Email_Status'] == 1][num_feature].agg(['sum','mean','median']).T

In [None]:
#Analyzing mean median and sum based on Email Ignored with respect to numerical features
df[df['Email_Status'] == 0][num_feature].agg(['sum','mean','median']).T

In [None]:
#Analyzing mean median sum and count based on Email Acknowledged with respect to different category
for cat in cat_feature:
  if (cat == 'Email_Status') | (cat == 'Email_ID'):
    pass
  else:
    print(f'Email Acknowledged based on {cat} \n')
    print(df[df['Email_Status'] == 2].groupby(cat)[num_feature
                            ].agg(['sum','mean','median']).T)
    print('='*120)

In [None]:
#Analyzing mean median sum and count based on Email Opened
for cat in cat_feature:
  if (cat == 'Email_Status') | (cat == 'Email_ID'):
    pass
  else:
    print(f'Email Opened based on {cat} \n')
    print(df[df['Email_Status']== 1].groupby(cat)[num_feature
                            ].agg(['sum','mean','median']).T)
    print('='*120)

In [None]:
#Analyzing mean median sum and count based on Email Ignored
for cat in cat_feature:
  if (cat == 'Email_Status') | (cat == 'Email_ID'):
    pass
  else:
    print(f'Email Ignored based on {cat} \n')
    print(df[df['Email_Status'] == 0].groupby(cat)[num_feature
                            ].agg(['sum','mean','median']).T)
    print('='*120)

In [None]:
# Calculate engagement rate based on emails acknowledged by the readers (Email_Status == 2)
engagement_count = df[df['Email_Status'] == 2]['Email_Status'].count()
total_emails = len(df)
engagement_rate = (engagement_count / total_emails) * 100

# Print the result with a comment
print(f"Engagement Rate: {engagement_rate:.2f}% (Emails Acknowledged by Readers)")

# Calculate open rate based on emails read and acknowledged by the readers (Email_Status != 0)
open_count = df[df['Email_Status'] != 0]['Email_Status'].count()
open_rate = (open_count / total_emails) * 100

# Print the result with a comment
print(f"Open Rate: {open_rate:.2f}% (Emails Read and Acknowledged by Readers)")

# Calculate ignored rate based on emails read and acknowledged by the readers (Email_Status == 0)
ignored_count = df[df['Email_Status'] == 0]['Email_Status'].count()
ignored_rate = (ignored_count / total_emails) * 100

# Print the result with a comment
print(f"Ignored Rate: {ignored_rate:.2f}% (Emails Read but Ignored by Readers)")


In [None]:
# Function to calculate engagement rate
def Engagement(group):
    acknowledged_emails = group[group['Email_Status'] == 2]
    return len(acknowledged_emails) / len(group)

# Function to calculate open rate
def Open(group):
    opened_emails = group[group['Email_Status'] != 0]
    return len(opened_emails) / len(group)

# Function to calculate ignore rate
def Ignore(group):
    ignored_emails = group[group['Email_Status'] == 0]
    return (len(ignored_emails) / len(group))

# List of categorical features to analyze
cat_feature = ['Email_Type', 'Email_Source_Type', 'Customer_Location', 'Email_Campaign_Type', 'Time_Email_sent_Category']

# Calculating engagement, open, and ignored rate for each categorical feature
for cat in cat_feature:
    # Skip Email_Status and Email_ID
    if (cat == 'Email_Status') or (cat == 'Email_ID'):
        continue

    print(f'Engagement Rate for - {cat}')
    print(df.groupby(cat).apply(Engagement))
    print('\n')

    print(f'Open Rate for - {cat}')
    print(df.groupby(cat).apply(Open))
    print('\n')

    print(f'Ignored Rate for - {cat}')
    print(df.groupby(cat).apply(Ignore))
    print('=' * 120)


In [None]:
# Link-to-Word ratio
df['Link_to_Word_ratio'] = df['Total_Links'] / df['Word_Count']

# Image-to-Word ratio
df['Image_to_Word_ratio'] = df['Total_Images'] / df['Word_Count']

# Image-Link-Word ratio
df['Image_Link_Word_ratio'] = (df['Total_Images'] + df['Total_Links']) / df['Word_Count']

# Percentage of words that are links
df['Percentage_of_words_that_are_links'] = (df['Total_Links'] / df['Word_Count']) * 100

# Number of Images per link
df['Number_of_Images_per_link'] = df['Total_Images'] / df['Total_Links']

# Calculating the weighted sum of Subject_Hotness_Score and Total_Past_Communications
df['Hotness_Score'] = df['Subject_Hotness_Score'] * df['Total_Past_Communications']

# Calculate and print the mean of each variable
print("Mean of Link_to_Word_ratio:", df['Link_to_Word_ratio'].mean())
print("Mean of Image_to_Word_ratio:", df['Image_to_Word_ratio'].mean())
print("Mean of Image_Link_Word_ratio:", df['Image_Link_Word_ratio'].mean())
print("Mean of Percentage_of_words_that_are_links:", df['Percentage_of_words_that_are_links'].mean())
print("Mean of Number_of_Images_per_link:", df['Number_of_Images_per_link'].mean())
print("Mean of Hotness_Score:", df['Hotness_Score'].mean())


### What all manipulations have you done and insights you found?

To comprehend the effectiveness of email campaigns, I initiated the analysis by scrutinizing the percentage of emails exhibiting customer engagement, along with bounce or ignore rates.


*   The number of emails read amounted to 11,039, constituting 17% of the total.
*   Emails acknowledged numbered 2,373, representing 3% of the total.

*   A substantial 54,941 emails, equivalent to 80%, were ignored.

This outcome underscores that the majority of emails faced disregard. To discern the reasons behind this indifference, I conducted a comprehensive analysis by grouping categorical features to uncover potential relationships. The entire analysis was contextualized by the email status to precisely ascertain engagement.

I further assessed the rates at which emails were opened, ignored, or acknowledged for each categorical variable. Additionally, I computed link and image density concerning the total word count. By utilizing this information, I determined the percentage of links and images in relation to the word count. Finally, I combined the subject hotness score and total past communication to calculate a weighted sum.

The analysis reveals that, to garner email acknowledgment, an average of at least 37 total communications is necessary, surpassing the interaction level observed in ignored emails. Other features, such as word count, indicate that an average of 590-600 words are acknowledged, ideally containing no more than 10 links and 3 images.

Notably, emails of Type 1 with Source 2, Location C, and Campaign Type 1, when sent in the morning, demonstrated effective engagement, boasting a higher count of acknowledged emails.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***