# Customer Churn prediction

## 1. Define the Business Problem and ML Solution

First step is to define your problem statement, how to solve it and what to evaluate it.

## 2. Explore the Data
In this section, we explore and analyze the data to gain insights and understanding.

## 3. Data Preparation
This section covers the preprocessing and cleaning steps performed on the data to make it suitable for training the models.

## 4. Training & Evaluation
Here, we train the models using the prepared data and evaluate their performances.

## 5. Conclusions
Finally, we summarize the key findings and conclusions from the analysis and training process.


In [None]:
## import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report,confusion_matrix , roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,roc_auc_score


## ML Models Diffrent Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

## EDA

In [None]:
Dataset_path='E_Commerce_Dataset.csv'

In [None]:
#Use pandas read_csv
#For example: df=pd.read_csv(....)
#For more https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

#your code starts here

#your code ends here

In [None]:
# To visualize the data, run 'head()' function
# For more https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

#your code starts here

#your code ends here

In [None]:
# Notice that there are few NaNs in the dataset

In [None]:
#Try to explore the data further using shape and describe() methods
#For more on 'shape': https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html 
#For more on 'describe()': https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

In [None]:
#print shape of data

#your code starts here

#your code ends here

In [None]:
#print description of data
print("\nNumerical  Features")

#your code starts here

#your code ends here

In [None]:
#Notice how the describe function prints numerical features only.
#this behavior can be changed by modifying the 'include' parameter like this: df.describe(include='object').T

In [None]:
print("\nCategorical Features")

#your code starts here

#your code ends here

In [None]:
#Check for missing values df.isna()
#Followed by .sum() to check number of Nulls per feature
print("--------------------Data NA Check-------------------------------")

#your code starts here

#your code ends here

In [None]:
# Check the distribution of the target variable (label)
sns.countplot(x="Churn", data=df)
plt.show()

## TimeCheck1
## Data Prep

In [None]:
#Let's Balance the Dataset
# Separate majority and minority classes
majority_samples = df[df['Churn'] == 0]
minority_samples = df[df['Churn'] == 1]

# Downsample the majority class
downsampled_majority = resample(majority_samples,
                                replace=False,  # Set to False for downsampling
                                n_samples=2*len(minority_samples),  # Match minority class size
                                random_state=42)  # For reproducibility

# Combine the downsampled majority class with the minority class
balanced_dataset = pd.concat([downsampled_majority, minority_samples])
df=balanced_dataset

In [None]:
# Re-check the distribution of the target variable (label)
sns.countplot(x="Churn", data=df)
plt.show()

In [None]:
#Let's drop Null values. To do that, we can drop all rows that have null values
# You can use 'df.dropna' 
# Make sure to specify axis=0 to remove rows with null values and inplace=True to modify the df
# for more: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

#your code starts here

#your code ends here

print("-----------------After removing the null values-----------------")
print(df.isna().sum())
print("-----------------shape of data after removing nulls--------------")
df.shape

In [None]:
# Check the correlation between the variables
fig, ax = plt.subplots(figsize=(20,10)) 
sns.heatmap(df.corr(), annot=True)
plt.show()

In [None]:
# Now we want to preprocess the data before we start training

In [None]:
#Let's have a look at the data type of each feature
print(df.dtypes)

In [None]:
#Note that some features are of type object and ML models expect numerical inputs
#Hence, we will need to convert the categorical variables into numerical 
#We can use LabelEncoder as follows:
le = LabelEncoder()
df['PreferredLoginDevice'] = le.fit_transform(df['PreferredLoginDevice'])

#Apply the same on other features that are of type objects

#your code starts here




#your code ends here


In [None]:
#Let's have a look at the data type of each feature after applying LabelEncoder
print(df.dtypes)

In [None]:
#Let's have a look at one of the categorical featuers 
df.Gender

In [None]:
#Let's drop meaningless features and divide data into X and y. 
X = df.drop(['CustomerID', 'Churn'], axis=1)
y = df['Churn']

# Split the dataset into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

## Timecheck2

In [1]:
#Train a DecisionTreeClassifier and RandomForestClassifier 

In [None]:
#Train DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print(classification_report(y_test, y_pred_dt))

# Plot the confusion matrix
plt.figure(figsize = (4,2))
sns.heatmap(confusion_matrix(y_test, y_pred_dt),fmt='g',annot=True)
plt.ylabel('Actual',fontsize=13)
plt.xlabel('Predicted',fontsize=13)
plt.show()


In [None]:
#Similarly, train a 'RandomForestClassifier()'
#Print classification report and draw Confusion Matrix

In [None]:
#Train a RandomForestClassifier and print classification_report
#Your code starts here




#your code ends here

# Plot the confusion matrix
#Your code starts here





#your code ends here

In [None]:
#Here is a mehtod to calculate the feature importances 
feature_importance = rf.feature_importances_
feature_names = X_test.columns
sorted_indices = np.argsort(feature_importance)[::-1]  # Sort indices in descending order
for idx in sorted_indices:
    print(f"{feature_names[idx]}: {feature_importance[idx]}")

In [None]:
#Let's test a sample, you can try any values and check the prediciton
test_sample = [[
      1, #'Tenure'
      1, #PreferredLoginDevice
      1, #CityTier
      14,#WarehouseToHome
      1, #PreferredPaymentMode
      1, #Gender
      4, #HourSpendOnApp
      4, #NumberOfDeviceRegistered
      4, #PreferedOrderCat
      1, #SatisfactionScore
      1, #MaritalStatus
      2, #NumberOfAddress
      1, #Complain
      12,#OrderAmountHikeFromlastYear
      7, #CouponUsed
      10,#OrderCount
      9, #DaySinceLastOrder
      200#CashbackAmount
     ]]
print('Sample Churn prediction: ',rf.predict(test_sample))

## The END

### Evaluation Metrics
![Confusion_Matrix](https://www.researchgate.net/publication/326866871/figure/fig3/AS:669601385959430@1536656819610/22-confusion-matrix-and-associated-measures.ppm)

### Accuracy
    Accuracy is the most basic evaluation metric, representing the The measure of how many observations our model correctly predicted.
    Pros:
        Easy to understand and interpret.
        Suitable for balanced datasets.
    Cons:
        Can be misleading when dealing with imbalanced datasets.
        Ignores the types of errors (false positives and false negatives).
    Example: 
        In a binary classification task, if a model makes 80 correct predictions out of 100, the accuracy would be 80%.

### Precision
    Precision measures how correct our 1’s are.
    Pros:
        Useful when the focus is on minimizing false positives.
        Provides insights into the model's ability to avoid false alarms.
    Cons:
        Does not consider false negatives.
    Example: 
        Abuse detection - In order to avoid accusing good users, we need to make sure a user is actually an abuser when flagged as 1.

### Recall
    Recall, also known as sensitivity or true positive rate, measures how many 1’s we might have missed out of all actual 1's.
    Pros:
        Important when the focus is on minimizing false negatives.
        Indicates the model's ability to identify positive cases.
    Cons:
        Does not account for false positives.
    Example: 
        Running a marketing campaign with unlimited budget and we want to send coupons to users that are at high risk of leaving so that we encourage them to stay - It’s important to capture as many 1’s as possible.

### F1 Score
    The F1 score is the balance between precision and recall and an evaluation metric that considers both aspects.
    Pros:
        Takes into account both false positives and false negatives.
        Suitable for imbalanced datasets.
    Cons:
        F1 score gives equal weight to precision and recall, which may not always be desired.
    Example: 
        Similar to previous example, but now we have limited marketing budget and can’t give out unlimited coupons. In this case, if only recall is high but precision is low then we might be spending a lot of money on users that may or may not have been high-risk, but if both precision and recall are high (which also means f1 score is high) then it’s a good balance between identifying potential high-risk users without spending a lot of money.