<a href="https://colab.research.google.com/github/saumya07p/Customer-Churn-Analysis-using-Machine-Learning/blob/main/Customer_Churn_Analysis_using_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction



Customer churn, alternatively termed as customer attrition, signifies the scenario wherein customers discontinue their association with a company or disengage from utilizing its services or products. Essentially, it denotes the pace at which customers terminate their subscriptions or cease making purchases from a business within a specific timeframe. Elevated churn rates could indicate dissatisfaction with the product, service, or overall customer experience offered by the business. Vigilantly tracking churn provides insights into areas that require enhancement and allows for the resolution of customer concerns. Our focus is on analyzing customer departures from a business, leveraging customer attributes to empower companies in identifying potential churners through available data

# Data Understanding and Preprocessing

In [None]:
#Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

In [None]:
#import libaries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report

In [None]:
# Read data

customer = pd.read_csv('/content/drive/MyDrive/ML_Assignment/CustomerChurnData.csv')

customer

In [None]:
# Examine the number of rows and cols
customer.shape

In [None]:
# Show the head rows of a data frame
customer.head()

In [None]:
# Show the tail rows of a data frame

customer.tail()

In [None]:
customer.describe()

In [None]:
customer.info()

In [None]:
# Examine missing values

customer.isnull().sum()

In [None]:
#Examine data types of each column

customer.dtypes

In [None]:
customer.select_dtypes(include=['number'])

In [None]:
customer.select_dtypes(include=['object'])

In [None]:
# Identify categorical variables
categorical_cols = customer.select_dtypes(include=['object']).columns

# Convert categorical variables to "category" type
customer[categorical_cols] = customer[categorical_cols].astype('category')

customer.select_dtypes(include=['category'])

In [None]:
customer.dtypes

In [None]:
# List of categorical variables
categorical_columns = [column for column in customer.keys() if customer[column].dtype.name == 'category']
categorical_columns

# Data Visualization of Numerical Values

In [None]:
customer['tenure'].describe()

In [None]:
customer[['age','tenure']].describe()

In [None]:
customer['tenure'].max()

In [None]:
customer[customer['tenure'] == customer['tenure'].max()]

In [None]:
# Obtain the variance, standard deviation, and range of a numeric varaible: Attack
print("variance: ", customer['tenure'].var(), "standard deviation: ", customer['tenure'].std(), "range: ", customer['tenure'].min(), customer['tenure'].max())

In [None]:
IQR = customer['tenure'].quantile(0.75) - customer['tenure'].quantile(0.25)
print("IQR:", IQR)

In [None]:
# Boxplot of a numeric variable 'tenure'
snsplot = sns.boxplot(x='tenure', data = customer)
snsplot.set_title("Boxplot of Tenure in the Customer Churn data set")

In [None]:
# Boxplot of a numeric variable 'income'
snsplot = sns.boxplot(x='income', data = customer)
snsplot.set_title("Boxplot of Income in the Customer data set")

In [None]:
customer['age'].describe()

In [None]:
snsplot = sns.boxplot(x='age', data = customer)
snsplot.set_title("Boxplot of Age in the Customer Churn data set")

In [None]:
snsplot = sns.histplot(x='tenure', data = customer)
snsplot.set_title("Histogram of tenure in the Customer Churn Data set")

In [None]:
snsplot = sns.histplot(x='age', data = customer)
snsplot.set_title("Histogram of age in the Customer Churn Data set")

In [None]:
snsplot = sns.histplot(x='age', data = customer)
snsplot.set_title("Histogram of age in the Customer Churn Data set")

# Data Visualization of Categorical Values

In [None]:
customer['churn'].value_counts(normalize = True)

In [None]:
number_of_values = len(customer['churn'])
number_of_values

In [None]:
customer.keys()

In [None]:
customer['churn'].describe()

In [None]:
customer['churn'].value_counts()

In [None]:
#Plotting a Countplot for Churn - Yes and No

snsplot = sns.countplot(x='churn', data=customer)
snsplot.set_title("Countplot of Churn in the Customer Churn data set")

In [None]:
customer['region'].describe()

In [None]:
customer['region'].value_counts()

In [None]:
snsplot = sns.countplot(x='region', data=customer, color = 'red')
snsplot.set_xticklabels(snsplot.get_xticklabels(), ha="right")
snsplot.set_title("Countplot of region in the Customer Churn data set")

# Understanding relationships of multiple variables

In [None]:
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(customer.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

In [None]:
customer.corr()

# Model Development - Building a Decision Tree

In [None]:
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()
Customer_Churn_Encoded = customer.apply(lambda col: label.fit_transform(col) if col.dtype == 'category' or col.dtype == 'object' else col)

In [None]:
Customer_Churn_Encoded.head()

In [None]:
Customer_Churn_Encoded.keys()

In [None]:
#Drop cust_id column
Customer_Churn_Encoded = Customer_Churn_Encoded.drop(['cust_id'], axis=1)

In [None]:
target = Customer_Churn_Encoded['churn']   #our Target variable is churn
print(target.value_counts(normalize=True))

In [None]:
# Adjust column names as necessary

predictors = Customer_Churn_Encoded.drop(['churn'], axis=1)

# Splitting and Test Dataset

In [None]:
predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size=0.2, random_state=1)

In [None]:
customer['churn'].value_counts()

In [None]:
# Examine the porportion of target variable for data set
target = customer['churn']
print(target.value_counts(normalize=True))

# Developing Model

In [None]:
# Build a decision tree model on training data with max_depth = 2
model = DecisionTreeClassifier(criterion = "entropy", random_state = 1, max_depth = 2)
model.fit(predictors_train, target_train)

In [None]:
# Plot the tree
fig = plt.figure(figsize=(30,20))
tree.plot_tree(model,
               feature_names=list(predictors_train.columns),
               class_names=['No','Yes'],
               filled=True)

In [None]:
# Text version of decision tree

print(tree.export_text(model, feature_names=list(customer.columns)[2:]))

In [None]:
# Make predictions on testing data
prediction_on_test = model.predict(predictors_test)

In [None]:
# Examine the evaluation results on testing data: confusion_matrix
cm = confusion_matrix(target_test, prediction_on_test)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_).plot()

# Results and Evaluation


In [None]:
print(classification_report(target_test, prediction_on_test))

In [None]:
prediction_on_train =model.predict(predictors_train)
cm_DT = confusion_matrix(target_train, prediction_on_train)
ConfusionMatrixDisplay(confusion_matrix=cm_DT,display_labels=model.classes_).plot()
print(classification_report(target_train, prediction_on_train))

In [None]:
# Select predictors and target variable
predictors = ['region', 'tenure', 'age', 'marital', 'address', 'income',
              'ed', 'employ', 'retire', 'gender', 'reside', 'tollfree', 'equip',
              'callcard', 'wireless', 'longmon', 'tollmon', 'equipmon', 'cardmon',
              'wiremon', 'longten', 'tollten', 'equipten', 'cardten', 'wireten',
              'multline', 'voice', 'pager', 'internet', 'callid', 'callwait',
              'forward', 'confer', 'ebill', 'loglong', 'lninc', 'custcat']
target = 'churn'

# Create feature matrix (X) and target vector (y)
X = Customer_Churn_Encoded[predictors]
y = Customer_Churn_Encoded[target]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the Decision Tree classifier
model = DecisionTreeClassifier()

# Train the model
model.fit(X_train, y_train)

# Extract feature importances
feature_importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)

# Display top 10 features with the highest impact on the prediction
top_features = feature_importances.head(10)
print("Top 10 features with the highest impact on the prediction -")
print(top_features)

**Evaluation -**

Top 10 Numerical features including longten, internet, equipten, address, employ,  
age, cardten, tollten, longmon, lninc  are included in the dataset.

The precision, recall, and F1-score for the training and test sets are displayed in the classification results.

On the test set and training set, the model's accuracy is 81% and the weighted average is 78%.

Top 10 features with the highest impact on the prediction has been mentioned.

In [None]:
!jupyter nbconvert --to html "/content/drive/MyDrive/Colab Notebooks/ML_Assignment_Saumya Prasad.ipynb"