# Project 01: Churn Prediction in Telecom Industry using Logistic Regression
<img src="image/img_customer_churn.jpg">

### Submitted By: Yashuv Baskota
### Language- Python
### Datasets :- https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets

### Defining Customer Churn
It is when an existing customer, user, player, subscriber or any kind of return client stops doing business or ends the relationship with a company.

## 1. Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np
import os

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import roc_auc_score, roc_curve, f1_score, precision_score, recall_score

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

## 2. Exploratory Data Analysis

In [None]:
# path to the dataset folder
folder_path = '.\data'

# list all the filenames in the folder
filenames = os.listdir(folder_path)

# print the filenames
for filenames in os.listdir(folder_path):
    print(os.path.join(folder_path,filenames))

In [None]:
telcom1 = pd.read_csv("data/churn-bigml-80.csv")
telcom2 = pd.read_csv("data/churn-bigml-20.csv")

# load all dataset into a DataFrame
telcom = pd.concat([telcom1, telcom2], ignore_index=True)

In [None]:
telcom.head()

In [None]:
telcom.shape

In [None]:
telcom.info()

Comment: Hence, we found that the dataset contains *3333* rows (customers) and *20* columns (features).<br>
The `"Churn"` column is the target to predict.

In [None]:
# accessing Churn feature
telcom['Churn'].head(10)

### Descriptive Analysis and Data Visualization

In [None]:
telcom.describe()

In [None]:
# Count the number of data points in each category
y = telcom['Churn'].value_counts()
y

In [None]:
# Create the pie chart
plt.pie(y, labels=y.index, autopct='%1.1f%%')

# Customize the appearance of the pie chart
plt.title('Distribution of Churn')
plt.legend(title='Churn')
plt.show()

In [None]:
sns.barplot(x=y.index, y=y.values)

### Summary statistics for both classes

In [None]:
# Group telcom by 'Churn' and compute the mean
telcom.groupby(['Churn']).mean()

Churners seem to make more customer service calls than non-churners.

In [None]:
telcom.groupby(['Churn']).std()

### Churn by State

In [None]:
telcom.groupby('State')['Churn'].value_counts()

In [None]:
telcom.groupby(['State','Churn']).size().unstack().plot(kind='bar', stacked=True, figsize=(30,10))

Comment: This is useful information for a company!

#### Exploring feature distributions

In [None]:
# visualize the distribution of 'Account length'
sns.distplot(telcom['Account length'])

# display the plot
plt.show()

In [None]:
sns.distplot(telcom['Total day minutes'])
plt.show()

In [None]:
sns.distplot(telcom['Total eve minutes'])
plt.show()

In [None]:
sns.distplot(telcom['Total intl minutes'])
plt.show()

Comment: All of these features above appear to be well approximated by the normal distribution. If this were not the case, we would have to consider applying a feature transformation of some kind.

## 3. Data preprocessing


### Cleaning the data

In [None]:
# Check for missing values
has_missing = telcom.isnull().any()
has_missing

In [None]:
# check for duplicate rows 
duplicate_rows = telcom[telcom.duplicated()]
duplicate_rows

### Identifying features to convert

In [None]:
telcom.head()

In [None]:
telcom.dtypes

In [None]:
# Find the columns that contain boolean values
bool_columns = telcom.select_dtypes(include=['bool']).columns
print(bool_columns)

# Find the columns of object type
object_columns = telcom.select_dtypes(include=['object']).columns
print(object_columns)

### Encoding binary features

In [None]:
# Convert the boolean values to integers
telcom[bool_columns] = telcom[bool_columns].astype(int)

In [None]:
# Replace 'no' with 0 and 'yes' with 1 in 'International plan' and 'Voice mail plan'
telcom[['International plan','Voice mail plan']] = telcom[['International plan','Voice mail plan']].apply(lambda x: x.map({'No': 0, 'Yes': 1}))

In [None]:
# see the results
telcom[['International plan','Voice mail plan','Churn']].head()

### Feature selection and engineering

Dropping unnecessary and correlated features

In [None]:
# drop 'State' feature
telcom = telcom.drop(telcom[['State']], axis=1)

# Calculate the correlation matrix
corr_matrix = telcom.corr()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
print(to_drop)

# Drop the correlated features from the dataset
telcom = telcom.drop(telcom[to_drop], axis=1)

telcom.head()

### Feature scaling
To ensure that all variables are on the same scale and have comparable influence on the model.<br>
eg: Let's see the different scales of the `'Total intl calls'` and `'Total night minutes'` features:

In [None]:
telcom['Total intl calls'].describe()

In [None]:
telcom['Total night minutes'].describe()

In [None]:
# from sklearn.preprocessing import StandardScaler

# Scale telcom using StandardScaler
features_to_scale = [column for column in telcom.columns if column not in ['International plan','Voice mail plan','Churn']]
# print(features_to_scale)
telcom_scaled = StandardScaler().fit_transform(telcom[features_to_scale])

# Add column names back for readability
telcom_scaled_df = pd.DataFrame(telcom_scaled, columns=features_to_scale)

# summary statistics
print(telcom_scaled_df.describe())

# final preprocessed dataframe
telcom = pd.concat([telcom_scaled_df, telcom[['International plan', 'Voice mail plan','Churn']]], axis=1)

## 4. Model Building and Performance Evaluation

### Model Selection:

* **Logistic Regression**

We choose `Logistic Regression` as our estimator for this project.

In [None]:
# from sklearn.linear_model import LogisticRegression

# instantiate our classifier
clf = LogisticRegression()

### Creating training and test sets

In [None]:
# from sklearn.model_selection import train_test_split

# create feature variable (which holds all of the features of telco by dropping the target variable 'Churn' from telco)
X = telcom.drop(telcom[['Churn']], axis=1)

# create target variable
y = telcom['Churn']

# Create training and testing sets (here 80% of the data is used for training.)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
# Fit to the training data
clf.fit(X_train, y_train)

# The predicted labels of classifier
y_pred = clf.predict(X_test)

### Check each sets length

In [None]:
print(X_train.shape)
print(X_test.shape)

### Model Metrics:

In [None]:
# from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
# from sklearn.metrics import roc_auc_score, roc_curve, f1_score, precision_score, recall_score

#### Confusion matrix

In [None]:
# Calculate the confusion matrix
matrix = confusion_matrix(y_test, y_pred)
# print(matrix)

# Plot the confusion matrix using seaborn
sns.heatmap(matrix, annot=True, fmt='d', cmap='magma')

# Add labels to the plot
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')

# Show the plot
plt.show()

In [None]:
print(classification_report(y_test, y_pred))

#### Accuracy, Precision, Recall and F1 Score

Accuracy is a measure of how well a classifier performs in terms of correctly predicting the class of an input sample.

Recall is a measure of the proportion of positive examples that were correctly classified by the model. It is calculated using the following formula:
$$Recall = \frac{True Positives}{True Positives + False Negatives}$$

Precision is a measure of the proportion of predicted positive examples that are actually positive. It is calculated using the following formula:

$$Precision = \frac{True Positives}{True Positives + False Positives}$$

The F1 score is a measure of the accuracy of a classifier, defined as the harmonic mean of precision and recall.

$$F_1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

In [None]:
print("Accuracy: {:.2f}".format(accuracy_score(y_test, y_pred)))
print("Precision: {:.2f}".format(precision_score(y_test, y_pred)))
print("Recall: {:.2f}".format(recall_score(y_test, y_pred)))
print("F1 score: {:.2f}".format(f1_score(y_test, y_pred)))

#### ROC Curve

In [None]:
# Generate the probabilities
y_pred_prob = clf.predict_proba(X_test)[:,1]

# Use roc_curve() to calculate the false positive rate, true positive rate, and thresholds.
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot the ROC curve
plt.plot(fpr, tpr)

# Add labels and diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot([0, 1], [0, 1], "k--")
plt.show()

#### Area under the ROC curve

In [None]:
# the area under the ROC curve
roc_auc_score(y_test, y_pred_prob)

## 5. Making Predictions (whether a new customer will churn)

In [None]:
def make_prediction(customer):
    prediction = clf.predict(customer)
    if prediction[0] == 1:
        print("[1] The customer will Churn.")
    else:
        print("[0] The customer will not Churn")

In [None]:
# scaled input values
new_customer1 = [[0.6262585675178604,
                  1.7188173197427594,
                 -1.0535424482925813,
                 -0.6197347815607696,
                 -1.1276788128173842,
                 0.5464802852218092,
                 -0.8676148392853111,
                 0.3011544282701762,
                 0.4523525497250106,
                 -0.6011950896927287,
                 -0.4279320210630441,
                 0.0,
                 0.0]]

new_customer2 = [[0.5257967737031338,
                  -0.5236032802413713,
                  0.9387740897371452,
                  1.5730210856813158,
                  0.8326323403400316,
                  -0.0559403500169171,
                  -0.3653036104833324,
                  -2.20323162813801,
                  0.27323229022856793,
                  -1.0075595662585095,
                  -1.1882184955849664,
                  1.0,
                  0.0]]

# make prediction on new customers
make_prediction(new_customer1)
make_prediction(new_customer2)

<br>

## Thank You!