# **Telecom Churn Analysis**
---

# Import Library

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Data visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Read Data

**Data Source**

- Dataset title: Telco Customer Churn
- Dataset source URL: https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv
- Dataset source description (Github repository managed by IBM): https://github.com/IBM/telco-customer-churn-on-icp4d

In [None]:
# Read data
link = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
data = pd.read_csv(link)

# Check the sample
data.sample(n=5, random_state=100)

# 0. Data Understanding

## 0.1. Features Definition

| Feature Name              | Feature Description        |
|---------------------------|----------------------------|
| customerID                | A unique ID that identifies each customer.                   |
| gender                    | The customer’s gender: Male, Female.                     |
| SeniorCitizen             | Indicates if the customer is 65 or older.                      |
| Partner                   | Indicates if the customer has a partner or not.                        |
| Dependents                | Indicates if the customer lives with any dependents or not.          |
| tenure                    | Number of months the customer has stayed with the company.                |
| PhoneService              | Indicates if the customer has a phone service or not.                        |
| MultipleLines             | Indicates if the customer subscribes to multiple telephone lines with the company.                        |
| InternetService           | Indicates if the customer subscribes to Internet service with the company.                   |
| OnlineSecurity            | Indicates if the customer subscribes to an additional online security service provided by the company.                        |
| OnlineBackup              | Indicates if the customer subscribes to an additional online backup service provided by the company.               |
| DeviceProtection          | Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company.                |
| TechSupport               | Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times.                      |
| StreamingTV               | Indicates if the customer uses their Internet service to stream television programming from a third party provider.                   |
| StreamingMovies           | Indicates if the customer uses their Internet service to stream movies from a third party provider.                   |
| Contract                  | Indicates the customer’s current contract type.                       |
| PaperlessBilling          | Indicates if the customer has chosen paperless billing.                    |
| PaymentMethod             | Indicates how the customer pays their bill.                       |
| MonthlyCharges            | Indicates the customer’s current total monthly charge for all their services from the company.                         |
| TotalCharges              | Indicates the customer’s total charges, calculated to the end of the quarter specified above.                   |
| Churn                     | Indicates if the customer stop using company's service.                    |

## 0.2. Dimensions of the DataFrame

In [None]:
data.shape

In [None]:
print('This dataset has data dimensions:')
print('Number of rows: {}'.format(data.shape[0]))
print('Number of cols: {}'.format(data.shape[1]))

## 0.3. Data Types of the Features

In [None]:
# Check details of the DataFrame 
data.info()

In [None]:
# Statistics for the columns (features)
data.describe(include='all')

- The `customerID` column can be dropped.
- The `tenure` column between 0 (indicating new customers) and 6 years (72 months).
- The `MonthlyCharges` column between 18.25 and 118.75.
- We need to convert the `TotalCharges` column from an object to a numeric data type.

In [None]:
# Drop customerID
data = data.drop('customerID', axis=1)

In [None]:
# Change TotalCharges to numeric dtype
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

In [None]:
data.describe(include='all')

## 0.4. Detect Missing Values

In [None]:
# Check the features that have missing values
print(data.isna().values.any())
data.isna().sum()

In [None]:
# Handle missing values
from sklearn.impute import SimpleImputer

# Find the column number for TotalCharges (starting at 0).
total_charges_idx = data.columns.get_loc('TotalCharges')
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

data.iloc[:, total_charges_idx] = pd.Series(imputer.fit_transform(data.iloc[:, total_charges_idx].values.reshape(-1, 1)).flatten())

In [None]:
# Confirm that all NaN values have been addressed
print(data.isna().values.any())
data.isna().sum()

## 0.5. Detect Duplicate Values

In [None]:
# Check the features that have duplicate values
print(data.duplicated().any())
data.duplicated().sum()

- Even though there are detected duplicate data, this dataset is considered unique.

## 0.6. Number of Unique Classes

In [None]:
# Count number of unique class
uniques = data.nunique().sort_values(ascending = False)
uniques

# 1. Exploratory Data Analysis (EDA)

# 1.0. Descriptive Statistics

In [None]:
data.columns

In [None]:
num = data.select_dtypes(exclude = ['object'])
cat = data.select_dtypes(include = ['object'])

In [None]:
num_cols = num.columns
cat_cols = cat.columns

## 1.0.1. Numerical Features

In [None]:
num.columns

In [None]:
num.describe(percentiles = [0.25, 0.5, 0.75, 0.9, 0.95]).T

In [None]:
for i in num_cols:
  sns.histplot(data[i], kde=False)
  plt.show()

## 1.0.2. Categorical Features

In [None]:
cat.columns

In [None]:
cat.describe().T

In [None]:
data['gender'].value_counts(normalize=True)

In [None]:
data['InternetService'].value_counts(normalize=True)

In [None]:
data['Contract'].value_counts(normalize=True)

In [None]:
data['PaymentMethod'].value_counts(normalize=True)

In [None]:
for i in cat_cols:
    if i!= 'customerID':
        sns.countplot(x = i, data = data)
        plt.show()

# 1.1. Data Visualization

In [None]:
data_num = num.columns
data_cat = cat.columns

In [None]:
len(data_num)

In [None]:
len(data_cat)

## 1.1.1. Univariate Analysis

In [None]:
from matplotlib import rcParams

rcParams['figure.figsize'] = 8, 3

for i in range(0, len(data_num)):
    plt.subplot(1, 4, i + 1)
    sns.kdeplot(x=data[data_num[i]], color='steelblue')
    plt.xlabel(data_num[i])
    plt.title(data_num[i],
              fontsize=12,
              fontweight='bold')
    plt.tight_layout(pad=2)

In [None]:
for i in range(0, len(data_num)):
    plt.subplot(1, 4, i + 1)
    sns.boxplot(x=data[data_num[i]],
                color='lightsteelblue',
                orient='h')
    plt.title(data_num[i],
              fontsize=12,
              fontweight='bold')
    plt.tight_layout(pad=2)

In [None]:
for i in range(0, len(data_num)):
    plt.subplot(1, 4, i + 1)
    sns.violinplot(x=data[data_num[i]],
                   color='lightsteelblue',
                   orient='h')
    plt.title(data_num[i],
              fontsize=12,
              fontweight='bold')
    plt.tight_layout(pad=2)

## 1.1.2. Multivariate Analysis

### 1.1.2.1. Distribusi Data

In [None]:
rcParams['figure.figsize'] = 15, 4

for i in range(0, len(data_num)):
    plt.subplot(1, 4, i + 1)
    sns.kdeplot(x=data[data_num[i]], data=data, hue='Churn', color='steelblue', fill=True, alpha=0.5)
    plt.xlabel(data_num[i])
    plt.title(data_num[i], fontsize=12, fontweight='bold')
    plt.tight_layout(pad=2)
plt.show()

* Customer who **is not senior citizen** is more likely to churn.
* Customer with **lower tenure** is more likely to churn.
* Customer with **higher monthly charge**s has more tendency to churn.
* Customer with **lower total charges** has more tendency to churn.

In [None]:
# calculate numerical feature skewness
skewness = num.skew()
skewness

* SeniorCitizen has high modal on 0 value and lower modal on 1 value, indicating a positive skew value.
* tenure has bimodal skewness, with high modal on range 0-20 values and low modal on range 20-60 values indicating positive skew. Then high modal on range 60-80 values indicating negative skew value.
* MonthlyCharges also has bimodal skewness with both are negative skew.
* TotalCharges has a concentration of lower values and a long tail on the right side indicating positive skew.

In [None]:
# Define the figure size
sns.set(rc={'figure.figsize':(20,15)})

# Loop over each categorical feature
for i, cat in enumerate(data_cat):
    
    # Create a subplot for each feature
    plt.subplot(4, 4, i+1)
    
    # Create a countplot for the feature, color-coded by Churn
    sns.countplot(x=cat, hue='Churn', data=data, palette='muted')
    
    # Set the title and axis labels
    plt.title(f'{cat} vs Churn')
    plt.ylabel('Count')
    
    # Add legend to the plot
    plt.legend(title='Churn', loc='best')
    
    # Set the spacing between subplots
    plt.tight_layout(pad=2)
    
# Show the plot
plt.show()


* Customers **without partner** have **higher churn rate** than customers with partner.
* Customers **without dependents** have **higher churn rate** than customers with dependents.
* Customers **without phone service** have **higher churn rate** than customers with phone service.
* Customers **with Fiber optic** internet service have **high tendency to churn** than DSL internet service, and customers without internet service have relatively low churn rates
* Customers **without online security** have **higher churn rate** than customers with online security. 
* Customers **without online backup** have **higher churn rate** than customers without online backup.
* Customers **without device protection** have **higher churn rate** than customers with device protection.
* Customers **without tech support** have **higher churn rate** than customers with tech support.
* Customers **with streaming TV** have **higher churn rate** than customers without streaming TV.
* Customers **with streaming movies** have **higher churn rate** than customers without streaming movies.
* Customers **with a month-to-month contract** have **higher churn rate** than customers with longer-term contracts.
* Customers **with paperless billing** have **higher churn rate** than customers without paperless billing.
* Customers who **pay with electronic check** have **high tendency to churn** than customers with other payment methods.

# 2. Feature Engineering

### 2.1. Fixing Data Type

In [None]:
data.sample()

In [None]:
data.dtypes

According to dataset above, there is no missmatch data types. feature SeniorCitizen is supposed to be in categorical type (True or False), not in numerical type (0 or 1) but we will left it as it is since we will encoding all the feature to numeric for modeling.

### 2.2. Handling Missing Values

In [None]:
data.isna().sum()

### 2.3. Handling Duplicate Values

In [None]:
data.duplicated().sum()

In [None]:
data.sample()

### 2.4. Feature Selection

In [None]:
# Compute the correlation matrix
corr_matrix = data.corr()

# Create a heatmap with color scale and score values
fig, ax = plt.subplots(figsize=(10, 10))
heatmap = sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', cbar_kws={'label': 'Correlation Coefficient'})

# Set axis labels and title
plt.xlabel('Features')
plt.ylabel('Features')
plt.title('Correlation Matrix Heatmap')

# Show the plot
plt.show()


### 2.5. Handling Outliers

### 2.6. Feature Standardization

# 3. Modelling and Evaluation

# 4. Model Interpretation and Recommendation