# Clustering Churn Values  in Telecom Industry

## Objective
In this report, various methods are applied to check which is most suited to cluster the churn vs not churned in dataset of a telecommunication industry.

In [1]:
!pip install -U scikit-learn



In [6]:
# Installing Packages

import sklearn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans,AgglomerativeClustering,DBSCAN,MeanShift
from scipy.spatial.distance import euclidean
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA,KernelPCA
from sklearn.preprocessing import LabelEncoder,MinMaxScaler

In [7]:
# Loading Dataset

filepath = 'Downloads/telco.csv'
data = pd.read_csv(filepath)
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Data
Data information is provided below:

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


## Data Analysis and Encoding

In [9]:
# Changing 'TotalCharges' column dtype to float

data.TotalCharges = data.TotalCharges.str.replace(' ','')
data=data[pd.to_numeric(data['TotalCharges'],errors='coerce').notnull()]
data['TotalCharges'] = data['TotalCharges'].astype(float)

In [10]:
# Creating a copy of data

df = data.copy()
df.drop('customerID',axis=1, inplace=True)

In [11]:
# Encoding

le=LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['Partner'] = le.fit_transform(df['Partner'])
df['Dependents'] = le.fit_transform(df['Dependents'])
df['PhoneService'] = le.fit_transform(df['PhoneService'])
df['MultipleLines'] = le.fit_transform(df['MultipleLines'])
df['InternetService'] = le.fit_transform(df['InternetService'])
df['OnlineSecurity'] = le.fit_transform(df['OnlineSecurity'])
df['OnlineBackup'] = le.fit_transform(df['OnlineBackup'])
df['DeviceProtection'] = le.fit_transform(df['DeviceProtection'])
df['TechSupport'] = le.fit_transform(df['TechSupport'])
df['StreamingTV'] = le.fit_transform(df['StreamingTV'])
df['StreamingMovies'] = le.fit_transform(df['StreamingMovies'])
df['Contract'] = le.fit_transform(df['Contract'])
df['PaperlessBilling'] = le.fit_transform(df['PaperlessBilling'])
df['PaymentMethod'] = le.fit_transform(df['PaymentMethod'])

df['Churn'] = df['Churn'].replace({'No':0,'Yes':1})

## Splitting into train and test set along with Scaling

In [12]:
X = df.drop('Churn', axis=1)
y = df.Churn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## KMeans

In [13]:
kmeans = KMeans(n_clusters=2)
kmeans=kmeans.fit(X_train)
kmeans_ypred = kmeans.predict(X_test)
print('Accuracy Score: ',accuracy_score(y_test,kmeans_ypred))



Accuracy Score:  0.6090047393364929


## Agglomerative Clustering

In [14]:
agg = AgglomerativeClustering(n_clusters=2)
#agg = agg.fit(X_train)
agg_ypred = agg.fit_predict(X_test)
print('Accuracy Score: ',accuracy_score(y_test,agg_ypred))

Accuracy Score:  0.49146919431279623


## DBSCAN

In [15]:
db = DBSCAN(eps=2,min_samples=10)
db = db.fit(X_test)
clusters = db.labels_
print('Accuracy Score: ',accuracy_score(y_test,clusters))
print('No. of clusters:',len(set(clusters)))

Accuracy Score:  0.7341232227488151
No. of clusters: 1


DBSCAN has only created one cluster

In [16]:
ms = MeanShift(bandwidth=1)
ms.fit(X_train)
ms_ypred = ms.predict(X_test)
print('Accuracy Score: ',accuracy_score(y_test,ms_ypred))

Accuracy Score:  0.04407582938388625


Mean Shift has failed to cluster properly.

As we know there can be only 2 clusters. So it is easy using KMeans and Agglomerative Clustering using cluster value of 2. 
Kmeans has higher accuracy as compared to Agglomerative Clustering but lets repeat these two methods again after dimention reduction.

## Principal Component Analysis 

In [17]:
pca = PCA(n_components = 10)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test)

Applying KMeans with PCA

In [18]:
kmeans = KMeans(n_clusters=2)
kmeans=kmeans.fit(X_train_pca)
kmeans_ypred = kmeans.predict(X_test_pca)
print('Accuracy Score: ',accuracy_score(y_test,kmeans_ypred))



Accuracy Score:  0.5819905213270142


Now lets have a look at Agglomerative Clustering with PCA:

In [19]:
agg = AgglomerativeClustering(n_clusters=2)
#agg = agg.fit(X_train)
agg_ypred = agg.fit_predict(X_test_pca)
print('Accuracy Score: ',accuracy_score(y_test,agg_ypred))

Accuracy Score:  0.5450236966824644


## KernelPCA

In [23]:
kernel = KernelPCA(n_components=10,kernel='rbf',gamma=1.0)
X_train_kernel = kernel.fit_transform(X_train)
X_test_kernel = kernel.fit_transform(X_test)

Applying KMeans with Kernel PCA:

In [24]:
kmeans = KMeans(n_clusters=2)
kmeans=kmeans.fit(X_train_kernel)
kmeans_ypred = kmeans.predict(X_test_kernel)
print('Accuracy Score: ',accuracy_score(y_test,kmeans_ypred))



Accuracy Score:  0.6786729857819905


Now lets see Agglomerative Clustering with KernelPCA:

In [25]:
agg = AgglomerativeClustering(n_clusters=2)
#agg = agg.fit(X_train)
agg_ypred = agg.fit_predict(X_test_kernel)
print('Accuracy Score: ',accuracy_score(y_test,agg_ypred))

Accuracy Score:  0.6947867298578199


## Findings

* KMeans and Agglomerative Clustering methods are most suited as we know the number of clusters
* Without dimension reduction, KMeans gave an average accuracy of 60% where as Agglomerative CLustering had an accuracy of nearly 50%.
* With Kernel PCA though, agglomerative CLustering showed an increase of 20% accuracy score which is fair

## Future Recommendations

Methods can be improved by different hyperparameters using GridSearch.

**** ****
**Thank You!**