## Telecom Churn Predication

Import data and required libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, recall_score, precision_score, f1_score, roc_auc_score, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


##**1) Preprocess Data**

In [None]:
# Load data

teleco_df = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn.csv")

Explore Data

In [None]:
teleco_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [None]:
teleco_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


Converting "Total Charges" from (string)object to numerical type and checking for empty values

In [None]:
teleco_df['TotalCharges'] = teleco_df['TotalCharges'].replace(' ', np.nan, regex=True)
teleco_df['TotalCharges'] = pd.to_numeric(teleco_df['TotalCharges'])
teleco_df.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

After looking at the above output, we can say that there are 11 missing values for Total Charges.

looking at categorical and numeric fields

In [None]:
columns = teleco_df.columns
print("Numeric fields: \n")
for i in range(len(columns)):
    if teleco_df[columns[i]].dtypes!=object:
        print("Number of fields unique to {} ---> {}".format(columns[i], len(teleco_df[columns[i]].unique())))

print("\nCategorical field: \n")
for i in range(len(columns)):
    if teleco_df[columns[i]].dtypes==object:
        print("Number of fields unique to {} ---> {}".format(columns[i], len(teleco_df[columns[i]].unique())))

Numeric fields: 

Number of fields unique to SeniorCitizen ---> 2
Number of fields unique to tenure ---> 73
Number of fields unique to MonthlyCharges ---> 1585
Number of fields unique to TotalCharges ---> 6531

Categorical field: 

Number of fields unique to customerID ---> 7043
Number of fields unique to gender ---> 2
Number of fields unique to Partner ---> 2
Number of fields unique to Dependents ---> 2
Number of fields unique to PhoneService ---> 2
Number of fields unique to MultipleLines ---> 3
Number of fields unique to InternetService ---> 3
Number of fields unique to OnlineSecurity ---> 3
Number of fields unique to OnlineBackup ---> 3
Number of fields unique to DeviceProtection ---> 3
Number of fields unique to TechSupport ---> 3
Number of fields unique to StreamingTV ---> 3
Number of fields unique to StreamingMovies ---> 3
Number of fields unique to Contract ---> 3
Number of fields unique to PaperlessBilling ---> 2
Number of fields unique to PaymentMethod ---> 4
Number of fields

In [None]:
teleco_df['Contract'].unique()

array(['Month-to-month', 'One year', 'Two year'], dtype=object)

Creating number of months for contract type to calculate and fill null Total Charges

In [None]:
# create number of months for contract type to fill null totalcharges

teleco_df['ContractMonths'] = np.where(teleco_df['Contract']=="Month-to-month", 1, 0) + np.where(teleco_df['Contract']=="One year", 12, 0) + np.where(teleco_df['Contract']=="Two year", 24, 0)
teleco_df['ContractMonths'] = pd.to_numeric(teleco_df['ContractMonths'])
teleco_df.loc[teleco_df['TotalCharges'].isnull(), ['Contract','ContractMonths','MonthlyCharges','TotalCharges']]

Unnamed: 0,Contract,ContractMonths,MonthlyCharges,TotalCharges
488,Two year,24,52.55,
753,Two year,24,20.25,
936,Two year,24,80.85,
1082,Two year,24,25.75,
1340,Two year,24,56.05,
3331,Two year,24,19.85,
3826,Two year,24,25.35,
4380,Two year,24,20.0,
5218,One year,12,19.7,
6670,Two year,24,73.35,


In [None]:
teleco_df['cal_TotalCharges'] = teleco_df['ContractMonths'] * teleco_df['MonthlyCharges']
teleco_df['TotalCharges'] = np.where(teleco_df['TotalCharges'].isnull(), teleco_df['cal_TotalCharges'], teleco_df['TotalCharges'])
teleco_df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
ContractMonths      0
cal_TotalCharges    0
dtype: int64

We can now see that there are no null values in our dataset

In [None]:
teleco_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,ContractMonths,cal_TotalCharges
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No,1,29.85
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,No,No,One year,No,Mailed check,56.95,1889.5,No,12,683.4
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,53.85
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No,12,507.6
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,70.7


Creating a function to convert categorical values to numerical values.

In [None]:
def cat_to_num(df):
    le = LabelEncoder() # using LabelEncoder from sklearn
    for col in df.columns: # using for loop to iterate through all features
        if df[col].dtype == 'object': # if a column is 'object' dtype using LabelEncoder on them
            df[col] = le.fit_transform(df[col])

    return df

In [None]:
numeric_data = cat_to_num(teleco_df)

#**2) Data Visualization**

I have created the following visualisations to understand how different features are distributed and what they tell us based on the data.

In [None]:
px.histogram(teleco_df,x='Churn',color='Dependents',barmode='group',title='Churn with Dependent People')


In [None]:
px.histogram(teleco_df, x="Churn", color="OnlineSecurity", barmode="group", title ='Churn vs Online Security')

From this it can be said that people with no online Security leave the company most

In [None]:
px.box(teleco_df,x='Churn',y='tenure')

From the above box plot we can understand the following:

Customer with tenure lower then 10 months are more likely to leave the company
Cutomer with tenure more then 15 months are more likely to stay with company


In [None]:
px.histogram(teleco_df, x='MonthlyCharges',nbins = 50)

This plot tells us the monthly charges and count of the people who recieve similar charges.

In [None]:
px.histogram(teleco_df, x='tenure',color = 'Churn')

This plot shows us how the churn is dependent on the tenure of customers. it tells us the different tenures for which the customers have churned.

#**3)Problem Statement**

I am working on a classification problem to predict churn rate of telecom customers.
My lable is column Churn and my features are :

1.   Customer ID
2. Gender
3. Senior Citizen
4. Partner
5. Dependents
6. Tenure
7. Phone Service
8. Multiple Lines
9. Internet Service
10. Online Security
11. Online Backup
12. Device Protection
13. Tech Support
14. Streaming TV
15. Streaming Movies
16. Contract
17. Paperless Billing
18. Payment Method
19. Monthly Charges
20. Total Charges

I will be using precision, recall, f1-score and support for evaluation.


#**4) Feature Selection**

Let's use Logistic regression to see which features are the most important to us and which ones are not.

In [None]:
from sklearn.model_selection import train_test_split
x = teleco_df.drop('Churn',axis=1)
y = teleco_df['Churn']
x_train, x_test, y_train, y_test= train_test_split(x,y,test_size=0.25,random_state=47)

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)
importance = model.coef_[0]
for i,v in enumerate(importance):
 print('Feature: %0d, Score: %.5f' % (i,v))

Feature: 0, Score: -0.06198
Feature: 1, Score: 0.16788
Feature: 2, Score: -0.10730
Feature: 3, Score: -0.19202
Feature: 4, Score: -0.05141
Feature: 5, Score: -0.30856
Feature: 6, Score: 0.14378
Feature: 7, Score: 0.07315
Feature: 8, Score: -0.40020
Feature: 9, Score: -0.14962
Feature: 10, Score: -0.07049
Feature: 11, Score: -0.35719
Feature: 12, Score: 0.05782
Feature: 13, Score: 0.08123
Feature: 14, Score: -0.59785
Feature: 15, Score: 0.33341
Feature: 16, Score: -0.03696
Feature: 17, Score: 0.01171
Feature: 18, Score: 0.00024



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



Therefore we can see that feature 14 which is Streaming Movies is the least important feature of our dataset.
hence i will be running my models without the above feature.

In [None]:
teleco_df=teleco_df.drop(columns=['StreamingMovies'], axis=1)

## **5) Comparing different models**

1. Logistic Regression

In [None]:
lr = LogisticRegression()
# fitting data in our model
lr.fit(x_train,y_train)
# predicating x_test values which we saperated during train_test_split()
pred = lr.predict(x_test)

print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1290
           1       0.64      0.49      0.56       471

    accuracy                           0.79      1761
   macro avg       0.74      0.69      0.71      1761
weighted avg       0.78      0.79      0.78      1761




lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



2. K-Neighbours Classification

In [None]:
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(x_train,y_train)
pred = knn.predict(x_test)
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.80      0.94      0.86      1290
           1       0.67      0.35      0.46       471

    accuracy                           0.78      1761
   macro avg       0.73      0.64      0.66      1761
weighted avg       0.76      0.78      0.75      1761



3. Decesion Tree Classifier

In [None]:
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(x_train,y_train)
pred = dt.predict(x_test)
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.82      0.91      0.87      1290
           1       0.67      0.47      0.55       471

    accuracy                           0.80      1761
   macro avg       0.75      0.69      0.71      1761
weighted avg       0.78      0.80      0.78      1761



4. Random Forest Classifier

In [None]:
rt = RandomForestClassifier(max_depth=5)
rt.fit(x_train,y_train)
pred = rt.predict(x_test)
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.81      0.93      0.87      1290
           1       0.68      0.41      0.51       471

    accuracy                           0.79      1761
   macro avg       0.74      0.67      0.69      1761
weighted avg       0.78      0.79      0.77      1761



**6) Unsupervised + Supervised**

In [None]:
!pip install minisom

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting minisom
  Downloading MiniSom-2.3.1.tar.gz (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: minisom
  Building wheel for minisom (setup.py) ... [?25l[?25hdone
  Created wheel for minisom: filename=MiniSom-2.3.1-py3-none-any.whl size=10609 sha256=be9dccf3956284cafc77becf935d510057a9b8d197db267c17d5aacf041c69db
  Stored in directory: /root/.cache/pip/wheels/b4/aa/7a/d9a88098f7877aa95dd4f227fe614f75773654baa39b47bba6
Successfully built minisom
Installing collected packages: minisom
Successfully installed minisom-2.3.1


In [None]:
from minisom import MiniSom

In [None]:
som_model=MiniSom(15,15,19,sigma=0.4, learning_rate=0.2)
som_model.train(x_train.values,50000, verbose=True)

 [ 50000 / 50000 ] 100% - 0:00:00 left 
 quantization error: 65.83128286535391


In [None]:
X_training_qnt=som_model.quantization(x_train.values)

In [None]:
from sklearn.linear_model import LogisticRegression
LGR_model_2nd_layer = LogisticRegression(max_iter=1000)
LGR_model_2nd_layer.fit(X_training_qnt,y_train)
LGR_model_2nd_layer_prediction=LGR_model_2nd_layer.predict(som_model.quantization(x_test.values))

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score,f1_score
print('LGC result:')
print('Accuracy: ',accuracy_score(y_test,LGR_model_2nd_layer_prediction))
print('f1_score: ', f1_score(y_test,LGR_model_2nd_layer_prediction,average='weighted'))
print('Precision: ', precision_score(y_test,LGR_model_2nd_layer_prediction,average='weighted'))
print('Recall: ', recall_score(y_test,LGR_model_2nd_layer_prediction,average='weighted'))

LGC result:
Accuracy:  0.7739920499716071
f1_score:  0.7368430969504545
Precision:  0.7594847722160881
Recall:  0.7739920499716071


**7) Justification**

In this project, I analyzed a telecom customer dataset to predict churn. I performed differnt data preprocessing techniques and also performed exploratory data analysis on the said dataset to find the relationship between features and churn.

For feature selection, I used logistic regression to get a list of features and thier importance score to find out which features are important and which ones are not. I found the feature StreamingMovies to be the least important.

i used precision, recall, f1-score and support for evaluation of different classification models like Logistic Regression, Decision Tree Classifier, K - Neighbours Classifier and Random Forest Classifier. The combination of SOM and Logistic Regression gave the followinf scores




*   Accuracy:07739920499716071
*   f1_score:  0.7368430969504545
*   Precision:  0.7594847722160881
*   Recall:  0.7739920499716071





This result is justified given the number of features of this dataset and their correlation with each other and the importance of certain features in predicting churn and the performance of different classifiers on the dataset.