**Scenario**

You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

#### 1. Import the required libraries and modules that you would need.

In [41]:
import pandas as pd
import numpy as np

import seaborn as sns

from sklearn.model_selection import train_test_split    
from sklearn.linear_model import LogisticRegression

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks


import warnings
warnings.filterwarnings('ignore')

#### 2. Read that data into Python and call the dataframe churnData.

In [2]:
data = pd.read_csv("Customer-Churn.csv")
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [3]:
data.shape

(7043, 16)

#### 3. Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.

In [46]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   float64
 15  Churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

In [47]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

In [48]:
data['TotalCharges'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 7043 entries, 0 to 7042
Series name: TotalCharges
Non-Null Count  Dtype  
--------------  -----  
7043 non-null   float64
dtypes: float64(1)
memory usage: 55.1 KB


#### 4. Check for null values in the dataframe. Replace the null values.

In [8]:
data.isna().sum().to_frame().rename(columns={0:'count'}).sort_values(by='count', ascending=False)

#TotalCharges column as 11 nulls.

Unnamed: 0,count
TotalCharges,11
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
OnlineSecurity,0
OnlineBackup,0
DeviceProtection,0


In [9]:
data['TotalCharges'].value_counts()

20.20      11
19.75       9
20.05       8
19.90       8
19.65       8
           ..
6849.40     1
692.35      1
130.15      1
3211.90     1
6844.50     1
Name: TotalCharges, Length: 6530, dtype: int64

In [None]:
sns.distplot(data['TotalCharges'])

In [49]:
# Replace Nulls with mean
data['TotalCharges'].fillna(int(data['TotalCharges'].mean()), inplace=True)

In [50]:
data.TotalCharges.isna().sum()

0

#### 5. Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges

#### 5.1 Scale the features either by using normalizer or a standard scaler.

In [11]:
numerical=data.select_dtypes(np.number)
numerical.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
0,0,1,29.85,29.85
1,0,34,56.95,1889.5
2,0,2,53.85,108.15
3,0,45,42.3,1840.75
4,0,2,70.7,151.65


In [12]:
from sklearn.preprocessing import MinMaxScaler


transformer = MinMaxScaler().fit(numerical)
data_normalized = transformer.transform(numerical)
data_normalized = pd.DataFrame(data_normalized, columns=numerical.columns)
data_normalized.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0,7043.0
mean,0.162147,0.449599,0.462803,0.261309
std,0.368612,0.341104,0.299403,0.261366
min,0.0,0.0,0.0,0.0
25%,0.0,0.125,0.171642,0.044245
50%,0.0,0.402778,0.518408,0.159445
75%,0.0,0.763889,0.712438,0.43478
max,1.0,1.0,1.0,1.0


#### 5.2 Split the data into a training set and a test set.

In [13]:
X = data_normalized
y = data['Churn']

#### 5.3 Fit a logistic regression model on the training data.

In [14]:
# We separate training and testing datasets and correspondant targets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)

In [15]:
X_test.shape

(2113, 4)

In [16]:
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)  #‘ovr’ because target is binary.

In [17]:
predictions = classification.predict(X_test)
pd.Series(predictions).value_counts() # how well model perfoms

No     1709
Yes     404
dtype: int64

In [18]:
y_test.value_counts()    #this are the real values

#So model predicted 1547 No and real were 1709.

No     1547
Yes     566
Name: Churn, dtype: int64

In [19]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,predictions)

array([[1396,  151],
       [ 313,  253]], dtype=int64)

#### 5.4 Check the accuracy on the test data.

In [20]:
# Accuracy

classification.score(X_test, y_test)

0.780407004259347

In [21]:
# Accuracy
from sklearn.metrics import accuracy_score

print('\nAccuracy: {:.2f}\n'.format(accuracy_score(y_test, predictions)))


Accuracy: 0.78



#### 6. Managing imbalance in the dataset
#### 6.1 Check for the imbalance.

In [51]:
data.Churn.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [None]:
numerical.SeniorCitizen.value_counts()

In [None]:
numerical.tenure.value_counts()

In [None]:
numerical.MonthlyCharges.value_counts()

In [None]:
numerical.TotalCharges.value_counts()

#### 6.2 Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.

#### Downsampling

In [24]:
ds = RandomUnderSampler()
X_ds, y_ds = ds.fit_resample(X, y)   #we resample X and y separately randomly
y_ds.value_counts()

No     1869
Yes    1869
Name: Churn, dtype: int64

In [25]:
y.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X_ds, y_ds, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)
classification.score(X_test, y_test)

0.7130124777183601

Accuracy decreased with Downsampling!!

#### Upsampling

In [27]:
us = RandomOverSampler()
X_us, y_us = us.fit_resample(X, y)
y_us.value_counts()

No     5174
Yes    5174
Name: Churn, dtype: int64

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X_us, y_us, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)
classification.score(X_test, y_test)

0.7442834138486313

Accuracy decreased with Upsampling!!

#### SMOTE - Synthetic Minority Oversampling Technique 

Random creation of a fake sample. Compute the k-nearest neighbors (for some pre-specified k) for this point. If a k-nn predicts this fake sample to belong to the minority class, keep it, otherwise throw it away.

In [53]:
smote = SMOTE()

X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

y_sm.value_counts()

No     5174
Yes    5174
Name: Churn, dtype: int64

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)
classification.score(X_test, y_test)

0.7442834138486313

No improvement! Increasing the Minority class with SMOTE to create a more balanced data decreased the accuracy of the model, which is expected because we are reducing data imbalance so we are not overfitting the model as much.

#### Tomek's Links

Tomek links are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. It drops data, removes instances from majority class that could be minority.


It does not make the two classes equal but only removes the points from the majority class that are close to other points.

In [43]:
tl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl = tl.fit_resample(X, y)
y_tl.value_counts()

No     4652
Yes    1869
Name: Churn, dtype: int64

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)
classification.score(X_test, y_test)

0.7956055186509964

It improved accuracy by removing data points from the majority class ("No") that could be confused with the minority class!