For this lab, we will build a model on customer churn binary classification problem. You will be using `files_for_lab/Customer-Churn.csv` file.



### Instructions

1. Apply SMOTE for upsampling the data

    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.


2. Apply TomekLinks for downsampling

    - It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.
    - You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.


In [59]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import TomekLinks
from imblearn.under_sampling import TomekLinks
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import cohen_kappa_score, accuracy_score 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [3]:
data = pd.read_csv('Customer-Churn.csv')

In [4]:
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [6]:
data.shape

(7043, 15)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
dtypes: float64(1), int64(2), object(12)
memory usage: 825.5+ KB


In [8]:
data.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
dtype: int64

## X-y split

In [5]:
y = data['Churn']
X = data.drop('Churn',axis = 1, inplace = True)

## Categorical and Numerical Variables

### Categorical

In [22]:
X_cat = data.select_dtypes(np.object)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  X_cat = data.select_dtypes(np.object)


In [23]:
X_cat

Unnamed: 0,gender,Partner,Dependents,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,TotalCharges
0,Female,Yes,No,No,No,Yes,No,No,No,No,Month-to-month,29.85
1,Male,No,No,Yes,Yes,No,Yes,No,No,No,One year,1889.5
2,Male,No,No,Yes,Yes,Yes,No,No,No,No,Month-to-month,108.15
3,Male,No,No,No,Yes,No,Yes,Yes,No,No,One year,1840.75
4,Female,No,No,Yes,No,No,No,No,No,No,Month-to-month,151.65
...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,1990.5
7039,Female,Yes,Yes,Yes,No,Yes,Yes,No,Yes,Yes,One year,7362.9
7040,Female,Yes,Yes,No,Yes,No,No,No,No,No,Month-to-month,346.45
7041,Male,Yes,No,Yes,No,No,No,No,No,No,Month-to-month,306.6


In [24]:
X_coded = pd.get_dummies(X_cat)

In [25]:
X_coded

Unnamed: 0,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,OnlineSecurity_No,OnlineSecurity_No internet service,...,TotalCharges_995.35,TotalCharges_996.45,TotalCharges_996.85,TotalCharges_996.95,TotalCharges_997.65,TotalCharges_997.75,TotalCharges_998.1,TotalCharges_999.45,TotalCharges_999.8,TotalCharges_999.9
0,1,0,0,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,1,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,1,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0,1,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7039,1,0,0,1,0,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
7040,1,0,0,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7041,0,1,0,1,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


### Numerical

In [30]:
X_num = data.select_dtypes(np.number)

In [37]:
transformer = StandardScaler().fit(X_num)
X_scaled = transformer.transform(X_num)
df_scaled = pd.DataFrame(X_scaled, columns = X_num.columns)
df_scaled

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
0,-0.439916,-1.277445,-1.160323
1,-0.439916,0.066327,-0.259629
2,-0.439916,-1.236724,-0.362660
3,-0.439916,0.514251,-0.746535
4,-0.439916,-1.236724,0.197365
...,...,...,...
7038,-0.439916,-0.340876,0.665992
7039,-0.439916,1.613701,1.277533
7040,-0.439916,-0.870241,-1.168632
7041,2.273159,-1.155283,0.320338


### Concat Categorical and Numerical

In [36]:
X_full = pd.concat([X_coded,df_scaled],axis=1)
X_full

Unnamed: 0,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,OnlineSecurity_No,OnlineSecurity_No internet service,...,TotalCharges_996.95,TotalCharges_997.65,TotalCharges_997.75,TotalCharges_998.1,TotalCharges_999.45,TotalCharges_999.8,TotalCharges_999.9,SeniorCitizen,tenure,MonthlyCharges
0,1,0,0,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,-0.439916,-1.277445,-1.160323
1,0,1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,-0.439916,0.066327,-0.259629
2,0,1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,-0.439916,-1.236724,-0.362660
3,0,1,1,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,-0.439916,0.514251,-0.746535
4,1,0,1,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,-0.439916,-1.236724,0.197365
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0,1,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,-0.439916,-0.340876,0.665992
7039,1,0,0,1,0,1,0,1,1,0,...,0,0,0,0,0,0,0,-0.439916,1.613701,1.277533
7040,1,0,0,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,-0.439916,-0.870241,-1.168632
7041,0,1,0,1,1,0,0,1,1,0,...,0,0,0,0,0,0,0,2.273159,-1.155283,0.320338


## Apply SMOTE for upsampling

In [39]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_sm, y_sm = smote.fit_sample(X_full, y)
pd.DataFrame(y_sm).value_counts()

Churn
Yes      5174
No       5174
dtype: int64

## Train-test split

In [42]:
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_sm, y_sm, test_size=0.33, random_state=11)

## Logistic Regression

In [52]:
classification = LogisticRegression(random_state=0, solver='lbfgs',
                        multi_class='ovr').fit(X_train_smote, y_train_smote)

print("The accuracy of the model is: ",round(classification.score(X_test_smote, y_test_smote),2))
print("The kappa of the model is: ",round(cohen_kappa_score(y_sm,classification.predict(X_sm)),2))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The accuracy of the model is:  0.81
The kappa of the model is:  0.7


## Decision Regression

In [48]:
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train_smote, y_train_smote)
print("The accuracy of the model is: {:4.2f}".format(model.score(X_test_smote, y_test_smote)))
print("The kappa of the model is: ",round(cohen_kappa_score(y_sm,model.predict(X_sm)),2))

The accuracy of the model is: 0.76
The kappa of the model is:  0.52


# Apply TomekLinks for downsampling

In [54]:
tl = TomekLinks('majority')
X_tl, y_tl = tl.fit_sample(X_full, y)
pd.DataFrame(y_tl).value_counts()



Churn
No       4619
Yes      1869
dtype: int64

## Train-test split

In [55]:
X_train_tl, X_test_tl, y_train_tl, y_test_tl = train_test_split(X_tl, y_tl, test_size=0.33, random_state=11)

## Logistic Regression

In [57]:
classification1 = LogisticRegression(random_state=0, solver='lbfgs',
                        multi_class='ovr').fit(X_train_tl, y_train_tl)

print("The accuracy of the model is: ",round(classification1.score(X_test_tl, y_test_tl),2))
print("The kappa of the model is: ",round(cohen_kappa_score(y_tl,classification1.predict(X_tl)),2))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The accuracy of the model is:  0.81
The kappa of the model is:  0.64


## Decision Regression

In [58]:
model1 = DecisionTreeClassifier(max_depth=3)
model1.fit(X_train_tl, y_train_tl)
print("The accuracy of the model is: {:4.2f}".format(model.score(X_test_tl, y_test_tl)))
print("The kappa of the model is: ",round(cohen_kappa_score(y_tl,model.predict(X_tl)),2))

The accuracy of the model is: 0.77
The kappa of the model is:  0.49
