# Lab | Imbalanced data

We will be using the `files_for_lab/customer_churn.csv` dataset to build a churn predictor.

### Instructions

1. Load the dataset and explore the variables.
2. We will try to predict variable `Churn` using a logistic regression on variables `tenure`, `SeniorCitizen`,`MonthlyCharges`.
3. Extract the target variable.
4. Extract the independent variables and scale them.
5. Build the logistic regression model.
6. Evaluate the model.
7. Even a simple model will give us more than 70% accuracy. Why?
8. **Synthetic Minority Oversampling TEchnique (SMOTE)** is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply `imblearn.over_sampling.SMOTE` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?
9. **Tomek links** are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. Apply `imblearn.under_sampling.TomekLinks` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
## load dataset
churnData = pd.read_csv('./files_for_lab/customer_churn.csv') # this file is in files_for_lab folder
churnData.head(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [5]:
## prepare variables
X = churnData[['tenure', 'SeniorCitizen','MonthlyCharges']]
y = (churnData.Churn == 'Yes').astype(int)
transformer = StandardScaler().fit(X)
scaled_x = transformer.transform(X)

In [6]:
## model
model = LogisticRegression(random_state=0, solver='lbfgs',
                        multi_class='ovr').fit(scaled_x, y)

In [11]:
model.score(scaled_x, y)

0.7911401391452506

In [12]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_sm, y_sm = smote.fit_sample(scaled_x, y)
y_sm.value_counts()

1    5174
0    5174
Name: Churn, dtype: int64

In [13]:
## model
model = LogisticRegression(random_state=0, solver='lbfgs',
                        multi_class='ovr').fit(X_sm, y_sm)

In [14]:
model.score(scaled_x, y)

0.7293766860712765

In [15]:
from sklearn.metrics import confusion_matrix

In [16]:
y_pred=model.predict(X_sm)
confusion_matrix(y_pred, y_sm)

array([[3756, 1372],
       [1418, 3802]], dtype=int64)

In [17]:
y_pred=model.predict(scaled_x)
confusion_matrix(y_pred, y)

array([[3756,  488],
       [1418, 1381]], dtype=int64)

In [18]:
from imblearn.under_sampling import TomekLinks

In [19]:
tl = TomekLinks('majority')
X_tl, y_tl = tl.fit_sample(scaled_x, y)
y_tl.value_counts()

0    4694
1    1869
Name: Churn, dtype: int64

In [20]:
model = LogisticRegression(random_state=0, solver='lbfgs',
                        multi_class='ovr').fit(X_tl, y_tl)

In [22]:
model.score(X_tl, y_tl)

0.7924729544415664

In [23]:
y_pred=model.predict(X_tl)
confusion_matrix(y_pred, y_tl)

array([[4223,  891],
       [ 471,  978]], dtype=int64)

In [24]:
model.score(scaled_x, y)

0.7799233281272185

In [25]:
y_pred=model.predict(scaled_x)
confusion_matrix(y_pred, y)

array([[4515,  891],
       [ 659,  978]], dtype=int64)