# Lab Imbalanced Data


Instructions

-    Load the dataset and explore the variables.
-    We will try to predict variable Churn using a logistic regression on variables tenure, SeniorCitizen,MonthlyCharges.
-    Extract the target variable.
-    Extract the independent variables and scale them.
-    Build the logistic regression model.
-    Evaluate the model.
-    Even a simple model will give us more than 70% accuracy. Why?
-    Synthetic Minority Oversampling TEchnique (SMOTE) is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply imblearn.over_sampling.SMOTE to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?


## Load the dataset and explore the variables.

In [None]:
# Import time! 

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

from imblearn.over_sampling import SMOTE

In [None]:
# Data loading time!

data = pd.read_csv('customer_churn.csv')

In [None]:
# Data exploring time!

data.head(60)

The data appears to be about phone/internet service. I had to do some googling to figure out what 'tenure' and 'churn' means. As far as I can tell: 
- Churn (yes/no) means whether the customer discontinued their service (yes or no).
- Tenure is the number of months a customer has been around. This checks out, because TotalCharges = MonthlyCharges x tenure.



In [None]:
data.dtypes

In [None]:
data.describe()

Lots of categorical data; within the numerical data, the Senior Citizen is more of a boolean than a real numerical value (1/0)

In [None]:
data.isna().sum()

No NaN values, so that's nice.

Since the features I will work with are the numerical ones, I'll check those out more closely.

In [None]:
sns.displot(data.tenure, kde= True)

In [None]:
sns.displot(data.MonthlyCharges, kde=True)

There are no outliers that need to be taken care of. So that saves a few steps!

I'm also curious what the distributions of values in the SeniorCitizen (1/0) and Churn(True/False) is.

In [None]:
data['SeniorCitizen'].value_counts()

In [None]:
data['Churn'].value_counts()

Finally, I'll check out the correlations between the three numerical values:

In [None]:
sns.heatmap(data.corr(), annot = True)

Ok, nothing super spectacular but I'll continue.

## We will try to predict variable Churn using a logistic regression on variables tenure, SeniorCitizen,MonthlyCharges.

In [None]:
data = data[['SeniorCitizen', 'tenure', 'MonthlyCharges', 'Churn']]

display(data.head())

display(data.shape)

## Extract the target variable.

In [None]:
y = data['Churn']

## Extract the independent variables and scale them.

In [None]:
X = data.select_dtypes(include=np.number)

### Time for a train-test split!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

I'm not 100% sure if I should also scale the SeniorCitizen column - but just thinking out loud: the MinMaxScaler uses 0 for the minimum and 1 for the maximum value, so it should not matter. I'll scale the whole enchilada and then check if I do indeed only get 1s and 0s for the SeniorCitizen column.

In [None]:
scaler = MinMaxScaler().fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train = pd.DataFrame(data = X_train_scaled, columns = X.columns)
X_test = pd.DataFrame(data = X_test_scaled, columns = X.columns)

In [None]:
X_train['SeniorCitizen'].value_counts()

Ok, I was right and it encodes the ones and zeroes as ones and zeroes. 

Next, check if I need to reset any indexes.

In [None]:
display(y_test.head())
display(X_test.head())

It looks like the index for X was reset, but not for y. So here goes:

In [None]:
y_test = y_test.reset_index(drop=True) 
y_train = y_train.reset_index(drop=True) 

## Build the logistic regression model.

In [None]:
regr = LogisticRegression(random_state=0, solver='lbfgs')

regr.fit(X_train, y_train)

## Evaluate the model.

In [None]:
display(regr.score(X_test, y_test))

In [None]:
pred = regr.predict(X_test)

In [None]:
display(regr.score(X_test, pred))

Yikes. A 1.0 score seems fishy. I looked it up, and score is the predictions / real values - so this would suggest that pred == X_test.

In [None]:
confusion_matrix(y_test,pred)

Hmmmm - not sure why this would count as a score of 1.0...

In [None]:
print("precision: ",precision_score(y_test,pred, pos_label = "Yes"))
print("recall: ",recall_score(y_test,pred, pos_label = "Yes"))
print("f1: ",f1_score(y_test,pred, pos_label = "Yes"))



In [None]:
pd.Series(pred).value_counts()

In [None]:
pd.Series(y_test).value_counts()

Ok, the 1.0 score is a mystery to me...

## Even a simple model will give us more than 70% accuracy. Why?

Even a model that only ever predicts "No" gets 70% accuracy.

In [None]:
yes = len(data[data['Churn'] == "Yes"])
no = len(data[data['Churn'] == "No"])
total = len(data)

print(no / total * 100)

So 73 % of all the Churn is 'No'

## Synthetic Minority Oversampling TEchnique (SMOTE) is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply imblearn.over_sampling.SMOTE to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

In [None]:
sm = SMOTE(random_state=100, k_neighbors=3)
X_train_SMOTE,y_train_SMOTE = sm.fit_resample(X_train ,y_train)

In [None]:
X_train_SMOTE.shape

In [None]:
regr2 = LogisticRegression(random_state=666, solver='lbfgs')
regr2.fit(X_train_SMOTE, y_train_SMOTE)
pred2 = regr2.predict(X_test)

# print("precision: ",precision_score(y_test,pred))
# print("recall: ",recall_score(y_test,pred))
# print("f1: ",f1_score(y_test,pred))



In [None]:
display(regr2.score(X_test, y_test))

In [None]:
pred2 = regr2.predict(X_test)

In [None]:
display(regr2.score(X_test, pred2))

In [None]:
# Still the same mystery of the 1.0 score...

In [None]:
display(confusion_matrix(y_test, pred2))

In [None]:
# And the previous model:
display(confusion_matrix(y_test, pred))

The new one is not great - fewer false positives, but more false negatives. Let's compare the precision and recall:

In [None]:
print("precision: ",precision_score(y_test,pred2, pos_label = "Yes"))
print("recall: ",recall_score(y_test,pred2, pos_label = "Yes"))
print("f1: ",f1_score(y_test,pred2, pos_label = "Yes"))

In [None]:
# The old one again, for comparing them side by side. 
print("precision: ",precision_score(y_test,pred, pos_label = "Yes"))
print("recall: ",recall_score(y_test,pred, pos_label = "Yes"))
print("f1: ",f1_score(y_test,pred, pos_label = "Yes"))



So after SMOTE, the recall went up quite drastically, while the precision went down a bit. 