### Scenario
#### You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

### <span style="color:purple">For this lab, I would like to focus more on the impact of various scaling methods on oversampling/undersampling process, instead of focusing on the oversampling/undersampling methods and F1/recall values etc. (as I have already done that in the ealier data imbalance lab!)</span>

### Instructions
#### In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

#### 1. Import the required libraries and modules that you would need.

In [77]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings("ignore")

#### 2. Read that data into Python and call the dataframe churnData.

In [2]:
churnData= pd.read_csv('Customer-Churn.csv')

#### 3. Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.

In [3]:
print(churnData.dtypes) #TotalCharges is listed as an object, which is inaccurate

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


In [4]:
churnData['TotalCharges']= pd.to_numeric(churnData['TotalCharges'], errors = 'coerce' )

In [5]:
print(churnData.dtypes) #To check if the data type conversion is successful

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object


#### 4.Check for null values in the dataframe. Replace the null values.

In [6]:
print(churnData.isnull().sum()) #TotalCharges has 11 nulls

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64


In [7]:
churnData['TotalCharges'].describe() #dropping 11 nulls would not affect the the column that much, as it has 7032 rows

count    7032.000000
mean     2283.300441
std      2266.771362
min        18.800000
25%       401.450000
50%      1397.475000
75%      3794.737500
max      8684.800000
Name: TotalCharges, dtype: float64

In [8]:
churnData.dropna(subset=['TotalCharges'], inplace=True) #dropping nulls from TotalCharges
print(churnData.isnull().sum())

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


#### 4.Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
#### Scale the features either by using normalizer or a standard scaler.


In [9]:
features = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges'] #scales the values of the variables so that they fall between 0 and 1
X= churnData[features]
y= churnData['Churn'] #As the question is regarding customer churn

scaler= MinMaxScaler()
X_scaled= scaler.fit_transform(X)

#### Split the data into a training set and a test set.

In [10]:
X_train, X_test, y_train, y_test= train_test_split(X_scaled, y, test_size=0.2, random_state=42)

#### Fit a logistic regression model on the training data.

In [11]:
logisticregression= LogisticRegression()
logisticregression.fit(X_train, y_train)

LogisticRegression()

In [12]:
predictions= logisticregression.predict(X_test)


#### Check the accuracy on the test data.

In [13]:
print('\nAccuracy: {:.2f}\n'.format(accuracy_score(y_test,predictions)))


Accuracy: 0.78



### Managing imbalance in the dataset

#### Check for the imbalance.

In [14]:
print(churnData['Churn'].value_counts()) # from the results below,there seems to be a huge imbalance in the representation of the two categories

No     5163
Yes    1869
Name: Churn, dtype: int64


#### Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes. Each time fit the model and see how the accuracy of the model is.

In [15]:
def train_logistic_regression(X_train, y_train, X_test, y_test): #this function is meant to speed up the fitting process after train_test_split, if needed 
    logisticregression = LogisticRegression()
    logisticregression.fit(X_train, y_train) #fits a logreg model on the training data
    prediction = logisticregression.predict(X_test) #prediction = y_pred
    accuracy = accuracy_score(y_test, prediction) #calculates the accuracy score
    return logisticregression, accuracy #returns the trained model and the corresponding accuracy score
    

### SMOTE
#### SMOTE creates as many fake samples from the minority class as needed in order to balance the classes.

In [16]:
from imblearn.over_sampling import SMOTE
!pip install imbalanced-learn



##### I've decided to look into each predicting feature in detail to see if a separate scaling method should be applied instead.

In [18]:
churnData['tenure'].describe()  #Tenure has a large range of values, thus StandardScaler would be more apt

count    7032.000000
mean       32.421786
std        24.545260
min         1.000000
25%         9.000000
50%        29.000000
75%        55.000000
max        72.000000
Name: tenure, dtype: float64

In [19]:
churnData['SeniorCitizen'].describe() # Senior Citizen is a binary variable, thus no scaling is required

count    7032.000000
mean        0.162400
std         0.368844
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: SeniorCitizen, dtype: float64

In [20]:
churnData['MonthlyCharges'].describe() # Monthly Charges has a specific range of values 

count    7032.000000
mean       64.798208
std        30.085974
min        18.250000
25%        35.587500
50%        70.350000
75%        89.862500
max       118.750000
Name: MonthlyCharges, dtype: float64

In [21]:
churnData['TotalCharges'].describe() # Total Charges has a large range of values, thus using Standard Scaler to standardize it would be more apt

count    7032.000000
mean     2283.300441
std      2266.771362
min        18.800000
25%       401.450000
50%      1397.475000
75%      3794.737500
max      8684.800000
Name: TotalCharges, dtype: float64

#### Scaling the Variables and applying SMOTE/TOMEKlinks

In [23]:
X= churnData[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']] #defines the X feature matrix

In [24]:
y= churnData['Churn'] # separating the target variable

In [25]:
scaler = StandardScaler() #applying the standardscaler to the intended variables
X[['tenure', 'TotalCharges']]= scaler.fit_transform(X[['tenure', 'TotalCharges']]) 

In [26]:
minmax_scaler= MinMaxScaler() #applying the minmaxscaler to the intended variables
X['MonthlyCharges']= minmax_scaler.fit_transform(X['MonthlyCharges'].values.reshape(-1,1))

#### SMOTE with combined scaling

In [27]:
smote= SMOTE()  #oversampling the minoroty class with SMOTE
X_sm, y_sm= smote.fit_resample(X,y)

In [28]:
y_sm.value_counts() #to check if the data is balanced by the SMOTE process

No     5163
Yes    5163
Name: Churn, dtype: int64

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=42)

In [30]:
logisticregression_model, accuracy = train_logistic_regression(X_train, y_train, X_test, y_test)

In [31]:
print("Accuracy (with combined scaling and SMOTE):", accuracy)

Accuracy (with combined scaling and SMOTE): 0.7333763718528082


### TOMEKlinks

#### Tomek links are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.

#### TOMEKlinks with combined scaling

In [34]:
from imblearn.under_sampling import TomekLinks

In [35]:
X_scaled = X

In [36]:
tomek_links = TomekLinks() #undersampling the majority class with TOMEKlinks
X_tomek, y_tomek = tomek_links.fit_resample(X_scaled, y)

In [38]:
X_train_tomek, X_test_tomek, y_train_tomek, y_test_tomek= train_test_split(X_tomek, y_tomek, test_size=0.2, random_state=42)

In [39]:
logisticregression_tomek= LogisticRegression() #decided not to use the earlier predefined function as its easier to remember that this involves TOM
logisticregression_tomek.fit(X_train_tomek, y_train_tomek)

LogisticRegression()

In [41]:
y_pred_tomek= logisticregression_tomek.predict(X_test_tomek)
accuracy_tomek= accuracy_score(y_test_tomek, y_pred_tomek)
print("Accuracy (with combined scaling and TOMEKlinks)", accuracy_tomek)

Accuracy (with combined scaling and TOMEKlinks) 0.7730769230769231


#### Unfortunately, my combined scaling strategy actually resulted in a lower accuracy score than before -- in both instances of SMOTE and TOMEKlinks. Hence I will try to reiterate the process by applying only one method on all X variables at a time and then combining the accuracy scores of both approaches. However, it is interesting to note that the TOMEK model outperformed the SMOTE model.

#### Undoing the Scaling Process using inverse_transform

In [42]:
X[['tenure', 'TotalCharges']]=scaler.inverse_transform(X[['tenure', 'TotalCharges']]) #inverse transforming variables transformed using StandardScaler

In [45]:
X['MonthlyCharges']= minmax_scaler.inverse_transform(X['MonthlyCharges'].values.reshape(-1, 1)) #inverse transforming variable that was transformed using MinMaxScaler

#### Scaling the Variables using StandardScaler

In [46]:
X= churnData[['tenure', 'TotalCharges', 'SeniorCitizen', 'MonthlyCharges']]

In [47]:
y = churnData['Churn']

In [48]:
X[['tenure', 'TotalCharges', 'SeniorCitizen', 'MonthlyCharges']]= scaler.fit_transform(X[['tenure', 'TotalCharges', 'SeniorCitizen', 'MonthlyCharges']]) 

#### SMOTE with StandardScaler

In [49]:
smote= SMOTE() 
X_sm, y_sm= smote.fit_resample(X,y)

In [51]:
y_sm.value_counts() #to check if the data is balanced by the SMOTE proces

No     5163
Yes    5163
Name: Churn, dtype: int64

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=42)

In [52]:
logisticregression_model, accuracy = train_logistic_regression(X_train, y_train, X_test, y_test)

In [53]:
print("Accuracy (with StandardScaler and SMOTE):", accuracy)

Accuracy (with StandardScaler and SMOTE): 0.7320852162685604


In [None]:
classification_report_smote2 = classification_report(y_test, prediction)

#### TOMEKlinks with StandardScaler

In [54]:
X_scaled = X

In [55]:
tomek_links = TomekLinks() #undersampling the majority class with TOMEKlinks
X_tomek, y_tomek = tomek_links.fit_resample(X_scaled, y)

In [56]:
X_train_tomek, X_test_tomek, y_train_tomek, y_test_tomek= train_test_split(X_tomek, y_tomek, test_size=0.2, random_state=42)

In [57]:
logisticregression_tomek= LogisticRegression() #decided not to use the earlier predefined function as its easier to remember that this involves TOM
logisticregression_tomek.fit(X_train_tomek, y_train_tomek)

LogisticRegression()

In [58]:
y_pred_tomek= logisticregression_tomek.predict(X_test_tomek)
accuracy_tomek= accuracy_score(y_test_tomek, y_pred_tomek)
print("Accuracy (with StandardScaler and TOMEKlinks)", accuracy_tomek)

Accuracy (with combined scaling and TOMEKlinks) 0.7868098159509203


#### SMOTE with MinMaxScaler

In [59]:
X[['tenure', 'TotalCharges','SeniorCitizen', 'MonthlyCharges']]=scaler.inverse_transform(X[['tenure', 'TotalCharges', 'SeniorCitizen', 'MonthlyCharges']]) #inverse transforming variables transformed using StandardScaler

In [64]:
X['tenure'] = minmax_scaler.fit_transform(X['tenure'].values.reshape(-1, 1))
X['TotalCharges'] = minmax_scaler.fit_transform(X['TotalCharges'].values.reshape(-1, 1))
X['SeniorCitizen'] = minmax_scaler.fit_transform(X['SeniorCitizen'].values.reshape(-1, 1))
X['MonthlyCharges'] = minmax_scaler.fit_transform(X['MonthlyCharges'].values.reshape(-1, 1)) #mass transforming them didn't work, so I transformed them one after the other

In [65]:
smote= SMOTE()  #oversampling the minoroty class with SMOTE
X_sm, y_sm= smote.fit_resample(X,y)

In [66]:
y_sm.value_counts() #to check if the data is balanced by the SMOTE process

No     5163
Yes    5163
Name: Churn, dtype: int64

In [67]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=42)

In [68]:
logisticregression_model, accuracy = train_logistic_regression(X_train, y_train, X_test, y_test)

In [69]:
print("Accuracy (with MinMaxScaler and SMOTE):", accuracy)

Accuracy (with MinMaxScaler and SMOTE): 0.7304712717882504


#### TomekLINKS with MinMaxScaler

In [70]:
X_scaled = X

In [71]:
tomek_links = TomekLinks() #undersampling the majority class with TOMEKlinks
X_tomek, y_tomek = tomek_links.fit_resample(X_scaled, y)

In [72]:
X_train_tomek, X_test_tomek, y_train_tomek, y_test_tomek= train_test_split(X_tomek, y_tomek, test_size=0.2, random_state=42)

In [73]:
logisticregression_tomek= LogisticRegression() #decided not to use the earlier predefined function as its easier to remember that this involves TOM
logisticregression_tomek.fit(X_train_tomek, y_train_tomek)

LogisticRegression()

In [74]:
y_pred_tomek= logisticregression_tomek.predict(X_test_tomek)
accuracy_tomek= accuracy_score(y_test_tomek, y_pred_tomek)
print("Accuracy (with MinMaxScaler and TOMEKlinks)", accuracy_tomek)

Accuracy (with MinMaxScaler and TOMEKlinks) 0.7966231772831927


### Conclusion: Scaling the predictive variables using the MinMaxScaler and then undersampling the majority class with TOMEKLinks seems to produce the model with the best accuracy scores. More time would be needed to also look into the classification reports of each method/combination used so far, as well as manual oversamplnig/undersampling methods.