### Scenario
#### You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

### Instructions
#### In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

#### 1. Import the required libraries and modules that you would need.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings("ignore")

#### 2. Read that data into Python and call the dataframe churnData.

In [2]:
churnData= pd.read_csv('Customer-Churn.csv')

#### 3. Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.

In [3]:
print(churnData.dtypes) #TotalCharges is listed as an object, which is inaccurate

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


In [4]:
churnData['TotalCharges']= pd.to_numeric(churnData['TotalCharges'], errors = 'coerce' )

In [5]:
print(churnData.dtypes) #To check if the data type conversion is successful

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object


#### 4.Check for null values in the dataframe. Replace the null values.

In [6]:
print(churnData.isnull().sum()) #TotalCharges has 11 nulls

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64


In [7]:
churnData['TotalCharges'].describe() #dropping 11 nulls would not affect the the column that much, as it has 7032 rows

count    7032.000000
mean     2283.300441
std      2266.771362
min        18.800000
25%       401.450000
50%      1397.475000
75%      3794.737500
max      8684.800000
Name: TotalCharges, dtype: float64

In [8]:
churnData.dropna(subset=['TotalCharges'], inplace=True) #dropping nulls from TotalCharges
print(churnData.isnull().sum())

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


#### 4.Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
#### Scale the features either by using normalizer or a standard scaler.


In [9]:
features = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges'] #scales the values of the variables so that they fall between 0 and 1
X= churnData[features]
y= churnData['Churn'] #As the question is regarding customer churn

scaler= MinMaxScaler()
X_scaled= scaler.fit_transform(X)

#### Split the data into a training set and a test set.

In [10]:
X_train, X_test, y_train, y_test= train_test_split(X_scaled, y, test_size=0.2, random_state=42)

#### Fit a logistic regression model on the training data.

In [11]:
logisticregression= LogisticRegression()
logisticregression.fit(X_train, y_train)

LogisticRegression()

In [12]:
predictions= logisticregression.predict(X_test)


#### Check the accuracy on the test data.

In [13]:
print('\nAccuracy: {:.2f}\n'.format(accuracy_score(y_test,predictions)))


Accuracy: 0.78



In [14]:
print(classification_report(y_test, predictions))  #Model is great at predicting No, but not as good at predicing Yes

              precision    recall  f1-score   support

          No       0.82      0.91      0.86      1033
         Yes       0.63      0.43      0.51       374

    accuracy                           0.78      1407
   macro avg       0.72      0.67      0.69      1407
weighted avg       0.76      0.78      0.77      1407



### Managing imbalance in the dataset

#### Check for the imbalance.

In [15]:
print(churnData['Churn'].value_counts()) # from the results below,there seems to be a huge imbalance in the representation of the two categories

No     5163
Yes    1869
Name: Churn, dtype: int64


#### Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes. Each time fit the model and see how the accuracy of the model is.

In [16]:
def train_logistic_regression(X_train, y_train, X_test, y_test): #this function is meant to speed up the fitting process after train_test_split, if needed 
    logisticregression = LogisticRegression()
    logisticregression.fit(X_train, y_train) #fits a logreg model on the training data
    predictions = logisticregression.predict(X_test) #prediction = y_pred
    accuracy = accuracy_score(y_test, predictions) #calculates the accuracy score
    return logisticregression, accuracy #returns the trained model and the corresponding accuracy score
    

### SMOTE
#### SMOTE creates as many fake samples from the minority class as needed in order to balance the classes.

In [17]:
from imblearn.over_sampling import SMOTE
!pip install imbalanced-learn



##### I've decided to look into each predicting feature in detail to see if a separate scaling method should be applied instead.

In [18]:
churnData['tenure'].describe()  #Tenure has a large range of values, thus StandardScaler would be more apt

count    7032.000000
mean       32.421786
std        24.545260
min         1.000000
25%         9.000000
50%        29.000000
75%        55.000000
max        72.000000
Name: tenure, dtype: float64

In [19]:
churnData['SeniorCitizen'].describe() # Senior Citizen is a binary variable, thus no scaling is required

count    7032.000000
mean        0.162400
std         0.368844
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: SeniorCitizen, dtype: float64

In [20]:
churnData['MonthlyCharges'].describe() # Monthly Charges has a specific range of values 

count    7032.000000
mean       64.798208
std        30.085974
min        18.250000
25%        35.587500
50%        70.350000
75%        89.862500
max       118.750000
Name: MonthlyCharges, dtype: float64

In [21]:
churnData['TotalCharges'].describe() # Total Charges has a large range of values, thus using Standard Scaler to standardize it would be more apt

count    7032.000000
mean     2283.300441
std      2266.771362
min        18.800000
25%       401.450000
50%      1397.475000
75%      3794.737500
max      8684.800000
Name: TotalCharges, dtype: float64

#### Scaling the Variables using both StandardScaler and MinMaxScaler

In [22]:
X= churnData[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']] #defines the X feature matrix

In [23]:
y= churnData['Churn'] # separating the target variable

In [24]:
scaler = StandardScaler() #applying the standardscaler to the intended variables
X[['tenure', 'TotalCharges']]= scaler.fit_transform(X[['tenure', 'TotalCharges']]) 

In [25]:
minmax_scaler= MinMaxScaler() #applying the minmaxscaler to the intended variables
X['MonthlyCharges']= minmax_scaler.fit_transform(X['MonthlyCharges'].values.reshape(-1,1))

#### SMOTE with combined scaling

In [26]:
smote= SMOTE()  #oversampling the minoroty class with SMOTE
X_sm, y_sm= smote.fit_resample(X,y)

In [27]:
y_sm.value_counts() #to check if the data is balanced by the SMOTE process

No     5163
Yes    5163
Name: Churn, dtype: int64

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=42)

In [29]:
logisticregression_model, accuracy = train_logistic_regression(X_train, y_train, X_test, y_test)

In [30]:
y_pred = logisticregression_model.predict(X_test)

In [31]:
print("Accuracy (with combined scaling and SMOTE):", accuracy)

Accuracy (with combined scaling and SMOTE): 0.7265978050355067


In [32]:
report = classification_report(y_test, y_pred)
print(report) #decrease in precision in predicting No, decrease in overall accuracy but improvement in predicting Yes

              precision    recall  f1-score   support

          No       0.73      0.73      0.73      1563
         Yes       0.72      0.73      0.73      1535

    accuracy                           0.73      3098
   macro avg       0.73      0.73      0.73      3098
weighted avg       0.73      0.73      0.73      3098



### TOMEKlinks

#### Tomek links are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.

#### TOMEKlinks with combined scaling

In [33]:
from imblearn.under_sampling import TomekLinks

In [34]:
X_scaled = X

In [35]:
tomek_links = TomekLinks() #undersampling the majority class with TOMEKlinks
X_tomek, y_tomek = tomek_links.fit_resample(X_scaled, y)

In [36]:
X_train_tomek, X_test_tomek, y_train_tomek, y_test_tomek= train_test_split(X_tomek, y_tomek, test_size=0.2, random_state=42)

In [37]:
logisticregression_tomek= LogisticRegression() #decided not to use the earlier predefined function as its easier to remember that this involves TOM
logisticregression_tomek.fit(X_train_tomek, y_train_tomek)

LogisticRegression()

In [38]:
y_pred_tomek= logisticregression_tomek.predict(X_test_tomek)
accuracy_tomek= accuracy_score(y_test_tomek, y_pred_tomek)
print("Accuracy (with combined scaling and TOMEKlinks)", accuracy_tomek)

Accuracy (with combined scaling and TOMEKlinks) 0.7730769230769231


In [39]:
report_tomek = classification_report(y_test_tomek, y_pred_tomek)
print(report_tomek) #Minor decrease in the precision in predicting No, minor improvement in predicting Yes, accuracy remains the same

              precision    recall  f1-score   support

          No       0.80      0.89      0.84       898
         Yes       0.68      0.51      0.58       402

    accuracy                           0.77      1300
   macro avg       0.74      0.70      0.71      1300
weighted avg       0.76      0.77      0.76      1300



#### Unfortunately, my combined scaling strategy actually resulted in a lower accuracy score than before -- in both instances of SMOTE and TOMEKlinks. Hence I will try to reiterate the process by applying only one method on all X variables at a time and then combining the accuracy scores of both approaches. However, it is interesting to note that the TOMEK model outperformed the SMOTE model.

#### Undoing the Scaling Process using inverse_transform

In [40]:
X[['tenure', 'TotalCharges']]=scaler.inverse_transform(X[['tenure', 'TotalCharges']]) #inverse transforming variables transformed using StandardScaler

In [41]:
X['MonthlyCharges']= minmax_scaler.inverse_transform(X['MonthlyCharges'].values.reshape(-1, 1)) #inverse transforming variable that was transformed using MinMaxScaler

#### Scaling the Variables using StandardScaler

In [42]:
X= churnData[['tenure', 'TotalCharges', 'SeniorCitizen', 'MonthlyCharges']]

In [43]:
y = churnData['Churn']

In [44]:
X[['tenure', 'TotalCharges', 'SeniorCitizen', 'MonthlyCharges']]= scaler.fit_transform(X[['tenure', 'TotalCharges', 'SeniorCitizen', 'MonthlyCharges']]) 

#### SMOTE with StandardScaler

In [45]:
smote= SMOTE() 
X_sm, y_sm= smote.fit_resample(X,y)

In [46]:
y_sm.value_counts() #to check if the data is balanced by the SMOTE proces

No     5163
Yes    5163
Name: Churn, dtype: int64

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=42)

In [48]:
logisticregression_model, accuracy = train_logistic_regression(X_train, y_train, X_test, y_test)

In [49]:
y_pred = logisticregression_model.predict(X_test)

In [50]:
print("Accuracy (with StandardScaler and SMOTE):", accuracy)

Accuracy (with StandardScaler and SMOTE): 0.7224015493867011


In [51]:
report = classification_report(y_test, y_pred)
print(report) #Decrease in overall accuracy, decrease in precision in predicting No but improvement in predicting Yes

              precision    recall  f1-score   support

          No       0.72      0.72      0.72      1563
         Yes       0.72      0.72      0.72      1535

    accuracy                           0.72      3098
   macro avg       0.72      0.72      0.72      3098
weighted avg       0.72      0.72      0.72      3098



#### TOMEKlinks with StandardScaler

In [52]:
X_scaled = X

In [53]:
tomek_links = TomekLinks() #undersampling the majority class with TOMEKlinks
X_tomek, y_tomek = tomek_links.fit_resample(X_scaled, y)

In [54]:
X_train_tomek, X_test_tomek, y_train_tomek, y_test_tomek= train_test_split(X_tomek, y_tomek, test_size=0.2, random_state=42)

In [55]:
logisticregression_tomek= LogisticRegression() #decided not to use the earlier predefined function as its easier to remember that this involves TOM
logisticregression_tomek.fit(X_train_tomek, y_train_tomek)

LogisticRegression()

In [56]:
y_pred_tomek= logisticregression_tomek.predict(X_test_tomek)
accuracy_tomek= accuracy_score(y_test_tomek, y_pred_tomek)
print("Accuracy (with StandardScaler and TOMEKlinks)", accuracy_tomek)

Accuracy (with StandardScaler and TOMEKlinks) 0.7868098159509203


In [57]:
report_tomek = classification_report(y_test_tomek, y_pred_tomek)
print(report_tomek)  #Minor improvement in predicting Yes, minor increase in overall accuracy, but poor recall score

              precision    recall  f1-score   support

          No       0.83      0.89      0.86       938
         Yes       0.65      0.52      0.58       366

    accuracy                           0.79      1304
   macro avg       0.74      0.70      0.72      1304
weighted avg       0.78      0.79      0.78      1304



#### SMOTE with MinMaxScaler

In [58]:
X[['tenure', 'TotalCharges','SeniorCitizen', 'MonthlyCharges']]=scaler.inverse_transform(X[['tenure', 'TotalCharges', 'SeniorCitizen', 'MonthlyCharges']]) #inverse transforming variables transformed using StandardScaler

In [59]:
X['tenure'] = minmax_scaler.fit_transform(X['tenure'].values.reshape(-1, 1))
X['TotalCharges'] = minmax_scaler.fit_transform(X['TotalCharges'].values.reshape(-1, 1))
X['SeniorCitizen'] = minmax_scaler.fit_transform(X['SeniorCitizen'].values.reshape(-1, 1))
X['MonthlyCharges'] = minmax_scaler.fit_transform(X['MonthlyCharges'].values.reshape(-1, 1)) #mass transforming them didn't work, so I transformed them one after the other

In [60]:
smote= SMOTE()  #oversampling the minority class with SMOTE
X_sm, y_sm= smote.fit_resample(X,y)

In [61]:
y_sm.value_counts() #to check if the data is balanced by the SMOTE process

No     5163
Yes    5163
Name: Churn, dtype: int64

In [62]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=42)

In [63]:
logisticregression_model, accuracy = train_logistic_regression(X_train, y_train, X_test, y_test)

In [64]:
y_pred = logisticregression_model.predict(X_test)

In [65]:
print("Accuracy (with MinMaxScaler and SMOTE):", accuracy)

Accuracy (with MinMaxScaler and SMOTE): 0.724661071659135


In [66]:
report = classification_report(y_test, y_pred)
print(report) #Decrease in precision in predicting No, improvement in predicting yes, reduced overall accuracy

              precision    recall  f1-score   support

          No       0.73      0.72      0.72      1563
         Yes       0.72      0.73      0.72      1535

    accuracy                           0.72      3098
   macro avg       0.72      0.72      0.72      3098
weighted avg       0.72      0.72      0.72      3098



#### TomekLINKS with MinMaxScaler

In [67]:
X_scaled = X

In [68]:
tomek_links = TomekLinks() #undersampling the majority class with TOMEKlinks
X_tomek, y_tomek = tomek_links.fit_resample(X_scaled, y)

In [69]:
X_train_tomek, X_test_tomek, y_train_tomek, y_test_tomek= train_test_split(X_tomek, y_tomek, test_size=0.2, random_state=42)

In [70]:
logisticregression_tomek= LogisticRegression() #decided not to use the earlier predefined function as its easier to remember that this involves TOM
logisticregression_tomek.fit(X_train_tomek, y_train_tomek)

LogisticRegression()

In [71]:
y_pred_tomek= logisticregression_tomek.predict(X_test_tomek)
accuracy_tomek= accuracy_score(y_test_tomek, y_pred_tomek)
print("Accuracy (with MinMaxScaler and TOMEKlinks)", accuracy_tomek)

Accuracy (with MinMaxScaler and TOMEKlinks) 0.7966231772831927


In [72]:
report_tomek = classification_report(y_test_tomek, y_pred_tomek)
print(report_tomek) #Minor decrease in the precision in predicting No, improvement in predicting Yes, minor increase in overall accuracy, but do note poor recall score for YEs


              precision    recall  f1-score   support

          No       0.82      0.91      0.86       926
         Yes       0.70      0.52      0.60       377

    accuracy                           0.80      1303
   macro avg       0.76      0.72      0.73      1303
weighted avg       0.79      0.80      0.79      1303



#### Observation: Scaling the predictive variables using the MinMaxScaler and then undersampling the majority class with TOMEKLinks seems to produce the model with the best accuracy scores. In any case, MinMaxScaler produces better accuracy scores than combining both scaling methods or using SMOTE.

### Combining both oversampling (with SMOTE) and undersampling with (TOMEKLinks) + MinMaxScaler

In [78]:
tomek = TomekLinks()
X_sm_tomek, y_sm_tomek = tomek.fit_resample(X_sm, y_sm)

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X_sm_tomek, y_sm_tomek, test_size=0.3, random_state=42)

In [80]:
logisticregression_model = LogisticRegression()
logisticregression_model.fit(X_train, y_train)
y_pred = logisticregression_model.predict(X_test)

In [81]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of combined SMOTE and TomekLinks", accuracy)

Accuracy of combined SMOTE and TomekLinks 0.742327150084317


In [82]:
report = classification_report(y_test, y_pred)
print(report) #Reduction in precision in predicting No, improvement in predicting Yes, reduced overall accuracy, improved recall score

              precision    recall  f1-score   support

          No       0.76      0.74      0.75      1571
         Yes       0.72      0.74      0.73      1394

    accuracy                           0.74      2965
   macro avg       0.74      0.74      0.74      2965
weighted avg       0.74      0.74      0.74      2965



#### Performing TomekLinks one more time to remove borderline observations

In [87]:
X_tomek_twice, y_tomek_twice = tomek.fit_resample(X_sm_tomek, y_sm_tomek)

In [88]:
X_train_twice, X_test_twice, y_train_twice, y_test_twice = train_test_split(X_tomek_twice, y_tomek_twice, test_size=0.3, random_state=42)

In [89]:
logisticregression_model_twice = LogisticRegression()
logisticregression_model_twice.fit(X_train_twice, y_train_twice)
y_pred_twice = logisticregression_model_twice.predict(X_test_twice)

In [90]:
accuracy_twice = accuracy_score(y_test_twice, y_pred_twice)
print("Accuracy (after applying TOMEK links twice):", accuracy_twice)

Accuracy (after applying TOMEK links twice): 0.7297205180640763


In [91]:
report_twice = classification_report(y_test_twice, y_pred_twice)
print(report_twice) #reduction in No precision, improvement in predicting Yes, reduction in overall accuracy

              precision    recall  f1-score   support

          No       0.74      0.73      0.74      1521
         Yes       0.71      0.73      0.72      1413

    accuracy                           0.73      2934
   macro avg       0.73      0.73      0.73      2934
weighted avg       0.73      0.73      0.73      2934



In [None]:
## Conclusion: Based on the trade offs between precision scores of No and Yes, recall scores, as well as accuracy, using TomekLinks and SMOTE combined with minmax scaler provides the best improvements as well as the most moderate tradeoffs between No and Yes.