# Lab | Cross Validation

## Instructions
### Apply SMOTE for upsampling the data

- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.

### Apply TomekLinks for downsampling
It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.

- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.

You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

In [85]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Transformation and modelling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# metrics
from sklearn.metrics import confusion_matrix, cohen_kappa_score, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, classification_report

### 2. Load Dataset

> Since we saw in the previous lab that the standardized dataset yielded slightly better results in general, we will proceed with this transformation. These datasets have been taken from the previous lab.

#### SMOTE

In [86]:
X_sm = pd.read_csv('x_smote.csv')
X_sm

Unnamed: 0,tenure,seniorcitizen,monthlycharges,totalcharges
0,0.013889,0.0,0.115423,0.001275
1,0.472222,0.0,0.385075,0.215867
2,0.027778,0.0,0.354229,0.010310
3,0.625000,0.0,0.239303,0.210241
4,0.027778,0.0,0.521891,0.015330
...,...,...,...,...
10343,0.043766,1.0,0.268535,0.012653
10344,0.445696,0.0,0.458261,0.239726
10345,0.027778,0.0,0.672666,0.017805
10346,0.208333,1.0,0.561180,0.124129


In [87]:
y_sm = pd.read_csv('y_smote.csv')
y_sm

Unnamed: 0,churn
0,0
1,0
2,1
3,0
4,1
...,...
10343,1
10344,1
10345,1
10346,1


#### Tomeklinks

In [88]:
X_tl = pd.read_csv('x_tl.csv')
X_tl

Unnamed: 0,tenure,seniorcitizen,monthlycharges,totalcharges
0,0.013889,0.0,0.115423,0.001275
1,0.472222,0.0,0.385075,0.215867
2,0.027778,0.0,0.354229,0.010310
3,0.625000,0.0,0.239303,0.210241
4,0.027778,0.0,0.521891,0.015330
...,...,...,...,...
6515,1.000000,0.0,0.028856,0.161620
6516,0.333333,0.0,0.662189,0.227521
6517,0.152778,0.0,0.112935,0.037809
6518,0.055556,1.0,0.558706,0.033210


In [89]:
y_tl = pd.read_csv('y_tl.csv')
y_tl

Unnamed: 0,churn
0,0
1,0
2,1
3,0
4,1
...,...
6515,0
6516,0
6517,0
6518,1


In [90]:
# Models
logreg = LogisticRegression(random_state = 0)
dt_gini = DecisionTreeClassifier(criterion = 'gini', random_state = 0)
dt_ent = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

### Standardized SMOTE Dataset

In [91]:
# train-test split using standardized X
Xs_train, Xs_test, ys_train, ys_test = train_test_split(X_sm, y_sm.churn, random_state=1)

In [92]:
# Logistic Regression
logreg.fit(Xs_train, ys_train)
log_sm_predict = logreg.predict(Xs_test)

In [93]:
log_sm_metrics = classification_report(ys_test, log_sm_predict)
print("Classification report for logistic regression:\n", log_sm_metrics)

Classification report for logistic regression:
               precision    recall  f1-score   support

           0       0.77      0.73      0.75      1330
           1       0.73      0.77      0.75      1257

    accuracy                           0.75      2587
   macro avg       0.75      0.75      0.75      2587
weighted avg       0.75      0.75      0.75      2587



In [94]:
# Decision Tree, gini
dt_gini.fit(Xs_train, ys_train)
dtg_sm_predict = dt_gini.predict(Xs_test)

In [95]:
dtg_sm_metrics = classification_report(ys_test, dtg_sm_predict)
print("Classification report for logistic regression:\n", dtg_sm_metrics)

Classification report for logistic regression:
               precision    recall  f1-score   support

           0       0.79      0.76      0.77      1330
           1       0.75      0.79      0.77      1257

    accuracy                           0.77      2587
   macro avg       0.77      0.77      0.77      2587
weighted avg       0.77      0.77      0.77      2587



In [96]:
# Decision Tree, entropy
dt_ent.fit(Xs_train, ys_train)
dte_sm_predict = dt_ent.predict(Xs_test)

In [97]:
dte_sm_metrics = classification_report(ys_test, dte_sm_predict)
print("Classification report for logistic regression:\n", dte_sm_metrics)

Classification report for logistic regression:
               precision    recall  f1-score   support

           0       0.79      0.75      0.76      1330
           1       0.74      0.79      0.76      1257

    accuracy                           0.76      2587
   macro avg       0.77      0.77      0.76      2587
weighted avg       0.77      0.76      0.76      2587



> **Observations:**
> - Using the SMOTE dataset, decision tree with gini index has a slightly higher accuracy of 77% compared to the logistic regression with 75% and decision tree model with entropy which yielded 76%.

### Standardized TomekLinks Dataset

In [98]:
# train-test split using standardized X
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl.churn, random_state=1)

In [99]:
# Logistic Regression
logreg.fit(X_train, y_train)
log_tl_predict = logreg.predict(X_test)

In [100]:
log_tl_metrics = classification_report(y_test, log_tl_predict)
print("Classification report for logistic regression:\n", log_tl_metrics)

Classification report for logistic regression:
               precision    recall  f1-score   support

           0       0.82      0.91      0.86      1161
           1       0.69      0.51      0.59       469

    accuracy                           0.79      1630
   macro avg       0.76      0.71      0.73      1630
weighted avg       0.78      0.79      0.78      1630



In [101]:
# Decision Tree, gini
dt_gini.fit(X_train, y_train)
dtg_tl_predict = dt_gini.predict(X_test)

In [102]:
dtg_tl_metrics = classification_report(y_test, dtg_tl_predict)
print("Classification report for logistic regression:\n", dtg_tl_metrics)

Classification report for logistic regression:
               precision    recall  f1-score   support

           0       0.82      0.84      0.83      1161
           1       0.59      0.55      0.57       469

    accuracy                           0.76      1630
   macro avg       0.71      0.70      0.70      1630
weighted avg       0.76      0.76      0.76      1630



In [103]:
# Decision Tree, entropy
dt_ent.fit(X_train, y_train)
dte_tl_predict = dt_ent.predict(X_test)

In [104]:
dte_tl_metrics = classification_report(y_test, dte_tl_predict)
print("Classification report for logistic regression:\n", dte_tl_metrics)

Classification report for logistic regression:
               precision    recall  f1-score   support

           0       0.82      0.83      0.83      1161
           1       0.57      0.54      0.55       469

    accuracy                           0.75      1630
   macro avg       0.69      0.68      0.69      1630
weighted avg       0.74      0.75      0.75      1630



> **Observations:**
> - Using the TomekLinks dataset, logistic regression has a higher accuracy of 79% compared to the 2 decision tree models, using gini and entropy respectively.