# **What is Churn Prediction?**

Churn prediction is analytical studies on the possibility of a customer abandoning a product or service. The goal is to understand and take steps to change it before the costumer gives up the product or service.

## **About Data**

#### customerID : Customer ID
#### gender : Whether the customer is a male or a female
#### SeniorCitizen : Whether the customer is a senior citizen or not (1, 0)
#### Partner : Whether the customer has a partner or not (Yes, No)
#### Dependents : Whether the customer has dependents or not (Yes, No)
#### tenure : Number of months the customer has stayed with the company
#### PhoneService : Whether the customer has a phone service or not (Yes, No)
#### MultipleLines : Whether the customer has multiple lines or not (Yes, No, No phone service)
#### InternetService : Customer’s internet service provider (DSL, Fiber optic, No)
#### OnlineSecurity : Whether the customer has online security or not (Yes, No, No internet service)
#### OnlineBackup : Whether the customer has online backup or not (Yes, No, No internet service)
#### DeviceProtection : Whether the customer has device protection or not (Yes, No, No internet service)
#### TechSupport : Whether the customer has tech support or not (Yes, No, No internet service)
#### StreamingTV : Whether the customer has streaming TV or not (Yes, No, No internet service)
#### StreamingMovies : Whether the customer has streaming movies or not (Yes, No, No internet service)
#### Contract : The contract term of the customer (Month-to-month, One year, Two year)
#### PaperlessBilling : Whether the customer has paperless billing or not (Yes, No)
#### PaymentMethod : The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
#### MonthlyCharges : The amount charged to the customer monthly
#### TotalCharges : The total amount charged to the customer
#### Churn : Whether the customer churned or not (Yes or No)

![image.png](https://s16353.pcdn.co/wp-content/uploads/2018/06/Churn.png)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.simplefilter('ignore')
plt.style.use("fivethirtyeight")

In [None]:
data = pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
data.head()

In [None]:
data.dtypes

In [None]:
data.shape

In [None]:
data.isna().sum()

In [None]:
data.groupby('Churn')[['MonthlyCharges', 'tenure']].agg(['min', 'max', 'mean'])

TotalCharges columns has numeric values but looks object type.

In [None]:
data[data['TotalCharges'] == ' ']

In [None]:
data['TotalCharges'] = data['TotalCharges'].replace(' ', np.nan)

In [None]:
data[data['TotalCharges'] == ' ']

In [None]:
data['TotalCharges'].isna().sum()

In [None]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])

In [None]:
data['TotalCharges'].dtypes

In [None]:
data.groupby('Churn')[['MonthlyCharges', 'tenure', 'TotalCharges']].agg(['min', 'max', 'mean'])

Since, we have 11 null values in dataset, either we can fill them, or remove them. 11 is a low number, so I will drop them.

In [None]:
data.dropna(inplace = True)

In [None]:
data.isna().sum()

In [None]:
data.shape

In [None]:
data.groupby('Churn')[['OnlineBackup', 'OnlineSecurity', 'PhoneService']].count()

In [None]:
def half_corr_heatmap(data, title=None):
    plt.figure(figsize=(9,9))
    sns.set(font_scale=1)
    
    mask = np.zeros_like(data.corr())
    mask[np.tril_indices_from(mask)] = True
    
    with sns.axes_style("white"):
        sns.heatmap(data.corr(), mask=mask, annot=True, cmap="coolwarm")
    
    if title: plt.title(f"\n{title}\n", fontsize=18)
    plt.show()
    return

In [None]:
half_corr_heatmap(data, 'Correlation Between Variables')

**Churn columns is not there because its dtype is object. Converting objects into a numeric.**

In [None]:
data['Churn'] = data['Churn'].map({'Yes' : 1, 'No' : 0})

In [None]:
half_corr_heatmap(data, 'Correlation Between Variables')

In [None]:
def corr_for_target(data, target, title=None):
    plt.figure(figsize=(4,14))
    sns.set(font_scale=1)
    
    sns.heatmap(data.corr()[[target]].sort_values(target, ascending=False)[1:], annot=True, cmap="coolwarm")
    
    if title: plt.title(f"\n{title}\n", fontsize=18)
    return

In [None]:
corr_for_target(data, 'Churn', 'Correlation Between Target')

In [None]:
sns.countplot(data['InternetService']);

In [None]:
sns.countplot(data['MultipleLines']);

In [None]:
data2 = data.drop(['customerID'], axis = 1)

To observe numerical, and numeric columns:

In [None]:
numerical = data2.select_dtypes(['number']).columns
print(f'Numerical: {numerical}\n')

categorical = data2.columns.difference(numerical)

data2[categorical] = data2[categorical].astype('object')
print(f'Categorical: {categorical}')

Creating ones, and zeros from categorical variables:

In [None]:
data2 = pd.get_dummies(data2)

In [None]:
data2.head()

Checking unique values of every column:

In [None]:
data_cols = data.drop('customerID', axis = 1)

for col in data_cols.columns:
    print(col, "\n")
    print(data[col].unique(), "\n")

In [None]:
plt.figure(figsize = (10,8))

ax = sns.distplot(data['tenure'], rug=True, rug_kws={"color": "g"},
                  kde_kws={"color": "red", "lw": 3},
                  hist_kws={"histtype": "step", "linewidth": 3,
                            "alpha": 0.4, "color": "g"});

### There are people staying with this company for about 70 years. 

In [None]:
plt.figure(figsize=(12,8))

sns.distplot(data['MonthlyCharges']);

### Most of the customer has low monthly charge.

In [None]:
data[data['Churn'] == 1].TotalCharges.plot(kind = 'hist', alpha = 0.3, color = '#016a55', label = 'Churn = Yes')

data[data['Churn'] == 0].TotalCharges.plot(kind = 'hist', alpha = 0.3, color = '#d89955', label = 'Churn = No')

plt.xlabel('Total Charges')
plt.legend();

### Those with lower total charges have left the brand most.

In [None]:
data[data['Churn'] == 1].MonthlyCharges.plot(kind = 'hist', alpha = 0.3, color = '#019955', label = 'Churn = Yes')

data[data['Churn'] == 0].MonthlyCharges.plot(kind = 'hist', alpha = 0.3, color = '#d89955', label = 'Churn = No')

plt.xlabel('Monthly Charges')
plt.legend();

In [None]:
data[data['Churn'] == 1].tenure.plot(kind = 'hist', alpha = 0.3, color = '#019955', label = 'Yes')

data[data['Churn'] == 0].tenure.plot(kind = 'hist', alpha = 0.3, color = '#d89955', label = 'No')

plt.xlabel('Tenure')
plt.legend();

### Those who have registered with the brand for 1-8 years has higher number of leaving the brand.

In [None]:
plt.figure(figsize = (10, 6))

sns.countplot('OnlineSecurity', data = data, hue = 'Churn');

### Those who do not have online protection have a higher number of leaving the brand.

In [None]:
plt.figure(figsize = (10, 6))

sns.countplot('OnlineBackup', data = data, hue = 'Churn');

### Those who don't have online backup have a higher number of leaving the brand.

In [None]:
plt.figure(figsize = (10, 6))

sns.countplot('Contract', data = data, hue = 'Churn');

### Those who have month-to-month contract have a higher number of leaving the brand.

In [None]:
plt.figure(figsize = (10, 6))

sns.countplot('PhoneService', data = data, hue = 'Churn');

In [None]:
plt.figure(figsize = (10, 6))

sns.countplot('MultipleLines', data = data, hue = 'Churn');

### There is not much difference in the number of times people with or without multiple lines leaving the brand.

In [None]:
plt.figure(figsize = (10, 6))

sns.countplot('PaperlessBilling', data = data, hue = 'Churn');

In [None]:
plt.figure(figsize = (10, 6))

sns.countplot('InternetService', data = data, hue = 'Churn');

In [None]:
plt.figure(figsize = (15, 15))

plt.subplot(3, 2, 1)
sns.countplot('gender', data = data, hue = 'Churn')

plt.subplot(3, 2, 2)
sns.countplot('DeviceProtection', data = data, hue = 'Churn')

plt.subplot(3, 2, 3)
sns.countplot('StreamingTV', data = data, hue = 'Churn')

plt.subplot(3, 2, 4)
sns.countplot('Partner', data = data, hue = 'Churn')

plt.subplot(3, 2, 5)
sns.countplot('TechSupport', data = data, hue = 'Churn')

plt.subplot(3, 2, 6)
sns.countplot('PaymentMethod', data = data, hue = 'Churn')

plt.xticks(rotation = 45);

In [None]:
plt.figure(figsize = (20, 7))

corr_for_target(data2, 'Churn');

### Splitting the Data

In [None]:
X = data2.drop('Churn', axis=1)

y = data2['Churn']

# Model Building

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .33, random_state = 42)

In [None]:
models = []
models.append(('Random Forest Clas.', RandomForestClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('Decision Tree Clas.', DecisionTreeClassifier()))
models.append(("LightGBM", LGBMClassifier()))
models.append(('GBC',GradientBoostingClassifier()))
models.append(('Logistic Reg.', LogisticRegression()))
models.append(('XGB', XGBClassifier()))
models.append(('SVC', SVC()))

Creating a for loop to see cross validation scores for every model above:

In [None]:
model_names = []
scores = []

for name, model in models:
    score = cross_val_score(model, X, y, cv = 10, scoring='accuracy')
    scores.append(score)
    model_names.append(name)
    print(f"Mean of the {name} model scores : {score.mean()}")

# Feature Importance By LightGBM

Checking the features that are most important for LGBM:

In [None]:
feature_importance = pd.DataFrame({'Importance' : LGBMClassifier().fit(X, y).feature_importances_}, index = X.columns)

feature_importance.sort_values(by = 'Importance', ascending = False, axis = 0)[:5].plot(kind = 'bar', color = '#019955', figsize = (10, 5))
plt.xlabel("Feature Importance by LightGBM", color = "#019955", fontdict= {"fontsize" : 20});

Model building with all features:

In [None]:
model_lgbm = LGBMClassifier()
model_lgbm.fit(X_train, y_train)

y_pred_lgbm = model_lgbm.predict(X_test)
y_pred_lgbm_train = model_lgbm.predict(X_train)

In [None]:
lgbm_test_as = metrics.accuracy_score(y_pred_lgbm, y_test)
lgbm_train_as = metrics.accuracy_score(y_pred_lgbm_train, y_train)

print(f"LGBM accuracy score for test data {lgbm_test_as}")
print(f"LGBM accuracy score for train data {lgbm_train_as}")

#### Accuracy score between train and test data is slightly high.

Let's try again with the new features that we got above.

In [None]:
X_train_new = X_train[['MonthlyCharges', 'TotalCharges', 'tenure', 'PaymentMethod_Electronic check']]

X_test_new = X_test[['MonthlyCharges', 'TotalCharges', 'tenure', 'PaymentMethod_Electronic check']]

In [None]:
new_model_lgbm = LGBMClassifier()
new_model_lgbm.fit(X_train_new, y_train)

new_y_pred = new_model_lgbm.predict(X_test_new)
lgbm_ft_as = metrics.accuracy_score(new_y_pred, y_test)
lgbm_ft_as

In [None]:
new_y_pred_train = new_model_lgbm.predict(X_train_new)
lgbm_ft_as_ = metrics.accuracy_score(new_y_pred_train, y_train)
lgbm_ft_as_

Not much thing has changed actually. We couldn't improve our model like we want it to be.

# Logistic Regression

In [None]:
log = LogisticRegression()
log.fit(X_train, y_train)

log_y_pred = log.predict(X_test)
log_y_pred_train = log.predict(X_train)

In [None]:
log_test_as = metrics.accuracy_score(log_y_pred, y_test)
log_train_as = metrics.accuracy_score(log_y_pred_train, y_train)

In [None]:
print(f"Accuracy score for test data : {log_test_as}")
print(f"Accuracy score for train data : {log_train_as}")

In [None]:
print(metrics.classification_report(log_y_pred, y_test))

In [None]:
metrics.confusion_matrix(log_y_pred, y_test)

In [None]:
metrics.confusion_matrix(log_y_pred_train, y_train)

In [None]:
y_proba_log = log.predict_proba(X_test)[:, 1]
fpr, tpr, threshold = metrics.roc_curve(y_test, y_proba_log)

In [None]:
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label = 'Logistic Regression')
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC Curve')
plt.legend();

In [None]:
metrics.roc_auc_score(y_test, y_proba_log)

In [None]:
y_proba_log_train = log.predict_proba(X_train)[:, 1]
metrics.roc_auc_score(y_train, y_proba_log_train)

# SVC

In [None]:
svc = SVC()
svc.fit(X_train, y_train)

In [None]:
y_pred_svc = svc.predict(X_test)
y_pred_train = svc.predict(X_train)

svc_train_as = metrics.accuracy_score(y_train, y_pred_train)
svc_as = metrics.accuracy_score(y_test, y_pred_svc)

In [None]:
print(f"Accuracy score for test data : {svc_as}")
print(f"Accuracy score for train data : {svc_train_as}")

In [None]:
print(metrics.classification_report(y_test, y_pred_svc))

Let's try after scaling the data.

In [None]:
sc = StandardScaler()

X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [None]:
svc_sc = SVC()
svc_sc.fit(X_train_sc, y_train)

y_pred_sc = svc_sc.predict(X_test_sc)
y_pred_sc_train = svc_sc.predict(X_train_sc)

svc_sc_train_as = metrics.accuracy_score(y_train, y_pred_sc_train)
svc_sc_as = metrics.accuracy_score(y_test, y_pred_sc)

In [None]:
print(f"Accuracy score for test data : {svc_sc_as}")
print(f"Accuracy score for train data : {svc_sc_train_as}")

In [None]:
params = {'kernel' : ['rbf'], 'C' : [0.1, 1, 5, 10], 'gamma' : [0.01, 0.1, 0.9, 1]}

grid = GridSearchCV(SVC(), params, cv = 5, return_train_score= False)

In [None]:
# grid.fit(X_train_sc, y_train)

In [None]:
# grid.best_params_
# best_params_ : [C = 1, gamma = 0.01, kernel = 'rbf']

In [None]:
# grid.best_score_
# best_score_ : 0.7968569389377085

Model tunning with the best params.

In [None]:
# svc_new = SVC(**grid.best_params_)
svc_new = SVC(C = 1, gamma = 0.01, kernel = 'rbf')
svc_new.fit(X_train_sc, y_train)

y_pred_new = svc_new.predict(X_test_sc)
y_pred_new_train = svc_new.predict(X_train_sc)

svc_new_train_as = metrics.accuracy_score(y_train, y_pred_new_train)
svc_new_as = metrics.accuracy_score(y_test, y_pred_new)

print(f"Accuracy score for test data : {svc_new_as}")
print(f"Accuracy score for train data : {svc_new_train_as}")

In [None]:
metrics.plot_roc_curve(svc_new, X_train_sc, y_train)

# KNN

In [None]:
testscores = []
trainscores = []

for i in range(1, 10):
    model = KNeighborsClassifier(i)
    model.fit(X_train, y_train)
    
    test_pred = model.predict(X_test)
    train_pred = model.predict(X_train)
    
    testscores.append(metrics.accuracy_score(y_test, test_pred))
    trainscores.append(metrics.accuracy_score(y_train, train_pred))

In [None]:
plt.plot(range(1, 10), testscores, label = 'Test Scores', color = 'red')

plt.plot(range(1, 10), trainscores, label = 'Train Scores', color = 'blue')

plt.legend();

We can choose k as 8.

In [None]:
knn = KNeighborsClassifier(8)
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)
y_pred_knn_train = knn.predict(X_train)

In [None]:
knn_as = metrics.accuracy_score(y_test, y_pred_knn)
knn_as_train = metrics.accuracy_score(y_train, y_pred_knn_train)

In [None]:
print(f"Accuracy score for test data : {knn_as}")
print(f"Accuracy score for train data : {knn_as_train}")

In [None]:
metrics.confusion_matrix(y_test, y_pred_knn)

In [None]:
print(metrics.classification_report(y_test, y_pred_knn))

In [None]:
y_proba = knn.predict_proba(X_test)[:, 1]
fpr, tpr, threshold = metrics.roc_curve(y_test, y_proba)

In [None]:
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label = 'KNN')
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC Curve')
plt.legend();

In [None]:
metrics.roc_auc_score(y_test, y_proba)

In [None]:
metrics.confusion_matrix(y_pred_knn, y_test)

# Decision Tree Classifier

In [None]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)

y_pred_dt = decision_tree.predict(X_test)
y_pred_train_dt = decision_tree.predict(X_train)

In [None]:
dt_as = metrics.accuracy_score(y_test, y_pred_dt)
dt_as_train = metrics.accuracy_score(y_train, y_pred_train_dt)

print(f"Accuracy score for test data : {dt_as}")
print(f"Accuracy score for train data : {dt_as_train}")

# Random Forest Classifier

In [None]:
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)

y_pred_rf = random_forest.predict(X_test)
y_pred_train_rf = random_forest.predict(X_train)

In [None]:
rf_as = metrics.accuracy_score(y_test, y_pred_rf)
rf_as_train = metrics.accuracy_score(y_train, y_pred_train_rf)

print(f"Accuracy score for test data : {rf_as}")
print(f"Accuracy score for train data : {rf_as_train}")

In [None]:
random_forest_ = RandomForestClassifier(100)
random_forest_.fit(X_train, y_train)

y_pred_rf_ = random_forest_.predict(X_test)
y_pred_train_rf_ = random_forest_.predict(X_train)

In [None]:
rf_as_ = metrics.accuracy_score(y_test, y_pred_rf_)
rf_as_train_ = metrics.accuracy_score(y_train, y_pred_train_rf_)

print(f"Accuracy score for test data : {rf_as_}")
print(f"Accuracy score for train data : {rf_as_train_}")

Checking feature importance for random forest classifier:

In [None]:
feature_importance_ = pd.DataFrame({'Importance' : RandomForestClassifier().fit(X, y).feature_importances_}, index = X.columns)

feature_importance_.sort_values(by = 'Importance', ascending = False, axis = 0)[:5].plot(kind = 'bar', color = '#019955', figsize = (10, 5))
plt.xlabel("Feature Importance by Random Forest Classifier", color = "#019955", fontdict= {"fontsize" : 20});

In [None]:
X_train_new_ = X_train[['MonthlyCharges', 'TotalCharges', 'tenure', 'Contract_Month-to-month', 'OnlineSecurity_No']]
X_test_new_ = X_test[['MonthlyCharges', 'TotalCharges', 'tenure', 'Contract_Month-to-month', 'OnlineSecurity_No']]

In [None]:
random_forest_new = RandomForestClassifier()
random_forest_new.fit(X_train_new_, y_train)

y_pred_rf_new = random_forest_new.predict(X_test_new_)
y_pred_train_rf_new = random_forest_new.predict(X_train_new_)

In [None]:
rf_as_new = metrics.accuracy_score(y_test, y_pred_rf_new)
rf_as_train_new = metrics.accuracy_score(y_train, y_pred_train_rf_new)

print(f"Accuracy score for test data : {rf_as_new}")
print(f"Accuracy score for train data : {rf_as_train_new}")

Nothing has changed again.

***Let's try with gridsearchcv to find best parameters.***

In [None]:
params_grid = {'criterion' : ['entropy', 'gini'], 'max_depth' : [2, 4, 6, 8], 'n_estimators' : [300, 400, 500],
              'min_samples_split' : [2, 4, 6, 8], 'min_samples_leaf' : [2, 3, 5, 7]}

gscv_rf = GridSearchCV(RandomForestClassifier(), params_grid, cv = 3, scoring = 'f1')
# gscv_rf.fit(X_train_sc, y_train)

In [None]:
# gscv_rf.best_params_
# {'criterion': 'gini','max_depth': 8,'min_samples_leaf': 2,'min_samples_split': 8,'n_estimators': 400}

In [None]:
# model tunning with best parameters

rf_gscv =RandomForestClassifier(n_estimators = 400, criterion = 'gini', max_depth = 8, min_samples_split = 8, min_samples_leaf = 2)
rf_gscv.fit(X_train_sc, y_train)

y_pred_gsvc = rf_gscv.predict(X_test_sc)
y_pred_gsvc_train = rf_gscv.predict(X_train_sc)

rf_gscv_as = metrics.accuracy_score(y_test, y_pred_gsvc)
rf_gscv_train_as = metrics.accuracy_score(y_train, y_pred_gsvc_train)

print(f"Accuracy score for test data : {rf_gscv_as}")
print(f"Accuracy score for train data : {rf_gscv_train_as}")

In [None]:
metrics.confusion_matrix(y_test, y_pred_gsvc)

In [None]:
metrics.confusion_matrix(y_train, y_pred_gsvc_train)

**Now, we do not have an overfitting problem!**

# XGBoost

In [None]:
xg = XGBClassifier()
xg.fit(X_train_sc, y_train)

y_pred_xg = xg.predict(X_test_sc)

y_pred_xg_train = xg.predict(X_train_sc)

xg_as = metrics.accuracy_score(y_test, y_pred_xg)
xg_as_train = metrics.accuracy_score(y_train, y_pred_xg_train)

print(f"Accuracy score of test data : {xg_as}")
print(f"Accuracy score of train data : {xg_as_train}")

In [None]:
metrics.plot_confusion_matrix(xg, X_test_sc, y_test, display_labels= [1, 0]);

In [None]:
metrics.plot_roc_curve(xg, X_test_sc, y_test);

In [None]:
parameters = {'learning_rate' : [0.01, 0.03, 0.05], 'max_depth' : [1, 4, 6], 'n_estimators' : [100, 300, 400, 600]}

In [None]:
xg_grid = GridSearchCV(XGBClassifier(), parameters, cv = 5)

In [None]:
# xg_grid.fit(X_train_sc, y_train)

In [None]:
# xg_grid.best_score_
# 0.804287486519285

In [None]:
# xg_grid.best_params_
# {'learning_rate': 0.05, 'max_depth': 1, 'n_estimators': 600}

In [None]:
# Parameters tunning
xg_gridcv =XGBClassifier(learning_rate = .05, max_depth = 1, n_estimators = 600)

xg_gridcv.fit(X_train_sc, y_train)

y_pred_xggrid = xg_gridcv.predict(X_test_sc)
y_pred_xggrid_train = xg_gridcv.predict(X_train_sc)

xg_as_grid = metrics.accuracy_score(y_test, y_pred_xggrid)
xg_as_grid_train = metrics.accuracy_score(y_train, y_pred_xggrid_train)

print(f"Accuracy score of test data : {xg_as_grid}")
print(f"Accuracy score of train data : {xg_as_grid_train}")

Now, looks good!

In [None]:
# https://matplotlib.org/examples/color/colormaps_reference.html #
metrics.plot_confusion_matrix(xg_gridcv, X_test_sc, y_test, cmap = 'cool', display_labels = [1, 0]);

In [None]:
metrics.plot_roc_curve(xg_gridcv, X_test_sc, y_test);

# Gradient Boosting Classifier

In [None]:
grad_boost = GradientBoostingClassifier()

grad_boost.fit(X_train_sc, y_train)

y_pred_grad = grad_boost.predict(X_test_sc)
y_pred_grad_train = grad_boost.predict(X_train_sc)

grad_as = metrics.accuracy_score(y_test, y_pred_grad)
grad_as_train = metrics.accuracy_score(y_train, y_pred_grad_train)

print(f"Accuracy score of test data : {grad_as}")
print(f"Accuracy score of train data : {grad_as_train}")

In [None]:
parameters_grad = {'learning_rate' : [0.01, 0.03, 0.05, 0.1], 'max_depth' : [1, 4, 6], 'n_estimators' : [100, 300, 400, 600, 800]}

grad_grid = GridSearchCV(GradientBoostingClassifier(), parameters_grad, cv = 5, scoring = 'f1')

# grad_grid.fit(X_train_sc, y_train)

In [None]:
# grad_grid.best_params_
# {'learning_rate': 0.1, 'max_depth': 1, 'n_estimators': 600}

In [None]:
# grad_grid.best_score_
# 0.5984668361905707

In [None]:
# Parameter tunning

grad_grid_ = GradientBoostingClassifier(n_estimators = 600, max_depth = 1, learning_rate = .1)

grad_grid_.fit(X_train_sc, y_train)

y_pred_grad_grid = grad_grid_.predict(X_test_sc)
y_pred_grad_grid_train = grad_grid_.predict(X_train_sc)

grad_grid_as = metrics.accuracy_score(y_test, y_pred_grad_grid)
grad_grid_as_train = metrics.accuracy_score(y_train, y_pred_grad_grid_train)

print(f"Accuracy score of test data : {grad_grid_as}")
print(f"Accuracy score of train data : {grad_grid_as_train}")

In [None]:
metrics.plot_confusion_matrix(grad_grid_, X_test_sc, y_test, cmap = 'summer', display_labels = [0, 1]);

In [None]:
metrics.plot_roc_curve(grad_grid_, X_test_sc, y_test);

## Logistic Regression w/ Scaled Data

In [None]:
log_sc = LogisticRegression()
log_sc.fit(X_train_sc, y_train)

y_pred_log_sc = log_sc.predict(X_test_sc)
y_pred_log_sc_ = log_sc.predict(X_train_sc)

log_sc_as = metrics.accuracy_score(y_test, y_pred_log_sc)
log_sc_as_ = metrics.accuracy_score(y_train, y_pred_log_sc_)

print(f"Accuracy score of test data : {log_sc_as}")
print(f"Accuracy score of train data : {log_sc_as_}")

In [None]:
metrics.plot_confusion_matrix(log_sc, X_test_sc, y_test, cmap = 'GnBu', display_labels = [0, 1]);

>  * We use KNN, Decision Tree Classifier, Random Forest Classifier, XGBoost Classifier, LGBM, Gradien Boosting Classifier, SVC, and Logistic Regressin.

In [None]:
print("Logistic Regression results : \n")
print(f"Accuracy score of test data : {log_sc_as}")
print(f"Accuracy score of train data : {log_sc_as_}\n")

print("------------------------------------------------")

print("KNN results : \n")
print(f"Accuracy score for test data : {knn_as}")
print(f"Accuracy score for train data : {knn_as_train}\n")

print("------------------------------------------------")

print("SVC result without parameter tunning : \n")
print(f"Accuracy score for test data : {svc_sc_as}")
print(f"Accuracy score for train data : {svc_sc_train_as}\n")
print("SVC results with parameter tunning : \n")
print(f"Accuracy score for test data : {svc_new_as}")
print(f"Accuracy score for train data : {svc_new_train_as}\n")

print("------------------------------------------------")

print("LGBM results without parameter importance : \n")
print(f"LGBM accuracy score for test data {lgbm_test_as}")
print(f"LGBM accuracy score for train data {lgbm_train_as}\n")
print("LGBM result with feature importance : \n")
print(f"LGBM accuracy score for test data {lgbm_ft_as}")
print(f"LGBM accuracy score for train data {lgbm_ft_as_}\n")

print("------------------------------------------------")

print("Decision Tree Classifier results with parameter importance : \n")
print(f"Accuracy score for test data : {dt_as}")
print(f"Accuracy score for train data : {dt_as_train}\n")

print("------------------------------------------------")

print("Random Forest Classifier without parameter tunning : \n")
print(f"Accuracy score for test data : {rf_as}")
print(f"Accuracy score for train data : {rf_as_train}\n")
print("Random Forest Classifier with parameter tunning : \n")
print(f"Accuracy score for test data : {rf_gscv_as}")
print(f"Accuracy score for train data : {rf_gscv_train_as}\n")

print("------------------------------------------------")

print("XGBoost results without parameter tunning : \n")
print(f"Accuracy score of test data : {xg_as}")
print(f"Accuracy score of train data : {xg_as_train}\n")
print("XGBoost results with parameter tunning : \n")
print(f"Accuracy score of test data : {xg_as_grid}")
print(f"Accuracy score of train data : {xg_as_grid_train}\n")

print("------------------------------------------------")

print("Gradient Boosting Classifier results without parameter tunning : \n")
print(f"Accuracy score of test data : {grad_as}")
print(f"Accuracy score of train data : {grad_as_train}\n")
print("Gradient Boosting Classifier results with parameter tunning : \n")
print(f"Accuracy score of test data : {grad_grid_as}")
print(f"Accuracy score of train data : {grad_grid_as_train}")