# Hi all. 🙋

Today, we will create models with famous trio (**XGBoost** & **LightGBM** & **Catboost**) that predict behavior to retain customers. We will analyze all relevant customer data and develop focused customer retention programs.

We will also deal with imbalanced data by using the famous trio modles and hyperparameter tunning with **OPTUNA**.

# Table of Contents
- Data

- Problem at Hand and Metric to Use?

- Exploratory Data Analysis

    - Target Variable

    - Numerical Features

    - Categorical Features

- Famous Trio and Imbalanced Data

    - CATBOOST / OPTUNA

    - XGBOOST / OPTUNA

    - LIGHTGBM /OPTUNA

- Model Comparision

- Conclusion

# Data
This dataset is about predicting whether a customer will change telecommunications provider, something known as "churning".

The training dataset contains 4250 samples. Each sample contains 19 features and 1 boolean variable "churn" which indicates the class of the sample. The 19 input features and 1 target variable are:

![image.png](attachment:11dae1ed-7c24-4d7f-8948-aac71aad2c37.png)

# Problem at Hand and Metric to Use?
- After analyzing data and data dictionary we see that we have a **classification** problem.
- We wil make classification on the target variable **Churn**.
- For this purpose we will look at the balance of the target variable.
- Since our target variable has imblanced data we are not going to use **Accuracy** score.
- Based on the problem on the hand, we will use **Recall** score.

# Exploratory Data Analysis

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



from sklearn.model_selection import cross_val_score,cross_val_predict, train_test_split
from sklearn.preprocessing import OneHotEncoder,StandardScaler,PowerTransformer,LabelEncoder

from sklearn.metrics import accuracy_score,classification_report, recall_score,confusion_matrix, roc_auc_score, precision_score, f1_score, roc_curve, auc, plot_confusion_matrix,plot_roc_curve


import optuna
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import optuna
import lightgbm as lgb
from xgboost import XGBClassifier
#from lightgbm import LGBMClassifier, plot_importance
from catboost import CatBoostClassifier


#importing plotly and cufflinks in offline mode
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

import shap 

import missingno as msno

import warnings
warnings.filterwarnings("ignore")

In [None]:
pd.set_option('max_columns',100)
pd.set_option('max_rows',900)

pd.set_option('max_colwidth',200)
data = pd.read_csv("../input/customer-churn-prediction-2020/train.csv")
data.head()

In [None]:
df1 = data.copy()

In [None]:
df1.head()

In [None]:
print(f"We have {df1.shape[0]} rows and {df1.shape[1]} columns in our dataset.")

In [None]:
df1.duplicated().sum()

In [None]:
def missing (df1):
    missing_number = df1.isnull().sum().sort_values(ascending=False)
    missing_percent = (df1.isnull().sum()/df1.isnull().count()).sort_values(ascending=False)
    missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=['Missing_Number', 'Missing_Percent'])
    return missing_values

missing(df1)

We have neither duplicated nor missing value (some deeper checks might always be needed to be 100% sure).

In [None]:
df1.info()

**Based on our preliminary analysis, we conclude that**:
- We won't drop any column.
- For **CatBoost** model, types of columns with int64 will be converted into float type.
- We will look at the cardinality of the categorical variables.
- And finally, we will convert **churn** column to numeric type by using **label encoding**.

## Target Variable

In [None]:
df1.churn.value_counts()

In [None]:
df1.churn.value_counts(normalize=True)

In [None]:
y = df1['churn']
print(f'Percentage of Churn:  {round(y.value_counts(normalize=True)[1]*100,2)} %  --> ({y.value_counts()[1]} customers)')
print(f'Percentage Non_Churn: {round(y.value_counts(normalize=True)[0]*100,2)}  %  --> ({y.value_counts()[0]} customers)')

In [None]:
y.iplot(kind="hist", title="Churns vs. NonChurns");

- It is obvious that we have imbalanced data.
- Almost 14% of the customers (598 customers) didn't continue with the company and churned.
- Almost 86% of the customers (3562 customers) continue with the company and didn't churn.

In [None]:
# We converted type of "churn" from object to int
le = LabelEncoder()
df1.churn = le.fit_transform(df1.churn)

In [None]:
df1.info()

- We converted **churn** column into numeric type.

In [None]:
numerical= df1.select_dtypes(include = 'number').columns

categorical = df1.select_dtypes(include = 'object').columns

print(f'Numerical Columns:  {df1[numerical].columns}')
print('\n')
print(f'Categorical Columns: {df1[categorical].columns}')

- For ease of usage, we got the list of the **numerical** and **categorical** features.

## Numerical Features

In [None]:
col_int = []

for col in numerical:
    if df1[col].dtype == "int64":
        col_int.append(col)

col_int.remove("churn")
col_int

In [None]:
for i in col_int:
    df1[i] = df1[i].astype(float)

- We have just converted all types of all columns (except for **churn**) with int64 into float type.

In [None]:
df1.info()

In [None]:
df1[numerical].describe()
# df1.describe()

In [None]:
plt.figure(figsize=(16, 8))
sns.heatmap (df1[numerical].corr(), annot=True, fmt= '.2f', vmin=-1, vmax=1, center=0, cmap='coolwarm');

- We can see in heatmap that we have some multicollinerity. 
- We need to drop one of each highly correleated column pairs.

In [None]:
drop_col = ['total_day_charge', 'total_eve_charge', 'total_night_charge', 'total_intl_charge']
df1 = df1.drop(drop_col, axis=1)
df1.shape

In [None]:
numerical= df1.select_dtypes(include = 'number').columns
numerical

In [None]:
plt.figure(figsize=(16, 8))
sns.heatmap (df1[numerical].corr(), annot=True, fmt= '.2f', vmin=-1, vmax=1, center=0, cmap='coolwarm');

- We got rid of multicollinear columns.
- **total_day_minutes** has the highest correleation with **churn**.
- Overall, there is low correleations among features.


## Categorical Features

In [None]:
df1[categorical].nunique()

- Great news! We do not have a high cardinality or zero variance issues.

In [None]:
for column in df1[categorical]:
    print(f"{column}: {df1[column].unique()}")

### state vs. churn

In [None]:
for i in df1["state"].unique():
    print(f'A customer from state of {i} has a probability of {round(df1[df1["state"]==i]["churn"].mean()*100,2)} % churn.')

In [None]:
fig = px.histogram(data_frame=df1, x="state", color="churn", width=1200, height=400)
fig.show()

- While **CA** (California) has the highest rate of churn, **VA** (Virginia) has the lowest churn rate.
- Overall churn rates among states range from 5 percent to 25 percent.

### area_code vs. churn

In [None]:
area_code: ['area_code_415' 'area_code_408' 'area_code_510']
    
print(f'A customer with area_code_415 has a probability of {round(df1[df1["area_code"]=="area_code_415"]["churn"].mean()*100,2)} % churn.')
print()
print(f'A customer with area_code_408 has a probability of {round(df1[df1["area_code"]=="area_code_408"]["churn"].mean()*100,2)} % churn.')
print()
print(f'A customer with area_code_510 has a probability of {round(df1[df1["area_code"]=="area_code_510"]["churn"].mean()*100,2)} % churn.')

In [None]:
fig = px.histogram(data_frame=df1, x="area_code", color="churn", width=420, height=420)
fig.show()

- It seems that there is not much difference among **area_codes** on churn rate.
- We may drop it later.

### international_plan vs. churn 

In [None]:
print(f'A customer with an international plan has a probability of {round(df1[df1["international_plan"]=="yes"]["churn"].mean()*100,2)} % churn.')
print()
print(f'A customer wwithout an international plan has a probability of {round(df1[df1["international_plan"]=="no"]["churn"].mean()*100,2)} % churn.')

In [None]:
fig = px.histogram(data_frame=df1, x="international_plan", color="churn", width=420, height=420)
fig.show()

- Customers with an international plan is almost 4 times more likely to churn than those without international plan.

### voice_mail_plan vs. churn

In [None]:
print(f'A customer with a voice mail plan has a probability of {round(df1[df1["voice_mail_plan"]=="yes"]["churn"].mean()*100,2)} % churn.')
print()
print(f'A customer wwithout a vocie mail plan has a probability of {round(df1[df1["voice_mail_plan"]=="no"]["churn"].mean()*100,2)} % churn.')

In [None]:
fig = px.histogram(data_frame=df1, x="voice_mail_plan", color="churn", width=420, height=420)
fig.show()

- Customers without a voice mail plan is almost 2.5 times more likely to churn than those with voice mail plan.

# Famous Trio and Imbalanced Data

- Now, let's look at the **CatBoost**, **XGBoost**, and **LightGBM** and see how they handle imbalanced data internally.
- By giving an opportunity to focus more on the minority class and accordingly tunning the training, they do good job even on imbalanced data.

- CatBoost, XGBoost, and LightGBM use **scale_pos_weight** hyperparameter to tune the training algorithm for the imbalanced data.

- By defualt, **scale_pos_weight** is 1.

- Both major class and minority class get the same weight in balanced data. However, when dealing with imbalanced data, story changes a bit.

- Formula for calculating value of **scale_pos_weight**: 
    - Number of Non-churned (**majority**) customer: 5174
    - Number of Churned customer(**minority**): 1869
    - **scale_pos_weight** = 5174 / 1869 or almost 3
- By adjusting the weight, minority class gets 3 times more impact and 3 times more correction than errors made on the majority class.

**Note1**: If we use extreme values for the **scale_pos_weight**, we can overfit the minority class and model could make worse predictions.

**Note2**: While **CatBoost** and **LightGBM** can handle categorical features, **XGBoost** cannot. You have to convert categorical features before creating your model.

## CATBOOST

![image.png](attachment:c30320f4-9693-4166-b972-26b5c751919a.png)

- It is an Boosting algorithm that was created by Yandex.
- It can handle both missing values and categorical values internally.

### CatBoost - scale_pos_weight = 5

In [None]:
numerical_1 = ['account_length', 'number_vmail_messages', 'total_day_minutes',
       'total_day_calls', 'total_eve_minutes', 'total_eve_calls',
       'total_night_minutes', 'total_night_calls', 'total_intl_minutes',
       'total_intl_calls', 'number_customer_service_calls']
numerical_1

In [None]:
accuracy= []
recall =[]
roc_auc= []
precision = []


df = pd.read_csv('../input/customer-churn-prediction-2020/train.csv')
df1 = df.copy()
le = LabelEncoder()
df1['churn']=le.fit_transform(df1['churn'])


#for i in numerical_1:
    #df1[i] = df1[i].astype(float)

    
X= df1.drop('churn', axis=1)
y= df1['churn']

categorical_features_indices = np.where(X.dtypes != np.float)[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# With scale_pos_weight=5, minority class gets 5 times more impact and 5 times more correction than errors made on the majority class.
catboost_5 = CatBoostClassifier(verbose=False,random_state=0,scale_pos_weight=5)

catboost_5.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_test, y_test))
y_pred = catboost_5.predict(X_test)

accuracy.append(round(accuracy_score(y_test, y_pred),4))
recall.append(round(recall_score(y_test, y_pred),4))
roc_auc.append(round(roc_auc_score(y_test, y_pred),4))
precision.append(round(precision_score(y_test, y_pred),4))

model_names = ['Catboost_adjusted_weight_5']
result_df1 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names)
result_df1

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
plot_confusion_matrix(catboost_5, X_test, y_test, cmap=plt.cm.plasma, ax=ax);

### OPTUNA - Hyperparameter Tunning

In [None]:
def objective(trial):
    df = pd.read_csv('../input/customer-churn-prediction-2020/train.csv')
    df1 = df.copy()
    
    le = LabelEncoder()
    df1['churn']=le.fit_transform(df1['churn'])
    
    X= df1.drop('churn', axis=1)
    y= df1['churn']
    
    categorical_features_indices = np.where(X.dtypes != np.float)[0]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    param = {
        "objective": "Logloss",
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
        "depth": trial.suggest_int("depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
        "bootstrap_type": trial.suggest_categorical(
            "bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]
        ),
        "used_ram_limit": "3gb",
    }

    if param["bootstrap_type"] == "Bayesian":
        param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
    elif param["bootstrap_type"] == "Bernoulli":
        param["subsample"] = trial.suggest_float("subsample", 0.1, 1)

    cat_cls = CatBoostClassifier(verbose=False,random_state=0,scale_pos_weight=1.2, **param)

    cat_cls.fit(X_train, y_train, eval_set=[(X_test, y_test)], cat_features=categorical_features_indices,verbose=0, early_stopping_rounds=100)

    preds = cat_cls.predict(X_test)
    pred_labels = np.rint(preds)
    accuracy = accuracy_score(y_test, pred_labels)
    return accuracy


if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=100, timeout=600)

    print("Number of finished trials: {}".format(len(study.trials)))

    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))

    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

- Ok let's use our **CatBoost** model with new parameters.

In [None]:
accuracy= []
recall =[]
roc_auc= []
precision = []


df = pd.read_csv('../input/customer-churn-prediction-2020/train.csv')
df1 = df.copy()

#for target feature
le = LabelEncoder()
df1['churn']=le.fit_transform(df1['churn'])


X=df1.drop('churn', axis=1)
y=df1['churn']

#indeces of categorical observations
categorical_features_indices = np.where(X.dtypes != np.float)[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#since our dataset is not imbalanced, we do not have to use scale_pos_weight parameter to counter balance our results
#catboost_5 = CatBoostClassifier(verbose=False,random_state=0,scale_pos_weight=5)
catboost_5 = CatBoostClassifier(verbose=False,random_state=0,
                                 colsample_bylevel=0.091134936724785,
                                 depth=9,
                                 boosting_type="Ordered",
                                 bootstrap_type="MVS")

catboost_5.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_test, y_test), early_stopping_rounds=100)
y_pred = catboost_5.predict(X_test)

accuracy.append(round(accuracy_score(y_test, y_pred),4))
recall.append(round(recall_score(y_test, y_pred),4))
roc_auc.append(round(roc_auc_score(y_test, y_pred),4))
precision.append(round(precision_score(y_test, y_pred),4))

model_names = ['Catboost_adjusted_weight_5_optuna']
result_df2 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names)
result_df2

- With **OPTUNA** hyperparameters, we managed to increase our **Accuracy** score by 2%.


![image.png](attachment:b3f5c277-ef75-4395-88af-c6653493bb3f.png)

## LightGBM

![image.png](attachment:f9f51bdc-d385-4a11-9bb4-d226f9760a62.png)

It was developed by Microsoft 

### LightGBM - scale_pos_weight = 5

In [None]:
accuracy= []
recall =[]
roc_auc= []
precision = []


df = pd.read_csv('../input/customer-churn-prediction-2020/train.csv')
df1 = df.copy()
le = LabelEncoder()
df1['churn']=le.fit_transform(df1['churn'])

                 
df1= pd.get_dummies(df1)
X= df1.drop('churn', axis=1)
y= df1['churn']

for col in X.columns:
    col_type = X[col].dtype
    if col_type == 'object' or col_type.name == 'category':
        X[col] = X[col].astype('category')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lgbmc_5=LGBMClassifier(random_state=0,scale_pos_weight=5)

lgbmc_5.fit(X_train, y_train,categorical_feature = 'auto',eval_set=(X_test, y_test),feature_name='auto', verbose=0)

y_pred = lgbmc_5.predict(X_test)

accuracy.append(round(accuracy_score(y_test, y_pred),4))
recall.append(round(recall_score(y_test, y_pred),4))
roc_auc.append(round(roc_auc_score(y_test, y_pred),4))
precision.append(round(precision_score(y_test, y_pred),4))

model_names = ['LightGBM_adjusted_weight_5']
result_df3 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names)
result_df3

With our defult parameters, we got almost 0.96 as accuracy score.

### OPTUNA - Hyperparameter Tunning

In [None]:
def objective(trial):
    df = pd.read_csv('../input/customer-churn-prediction-2020/train.csv')
    df1 = df.copy()
    le = LabelEncoder()
    df1['churn']=le.fit_transform(df1['churn'])
   
    
    X= df1.drop('churn', axis=1)
    y= df1['churn']
    
    for col in X.columns:
        col_type = X[col].dtype
        if col_type == 'object' or col_type.name == 'category':
            X[col] = X[col].astype('category')    
    
    param = {
        "objective": "binary",
        "metric": "binary_logloss",
        "verbosity": -1,
        "boosting_type": "dart",
        "num_leaves": trial.suggest_int("num_leaves", 2,2000),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
    }
    
    lgbmc_adj=lgb.LGBMClassifier(random_state=0,scale_pos_weight=5,**param)
    lgbmc_adj.fit(X_train, y_train,categorical_feature = 'auto',eval_set=(X_test, y_test),feature_name='auto', verbose=0, early_stopping_rounds=100)

    preds = lgbmc_adj.predict(X_test)
    pred_labels = np.rint(preds)
    accuracy = accuracy_score(y_test, pred_labels)
    return accuracy


if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=100)

    print("Number of finished trials: {}".format(len(study.trials)))

    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))

    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

With parameters provided by **OPTUNA**, our accuracy score is almost 0.96.

In [None]:
accuracy= []
recall =[]
roc_auc= []
precision = []


df = pd.read_csv('../input/customer-churn-prediction-2020/train.csv')
df1 = df.copy()
le = LabelEncoder()
df1['churn']=le.fit_transform(df1['churn'])


X= df1.drop('churn', axis=1)
y= df1['churn']

#if you want a variable to be perecived as categorical then you need to covert it to object type
for col in X.columns:
    col_type = X[col].dtype
    if col_type == 'object' or col_type.name == 'category':
        X[col] = X[col].astype('category')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lgbmc_5=lgb.LGBMClassifier(random_state=0,scale_pos_weight=5,
                           num_leaves=724,
                           max_depth=9,
                           lambda_l1=8.142384644362947e-06,
                           lambda_l2=3.432798202818561e-08,
                           feature_fraction=0.5164384666114301,
                           bagging_fraction=0.7323707200247135,
                           bagging_freq=5,
                           min_child_samples=5)

#y_train,categorical_feature = 'auto' takes all categoricals automatically
lgbmc_5.fit(X_train, y_train,categorical_feature = 'auto',eval_set=(X_test, y_test),feature_name='auto', verbose=0, early_stopping_rounds=100)

y_pred = lgbmc_5.predict(X_test)

accuracy.append(round(accuracy_score(y_test, y_pred),4))
recall.append(round(recall_score(y_test, y_pred),4))
roc_auc.append(round(roc_auc_score(y_test, y_pred),4))
precision.append(round(precision_score(y_test, y_pred),4))

model_names = ['LightGBM_adjusted_weight_5_optuna']
result_df4 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names)
result_df4

With **OPTUNA** parameters in **LightGBM**, our accuracy score did not change much. 

## XGBoost

![image.png](attachment:775354cf-ce1c-47d3-9e91-5a24c8013cba.png)

### XGBoost - scale_pos_weight = 5

In [None]:
accuracy= []
recall =[]
roc_auc= []
precision = []


df = pd.read_csv("../input/customer-churn-prediction-2020/train.csv")
df1 = df.copy()
le = LabelEncoder()
df1['churn']=le.fit_transform(df1['churn'])

#Since XGBoost does not handle categorical values itself, we use get_dummies to convert categorical variables into numeric variables.
df1= pd.get_dummies(df1)
X= df1.drop('churn', axis=1)
y= df1['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

xgbc_5 = XGBClassifier(random_state=0)

xgbc_5.fit(X_train, y_train)
y_pred = xgbc_5.predict(X_test)

accuracy.append(round(accuracy_score(y_test, y_pred),4))
recall.append(round(recall_score(y_test, y_pred),4))
roc_auc.append(round(roc_auc_score(y_test, y_pred),4))
precision.append(round(precision_score(y_test, y_pred),4))

model_names = ['XGBoost_adjusted_weight_5']
result_df5 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names)
result_df5

With defualt parameters, we got 0.95 as accuracy score.

### OPTUNA - Hyperparameter Tunning

In [None]:
import numpy as np
import optuna

import sklearn.datasets
import sklearn.metrics
from sklearn.model_selection import train_test_split
import xgboost as xgb

def objective(trial):
    
    df = pd.read_csv("../input/customer-churn-prediction-2020/train.csv")
    df1 = df.copy()
    le = LabelEncoder()
    df1['churn']=le.fit_transform(df1['churn'])

    df1= pd.get_dummies(df1)
    X= df1.drop('churn', axis=1)
    y= df1['churn']
    
    #(data, target) = sklearn.datasets.load_breast_cancer(return_X_y=True)
    train_x, valid_x, train_y, valid_y = train_test_split(X, y, test_size=0.25)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "objective": "binary:logistic",
        # use exact for small dataset.
        "tree_method": "exact",
        # defines booster, gblinear for linear functions.
        "booster": trial.suggest_categorical("booster", ["gbtree", "gblinear", "dart"]),
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
    }

    if param["booster"] in ["gbtree", "dart"]:
        # maximum depth of the tree, signifies complexity of the tree.
        param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
        # minimum child weight, larger the term more conservative the tree.
        param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
        param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
        # defines how selective algorithm is.
        param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
        param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    if param["booster"] == "dart":
        param["sample_type"] = trial.suggest_categorical("sample_type", ["uniform", "weighted"])
        param["normalize_type"] = trial.suggest_categorical("normalize_type", ["tree", "forest"])
        param["rate_drop"] = trial.suggest_float("rate_drop", 1e-8, 1.0, log=True)
        param["skip_drop"] = trial.suggest_float("skip_drop", 1e-8, 1.0, log=True)

    bst = xgb.train(param, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    accuracy = sklearn.metrics.accuracy_score(valid_y, pred_labels)
    return accuracy


if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=100, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

**OPTUNA** parameters give us a higher Accuracy score (0.96)

In [None]:
from  xgboost import XGBClassifier
accuracy= []
recall =[]
roc_auc= []
precision = []


df = pd.read_csv("../input/customer-churn-prediction-2020/train.csv")
df1 = df.copy()
le = LabelEncoder()
df1['churn']=le.fit_transform(df1['churn'])

df1= pd.get_dummies(df1)
X= df1.drop('churn', axis=1)
y= df1['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

xgbc_5 = XGBClassifier(random_state=0,
     booster="gbtree",
     lambda_=1.0747585763388536e-08,
     alpha=4.888494937862174e-05,
     subsample=0.9424632124541714,
     colsample_bytree=0.9950004607929119,
     max_depth=9,
     min_child_weight=3,
     eta=0.053153334432134325,
     gamma=0.0017328227799719943,
     grow_policy="lossguide")

xgbc_5.fit(X_train, y_train)
y_pred = xgbc_5.predict(X_test)

accuracy.append(round(accuracy_score(y_test, y_pred),4))
recall.append(round(recall_score(y_test, y_pred),4))
roc_auc.append(round(roc_auc_score(y_test, y_pred),4))
precision.append(round(precision_score(y_test, y_pred),4))

model_names = ['XGBoost_adjusted_weight_5_optuna']
result_df6 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names)
result_df6

After applying **OPTUNA** parameters to our **XGBoost** model, we get silightly higher score than the one with default parameters.

# Model Comparion

In [None]:
result_final= pd.concat([result_df1,result_df2,result_df3,result_df4,result_df5,result_df6],axis=0)
result_final

In [None]:
result_final.sort_values(by=['Accuracy'], ascending=True,inplace=True)
fig = px.bar(result_final, x='Accuracy', y=result_final.index,title='Model Comparison',height=600,labels={'index':'MODELS'})
fig.show()

# Conclusion

- We have developed model to classifiy churn cases.

- First, we made the detailed exploratory analysis.

- We have decided which metric to use (**Accuracy** - since the author of the dataset required so).

- We looked in detail **Catboost**, **LightGBM**, and **XGBoost** models.

- We made hyperparameter tuning of for each model with **OPTUNA** to see the improvement.

![image.png](attachment:54868641-bc9b-4d60-9e7f-a283c4f130f0.png)