 Тестовое задание «Отток клиентов»
 
Описание

Из «Бета-Банка» стали уходить клиенты. Каждый месяц. Немного, но заметно. Банковские маркетологи
посчитали: сохранять текущих клиентов дешевле, чем привлекать новых.
Нужно спрогнозировать, уйдёт клиент из банка в ближайшее время или нет. Вам предоставлены
исторические данные о поведении клиентов и расторжении договоров с банком.

Источник данных: https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling

Инструкция по выполнению задачи

1. Загрузите и подготовьте данные
2. Исследуйте баланс классов, обучите модель без учета дисбаланса
3. Улучшите качество модели, учитывая дисбаланс классов
4. Проведите финальное тестирование

Все преобразования и построение выполнять в Python. В результате предоставить Notebook с
комментариями к выполняемым шагам и выводами о проделанной работе.
Оптимальную модель для данной задачи выбирайте сами, сравнение нескольких подходов
приветствуется. Любые дополнительные действия для улучшения качества модели также
приветствуются.

Описание данных.
Признаки:

- RowNumber – индекс строки в данных
- CustomerId – уникальный идентификатор клиента
- Surname – фамилия
- CreditScore – кредитный скоринг
- Geography – страна проживания
- Gender – пол
- Age – возраст
- Tenure – количество недвижимости у клиента
- Balance – баланс на счете
- NumOfProducts – количество продуктов банка, используемых клиентом
- HasCrCard – наличие кредитной карты
- IsActiveMember – активность клиента
- EstimatedSalary – предполагаемая зарплата

Целевой признак:
- Exited – факт ухода клиента

---

# 1. Loading and preparing data

In [None]:
import pandas as pd
import numpy as np

In [None]:
data=pd.read_csv('../input/bank-customer-churn-modeling/Churn_Modelling.csv', sep=',')
data.head(10)

In [None]:
data[data.Surname=='Hill']

In [None]:
len(pd.unique(data.Surname))

In [None]:
data=data.drop(['RowNumber','CustomerId', 'Surname'], axis='columns')

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
def summary(data):
    print('Shape: ' , data.shape)
    return( pd.DataFrame({ "Dtypes ":data.dtypes , 
                           "NAs":data.isnull().sum() ,
                           "uniques":data.nunique() ,
                            "Levels":[ data[i].unique() for i in data.columns]}))

In [None]:
summary(data)

#### Let's check if there are outliers in the data. Delete them

In [None]:
import matplotlib.pyplot as plt

In [None]:
fig = plt.figure(figsize=(15,10))
ax1=fig.add_subplot(221)
ax2=fig.add_subplot(222)

g=ax1.hist(data['CreditScore'], bins=500, color='y', alpha=0.9)
g=ax2.boxplot(data['CreditScore'])

In [None]:
data=data.drop(data[data['CreditScore']<385].index)
len(data)

In [None]:
fig = plt.figure(figsize=(15,10))
ax1=fig.add_subplot(221)
ax2=fig.add_subplot(222)

g=ax1.hist(data['CreditScore'], bins=500, color='y', alpha=0.9)
g=ax2.boxplot(data['CreditScore'])

In [None]:
fig = plt.figure(figsize=(15,10))
ax1=fig.add_subplot(221)
ax2=fig.add_subplot(222)

g=ax1.hist(data['Age'], bins=500, color='y', alpha=0.9)
g=ax2.boxplot(data['Age'])

In [None]:
data=data.drop(data[data['Age']>60].index)
len(data)

In [None]:
fig = plt.figure(figsize=(15,10))
ax2=fig.add_subplot(221)
ax3=fig.add_subplot(222)

g2=ax2.hist(data['Age'], bins=500, color='y', alpha=0.9)
g3=ax3.boxplot(data['Age'])

In [None]:
fig = plt.figure(figsize=(15,10))
ax1=fig.add_subplot(221)
ax2=fig.add_subplot(222)

g=ax1.hist(data['Balance'], bins=500, color='y', alpha=0.9)
g=ax2.boxplot(data['Balance'])

In [None]:
len(data[data.Balance==0])

In [None]:
fig = plt.figure(figsize=(15,10))
ax1=fig.add_subplot(221)
ax2=fig.add_subplot(222)

g=ax1.hist(data['EstimatedSalary'], bins=1000, color='y', alpha=0.9)
g=ax2.boxplot(data['EstimatedSalary'])

##### Checking for normality of distribution

In [None]:
from scipy import stats

In [None]:
W, p = stats.shapiro(data.CreditScore.iloc[:5000])
print(W, p)

In [None]:
W, p = stats.shapiro(data.Age.iloc[:5000])
print(W, p)

In [None]:
W, p = stats.shapiro(data.EstimatedSalary.iloc[:5000])
print(W, p)

##### Correlation

In [None]:
import seaborn as sns

In [None]:
corr = data.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

In [None]:
data.head(10)

##### normalization

In [None]:
from sklearn import preprocessing

norm = preprocessing.StandardScaler()
norm.fit(data[['CreditScore','Age','Balance','EstimatedSalary','Tenure','NumOfProducts']])
N=norm.transform(data[['CreditScore','Age','Balance','EstimatedSalary','Tenure','NumOfProducts']])
N

In [None]:
data[['CreditScore','Age','Balance','EstimatedSalary','Tenure','NumOfProducts']]=N

In [None]:
data.head()

# 2. Exploring the balance of classes. Training models without accounting the imbalance

In [None]:
data1 = pd.get_dummies(data, columns =['Gender', 'Geography'], drop_first=True)
data1.head()

In [None]:
X = data1.iloc[:, 2:].drop(['Exited'], axis='columns')

Y = data1.iloc[:, 8]

### 2.1. Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, train_size=(0.7), test_size=(0.3))

In [None]:
len(Y_test[Y_test==1])

In [None]:
len(Y_test[Y_test==0])

In [None]:
classifier=LogisticRegression()
classifier.fit(X_train, Y_train)

In [None]:
predicted_y = classifier.predict(X_test)
print('predicted_y:', predicted_y)
print('coef_:', classifier.coef_)
print('accuracy_score:',classifier.score(X_test, Y_test))

In [None]:
len(predicted_y[np.where(predicted_y==0)])

In [None]:
len(predicted_y[np.where(predicted_y==1)])

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, predicted_y)
tn, fp, fn, tp=cm.ravel()
print(cm)
print(tn, fp, fn, tp)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (log_reg):',classifier.score(X_test, Y_test))
print('Recall (log_reg):', Re)
print('Precision (log_reg):', Pr)
print('F-measure (log_reg):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (log_reg):', F_b)
print('Balanced accuracy (log_reg):', Bac)
print('Specificity_ (log_reg):', Sp)

##### Precision and recall do not depend, in contrast to accuracy, on the ratio of classes and therefore are applicable in conditions of unbalanced samples.

##### We got the F-measure close to 0, so the Recall is close to 0. And this, in turn, is due to the fact that the model makes many passes.
- The F_2-measure in which completeness is preferred (b = 2) is very small.
- Precision is not a good value, which means that the model is good at identifying "good" (0) clients.
- The specificity (Specificity_) is high, because the model has a large percentage of loyal customers (the constant algorithm will show the same accuracy for this metric).
- Analysis of these metrics tells us about the low quality of the model.

##### Cross-validation regression

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold, cross_validate

In [None]:
clf_log = LogisticRegression()

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_validate(clf_log, X, Y, cv=cv, n_jobs=-1, scoring=['accuracy','precision','recall'])

print("Accuracy_test (log_reg): {}".format(scores['test_accuracy'].mean()), 
      "Recall_test (log_reg): {}".format(scores['test_recall'].mean()),
      "Precision_test (log_reg): {}".format(scores['test_precision'].mean()), sep='\n')

b=2   # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*scores['test_recall'].mean()*scores['test_precision'].mean()/(scores['test_recall'].mean()+scores['test_precision'].mean()*b**2)
print('F_2-measure_test (log_reg):', F_b)

##### Cross-validation did not improve model quality

### 2.2 RandomForest

##### To overcome the deviation from the norm (outliers), you can use randomness in the models or random forests.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf_forest = RandomForestClassifier(random_state=1, n_estimators=500, min_samples_split=10, min_samples_leaf=2)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_validate(clf_forest, X, Y, cv=cv, n_jobs=-1, scoring=['accuracy','precision','recall'], return_train_score=True)

print("Accuracy_test (Forest): {}".format(scores['test_accuracy'].mean()), 
      "Recall_test (Forest): {}".format(scores['test_recall'].mean()),
      "Precision_test (Forest): {}".format(scores['test_precision'].mean()), sep='\n')

b=2   # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*scores['test_recall'].mean()*scores['test_precision'].mean()/(scores['test_recall'].mean()+scores['test_precision'].mean()*b**2)
print('F_2-measure_test (Forest):', F_b)

In [None]:
clf_forest.fit(X_train, Y_train)
predicted_y = clf_forest.predict(X_test)
predicted_y

In [None]:
cm = confusion_matrix(Y_test, predicted_y)
tn, fp, fn, tp=cm.ravel()
print(cm)

##### Optimize prediction

In [None]:
# Hyperparameter optimization using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [None]:
#parameters
params = {
    "n_estimators": [350, 400, 450],
    "min_samples_split": [6, 8, 10],
    "min_samples_leaf": [1, 2, 4]
}

In [None]:
random_search=RandomizedSearchCV(clf_forest, param_distributions=params, n_iter=5, scoring='roc_auc',n_jobs=-1, cv=cv,verbose=3)

In [None]:
random_search.fit(X_train,Y_train)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
random_forest = RandomForestClassifier(min_samples_leaf=4, min_samples_split=10,
                       n_estimators=350, random_state=1)

In [None]:
from sklearn.model_selection import cross_val_score
score = cross_val_score(random_forest,X,Y,cv=10)
score.mean()

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score

In [None]:
random_forest.fit(X_train, Y_train)
Y_test_preds=random_forest.predict(X_test)

In [None]:
print('Accuracy (Forest): {0:.2f}'.format(accuracy_score(Y_test, Y_test_preds)))
print('Precision (Forest): {0:.2f}'.format(precision_score(Y_test, Y_test_preds)))
print('Recall (Forest): {0:.2f}'.format(recall_score(Y_test, Y_test_preds)))
print('F2 (Forest): {0:.2f}'.format(fbeta_score(Y_test, Y_test_preds, 2)))

### 2.3 XGboost

In [None]:
from xgboost.sklearn import XGBClassifier

In [None]:
#parameters
params = {
    "learning_rate"    :[0.05,0.10,0.15,0.20,0.25,0.30],
    "max_depth"        :[ 3,4,5,6,8,10,12,15 ],
    "min_child_weight" :[ 1,3,5,7 ],
    "gamma"            :[ 0.0,0.1,0.2,0.3,0.4 ],
    "colsample_bytree" :[ 0.3, 0.4, 0.5, 0.7 ]
}

In [None]:
classifier = XGBClassifier()

##### Let's try to reconfigure the model

In [None]:
random_search=RandomizedSearchCV(classifier, param_distributions=params, n_iter=5, scoring='roc_auc',n_jobs=-1, cv=cv,verbose=3)

In [None]:
random_search.fit(X_train, Y_train)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
classifier = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=0.1,
              learning_rate=0.25, max_delta_step=0, max_depth=4,
              min_child_weight=5, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
score = cross_val_score(classifier,X,Y,cv=10)
score.mean()

In [None]:
classifier.fit(X_train, Y_train)
Y_test_preds=classifier.predict(X_test)

In [None]:
print('Accuracy (XGboost): {0:.2f}'.format(accuracy_score(Y_test, Y_test_preds)))
print('Precision (XGboost): {0:.2f}'.format(precision_score(Y_test, Y_test_preds)))
print('Recall (XGboost): {0:.2f}'.format(recall_score(Y_test, Y_test_preds)))
print('F2 (XGboost): {0:.2f}'.format(fbeta_score(Y_test, Y_test_preds, 2)))

# 3. Improving the quality of the model, taking into account the imbalance of classes

##### The data is strongly unbalanced, this could lead to problems when predicting data.

##### Sampling methods: artificially duplicate observations from a rare class, or throw out some observations from a popular class.

In [None]:
data.head()

In [None]:
data.Exited[data.Exited==0].count()

In [None]:
data.Exited[data.Exited==1].count()

### 3.1.1 Logistic regression without cross-validation taking into account unbalance (balanced weight)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, train_size=(0.7), test_size=(0.3))

In [None]:
len(Y_test[Y_test==1])

In [None]:
len(Y_test[Y_test==0])

In [None]:
classifier=LogisticRegression(class_weight='balanced')
classifier.fit(X_train, Y_train)

In [None]:
predicted_y = classifier.predict(X_test)
print('predicted_y:', predicted_y)
print('coef_:', classifier.coef_)
print('accuracy_score:',classifier.score(X_test, Y_test))

In [None]:
len(predicted_y[np.where(predicted_y==0)])

In [None]:
len(predicted_y[np.where(predicted_y==1)])

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, predicted_y)
tn, fp, fn, tp=cm.ravel()
print(cm)
print(tn, fp, fn, tp)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (log_reg):',classifier.score(X_test, Y_test))
print('Recall (log_reg):', Re)
print('Precision (log_reg):', Pr)
print('F-measure (log_reg):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (log_reg):', F_b)
print('Balanced accuracy (log_reg):', Bac)
print('Specificity_ (log_reg):', Sp)

##### After weight balancing, all metrics are aligned. The completeness (Recall) has increased and, accordingly, the F_2-measure.

### 3.1.2 Logistic regression on cross-validation taking into account the unbalance (balanced weight)

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold, cross_validate

In [None]:
clf_log = LogisticRegression(class_weight='balanced')

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_validate(clf_log, X, Y, cv=cv, n_jobs=-1, scoring=['accuracy','precision','recall'])

print("Accuracy_test (log_reg): {}".format(scores['test_accuracy'].mean()), 
      "Recall_test (log_reg): {}".format(scores['test_recall'].mean()),
      "Precision_test (log_reg): {}".format(scores['test_precision'].mean()), sep='\n')

b=2   # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*scores['test_recall'].mean()*scores['test_precision'].mean()/(scores['test_recall'].mean()+scores['test_precision'].mean()*b**2)
print('F_2-measure_test (log_reg):', F_b)

##### After weight balancing, all metrics are aligned. The completeness (Recall) has increased and, accordingly, the F_2-measure.

### 3.1.3 RandomForest on cross-validation taking into account the unbalance (balanced weight)

In [None]:
clf_forest = RandomForestClassifier(class_weight='balanced', random_state=1, n_estimators=500, min_samples_split=10, min_samples_leaf=2)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_validate(clf_forest, X, Y, cv=cv, n_jobs=-1, scoring=['accuracy','precision','recall'], return_train_score=True)

print("Accuracy_test (Forest): {}".format(scores['test_accuracy'].mean()), 
      "Recall_test (Forest): {}".format(scores['test_recall'].mean()),
      "Precision_test (Forest): {}".format(scores['test_precision'].mean()), sep='\n')

b=2   # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*scores['test_recall'].mean()*scores['test_precision'].mean()/(scores['test_recall'].mean()+scores['test_precision'].mean()*b**2)
print('F_2-measure_test (Forest):', F_b)

### 3.1.4 XGBoost on cross-validation taking into account the unbalance (balanced weight)

In [None]:
classifier = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=0.1,
              learning_rate=0.25, max_delta_step=0, max_depth=4,
              min_child_weight=5, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=5, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
score = cross_val_score(classifier,X,Y,cv=10)
score.mean()

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score

In [None]:
classifier.fit(X_train, Y_train)
Y_test_preds=classifier.predict(X_test)

In [None]:
print('Accuracy (XGBoost): {0:.2f}'.format(accuracy_score(Y_test, Y_test_preds)))
print('Precision (XGBoost): {0:.2f}'.format(precision_score(Y_test, Y_test_preds)))
print('Recall (XGBoost): {0:.2f}'.format(recall_score(Y_test, Y_test_preds)))
print('F2 (XGBoost): {0:.2f}'.format(fbeta_score(Y_test, Y_test_preds, 2)))

### 3.2.1 Logistic regression considering imbalance. Random undersampling and oversampling

In [None]:
num_0 = len(data1[data1['Exited']==0])
num_1 = len(data1[data1['Exited']==1])
print(num_0,num_1)

In [None]:
# oversampling
oversampled_data = pd.concat([ data1[data1['Exited']==0] , data1[data1['Exited']==1].sample(num_0, replace=True) ])
print(len(oversampled_data))

In [None]:
# undersampling
undersampled_data = pd.concat([data1[data1['Exited']==0].sample(num_1) , data1[data1['Exited']==1] ])
print(len(undersampled_data))

### Oversampling

In [None]:
X_o = oversampled_data.iloc[:, 2:].drop(['Exited'], axis='columns')
Y_o = oversampled_data.iloc[:, 8]

In [None]:
X_train_o, X_test_o, Y_train_o, Y_test_o = train_test_split(X_o, Y_o, stratify=Y_o, train_size=(0.7), test_size=(0.3))

In [None]:
classifier_o=LogisticRegression()
classifier_o.fit(X_train_o, Y_train_o)

#### Testing on oversampled_data 

In [None]:
predicted_y_o = classifier_o.predict(X_test_o)
print('predicted_y:', predicted_y_o)
print('coef_:', classifier_o.coef_)
print('accuracy_score:',classifier_o.score(X_test_o, Y_test_o))

In [None]:
cm_o = confusion_matrix(Y_test_o, predicted_y_o)
tn, fp, fn, tp=cm_o.ravel()
print(cm_o)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (log_reg):',classifier_o.score(X_test_o, Y_test_o))
print('Recall (log_reg):', Re)
print('Precision (log_reg):', Pr)
print('F-measure (log_reg):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (log_reg):', F_b)
print('Balanced accuracy (log_reg):', Bac)
print('Specificity_ (log_reg):', Sp)

#### Testing on data1 

In [None]:
predicted_y_o = classifier_o.predict(X_test)
print('predicted_y:', predicted_y_o)
print('coef_:', classifier_o.coef_)
print('accuracy_score:',classifier_o.score(X_test, Y_test))

In [None]:
cm_o = confusion_matrix(Y_test, predicted_y_o)
tn, fp, fn, tp=cm_o.ravel()
print(cm_o)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (log_reg):',classifier_o.score(X_test, Y_test))
print('Recall (log_reg):', Re)
print('Precision (log_reg):', Pr)
print('F-measure (log_reg):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (log_reg):', F_b)
print('Balanced accuracy (log_reg):', Bac)
print('Specificity_ (log_reg):', Sp)

### Undersampling

In [None]:
X_u = undersampled_data.iloc[:, 2:].drop(['Exited'], axis='columns')
Y_u = undersampled_data.iloc[:, 8]

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
X_train_u, X_test_u, Y_train_u, Y_test_u = train_test_split(X_u, Y_u, stratify=Y_u, train_size=(0.7), test_size=(0.3))

In [None]:
classifier_u=LogisticRegression()
classifier_u.fit(X_train_u, Y_train_u)

#### Testing on undersampled_data

In [None]:
predicted_y_u = classifier_u.predict(X_test_u)
print('predicted_y:', predicted_y_u)
print('coef_:', classifier_u.coef_)
print('accuracy_score:',classifier_u.score(X_test_u, Y_test_u))

In [None]:
cm_u = confusion_matrix(Y_test_u, predicted_y_u)
tn, fp, fn, tp=cm_u.ravel()
print(cm_u)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (log_reg):',classifier_u.score(X_test_u, Y_test_u))
print('Recall (log_reg):', Re)
print('Precision (log_reg):', Pr)
print('F-measure (log_reg):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (log_reg):', F_b)
print('Balanced accuracy (log_reg):', Bac)
print('Specificity_ (log_reg):', Sp)

#### Testing on data1

In [None]:
predicted_y_u = classifier_u.predict(X_test)
print('predicted_y:', predicted_y_u)
print('coef_:', classifier_u.coef_)
print('accuracy_score:',classifier_u.score(X_test, Y_test))

In [None]:
cm_u = confusion_matrix(Y_test, predicted_y_u)
tn, fp, fn, tp=cm_u.ravel()
print(cm_u)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (log_reg):',classifier_u.score(X_test, Y_test))
print('Recall (log_reg):', Re)
print('Precision (log_reg):', Pr)
print('F-measure (log_reg):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (log_reg):', F_b)
print('Balanced accuracy (log_reg):', Bac)
print('Specificity_ (log_reg):', Sp)

#### As a result of oversampling and undersampling, the quality of the logistic regression model has not changed much.

### 3.2.2 RandomForest considering the imbalance. Random undersampling and oversampling

#### Oversampling. Testing on oversampled_data 

In [None]:
clf_forest_o = RandomForestClassifier(random_state=1, n_estimators=500, min_samples_split=10, min_samples_leaf=2)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_validate(clf_forest_o, X_o, Y_o, cv=cv, n_jobs=-1, scoring=['accuracy','precision','recall'], return_train_score=True)

print("Accuracy_test (Forest): {}".format(scores['test_accuracy'].mean()), 
      "Recall_test (Forest): {}".format(scores['test_recall'].mean()),
      "Precision_test (Forest): {}".format(scores['test_precision'].mean()), sep='\n')

b=2   # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*scores['test_recall'].mean()*scores['test_precision'].mean()/(scores['test_recall'].mean()+scores['test_precision'].mean()*b**2)
print('F_2-measure_test (Forest):', F_b)

#### Oversampling. Testing on data1 

In [None]:
clf_forest_o.fit(X_o, Y_o)

In [None]:
predicted_y_o = clf_forest_o.predict(X_test)
print('predicted_y:', predicted_y_o)
print('accuracy_score:', clf_forest_o.score(X_test, Y_test))

In [None]:
from sklearn.metrics import confusion_matrix

cm_o = confusion_matrix(Y_test, predicted_y_o)
tn, fp, fn, tp=cm_o.ravel()
print(cm_o)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (Forest):',clf_forest_o.score(X_test, Y_test))
print('Recall (Forest):', Re)
print('Precision (Forest):', Pr)
print('F-measure (Forest):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (Forest):', F_b)
print('Balanced accuracy (Forest):', Bac)
print('Specificity_ (Forest):', Sp)

#### Undersampling. Testing on undersampled_data1

In [None]:
clf_forest_u = RandomForestClassifier(random_state=1, n_estimators=500, min_samples_split=10, min_samples_leaf=2)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_validate(clf_forest_u, X_u, Y_u, cv=cv, n_jobs=-1, scoring=['accuracy','precision','recall'], return_train_score=True)

print("Accuracy_test (Forest): {}".format(scores['test_accuracy'].mean()), 
      "Recall_test (Forest): {}".format(scores['test_recall'].mean()),
      "Precision_test (Forest): {}".format(scores['test_precision'].mean()), sep='\n')

b=2   # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*scores['test_recall'].mean()*scores['test_precision'].mean()/(scores['test_recall'].mean()+scores['test_precision'].mean()*b**2)
print('F_2-measure_test (Forest):', F_b)

#### Undersampling. Testing on data1

In [None]:
clf_forest_u.fit(X_u, Y_u)

In [None]:
predicted_y_u = clf_forest_u.predict(X_test)
print('predicted_y:', predicted_y_u)
print('accuracy_score:', clf_forest_u.score(X_test, Y_test))

In [None]:
cm_u = confusion_matrix(Y_test, predicted_y_u)
tn, fp, fn, tp=cm_u.ravel()
print(cm_u)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (Forest):',clf_forest_u.score(X_test, Y_test))
print('Recall (Forest):', Re)
print('Precision (Forest):', Pr)
print('F-measure (Forest):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (Forest):', F_b)
print('Balanced accuracy (Forest):', Bac)
print('Specificity_ (Forest):', Sp)

### 3.2.2 XCBoost considering the imbalance. Random undersampling and oversampling

#### Oversampling. Testing on oversampled_data 

In [None]:
#parameters
params = {
    "learning_rate"    :[0.05,0.10,0.15,0.20,0.25,0.30],
    "max_depth"        :[ 3,4,5,6,8,10,12,15 ],
    "min_child_weight" :[ 1,3,5,7 ],
    "gamma"            :[ 0.0,0.1,0.2,0.3,0.4 ],
    "colsample_bytree" :[ 0.3, 0.4, 0.5, 0.7 ]
}

In [None]:
classifier = XGBClassifier()

##### Let's try to reconfigure the model

In [None]:
random_search=RandomizedSearchCV(classifier, param_distributions=params, n_iter=5, scoring='roc_auc',n_jobs=-1, cv=cv,verbose=3)

In [None]:
random_search.fit(X_o,Y_o)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
classifier = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=0.3,
              learning_rate=0.15, max_delta_step=0, max_depth=10,
              min_child_weight=5, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
from sklearn.model_selection import cross_val_score
score = cross_val_score(classifier,X_o,Y_o,cv=10)
score.mean()

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score

In [None]:
classifier.fit(X_train_o, Y_train_o)
Y_test_preds=classifier.predict(X_test_o)

In [None]:
print('Accuracy (XCBoost): {0:.2f}'.format(accuracy_score(Y_test_o, Y_test_preds)))
print('Precision (XCBoost): {0:.2f}'.format(precision_score(Y_test_o, Y_test_preds)))
print('Recall (XCBoost): {0:.2f}'.format(recall_score(Y_test_o, Y_test_preds)))
print('F2 (XCBoost): {0:.2f}'.format(fbeta_score(Y_test_o, Y_test_preds, 2)))

#### Oversampling. Testing on data1 

In [None]:
classifier.fit(X_train_o, Y_train_o)
Y_test_preds=classifier.predict(X_test)

In [None]:
print('Accuracy (XCBoost): {0:.2f}'.format(accuracy_score(Y_test, Y_test_preds)))
print('Precision (XCBoost): {0:.2f}'.format(precision_score(Y_test, Y_test_preds)))
print('Recall (XCBoost): {0:.2f}'.format(recall_score(Y_test, Y_test_preds)))
print('F2 (XCBoost): {0:.2f}'.format(fbeta_score(Y_test, Y_test_preds, 2)))

#### Undersampling. Testing on undersampled_data1

In [None]:
#parameters
params = {
    "learning_rate"    :[0.05,0.10,0.15,0.20,0.25,0.30],
    "max_depth"        :[ 3,4,5,6,8,10,12,15 ],
    "min_child_weight" :[ 1,3,5,7 ],
    "gamma"            :[ 0.0,0.1,0.2,0.3,0.4 ],
    "colsample_bytree" :[ 0.3, 0.4, 0.5, 0.7 ]
}

##### Let's try to reconfigure the model

In [None]:
random_search.fit(X_u,Y_u)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
classifier = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, gamma=0.3,
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=7, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
score = cross_val_score(classifier,X_u,Y_u,cv=10)
score.mean()

In [None]:
classifier.fit(X_train_u, Y_train_u)
Y_test_preds=classifier.predict(X_test_u)

In [None]:
print('Accuracy (XCBoost): {0:.2f}'.format(accuracy_score(Y_test_u, Y_test_preds)))
print('Precision (XCBoost): {0:.2f}'.format(precision_score(Y_test_u, Y_test_preds)))
print('Recall (XCBoost): {0:.2f}'.format(recall_score(Y_test_u, Y_test_preds)))
print('F2 (XCBoost): {0:.2f}'.format(fbeta_score(Y_test_u, Y_test_preds, 2)))

#### Undersampling. Testing on data1

In [None]:
classifier.fit(X_train_u, Y_train_u)
Y_test_preds=classifier.predict(X_test)

In [None]:
print('Accuracy (XCBoost): {0:.2f}'.format(accuracy_score(Y_test, Y_test_preds)))
print('Precision (XCBoost): {0:.2f}'.format(precision_score(Y_test, Y_test_preds)))
print('Recall (XCBoost): {0:.2f}'.format(recall_score(Y_test, Y_test_preds)))
print('F2 (XCBoost): {0:.2f}'.format(fbeta_score(Y_test, Y_test_preds, 2)))

### 3.3.1 Logistic regression considering imbalance. Oversampling with SMOTE and Undersampling with Tomek Links

#### Oversampling with SMOTE:
In SMOTE we create elements in close proximity to existing ones in a smaller set.

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
smote = SMOTE(sampling_strategy='minority')
X_sm, Y_sm = smote.fit_sample(X, Y)

In [None]:
len(Y_sm[Y_sm==0])

In [None]:
len(X_sm)

In [None]:
X_train_sm, X_test_sm, Y_train_sm, Y_test_sm = train_test_split(X_sm, Y_sm, random_state=0, stratify=Y_sm, train_size=(0.7), test_size=(0.3))

In [None]:
len(Y_test_sm[Y_test_sm==1])

In [None]:
len(Y_test_sm[Y_test_sm==0])

In [None]:
classifier=LogisticRegression()
classifier.fit(X_train_sm, Y_train_sm)

In [None]:
predicted_y = classifier.predict(X_test_sm)
print('predicted_y:', predicted_y)
print('coef_:', classifier.coef_)
print('accuracy_score:',classifier.score(X_test_sm,Y_test_sm))

In [None]:
cm = confusion_matrix(Y_test_sm, predicted_y)
print(cm)

In [None]:
tn, fp, fn, tp=cm.ravel()
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
F=2*Re*Pr/(Re+Pr)
print(tn, fp, fn, tp)
print('Accuracy_score (log_reg):',classifier.score(X_test_sm,Y_test_sm))
print('Recall (log_reg):', Re)
print('Precision (log_reg):', Pr)
print('F-measure (log_reg):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (Forest):', F_b)
print('Balanced accuracy (Forest):', Bac)
print('Specificity_ (log_reg):', Sp)

#### Undersampling using Tomek Links:

#### One of the provided methods is called "Tomek Links". "Links" in this case are pairs of elements from different classes that are nearby. Using the algorithm, we will eventually remove the element of the pair from the larger set, which will allow the classifier to perform better.

In [None]:
from imblearn.under_sampling import TomekLinks

In [None]:
tl = TomekLinks(sampling_strategy ='majority')
X_tl, Y_tl = tl.fit_sample(X, Y)

In [None]:
len(Y_tl[Y_tl==0])

In [None]:
len(Y_tl[Y_tl==1])

In [None]:
X_train_tl, X_test_tl, Y_train_tl, Y_test_tl = train_test_split(X_tl, Y_tl, random_state=0, stratify=Y_tl, train_size=(0.7), test_size=(0.3))

In [None]:
len(Y_test_tl[Y_test_tl==1])

In [None]:
len(Y_test_tl[Y_test_tl==0])

In [None]:
classifier=LogisticRegression()
classifier.fit(X_train_tl, Y_train_tl)

In [None]:
predicted_y = classifier.predict(X_test_tl)
print('predicted_y:', predicted_y)
print('coef_:', classifier.coef_)
print('accuracy_score:',classifier.score(X_test_tl,Y_test_tl))

In [None]:
len(predicted_y[np.where(predicted_y==0)])

In [None]:
len(predicted_y[np.where(predicted_y==1)])

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test_tl, predicted_y)
tn, fp, fn, tp=cm.ravel()
print(cm, tn, fp, fn, tp)

In [None]:
tn, fp, fn, tp=cm.ravel()
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
F=2*Re*Pr/(Re+Pr)
print(tn, fp, fn, tp)
print('accuracy_score (log_reg):',classifier.score(X_test_tl,Y_test_tl))
print('Recall (log_reg):', Re)
print('Precision (log_reg):', Pr)
print('F-measure (log_reg):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (log_reg):', F_b)
print('Balanced accuracy (log_reg):', Bac)
print('Specificity_ (log_reg):', Sp)

### 3.3.2 RandomForest considering the imbalance. Oversampling with SMOTE and Undersampling with Tomek Links

In [None]:
clf_forest_sm = RandomForestClassifier(random_state=1, n_estimators=500, min_samples_split=10, min_samples_leaf=2)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_validate(clf_forest_sm, X_sm, Y_sm, cv=cv, n_jobs=-1, scoring=['accuracy','precision','recall'], return_train_score=True)

print("Accuracy_test (Forest): {}".format(scores['test_accuracy'].mean()), 
      "Recall_test (Forest): {}".format(scores['test_recall'].mean()),
      "Precision_test (Forest): {}".format(scores['test_precision'].mean()), sep='\n')

b=2   # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*scores['test_recall'].mean()*scores['test_precision'].mean()/(scores['test_recall'].mean()+scores['test_precision'].mean()*b**2)
print('F_2-measure_test (Forest):', F_b)

#### Oversampling with SMOTE. Testing on data1

In [None]:
clf_forest_sm.fit(X_sm, Y_sm)

In [None]:
predicted_y_sm = clf_forest_sm.predict(X_test)
print('predicted_y:', predicted_y_sm)
print('accuracy_score:', clf_forest_sm.score(X_test, Y_test))

In [None]:
cm_sm = confusion_matrix(Y_test, predicted_y_sm)
tn, fp, fn, tp=cm_sm.ravel()
print(cm_u)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (Forest):',clf_forest_u.score(X_test, Y_test))
print('Recall (Forest):', Re)
print('Precision (Forest):', Pr)
print('F-measure (Forest):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (Forest):', F_b)
print('Balanced accuracy (Forest):', Bac)
print('Specificity_ (Forest):', Sp)

#### Undersampling using Tomek Links. Testing on Tomek Links-data

In [None]:
clf_forest_tl = RandomForestClassifier(random_state=1, n_estimators=500, min_samples_split=10, min_samples_leaf=2)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_validate(clf_forest_tl, X_tl, Y_tl, cv=cv, n_jobs=-1, scoring=['accuracy','precision','recall'], return_train_score=True)

print("Accuracy_test (Forest): {}".format(scores['test_accuracy'].mean()), 
      "Recall_test (Forest): {}".format(scores['test_recall'].mean()),
      "Precision_test (Forest): {}".format(scores['test_precision'].mean()), sep='\n')

b=2   # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*scores['test_recall'].mean()*scores['test_precision'].mean()/(scores['test_recall'].mean()+scores['test_precision'].mean()*b**2)
print('F_2-measure_test (Forest):', F_b)

In [None]:
clf_forest_tl.fit(X_tl, Y_tl)

In [None]:
predicted_y_tl = clf_forest_tl.predict(X_test)
print('predicted_y:', predicted_y_tl)
print('accuracy_score:', clf_forest_sm.score(X_test, Y_test))

In [None]:
cm_tl = confusion_matrix(Y_test, predicted_y_tl)
tn, fp, fn, tp=cm_tl.ravel()
print(cm_u)

In [None]:
Re=tp/(tp+fn)
Pr=tp/(tp+fp)
Sp=tn/(tn+fp)
Bac=(Re+Sp)/2
F=2*Re*Pr/(Re+Pr)
print('Accuracy (Forest):',clf_forest_tl.score(X_test, Y_test))
print('Recall (Forest):', Re)
print('Precision (Forest):', Pr)
print('F-measure (Forest):', F)
b=2   # приоритет у Recall # b>1(Recall), 0<b<1(Precision)
F_b=(1+b**2)*Re*Pr/(Re+Pr*b**2)
print('F_2-measure (Forest):', F_b)
print('Balanced accuracy (Forest):', Bac)
print('Specificity_ (Forest):', Sp)

### 3.3.2 XGBoost considering the imbalance. Oversampling with SMOTE and Undersampling with Tomek Links

In [None]:
#parameters
params = {
    "learning_rate"    :[0.05,0.10,0.15,0.20,0.25,0.30],
    "max_depth"        :[ 3,4,5,6,8,10,12,15 ],
    "min_child_weight" :[ 1,3,5,7 ],
    "gamma"            :[ 0.0,0.1,0.2,0.3,0.4 ],
    "colsample_bytree" :[ 0.3, 0.4, 0.5, 0.7 ]
}

In [None]:
classifier = XGBClassifier()

##### Let's try to reconfigure the model

In [None]:
random_search=RandomizedSearchCV(classifier, param_distributions=params, n_iter=5, scoring='roc_auc',n_jobs=-1, cv=cv,verbose=3)

In [None]:
random_search.fit(X_sm,Y_sm)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
classifier = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=0.3,
              learning_rate=0.15, max_delta_step=0, max_depth=10,
              min_child_weight=5, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
score = cross_val_score(classifier,X_sm,Y_sm,cv=10)
score.mean()

In [None]:
classifier.fit(X_train_sm, Y_train_sm)
Y_test_preds=classifier.predict(X_test_sm)

In [None]:
print('Accuracy (XGBoost): {0:.2f}'.format(accuracy_score(Y_test_sm, Y_test_preds)))
print('Precision (XGBoost): {0:.2f}'.format(precision_score(Y_test_sm, Y_test_preds)))
print('Recall (XGBoost): {0:.2f}'.format(recall_score(Y_test_sm, Y_test_preds)))
print('F2 (XGBoost): {0:.2f}'.format(fbeta_score(Y_test_sm, Y_test_preds, 2)))

#### Oversampling. Testing on data1 

In [None]:
classifier.fit(X_train_sm, Y_train_sm)
Y_test_preds=classifier.predict(X_test)

In [None]:
print('Accuracy (XGBoost): {0:.2f}'.format(accuracy_score(Y_test, Y_test_preds)))
print('Precision (XGBoost): {0:.2f}'.format(precision_score(Y_test, Y_test_preds)))
print('Recall (XGBoost): {0:.2f}'.format(recall_score(Y_test, Y_test_preds)))
print('F2 (XGBoost): {0:.2f}'.format(fbeta_score(Y_test, Y_test_preds, 2)))

#### Undersampling using Tomek Links. Testing on Tomek Links-data

In [None]:
#parameters
params = {
    "learning_rate"    :[0.05,0.10,0.15,0.20,0.25,0.30],
    "max_depth"        :[ 3,4,5,6,8,10,12,15 ],
    "min_child_weight" :[ 1,3,5,7 ],
    "gamma"            :[ 0.0,0.1,0.2,0.3,0.4 ],
    "colsample_bytree" :[ 0.3, 0.4, 0.5, 0.7 ]
}

In [None]:
classifier = XGBClassifier()

##### Let's try to reconfigure the model

In [None]:
random_search=RandomizedSearchCV(classifier, param_distributions=params, n_iter=5, scoring='roc_auc',n_jobs=-1, cv=cv,verbose=3)

In [None]:
random_search.fit(X_tl,Y_tl)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
classifier = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=0.3,
              learning_rate=0.15, max_delta_step=0, max_depth=10,
              min_child_weight=5, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
score = cross_val_score(classifier,X_tl,Y_tl,cv=10)
score.mean()

In [None]:
classifier.fit(X_train_tl, Y_train_tl)
Y_test_preds=classifier.predict(X_test_tl)

In [None]:
print('Accuracy (XGBoost): {0:.2f}'.format(accuracy_score(Y_test_tl, Y_test_preds)))
print('Precision (XGBoost): {0:.2f}'.format(precision_score(Y_test_tl, Y_test_preds)))
print('Recall (XGBoost): {0:.2f}'.format(recall_score(Y_test_tl, Y_test_preds)))
print('F2 (XGBoost): {0:.2f}'.format(fbeta_score(Y_test_tl, Y_test_preds, 2)))

#### Undersampling using Tomek Links. Testing on data1-data

In [None]:
classifier.fit(X_train_tl, Y_train_tl)
Y_test_preds=classifier.predict(X_test)

In [None]:
print('Accuracy: {0:.2f}'.format(accuracy_score(Y_test, Y_test_preds)))
print('Precision: {0:.2f}'.format(precision_score(Y_test, Y_test_preds)))
print('Recall: {0:.2f}'.format(recall_score(Y_test, Y_test_preds)))
print('F2: {0:.2f}'.format(fbeta_score(Y_test, Y_test_preds, 2)))

# 4. Conclusion

### In this work, the quality of the model was determined mainly by the metric F2. It is presented in the following table.

In [None]:
tbl = {'Par.':[2, 3.1, 3.2, 3.2, 3.3, 3.3], 
         'Kind':['-', 'Balanced weight', 'Random oversampling', 'Random undersampling','Oversampling(SMOTE)','Undersampling(Tomek Links)'], 
         'Log_Reg':[0.02, 0.51, 0.50, 0.51, 0.63, 0.13], 
         'Forest':[0.39, 0.52, 0.92, 0.76, 0.83, 0.65], 
         'XGBoost':[0.39, 0.61, 0.79, 0.67, 0.67, 0.57]}
table=pd.DataFrame(tbl)
table

### The best quality model is Forest on Random oversampling with F_2=0.92.