**Task: to build classification model for bank crisis detection.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Part 1
**Data reading and general preprocessing**

**Data reading.**

In [None]:
data = pd.read_csv("../input/africa-economic-banking-and-systemic-crisis-data/african_crises.csv").dropna()
data.head()

In [None]:
data.info()

**Data rewriting.**

In [None]:
cc3s = list(data["cc3"].unique())
countries = list(data["country"].unique())
banking_crisis_id = list(data["banking_crisis"].unique())[::-1]

for cc3 in cc3s:
    data["cc3"][ data["cc3"]==cc3 ] = cc3s.index(cc3)
for country in countries:
    data["country"][ data["country"]==country ] = countries.index(country)
for banking_crisis in banking_crisis_id:
    data["banking_crisis"][ data["banking_crisis"]==banking_crisis ] = banking_crisis_id.index(banking_crisis)
data.astype("float")
data.head()

In [None]:
sns.heatmap(data.corr())

**Data preparation to classification model building.**

Data normalization.

In [None]:
X = data.drop("banking_crisis", axis=1)
y = data["banking_crisis"].astype("int")

In [None]:
X_means = X.mean()
X_stds = X.std()
X = (X-X_means)/X_stds

X.head()

In [None]:
X_corr = X.corr()
sns.heatmap(X_corr)

# Part 2
**Building and testing of classification models**

Building **simple model, based on random forest**.

In [None]:
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
rf_pred_model_d1 = RandomForestClassifier(n_estimators=10, min_samples_leaf=20).fit(X_train, y_train)

Testing of begining model.

In [None]:
print("Accuracy:")
print( metrics.accuracy_score(y_test, rf_pred_model_d1.predict(X_test)) )

As we see, shown architecture of prediction model can show excellent results in bank crisis detection. Let's check classification model with shuffled dataset.

Building and testing the second model.

In [None]:
from sklearn.utils import shuffle

In [None]:
data = shuffle(data)
data.head()

In [None]:
X = data.drop("banking_crisis", axis=1)
y = data["banking_crisis"].astype("int")

X_means = X.mean()
X_stds = X.std()
X = (X-X_means)/X_stds

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rf_pred_model_d2 = RandomForestClassifier(n_estimators=10, min_samples_leaf=20).fit(X_train, y_train)

In [None]:
print("Accuracy:")
print( metrics.accuracy_score(y_test, rf_pred_model_d2.predict(X_test)) )

As we see, accuracy>90% again. It can mean, that selected architecture can give good prediction without overstudying.