**OUMAR ALPHA YAYA CISSE**

# PROBLEM
- My inspiration comes from two questions:
        1-) What are the factors most associated with systemic crises in Africa?
        2-) At what annual inflation rate does an inflation crisis become a practical certainty?

# EXPLORATORY DATA ANALYSIS

## Shape analysis :

    - target variable : 'systemic_crisis'
    - Rows and columns : 1059 R , 14 C
    - Types of variables : 3 qualitative variables, 3 continuous variables, 8 discrete variables
    - Analysis of missing values: No missing value

## Background analysis :

    #- MODULE

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split , learning_curve , GridSearchCV
from sklearn.preprocessing import PolynomialFeatures , StandardScaler
from sklearn.feature_selection import SelectKBest , f_classif
from sklearn.metrics import classification_report , confusion_matrix , precision_recall_curve , f1_score , recall_score , precision_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier , AdaBoostClassifier
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

    #- CODED

In [None]:
a_crises = pd.read_csv('../input/africa-economic-banking-and-systemic-crisis-data/african_crises.csv')
df = a_crises.copy()
pd.set_option('display.max_rows', 500)

## VISUALIZATION OF THE TARGET

In [None]:
df['systemic_crisis'].value_counts().plot.pie(figsize=(12,9))

In [None]:
df['systemic_crisis'].value_counts()

    #- Visualization of the variable target: 977 systemic crisis cases and 82 cases where there are none are detected

## Meaning of Variable Types (Variable Analysis):

#### CONTINUOUS VARIABLE

In [None]:
cont_variable = ['exch_usd' ,'gdp_weighted_default', 'inflation_annual_cpi' , 'year']

- We have a total of 4 continuous variables including 3 relating to economic variables and one relating to the year.

#### DISCRET VARIABLE

In [None]:
disc_variable = ['domestic_debt_in_default', 'sovereign_external_debt_default', 'independence', 'currency_crises', 'inflation_crises', 'banking_crisis' ,'systemic_crisis']

- all the discrete variables are of type binary variables.

In [None]:
#df[df['currency_crises'] == 2]
df.drop([142, 146, 775, 840],inplace=True)

- except the currency_crises variable which has a third value '2' so 4 countries carry this number at the level of their currency_crises on lines [142, 146, 775, 840] the owner of the dataset tells us only two values ​​0 and 1 concerning this variable we can deduce that it is an error and delete the lines with the number 2 at their currency_crises

In [None]:
 code = {
    'crisis' : 1,
    'no_crisis' : 0,
     False : 0,
     True : 1
}
def encoding(df):
    df['banking_crisis'] = df['banking_crisis'].map(code)
   # df['got_crises'] = df['got_crises'].map(code)
    return df
df = encoding(df)

- we encoded the banking crisis variable for a better EDA which normally should be done at the preprocessing level

#### QUALITATIVE VARIABLE

In [None]:
qual_variable = ['case' ,'cc3', 'country']

- cc3 and contry are both information relating to the country just like the box we will eliminate or retain among these 3 variables the one that is the most important in the development of our model.

## TARGET / VARIABLES RELATIONSHIP (Assumption):

#### TARGET / CONTINUOUS VARIABLE:

    #- exch_usd

In [None]:
plt.figure(figsize = (12,9))
sns.lineplot(x = 'year', y = 'exch_usd', hue = 'systemic_crisis', data = df, palette = 'colorblind')
plt.xlabel('Year')
plt.ylabel('exch_usd')
display()

- The exchange rate of countries having had a systemic crisis is almost zero from 1860 to 1980, it begins to vary strongly from 1980 to 2000, then from 2000 to 2012 we observe more exchange rates in these countries.

    # - gdp_weighted_default

In [None]:
plt.figure(figsize = (12,9))
sns.lineplot(x = 'year', y = 'gdp_weighted_default', hue = 'systemic_crisis', data = df)
plt.xlabel('Year')
plt.ylabel('gdp_weighted_default')
display()

- for the countries having had the systemic crisis the total debt in default vis-à-vis the GDP is almost zero from 1860 to 1980 from 1980 to around 2000 we observe little variation in the total debt in default vis-à-vis of the GDP of these countries with a systemic crisis then from 2000 to 2012 we observe more total debt in default vis-à-vis the GDP in these countries.

    #- inflation_annual_cpi

In [None]:
plt.figure(figsize = (12,9))
sns.lineplot(x = 'year', y = 'inflation_annual_cpi', hue = 'systemic_crisis', data = df, palette = 'colorblind')
plt.xlabel('Year')
plt.ylabel('inflation_annual_cpi')
display()

- DIFFICULT TO INTERPRET THE GRAPH

- In conclusion, the analysis of the continuous variables mainly made up of variable of economic flow we showed that during the period 1860 to 1980 no economic flow occurred. and From 1980 to 2000 we start to observe economic flows. after the year 2000 no more economic flow occurs.

#### TARGET / DISCREET VARIABLE:

In [None]:
disc_variable = ['domestic_debt_in_default', 'sovereign_external_debt_default', 'independence', 'currency_crises', 'inflation_crises', 'banking_crisis' ]#'systemic_crisis']
for col in disc_variable:
    plt.figure(figsize=(9,6))
    sns.heatmap(pd.crosstab(df['systemic_crisis'],df[col]),annot=True,fmt='d')

- Among the independent countries 10% are affected by the systemic crisis.
- Among the countries which are not independent 1/237 are affected by the systemic crisis.
- Among the countries affected by the banking crisis 0.81 are affected by the systemic crisis.

### TARGET / QUALITATIVE VARIABLE:

In [None]:
sys_crisis =df[df['systemic_crisis']==1]
not_sys_crisis = df[df['systemic_crisis']==0]

In [None]:
sys_crisis['country'].value_counts().plot.pie(figsize=(19,8))

- The Central African Republic is the country in which we observe the most systemic crisis.
- Morocco is the country in which we observe the least systemic crisis.

## Relationship between variables and variables MORE DETAILED ANALYSIS (Correlation):

#### CONTINUOUS VARIABLE / CONTINUOUS VARIABLE:

In [None]:
plt.figure(figsize=(12,9))
sns.heatmap(df[cont_variable].corr(),annot=True)

- Continuous variables correlate very weakly with each other the strongest correlation is 0.25 between the year variable and exch_usd.

#### QUALITATIVE VARIABLE / QUALITATIVE VARIABLE:

In [None]:
plt.figure(figsize=(12,9))
sns.heatmap(df[disc_variable].corr(),annot=True)

- We observe a very strong correlation of 0.86 between the variable systemic_crisis and banking_crisis, the variable banking_crisis has a great influence on systemic_crisis and can be a factor of a good analysis.

- These variables correlate quite a bit with each other.

#### CONTINUOUS VARIABLE / QUALITATIVE VARIABLE

In [None]:
#df['got_crises'] = df[['systemic_crisis','currency_crises', 'inflation_crises', 'banking_crisis']].sum(axis=1) >= 1

In [None]:
plt.figure(figsize=(12,9))
sns.heatmap(df[list(disc_variable)+list(cont_variable)].corr(),annot=True)

- There is very little correlation between the continuous and qualitative variables, the strongest correlation is 0.43 between exch_usd.
- The variables that greatly influence our target are the qualitative variables.

# ANSWER TO THE PROBLEM:
     1-) What are the factors most associated with systemic crises in Africa? :
  these factors are the year, the country and having a banking crisis within the year which is the most important factor
     
     2-) At what annual inflation rate does an inflation crisis become a practical certainty? :
  has a rate of around 54.042

# PREPROCESSING

In [None]:
df_prep = a_crises.copy()

    # - ENCODING

In [None]:
 code = {
    'crisis' : 1,
    'no_crisis' : 0,
     False : 0,
     True : 1
}
def encodage(df):
    df['banking_crisis'] = df['banking_crisis'].map(code)
   # df['got_crises'] = df['got_crises'].map(code)
    return df

    # - FEATURE ENGINEERING

In [None]:
def feature_engineering(df):
    #df['got_crises'] = df[['systemic_crisis','currency_crises', 'inflation_crises', 'banking_crisis']].sum(axis=1) >= 1

    # case , cc3 et country etant tout les 3 des variables concernant le pays nous allons seulement garder la variable case
    # qui est l'encondage de country et cc3
    df = df.drop(['cc3','country'],axis=1)
    return df

    # - PREPROCESSING 

In [None]:
train_set , test_set = train_test_split(df_prep,test_size=0.2)

In [None]:
def preprocessing(df):
    df = feature_engineering(df)
    df = encodage(df)
    
    y = df['systemic_crisis']
    X = df.drop('systemic_crisis',axis=1)
    
    return X , y

In [None]:
X_train , y_train = preprocessing(train_set)
X_test , y_test = preprocessing(test_set)

    #- MODELE

In [None]:
process = make_pipeline(PolynomialFeatures(2,include_bias=False))

In [None]:
svc = make_pipeline(process,StandardScaler(),SVC(random_state=0))

    #- EVALUATION

In [None]:
def evaluation(model):
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    
    print(confusion_matrix(y_test,y_pred))
    print(classification_report(y_test,y_pred))
    
    N , train_score , val_score = learning_curve(model,X_train,y_train,scoring='f1',train_sizes=np.linspace(0.1,1,10),cv=4)
    
    plt.figure(figsize=(12,9))
    plt.plot(N,train_score.mean(axis=1),label='train_score')
    plt.plot(N,val_score.mean(axis=1),label='val_score')
    plt.legend()
evaluation(svc)

# SEACH CV

In [None]:
param_grid = {
    'pipeline__polynomialfeatures__degree' : [2 , 3, 4, 6],
    'svc__C' : [10, 100 , 1000],
    'svc__gamma' : [1e-1, 1e-2, 1e-4]
}
grid = GridSearchCV(svc,param_grid,cv=4)

In [None]:
grid.fit(X_train,y_train)
y_pred = grid.predict(X_test)

In [None]:
evaluation(grid.best_estimator_)

In [None]:
grid.best_params_

# precision_recall_curve

In [None]:
precison , recall , threshold = precision_recall_curve(y_test,grid.best_estimator_.decision_function(X_test))
plt.figure(figsize=(12,9))
plt.plot(threshold,precison[:-1],label='precision')
plt.plot(threshold,recall[:-1],label='recall')
plt.legend()

In [None]:
def model_final(X,model,threshold=0.6):
    return model.decision_function(X) > threshold

In [None]:
y_pred = model_final(X_test,grid.best_estimator_,0.8)

In [None]:
f1_score(y_test,y_pred)

In [None]:
recall_score(y_test,y_pred)

In [None]:
precision_score(y_test,y_pred)