<div align='center'><font size="5" color='#353B47'>The data scientist bagpack</font></div>
<div align='center'><font size="4" color="#353B47">A good preparation for case study interviews</font></div>
<br>
<div align='center'><img src="https://s27389.pcdn.co/wp-content/uploads/shutterstock_112621262-1000x440.jpg"></div>
<br>
<hr>

<div align="justify">Through this notebook, you will find the good practises to have for a case study that you may likely have during a long and endless interview process. For a first role as data scientist, I was given this case study to do within 4h.</div>



<font color="red" size="4">ADVICE</font>
> If you want to play with the interactive functions allowing you to choose the parameters in a mini interface, this notebook must be forked.

# Let's dig

<div align="justify">The whole study consists in determining whether an applicant is going to be hired according to his caracteristics. This is a classification problem.</div>

## Data description

- date: date of the application
- age: age of the candidate
- diplome: highest qualification diploma (bac, licence, master, doctorat)
- specialite: minor of the diploma (geologie, forage, detective, archeologie,...)
- salaire: asked salary
- dispo: oui : directly available, non : not directly available
- sexe: female (F) or male (M)
- exp: years of relevant experience
- cheveux: hair color (chatain, brun, blond, roux)
- note: grade (out of 100) for gold digging exam
- embauche: Has the candidate been hired ? (0 : no, 1 : yes)

## <div id="summary">Table of contents</div>

**<font size="2"><a href="#chap1">1. Import libraries and data</a></font>**
**<br><font size="2"><a href="#chap2">2. Handling Missing Values</a></font>**
**<br><font size="2"><a href="#chap3">3. EDA</a></font>**
**<br><font size="2"><a href="#chap4">4. Preparing the data for modelling</a></font>**
**<br><font size="2"><a href="#chap5">5. Random Forest</a></font>**
**<br><font size="2"><a href="#chap6">6. Feature Importance</a></font>**
**<br><font size="2"><a href="#chap7">7. Amelioration</a></font>**

# <div id="chap1">1. Import libraries and data</div>

In [None]:
# import sys
# print(sys.version)

<div align="justify">The test was coded under python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]</div>

In [None]:
%matplotlib inline

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from pprint import pprint
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

# Data manipulation
import numpy as np
import pandas as pd
import pandas_profiling

# Modelling
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

In [None]:
# Import dataframe
data = pd.read_csv('../input/applicants-for-a-gold-digger-position/data.csv').drop(['Unnamed: 0'], axis = 1)

# Display five first rows of the dataframe 
data.head()

In [None]:
categorical_feature = list(data.dtypes[data.dtypes == 'object'].index)
numerical_feature = list(data.describe().columns)

def information(dataframe):
    
    '''
    Print the number of observations, features and those which are numerical or categorical
    '''
    
    print(f'the dataframe contains {dataframe.shape[0]} observations and {dataframe.shape[1]} columns\n')
    print(f"the dataframe contains:\n   -   {len(numerical_feature)} numeric features:\n   '--->   {numerical_feature}\n\n\
   -   {len(categorical_feature)} categorical features:\n   '--->   {categorical_feature}")
    

information(data)

In [None]:
# For further information
#pandas_profiling.ProfileReport(data)

In [None]:
# Export in html file
#profile = pandas_profiling.ProfileReport(data)
#profile.to_file("OrpheeProfiling.html")

**<font size="2"><a href="#summary">Back to summary</a></font>**

----

# <div id="chap2"> 2. Handling Missing Values</div>

<div align='justify'>In this case study, I did not observe many missing values. I therefore decided to delete the lines containing at least one missing value. Nevertheless, be careful to always check that the data are usable.</div>

In [None]:
# There are too few observations with NAs, which is not significant for the study
data = data.dropna().reset_index(drop=True)

**<font size="2"><a href="#summary">Back to summary</a></font>**

----

# <div id="chap3">3. EDA</div>

<div align='justify'>Exploratory analysis is an essential step in a case study. It allows you to quickly see if there are any outliers in the data and also to get an overview of the data you are manipulating. It is from this exploratory analysis that we will be able to confirm the variables to be used for our model.</div>

### <font color='blue' size='3'>3.1 Label proportion</font>

In [None]:
def plot_label_proportion(df):
    # Plot
    fig = go.Figure([go.Bar(x=df.embauche.value_counts().index, y=df.embauche.value_counts().values)])

    fig.update_layout(title="Countplot showing proportion of hired candidates",
                      xaxis_title="Embauche (hired)",
                      yaxis_title="Count")

    fig.show()

plot_label_proportion(data)

<font size='3'>Do best performers on the exercice stand more chance ?</font>

In [None]:
# Let's consider those who have the top 20% grade students
top_20perc_grades = data[data['note'] > 0.8*100 ].reset_index(drop=True)
plot_label_proportion(top_20perc_grades)

In [None]:
# Looking for outliers: check the min and max of note
print(min(data['note']),max(data['note']),"\n")

# Get index of highest note
print(np.argmax(data['note']),"\n")

# Check the data
print(data.loc[18992,:],"\n")

<div align='justify'>It turns out that the maximum grade is above 100. +100 points score could be considered as bonus. I chose to consider it is a tipo error. In the end, whatever your choices are as long as you are able to justify them.</div>

In [None]:
# Filtering
data = data[data['note']<=100].reset_index(drop=True)

<div align='justify'>Surprisingly, there are more great students who are not hired. As a consequence, we can say that what makes them hired doesn't rely in their academic skills. Let's dig deepper...</div>

In [None]:
diplome_20perc_grades = top_20perc_grades.groupby(['diplome'])['embauche'].count()
diplome_20perc_grades

In [None]:
# Plot
fig = go.Figure(data=[go.Pie(labels=diplome_20perc_grades.index, values=diplome_20perc_grades.values, textinfo='label+percent',
                             insidetextorientation='radial'
                            )])

fig.update_traces(hole=.3, hoverinfo="label+percent+name")

fig.update_layout(
    title_text="Type of diploma distribution")

fig.show()

In [None]:
salaries_asked_d = top_20perc_grades[top_20perc_grades['diplome']=='doctorat']['salaire']
salaries_asked_b = top_20perc_grades[top_20perc_grades['diplome']=='bac']['salaire']
salaries_asked_l = top_20perc_grades[top_20perc_grades['diplome']=='licence']['salaire']
salaries_asked_m = top_20perc_grades[top_20perc_grades['diplome']=='master']['salaire']

In [None]:
# Graph 1
x1 = salaries_asked_d.values
x2 = data[data['diplome']=='doctorat']['salaire'].values
x3 = data['salaire'].values

hist_data = [x1,x2,x3]
group_labels = ['doctorant_top_20%','doctorant','all']

fig1 = ff.create_distplot(hist_data, 
                          group_labels, 
                          show_hist=False, 
                          show_rug=False)

fig1.update_layout(
    title_text="PhD distribution")

fig1.show()


# Graph 2
x1 = salaries_asked_b.values
x2 = data[data['diplome']=='bac']['salaire'].values
x3 = data['salaire'].values

hist_data = [x1,x2,x3]
group_labels = ['bac_top_20%','bac','all']

fig2 = ff.create_distplot(hist_data, 
                          group_labels, 
                          show_hist=False, 
                          show_rug=False)

fig2.update_layout(
    title_text="Baccalauréat degree distribution")

fig2.show()

# Graph 3
x1 = salaries_asked_l.values
x2 = data[data['diplome']=='licence']['salaire'].values
x3 = data['salaire'].values

hist_data = [x1,x2,x3]
group_labels = ['licence_top_20%','licence','all']

fig3 = ff.create_distplot(hist_data, 
                          group_labels, 
                          show_hist=False,
                          show_rug=False)

fig3.update_layout(
    title_text="License degree distribution")

fig3.show()


# Graph 4
x1 = salaries_asked_m.values
x2 = data[data['diplome']=='master']['salaire'].values
x3 = data['salaire'].values

hist_data = [x1,x2,x3]
group_labels = ['master_top_20%','master','all']

fig4 = ff.create_distplot(hist_data, 
                          group_labels, 
                          show_hist=False, 
                          show_rug=False)

fig4.update_layout(
    title_text="Master's degree distribution")

fig4.show()

<div align='justify'>Doctorants fits well with top 20% doctorants regarding salaries expectations. Nevertheless, doctorant are not asking higher salaries compared to global distribution.</div>

In [None]:
# Graph 5
x1 = data[data['diplome']=='doctorat']['exp'].values
x2 = data['exp'].values

hist_data = [x1,x2]
group_labels = ['master_top_20%','master']

fig5 = ff.create_distplot(hist_data, 
                          group_labels, 
                          show_hist=False, 
                          show_rug=False)

fig5.update_layout(
    title_text="Master's degree distribution")

fig5.show()

<div align='justify'>... NOT because of the fact they have less experience.</div>

### <font color='blue' size='3'>3.2 For further information, Date feature is a gold mine too</font>

#### <font color='orange' size='2'>*Generating the TS*</font>

In [None]:
# Set date feature as a datetime format
data['date'] = pd.to_datetime(data['date'])

In [None]:
# Check if one date can contain many applications
print(data.shape[0])
print(len(data['date'].unique()))

<div align='justify'>It turns out that yes, so I don't generate TS with that feature as it is not a bijective function of observations. I need to groupby first.</div>

In [None]:
# Groupby data to count how many applications per day
application_per_day = data.groupby(["date"])['embauche'].count()
application_per_day.head()

In [None]:
index = list(application_per_day.index)
application_per_day = np.transpose(list(application_per_day))

In [None]:
# Check that the range of days corresponds to the length of the index (no missing days)
print(max(index) - min(index))
print(len(index))

In [None]:
X = pd.date_range(start=min(index), freq='D',periods=1825)

data_obs_per_day = pd.DataFrame({'index':index, 'nb_app_day':application_per_day})

data_obs_per_day.set_index(X, inplace=True)
data_obs_per_day = data_obs_per_day['nb_app_day']

# Compute the range of data_observations to calibrate the plot
print(max(data_obs_per_day))
print(min(data_obs_per_day))

In [None]:
def zoom(dataframe, startDate, endDate):
    
    ''' 
    Plot the time series in a specific interval
    dataframe (dataframe): need timestamp as index
    startDate (string): a string with same date format than index
    endDate (string): a string with same date format than index
    feature (list of string): list of features to select
    '''
    
    ts_to_plot = dataframe[(dataframe.index>=startDate) & (dataframe.index<endDate)]

    mean_ts_to_plot = np.mean(ts_to_plot)
    std = np.std(ts_to_plot)
    print(f'mean: {mean_ts_to_plot}, std: {std}')
    
    ts_to_plot = pd.DataFrame(ts_to_plot)
    ts_to_plot['date'] = ts_to_plot.index
    ts_to_plot = ts_to_plot.reset_index(drop=True)
    
    return px.line(ts_to_plot, x='date', y='nb_app_day', range_x=[startDate, endDate])

#### <font color='orange' size='2'>*Plotting the TS*</font>

In [None]:
fig = zoom(data_obs_per_day, '2010', '2011')
fig.show()

fig = zoom(data_obs_per_day, '2011', '2012')
fig.show()

fig = zoom(data_obs_per_day, '2012', '2013')
fig.show()

fig = zoom(data_obs_per_day, '2013', '2014')
fig.show()

fig = zoom(data_obs_per_day, '2014', '2015')
fig.show()

<div align='justify'>No saisonality observed, mean and std for each year are really close.</div>

### <font color='blue' size='3'>3.3 Levels of each categorical feature</font>

In [None]:
def levels_cat_features(dataframe, indexes):
    levels = [dataframe.loc[:,i].unique() for i in indexes]
    return(levels)

In [None]:
levels_cat_features(data, list(data.dtypes[data.dtypes == 'object'].index))

### <font color='blue' size='3'>3.4 Statistical dependancies</font>

#### <font color='orange' size='2'>*3.4.1 Speciality x sex*</font>

In [None]:
# No missing values for accurateness of proportions

@interact
def class_proportion(speciality = ['geologie', 'forage', 'detective', 'archeologie'] , 
                     gender = ['','M','F'], 
                     hired = ['', True]):
    
    if hired == True:
        data_embauche = data[data['embauche']==1]
        temp = round(data_embauche[data_embauche['specialite']==speciality].shape[0] / data.shape[0], 3)
        temp_m = round(data_embauche[(data_embauche['specialite']==speciality) &\
                        (data_embauche['sexe']=='M')].shape[0] / data[data['specialite']==speciality].shape[0], 3)
        temp_f = round(data_embauche[(data_embauche['specialite']==speciality) &\
                        (data_embauche['sexe']=='F')].shape[0] / data[data['specialite']==speciality].shape[0], 3)    

        plt.figure(figsize=(14,7))

        prop_m = round(data_embauche[data_embauche['sexe']=='M'].shape[0] / data.shape[0], 3)
        prop_f = round(data_embauche[data_embauche['sexe']=='F'].shape[0] / data.shape[0], 3)
        print(f'\nProportion of hired men: {prop_m}')
        print(f'Proportion of hired women: {prop_f}\n\n')

        if gender == '':
            print(f"Proportion of hired candidates in {speciality}: {temp}\n\n")
            print(f"Proportion of hired male in {speciality} in the selected speciality: {temp_m}")
            print(f"Proportion of hired female in {speciality} in the selected speciality: {temp_f}")

            plt.subplot(1,2,1)
            sns.countplot(x="specialite", 
                          hue="sexe", 
                          data=data_embauche, 
                          palette="muted")
            plt.title('Number of hired candidates per speciality')
            plt.tight_layout(pad = 7)
            plt.subplot(1,2,2)
            sns.countplot(x="specialite", 
                          hue="sexe", 
                          data=data_embauche[data_embauche['specialite']==speciality], 
                          palette="muted")
            plt.title('Number of hired candidates for a chosen speciality')
            
        else:
            print(f"Proportion of hired candidates in {speciality}: {temp}")
            sns.countplot(x="specialite", 
                          hue="sexe", 
                          data=data_embauche[data_embauche['sexe']==gender], 
                          palette="muted")
            plt.title('Number of hired candidates per speciality for a selected gender')
            
    else:
        temp = round(data[data['specialite']==speciality].shape[0] / data.shape[0], 2)
        temp_m = round(data[(data['specialite']==speciality) &\
                        (data['sexe']=='M')].shape[0] / data[data['specialite']==speciality].shape[0], 2)
        temp_f = round(data[(data['specialite']==speciality) &\
                        (data['sexe']=='F')].shape[0] / data[data['specialite']==speciality].shape[0], 2)    

        plt.figure(figsize=(14,7))

        prop_m = round(data[data['sexe']=='M'].shape[0] / data.shape[0], 2)
        prop_f = round(1 - prop_m, 2)
        print(f'\nProportion of men: {prop_m}')
        print(f'Proportion of women: {prop_f}\n\n')

        if gender == '':
            print(f"Proportion of candidates in {speciality}: {temp}\n\n")
            print(f"Proportion of male in {speciality} in the selected speciality: {temp_m}")
            print(f"Proportion of female in {speciality} in the selected speciality: {temp_f}")

            plt.subplot(1,2,1)
            sns.countplot(x="specialite", 
                          hue="sexe", 
                          data=data, 
                          palette="muted")
            plt.title('Number of candidates per speciality')
            plt.tight_layout(pad = 7)
            plt.subplot(1,2,2)
            sns.countplot(x="specialite", 
                          hue="sexe", 
                          data=data[data['specialite']==speciality], 
                          palette="muted")
            plt.title('Number of candidates for a selected speciality')
            
        else:
            print(f"Proportion of {speciality}: {temp}")
            sns.countplot(x="specialite", 
                          hue="sexe", 
                          data=data[data['sexe']==gender], 
                          palette="muted")
            plt.title('Number of candidates per speciality for a selected gender')

<div align='justify'>The countplot shows that women are more represented in archeologie and detective specialities whereas men are more represented in geologie and forage specialities. The most represented speciality is **Geologie**, which contains more than 50% of all observations. It is also interesting to see that women are equally represented in each field of study.</div>

#### <font color='orange' size='2'>*3.4.2 Hair x salary*</font>

In [None]:
plt.figure(figsize=(10,7))

sns.set(style="ticks", palette="pastel")

# Draw a nested boxplot to show bills by day and time
sns.boxplot(x="cheveux", y="salaire",
            hue="sexe", palette=["m", "g"],
            data=data)
sns.despine(offset=10, trim=True)

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)
sns.despine(left=True)

roux = data.loc[data['cheveux']=='roux', 'salaire']
blond = data.loc[data['cheveux']=='blond', 'salaire']
brun = data.loc[data['cheveux']=='brun', 'salaire']
chatain = data.loc[data['cheveux']=='chatain', 'salaire']

# Plot a filled kernel density estimate
sns.distplot(roux, 
             hist=False, 
             color="orange", 
             kde_kws={"shade": True}, 
             ax=axes[0, 0])

sns.distplot(blond, 
             hist=False, 
             color="yellow", 
             kde_kws={"shade": True}, 
             ax=axes[0, 1])

sns.distplot(brun, 
             hist=False, 
             color="brown", 
             kde_kws={"shade": True}, 
             ax=axes[1, 0])

sns.distplot(chatain, 
             hist=False, 
             color="goldenrod", 
             kde_kws={"shade": True}, 
             ax=axes[1, 1])

plt.tight_layout()

<div align='justify'>Hair colour has absolutely no correlation with salary, each level follows same distribution. Men have a better average salary for each level of hair compared to women and that average salary is almost the same for each level, among men and women.</div>

#### <font color='orange' size='2'>*Experience x grade*</font>

In [None]:
@interact
def correlation(numerical_feature_1 = numerical_feature, 
                numerical_feature_2 = numerical_feature):
    
    temp = round(data[numerical_feature_1].corr(data[numerical_feature_2]),4)
    print(f'Correlation between {numerical_feature_1} and {numerical_feature_2}: {temp}')

In [None]:
corr = data.corr()

# generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype = np.bool)
# return the indices for the upper triangle of an (n,m) array
mask[np.triu_indices_from(mask)] = True

sns.set_style("white")
f, ax = plt.subplots(figsize=(11,7))
plt.title("Correlation matrix")
sns.heatmap(corr, 
            mask=mask, 
            cmap=sns.diverging_palette(220,10, as_cmap=True),
            square=True, 
            vmax = 1, 
            center = 0, 
            linewidths = .5, 
            cbar_kws = {"shrink": .5})

plt.show()

<div align='justify'>There is no correlation between the experience and the grade obtained. It means that there is no linear combination can link A to B. The heatmap doesn't show much correlation either.</div>

**<font size="2"><a href="#summary">Back to summary</a></font>**

---

# <div id="chap4"> 4. Preparing the data for modelling</div>

### <font color='blue' size='3'>4.1 One Hot Encoding</font>

In [None]:
# Convert each categorical feature in number of levels -1 boolean features
data_rdy = pd.get_dummies(data, drop_first=True)

In [None]:
corr = data_rdy.corr()
# generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype = np.bool)
# return the indices for the upper triangle of an (n,m) array
mask[np.triu_indices_from(mask)] = True

sns.set_style("white")
f, ax = plt.subplots(figsize=(11,7))
plt.title("Correlation matrix")
sns.heatmap(corr, mask=mask, cmap=sns.diverging_palette(220,10, as_cmap=True),
            square=True, vmax = 1, center = 0, linewidths = .5, cbar_kws = {"shrink": .5})
plt.show()

### <font color='blue' size='3'>4.2 Split into train and test</font>

In [None]:
X = data_rdy.drop(['embauche','date'], axis = 1)
y = data_rdy.embauche

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 12345)

<font size="2"><a href="#summary">Back to summary</a></font>

---

# <div id="chap5"> 5. Random Forest</div>

<div align='justify'>I would like to use an ensemble method, RF which I believe fits great for that problem. The execution time is usually quite fast, way faster than XGBoost and more accurate than CART. Plus they can handle unbalanced data, what is precisely our case. They also handle with missing data. The only problem with RF is that it generates many independant trees with different conditions at each nodes and with a wise majority vote, gives the prediction. It is difficult to interpret compared to a logistic regression.</div>

### <font color='blue' size='3'>5.1 Random gridsearch</font>

In [None]:
# Instantiate model
rf = RandomForestClassifier(random_state=42)

# Look at parameters used by random forest
print('Parameters currently in use:\n')
pprint(rf.get_params())

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier(random_state=42)

# Random search of parameters, using 3 fold cross validation, 
# search across 10 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid, 
                               n_iter = 10, 
                               cv = 3, 
                               verbose=2, 
                               random_state=42, 
                               n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_

In [None]:
def evaluate(model, test_features, test_labels):
    predictions = model.predict_proba(test_features)
    
    probs = predictions[:,1]
    # calculate AUC
    auc = roc_auc_score(test_labels, probs)
    print('AUC: %.3f' % auc)

    # calculate roc curve
    fpr, tpr, thresholds = roc_curve(test_labels, probs)
    # plot no skill
    plt.plot([0, 1], [0, 1], linestyle='--')
    # plot the roc curve for the model
    plt.plot(fpr, tpr, marker='.')
    # show the plot
    plt.show()
    return(auc, probs)

base_model = RandomForestClassifier(n_estimators = 10, 
                                    random_state = 42)

base_model.fit(X_train, 
               y_train)

base_accuracy = evaluate(base_model, 
                         X_test, 
                         y_test)[0]

<div align='justify'>AUC is one of the most important evaluation metrics for checking any classification model’s performance. It represents degree or measure of separability. It tells how much model is capable of distinguishing between classes.the Higher AUC, the better the model is at predicting</div>

In [None]:
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)

<div align='justify'>Now there are 87% that model will be able to distinguish between positive class and negative class.</div>

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy[0] - base_accuracy) / base_accuracy))

### <font color='blue' size='3'>5.2 Confusion matrix</font>

<div align='justify'>A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.</div>

In [None]:
y_predict = evaluate(best_random, X_test, y_test)
y_predict = y_predict[1]
y_predict = [round(i) for i in y_predict]

In [None]:
confusion = confusion_matrix(y_test, y_predict)

def display_results(confusion_matrix):
    precision = round((confusion_matrix[0][0] + confusion_matrix[1][1]) / (confusion_matrix[0][0] + confusion_matrix[1][1] + confusion_matrix[0][1] + confusion_matrix[1][0]), 2)
    recall = round(confusion_matrix[0][0] / (confusion_matrix[0][0] + confusion_matrix[1][0]), 2)
    f1_score = round(2 * precision * recall / (precision + recall), 2)

    print(f'Precision : {precision}')
    print(f'Recall    : {recall}')
    print(f'F1 Score  : {f1_score}')
    
display_results(confusion)

- Precision : Of those predicted positive, how many of them are actual positive.

- Recall : How many of the Actual Positives our model capture through labeling it as Positive (True Positive).

- F1 Score : A balance between Precision and Recall.

**<font size="2"><a href="#summary">Back to summary</a></font>**

---

# <div id="chap6"> 6. Feature importance</div>

In [None]:
# Get numerical feature importances
importances = list(best_random.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(data.columns, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

In [None]:
# Set the style
plt.style.use('fivethirtyeight')

# list of x locations for plotting
x_values = list(range(len(importances)))

# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')

# Tick labels for x axis
plt.xticks(x_values, data.columns, rotation='vertical')

# Axis labels and title
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')

<div align='justify'>As expected, hair color has no influence in the model</div>

**<font size="2"><a href="#summary">Back to summary</a></font>**

---

## <div id="chap7"> 7. AMELIORATION</div>

### <font color='blue' size='3'>7.1 Gridsearch with CV</font>

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4],
    'min_samples_split': [8, 10],
    'n_estimators': [100, 200]
}
# Create a based model
rf = RandomForestClassifier(random_state=42)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
best_grid = grid_search.best_estimator_
grid_accuracy = evaluate(best_grid, X_test, y_test)

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy[0] - base_accuracy) / base_accuracy))

**<font size="2"><a href="#summary">Back to summary</a></font>**

<hr>
<br>
<div align='justify'><font color="#353B47" size="4">Thank you for taking the time to read this notebook. I hope that I was able to answer your questions or your curiosity and that it was quite understandable. <u>any constructive comments are welcome</u>. They help me progress and motivate me to share better quality content. I am above all a passionate person who tries to advance my knowledge but also that of others. If you liked it, feel free to <u>upvote and share my work.</u> </font></div>
<br>
<div align='center'><font color="#353B47" size="3">Thank you and may passion guide you.</font></div>