## Random Forest

Random Forest is an ensemble of Decision Trees. With a few exceptions, a `RandomForestClassifier` has all the hyperparameters of a `DecisionTreeClassifier` (to control how trees are grown), plus all the hyperparameters of a `BaggingClassifier` to control the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. The following `BaggingClassifier` is roughly equivalent to the previous `RandomForestClassifier`. Run the cell below to visualize a single estimator from a random forest model, using the Iris dataset to classify the data into the appropriate species.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

# Model (can also use single decision tree)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)

# Train
model.fit(iris.data, iris.target)
# Extract single tree
estimator = model.estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = iris.feature_names,
                class_names = iris.target_names,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

Notice how each split seperates the data into buckets of similar observations. This is a single tree and a relatively simple classification dataset, but the same method is used in a more complex dataset with greater depth to the trees.

## Coronavirus
Coronavirus disease (COVID-19) is an infectious disease caused by a new virus.
The disease causes respiratory illness (like the flu) with symptoms such as a cough, fever, and in more severe cases, difficulty breathing. You can protect yourself by washing your hands frequently, avoiding touching your face, and avoiding close contact (1 meter or 3 feet) with people who are unwell. An outbreak of COVID-19 started in December 2019 and at the time of the creation of this project was continuing to spread throughout the world. Many governments recommended only essential outings to public places and closed most business that do not serve food or sell essential items. An excellent [spatial dashboard](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6) built by Johns Hopkins shows the daily confirmed cases by country. 

This case study was designed to drive home the important role that data science plays in real-world situations like this pandemic. This case study uses the Random Forest Classifier and a dataset from the South Korean cases of COVID-19 provided on [Kaggle](https://www.kaggle.com/kimjihoo/coronavirusdataset) to encourage research on this important topic. The goal of the case study is to build a Random Forest Classifier to predict the 'state' of the patient.

First, please load the needed packages and modules into Python. Next, load the data into a pandas dataframe for ease of use.

In [1]:
import os
import pandas as pd
from datetime import datetime,timedelta
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas_profiling
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file

%matplotlib inline

import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,roc_auc_score
from sklearn.metrics import accuracy_score,log_loss

from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

import plotly.graph_objects as go






In [2]:
url ='SouthKoreacoronavirusdataset/PatientInfo.csv'
df = pd.read_csv(url, parse_dates=['symptom_onset_date','confirmed_date','released_date','deceased_date'], infer_datetime_format=True)
df.head()

Unnamed: 0,patient_id,global_num,sex,birth_year,age,country,province,city,disease,infection_case,infection_order,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state
0,1000000001,2.0,male,1964.0,50s,Korea,Seoul,Gangseo-gu,,overseas inflow,1.0,,75.0,2020-01-22,2020-01-23,2020-02-05,NaT,released
1,1000000002,5.0,male,1987.0,30s,Korea,Seoul,Jungnang-gu,,overseas inflow,1.0,,31.0,NaT,2020-01-30,2020-03-02,NaT,released
2,1000000003,6.0,male,1964.0,50s,Korea,Seoul,Jongno-gu,,contact with patient,2.0,2002000000.0,17.0,NaT,2020-01-30,2020-02-19,NaT,released
3,1000000004,7.0,male,1991.0,20s,Korea,Seoul,Mapo-gu,,overseas inflow,1.0,,9.0,2020-01-26,2020-01-30,2020-02-15,NaT,released
4,1000000005,9.0,female,1992.0,20s,Korea,Seoul,Seongbuk-gu,,contact with patient,2.0,1000000000.0,2.0,NaT,2020-01-31,2020-02-24,NaT,released


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   patient_id          2218 non-null   int64         
 1   global_num          1314 non-null   float64       
 2   sex                 2073 non-null   object        
 3   birth_year          1764 non-null   float64       
 4   age                 1957 non-null   object        
 5   country             2218 non-null   object        
 6   province            2218 non-null   object        
 7   city                2153 non-null   object        
 8   disease             19 non-null     object        
 9   infection_case      1163 non-null   object        
 10  infection_order     42 non-null     float64       
 11  infected_by         469 non-null    float64       
 12  contact_number      411 non-null    float64       
 13  symptom_onset_date  193 non-null    datetime64[n

In [4]:
df.nunique()

patient_id            2218
global_num            1303
sex                      2
birth_year              96
age                     11
country                  4
province                17
city                   134
disease                  1
infection_case          16
infection_order          6
infected_by            206
contact_number          72
symptom_onset_date      38
confirmed_date          45
released_date           35
deceased_date           16
state                    3
dtype: int64

In [5]:
df1 = df.copy()

In [6]:
now= datetime.now()
year = now.year
df['age'] = year - df['birth_year']

In [7]:
#dropping birth_year as age is calculated
df.drop('birth_year', inplace=True, axis=1)
df.head()

Unnamed: 0,patient_id,global_num,sex,age,country,province,city,disease,infection_case,infection_order,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state
0,1000000001,2.0,male,56.0,Korea,Seoul,Gangseo-gu,,overseas inflow,1.0,,75.0,2020-01-22,2020-01-23,2020-02-05,NaT,released
1,1000000002,5.0,male,33.0,Korea,Seoul,Jungnang-gu,,overseas inflow,1.0,,31.0,NaT,2020-01-30,2020-03-02,NaT,released
2,1000000003,6.0,male,56.0,Korea,Seoul,Jongno-gu,,contact with patient,2.0,2002000000.0,17.0,NaT,2020-01-30,2020-02-19,NaT,released
3,1000000004,7.0,male,29.0,Korea,Seoul,Mapo-gu,,overseas inflow,1.0,,9.0,2020-01-26,2020-01-30,2020-02-15,NaT,released
4,1000000005,9.0,female,28.0,Korea,Seoul,Seongbuk-gu,,contact with patient,2.0,1000000000.0,2.0,NaT,2020-01-31,2020-02-24,NaT,released


In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
patient_id,2218.0,4014678000.0,2192419000.0,1000000000.0,1700000000.0,6001000000.0,6004000000.0,7000000000.0
global_num,1314.0,4664.817,2874.044,1.0,1908.5,5210.5,7481.5,8717.0
age,1764.0,45.01134,19.41264,0.0,27.0,45.5,58.0,104.0
infection_order,42.0,2.285714,1.254955,1.0,1.25,2.0,3.0,6.0
infected_by,469.0,2600789000.0,1570638000.0,1000000000.0,1200000000.0,2000000000.0,4100000000.0,6113000000.0
contact_number,411.0,24.12895,91.08779,0.0,2.0,5.0,16.0,1160.0


df['symptom_onset_date'] = pd.to_datetime(df['symptom_onset_date'])
df['confirmed_date'] = pd.to_datetime(df['confirmed_date'])
df['released_date'] = pd.to_datetime(df['released_date'])
df['deceased_date'] = pd.to_datetime(df['deceased_date'])

In [9]:
cols = ['sex', 'country','province', 'city','infection_case', 'state']
df[cols] = df[cols].astype('category')

In [10]:
df.dtypes

patient_id                     int64
global_num                   float64
sex                         category
age                          float64
country                     category
province                    category
city                        category
disease                       object
infection_case              category
infection_order              float64
infected_by                  float64
contact_number               float64
symptom_onset_date    datetime64[ns]
confirmed_date        datetime64[ns]
released_date         datetime64[ns]
deceased_date         datetime64[ns]
state                       category
dtype: object

In [11]:
df.isnull().sum()

patient_id               0
global_num             904
sex                    145
age                    454
country                  0
province                 0
city                    65
disease               2199
infection_case        1055
infection_order       2176
infected_by           1749
contact_number        1807
symptom_onset_date    2025
confirmed_date         141
released_date         1995
deceased_date         2186
state                   88
dtype: int64

state column in df has 88 missing values which depicts that those are missing apart from 'isolated','released' and'deceased. so we can replace NaN values as 'missing

In [12]:
df.state.unique()

['released', 'isolated', 'deceased', NaN]
Categories (3, object): ['released', 'isolated', 'deceased']

In [13]:
df['sex'].unique()

['male', 'female', NaN]
Categories (2, object): ['male', 'female']

In [14]:
# filling Nan values in state as missing
df['state'] = df['state'].cat.add_categories('missing').fillna('missing')
df['sex'] = df['sex'].cat.add_categories('neutral').fillna('neutral')

In [15]:
df.state.unique(), df['sex'].unique()

(['released', 'isolated', 'deceased', 'missing']
 Categories (4, object): ['released', 'isolated', 'deceased', 'missing'],
 ['male', 'female', 'neutral']
 Categories (3, object): ['male', 'female', 'neutral'])

In [16]:
df['disease'].fillna(0, inplace=True)
df.disease[df['disease']==True]=1



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [17]:
df = df.fillna(df.mean())

In [18]:
df.isnull().sum()

patient_id               0
global_num               0
sex                      0
age                      0
country                  0
province                 0
city                    65
disease                  0
infection_case        1055
infection_order          0
infected_by              0
contact_number           0
symptom_onset_date    2025
confirmed_date         141
released_date         1995
deceased_date         2186
state                    0
dtype: int64

In [19]:
df.state.value_counts()

isolated    1791
released     307
missing       88
deceased      32
Name: state, dtype: int64

### state column indicates that 307 patients were released whereas there are 84 patients whose are released in released_dates columns are missing. These 84 NaT values are imputed with last patient released  date i.e 2020-03-19

In [None]:
print('No of patients whose released dates missing are :', len(df[df['state']=='released'])- len(df[df['released_date'].notnull()]))


In [None]:
#filter those 84 missing values who are released but, released dates are missing
filt = (df['released_date'].isin([pd.NaT]) & (df['state'] == 'released'))

df.loc[filt, ['released_date']] = pd.to_datetime('2020-03-19')

### symptom_onset_date are filled with confirmed_dates minus 2 days as is with most cases

In [None]:
filt1 = (df['symptom_onset_date'].isin([pd.NaT]) & (df['confirmed_date'] != pd.NaT))
df.loc[filt1, ['symptom_onset_date','confirmed_date']]

In [None]:
confirm_dates = df.loc[filt1, ['confirmed_date']]
sub = confirm_dates - timedelta(days=2)
df.loc[filt1, 'symptom_onset_date'] = sub.values

In [None]:
df.loc[filt1, 'symptom_onset_date'] = pd.to_datetime(df.loc[filt1, 'symptom_onset_date'])

In [None]:
df['symptom_onset_date'] = pd.to_datetime(df['symptom_onset_date'])

In [None]:
df.info()

In [None]:
df.loc[df['deceased_date'].notnull()]

In [None]:
df.iloc[[109,187],:]

#### Index 109 and 187 have conflicting data, both rows have released dates and deceased dates and status of state is released. It is better to drop these rows to avoid conflicting data.

In [None]:
df.drop([109,187], inplace=True)

In [None]:
filt2 = (df['deceased_date'].isin([pd.NaT]) & (df['state'] == 'deceased'))
df.loc[filt2, ['deceased_date','state']]

#### Filling missing deceased_dates column who were declared dead in state column with end date of dataset i.e 2020-03-19 . Index No 1611 and 1946

In [None]:
df.loc[filt2, 'deceased_date'] =  pd.to_datetime('2020-03-19')

In [None]:
df[df['state']=='deceased']

In [None]:
filt3 = (df['symptom_onset_date'].isin([pd.NaT]) & df['confirmed_date'].isin([pd.NaT])& (df['state'] == 'deceased'))
df.loc[filt3, ['symptom_onset_date','confirmed_date','state']]

In [None]:
#filling the symptom_onset_date and confirmed_date col of index with 1946 with next index date
df.loc[filt3, ['symptom_onset_date']] = pd.to_datetime('2020-02-17')
df.loc[filt3, ['confirmed_date']] = pd.to_datetime('2020-02-19')

In [None]:
filt4 = (df['symptom_onset_date'].isin([pd.NaT]) & df['confirmed_date'].isin([pd.NaT])& (df['state'] =='isolated'))
df.loc[filt4, ['symptom_onset_date','confirmed_date','state']]

#### There are 140 isolated patients whose symptom_onset_date and confirmed_dates are missing. These  dates can be imputed by a date range between 19 Jan 2020 and 18 Mar 2020. As these 140 patients are still in isloation, imputing symptom_onset_date as 17 Mar 2020 and confirmed_dates  as 18 Mar 2020

In [None]:
df['symptom_onset_date'].sort_values(ascending=False).head(1)

In [None]:
df['confirmed_date'] .sort_values(ascending=False).head(1)

In [None]:
df.loc[filt4, ['symptom_onset_date']] = pd.to_datetime('2020-03-17')
df.loc[filt4, ['confirmed_date']] = pd.to_datetime('2020-03-20')

In [None]:
df['infection_case'].unique()

In [None]:
df.info()

#### Columns released_date and deceased_date contains null values which actually does not represent missing data but data that are not required. Deceased and Released date data is available in state column. Hence dropping released_date and deceased_date columns

In [None]:
df = df.drop(['released_date','deceased_date'],axis =1)

Review the count of unique values by column.

In [None]:
print(df.nunique())

Review the percent of unique values by column.

In [None]:
print(df.nunique()/df.shape[0])

Review the range of values per column.

In [None]:
df.describe().T

### Check for duplicated rows

In [None]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

In [None]:
df.info()

**<font color='teal'> Create dummy features for object type features. </font>**

In [None]:
dfd = pd.get_dummies(df.drop('state', axis=1),drop_first=True)
dfd.columns

In [None]:
dfd

Print the categorical columns and their associated levels.

In [None]:
dfo = df.select_dtypes(exclude=['datetime', 'int64','float64'])
dfo

In [None]:
#get levels for all variables
vn = pd.DataFrame(dfo.nunique()).reset_index()
vn.columns = ['VarName', 'LevelsCount']
vn.sort_values(by='LevelsCount', ascending =False)
vn

**<font color='teal'> Plot the correlation heat map for the features.</font>**

In [None]:
sns.heatmap(df.corr())

In [None]:
df.hist(bins=25,figsize=(15,15))

In [None]:
#get counts of each variable value
df.state.value_counts()

#Rotate x-axis lables by 90 deg
plt.xticks(rotation=90)

#count plot for one variable
sns.countplot(data = df, x = 'state')

In [None]:
#get counts of each variable value
df.sex.value_counts()

#Rotate x-axis lables by 90 deg
plt.xticks(rotation=90)

#count plot for one variable
sns.countplot(data = df, x = 'sex')

In [None]:
dfdt = df.select_dtypes(exclude=('datetime', 'category'))
dfdt

In [None]:
# define standard scaler
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(dfdt)
print(scaled)

In [None]:
fig, axes = plt.subplots(4, 2, figsize=(30, 30), sharey=True)

sns.boxplot(data=dfdt, x='patient_id',ax=axes[0,0])
sns.boxplot(data=dfdt, x='global_num', ax=axes[0,1])
sns.boxplot(data=dfdt, x='age', ax=axes[1,0])
sns.boxplot(data=dfdt, x='disease', ax=axes[1,1])
sns.boxplot(data=dfdt, x='infection_order',whis=1, ax=axes[2,0])
sns.boxplot(data=dfdt, x='infected_by', ax=axes[2,1])
sns.boxplot(data=dfdt, x='contact_number', ax=axes[3,0])

In [None]:
X= df.drop('state', axis =1).values
y=df[['state']]

In [None]:
dfd['symptom_onset_date'] = dfd['symptom_onset_date'].astype('int64')
dfd['confirmed_date'] = dfd['confirmed_date'].astype('int64')

In [None]:
X = dfd.values
y = df[['state']].values.ravel()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Finding Best Hyperparametes

In [None]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [None]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250],
    'max_depth': [2, 4, 8, 16, 32, None]
}

cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(X_train, y_train)

print_results(cv)

In [None]:
best_rf = RandomForestClassifier(n_estimators=250, max_depth=16,)
model = best_rf.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
y_pred

In [None]:
y_pred_prob = model.predict_proba(X_test)
y_pred_prob

In [None]:
lr_probs = y_pred_prob[:,1]

In [None]:
ac = accuracy_score(y_test, y_pred)
ac

In [None]:
f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(y_test, y_pred)

print('Random Forest: Accuracy=%.3f' % (ac))

print('Random Forest: f1-score=%.3f' % (f1))

In [None]:
macro_roc_auc_ovo = roc_auc_score(y_test, y_pred_prob, multi_class="ovo",
                                  average="macro")
weighted_roc_auc_ovo = roc_auc_score(y_test, y_pred_prob, multi_class="ovo",
                                     average="weighted")
macro_roc_auc_ovr = roc_auc_score(y_test, y_pred_prob, multi_class="ovr",
                                  average="macro")
weighted_roc_auc_ovr = roc_auc_score(y_test, y_pred_prob, multi_class="ovr",
                                     average="weighted")
print("One-vs-One ROC AUC scores:\n{:.6f} (macro),\n{:.6f} " "(weighted by prevalence)".format(macro_roc_auc_ovo, weighted_roc_auc_ovo))
print("One-vs-Rest ROC AUC scores:\n{:.6f} (macro),\n{:.6f} ""(weighted by prevalence)".format(macro_roc_auc_ovr, weighted_roc_auc_ovr))

In [None]:
dfd0 = pd.get_dummies(df)
dfd0.columns

In [None]:
dfd0

In [None]:
from sklearn.cluster import KMeans
x = dfd.values


In [None]:
Error =[]
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i).fit(x)
    kmeans.fit(x)
    Error.append(kmeans.inertia_)
import matplotlib.pyplot as plt
plt.plot(range(1, 11), Error)
plt.title('Elbow method')
plt.xlabel('No of clusters')
plt.ylabel('Error')
plt.show()

In [None]:
kmeans3 = KMeans(n_clusters=4)
labels = kmeans3.fit_predict(x)
print(labels)

In [None]:
plt.scatter(x[:,0],x[:,1],c=labels, cmap='rainbow')

### Create Confusion Matrix Plots
Confusion matrices are great ways to review your model performance for a multi-class classification problem. Being able to identify which class the misclassified observations end up in is a great way to determine if you need to build additional features to improve your overall model. In the example below we plot a regular counts confusion matrix as well as a weighted percent confusion matrix. The percent confusion matrix is particulary helpful when you have unbalanced class sizes.

In [None]:
class_names=['isolated','released','missing','deceased'] # name  of classes

In [None]:
import itertools
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()


# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')
#plt.savefig('figures/RF_cm_multi_class.png')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')
#plt.savefig('figures/RF_cm_proportion_multi_class.png', bbox_inches="tight")
plt.show()

### Plot feature importances
The random forest algorithm can be used as a regression or classification model. In either case it tends to be a bit of a black box, where understanding what's happening under the hood can be difficult. Plotting the feature importances is one way that you can gain a perspective on which features are driving the model predictions.

In [None]:
df.columns

In [None]:
feature_importance = best_rf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())[:30]
sorted_idx = np.argsort(feature_importance)[:30]

pos = np.arange(sorted_idx.shape[0]) + .5
print(pos.size)
sorted_idx.size
plt.figure(figsize=(10,10))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

The popularity of random forest is primarily due to how well it performs in a multitude of data situations. It tends to handle highly correlated features well, where as a linear regression model would not. In this case study we demonstrate the performance ability even with only a few features and almost all of them being highly correlated with each other.
Random Forest is also used as an efficient way to investigate the importance of a set of features with a large data set. Consider random forest to be one of your first choices when building a decision tree, especially for multiclass classifications.

In [None]:
metrics, roc, auc, logloss, pipeline, randomsearchcv, gridsearchcv

In [None]:
pwd