# Training DT, RF on your Data
This notebook will take you through the steps necessary to train Decision Tree (DT) and Random Forest (RF) classifiers to recognize ICD-9 codes, or items from similar dictionaries, from free text.

## Setup

### Imports and Definitions
Make sure that the below packages are installed on the server on which this program will run.

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle

from sklearn.datasets import make_multilabel_classification
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)
warnings.filterwarnings("ignore", category=UserWarning)
from sklearn.datasets import fetch_20newsgroups

In [8]:
def printBestModelStatistics(gridCV, scoring, modelName):
    """
    Description: Prints information about the model specified in modelName.
    Input:
        gridCV (dict): A dictionary returned by sklearn's GridSearchCV function.
        scoring (list): A list of metrics in gridCV which are to be printed
        modelName (str): Name of model to be printed.
    Output:
        None: Results are printed.
    TODO:
    """
    scoringDict = {}
    bestModelIndex = gridCV.best_index_
    for score in scoring:
        scoringDict[score] = gridCV.cv_results_["mean_test_" + score][bestModelIndex]
        outStr = "For Model {}:".format(modelName)
    for scoreName, scoreVal in scoringDict.items():
        outStr += "\n\t{}: {}".format(scoreName, np.round(scoreVal, decimals = 3))
    print(outStr)  

### Load Data

Change the dataPrefix variable below to run this on different datasets. It does expect that there is a separate train and validation file with similarly formatted column names (input = `"TEXT"`, labels = `"V9"`), so make sure of this.

In [9]:
# Load data
# dataPrefix = 'data/mimic_0815'
dataPrefix = 'data/mimic_mm_reps_0815'
# dataPrefix = 'data/csu_mm_reps_nodiag_no17_0815'
# dataPrefix = 'data/csu_nodiag_no17_0815'
nRows = None
df = pd.read_csv(dataPrefix + "_train.csv", usecols = ["TEXT", "V9"], nrows = nRows)
nTrain = df.shape[0]
df = pd.concat([df, pd.read_csv(dataPrefix + "_valid.csv", usecols = ["TEXT", "V9"], nrows = nRows)])
nVal = df.shape[0] - nTrain
Y_temp = [x.split("-") for x in df.V9]
Y = []
for sublist in Y_temp:
    categories = [cat.split(":")[-1] for cat in sublist]
    Y.append(categories)
Y = MultiLabelBinarizer().fit_transform(Y)
print("n train: {}\nn Val: {}".format(nTrain, nVal))

n train: 39541
n Val: 13181


### Set Up Model

The below code declares the necessary parameters needed to train the models above over a grid of hyperparameters.

In [10]:
modelDict = {}
cv = lambda: zip([np.arange(nTrain)], [np.arange(nTrain, nTrain + nVal)])
n_jobs = 2
verbose = 1

scoring = ["f1_micro", "f1_macro", "f1_weighted", "precision_samples", "recall_samples"]
importantMetric = "f1_weighted"
# scoring = None#["f1","precision", "recall"]
# importantMetric = None#"f1"

max_df = (0.5, 0.75, 0.95)
tf_idf_norm = ('l1', 'l2')
myCountVectorizer = ('vect', CountVectorizer(stop_words = 'english', min_df = 0.05))
myTfidfTransformer = ('tfidf', TfidfTransformer(norm = 'l2',  use_idf = True))

# Decision Tree Classifier

estimators_DT = []
estimators_DT.append(myCountVectorizer)
estimators_DT.append(myTfidfTransformer)
estimators_DT.append(('DT', DecisionTreeClassifier()))
paramGrid_DT = [
    {
        'vect__max_df': max_df,
#         'vect__max_features': (5000, 10000),
#         'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
        # 'tfidf__use_idf': (True, False),
        'tfidf__norm': tf_idf_norm,
        "DT__max_features": ["sqrt", "log2", 0.5],# we have lots of possibly extraneous
        # features so it might be good to use lower numbers here
        "DT__max_depth": [None],# still need to understand if deeper trees are better.
        "DT__criterion":["gini"],
    }
]
modelDict["DT"] = {"pipe": Pipeline(estimators_DT),
                             "params": paramGrid_DT}
modelDict["DT"]["gridcv"] = GridSearchCV(estimator = modelDict["DT"]["pipe"],
                       param_grid = modelDict["DT"]["params"],
                       cv = cv(), 
                       n_jobs = n_jobs, return_train_score = False,
                       scoring = scoring, refit = importantMetric, verbose = verbose)

# Random Forest Classifier
estimators_RF = []
estimators_RF.append(myCountVectorizer)
estimators_RF.append(myTfidfTransformer)
estimators_RF.append(('RFC', RandomForestClassifier()))
paramGrid_RF = [
    {
        'vect__max_df': max_df,
        'tfidf__norm': tf_idf_norm,
        "RFC__n_estimators": [5, 10],# second most important feature to tune. First
        # is max number of feats.
        "RFC__max_features": ["sqrt", "log2", 0.5],# we have lots of possibly extraneous
        # features so it might be good to use lower numbers here
        "RFC__max_depth": [None],# still need to understand if deeper trees are better.
        "RFC__criterion":["gini"],
    }
]
modelDict["RFC"] = {"pipe": Pipeline(estimators_RF),
                             "params": paramGrid_RF}
modelDict["RFC"]["gridcv"] = GridSearchCV(estimator = modelDict["RFC"]["pipe"],
                       param_grid = modelDict["RFC"]["params"],
                       cv = cv(), 
                       n_jobs = n_jobs, return_train_score = False,
                       scoring = scoring, refit = importantMetric, verbose = verbose)

## Model Training

The below trains the models and prints the best classification performances and parameters over the grid search.

In [11]:
for modelName, currModelDict in modelDict.items():
    print("Training {}".format(modelName))
    currModelDict["gridcv"].fit(df.TEXT.values, Y)
    printBestModelStatistics(gridCV = currModelDict["gridcv"],
                         scoring = scoring, modelName = modelName)
    currModelDict["refitMetric"] = importantMetric
    print("Best Model Parameters {}".format(currModelDict["gridcv"].best_params_))
    print("*"*100)

Training DT
Fitting 1 folds for each of 18 candidates, totalling 18 fits


[Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed: 32.7min finished


For Model DT:
	f1_micro: 0.617
	f1_macro: 0.488
	f1_weighted: 0.615
	precision_samples: 0.638
	recall_samples: 0.607
Best Model Parameters {'DT__criterion': 'gini', 'DT__max_depth': None, 'DT__max_features': 0.5, 'tfidf__norm': 'l1', 'vect__max_df': 0.5}
****************************************************************************************************
Training RFC
Fitting 1 folds for each of 36 candidates, totalling 36 fits


[Parallel(n_jobs=2)]: Done  36 out of  36 | elapsed: 146.2min finished


For Model RFC:
	f1_micro: 0.671
	f1_macro: 0.509
	f1_weighted: 0.644
	precision_samples: 0.731
	recall_samples: 0.627
Best Model Parameters {'RFC__criterion': 'gini', 'RFC__max_depth': None, 'RFC__max_features': 0.5, 'RFC__n_estimators': 5, 'tfidf__norm': 'l2', 'vect__max_df': 0.5}
****************************************************************************************************


### Save output

The below saves the model performance data to a pickled Python file.

In [14]:
with open(dataPrefix + "modelPerformance.pkl", "wb") as pickleFile:
    pickle.dump(modelDict, pickleFile, protocol= 2)