# Cost Sensitive Random Forest using CostCla  

In this notebook a cost sensitive learning approach was used to attempt to optimize the training of the model by considering actual costs for False Positives and False Negatives. We found that the CostCla package allowed for a cost matrix as an input parameter and our hopes were that this would minimize resulting cost of predictions.

Documentation: https://albahnsen.github.io/CostSensitiveClassification/

### Load Packages

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, confusion_matrix
import seaborn as sns

### Data Preprocessing and Modeling Preparation  

The same data preprocessing and model preparation steps were taken for this model.

In [4]:
data = pd.read_csv('dataset_diabetes/diabetic_data.csv', na_values = '?')

def convert_code(code):
    try:
        code = float(code)
        if code >= 1 and code <= 139:
            return 1
        elif code >= 140 and code <= 239:
            return 2
        elif code >= 240 and code <= 279:
            return 3
        elif code >= 280 and code <= 289:
            return 4
        elif code >= 290 and code <= 319:
            return 5
        elif code >= 320 and code <= 389:
            return 6
        elif code >= 390 and code <= 459:
            return 7
        elif code >= 460 and code <= 519:
            return 8
        elif code >= 520 and code <= 579:
            return 9
        elif code >= 580 and code <= 629:
            return 10
        elif code >= 630 and code <= 679:
            return 11
        elif code >= 680 and code <= 709:
            return 12
        elif code >= 710 and code <= 739:
            return 13
        elif code >= 740 and code <= 759:
            return 14
        elif code >= 760 and code <= 779:
            return 15
        elif code >= 780 and code <= 799:
            return 16
        elif code >= 800 and code <= 999:
            return 17
    except:
        if 'V' in code:
            return 18
        elif 'E' in code:
            return 19
        else:
            return 'Code not mapped'

data['diag_1_mapped'] = data.diag_1.apply(convert_code)
data['diag_2_mapped'] = data.diag_2.apply(convert_code)
data['diag_3_mapped'] = data.diag_3.apply(convert_code)

data = data.loc[~data.discharge_disposition_id.isin([11,13,14,18,20,21])]

data['Target_Label'] = (data.readmitted == '<30').astype(int)

num_col_names = ['time_in_hospital','num_lab_procedures', 'num_procedures', 'num_medications',\
                 'number_outpatient', 'number_emergency', 'number_inpatient','number_diagnoses']

cat_col_names = ['race', 'gender', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide',\
                 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide',\
                 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose',\
                 'miglitol', 'troglitazone','tolazamide', 'insulin', 'glyburide-metformin',\
                 'glipizide-metformin','glimepiride-pioglitazone', 'metformin-rosiglitazone',\
                 'metformin-pioglitazone', 'change', 'diabetesMed','payer_code']

# Fill NA with 'UNK'
data['race'] = data['race'].fillna('UNK')
data['payer_code'] = data['payer_code'].fillna('UNK')
data['medical_specialty'] = data['medical_specialty'].fillna('UNK')
data['diag_1_mapped'] = data['diag_1_mapped'].fillna('UNK')
data['diag_2_mapped'] = data['diag_2_mapped'].fillna('UNK')
data['diag_3_mapped'] = data['diag_3_mapped'].fillna('UNK')

# Get top 10 medical specialties
top_10_spec = list(data['medical_specialty'].value_counts(dropna=False)[0:10].index)

# New medical_specialty column
data['med_spec_new'] = data['medical_specialty'].copy()

# Replace values with 'Other' if not in Top 10
data.loc[~data.med_spec_new.isin(top_10_spec), 'med_spec_new'] = 'Other'

# Convert Numerical Categorical Columns to strings
cat_col_num = ['admission_type_id', 'discharge_disposition_id', 'admission_source_id',\
               'diag_1_mapped', 'diag_2_mapped', 'diag_3_mapped']
data[cat_col_num] = data[cat_col_num].astype(str)

# Create Categorical Predictors DataFrame
data_cat = pd.get_dummies(data[cat_col_names + cat_col_num + ['med_spec_new']], drop_first = True)

# Add Categorical Predictor Variables to main DataFrame
data = pd.concat([data, data_cat], axis = 1)

# Retain columns of data_cat
data_cat_cols = list(data_cat.columns)

# Create Age Group Variable
age_dict = {'[0-10)':0, 
            '[10-20)':10, 
            '[20-30)':20, 
            '[30-40)':30, 
            '[40-50)':40, 
            '[50-60)':50,
            '[60-70)':60, 
            '[70-80)':70, 
            '[80-90)':80, 
            '[90-100)':90}
data['age_group'] = data.age.replace(age_dict)

# Create Age Variable
data['has_weight'] = data.weight.notnull().astype('int')

# Save feature names
features = ['age_group', 'has_weight']

# Dataframe for modeling
model_data = data[num_col_names + data_cat_cols + features + ['Target_Label']]

# Shuffle Data
model_data = model_data.sample(n=len(model_data),random_state=10)
model_data = model_data.reset_index(drop=True)

# 15% Validation / 15% Test split / 70% Train
vd_test = model_data.sample(frac=0.3, random_state=10)
test_data = vd_test.sample(frac=0.5, random_state=10)
vd_data = vd_test.drop(test_data.index)
train_data = model_data.drop(vd_test.index)

# Split training data into positive and negative
positive = train_data.Target_Label == 1
train_data_pos = train_data.loc[positive]
train_data_neg = train_data.loc[~positive]

# Merge and Balance
train_data_balanced = pd.concat([train_data_pos, train_data_neg.sample(n = len(train_data_pos), random_state=10)], axis = 0)

# Shuffle
train_data_balanced = train_data_balanced.sample(n = len(train_data_balanced), random_state = 10).reset_index(drop=True)

train_matrix = train_data[num_col_names + data_cat_cols + features].values
train_balanced_matrix = train_data_balanced[num_col_names + data_cat_cols + features].values
vd_matrix = vd_data[num_col_names + data_cat_cols + features].values

train_labels = train_data_balanced['Target_Label'].values
vd_labels = vd_data['Target_Label'].values

scaler = StandardScaler()
scaler.fit(train_matrix)

scaled_train = scaler.transform(train_balanced_matrix)
scaled_vd = scaler.transform(vd_matrix)

  interactivity=interactivity, compiler=compiler, result=result)


# Modeling

In [5]:
def report(actual, predicted):
    AUC = roc_auc_score(actual, predicted)
    accuracy = accuracy_score(actual, predicted)
    precision = precision_score(actual, predicted)
    recall = recall_score(actual, predicted)
    conf_matrix = confusion_matrix(actual, predicted)
    group_names = ['True Neg','False Pos','False Neg','True Pos']
    group_counts = ['{0:0.0f}'.format(value) for value in conf_matrix.flatten()]
    group_percentages = ['{0:.2%}'.format(value) for value in conf_matrix.flatten()/np.sum(conf_matrix)]
    cost = ((int(group_counts[1])*1780)+(int(group_counts[2])*14400))/len(actual)

    return cost

### Cost Sensitive Random Forest

In [6]:
from costcla.models import CostSensitiveRandomForestClassifier



In [7]:
cost = []
fp = []
fn = []
for i in range(1, 50, 5):
    for j in range(1, 50, 5):
        cost_matrix = np.array([[i, j, 0, 0]])
        for k in range(len(train_labels)-1):
            cost_matrix = np.append(cost_matrix, [[i, j, 0, 0]], axis=0)
        CSRandomForest = CostSensitiveRandomForestClassifier()
        CSRandomForest.fit(scaled_train, train_labels, cost_matrix)
        vd_predictions = CSRandomForest.predict_proba(scaled_vd)[:,1]
        vd_predictions[vd_predictions > 0.50] = 1
        vd_predictions[vd_predictions <= 0.50] = 0
        cost.append(report(vd_labels, vd_predictions))
        fp.append(i)
        fn.append(j)
        
cost_df = pd.DataFrame({'FP': fp, 'FN': fn, 'Cost': cost})

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Results

In [8]:
cost_df.sort_values(by='Cost', ascending=True)

Unnamed: 0,FP,FN,Cost
11,6,6,1262.185518
33,16,16,1263.005087
89,41,46,1269.196460
0,1,1,1271.999442
78,36,41,1277.885567
...,...,...,...
80,41,1,1691.992473
38,16,41,1692.116524
48,21,41,1692.951425
58,26,41,1693.695728


**Conclusion**: The CostCla package was useful in that it allowed for the utilization of a cost matrix. However, after further research, it was discovered the algorithm was made primarily for observation dependent cost matricies. In otherwords, the costs for misclassifications would be different depending on the feature values. The way we implemented it here was to use a constant cost matrix and check the effects of using different cost proportions for FP and FN in training. As a result, the parameters that produced the lowest cost model still did not perform better than previous models.