The problem we’ll solve is a binary classification task with the goal of predicting an individual’s health. The features are socioeconomic and lifestyle characteristics of individuals and the label is 0 for poor health and 1 for good health. This dataset was collected by the Centers for Disease Control and Prevention 

Some tutorials on Random Forests:
https://www.youtube.com/watch?v=J4Wdy0Wc_xQ
https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:

df = pd.read_csv('../input/behavioral-risk-factor-surveillance-system/2015.csv').sample(10000, random_state = 50)
df.head()

In [None]:
df['_RFHLTH'].value_counts()

In [None]:
df['_RFHLTH'] = df['_RFHLTH'].replace({2: 0})

In [None]:
df['_RFHLTH'].value_counts()

In [None]:
df = df.loc[df['_RFHLTH'].isin([0, 1])].copy()

In [None]:
df['_RFHLTH'].value_counts()

In [None]:
df = df.rename(columns = {'_RFHLTH': 'Label'})

We won't do any data exploration in this notebook, but in general, exploring the data is a best practice. This can help you for feature engineering (which we also won't do here) or by identifying and correcting anomalies / mistakes in the data.

In [None]:
df.shape

In [None]:
percentOfData = df.count()*100/9980

In [None]:
percentOfData.where(percentOfData<50).dropna()

In [None]:
badFeatures = percentOfData.where(percentOfData<50).dropna()

In [None]:
# Remove columns with missing values
df = df.drop(columns = badFeatures.index.to_list())

In [None]:
# Remove all non float data
df = df.select_dtypes(include=['float64'])

In [None]:
#Removing few more columns
df = df.drop(columns=['SEX','_STATE','FMONTH','SEQNO','DISPCODE','MARITAL','EDUCA','POORHLTH', 'PHYSHLTH', 'GENHLTH', 'HLTHPLN1', 'MENTHLTH'])

In [None]:
from IPython.display import HTML
HTML(pd.DataFrame(df.dtypes).to_html())

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split

# Extract the labels
#labels = np.array(df.pop('Label'))

# 30% examples in test data
train, test, train_labels, test_labels = train_test_split(df, df['Label'], test_size = 0.3, 
                                                          random_state = 50)

In [None]:
# Imputation of missing values
train = train.fillna(train.mean())
test = test.fillna(test.mean())

In [None]:
train.columns

In [None]:
sns.distplot(train['Label'], kde=False)

In [None]:
train.shape

In [None]:
test.shape

In [None]:
# Train tree
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=50, max_depth=60)
tree.fit(train, train_labels)
print(f'Decision tree has {tree.tree_.node_count} nodes with maximum depth {tree.tree_.max_depth}.')

### Assess Decision Tree Performance

Given the number of nodes in our decision tree and the maximum depth, we expect it has overfit to the training data. This means it will do much better on the training data than on the testing data.

In [None]:
# Make probability predictions
train_probs = tree.predict_proba(train)[:, 1]
probs = tree.predict_proba(test)[:, 1]

train_predictions = tree.predict(train)
predictions = tree.predict(test)

In [None]:
from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve

print(f'Train ROC AUC Score: {roc_auc_score(train_labels, train_probs)}')
print(f'Test ROC AUC  Score: {roc_auc_score(test_labels, probs)}')

In [None]:
print(f'Baseline ROC AUC: {roc_auc_score(test_labels, [1 for _ in range(len(test_labels))])}')

Our model does outperform a baseline guess, but we can see it has severely overfit to the training data, acheiving perfect ROC AUC.

TODO: construct ROC curve and confusion matrix