# Random Forest

### Problem Statement -

        - Divide the data (Diabetes) into training and test datasets and create a Random Forest Model to 
          classify 'Class Variable'.

### Data Understanding

In [41]:
import warnings
warnings.filterwarnings('ignore')

In [42]:
import pandas as pd
import numpy as np
db = pd.read_csv ("~/desktop/Digi 360/Module 19/Diabetes.csv",encoding='mac_roman')
db.head(5)

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure,Triceps skin fold thickness,2-Hour serum insulin,Body mass index,Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,YES
1,1,85,66,29,0,26.6,0.351,31,NO
2,8,183,64,0,0,23.3,0.672,32,YES
3,1,89,66,23,94,28.1,0.167,21,NO
4,0,137,40,35,168,43.1,2.288,33,YES


In [43]:
# Renaming columns

db = db.rename(columns={' Number of times pregnant':'np', ' Plasma glucose concentration':'plasma',
                       ' Diastolic blood pressure':'bp',' Triceps skin fold thickness':'tsf',
                       ' 2-Hour serum insulin':'serum',' Body mass index':'bmi',
                       ' Diabetes pedigree function':'dpf',' Age (years)':'age',' Class variable':'class'})
db.columns

Index(['np', 'plasma', 'bp', 'tsf', 'serum', 'bmi', 'dpf', 'age', 'class'], dtype='object')

In [44]:
db.shape

(768, 9)

In [45]:
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
np        768 non-null int64
plasma    768 non-null int64
bp        768 non-null int64
tsf       768 non-null int64
serum     768 non-null int64
bmi       768 non-null float64
dpf       768 non-null float64
age       768 non-null int64
class     768 non-null object
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB


In [46]:
db.isnull().sum()

np        0
plasma    0
bp        0
tsf       0
serum     0
bmi       0
dpf       0
age       0
class     0
dtype: int64

### Splitting the dataset 

In [47]:
from sklearn.model_selection import train_test_split

In [48]:
db_X = db.iloc[:,:8]
db_X.head()

Unnamed: 0,np,plasma,bp,tsf,serum,bmi,dpf,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [49]:
db_y = db.iloc[:,8]
db_y.head()

0    YES
1     NO
2    YES
3     NO
4    YES
Name: class, dtype: object

In [50]:
X_train, X_test, y_train, y_test = train_test_split(db_X, db_y, test_size=0.2,random_state=4)

In [51]:
X_train.head()

Unnamed: 0,np,plasma,bp,tsf,serum,bmi,dpf,age
596,0,67,76,0,0,45.3,0.194,46
90,1,80,55,0,0,19.1,0.258,21
734,2,105,75,0,0,23.3,0.56,53
694,2,90,60,0,0,23.5,0.191,25
517,7,125,86,0,0,37.6,0.304,51


In [52]:
y_train.head()

596    NO
90     NO
734    NO
694    NO
517    NO
Name: class, dtype: object

### Building the model

In [53]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [54]:
# Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [55]:
# Checking the score on train data
dt.score(X_train,y_train)

1.0

In [56]:
# Checking the score on test data
dt.score(X_test,y_test)

0.7012987012987013

So our model is overfitting here because train score is 100% and test score is 70%. Let's go for ensemble methods.

### RandomForest Classifier

In [57]:
rf = RandomForestClassifier(n_estimators=10)

In [58]:
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [59]:
#Finding score for test Data
rf.score(X_test, y_test)

0.7532467532467533

In [60]:
#Finding score for train Data
rf.score(X_train, y_train)

0.988599348534202

### Extract Feature Importance

In [61]:
# Extract feature importances
import pandas as pd
fi = pd.DataFrame({'feature': list(X_train.columns),
                   'importance': rf.feature_importances_}).\
                    sort_values('importance', ascending = False)

# Display
fi.head()

Unnamed: 0,feature,importance
1,plasma,0.207194
5,bmi,0.191013
7,age,0.153859
6,dpf,0.126474
2,bp,0.088761


Feature importances can give us insight into a problem by telling us what variables are the most discerning between classes. 

For example, here `plasma`, indicating whether the patient has `Plasma glucose concentration` , is the most important feature which makes sense in the problem context.

### Visualization of a single tree

In [62]:
# Remove the labels from the features
# axis 1 refers to the columns
db_viz= db.drop('class', axis = 1)

# Saving feature names for later use
feature_list = list(db_viz.columns)
feature_list

['np', 'plasma', 'bp', 'tsf', 'serum', 'bmi', 'dpf', 'age']

In [63]:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot

# Pull out one tree from the forest
tree = rf.estimators_[5]

# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot

# Pull out one tree from the forest
tree = rf.estimators_[5]

# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)

# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')

# Write graph to a png file
graph.write_png('tree.png')

### Conclusion

    - Accuracy of our model is 75%
    - Important feature is `Plasma glucose concentration` to classify whether a preson has diabetes or not. 