# Module 13 Lab - Rule Based Machine Learning

## Directions

1. Show all work/steps/calculations. Generate a Markdown/code cells for each answer.
2. You must submit your lab by the deadline to the Lab section of the Course Module where you downloaded this file from.
3. You may use any core Python libraries or Numpy/Scipy. **Additionally, code from the Module notebooks and lectures is fair to use and modify.** You may also consult Stackoverflow (SO). If you use something from SO, please place a comment with the URL to document the code.

In [3]:
import numpy as np
import seaborn as sns
import pandas as pd

We talked about a wide variety of algorithms this module but we're going to concentrate on just two: Decision Trees and Random Forests.

**Problem 1.**

Using the insurance data set, construct a Decision Tree to estimate charges using the Scikit Learn Library [Decision Tree](https://scikit-learn.org/stable/modules/tree.html). You should use validation curves to estimate the best tree depth. With this tree depth, perform 3 rounds of 10 fold cross validation to get a sense of generalization error and learning curves to estimate bias/variance trade-off.

Visualize the tree if possible. 

Compare with your linear regression results. Use Bayesian inference to test the difference of means.

### Transfer categorical data to dummy variables 

In [4]:
data_raw = pd.read_csv("insurance.csv")
data_raw.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [5]:
data = pd.get_dummies(data_raw)
data.head()

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,1,0,0,1,0,0,0,1
1,18,33.77,1,1725.5523,0,1,1,0,0,0,1,0
2,28,33.0,3,4449.462,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.88,0,3866.8552,0,1,1,0,0,1,0,0


In [18]:
X = data.loc[:,data.columns[[i !="charges" for i in data.columns]]]
y = data.loc[:,"charges"]

### Decision Tree Regression

In [25]:
from sklearn import tree
from sklearn.model_selection import cross_val_score

depth = []
for i in range(3,20):
    clf = tree.DecisionTreeRegressor()
    # Perform 7-fold cross validation 
    scores = cross_val_score(clf, X, y, cv=5)
    depth.append((i,scores.mean()))
print(depth)

[(3, 0.7084712014379398), (4, 0.7096849264563504), (5, 0.7094385495218182), (6, 0.6834263757353716), (7, 0.7019954650077398), (8, 0.7001310360666697), (9, 0.7020833147051682), (10, 0.702897511999846), (11, 0.7080699588530865), (12, 0.7077403151037918), (13, 0.6984678267627721), (14, 0.7136769500958375), (15, 0.7116366139751422), (16, 0.7105848484041051), (17, 0.7055818974709793), (18, 0.716242037825249), (19, 0.7064361432881946)]


** Problem 2.**

Now use the [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) on the insurance data set. Use validation curves to optimize the hyperparameters. Estimate generalization error on 3 rounds of 10 fold cross validation. Instead of learning curves, examine the importance of the features. How does this compare with your linear regression from before?

In [28]:
from sklearn.ensemble import RandomForestRegressor
depth = []
for i in range(3,20):
    clf = RandomForestRegressor(n_estimators=100, max_depth=2,random_state=0)
    clf.fit(X, y)
    # Perform 7-fold cross validation 
    scores = cross_val_score(clf, X, y, cv=5)
    depth.append((i,scores.mean()))
print(depth)


[(3, 0.8271190100597805), (4, 0.8271190100597805), (5, 0.8271190100597805), (6, 0.8271190100597805), (7, 0.8271190100597805), (8, 0.8271190100597805), (9, 0.8271190100597805), (10, 0.8271190100597805), (11, 0.8271190100597805), (12, 0.8271190100597805), (13, 0.8271190100597805), (14, 0.8271190100597805), (15, 0.8271190100597805), (16, 0.8271190100597805), (17, 0.8271190100597805), (18, 0.8271190100597805), (19, 0.8271190100597805)]
