<h2>About this Project</h2>
<p>In this project, you will implement a regression tree that predicts the severity of a patient's heart disease based on a set of attributes. The training data is in <code>heart_disease_train.csv</code> and the test data is in <code>heart_disease_test.csv</code>. Before you begin, take a look at the two csv files and  <code>attribute.txt</code>, which contains a description of each attribute in the csv files. You can download the files for review using the links below:</p>

* [heart_disease_train.csv](files/heart_disease_train.csv)</li>
* [heart_disease_test.csv](files/heart_disease_test.csv)
* [attribute.txt](files/attribute.txt)


<h3>Evaluation</h3>

<p><strong>This project must be successfully completed and submitted in order to receive credit for this course. Your score on this project will be included in your final grade calculation.</strong><p>
    
<p>You are expected to write code where you see <em># YOUR CODE HERE</em> within the cells of this notebook. Not all cells will be graded; code input cells followed by cells marked with <em>#Autograder test cell</em> will be graded. Upon submitting your work, the code you write at these designated positions will be assessed using an "autograder" that will run all test cells to assess your code. You will receive feedback from the autograder that will identify any errors in your code. Use this feedback to improve your code if you need to resubmit. Be sure not to change the names of any provided functions, classes, or variables within the existing code cells, as this will interfere with the autograder. Also, remember to execute all code cells sequentially, not just those you’ve edited, to ensure your code runs properly.</p>
    
<p>You can resubmit your work as many times as necessary before the submission deadline. If you experience difficulty or have questions about this exercise, use the Q&A discussion board to engage with your peers or seek assistance from the instructor.<p>

<p>Before starting your work, please review <a href="https://s3.amazonaws.com/ecornell/global/eCornellPlagiarismPolicy.pdf">eCornell's policy regarding plagiarism</a> (the presentation of someone else's work as your own without source credit).</p>

<h3>Submit Code for Autograder Feedback</h3>

<p>Once you have completed your work on this notebook, you will submit your code for autograder review. Follow these steps:</p>

<ol>
  <li><strong>Save your notebook.</strong></li>
  <li><strong>Mark as Completed —</strong> In the blue menu bar along the top of this code exercise window, you’ll see a menu item called <strong>Education</strong>. In the <strong>Education</strong> menu, click <strong>Mark as Completed</strong> to submit your code for autograder/instructor review. This process will take a moment and a progress bar will show you the status of your submission.</li>
	<li><strong>Review your results —</strong> Once your work is marked as complete, the results of the autograder will automatically be presented in a new tab within the code exercise window. You can click on the assessment name in this feedback window to see more details regarding specific feedback/errors in your code submission.</li>
  <li><strong>Repeat, if necessary —</strong> The Jupyter notebook will always remain accessible in the first tabbed window of the exercise. To reattempt the work, you will first need to click <strong>Mark as Uncompleted</strong> in the <strong>Education</strong> menu and then proceed to make edits to the notebook. Once you are ready to resubmit, follow steps one through three. You can repeat this procedure as many times as necessary.</li>
</ol>
<p>You can also download a copy of this notebook in multiple formats using the <strong>Download as</strong> option in the <strong>File</strong> menu above.</p>

## Getting Started

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sys
import pandas as pd

%matplotlib inline

sys.path.append('/home/codio/workspace/.guides/hf')
from helper import *

print('You\'re running python %s' % sys.version.split(' ')[0])

You're running python 3.6.8


<h2>Implement a Regression Tree</h2>

<h3>Part One: Implement <code>load_data</code> [Graded]</h3>

Now, implement a function called <code>load_data</code>, which will load the given <code>.csv</code> file and return <code>X, y</code> where <code>X</code> are the patients' attributes and <code>y</code> is the severity of the patients' heart disease. <b>The function should handle two explicit cases. If label=True it should output the training data X and the corresponding label vector y; if label=False it should output only the training data 

In [2]:
def load_data(file='heart_disease_train.csv', label=True):
    '''
    Input:
        file: filename of the dataset
        label: a boolean to decide whether to return the labels or not
    Returns:
        X: patient attributes
        y: label (only if label=True)
    '''
   
       
    X = None
    y = None
    # YOUR CODE HERE
    
    
    if label ==True:
        X = np.loadtxt(file, skiprows=1, delimiter=",")[:,:-1]
        y = np.loadtxt(file,skiprows=1, delimiter=",")[:,-1]
       
        return X,y
    else:
        X = np.loadtxt(file, skiprows=1, delimiter=",")
            
        return X
    

#     raise NotImplementedError()

In [3]:
X, y = load_data()

In [4]:
# The following tests check that your load_data function reads in the correct number of rows, the correct number of unique values for y, and the same training data as the correct implementation

Xtrain, ytrain = load_data()
Xtrain_grader, ytrain_grader = load_data_grader()
Xtest = load_data(file='heart_disease_test.csv', label=False)
Xtest_grader = load_data_grader(file='heart_disease_test.csv', label=False)

def load_data_test1():
    return (len(Xtrain) == len(ytrain))

def load_data_test2():
    return (len(Xtrain) == len(Xtrain_grader))

def load_data_test3():
    y_unique = np.sort(np.unique(ytrain))
    y_grader_unique = np.sort(np.unique(ytrain_grader))
    
    if len(y_unique) != len(y_grader_unique):
        return False
    else:
        return np.linalg.norm(y_unique - y_grader_unique) < 1e-7
    
def load_data_test4():
    return(type(Xtrain)==np.ndarray and type(ytrain)==np.ndarray and type(Xtest)==np.ndarray)

def load_data_test5():
    Xtrain.sort()
    Xtrain_grader.sort()
    return np.linalg.norm(Xtrain-Xtrain_grader)<1e-07

def load_data_test6():
    ntr,dtr=Xtrain.shape
    nte,dte=Xtest.shape
    return dtr==dte

def load_data_test7():
    Xtest.sort()
    Xtest_grader.sort()
    return np.linalg.norm(Xtest-Xtest_grader)<1e-07

runtest(load_data_test1,'load_data_test1')
runtest(load_data_test2,'load_data_test2')
runtest(load_data_test3,'load_data_test3')
runtest(load_data_test4,'load_data_test4 (Testing for correct types)')
runtest(load_data_test5,'load_data_test5 (Testing training data for correctness)')
runtest(load_data_test6,'load_data_test6 (training and testing data dimensions should match)')
runtest(load_data_test7,'load_data_test7 (Testing test data for correctness)')




Running Test: load_data_test1 ... ✔ Passed!
Running Test: load_data_test2 ... ✔ Passed!
Running Test: load_data_test3 ... ✔ Passed!
Running Test: load_data_test4 (Testing for correct types) ... ✔ Passed!
Running Test: load_data_test5 (Testing training data for correctness) ... ✔ Passed!
Running Test: load_data_test6 (training and testing data dimensions should match) ... ✔ Passed!
Running Test: load_data_test7 (Testing test data for correctness) ... ✔ Passed!


In [5]:
# Autograder test cell - worth 1 point
# runs load_data test1

In [6]:
# Autograder test cell - worth 1 point
# runs load_data test2

In [7]:
# Autograder test cell - worth 1 point
# runs load_data test3

In [8]:
# Autograder test cell - worth 1 point
# runs load_data test4

In [9]:
# Autograder test cell - worth 1 point
# runs load_data test5

In [10]:
# Autograder test cell - worth 1 point
# runs load_data test6

In [11]:
# Autograder test cell - worth 1 point
# runs load_data test7

Now, you will use the regression tree from the previous assignment for this prediction problem. You can implement a regression tree using the function we've provided as demonstrated below:

In [12]:
# Create a regression with no restriction on its depth
# if you want to create a tree of depth k
# then call RegressionTree(depth=k)
tree = RegressionTree(depth=np.inf)

# To fit/train the regression tree
tree.fit(X, y)

# To use the trained regression tree to make predictions
pred = tree.predict(X)

<h3> Part Two: Find the Optimal Regression Tree [Graded]</h3>

In <code>test</code>, you will find the optimal regression tree for the dataset <code>heart_disease_train.csv</code> and return its prediction on <code>heart_disease_test.csv</code>. You will be evaluated based on <code>square_loss</code>. You will get a full score if the test loss on your classifier is less than <strong>0.17</strong>. You may use any functions that you implemented in the previous project.

In [13]:
def square_loss(pred, truth):
    return np.mean((pred - truth)**2)

In [14]:
def test():
    '''
        prediction: the prediction of your classifier on the heart_disease_test.csv
    '''
    prediction = None
    Xtrain, ytrain = load_data(file='heart_disease_train.csv', label=True)
    ytrain=ytrain>0
    Xtest = load_data(file='heart_disease_test.csv', label=False)
    
    # YOUR CODE HERE
#     raise NotImplementedError()

    tree = RegressionTree(depth=4)
    tree.fit(Xtrain, ytrain)
    prediction = tree.predict(Xtest)



    return prediction

In [15]:
# The following test wil check that your test function returns a loss less than 2 on a sample dataset
# ground truth:
gt = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

pred = test()
test_loss = square_loss(pred, gt)
print('Your test loss: {:0.4f}'.format(test_loss))

def test_loss_test():
    return (test_loss < 0.17)

runtest(test_loss_test, 'test_loss_test')

Your test loss: 0.1523
Running Test: test_loss_test ... ✔ Passed!


In [16]:
square_loss(np.mean(ytrain),gt)

0.248206469010062

In [17]:
ytrain=(ytrain>0)*1.0
ytrain

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1.

In [18]:
# Autograder test cell - worth 1 point
# runs test function test