<h2>About this Project</h2>

In this project, you will walk through how to approach data science problem for a dataset from scratch. The context of the problem is as follows: you are tasked to predict the severity of a patient's heart disease based on a set of medical attributes. We will restrict our attention to regression tree and walk you through how to approach this problem systematically.

<h3>Evaluation</h3>

<p><strong>This project must be successfully completed and submitted in order to receive credit for this course. Your score on this project will be included in your final grade calculation.</strong><p>
    
<p>You are expected to write code where you see <em># YOUR CODE HERE</em> within the cells of this notebook. Not all cells will be graded; code input cells followed by cells marked with <em>#Autograder test cell</em> will be graded. Upon submitting your work, the code you write at these designated positions will be assessed using an "autograder" that will run all test cells to assess your code. You will receive feedback from the autograder that will identify any errors in your code. Use this feedback to improve your code if you need to resubmit. Be sure not to change the names of any provided functions, classes, or variables within the existing code cells, as this will interfere with the autograder. Also, remember to execute all code cells sequentially, not just those you’ve edited, to ensure your code runs properly.</p>
    
<p>You can resubmit your work as many times as necessary before the submission deadline. If you experience difficulty or have questions about this exercise, use the Q&A discussion board to engage with your peers or seek assistance from the instructor.<p>

<p>Before starting your work, please review <a href="https://s3.amazonaws.com/ecornell/global/eCornellPlagiarismPolicy.pdf">eCornell's policy regarding plagiarism</a> (the presentation of someone else's work as your own without source credit).</p>

<h3>Submit Code for Autograder Feedback</h3>

<p>Once you have completed your work on this notebook, you will submit your code for autograder review. Follow these steps:</p>

<ol>
  <li><strong>Save your notebook.</strong></li>
  <li><strong>Mark as Completed —</strong> In the blue menu bar along the top of this code exercise window, you’ll see a menu item called <strong>Education</strong>. In the <strong>Education</strong> menu, click <strong>Mark as Completed</strong> to submit your code for autograder/instructor review. This process will take a moment and a progress bar will show you the status of your submission.</li>
	<li><strong>Review your results —</strong> Once your work is marked as complete, the results of the autograder will automatically be presented in a new tab within the code exercise window. You can click on the assessment name in this feedback window to see more details regarding specific feedback/errors in your code submission.</li>
  <li><strong>Repeat, if necessary —</strong> The Jupyter notebook will always remain accessible in the first tabbed window of the exercise. To reattempt the work, you will first need to click <strong>Mark as Uncompleted</strong> in the <strong>Education</strong> menu and then proceed to make edits to the notebook. Once you are ready to resubmit, follow steps one through three. You can repeat this procedure as many times as necessary.</li>
</ol>
<p>You can also download a copy of this notebook in multiple formats using the <strong>Download as</strong> option in the <strong>File</strong> menu above.</p>

## Getting Started

In [23]:
import numpy as np
import matplotlib.pyplot as plt
import sys

%matplotlib inline

sys.path.append('/home/codio/workspace/.modules')
from helper import *

print('You\'re running python %s' % sys.version.split(' ')[0])

You're running python 3.6.8


## Data Science in the Wild

### Part Zero: Understand the Type of Data and Information to be Extracted from the Data [Not Graded]

For your convenience, we have split the data into training set and test set for you. The training data is in `heart_disease_train.csv` and the test data is in `heart_disease_test.csv`. You should do all your model selection on the training set and evaluate your _final_ model on the test data. Selecting model based on the test data is considered cheating so please refrain from doing so!
    
Before you begin, take a look at the two csv files and `attribute.txt`, which contains a description of each attribute in the csv files. You can download the files for review using the links below:

* [heart_disease_train.csv](files/heart_disease_train.csv)
* [heart_disease_test.csv](files/heart_disease_test.csv)
* [attribute.txt](files/attribute.txt)

### Part One: Implement `load_data` [Graded]

Implement a function called **`load_data`**, which will load the given `.csv` file and return `X, y` data, where `X` are the patients' attributes and `y` is the severity of the patients' heart disease. Your function should:
1. Open the file
2. Read the comma-separated columns in the first line of the `.csv` (remember to strip the ending delimiters `'\n'`).
3. For each line except the first one, read the comma-separated column values, convert all values from `str` to `float`, and add to the data matrix (remember to strip the ending delimiters `'\n'`).
4. Use the `'label'` column for `y` if necessary.

**Implementation Notes:**
- In any case, do not include the `'label'` column in the data matrix &mdash; the model will then be able to use this feature to predict `y`!
- The function should handle two explicit cases. With `label=True`, it should output the data matrix `X` and the corresponding label vector `y`; with `label=False`, it should output only the data matrix `X`.
- Feel free to use `pd.read_csv` or other data loaders. Just make sure the returned `X, y` are NumPy arrays of shapes `nxd` and `n` respectively.

In [41]:
def load_data(file='heart_disease_train.csv', label=True):
    """
    Returns the data matrix and optionally the corresponding label vector.
    
    Input:
        file: filename of the dataset
        label: a boolean to decide whether to return the labels or not
        
    Output:
        X: (numpy array) nxd data matrix of patient attributes
        y: (numpy array) n-dimensional vector of labels (if label=False, y is not returned)
    """
    X = None
    y = None
    
    # YOUR CODE HERE
    #raise NotImplementedError()
    txt = np.loadtxt(file, delimiter=",", dtype = str)
    #print(X)
    #truncate labels IF THERE ARE LABELS. IF THERE ARE NOT LABELS, DO NOT TRUNCATE 
    if label:
        X = txt[1:,:-1]
    else:
        X = txt[1:,:]
        
    #y = txt[0:1,:]
    y = txt[1:, -1]
    
    #convert to floats
    X = X.astype(float)
    y = y.astype(float)
    
    
    #print("Shape of X: " + np.shape(X))
    #print("Shape of Y: " + len(Y))
    '''
    #from 'Load Data in Python' notebook
    #loads ALL entries as string values into the np array 'entries'
    entries = []
    with open(file, 'r') as f:
        entries = [line.rstrip() for line in f.readlines() if len(x) > 0]
    '''
    
    if label:
        #if label is true
        return X, y
    else:
        return X
    

In [42]:
X, y = load_data()
print(f'Training data matrix shape: {X.shape}')
print(f'Labels vector shape: {y.shape}')

Training data matrix shape: (244, 13)
Labels vector shape: (244,)


In [43]:
# The following tests check that your load_data function reads in the correct number of rows, the correct number of unique values for y, and the same training data as the correct implementation

Xtrain, ytrain = load_data()
Xtrain_grader, ytrain_grader = load_data_grader()
Xtest = load_data(file='heart_disease_test.csv', label=False)
Xtest_grader = load_data_grader(file='heart_disease_test.csv', label=False)

def load_data_test1():
    return (len(Xtrain) == len(ytrain))

def load_data_test2():
    return (len(Xtrain) == len(Xtrain_grader))

def load_data_test3():
    y_unique = np.sort(np.unique(ytrain))
    y_grader_unique = np.sort(np.unique(ytrain_grader))
    
    if len(y_unique) != len(y_grader_unique):
        return False
    else:
        return np.linalg.norm(y_unique - y_grader_unique) < 1e-7
    
def load_data_test4():
    return(type(Xtrain)==np.ndarray and type(ytrain)==np.ndarray and type(Xtest)==np.ndarray)

def load_data_test5():
    Xtrain.sort()
    Xtrain_grader.sort()
    return np.linalg.norm(Xtrain-Xtrain_grader)<1e-07

def load_data_test6():
    ntr,dtr=Xtrain.shape
    nte,dte=Xtest.shape
    return dtr==dte

def load_data_test7():
    Xtest.sort()
    Xtest_grader.sort()
    return np.linalg.norm(Xtest-Xtest_grader)<1e-07

runtest(load_data_test1,'load_data_test1')
runtest(load_data_test2,'load_data_test2')
runtest(load_data_test3,'load_data_test3')
runtest(load_data_test4,'load_data_test4 (Testing for correct types)')
runtest(load_data_test5,'load_data_test5 (Testing training data for correctness)')
runtest(load_data_test6,'load_data_test6 (training and testing data dimensions should match)')
runtest(load_data_test7,'load_data_test7 (Testing test data for correctness)')



Running Test: load_data_test1 ... ✔ Passed!
Running Test: load_data_test2 ... ✔ Passed!
Running Test: load_data_test3 ... ✔ Passed!
Running Test: load_data_test4 (Testing for correct types) ... ✔ Passed!
Running Test: load_data_test5 (Testing training data for correctness) ... ✔ Passed!
Running Test: load_data_test6 (training and testing data dimensions should match) ... ✔ Passed!
Running Test: load_data_test7 (Testing test data for correctness) ... ✔ Passed!


In [27]:
# Autograder test cell - worth 1 point
# runs load_data test1

In [28]:
# Autograder test cell - worth 1 point
# runs load_data test2

In [29]:
# Autograder test cell - worth 1 point
# runs load_data test3

In [30]:
# Autograder test cell - worth 1 point
# runs load_data test4

In [31]:
# Autograder test cell - worth 1 point
# runs load_data test5

In [32]:
# Autograder test cell - worth 1 point
# runs load_data test6

In [33]:
# Autograder test cell - worth 1 point
# runs load_data test7

### Part Two: Picking a Metric for Evaluation [Not Graded]

Since this is a classification problem, there are multiple loss functions or metrics we can use to evaluate. We will use `square_loss` as our loss function since we can always cast a classification problem as regression problem. We have implemented the loss function below.

In [44]:
def square_loss(pred, truth):
    """
    Calculates the loss between predicted and true labels.
    
    Input:
        pred: n-dimensional vector of predicted labels
        truth: n-dimensional vector of true labels
        
    Output:
        loss: average squared loss
    """
    return np.mean((pred - truth)**2)

### Part Three: Model Selection [Graded]

At this point, we have 
- Split the data into training set and test set
- Understood and loaded the data
- Picked a metric for evaluation

Now we are ready for model selection!

A data scientist would typically try different models such as perceptron, linear regression etc. For simplicity, we restrict our attention to regression tree. Implement the **`test`** function which loads the training and test sets, finds the optimal regression tree trained on `heart_disease_train.csv`, and returns the tree's predictions on `heart_disease_test.csv`. You will be evaluated based on `square_loss`. You will get a full score if the test loss on your classifier is less than **0.18**. You may use any functions that you implemented in the previous projects.

_Hint: A few things you can try: selecting the best depth to avoid overfitting, pick the optimal subset of features for your classification model etc._

Here are the functions/classes from previous projects available to you:

In [45]:
## Regression Tree
# Create a regression with no restriction on its depth
# if you want to create a tree of depth k
# then call RegressionTree(depth=k)
tree = RegressionTree(depth=np.inf)

# To fit/train the regression tree
tree.fit(X, y)

# To use the trained regression tree to make predictions
pred = tree.predict(X)

## k-Fold Cross Validation
depths = [1, 3]

# To generate 5 folds for X data
indices = generate_kFold(n=X.shape[0], k=5)

# To find best depth across the folds
best_depth, training_losses, validation_losses = cross_validation(X, y, depths, indices)

In [46]:
def test():
    """
    Loads the training and test sets, trains a regression tree and outputs predictions for the test set.
    
    Output:
        prediction: the prediction of your classifier on the heart_disease_test.csv
    """
    prediction = None
    Xtrain, ytrain = load_data(file='heart_disease_train.csv', label=True)
    ytrain=ytrain>0
    Xtest = load_data(file='heart_disease_test.csv', label=False)
    
    # YOUR CODE HERE
    #raise NotImplementedError()
    tree = RegressionTree(depth = 4)
    tree.fit(Xtrain, ytrain)
    
    return tree.predict(Xtest)

In [47]:
# The following test wil check that your test function returns a loss less than 0.18 on a sample dataset
# ground truth:
gt = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

pred = test()
test_loss = square_loss(pred, gt)
print('Your test loss: {:0.4f}'.format(test_loss))

def test_loss_test():
    return (test_loss < 0.18)

runtest(test_loss_test, 'test_loss_test')

Your test loss: 0.2203
Running Test: test_loss_test ... ✖ Failed!
 The output of your function does not match the expected output. Check your code and try again.


In [None]:
# Autograder test cell - worth 1 point
# runs test function test