# Module 2 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
2. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).
3. Do not change the title (i.e. file name) of this notebook.
4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).
5. All work must be your own, if you do use any code from another source (such as a course notebook or a website) you need to properly cite the source.

-----

In [None]:
import pandas as pd
import numpy as np

from sklearn.svm import SVC

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

-----

## Loading Breast Cancer Data

In this assignment, we will work with a breast cancer data set to make predictive models. Before we build a model, we first load the data into the assignment notebook, and randomly sample several rows.

-----

In [None]:
df = pd.read_csv('./breast-cancer-wisconsin.data')
df.sample(5)

-----

## Problem 1: Creating Training and Testing Datasets

We pass a DataFrame into the data_split function, which is shown below. Your task in this assignment is to complete this function by using the `train_test_split` function available in the scikit learn library to split the input DataFrame into two new DataFrames (i.e., a testing set and training set). Specifically, you must complete the following tasks:
- Split the input DataFrame into two new DataFrames, one each for the training and testing data.
- The `test_size` argument in `train_test_split` should be set to the `size` parameter.
- The random_state argument in `train_test_split` should be set to the `rs` parameter.
- Return the training set and the testing DataFrames, respectively.

In [None]:
from sklearn.model_selection import train_test_split

def data_split(data, size=0.25, rs=0):
    '''
    Split input DataFrame into train and test DataFrames
    
    Parameters
    ---------
    data: input DataFrame
    size: fraction of data to hold out for testing
    rs: random state

    Returns
    -------
    Two DataFrames, one for training, and one for testing, respectively
    '''

    ### BEGIN SOLUTION
    trees = tree.DecisionTreeClassifier(max_features = mf, random_state=rs)
    predictions = trees.fit(X_train, y_train).predict(X_test)
    
    return predictions
    ### END SOLUTION

In [None]:
# Test Function
train, test = data_split(df)

# Test return types
assert_equal(type(train), pd.DataFrame, msg="train is not a DataFrame")
assert_equal(type(test), pd.DataFrame, msg="test is not a DataFrame")

# Test return counts
assert_equal(train.count()['id'], 512)
assert_equal(test.count()['id'], 171)

-----

We now convert the training and testing data to NumPy arrays to use with the scikit learn library. This also requires defining the dependent feature, in this case `class`, which must be removed from the set of independent features.

-----

In [None]:
y = train['class']
X = train.drop('class', axis = 1)
yTest = test['class']
XTest = test.drop('class', axis=1)

-----

## Problem 2: Creating a Logistic Regression Classifier

In the following Code cell, you are given a template for the `classify` function. Your task is to complete this function to perform the following tasks:
- Create a Logistic Regression classifier by using the LogisticRegression estimator.
- Specify the _random\_state_ to use by using the input `rs` parameter.
- Fit the new estimator on `X` (i.e., the features) and `y` (i.e., the labels).
- Return the fitted Logistic Regression model

-----

In [None]:
from sklearn.linear_model import LogisticRegression

def classify(X, y, rs=0):
    '''
    Create and fit an LR classifier to input data
    
    Parameters
    ---------
    X: NumPy array used for training
    y: NumPy array used for training
    rs: seed used for random number generator

    Returns
    -------
    The LR model
    '''

    ### BEGIN SOLUTION
    lbls = y_test.values.reshape(y_test.shape[0])
    
    return confusion_matrix(lbls, y_pred).ravel()
    ### END SOLUTION


In [None]:
# Compute linear model by using our new classify function
model_lr = classify(X, y, rs=0)

# Test Linear Model
assert_equal(model_lr.get_params(), 
             {'C': 1.0,
              'class_weight': None,
              'dual': False,
              'fit_intercept': True,
              'intercept_scaling': 1,
              'max_iter': 100,
              'multi_class': 'ovr',
              'n_jobs': 1,
              'penalty': 'l2',
              'random_state': 0,
              'solver': 'liblinear',
              'tol': 0.0001,
              'verbose': 0,
              'warm_start': False})

-----

## Problem 3: Computing Mean Accuracy

The Code cell below provides a template for a function that computes the mean accuracy of a given model on test data, which are specified as `X`, the features, and `y`, the labels. Your task is to complete this function by explicitly:
- Calculating the mean accuracy for the given model by using these features and labels.
- Return the computed mean accuracy.

-----

In [None]:
def mean_acc(model, X, y):
    '''
    Compute the mean accuracy for a given model and data set.
    
    Parameters
    ---------
    model: the model of interest
    X: NumPy array containing indepenent data (features)
    y: NumPy array containing depenent data (labels)

    Returns
    -------
    A float containing the mean accuracy
    '''

    ### BEGIN SOLUTION
    
    # Grab positive class probability
    y_score_lr = model_lr.decision_function(XTest)

    # Compute ROC curve and ROC area
    fpr_lr, tpr_lr, thresholds = roc_curve(yTest, y_score_lr, pos_label=4)
    roc_auc_lr = auc(fpr_lr, tpr_lr)

    # Make the plots
    fig, ax = plt.subplots(figsize=(8, 8))

    # Plot data and model
    ax.plot(fpr_lr, tpr_lr, alpha = 0.5, linestyle='-.',
            label=f'LR (AUC = {roc_auc_lr:4.2f})')
    
    ax.plot([0, 1], [0, 1], alpha = 0.5, 
            lw=1, linestyle='-', label='Random')
    
    ax.plot([0, 0, 1], [0, 1, 1], alpha = 0.5, 
            lw=1, linestyle='-.', label='Perfect')

    # Decorate plot appropriately    
    ax.set(title='Receiver Operating Characteristic Curve', 
           xlabel='False Positive Rate', 
           ylabel='True Positive Rate', 
           xlim=(-0.05, 1.05),
           ylim=(-0.05, 1.05))
    
    ax.set_aspect('equal')
    ax.legend(loc=4, fontsize=16)
    sns.despine(offset=5, trim=True)

    return roc_auc_lr, ax
    ### END SOLUTION

In [None]:
# Compute Accuracy scores for linear model
model_lr_scores = mean_acc(model_lr, XTest, yTest)

# Test accuracy score
assert_almost_equal(0.6257, model_lr_scores, places=2)

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode 