In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10, 8) # set default figure size, 8in by 6in

**Due: Wednesday 10/31 (by midnight)**

Name: 

CWID:


# Part 1: Linear Regression

The following are definitions of data for winning long jump performances at the summer Olympics from 1900 through 1985.  Copy the following data into your notebook and run the cell in order to define these variables:


In [3]:
# gold medal winning long jump, in inches
long_jump = np.array([282.875, 289, 294.5, 299.25, 281.5, 293.125, 304.75, 300.75, 317.3125, 308, 
                      298, 308.25, 319.75, 317.75, 350.5, 324.5, 328.5, 336.25, 336.25])
y = long_jump # variable y, corresponding to usage in lecture notebooks

# corresponding year of olympics
year = np.array([1900, 1904, 1908, 1912, 1920, 1924, 1928, 1932, 1936, 1948, 1952, 1956, 
                 1960, 1964, 1968, 1972, 1976, 1980, 1984])
x = year # variable x, corresponding to usage in lecture notebooks

# number of training examples
m = len(long_jump) 

# create the X array, wher first row is all 1's, and other rows are the original x inputs
# suitable for use in cost and gradient function calculations
X = np.ones( (2, m) )
X[1:,:] = x.T # the second column contains the raw inputs

**Task 1**: Plot the long jump performance by year (year on the x axis, long jump performance on the y axis).  Use blue circle markers (no lines) to create a scatter plot of the data.  Label your axis for the figure, and add a title to the figure.  Copy and paste your code to create your figure from you notebook into the eCollege text box provided for this task.

In [6]:
# place plot here

**Task 2**: Now using the long jump data, perform a linear fit of the data.  For this cell, use the polyfit() and poly1d() functions from the NumPy library to fit a line to the data (we demonstrated using polyfit() functions back in Lecture-03a notebook).  Print out the coeficients you find for the best fit using polyfit().  Replot the figure from the previous task, but add in the line representing the hypothesis/model that polyfit() and poly1d() find to your new figure.

In [7]:
# perform linear regression using numpy polyfit() and poly1d()

# plot figure adding regression line to figure to visualize fitted model

**Task 3**: We did not demonstrate using the scikit-learn library for performing linear and logistic regressions,
but we did later look in general at the general framework of scikit-learn and how to use it to fit models.
Look up the documentation about using scikit-learn to fit linear regression models to data.  Using scikit-learn
functions, perform the same linear regression and display the coefficients found of the best fit.  Do they
match the same found parameters found using the NumPy library functions?

In [8]:
# perform linear regression using scikit-learn functions

**Task 3**: Perform linear regression using advanced optimization techniques by hand, as we did in our
lecture notebooks..  Here are functions suitable for calculating the cost and gradients for
simple linear regression of one variable:

In [9]:
def compute_linear_regression_cost(theta, X, y):
    """Compute the cost function for linear regression.  
    
    Given a set of inputs X (we assume that the first column has been 
    initialized to 1's for the theta_0 parameter), and the correct 
    outputs for these inputs y, calculate the hypothesized outputs 
    for a given set of parameters theta.  Then we compute the sum of
    the squared differences (and divide the final result by 2*m), 
    which gives us the cost.
    
    Args
    ----
    theta (numpy nx1 array) - An array of the set of theta parameters
       to evaluate
    X (numpy mxn array) - The example inputs, first column is expected
       to be all 1's.
    y (numpy m size array) - A vector of the correct outputs of length m
       
    Returns
    -------
    J (float) - The sum squared difference cost function for the given
       theta parameters
    """
    
    # determine the number of training examples from the size of the correct outputs
    m = len(y)
    
    # You need to return the following variable correctly
    J = 0.0
    
    # Compute the cost of a particular choice of theta
    # and return the resulting cost J
    hypothesis = np.dot(theta.T, X)
    J = np.sum( (hypothesis - y)**2.0 ) / (2.0 * m)
    
    return J


def compute_linear_regression_gradients(theta, X, y):
    """Compute the gradients of the theta parameters for our linear regression
    cost function.
    
    Given a set of inputs X (we assume that the first column has been 
    initialized to 1's for the theta_0 parameter), and the correct 
    outputs for these inputs y, calculate the gradient of the cost function
    with respect to each of the theta parameters.
    
    Args
    ----
    theta (numpy nx1 array) - An array of the set of theta parameters
       to evaluate
    X (numpy mxn array) - The example inputs, first column is expected
       to be all 1's.
    y (numpy m size array) - A vector of the correct outputs of length m
       
    Returns
    -------
    gradients - A numpy n sized vector of the computed gradients.
    """

    # determine the number of training examples from the size of the correct outputs
    # and the number of parameters from the size of theta
    m = len(y)
    n = len(theta)
    
    # You need to return the following variable with the correctly calculated
    # gradients of theta
    gradients = np.zeros(n)

    hypothesis = np.dot(theta.T, X)
    for j in range(n):
        gradients[j] = np.sum((hypothesis - y) * X[j,:]) / m
        
    return gradients

Using the advanced optimization minimize() method (from the scipy.optimize library), run a minimization of the cost function for the long jump data, using the BFGS optimization method.  Print out the resulting theta parameters that are calculated.  Do they agree with the parameters you calculated using the polyfit() method?

In [10]:
# optimize the linear regression fit by hand here

# Part 2: Logistic Regression

In assignment 03 we performed a logistic regression classification on a set of data representing student
exam scores on 2 exams where students were classified as being admitted or not admitted to the university
(based on exam scores and other unknown criteria).  Read the help documentation on using scikit-learn
library functions to perform a logistic classification, and fit a classifier to this same data.
Create a plot that visualizes the classification achieved by your scikit-learn logistic classifier.

In [11]:
# perform logistic regression using scikit-learn library here

# visualize the classification of the data by showing the decision boundary made by the logistic regressor

# Part 3: Decision Tree

You have been hired by a biology professor to create a decision tree based on whether a mushroom is poisonous or
not, and have been given the following chart of data:


|Colour |Height |Stripes |Texture |Poisonous?
|------ |------ |------- |------- |----------
|Purple |Tall 	|Yes 	 |Rough   |Yes
|Purple |Tall 	|Yes 	 |Smooth  |Yes
|Red 	|Short 	|Yes 	 |Hairy   |No
|Blue 	|Short 	|No 	 |Smooth  |No
|Blue 	|Short 	|Yes 	 |Hairy   |Yes
|Red 	|Tall 	|No 	 |Hairy   |No
|Blue 	|Tall 	|Yes 	 |Smooth  |Yes
|Blue 	|Short 	|Yes 	 |Smooth  |Yes
|Blue 	|Tall 	|No 	 |Hairy   |No
|Blue 	|Short 	|Yes 	 |Rough   |Yes
|Red 	|Short 	|No 	 |Smooth  |No
|Purple |Short 	|No 	 |Hairy   |Yes
|Red 	|Tall 	|Yes 	 |Hairy   |No
|Purple |Tall 	|Yes 	 |Hairy   |Yes
|Purple |Tall 	|No 	 |Rough   |No
|Purple |Tall 	|No 	 |Smooth  |No

First of all, put the data into a file and correctly massage or clean it to work well as training data
for a decision tree classifier.  You may need to look at converting text into numerical factors, for example.
Then using a scikit-learn decision tree classifier, fit a tree to your data.  Show some predictions from
you model, for example, given a muchroom that is blue, tall, striped and smooth, does your classifier
predict you can eat it?