# Assignment 02: Scikit Learn Basic Regression and Classification

**Due Date:** Friday 9/18/2020 (by midnight)


## Introduction 

In this exercise we will be performing a regression and classification task using the Scikit learn framework, and the Python statsmodel library.  You should work through the tutorial on using scikit-learn and statsmodel before doing this assignment, as well as work on the materials from our units on regression and classification tasks.

For the first part of this assignment, I recommend looking through the following tutorials on using
Scikit Learn and the statsmodel library for linear regression:

[A Beginners guide to Linear Regression in Python with Scikit-Learn](https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f)

[Use statsmodels to Perform Linear Regression in Python](https://datatofish.com/statsmodels-linear-regression/)

I am using this material as a reference for the first part of this assignment.

**Please fill these in before submitting, just in case I accidentally mix up file names while grading**:

Name: Joe Student

CWID-5: (Last 5 digits of cwid)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# By convention, we often just import the specific classes/functions
# from scikit-learn we will need to train a model and perform prediction.
# Here we include all of the classes and functions you should need for this
# assignment from the sklearn library, but there could be other methods you might
# want to try or would be useful to the way you approach the problem, so feel free
# to import others you might need or want to try
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve

# statsmodels has an api, it is often imported as sm by convention
import statsmodels.api as sm

%matplotlib inline

In [2]:
plt.rcParams['figure.figsize'] = (10, 8) # set default figure size, 8in by 6in

## Linear Regression with One Variable
--------


### Scikit-Learn LinearRegression model

Load and plot the house pricing data of 1 feature from the file named "data/assg-02-house-data.csv" using Pandas to load in the
csv file.  Plot using a basic matplotlib figure the points in the dataset. This data set has profit (y or dependent
variable) for a food truck business, specified in 10's of $1000. 
The profit is the variable we want to predict (the regression variable).
Profit is a function of the population size (x or independent variable),
which here is expressed in 10000s people (e.g. pupulation 10 = 100,000 people).

In [3]:
# load the assignment 02 housing price linear regression data here


In [4]:
# plot the data here, matplotlib example


The tutorial shows an example of actually building a regression model where data is held back from the training
so that we can evaluate the accuracy of our predictive model.  We will try that next.  First of all,
fit a linear regression model to all of the data using the scikit-learn `LinearRegression` object.

Once you fit the model, show what the slope and intercept (e.g. the fitted parameters
of the model) were that were determined to be the
best fit model parameters (intercept and coefficients of the model).  Also use the score function
to display the $R^2$ score of the fitted model.

In [5]:
# fit the linear regression model to all of the data


In [6]:
# retrieve the intercept and slope


In [7]:
# use the score() function to display the models R^2 fit


You should compare your intercept and slope you determine here using Scikit Learn with the following.

The slope and intercept fitted parameters you should find are:

intercept: -3.89578088

slope: 1.19303364

As shown in the our lecture notebooks, use the predict() method of your Scikit Learn regression model to predict
each value of our x data features

In [8]:
# using predict() from scikit-learn find the predicted or hypothesized profit for each of the model populations


Now we can plot the determined linear fit line given by Scikit Learn to our data

In [9]:
# plot the fitted line using the predict() method from the LinearRegression object


### statsmodels Linear Regression

In contrast to the scikit-learn library, the python statsmodel library is primarily geared towards doing statistical
analysis of data, similar to a stats package like using SPSS or R.  You can perform a linear regression on a data
set using the statsmodel package, and get much more information about the goodness of the fit from the
constructed model.

In the next cell, create a model using statsmodels OLS (ordinary least squared fit) function, fit the model, and use the summary() function
to get information about the fit.

In [10]:
# load the data from our assignment 02 linear regression problem again if needed


When building a model of data, like a linear regression model, there are terms or parameters that the model
fits to the data.   One of these terms is known as the **bias** term.  We will learn more about what this 
term is in the coming weeks.  

The Scikit-learn library assumes you have not represented the bias term in the `X` data that is being fit.
But the statsmodel library does not make this assumption.  So there is actually an additional step you
need to perform before fitting, which is to add in the constant bias term to the `X` data being fit.

In [11]:
# unlike for sklearn library, we actually have to add the dummy feature by hand to 
# represent the intercept feature, it is not assumed automatically by OLS
# use the add_constant() method to add a column to represent our intercept coefficient in the model.


In [12]:
# use the statsmodels summary method to get a summary of the statistical fit of your linear regression.
# Check the fitted parameters to the results from scikit-learn before.


In the summary you should note that you get the same coefficients (const and x1) as we have determined using 
scikit-learn previously.  The $R^2$ measure of the fit is also the same
as what we got for fitting all of the data for the sklearn model.  

The rest of the summary information are some statistical information about how well the model fits
the data.  The data in the table under the [0.025 0.975] columns give a 95% confidence interval for
the coefficients.  For example from the measure of the noise and fit we are 95% confident that the true
coefficient for the x1 parameter (the slope of the line) is somewhere between 1.035 and 1.351.  The P>|t| measure is also
important here.  This is a P-value that measure how surprised we would be to see this fit if there was
actually no linear relationship between the independent variable and the dependent variable.  Both of these
measures are basically 0, which means we would be very surprised to see this fit if there was no linear
relationship between the features and the dependent variable.  When the P value here is large (usually
a cutoff of 0.05 is used), then that means we are not so surprised to see the result if there was no
linear relationship.


## Logistic Regression for a Binary Classifier
------

### scikit-learn LogisticRegression model

Load and plot the exam score data with binary class labels of accepted/not accepted.
This data is found in a file named "data/assg-02-exam-data.csv". Plot the data using
a standard Matplotlib scatter plot to visualize it.  This data set has 2 exam scores (exam1 and exam2),
for a number of students, and a binary category for each student of whether they were admitted
or not to a university degree program.

[Logistic Regression using Python (scikit-learn)](https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a)

[Logistic Regression in Python Using statsmodels](http://blog.yhat.com/posts/logistic-regression-python-rodeo.html)

In [13]:
# load the data for the logistic classification problem here


In [14]:
# replot the exam1/exam2 data indicating the binary categories using marker type again here for reference

# plot not admitted students first

# plot the admitted students

Now we will use Scikit Learn to fit a model again, but of course we will fit a binary logistic regression classifier
to our data to find the best decision boundary between the two classes.

In the next cell, create the Scikit Learn logistic regression model and fit it to our data.

In [15]:
# Create the scikit-learn LogisticRegression instance here

# and fit your model to the college acceptance using exam scores data here


For a binary classifier with 2 features like this, there are 3 parameters in the model,
the intercept, and the parameters for exam1 and exam2 that were fit to define the decision boundary.

You should get close to the following parameters, where the first
parameter is the intercept, and the second and third are the theta parameters fit for the exam1 and exam2
feature respectively:

[-25.05219314   0.20535491   0.2005838 ]

In the model returned by Scikit Learn, the intercept_ should correspond and match the intercept value,
and the coef_ should match the exam1 and exam2 coefficient parameters.

In [16]:
# display the intercept and the model coefficients for the exam1 and exam2 feature here


The parameters in this case might not exactly match because of the differences in the optimization meta-parameters, but
they will be close and essentially form almost the same decision boundary.

As we did in the example lectures this week, for a 2 parameter set of data we can use the intercept and coefficients to
visualize the decision boundary specified by the fitted logistic regression model.

In [17]:
# plot the decision boundary line found by the scikit-learn logistic classification


Then finally, as we showed in our lecture materials from chapter 3, calculate the confusion matrix results of your trained classifier
on the training data.  Also plot the precision and recall scores, using either two separate lines, one for each, or plot the precision vs. recall against
one another.  We showed both in our lecture and textbook readings for this week.

In [18]:
from sklearn.metrics import confusion_matrix

# calculate predictions for the x inputs

# display the confusion matrix

In [19]:
from sklearn.metrics import precision_recall_curve

# get the decision scores

# calculate precision and recall 


In [20]:
# plot precision and recall curves


### statsmodels Logistic Regression

Likewise use the statsmodels library to redo the Logistic Regression classification of the adming/not admit
data set once again.  In the following cells, load the data, add in the constant column needed by statsmodels
to fit the model using the intercept parameter, then create an instance and fit the model, and show a summary
of your logistic regression results.

In [21]:
# get fresh reload of the data if needed here to ensure you have correct starting values of the assignment 03
# classification data


In [22]:
# unlike for sklearn library, we actually have to add the dummy feature by hand to 
# represent the intercept feature, it is not assumed automatically by OLS
# make sure you add the intercept feature column here before fitting the model.


In [23]:
# create an instance of the statsmodel Logit model (you don't need MNLogit here since this is
# a binary classification task).
# fit your model to get a statsmodel model fit wrapper


In [24]:
# display a summary of the fit of your classifier.  You might want to compare your intercept and
# fitted coefficients again, though this time they probably won't match.
# HINT: if you want them to match, try turning off the penality for scikit-learn.  It adds 
# a regularization penality by default, which we will learn about later in the course.


As mentioned, by default the statsmodel logistic regression will add no regularization penality.  We will 
discuss regularizing ML models in the coming weeks.  Adding regularization to get a more general model is not
a normal sort of activity one does when performing statistical analysis, so it is not an easily
specified parameter of statsmodel objects.  But when building ML models to classify new unseen data, it 
is important to try and make the model as general and robust as possible.