# Lab: Classification
## CMSE 381 - Spring 2023
## Jan 27, 2023



In this module we are going to test out the classification methods we discussed in class 

## Getting a feel for the data

We're going to use the `Smarket` data set from the ISLR book as included in their R package.  I've included a csv with this notebook for you to use. 

This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, `Lag1` through `Lag5`. We have also recorded `Volume` (the number of shares traded on the previous day, in billions), `Today` (the percentage return on the date in question) and `Direction` (whether the market was `Up` or `Down` on this date). Our goal is to predict `Direction` (a qualitative response) using the other features.

In [2]:
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
%matplotlib inline
import seaborn as sns

In [3]:
smarket = pd.read_csv('Smarket.csv', index_col = 0)

In [4]:
smarket

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
1,2001,0.381,-0.192,-2.624,-1.055,5.010,1.19130,0.959,Up
2,2001,0.959,0.381,-0.192,-2.624,-1.055,1.29650,1.032,Up
3,2001,1.032,0.959,0.381,-0.192,-2.624,1.41120,-0.623,Down
4,2001,-0.623,1.032,0.959,0.381,-0.192,1.27600,0.614,Up
5,2001,0.614,-0.623,1.032,0.959,0.381,1.20570,0.213,Up
...,...,...,...,...,...,...,...,...,...
1246,2005,0.422,0.252,-0.024,-0.584,-0.285,1.88850,0.043,Up
1247,2005,0.043,0.422,0.252,-0.024,-0.584,1.28581,-0.955,Down
1248,2005,-0.955,0.043,0.422,0.252,-0.024,1.54047,0.130,Up
1249,2005,0.130,-0.955,0.043,0.422,0.252,1.42236,-0.298,Down


Note that the `Year` column only has the year information.  In the case of this data, the sorted order tracks the days, so be sure to not accidentally shuffle it! 

&#9989; **<font color=red>Do this:</font>** Write a brief description of the data set. Which of the available variables are quantitative? Which are categorical? Draw some plots of the data and get a feel for what the columns mean. 

In [None]:
# Your code here #

&#9989; **<font color=red>Do this:</font>** Take a look at the correlation matrix, which can be found with `dataframe.corr()`. Does it include all of the variables in the data set? What do you notice about the correlations?

*Hint: A great way to see what's going on with a matrix is to use the `plt.matshow` command.*

In [None]:
# Your code here #

&#9989; **<font color=red>Q:</font>** Do some further investigation on the high correlation value you saw in the previous step. Can you justify why that particular pair of variables has a high correlation? 

## Classification using Logistic Regression

Our goal is to predict `Direction`, a categorical variable taking as values the strings `Up` and `Down`.


For this module, we will largely use the tools from `sklearn`  for classification. One of the big perks of the `sklearn` module is that there is a great deal of uniformity in the classes. So once we have a handle on how to interact with one kind of classification tool, very minor tweaks in the code will allow for the use of a new model. In fact, many of the things we'll do today should look very similar in terms of the syntax to the linear regression lab from a few weeks ago. 

For our first try doing classification, we'll use `LogisticRegression` from the `sklearn.linear_model` module. I'm a huge fan of the `sklean` documentaiton since it includes a great deal of info on the math behind what we're doing as well as explanations on the code:
- [`sklearn` mathematical description of logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
- [`LogisticRegression` class documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)

In [None]:
from sklearn.linear_model import LogisticRegression 

Lets first predict `Direction` using `Lag1`,`Lag2` and `Volume`. 
Our first job is to extract the portion of the dataframe that we want to use. `sklearn` is happiest when we hand it an array. 

In [None]:
X = np.array(smarket[['Lag1','Lag2','Volume']])
Y = np.array(smarket.Direction)

In [None]:
print(X.shape)
print(Y.shape)

Once we have our data, we create an instance of the model class we want, in this case `LogisticRegression`, and fit the model to the data. Note the `random_state=0` code ensures that rerunning the following box will return the same answer every time. 

In [None]:
clf = LogisticRegression(random_state=0)
clf.fit(X,Y)


Great, that was easy! Once we've fit the model, the main task is to understand how to extract information from it. 

&#9989; **<font color=red>Do this:</font>** Extract the coefficients and intercept from the trained model. *(Note: You might need to take a look at the documentation to figure out how to do that.)* What is the equation, in terms of the variables used, that you are modeling? 

In [None]:
# Your code here

While it's good to know what equation we're modeling with, the big perk here is that your `sklearn` class will evaluate the data points of your model for you. Yay!

&#9989; **<font color=red>Do this:</font>** Use the `predict_proba` function to determine the probabilities $Pr(Y = \texttt{Down} \mid X)$ for the data set. What shape is the output matrix? Why that shape? What do the columns represent?

In [None]:
# Your code here

Of course this gives us the probability of each each label for a given data, but we really would like to have the prediction itself. 


&#9989; **<font color=red>Do this:</font>** Use the `predict` function to determine the predictions for each input data point in the original $X$ matrix and store the output as `Yhat`. How many predictions are different than the actual `Direction` value? Whats the percent error for the model?

In [None]:
# Your code here


&#9989; **<font color=red>Do this:</font>** An even easier way of figuring out the error rate is through the score. What does the output of `clf.score(X,Y)` mean and how is it related to the number you determined above?

In [None]:
# Your code here

**Confusion matrix**

As we saw in class, the percent error is a rather limited way of evaluating the classification model. Luckily `sklearn` provides commands for computing the confusion matrix for a given model easily. The `confusion_matrix` command computes the confusion matrix, and `ConfusionMatrixDisplay` gives a nice visual representation. 

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


In [None]:
# This code gives the confusion matrix, assuming you stored the predicted values as `Yhat`.
C = confusion_matrix(Y,Yhat)

C

In [None]:
# This code gives a visual representation 
ConfusionMatrixDisplay(C).plot()



&#9989; **<font color=red>Q:</font>** The makers of `sklearn` made a PARTICULARY strange choice when it comes to the confusion matrix representation.  What is different about the `sklearn` confusion matrix from how we saw it in class?

*Your answer here*



-----
### Congratulations, we're done!


<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.