# Lecture 13: September 6th, 2023

__Updates:__ 

* Extended deadlines for Homework 7 and Homework 8; they are now due Monday.
* Project planning worksheet due tonight.
* There will be lecture on Monday of Week 6.

## Brief fieldtrip to Lecture 12

The end of lecture 12 had some important concepts surrounding overfitting. I added in some charts after the class, and want to spend a few minutes at the start of this lecture going over them.

## Introduction to the MNIST Database

We have a YouTube video describing this database (and also solving the first two problems from Homework 7), but I'll spend a bit of lecture time today going over the most important points.

![](mnist5.png)

[Source: Medium](https://medium.com/comet-ml/real-time-numbers-recognition-mnist-on-an-iphone-with-coreml-from-a-to-z-283161441f90)

Imagine I asked you to identify the number above. It's pretty easy for a human to look at it and recognize it as the number 5...but how could you teach a computer to perform this same task? Suddenly, the task no longer seems so easy.

* The MNIST (Modified National Institute of Standards and Technology) database is a collection of 70,000 images of handwritten digits.

* Each image can be understood as a $28 \times 28$ grid of pixels, with values ranging between 0 (darkest) and 255 (lightest).

* Imagine we lay out the pixels all in one row; from this perspective, each image is a point in 784-dimensional space.

## Logistic Regression

__Remember:__ The point of polynomial regression was to predict a continuous value (think about predicting price of a taxi ride based on the distance traveled.)

* __Very Confusing:__ Logistic regression is used for classification problems.
* Think of our handwritten digits examples from MNIST Database; recognizing handwritten digits is a classification problem. If this seem confusing, here are some points to consider:
    * If we're recognizing digits, we know there are only 10 possible outputs: zero through nine. Even though these numbers have an ordering, it isn't important for identifying the number. For example, labeling  0 with "The number zero" wouldn't make a difference.
    * Imagine I got 4.3 as the output of my model. Would this make sense if I'm trying to predict what digit it is? This type of output doesn't make sense for classification, but would make sense if I was trying to predict something from regression (like price of a taxi ride).

### The Sigmoid Function

$$
\sigma(x) = \frac{e^x}{e^x + 1} = \frac{1}{1 + e^{-x}}
$$

The sigmoid function is very good for getting probabilities. In logistic regression, we don't predict a class directly, instead we predict the probability of being in that class. In today's lecture, we'll do a lot with binary classification; so we'll be predicting the probability of something being in a class or not in a class.

![](sigmoid.png)

[Source: Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Logistic-curve.svg)

### Penguins Example

* Using the penguins dataset from Seaborn, fit a logistic regression model to classify whether or not a penguin is in the Chinstrap species, using its flipper length and its bill length.

Basic idea: we'll ask "what's the probability of a penguin belonging to the Chinstrap species, given that we know its flipper length and its bill length". 

In [None]:
import numpy as np
import pandas as pd
import altair as alt 
import seaborn as sns

In [None]:
df = sns.load_dataset("penguins").dropna()

In [None]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [None]:
cols = ["flipper_length_mm","bill_length_mm"]

In [None]:
alt.Chart(df).mark_point(filled=True,size=80).encode(
    x = alt.X("flipper_length_mm").scale(zero=False),
    y = alt.Y("bill_length_mm").scale(zero=False),
    color="species",
    shape="species"
)

Just looking at the chart, it seems like there are pretty clear distinctions between the species, with a few outliers.

__Remember:__ for this example, we're doing binary classification. Our question is "Is the penguin Chinstrap, or not?"

In [None]:
(df["species"] == "Chinstrap").sum()

68

In [None]:
df["species"].shape

(333,)

In [None]:
df["species"]

0      Adelie
1      Adelie
2      Adelie
4      Adelie
5      Adelie
        ...  
338    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 333, dtype: object

In our dataset, there are 333 penguins represented, and we know that 68 of them are Chinstrap.

So our two categories for logistic regression will be "Chinstrap" and "Other". One benefit of binary classification is that the coefficients are easier to interpret.

Next, we add a column which is True if a penguin belongs to the Chinstrap species, and is False otherwise.

In [None]:
df["is_chinstrap"] = (df["species"] == "Chinstrap")
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,is_chinstrap
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,False
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,False
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,False
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,False
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,False


At this point, we're still preparing our data for logistic regression. Next, we'll split into a training set and a test set.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
cols

['flipper_length_mm', 'bill_length_mm']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[cols],df["is_chinstrap"],test_size=0.2,random_state=0)

In [None]:
X_test

Unnamed: 0,flipper_length_mm,bill_length_mm
62,185.0,37.6
60,185.0,35.7
283,231.0,54.3
107,190.0,38.2
65,192.0,41.6
...,...,...
122,176.0,40.2
298,215.0,45.2
22,189.0,35.9
151,201.0,41.5


In [None]:
y_test

62     False
60     False
283    False
107    False
65     False
       ...  
122    False
298    False
22     False
151    False
252    False
Name: is_chinstrap, Length: 67, dtype: bool

Notice the indices for `X_test` and `y_test` are the same!

We're finally ready for logistic regression! We follow our usual workflow of import > instantiate > fit > predict

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
#clf to remind us of classification
clf = LogisticRegression()

In [None]:
clf.fit(X_train,y_train)

In [None]:
clf.predict(X_test)

array([False, False, False, False, False, False,  True, False, False,
        True, False, False, False,  True, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
        True, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False])

In [None]:
#predictions on X_test
clf.predict(X_test).shape

(67,)

In [None]:
#Actual values
y_test

62     False
60     False
283    False
107    False
65     False
       ...  
122    False
298    False
22     False
151    False
252    False
Name: is_chinstrap, Length: 67, dtype: bool

Now we want to ask how well did our model do?

In [None]:
(y_test == clf.predict(X_test)).mean()

0.9402985074626866

This is saying that we got 94.03% accuracy. Not too bad, huh?

Instead of computing the accuracy by hand, there's a much faster way to get this number using the `score` method.

In [None]:
clf.score(X_test,y_test)

0.9402985074626866

This is saying that about 94% of the time, our model was able to correctly predict whether or not a penguin belonged to the chinstrap species.

Accuracy on the test set is pretty high, so we don't need to worry too much about overfitting. But, just to be safe, let's check how the model does on the train set to see if it does much better.

In [None]:
clf.score(X_train,y_train)

0.9624060150375939

Great question from the chat: in general, including more training data will cause our accuracy to go up.

### Interpreting Coefficients

We're still dealing with the same question from before: what is the probability a penguin is Chinstrap, given we know its bill length and flipper length.

The "learning" in "machine learning" for logistic regression comes down to finding the following coefficients and intercept (bias).

In [None]:
clf.coef_

array([[-0.38263519,  1.17201067]])

In [None]:
flip_coef, bill_coef = clf.coef_[0]

array([-0.38263519,  1.17201067])

In [None]:
cols

['flipper_length_mm', 'bill_length_mm']

In [None]:
flip_coef

-0.3826351932141673

In [None]:
bill_coef

1.1720106733426894

In [None]:
clf.intercept_

array([20.93321887])

We already saw that the sigmoid function is very natural for modeling probability.

In [None]:
sigmoid = lambda x: 1/(1 + np.exp(-x))

In [None]:
sigmoid(0)

0.5

In [None]:
sigmoid(10)

0.9999546021312976

The larger the input, the closer the function is to 1.

__Motivating Question:__ What does our model predict if the flipper has length 200mm and the bill has length 50mm?

In [None]:
flip = 200
bill = 50

Here is the value that we want to put into the sigmoid function. Because the next equation is linear, that's why the logistic regression is considered a linear model.

In [None]:
flip_coef*flip + bill_coef*bill + clf.intercept_

array([3.00671389])

Now, we put this number into the sigmoid function.

In [None]:
sigmoid(flip_coef*flip + bill_coef*bill + clf.intercept_)

array([0.95287652])

Our model would predict with 95% chance that the penguin is Chinstrap.

Do we get the same thing using `clf.predict`?

In [None]:
#this will give an error
clf.predict([flip,bill])



ValueError: Expected 2D array, got 1D array instead:
array=[200  50].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
#easy fix
clf.predict([[flip,bill]])



array([ True])

Here, our model predicts that the hypothetical penguin belongs to the Chinstrap species. What if we actually wanted to see the probability?

In [None]:
clf.predict_proba([[flip,bill]])



array([[0.04712348, 0.95287652]])

Here's how to interpret these results:
* With 4.7% chance, the penguin is _not_ Chinstrap
* With about 95.3% chance, the penguin _is_ Chinstrap 

### Predicting if a penguin is in the chinstrap species (decision boundaries)

* Same setup as before, now we answer the following question.
* Using the model, describe all flipper lengths and bill lengths for which our model thinks there is an 80% chance the penguin is Chinstrap. Give your answer as a formula for bill length in terms of flipper length.

Next, I'll redefine some variables, just to keep things consistent with my notes.

In [None]:
cols

['flipper_length_mm', 'bill_length_mm']

In [None]:
fcoef, bcoef = clf.coef_[0]

For a given fillper length, what value of bill length gives 80% confidence? This comes down to solving the following equation.

$$
0.8 = \frac{1}{1 + \exp(-(\text{intercept} + \text{fcoef}*\text{flength}+ \text{bcoef}*\text{blength}))}
$$

Now, let's write a function that solves for bill length in terms of flipper length.

In [None]:
bill80 = lambda flength: 1/(bcoef)*((-1)*np.log((1/0.8)-1)-clf.intercept_[0] - fcoef*flength)

I'm now going to define a function `bill50` similarly. 

In [None]:
bill50 = lambda flength: 1/(bcoef)*((-1)*np.log((1/0.5)-1)-clf.intercept_[0] - fcoef*flength)

Now, let's test out our function! Remember, as input, we pass a flipper length, and as output we get a bill length leads to a certain amount of confidence.

In [None]:
bill80(200)

48.617402067720555

Interpretation: This is saying that if we have a penguin with a flipper length of 200mm, then a bill length of 48.6mm leads our model to have 80% confidence that the penguin is a Chinstrap penguin.

Next, we add 80% and 50% values to our DataFrame.

In [None]:
#add a column representing boundary for 80% confidence
df["bdry80"] = df["flipper_length_mm"].map(bill80)

In [None]:
df["bdry50"] = df["flipper_length_mm"].map(bill50)

In [None]:
df["pred"] = clf.predict(df[cols])

Feature names unseen at fit time:
- is_chinstrap
Feature names seen at fit time, yet now missing:
- bill_length_mm
- flipper_length_mm



ValueError: X has 1 features, but LogisticRegression is expecting 2 features as input.

In [None]:
df.sample(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,is_chinstrap,bdry80,bdry50
174,Chinstrap,Dream,43.2,16.6,187.0,2900.0,Female,True,44.373194,43.190359
328,Gentoo,Biscoe,43.3,14.0,208.0,4575.0,Female,False,51.229223,50.046388
201,Chinstrap,Dream,49.8,17.3,198.0,3675.0,Female,True,47.964447,46.781613
73,Adelie,Torgersen,45.8,18.9,197.0,4150.0,Male,False,47.637969,46.455135
102,Adelie,Biscoe,37.7,16.0,183.0,3075.0,Female,False,43.067283,41.884449
18,Adelie,Torgersen,34.4,18.4,184.0,3325.0,Female,False,43.393761,42.210927
37,Adelie,Dream,42.2,18.5,180.0,3550.0,Female,False,42.087851,40.905016
88,Adelie,Dream,38.3,19.2,189.0,3950.0,Male,False,45.026149,43.843315
264,Gentoo,Biscoe,50.5,15.9,222.0,5550.0,Male,False,55.799909,54.617074
151,Adelie,Dream,41.5,18.5,201.0,4000.0,Male,False,48.94388,47.761045


Row 174 actually looks like a really interesting edge case. Let's come back to it when we have our plots.

Now, we'll start a series of plots that will pull everything together.

In [None]:
#same base chart from before
c = alt.Chart(df).mark_point(filled=True,size=80).encode(
    x=alt.X("flipper_length_mm").scale(zero=False),
    y=alt.Y("bill_length_mm").scale(zero=False),
    color="pred",
    shape="species"
)
c

In [None]:
#Just a line right now, nothing too special until we see the charts all together
c80 = alt.Chart(df).mark_line(color="red").encode(
    x=alt.X("flipper_length_mm").scale(zero=False),
    y=alt.Y("bdry80").scale(zero=False)
)
c80

In [None]:
c50 = alt.Chart(df).mark_line(color="black").encode(
    x=alt.X("flipper_length_mm").scale(zero=False),
    y=alt.Y("bdry50").scale(zero=False)
)
c50

In [None]:
c + c80 + c50

Note: At the end of lecture we updated color to show the preditions! Notice that the color is showing us where the decision boundary is.

The cool thing about this picture, is it is telling us how the model makes decisions.

Observations:
* Recall the red (top) line represents our model having 80% confidence based on flipper length. So, any penguin above this line has > 80% chance of being a Chinstrap penguin according to our model.
* Between the two lines, there is between a 50% and 80% chance of being a Chinstrap penguin.

* The black (bottom) line is very important! It is called the decision boundary. Recall that it represent 50% confidence that a penguin belongs to the Chinstrap species. Anything above the bottom line gets classified by our model as being a Chinstrap penguin, while everything below it gets classifed as other.

* Notice that the decision boundary is a line! This is part of the reason why logistic regression is considered a linear model (remember, we're finding coefficients of a linear function, even if the probability function itself is not linear).

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=cc3d67fb-3815-4980-bc17-cf3c775f9e1c' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>