# Lecture 14: September 8th, 2023

__Today:__ 

* Give a few more examples with logistic regression
* Loss function for classification problems

__Reminders and Updates:__
* We will have lecture on Monday (our last lecture :( ) I plan to cover K-Means clustering
* New token-earning opportunity on Canvas
* I will release new chances for ML outcomes from last week 
* Remaining ML outcomes will be posted by tonight (P20 is already live)
* Homework 7 and Homework 8 are due Monday!

__Final Project:__ 

You do not have to include all of our ML models in your final project! You are welcome to, but it is not a requirement. In lecture we've seen the following models:

* Linear regression
* Logistic regression
* K-Means clustering (Monday)

Including at least one of these models is expected for the final project, but you do not need to do all of them. If you are able to spend a good amount of time with just one of the above models, then I think that's good. If you end up finishing a question too fast, it might be nice to try another model.

## Recap of Lecture 13

Recall that on Wednesday, we used Logistic Regression with flipper length and bill length columns to predict whether or not a penguin belonged to the Chinstrap species. As a reminder, this is binary classification, because we only have two possible outputs.

In [4]:
import altair as alt
import pandas as pd
import seaborn as sns
import numpy as np

In [None]:
df = sns.load_dataset("penguins").dropna(axis=0)
df["isChinstrap"] = (df["species"] == "Chinstrap")

In [None]:
cols = ["flipper_length_mm", "bill_length_mm"]

In [None]:
c1 = alt.Chart(df).mark_point(filled=True,size=100).encode(
    x = alt.X("flipper_length_mm").scale(zero=False),
    y = alt.Y("bill_length_mm").scale(zero=False),
    color="species",
    shape="species",
    tooltip = ["species","pred","index"]
).properties(title='actual species')
c1

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
clf = LogisticRegression()

In [None]:
clf.fit(df[cols], df["isChinstrap"])

In [None]:
df["pred"] = clf.predict(df[cols])

In today's lecture, I want to talk a little bit more about this same example.

In [None]:
df.loc[[18, 19, 20], cols + ["isChinstrap", "pred"]]

Unnamed: 0,flipper_length_mm,bill_length_mm,isChinstrap,pred
18,184.0,34.4,False,False
19,194.0,46.0,False,True
20,174.0,37.8,False,False


The Chinstrap penguins all appear in the same portion of the dataset, so we shouldn't be concerned that none of the penguins in rows labeled 18,19, and 20 are Chinstrap. Looking at the prediction column, what goes wrong?

<font color=red>__Notice__: The model has the wrong prediction for row label 19. </font>

One question we can ask is how confident was our model that row label 19 penguin was really a Chinstrap?

To get the classes that our model predicts, we can use the `classes_` attribute of `clf`. The important thing here is the order in which the classes are returned. When we ask for probabilities, they will be returned in this same order.

In [None]:
clf.classes_

array([False,  True])

The first entry is False, means this is not a Chinstrap. The second entry is True, means this is a Chinstrap.

Now let's predict the probabilities for these three rows.

The probabilities below are a little difficult to read. Here's how we can make things look a little better.

In [None]:
clf.predict_proba(df.loc[[18,19,20],cols])

array([[9.99771697e-01, 2.28303114e-04],
       [3.29559670e-01, 6.70440330e-01],
       [7.71818762e-01, 2.28181238e-01]])

In [None]:
#Returns just decimals, instead of scientific notation
np.set_printoptions(suppress=True)

In [None]:
clf.predict_proba(df.loc[[18,19,20],cols])

array([[0.9997717 , 0.0002283 ],
       [0.32955967, 0.67044033],
       [0.77181876, 0.22818124]])

In [None]:
clf.classes_

array([False,  True])

These are our classes. Notice, we could have replaced the boolean values with strings in our model, and it would have worked the same. We'll see this with multiclass logistic regression later in the lecture.
False: "The penguin is not Chinstrap"
True: "The penguin is Chinstrap"

In [None]:
df["isChinstrap"]

0      False
1      False
2      False
4      False
5      False
       ...  
338    False
340    False
341    False
342    False
343    False
Name: isChinstrap, Length: 333, dtype: bool

Let's try to interpret these probs. for each row.

For label 18 row, the model is 99.97% sure that the penguin is not a Chinstrap penguin. This is good, because the if we look at the true data, the corresponding penguin is not Chinstrap.

For label 19 row, the model is 33% sure that the penguin is not Chinstrap (and thus 67% sure that it is a Chinstrap). Notice how confident the model was for row 18 versus row 19, this is good because the model was correct for row 18, but incorrect for row 19.

In [None]:
df.loc[[18, 19, 20], cols + ["isChinstrap", "pred"]]

Unnamed: 0,flipper_length_mm,bill_length_mm,isChinstrap,pred
18,184.0,34.4,False,False
19,194.0,46.0,False,True
20,174.0,37.8,False,False


__Great question from the chat:__ at what point should we be concerned about the model being confidently incorrect?
__Answer:__ There is no good answer. It's really going to depend on your specific problem/data (for example, does your data have some weird outliers)?

__Interpretation Question:__ As flipper length increases, is the penguin more or less likely to be a Chinstrap, according to our model? What about bill length?

![](sigmoid.png)

Recall:

* We are dealing with a sigmoid function: $\sigma(x) = \frac{1}{1 + e^{-x}}$
    * Range is (0,1), so this makes it a good candidate for modeling probability. 
    * As $x$ increases, $\sigma(x)$ increases as well (i.e. the sigmoid function is increasing); therefore, the larger the input, the closer the probability to 1, which means the outcome is more likely to happen. (Similarly, smaller input means less likely to happen.)
* Recall, for logistic regression, we determine $x$ using the following linear equation:
$$
L(x) = \text{bill coefficient}*\text{bill length} + \text{flipper coefficient}*\text{flipper length} + \text{intercept}
$$

* Putting everything together, because $L(x)$ is what we pass to the sigmoid function, if $L(x)$ is larger, that means there's a larger probability, and if $L(x)$ is smaller, that means there's a smaller probability.

So, to answer the question above, we should look at the coefficients of our model.

In [None]:
cols

['flipper_length_mm', 'bill_length_mm']

In [None]:
clf.coef_

array([[-0.34802208,  1.08405225]])

The coefficient for flipper length is negative, while the coefficient for bill length is positive. This means that if flipper length increases, then $L(x)$ decreases, so the probability is smaller => the penguin is less likely to be Chinstrap.

The coefficient for bill length, on the other hand, is positive! This means that if bill length increases, then so does $L(x)$, so the probability is larger => the penguin is more likely to be Chinstrap.

## Multiclass Logistic Regression

Many of you have already started the homeworks for this week, which deal with multiclass logistic regression (this is where we predict more than 2 classes). Let's see an example of it in action together as a class.

* Drop the “pred” column from `df` (we will make a new one below) using the `drop` method and a suitable `axis` keyword argument.

In [None]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,isChinstrap,pred
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,False,False
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,False,False
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,False,False
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,False,False
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,False,False


In [None]:
#errors here means if I run this cell again it will ignore an errors
df = df.drop("pred", axis=1, errors="ignore")

In [None]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,isChinstrap
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,False
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,False
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,False
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,False
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,False


* Fit a new logistic regression classifier, using the same input features, but this time using the “species” column as our target. (This is our first time seeing logistic regression with more than two classes. When we perform classification with two classes, it is called “Binary Classification”, and is often easier to explain.)

In [None]:
df.species.unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

Notice there are 3 species of penguins, so there will be 3 classes in our model. Even though there are now more than 2 classes, the procedure will be the exact same as what we did for binary classification.

In [None]:
clf = LogisticRegression()

In [None]:
cols

['flipper_length_mm', 'bill_length_mm']

In [None]:
clf.fit(df[cols],df["species"])

One nice thing is that the model automatically reports the outputs using the values in the "species" column. In particular, these will be strings, and is a nice example of how our output doesn't need to be boolean/numeric.

* Check the `coef_` attribute. How does it relate to the `coef_` attribute we found above, where we were only considering Chinstrap penguins?

In [None]:
clf.classes_

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

The order of the classes is extremely important. I believe scikit-learn will always output them in alphabetical order. Looking at the coefficients below, the first row corresponds to the coefficients for the Adelie species, the second row to the Chinstrap species, and the third row to the Gentoo species.

In [None]:
clf.coef_

array([[-0.10544131, -0.59253116],
       [-0.2674632 ,  0.71305121],
       [ 0.37290451, -0.12052006]])

Now, we have 3 rows, where each row represents the coefficients corresponding to the penguin species. 

* Add a column “pred” to `df` containing the predicted values.

In [None]:
df["pred"] = clf.predict(df[cols])
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,isChinstrap,pred
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,False,Adelie
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,False,Adelie
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,False,Adelie
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,False,Adelie
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,False,Adelie
...,...,...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female,False,Gentoo
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,False,Gentoo
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,False,Gentoo
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,False,Gentoo


Remember the model had made a mistake in row 19 when we were just doing binary classification. Let's see how it did for this multiclass classification.

In [None]:
df.loc[[19],:]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,isChinstrap,pred
19,Adelie,Torgersen,46.0,21.5,194.0,4200.0,Male,False,Chinstrap


Let's look at the probabilities for this row.

In [None]:
clf.classes_

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

In [None]:
clf.predict_proba(df.loc[[19],cols])

array([[0.08336009, 0.91516209, 0.00147781]])

Here, our model is 91% sure that the penguin is Chinstrap. This is worse!! than in our binary classification, where it was 67% sure that it was Chinstrap.

In [None]:
clf.predict_proba(df.loc[[18,19,20],cols])

array([[0.99998469, 0.00001469, 0.00000062],
       [0.08336009, 0.91516209, 0.00147781],
       [0.99375309, 0.00624688, 0.00000003]])

Our model is very confident that rows 18 and 20 are Adelie penguins (this is good, because in this case that is correct). It is less confident that row 19 is a Chinstrap (which is good, beause it got this one wrong).

* Make an Altair scatter plot showing the predicted values.

In [None]:
cols

['flipper_length_mm', 'bill_length_mm']

In [None]:
df["index"] = df.species.index

Int64Index([  0,   1,   2,   4,   5,   6,   7,  12,  13,  14,
            ...
            332, 333, 334, 335, 337, 338, 340, 341, 342, 343],
           dtype='int64', length=333)

In [None]:
c2 = alt.Chart(df).mark_point(filled=True,size=100).encode(
    x=alt.X(cols[0]).scale(zero=False),
    y=alt.Y(cols[1]).scale(zero=False),
    color="pred",
    shape="pred",
    tooltip=["species","pred","index"]
).properties(title="predicted species")
c2 

In [None]:
#Trying to find the row 19 penguin!
c3 = alt.Chart(df[df["species"]=="Adelie"]).mark_point(filled=True,size=100).encode(
    x=alt.X(cols[0]).scale(zero=False),
    y=alt.Y(cols[1]).scale(zero=False),
    color="pred",
    shape="pred",
    tooltip=["species","pred","index"]
).properties(title="predicted species")
c3

Let's now view this chart next to the actual species.

In [None]:
c1 | c2

It looks like our model did a pretty good job of predicting the species! The main difference is the predictions look a little more regular than the actual data, but this is to be expected.

It's a little difficult to see (because of the space between the points), but the decision boundaries are line segments, where on one side of each line segment we make one prediction, and on the other side we make a different prediction.

### Generating fake data to better see the decision boundaries

* Using `np.linspace`, make a NumPy array of 70 equally spaced x-coordinates and 70 equally spaced y-coordinates. Name these NumPy arrays `x` and `y`.

In [None]:
x = np.linspace(170,235,70)
y = np.linspace(30,60,70)

Notice, we selected these ranges to roughly follow the ranges in the real data (flipper length and bill length).

In [None]:
x

array([170.        , 170.94202899, 171.88405797, 172.82608696,
       173.76811594, 174.71014493, 175.65217391, 176.5942029 ,
       177.53623188, 178.47826087, 179.42028986, 180.36231884,
       181.30434783, 182.24637681, 183.1884058 , 184.13043478,
       185.07246377, 186.01449275, 186.95652174, 187.89855072,
       188.84057971, 189.7826087 , 190.72463768, 191.66666667,
       192.60869565, 193.55072464, 194.49275362, 195.43478261,
       196.37681159, 197.31884058, 198.26086957, 199.20289855,
       200.14492754, 201.08695652, 202.02898551, 202.97101449,
       203.91304348, 204.85507246, 205.79710145, 206.73913043,
       207.68115942, 208.62318841, 209.56521739, 210.50724638,
       211.44927536, 212.39130435, 213.33333333, 214.27536232,
       215.2173913 , 216.15942029, 217.10144928, 218.04347826,
       218.98550725, 219.92753623, 220.86956522, 221.8115942 ,
       222.75362319, 223.69565217, 224.63768116, 225.57971014,
       226.52173913, 227.46376812, 228.4057971 , 229.34

In [None]:
y

array([30.        , 30.43478261, 30.86956522, 31.30434783, 31.73913043,
       32.17391304, 32.60869565, 33.04347826, 33.47826087, 33.91304348,
       34.34782609, 34.7826087 , 35.2173913 , 35.65217391, 36.08695652,
       36.52173913, 36.95652174, 37.39130435, 37.82608696, 38.26086957,
       38.69565217, 39.13043478, 39.56521739, 40.        , 40.43478261,
       40.86956522, 41.30434783, 41.73913043, 42.17391304, 42.60869565,
       43.04347826, 43.47826087, 43.91304348, 44.34782609, 44.7826087 ,
       45.2173913 , 45.65217391, 46.08695652, 46.52173913, 46.95652174,
       47.39130435, 47.82608696, 48.26086957, 48.69565217, 49.13043478,
       49.56521739, 50.        , 50.43478261, 50.86956522, 51.30434783,
       51.73913043, 52.17391304, 52.60869565, 53.04347826, 53.47826087,
       53.91304348, 54.34782609, 54.7826087 , 55.2173913 , 55.65217391,
       56.08695652, 56.52173913, 56.95652174, 57.39130435, 57.82608696,
       58.26086957, 58.69565217, 59.13043478, 59.56521739, 60.  

I want to take the cartesian product of `x` and `y` to get a grid of artificial data that I can plot/fit. There are a few ways to get this, we could use `itertools.product` or the NumPy function `meshgrid`.

In [None]:
from itertools import product

In [None]:
product(x,y)

<itertools.product at 0x7f6193ae3380>

It's difficult to see what this is at first glance, but what if we convert it to a list?

In [None]:
list(product(x,y))

[(170.0, 30.0),
 (170.0, 30.434782608695652),
 (170.0, 30.869565217391305),
 (170.0, 31.304347826086957),
 (170.0, 31.73913043478261),
 (170.0, 32.17391304347826),
 (170.0, 32.608695652173914),
 (170.0, 33.04347826086956),
 (170.0, 33.47826086956522),
 (170.0, 33.91304347826087),
 (170.0, 34.34782608695652),
 (170.0, 34.78260869565217),
 (170.0, 35.21739130434783),
 (170.0, 35.65217391304348),
 (170.0, 36.08695652173913),
 (170.0, 36.52173913043478),
 (170.0, 36.95652173913044),
 (170.0, 37.391304347826086),
 (170.0, 37.82608695652174),
 (170.0, 38.26086956521739),
 (170.0, 38.69565217391305),
 (170.0, 39.130434782608695),
 (170.0, 39.565217391304344),
 (170.0, 40.0),
 (170.0, 40.434782608695656),
 (170.0, 40.869565217391305),
 (170.0, 41.30434782608695),
 (170.0, 41.73913043478261),
 (170.0, 42.17391304347826),
 (170.0, 42.608695652173914),
 (170.0, 43.04347826086956),
 (170.0, 43.47826086956522),
 (170.0, 43.91304347826087),
 (170.0, 44.34782608695652),
 (170.0, 44.78260869565217),
 

In [None]:
len(list(product(x,y)))

4900

* Make a DataFrame `df_art` (for “artificial”) containing all the possible pairs of coordinates from `x` and `y`. (We chose 70 above so `df_art` will have 4900 rows, which is a good length for Altair.)

In [None]:
df_art = pd.DataFrame(list(product(x,y)))
df_art.head()

Unnamed: 0,0,1
0,170.0,30.0
1,170.0,30.434783
2,170.0,30.869565
3,170.0,31.304348
4,170.0,31.73913


In [None]:
df_art.columns = cols
df_art.head()

Unnamed: 0,flipper_length_mm,bill_length_mm
0,170.0,30.0
1,170.0,30.434783
2,170.0,30.869565
3,170.0,31.304348
4,170.0,31.73913


We've already fit our Logistic Regression model to the real data. Let's predict on this sample data.

* Add a corresponding “pred” column to `df_art`.

In [None]:
df_art["pred"] = clf.predict(df_art[cols])

In [None]:
df_art

Unnamed: 0,flipper_length_mm,bill_length_mm,pred
0,170.0,30.000000,Adelie
1,170.0,30.434783,Adelie
2,170.0,30.869565,Adelie
3,170.0,31.304348,Adelie
4,170.0,31.739130,Adelie
...,...,...,...
4895,235.0,58.260870,Gentoo
4896,235.0,58.695652,Gentoo
4897,235.0,59.130435,Gentoo
4898,235.0,59.565217,Gentoo


This is saying that if there were a penguin with the measurements (170,30), it predicts that it would be an Adelie penguin. Note: we don't expect a real penguin to have any of these measurements exactly.

When we predicted on `df_art`, it's important that the column names matched the names of the column of `df` where we fit the model. We would get a warning message if this was not the case.

* Make another Altair scatter plot of the predicted species, this time using `df_art`.

In [None]:
c4 = alt.Chart(df_art).mark_point(filled=True).encode(
    x=alt.X(cols[0]).scale(zero=False),
    y=alt.Y(cols[1]).scale(zero=False),
    color="pred",
    shape="pred"
).properties(title="artificial data")
c4

In [None]:
c2 | c4

Here we can really clearly see the decision boundaries! Notice that they are lines!!

## A loss function for classification

So far, the only way we've seen to measure performance of our logistic regression model is using accuracy (e.g. how many predictions did our model get correct). Accuracy is not a loss function - think back to our examples from linear regression with mean squared error and absolute squared error. For loss functions, a smaller number is better, while for accuracy, a higher number is better.

__Key Takeaway:__ Accuracy is too rough of a measure. Here's an example:

Imagine our model has two predictions for [Adelie, Chinstrap, Gentoo]: [1,0,0] and [0.34, 0.33, 0.33]. In both cases, our model will predict that we have an Adelie penguin. But! In the first example, it thinks this is with probability 1, while in this second case, it thinks all options (basically) equally likely. There's a big difference between these two cases! Suppose the model got both of these correct, just checking accuracy wouldn't tell us how to distinguish between these two cases. 

This is where we got to in lecture today, but I'll fill in the notes below so that you can read up on your own :)

A commonly used loss function for logistic regression is called __log loss__ (sometimes called __cross entropy__).

__Definition:__ Suppose we have $n$ observations $(X_1,y_1) \dots (X_n,y_n)$. Suppose our model predicts with probability $\pi_{y_i}(X_i)$ that $y_i$ is the output corresponding to the input $X_i$. The corresponding log loss is as follows:

$$
\text{log loss} = \frac{1}{n}\sum_{i=1}^{n} - \log(\pi_{y_i}(X_i))
$$

This definition might be a little strange at first, but let me try to convince you that this is a good definition with some nice properties.

* Suppose our model predicted values perfectly. Then each $\pi_{y_i}(X_i) = 1$, so the entire sum is zero. This is just like our performance measures being zero in the polynomial regression examples from earlier lectures.
* Suppose our model makes the worst possible prediction. This would be like saying $\pi_{y_i}(X_i)$ is zero, which means our logarithm is undefined. You can imagine this being infinite, or in the implementation of scikit-learn being a very large positive number. 

Here is some example data. Say we have three possible outputs, the three classes of penguin, Adelie, Chinstrap, and Gentoo. Say we have three data points:

* (X_1, Adelie), with predicted probabilities [prob of Adelie 0.8, prob of Chinstrap 0.1, prob of Gentoo 0.1]

* (X_2, Gentoo), with predicted probabilities [0.1, 0.5, 0.4]

* (X_3, Adelie), with predicted probabilities [0.6, 0.3, 0.1]

In [3]:
pred_probs = [[0.8,0.1,0.1],[0.1,0.5,0.4],[0.6,0.3,0.1]]
n = 3

Let's compute the log loss in this case.

In [5]:
(1/n)*(-np.log(0.8)+ -np.log(0.4) + -np.log(0.6))

0.5500866356514518

Next we create a list which holds the true species of each penguin.

In [11]:
y_true = ["Adelie","Gentoo","Adelie"]

In practice, we'd never compute the log loss by hand. Here's how we use scikit-learn.

In [7]:
from sklearn.metrics import log_loss

In [8]:
#This will give an error! The Error message gives a nice hint...
log_loss(y_true,pred_probs)

ValueError: y_true and y_pred contain different number of classes 2, 3. Please provide the true labels explicitly through the labels argument. Classes found in y_true: ['Adelie' 'Chinstrap']

Notice! `y_true` has just two penguins, Adelie and Chinstrap, while the probabilities have three distinct classes. We can fix this as follows:

In [12]:
log_loss(y_true, pred_probs, labels=["Adelie","Chinstrap","Gentoo"])

0.5500866356514518

Notice this is the same result we got above! A few comments:
* The order we type for the labels doesn't matter - scikit-learn will alphabetize them
* The order of `pred_probs` must appear in the order the scikit-learn sorts the labels in, that is, alphabetically.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=8f65ccce-fd99-423b-873c-3b3a68cb3ba2' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>