### Probabilities and the Sigmoid Function

One thing we could try for our binary classification is a linear regression model with a threshold. We could
look at all our data and try to draw a line through it that minimizes the sum of squares error. Then, we
could determine a threshold value (say .5) such that if our linear model is above that threshold, then we
predict a positive outcome, and if it’s below we predict a negative outcome.

This model might do... OK. At least when it comes to classification. However, there’s a real problem when
you want to interpret what the outcome means. If I’ve got a linear model that predictis how tall somebody
will be based on their parents’ height, and it comes back with 75 inches, that means their predicted height
is 75 inches. However, it’s not clear what an output of, say, .78 would mean in this case. Even worse, the
linear model can potentially have predicted values less than 0 or greater than 1. This is nuts! This would be a predicted value greater, or less, than the largest, or smallest, possible actual value. It’s very unclear what that should mean.

Well, what would we want the output value of our model to mean? These are all binary predictions, and
for almost any binary prediction, you’re probably interested in more than just a yes/no, true/false, 1/0
prediction. Most of the time, you want to not just know what the most likely outcome is, but you want to
have some idea of exactly how likely that outcome is. So, ideally, what we want from our logistic regression
model is a *probability*.

A probability is a number between 0 and 1, and so instead of a line, we’d want a function with a range from
0 to 1. A function that meets this requirement is the sigmoid, which is defined as:

$\displaystyle S(x) = \frac{1}{1+e^{-x}}$

Now, there are very good mathematical reasons why the sigmoid function is what we use for logistic regression, but to get into that would divert us too much from the main concepts we want to cover. So, we won’t formally derive why we use the sigmoid,
but let’s explore it a bit to see how its properties align with what we’d want for a binary prediction model.

Some of the nice properties of the sigmoid include:

- Its values are constrained between 0 and 1, so no matter how large (positive) the input to the function
is, the output is always less than 1, and no matter how small (negative) the input to the function is,
the output is always greater than 0. That’s something we want if we’re going to treat these outputs as
probabilities.

- The sigmoid function is increasing, which means that if something makes an outcome more likely,
even more of that something makes the outcome even more likely. This makes intuitive sense, and
aligns with what we’d want for a straightforward probability model.

- The sigmoid function is smooth, which mathematically means it’s “differentiable", but more qualitatively means it doesn’t have any weird jumps or spikes in the probability, which is again what we’d want for a straightforward model.

- The sigmoid function has a region where the probabilities are very low, a region where the probabilities are very high, and a transition region where the probabilities are more in-between, which is what we’d expect intuitively for this type of classification problem.

### The Logistic Regression Model

OK, so now we’ve seen the sigmoid function. How do we use it to make a model? Where are the parameters? Well, for a logistic regression model, you actually start with a linear model:

$c_{1}X_{1} + c_{2}X_{2} + · · · + c_{n}X_{n} + b$

and then you plug this linear model into the sigmoid function, which gives us:
$\displaystyle \frac{1}{1 + e^{−(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}}$

If we’ve only got one independent variable, like in our model predicting the probability of passing the final
exam based on the number of hours studied, we have a univariate logistic regression model, and the formula
for our model (rewriting $c_{1}$ as $m$, and $X_{1}$ and $X$) is:
$\displaystyle \frac{1}{1 + e^{−(mX+b)}}$

Now, as in linear regression, the task is to solve for the model that does the best job fitting our data. In
linear regression, this was the model that minimized the sum of squares between our predictions and our
data. However, in logistic regression, this sum of squares approach doesn’t make much sense. So, what do
we do?

Before we get into the answer, let’s step back and look at a different question. Suppose we have a
biased coin, which if you flip ends up heads with probability $p$, and tails with probability $(1 − p)$, and it is
not necessarily the case that $p = .5$. Suppose you flip the coin 10 times, and you get the following sequence
of heads and tails:
$H, H, T, H, T, H, H, H, T, H.$

Now, for any value $0 < p < 1$, that sequence is technically possible. However, if the probability of heads
is $1\%$, so $p = .01$, then it’s very unlikely that you’d get 7 heads in 10 flips. On the other hand, if it were a
fair coin, so $p = .5$, then 7 heads in 10 flips is more likely. A natural question then is - for what value of
$p$ is the given sequence most likely? Formulating this mathematically, $p$ is the probability any given flip is
heads, $(1 − p)$ is the probability any given flip is tails, and the probability of any sequence of flips is just the
product of the probabilities of the individual flips, so:
$P(H, H, T, H, T, H, H, H, T, H) = pp(1 − p)p(1 − p)ppp(1 − p)p = p^{7}(1 − p)^{3}$

This is the probability that, with ten flips of the coin, you’d see our sequence of heads and tails, given the
chance of heads is $p$. We call this the likelihood of our sequence, and the value of p that maximizes it is the
*maximum likelihood estimate*. In this example it’s a calculus exercise to show that the likelihood is maximized
at $p = .7$. In other words, if your coin shows up as heads $70\%$ of the time, your best guess is that its
probability of heads is $70\%$. Not exactly shocking.

Returning to logistic regression, suppose we’re looking at the first problem we discussed, predicting
whether a student will pass the final exam based on the hours they study, and we have the following
four data points:

| Student   | Hours of Study | Exam grade |
| :---:     |    :----:   | :---: |
| Student 1 | 3       | F |
| Student 2 | 2       | P |
| Student 3 | 0       | F |
| Student 4 | 4       | P |


Our equation for the probability of passing is:
$\displaystyle S(x, m, b) = \frac{1}{1 + e^{-(mX+b)}}$,
where $X$ is the number of hours studied, and so the probability of the above four data points, the likelihood
of our data, will be given by:
$P(m, b) = (1 − S(3, m, b))S(2, m, b)(1 − S(0, m, b))S(4, m, b) = $

$\displaystyle \left(1 −\frac{1}{1+e^{-(3m+b)}}\right)\left(\frac{1}{1+e^{-(2m+b)}}\right)\left(1-\frac{1}{1+e^{-(0m+b)}}\right)\left(1+\frac{1}{e^{-(4m+b)}}\right)$

Again, with linear regression, our goals was to find the values of m and b that minimized the sum of square
error. With logistic regression, our goal is to find the values of m and b that maximize the likelihood, which
for our four data points would be the function above. With more data points, it would be exactly the same
setup, just with more terms in our product.

Generally speaking, finding the values of the coefficients (in this case m and b) that maximize this likelihood function is hard. In our biased coin example, finding the exact value of p that gave us the maximum likelihood was possible. Unfortunately, that’s very rarely the case. Usually there’s no easy way to exactly calculate the maximum likelihood. 

Note that this is a big difference between logistic and linear regression. With linear regression, there are exact, optimized formulas that will tell you the coefficients that give you the best fit line. Not so with logistic regression.

If we can’t solve our equation exactly, what do we do? We use numerical methods to get close to the
exact solution. Precisely how this is done introduces a very important concept in data science and machine
learning. This very important concept is called gradient descent, which we'll discuss at the end of class today.

### The Confusion Matrix

Once we have a logistic regression model, or really any type of classifier, an obvious question to ask is how
well it works. Very few models are perfect, and almost every one will get some classifications wrong. A
confusion matrix is a tabular way to record the successes and failures of a model, and provides a quick
overview of the model’s performance.

The rows of a confusion matrix represent the actual category classifications for our data, while the columns
represent the category classifications from our model. The entries represent the number of data elements
with the given actual and model classifications. For example, suppose we have a binary logistic regression
model predicting whether a tumor is malignant or benign, and it has the following confusion matrix:

| Tumor Diagnosis | Malignant | Benign |
| :---:     |    :----:   | :---: |
| Malignant | 29      | 2 |
| Benign    | 4       | 66 |

The first row of data represents the tumors that were actually malignant, and the second row of data rep-
resents the tumors that were actually benign. The first column of data represents the tumors the model classified as malignant, while the second column of data represents the tumors the model classified as benign.

From this we can see that there were 33 tumors classified as malignant, and of these 33, 29 were actually
malignant, while 4 were incorrectly classified, and were in reality benign. When a model marks a negative
(in this case benign) as a positive (in this case malignant) this is called a false positive or a type I error. We
can also see there were 68 tumors classified as benign, and of these 68, 66 were actually benign, while 2
were incorrectly classified, and were in reality malignant. When a model marks a positive (in this case
malignant) as a negative (in this case benign) this is called a false negative or a type II error.

Now, it’s not always the case that the cost associated with a false positive is the same as that associated
with a false negative. For example, if when the model says malignant the patient goes in for more tests,
then the cost of a false positive is relatively small compared to the cost of a false negative, which would
mean the patient doesn’t receive treatment when they should. In this case instead of setting the threshold
for a positive diagnosis at p > .5, you might set it at p > .05, which will significantly increase the number
of positive diagnoses, and increase the type I error (false positives) while decreasing the type II error (false
negatives). This might be fine given the nature of the problem. On the other hand, if when the model says
malignant the patient goes in for invasive surgery, then the cost of a false positive is significant, and we
wouldn’t bring the threshold down from p = .5 as much, if at all.

### Accuracy, Precision, Recall, and Specificity

There are a few terms associated with confusion matrices that any practicing data scientist needs to know.
These are:

- Accuracy - This one is pretty straightforward. The accuracy of a model is how often it’s right, regardless of whether it’s right about a true positive, or a true negative. The formula for accuracy is:
$\text{Accuracy} = \displaystyle \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$

- Precision - The precision is concerned with the positive. Specifically, it is the fraction of the time the
model’s positive result is accurate. Defined precisely, the formula for precision is:
$\text{Precision} = \displaystyle \frac{\text{True Positives}}{\text{Predicted Positives}} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$

- Recall - The recall is concerned with how often the model correctly identifies an actual positive. Its formula is:
$\text{Recall} = \displaystyle \frac{\text{True Positives}}{\text{Real Positives}} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$

- Specificity - The specificity is concerned with how often the model correctly identifies an actual negative. Its formula is:
$\text{Specificity} = \displaystyle \frac{\text{True Negatives}}{\text{Real Negatives}} = \frac{\text{True 
Negatives}}{\text{True Negatives + False Positives}}$

In a great model, ideally, all of these will be high. However, in reality, frequently tradeoffs must be made,
and which of these you choose to optimize will depend on the nature of your problem, and the different
costs for different types of mistakes. For example, in our tumor example, which of these do you think
would be most important?

Everything we’ve talked about so far can be applied to any binary classification problem, and it’s straight-
forward to generalize these ideas to non-binary classification as well. However, remember, one of the big ideas behind logistic regression was that it not only gave you a yes/no prediction, but it also gave you a probability. Is there a way of measuring how good those probability scores are? Yes, there is!

### The ROC Curve and AUC

The ROC curve (which stands for the receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. The curve plots two parameters, the true positive rate (TPR), which is the recall, on the y-axis, and the false positive rate (FPR), which is 1 minus the specificity, on the x-axis. When the threshold is p = 1, so as strict as it could be, then nothing will be classified as positive, and both the TPR and FPR will be 0. When the threshold is p = 0, so as relaxed as it could be, then everything will be classified as positive, and both the TPR and FPR will be 1. As we move from p = 1 down to p = 0, the TPR and FPR will grow at their own rates, and we can plot their values on a curve. This is the ROC curve.

<center>
    <img src="ROC.png" width="600">
</center>

We measure the performance of the model by looking at the area under the ROC curve. This performance
measure is called AUC, which just stands for "Area Under the Curve". This is a number between 0 and 1,
and would be 1 for a perfect model (one that gave every real positive probability 1, and every real negative
probability 0). For a random model, you’d get an AUC of about .5. A good AUC is usually between .8 and
.9.

## Gradient Descent

The function that we use for predictions in logistic regression is the sigmoid function:
$\displaystyle S(\textbf{X}) = \frac{1}{1 + e^{−(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}}$

If we're given a set of input data and we want to find the best predictor, we don't try to minimize the sum of squares error as we did with linear regression. Instead, with logistic regression, we try to *maximize the likelihood*. In other words, we try to find the values of our coefficients such that the probability of our dataset is maximized.

Now, unfortunately, the exact values that achieve this aren't always easy to find. For the sigmoid function and logistic regression, it's not really possible to find a closed-form solution the way we can with linear regression. So, to get around this difficulty, we use *gradient descent*. 

Stated more mathematically, at any given point it's frequently straightforward to find the gradient (the vector of partial derivatives) of our function. This gradient tells us the direction of greatest increase at that point, and so if we're looking to maximize a function, we can take a step in that direction. If we're looking to minimize the function, we can take a step in the other direction.

For the sigmoid function, the partial derivative with respect to a coefficient is relatively easy to calculate:
    $\displaystyle \frac{\partial S}{\partial c_{i}} = \frac{X_{i}e^{−(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}}{\left(1 + e^{−(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}\right)^{2}} = X_{i}\left(1-\frac{1}{1 + e^{-(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}}\right)\left(\frac{1}{1 + e^{−(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}}\right) = X_{i}(1-S)S$

Here $c_{0} = b$ and $X_{0} = 1$.

From this, we can fairly easily calculate the gradient at any point $\textbf{X}$. Now, an important hyperparameter here is the step size - how far you move in the direction of (or opposite) the gradient at each step. This is sometimes knows as the *learning rate*, and its study is an important field within machine learning. For today, we'll assume the step size is constant, but please note that isn't always the case.

Something to note is that the product of terms in our likelihood is equal to the number of observations in our dataset. For a relatively small dataset this isn't a problem, but for a huge dataset this can make likelihood calculations rather involved, and iterative likelihood calculations extremely resource intensive.

#### Stochastic Gradient Descent

One way to get around this issue is with *stochastic gradient descent*. The idea behind stochastic gradient descent is that (in the extreme) we only see how our prediction works on one single observation, and then we adjust our parameters accordingly based upon our prediction for that observation. In other words, we move either forward or backward depending on whether our prediction was right or wrong.

This can make the optimization problem *much* less computationally intensive, although it does introduce some potential problems, and makes your optimization dependent on the order in which you evaluate the observations. A middle ground between pure gradient descent and extreme stochastic gradient descent is batched stochastic gradient descent, where we divide our data into disjoint groups, and run gradient descent over each group individually.