# Tutorial 10 - Logistic Regression

*Written and revised by Jozsef Arato, Mengfan Zhang, Dominik Pegler*  
Computational Cognition Course, University of Vienna  
https://github.com/univiemops/tewa1-computational-cognition

---
**This tutorial will cover:**

How to

1. make predictions with a logistic regression model- and visualize prediction  
2. understand the intercept and slope in their effect on the sigmoid
3. simulate binary data, based on the predicted sigmoid probabilities
4. fit a logistic regression model to the simulated data
5. check similarity of parameters of fitted model, to the parameters that were used to "make" the data
6. understand logistic regression likelihood and  score

---

## Import libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np

## Simulating data with a logistic regression model

Scenario: time spent studying predicting success on a pass or fail exam:
p(Success)  🕔 🧑

### sigmoid
first step for this simulation is to implement the sigmoid function:
the function should take as input x and return:

1/(1+e<sup>-x</sup>)    

use np.exp




In [None]:
def my_sigmoid(x):
    return 1 / (1 + np.exp(-x))

Test this function with the X= numbers -5 to 5
and plot the result

In [None]:
x = np.linspace(-5, 5, 20)
plt.plot(x, my_sigmoid(x))
plt.xlabel("x", fontsize=15)
plt.ylabel("Sigmoid(x)", fontsize=15)

to implement the exam scenario, we examine a scenario, where the students  study something between 0 to 50 hours for the exam

however, lets make data that is centered at zero, that is students study -25 to +25 hours relative to the average

In [None]:
n = 100  # nUM OF DATA points
x = np.linspace(-25, 25, n)
print(x)

next step is too see, what the sigmoid predicts for these hours of studying

In [None]:
plt.plot(x, my_sigmoid(x))

plt.xlabel("Relative hours studied", fontsize=14)
plt.ylabel("P(Success)", fontsize=14)


clearly the above is not sufficient

we did not yet implement the logistic regression model:

### linear equation inside sigmoid


p(Y)=sigmoid(b<sub>0</sub>+b<sub>1</sub>*X)

define two parameters for the intercept and the slope,
and use them with mysigmoid, to test how it changes the model predictions

visualize the result as above


In [None]:
# sigmoid prediction for a single combination
b0 = -0.1  # intercept
b1 = 1.5  # slope
lin_pred = b0 + b1 * x
p_y = my_sigmoid(lin_pred)
plt.plot(x, p_y)
plt.xlabel("Relative hours studied", fontsize=14)
plt.ylabel("P(Success)", fontsize=14)
plt.ylim([0, 1])

In [None]:
plt.plot(x, my_sigmoid(b0 + b1 * x))
plt.xlabel("Relative hours studied", fontsize=14)
plt.ylabel("P(Success)", fontsize=14)
plt.ylim([0, 1])

Check "divide by 4 rule" and compare to difference of 1 prediction

slope/4 roughly = change in Y for unit change in X

In [None]:
print(b1)
print(b1 / 4)
print(my_sigmoid(b1 * 0.5) - my_sigmoid(b1 * -0.5))
print(my_sigmoid(b1 * 5) - my_sigmoid(b1 * 4))
print(my_sigmoid(b1 * 15) - my_sigmoid(b1 * 14))

##  testing prediciton of  different intercepts
(plotted on the same figure)

use a for cycle

In [None]:
# test for multiple values of b0 and b1
plt.figure()
b1 = 1
b0s = np.linspace(-3, 3, 10)
for b0 in b0s:
    p_y = my_sigmoid(b0 + x * b1)
    plt.plot(x, p_y, label=np.round(b0, 1))
plt.legend()
plt.xlabel("hours studied")
plt.grid("ON")
plt.ylabel("P(Success)", fontsize=14)

In [None]:
my_sigmoid(3)

### inverse of sigmoid
logit function

In [None]:
from scipy import special

special.logit(0.9525741268224334)

In [None]:
my_sigmoid(3)

In [None]:
my_sigmoid(-2.3)

In [None]:
my_sigmoid(0)

In [None]:
my_sigmoid(3)

##  testing prediciton of  different slopes
(plotted on the same figure)

use a for cycle

In [None]:
b1s = np.linspace(0.1, 3, 10)
b0 = 0
plt.figure()
for b1 in b1s:
    plt.plot(x, my_sigmoid(b0 + x * b1), label=np.round(b1, 1))
plt.legend()
plt.xlabel("hours studied")
plt.ylabel("p(Success)")
plt.grid("ON")

try to find good values for b<sub>0</sub> and b<sub>1</sub> for an exam, where almost everyone who studies less than -10 hours fails, but almost everyone who studies more then +20 hours succeeds




In [None]:
b0 = 0
b1 = 1  # your code
lin_pred = b0 + b1 * x
p_y = my_sigmoid(lin_pred)
plt.plot(x, p_y)
plt.xlabel("Hours Studied", fontsize=14)
plt.ylabel("P(Success)", fontsize=14)

find values for b0 and b1, for an exam, where there is a smooth transition, and people who study more are more likely to succeed, but almost everyone passes, who studied more than 30 hours (approximately)

In [None]:
# YOUR CODE
# YOUR CODE

now that we set up the model, we are almost ready to simulate some data

for this we need the numpy random.binomial generator

test this function, and try to understand what it is doing  (manipulate the probability of getting 0 and 1)

the goal here is to use probabilities, to simulate random binary outcomes



In [None]:
for i in range(10):
    print(np.random.binomial(1, 0.5))

In [None]:
np.random.binomial(1, 0.1)

In [None]:
np.random.binomial(1, 0.5, 10)

## Now we are ready to start the simulation
### simulate X
I. simulate 50 students, who uniformly studies someting between 0 and 100 hours ( np.random.uniform).  (visualize-histogram)

or between -50 and 50!


In [None]:
n = 70
x = np.random.uniform(-50, 50, n)

plt.hist(x)
plt.xlabel("relative hours studied")

### calculate p(succes)
II. make predictions for the probability of passing the exam for each student using the linear equation and the sigmoid function

 visualize the prediction.  use np.sort or scatter, as a simple line plot will be messy

In [None]:
b0gen = 1.5
b1gen = 0.12
y_pred = my_sigmoid(b0gen + b1gen * x)
plt.plot(np.sort(x), np.sort(y_pred))
# plt.scatter(x,y_pred)

plt.xlabel("Relative hours studied", fontsize=14)
plt.ylabel("P(Success)", fontsize=14)

### simulate Y
Now we can use the predicted probabilties from above, so simulate, whether the student passed the exam or not, using the np.random.binomial

1. make Y equal to the simulated success (=1) or fail (=0), and make a single random prediciton based on each value in Ypred


2. visualize the result with a scatter plot, and also with the predicted curve

In [None]:
print(y_pred)
y = np.zeros(n)
for i in range(n):
    y[i] = np.random.binomial(1, y_pred[i])

plt.scatter(x, y)
plt.plot(np.sort(x), np.sort(y_pred))

plt.xlabel("Relative hours studied", fontsize=14)
plt.ylabel("Exame outcome", fontsize=14)


now we are ready with a data-set simulated with a logistic model! ✌

next step is to actually fit a logistic regression to this data! 👼
## Fitting logistic regression


In [None]:
from sklearn.linear_model import LogisticRegression

try to fit the logistic regression model to the X,Y data, that we created above

for X we will need .reshape(-1, 1)  for dimensionality issues

In [None]:
log_reg = logistic_regression()
log_reg.fit(x.reshape(-1, 1), y)

print the fitted parameters
intercept_ and slope:  coef_



In [None]:
print("fitted slope", log_reg.coef_)
print("generating slope", b1gen)

print("fitted intercept", log_reg.intercept_)
print("generating intercept", b0gen)

## predicted probabilty (likelihood of each data point)

### using built in method from scikit-learn

In [None]:
ll = log_reg.predict_proba(x.reshape(-1, 1))
lls = np.zeros(n)
lls[y == 0] = ll[y == 0, 0]  # where true y is 0, use first column
lls[y == 1] = ll[y == 1, 1]  # where true y is 1, use 2nd column
print(lls)

## predicted probabilty (likelihood of each data point)

### using costum code

In [None]:
def sigm_ll(x, y, b1, b0):
    predp = my_sigmoid(x * b1 + b0)
    ll = np.zeros_like(predp)
    ll[y == 0] = 1 - predp[y == 0]
    ll[y == 1] = predp[y == 1]
    return ll


sigm_ll(x, y, log_reg.coef_[0], log_reg.intercept_)

## color data by likelihood

In [None]:
plt.scatter(x, y, c=lls)  # color data by model likelihood
plt.xlabel("Hours studied", fontsize=14)
plt.plot(
    np.sort(x),
    my_sigmoid(np.sort(x) * log_reg.coef_[0] + log_reg.intercept_),
    label="fitted",
)
# plt.plot(np.sort(x),my_sigmoid(np.sort(x)*b1gen+b0gen),label='data gen')
# plt.legend()
plt.colorbar()
plt.ylabel("Exam passed", fontsize=14)

try to visualize the fitted model, together with the data

## Compare data generating and fitted model
compare the generating model from the simulation, and the fitted model, that we got with scikit-learn
compare both the numbers and visulize the curves

In [None]:
y_pred_fitted = my_sigmoid(log_reg.intercept_ + log_reg.coef_ * xcentered)

plt.figure()
# your code
# your code

plt.xlabel("Relative hours studied", fontsize=14)
plt.ylabel("Exame outcome", fontsize=14)
plt.legend()

## homework
use the relevant parts of the code above, to make 2 functions: one for generating binary outcome data based on the logistic model and one for fitting the logistic regression with scikit learn

### Function 1.: MakeLogisticDat

Three inputs, 1: intercept, 2: slope, 3: array X, that is used to make the predictions.

the function should take the 3rd input X, and use the intercept and the slope, to calculate P(Y) for each value of X. P(Y) means: predicted probabilities outcome=1 (as above).

Next the function shuold simulate a binary outcomes for array Y, for each p(Y) (as above).

One output: numpy.array Y (same length as input X), that contains the simulated binary outcomes


### Function 2.: FitLogistic
Inputs (2): an X and a Y array
fit the scikit learn logistic regression (pay attention that you have to reshape X, so that it is a 2d array).
Output (2): intercept and slope of fitted logistic regression

!! pay attention to the order of the inputs and the outputs (should be the same order - and same number as above)
!! the function names should also be exactly as defined above!


###  Difference between data generating and fitted model

once you have these functions ready,
 change the number of students for the exam (from 10 to 1010 in steps of 50), and generate a different Xdataset for each, fit the logisitic regression to the generated data, and observe how it affects the difference between the data generating and the fitted model.
** Visualize this findings**-- x axis, number of data points, Y-axis fitted value and true data generating value for the slope

Test this with at least 3 different values for data generating slope- make seprate figures!



In [None]:
# your code
# your code

In [None]:
# your code
# your code
# your code
# your code