# **DSFM Exercise**: Simple predictions - linear and logit models

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)  
Source:  [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this exercise, we consider a simple linear prediction problem. Given some toy data, we try to recover the parameters of a data generating process (DGP) and make predictions based on the fitted parameters.

------

# Part 0: Setup

In [None]:
# plot data
import matplotlib.pyplot as plt

# numerical matrix operations
import numpy as np    

# data science models
from sklearn.linear_model import LinearRegression, LogisticRegression
from scipy.special import expit

# Part 1: Define a linear DGP

We simply define the linear DGP as a line of the form `y = m * x + b` with the following coefficients:

- `m = 2`
- `b = 3`

**Q 1:** What are the values of `y` at `x` equal to 1, 2, 3, 4 and 5?

# Part 2: Use the mean to make a prediction

In the most simple form, our "model" simply predicts the mean. 

**Q 1:** Plot the values of X and y. Does this DGP have any noise/variance?

**Q 2:** What is the mean of our outcome variable `y`?

**Q 3**: How well does the mean fit our DGP? Plot the mean of `y` and the values of `X` and `y`.

# Part 3: Fit a linear regression and predict

We now move beyond the mean. We fit a linear regression, which tries to estimate the coefficients for `m` and `b` that best fit the data generating process from Part 1. 

**Q 1**: Fit a linear regression to `X` and `y`. Hint: do not forget to reshape the data into a two-dimensional shape.

**Q 2**: What is the R^2 score? What are the estimated coefficient and intercept values?

**Q 3**: What is the predicted value of `y` at `x = 6`?

Given the DGP `y = m * x + b`, its true coefficients `m = 2` and `b = 3` and `x = 6`:

`y = m * x + b`

`y = 2 * x + 3`

`y = 2 * 6 + 3`

`y = 15`

Let's validate this prediction with our fitted model.

# Part 4: Set up a nonlinear DGP

We now consider toy data that was generated by a *nonlinear* and *noisy* DGP. We want to predict two classes, where `y` equals `1` or `0`, that depend only on the value of `x`. 

Think back to the Credit Default demo. In this example, `y` stands for defaulting and `x` might stand for your amount of debt. In this oversimplified model, the more debt you have, the more likely you are to default. We would like predictions to lie in the continuous inerval `[0,1]`.

**Q 1**: Generate a Gaussian random sample for values of `X` centered around `0`. 

**Q 2**: For all values `X > 0`, set `y` equal to `1` and `0` otherwise. Add some random noise to `X`. Hint: the numerical value of the boolean `True` is 1 and `False` is 0.

# Part 5: Plot toy data and fit a linear classifier (linear regression)

In this part, we simply plot the data. Plotting the data is key before fitting any statistical model. 

**Q 1**: Plot `X` and `y` on a scatterplot. Will a linear classifier fit these data well?

**Q 2**: Fit a linear regression and look at the intercept. Does the intercept have a reasonable values? Why not?

**Q 3**: Plot the linear regression fit. What can you say about predictions for very small/large `x`?

# Part 6: Fit a nonlinear classifier (logistic regression)

We now implement a nonlinear classifier, which should better fit the data.

**Q 1**: Fit a logistic regression. 

**Q 2**: Plot the fitted function. Hint: apply the expit() function to your prediction. This will convert the predictions to an inverse sigmoid shape. How does this model fit the data?