# **Simple Prediction**: Linear and Logit Models (SOLUTIONS)

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this exercise, we consider a simple linear prediction problem. Given some toy data, we try to recover the parameters of a data generating process (DGP) and make predictions based on the fitted parameters. DGP is a fancy term referring to the process that generates the real-world data we want to model.

---------

## Part 0: Setup

In [None]:
# Plot data
import matplotlib.pyplot as plt

# Numerical matrix operations
import numpy as np    

# Data science models
from sklearn.linear_model import LinearRegression, LogisticRegression
from scipy.special import expit

### Define a data generating process (DGP)

We simply define the linear DGP as a line of the form `y = m * x + b` with the following coefficients:

- `m = 2`
- `b = 3`

In [None]:
# Function that implements our DGP

def dgp(x):
    """
    Linear DGP of the form y=mx+b, where m=2 and b=3
    
    Parameter: 
        x (float): input value
    
    Returns: 
        float: f(x) 
        
    """
    
    return 2 * x + 3
    

In [None]:
# Toy data to use in the analysis below

X = [1, 2, 3, 4, 5]

y = []
for x in X:
    y.append(dgp(x))

In [None]:
# Look at X

X

In [None]:
# Look at y

y

# **MAIN EXERCISE**

## Part 1: Predict the mean

In the most simple form, our "model" simply predicts the mean. 

**Q 1:** What is the mean of our outcome variable `y`?

In [None]:
# Compute the mean

y_mean = sum(y)/len(y)
y_mean

**Q 2**: Plot the mean of `y` and the values of `X` and `y`. How well does the mean fit our DGP?

In [None]:
# Plot mean prediction

plt.scatter(X, y)
plt.plot(X, y)
plt.hlines(y_mean, xmin = min(X), xmax = max(X), colors='red')
plt.ylabel('y')
plt.xlabel('X')

As we can see, this prediction does not fit well. The prediction is biased and has no variance. In fact, no prediction will have variance as the DGP has no noice/variance.

## Part 2: Fit and predict a linear regression

We now move beyond the mean. We fit a linear regression, which tries to estimate the coefficients for `m` and `b` that best fit the data generating process. 

In [None]:
# Prepare data 

X = np.array(X).reshape(-1, 1)
y = np.array(y)


**Q 1**: Fit a linear regression to `X` and `y`. Hint: do not forget to reshape the data into a two-dimensional shape. Why do we need to reshape our data?

In [None]:
# Fit the linear regression 

reg = LinearRegression().fit(X, y)


**Q 2**: What is the R^2 score (i.e. the variance explained) for the above model?

In [None]:
# Evaluate the fit in terms of R^2 (i.e. variance explained)

reg.score(X, y)


**Q 3**: What is the predicted value of `y` at `x = 6`?

Given the DGP `y = m * x + b`, its true coefficients `m = 2` and `b = 3` and `x = 6`:

`y = m * x + b`

`y = 2 * x + 3`

`y = 2 * 6 + 3`

`y = 15`

Let's validate this prediction with our fitted model.

In [None]:
# Predict at x = 6

reg.predict([[6]])

# **ADVANCED EXERCISE**

*Optional.* If time permits and you feel comfortable with Python, continue with the advanced parts of this exercise below.

**Q 0**: What are the estimated coefficient and intercept values?

In [None]:
# Estimate coefficient m

reg.coef_


In [None]:
# Estimate intercept b

reg.intercept_ 


## Part 4: Set up a nonlinear DGP

We now consider toy data that was generated by a *nonlinear* and *noisy* DGP. We want to predict two classes, where `y` equals `1` or `0`, that depend only on the value of `x`. 

Think back to the Credit Default demo. In this example, `y` stands for defaulting and `x` might stand for your amount of debt. In this oversimplified model, the more debt you have, the more likely you are to default. We would like predictions to lie in the continuous inerval `[0,1]`.

**Q 1**: Generate a Gaussian random sample for values of `X` centered around `0`. Inspect the first 10 values of `X`.

In [None]:
# Generate random samples with seed = 0 and look at the first 10 values of X

n_samples = 100
np.random.seed(0)

X = np.random.normal(size = n_samples)
X[:10]

**Q 2**: For all values `X > 0`, set `y` equal to `1` and `0` otherwise. Add some random noise to `X`. Hint: the numerical value of the boolean `True` is 1 and `False` is 0.

In [None]:
# Set y to 1 if X > 0

y = X > 0

# Convert boolean values to numbers

y = y.astype(float)

# Look at the first 10 entries of y

y[:10]

In [None]:
# Add some Gaussian noise around x = 0 and look the first 10 entries of X

X = X + 0.2 * np.random.normal(size = n_samples)
X[:10]

## Part 5: Plot toy data and fit a linear classifier (linear regression)

In this part, we simply plot the data. Plotting the data is key before fitting any statistical model. 

**Q 1**: Plot `X` and `y` on a scatterplot. Will a linear classifier fit these data well?

In [None]:
# Scatterplot with axis labels

plt.scatter(X, y)
plt.ylabel('y')
plt.xlabel('X')
plt.ylim(-.25, 1.25)
plt.xlim(min(X), max(X))

**Q 2**: Fit a linear regression and look at the intercept. Does the intercept have a reasonable values? Why not?

In [None]:
# Fit the linear regression (hint: call .reshape(-1, 1) on X to tell NumPy that there's only one predictor )

X = X.reshape(-1, 1)

reg = LinearRegression().fit(X, y)

In [None]:
reg.intercept_

**Q 3**: Plot the linear regression fit. What can you say about predictions for very small/large `x`?

In [None]:
# Plot linear regression fit

# np.linspace return evenly spaced numbers over a specified interval
X_test = np.linspace(-2, 2, 100).reshape(-1, 1)

# Option 1: manually extract the coefficients for the line y = m * x + b
y_pred_linear = reg.coef_ * X_test + reg.intercept_

# Option 2: use the .predict() function
# y_pred_linear = reg.predict(X_test)

plt.plot(X_test, y_pred_linear, linewidth=1)
plt.scatter(X, y)
plt.ylabel('y')
plt.xlabel('X')
plt.ylim(-.25, 1.25)
plt.xlim(min(X), max(X))

We can see that the linear classifier is not a good fit. While the intercept looks plausible, the model predicts values outside the `[0,1]` range.

## Part 6: Fit a nonlinear classifier (logistic regression)

We now implement a nonlinear classifier, which should better fit the data.

**Q 1**: Fit a logistic regression. 

In [None]:
# Fit the logistic regression

logReg = LogisticRegression(solver='lbfgs').fit(X, y)


**Q 2**: Plot the fitted function. Hint: apply the expit() function to your prediction. This will convert the predictions to an inverse sigmoid shape. How does this model fit the data?

In [None]:
# Plot linear regression fit

# Option 1: manually extract the coefficients (uncomment line below)
# y_pred_nonlinear = X_test * logReg.coef_ + logReg.intercept_
# transorm prediction to the inverse of the logit function (uncomment line below)
# y_pred_nonlinear = expit(y_pred_nonlinear)

# Option 2: use the .predict() function
y_pred_nonlinear = logReg.predict_proba(X_test)[:,1]

plt.plot(X_test, y_pred_nonlinear, linewidth=1)
plt.scatter(X, y)
plt.ylabel('y')
plt.xlabel('X')
plt.ylim(-.25, 1.25)
plt.xlim(min(X), max(X))
