<a href="https://colab.research.google.com/github/tbonne/IntroPychStats/blob/main/notebooks/lm_predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='http://drive.google.com/uc?export=view&id=1VwLYFU_RpimHn-ApqqcH95EVIe00-UTE' width=500>

#<font color='darkorange'>Making predictions</font>

In this notebook we'll learn how to make predictions using our regression models. To do this we will first fit our model to the data.


### 1. Load in the data

Lets load in some packages. These have functions that other people have made, and will hopefully make our lives a lot easier!

In [None]:
install.packages("jtools")
install.packages("ggstance")
library(jtools)

Then let's load in the IQ data

In [None]:
#here we will read in a csv file and place it into something called df
df_IQ <- read.csv("https://raw.githubusercontent.com/tbonne/IntroPychStats/main/data/kidIQ.csv", header = T)

#let's take a look at the data
head(df_IQ)

### 2. Visualize our data

Then let's plot the scatterplot. Here we will choose: 
> what we'd like to predict (child IQ) and put it on the y-axis.
> What we'd like to use to help make those predictions (mom IQ) and put it on the x-axis.

In [None]:
plot(x=df_IQ$mom_iq,y=df_IQ$kid_score) 

### 3. Define and fit our model

Now we can speficy the model we'd like to fit: i.e., predict child IQ using mom IQ.
> Remember, here we use the formula: "what we'd like to predict" ~ "what we'd like to use to help make those predictions."
  

In [None]:
#fit a linear model
model_childIQ <- lm(kid_score ~ mom_iq, data=df_IQ)


This bit of code then use our inputs to find the best fit linear equation for :
> kid_score = a + b * mom_iq

Let's use the summ function to tell us what values of a and b it found for the best fit line. 
> Note: we'll also calculate our 95% confidence interval here too!

In [None]:
#What does the best fit model look like?
summ(model_childIQ, confint=TRUE)

We can see from this output that the model is pretty certain that the slope of the population is somewhere between 0.49 and 0.72. 
> Those are the range of population values that are compatible with our sample!
> I.e., it is unlikely that we'd get a sample like the one we got if the real slope between moms and chilren was -0.3.

### 4. Visualize the results

Let's take a look at the estimates a little more visually

In [None]:
#plot the estimates of the slopes
plot_summs(model_childIQ)

Let's take a look at the regression line a little more visually

In [None]:
#plot line on the data
effect_plot(model_childIQ, pred = mom_iq, interval = TRUE, plot.points = TRUE)

### 5. Make predictions

Let's now use the model we just fit to make a prediction. The idea here is that we can use our best fit line to answer prediction questions. 
> E.g., What does our model predict the IQ of a child if the mom's IQ is 120? 

In [None]:
#make a prediction
predict(model_childIQ, newdata=data.frame(mom_iq=120) )

> E.g., What does our model predict the IQ of a child if the mom's IQ is 80? 

In [None]:
#make a prediction


> E.g., What does our model predict the IQ of a child if the mom's IQ is 170? 

In [None]:
#make a prediction


<img src='http://drive.google.com/uc?export=view&id=1qWrKY9TgpgQaBCzZfz1xLTV6iCeSwfmG' width=150>

## Let's head back to the slides!

Let's take a look at how good our model is at predicting IQ.
  
Let's first make predictions for every data point and compare it to the observed values.

In [None]:
#let's make some predictions (if we don't give the prediction function new values it will just use the data that was used when building the model)
df_IQ$predictions <- predict(model_childIQ)

#Let's take a look
head(df_IQ)

Next let's plot the predictions and the observed IQ values of the children.

In [None]:
plot(y=df_IQ$kid_score, x=df_IQ$mom_iq)
points(y=df_IQ$predictions, x=df_IQ$mom_iq, col="red")

Not too suprisingly we see that all those predicted points fall on a line, i.e., the regression line! 
  
We can also see that those predictions are sometimes very far away from the real observations, i.e., there is some amount of residual error!
  
Let's look at this error in a plot. To do this let's subtract all the observations by the predictions to get the residuals. Then let's plot the residuals.

In [None]:
#take the difference between the observed and the predicted
df_IQ$residual = df_IQ$kid_score - df_IQ$predictions

#take a look
plot(y=df_IQ$residual, x=df_IQ$mom_iq)
abline(h=0, lty=3) #just adding a little dashed line here

Here we can see a scatter plot of the residuals on the y-axis and the mom's IQ on the x-axis. Each point on the plot tells us how far off our prediction was. So we can see some positive values (where our prediction was too low) and some negative values (where our prediction was too high). 
  
Points near zero on the y-axis are points where our model did very well! So if the model did really well we should see all the points around zero (i.e., close to that dashed line). If it was a weak model we should see points that vary a lot and are not close to the dashed line.
> what measure do you know of that gets at how much something varies? That's right variation!

We'll use that to estimate model performance with something called r-squared (or sometimes just called R2). R2 compares how much variation you have in your data to how much is left in the residuals. If you have a good model all the variation in your data (e.g., kid_scores) should be well predicted and in the residuals you should have all the points near zero (i.e., variation in the residuals will be low, because your model is making good predictions).

Below let's calculate the variance in your observed data (i.e., kid IQ score).


In [None]:
var_data = var(df_IQ$kid_score)

Next let's calculate the variance in your residuals (i.e., what is left after making your predictions)

In [None]:
var_residuals = var(df_IQ$residual)

Finally, let's use a formula to calcualte R2!

In [None]:
1-(var_residuals/var_data)

Above you should see a value of R2. This reflects how much of the variation in kid IQ scores that your model can explain! R2 ranges from 0 - 1.

> Values close to 1 suggest your model is doing very well at predicting kiq IQ scores.

> Values close to 0 suggest your model is not doing very well at predicting kid IQ scores.

If you go all the way back to the top, you'll see that we won't have to go through all these steps everytime we'd like to get the R2 for our model. It is calculated for us when we look at the summary table using the **summ()** function!