<a href="https://colab.research.google.com/github/tbonne/IntroPychStats/blob/main/notebooks/lm_house_prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='http://drive.google.com/uc?export=view&id=1L4qAFCwXR9S6RHVzdOc7PhL7-B3yW-Jb' width=300>

#<font color='darkorange'>Can you predict house prices?</font>

In this notebook we'll use the housing data to build multi linear regression models to make predictions and help answer questions.


### 1. Load in the data

Lets load in some packages. These have functions that other people have made, and will hopefully make our lives a lot easier!

In [None]:
install.packages("jtools")
install.packages("ggstance")
library(jtools)

Then let's load in the [Boston housing data](https://www.kaggle.com/fedesoriano/the-boston-houseprice-data).

In [None]:
#here we will read in a csv file and place it into something called df
df_houses <- read.csv("https://raw.githubusercontent.com/tbonne/IntroPychStats/main/data/BostonHousesPrices.csv", header = T)

#let's take a look at the data
head(df_houses)

Now that we can see the data, think about a question you might like to ask about what might lead to some houses having high/lower prices than others. 

> E.g., what predicts a house prices?

Write out your question in words here: 

> Is it the case that house with more rooms are more expensive?

> Are houses next to the river more expensive?

> How well can i predict the price of a house?



### 2. Visualize our data

Then let's plot a scatterplot (feel free to plot a few here, it is always a good idea to explore your data before modeling!). Here we will choose: 
> What we'd like to predict and put it on the y-axis.

> What we'd like to use to help make those predictions and put it on the x-axis.

> The choice of these variables should follow from the question you're asking above!


<font color = "darkred"> (?) for the question mark below you should replace it with the column name that you'd like to use to make predictions about price of houses. </font>

In [None]:
plot(x=df_houses$?,y=df_houses$price) 

In [None]:
plot(x=factor(df_houses$?),y=df_houses$price) 

In [None]:
plot(x=df_houses$?,y=df_houses$price)

### 3. Define and fit our model

Now we can speficy the model we'd like to fit.
> Remember, here we use the formula: "what we'd like to predict" ~ "what we'd like to use to help make those predictions."
  

<font color = "darkred"> (?) for the question mark below you should replace it with the formula that will help you answer your question. </font>

In [None]:
#fit a linear model
model_houses <- lm(?, data=df_houses)


This bit of code then use our inputs to find the best fit linear equation for :

> $price_i \sim Normal(\mu_i, \sigma) $

> $mu_i = a + b_{rooms} * room_i + b_{river}*river_i $


Let's use the summ function to tell us what values of a and b it found for the best fit line. 
> Note: we'll also calculate our 95% confidence interval here too!

In [None]:
#What does the best fit model look like?
summ(model_houses, confint=TRUE)

We can see from this output that the model is pretty certain that the slope of the population is somewhere between ? and ?. 
> Those are the range of population values that are compatible with our sample!
  
We can also get a sense of how well your model predictions reflect the observed values using R2.

### 4. Visualize the results

Let's take a look at the estimates a little more visually

In [None]:
#plot the estimates of the slopes
plot_summs(model_houses)

Let's take a look at the regression line a little more visually

<font color = "darkred"> (?) for the question mark below you should replace it with the variable that you used to help make your predictions. </font>

In [None]:
#plot line on the data
effect_plot(model_houses, pred = ?, interval = TRUE, plot.points = TRUE)

In [None]:
#plot line on the data
effect_plot(model_houses, pred = ?, interval = TRUE, plot.points = TRUE)

### 5. Checking assumptions

**Assumption 1**

Let's check the assumption that the errors (residuals) are normally distributed.

In [None]:
hist(model_houses$residuals)

The above plot is just like the histograms we've looked at in the past. Now we are looking at how errors are distributed.

> If the errors do not look to have many small errors and few large errors (both positive and negative) then a normal distribution might not be the best model of the data. We might also be missing an important variable...

**Assumption 2** - no patterns in the residuals
  
Let's check the assumption that the variance in the errors is constant.

In [None]:
plot(y=model_houses$residuals, x=model_houses$fitted.values)
abline(h = 0, lty=3)

The above plot shows you all the errors (residuals) for each value that the model predicts. Ideally, we'd like to see errors evenly distributed around 0 (i.e., the dashed line).

> If there is more variance in the errors for some prediction values then this means the model is better at predicting some values than others. 

**Assumption 2** - no patterns in the residuals
   
Let's check the assumption that the relationship between your variables is linear (i.e., that a straight line and not a curvy line fit the data best). We can see this intuatively in the origianl scatter plot, or we can look at the residuals!

In [None]:
plot(y=model_houses$residuals, x=model_houses$fitted.values)
abline(h = 0, lty=3)

The plot above is just the line fit to the scatterplot we saw before. Intuatively you can check to see if the straight line fits the data, or if a curvy line might fit better.

There are two things to keep in mind when checking the assumptions of the linear regression.

> The first is that the assumptions do not need to be perfect to give you a resonable estimate.

> The second is that often the way the model fails can help you build a better model.

### 6. Interpret the results

From the results above what can you answer the question you posed in section 1?  
> What is the association between the variables that you tested?

> What does the confidence interval tell you about how certain you are in the sign and magnitude of that association?

> How "good" are your model predictions?
  
If you've finished this section, try going back up and asking another question!