<img src='https://images.squarespace-cdn.com/content/v1/56acc1138a65e2a286012c54/1540838202280-0J5KTEAF49LT2SWSA4TT/cell-phone-addiction_0.jpg' width=500>

#<font color='darkorange'>Answering questions using a statistical model</font>

In this notebook we'll use statistical inference to answer the question: "Is there a relationship between age and how much someone uses their phone?"

To do this we'll use the linear regression:

y = a + b*x

In this equation we have variables (the data we are going to use: e.g., the columns in a spreadsheet).

> Here the "y" is the values of the response variable. This is the thing you'd like to predict! So in the case of the IQ dataset this could be child IQ values.
  
> The "x" in the equation is the values you'd like to use to help make those predictions. Again in the case of the IQ dataset this could be the mom's IQ score.

In this equation we also have parameters (these are learnt from the data)
  
> The "a" is the intercept, and measures where the linear line crosses the y-axis on the plot.

> The "b" is the slope, and measures how the x value is predicted to change the y value. Positive slopes suggest that when x goes up so does y, and negative slopes suggest that when x goes up y goes down.

  
  

Now that we've looked at all the peices of the equation let's try to use it! This will help us better interpret what the parameters really mean.

## Fitting a line to the data

First lets load in some packages. These have functions that other people have made, and will hopefully make our lives a lot easier!

In [6]:
install.packages("jtools")
install.packages("ggstance")
library(jtools)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“installation of package ‘jtools’ had non-zero exit status”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Then let's load in the phone usage data

In [None]:
#here we will read in a csv file and place it into something called df
df_app <- read.csv("https://raw.githubusercontent.com/rhi-batstone/IntroPsychStats/main/IntroPychStats-main/data/app_usage.csv", header = T)

#let's take a look at the data
df_app

Then let's plot the scatterplot. Here we will choose:
> what we'd like to predict and put it on the y-axis.
> What we'd like to use to help make those predictions and put it on the x-axis.

In [None]:
plot(x = df_app$age, y = df_app$hours)

Rather than visually fitting the line let's use OLS to find the values of the best fit line for us! To do this we'll use the function call **lm()**, which stands for linear model (lm). We'll use this function a lot in the next few classes so there will be a lot of time to figure all this out, so don't worry if this seems like a lot at the momment.

The **lm()** function needs us to tell it what kind of linear equation to use. In particular it needs to know what you'd like to predict and what you'd like to use to make those predictions. To do this it uses a specific input to make it easier for use:

> What you want to predict ~ what you want to use to make those predictions

Try and fill in the equation below. You should replace the question marks below with the correct variable name (i.e., column name).


  

In [None]:
#fit a linear model
model_age <- lm(? ~ ?, data = df_app)

Let's take a look at what it found. To do this we'll use a function called **summary()**. It is very useful and will tell us what values of a and b it found for the best fit line.
  
> Let's also calculate a confidence interval so that we can get a sense of what the slope and intercept are for the population and not just the sample.


In [None]:
summ(model_age, confint = TRUE)

What does the output suggest are likely values for the intercept and slope?

> What are the range of population values that are compatible with our sample?

Let's take a look at the estimates a little more visually

In [None]:
#plot the estimates of the slopes
plot_summs(model_age)

Let's take a look at the regression line a little more visually

In [None]:
#plot line on the data
effect_plot(model_age, pred = age, interval = TRUE, plot.points = TRUE)

### 5. Checking assumptions

**Assumption 1**

Let's check the assumption that the errors (residuals) are normally distributed.

In [None]:
hist(model_age$residuals)

The above plot is just like the histograms we've looked at in the past. Now we are looking at how errors are distributed.

> If the errors do not look to have many small errors and few large errors (both positive and negative) then a normal distribution might not be the best model of the data. We might also be missing an important variable...

**Assumption 2** - no patterns in the residuals
  
Let's check the assumption that the variance in the errors is constant.

In [None]:
plot(y = model_age$residuals, x = model_age$fitted.values)
abline(h = 0, lty = 3)

The above plot shows you all the errors (residuals) for each value that the model predicts. Ideally, we'd like to see errors evenly distributed around 0 (i.e., the dashed line).

> If there is more variance in the errors for some prediction values then this means the model is better at predicting some values than others.

**Assumption 2** - no patterns in the residuals
   
Let's check the assumption that the relationship between your variables is linear (i.e., that a straight line and not a curvy line fit the data best). We can see this intuatively in the origianl scatter plot, or we can look at the residuals!

In [None]:
plot(y = model_age$residuals, x = model_age$fitted.values)
abline(h = 0, lty = 3)

The plot above is just the line fit to the scatterplot we saw before. Intuatively you can check to see if the straight line fits the data, or if a curvy line might fit better.

There are two things to keep in mind when checking the assumptions of the linear regression.

> The first is that the assumptions do not need to be perfect to give you a resonable estimate.

> The second is that often the way the model fails can help you build a better model.

## Handing in your work

Once you've run the code above go to Moodle and answer some questions. These questions will mostly be about interpreting the outputs of your linear regression!