<a href="https://colab.research.google.com/github/tbonne/IntroPychStats/blob/main/lm_final_exam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='http://drive.google.com/uc?export=view&id=1fDQUuaVfjkpHK2MpaeIQBe4UBME7l4SP' width=500>



#<font color='darkorange'>Alcohol consumption and happiness</font>

In this notebook we'll do some exploratory data analysis! 


### 1. Load in the data

Lets load in some packages. These have functions that other people have made, and will hopefully make our lives a lot easier!

In [None]:
install.packages("jtools")
install.packages("ggstance")
library(jtools)

Load in a dataset that has how much alcohol a country consumes, as well as a number of measures of that contry.


In [None]:
#Country happiness and alchohol consumption
df_EDA <- read.csv("https://raw.githubusercontent.com/tbonne/IntroPychStats/main/data/HappinessAlcoholConsumption.csv", header = T)

#let's take a look at the data
head(df_EDA)

Now that we can see the data, think about a question you might like to ask or about what variable you'd like to predict. 

> E.g., what predicts the happiness level of a country?

Write out your question in words here: 


### 2. Visualize our data

**Histogram**

Histograms are a great way to see how a numeric variable is distributed. Let's take a look at a histogram of the happiness scores.

In [None]:
hist(df_EDA$HappinessScore)

**Scatterplot**

Scatterplots are a great way to explore the relationships between two variables. Let's look at the relationship between the amount of spirits consumed and the happiness score of the country.

In [None]:
plot(x=df_EDA$Spirit_PerCapita,y=df_EDA$HappinessScore) 

**Boxplot**

Boxplots are a great way to see how a numeric variable is distributed within certain categories. Let's look at how happiness scores are distributed within 

In [None]:
plot(x=factor(df_EDA$Hemisphere),y=df_EDA$HappinessScore) 

### 3. Define and fit a model

Now we can speficy the model we'd like to fit. Let's predict happinessScore using the amount of spirits consumed, the hemisphere the country is in, and the human development index (HDI).

> Remember, here we use the formula: "what we'd like to predict" ~ "what we'd like to use to help make those predictions."

  

In [None]:
#fit a linear model
model_EDA <- lm(HappinessScore ~ Hemisphere + HDI + Spirit_PerCapita, data=df_EDA)


This bit of code then use our inputs to find the best fit linear equation for:
> $y \sim Normal(\mu, \sigma) $

> $\mu = a + b_{hemisphere} * hemisphere + b_{HDI}*HDI + b_{spirits}*Spirits $


Let's use the summ function to tell us what values of a and b it found for the best fit line. 
> We'll calculate our 95% confidence interval here (i.e., confint=TRUE)

> We'll also scale our numeric varibables to make them easier to compare (i.e., scale=TRUE)

In [None]:
#What does the best fit model look like?
summ(model_EDA, confint=TRUE, scale = TRUE)

Remember that in the Est. column in the table above, these are the estimates of the associations/differences. Then the 2.5% and 97.5% is the lower and upper 95% confidence intervals.

### 4. Visualize the results

Let's take a look at the estimates in our summary table a little more visually. 
> The points in the plot below are the model estimates (i.e., associations/differences).

> The horizontal lines show us the 95% confidence intervals of those estimates.

> If scale = TRUE we can also see what associations/differences are contributing the most in making predictions.

In [None]:
#plot the estimates of the slopes
plot_summs(model_EDA, scale=TRUE)

Let's take a look at the regression line a little more visually.
> The plot below shows us the predicted line or difference 

> This can be plotted for each variable that you used to help make predictions

> This plot can help you better understand the relationship between one of your predictor variables and the outcome you are trying to predict.

Let's first take a look at what the model predicts about happiness as we change the amount of spirits consumed in a country.

In [None]:
#plot line on the data
effect_plot(model_EDA, pred = Spirit_PerCapita, interval = TRUE, plot.points = TRUE)

Next, lets look at what the model predicts about happiness as we change the human development index of a country.

In [None]:
#plot line on the data
effect_plot(model_EDA, pred = HDI, interval = TRUE, plot.points = TRUE)

Next, lets look at what the model predicts about happiness as we change the hemisphere of a country.

In [None]:
#plot line on the data
effect_plot(model_EDA, pred = Hemisphere, interval = TRUE, plot.points = TRUE)

### 5. Checking assumptions

**Assumption 1**

Let's check the assumption that the errors (residuals) are normally distributed.

In [None]:
hist(model_EDA$residuals)

The above plot is just like the histograms we've looked at in the past. Now we are looking at how errors are distributed.

> If the errors do not look to have many small errors and few large errors (both positive and negative) then a normal distribution might not be the best model of the data. We might also be missing an important variable...

**Assumption 2** - no patterns in the residuals
  
Let's check the assumption that the variance in the errors is constant.

In [None]:
plot(y=model_EDA$residuals, x=model_EDA$fitted.values)
abline(h = 0, lty=3)

The above plot shows you all the errors (residuals) for each value that the model predicts. Ideally, we'd like to see errors evenly distributed around 0 (i.e., the dashed line).

> If there is more variance in the errors for some prediction values then this means the model is better at predicting some values than others. 

**Assumption 2** - no patterns in the residuals
   
Let's check the assumption that the relationship between your variables is linear (i.e., that a straight line and not a curvy line fit the data best). We can see this intuatively in the origianl scatter plot, or we can look at the residuals!

In [None]:
plot(y=model_EDA$residuals, x=model_EDA$fitted.values)
abline(h = 0, lty=3)

The plot above is just the line fit to the scatterplot we saw before. Intuatively you can check to see if the straight line fits the data, or if a curvy line might fit better.

There are two things to keep in mind when checking the assumptions of the linear regression.

> The first is that the assumptions do not need to be perfect to give you a resonable estimate.

> The second is that often the way the model fails can help you build a better model.

### 6. Interpret and communicate the results

From the results above what can you answer the question you posed in section 1?  
> What is the association between the variables that you tested?

> What does the confidence interval tell you about how certain you are in the sign and magnitude of that association?

> How "good" are your model predictions?

> How closely does your model meet the model assumptions?
