<h1><center>Data science for Geographers</center></h1>

<h2><center>Practical 2 - Data modelling in R using multiple regression</center></h2>

![image2.png](attachment:image2.png)

# Practical 2: Data analysis using R

## Contents

- 1. Introduction
    - Reminder: What are we trying to do?
- 2. Summarising and visualising
    - Some variable cleaning and recoding
    - Summary statistics and tables
    - Plots (boxplots and histograms)
- 3. Linear Regression in R
    - With one predictor
    - With more than one predictor

## 1. Introduction

### Reminder: What are we trying to do?

If you remember from the previous practical, we are going through all of the steps for a research project using secondary quantitative data. In our particular example, we are undertaking a research project where we are interested in the geography of the availabillity of tobbaco products (i.e. similar to the paper below):

https://tobaccocontrol.bmj.com/content/tobaccocontrol/25/1/75.full.pdf

Last week, we went through the process of cleaning, recoding and merging the datasets together so that by the end we had a dataset ready to answer our research questions. This final dataset consists of an area-level dataset (as opposed to individual level data for example). Our unit of analysis is datazones and for each datazone we have information on the level of deprivation (in terms of employment levels, health, education and income), the numbers of tobacco retailers per unit of population, smoking rates among pregnant women and whether the area is urban or rural. Since there are approximately 6500 datazones in Scotland we now have a dataset which looks like this, with approximately 6500 rows of data:

![image.png](attachment:image.png)

So, we have got the boring part out of the way! Now it's on to the interesting part, the actual analysis of this data! Let's first etablish the three research questions we are going to try and answer today. What we are going to do today is showcase how we might approach answering questions like these:

1. Is the number of tobacco retailers in urban areas associated with smoking rates in datazones in Scotland?
2. Is this association independent of the level of deprivation in these datazones?
3. Is the level of deprivation in urban areas datazones associated with the number of tobacco retailers?

First thing we have to do, as always, is load in the packages we need. As last week everything we need is in the tidyverse collection of packages. So, let's import that now... 

In [None]:
#First let's load in the package we need as before
library(tidyverse)

And now let's read the data we created in last weeks practical...

In [None]:
analysis_data <- read_csv("merged_data.csv")

And take a look at the data to remind ourselves what is there...

In [None]:
head(analysis_data)

### Exercise

To refresh our memories from last week, lets do one more variable recode to see what you can remember:

1. Create new variables called `health_quintile` and retailers_adj_quintile. These should be based on the existing `health_rank` and `retailers_adj` variables but dividing them into five equal groups.

2. Change these two new variables so that they are factor variables. Label the `health_quintile` the same as the other "quintile" variables i.e. "Most deprived", "Second, "Third", "Fourth" and "Least deprived". Label the `retailer_adj` as you see fit. 

3. Change the variables `simd_quintile` and `simd_decile` into factors and relabel appropriately (you'll need to think of labels for the `simd_decile` variable.

Bonus points if you can figure out how to do tasks 1 and 2 in one command using the `%>%` command!

<b>Hint: you can use `mutate` and the `ntile` command here as well as the . Refer back to last weeks notebook if you get stuck!

In [None]:
#1
analysis_data <- analysis_data %>%
    mutate(health_quintile = ntile(health_rank, 5))

analysis_data <- analysis_data %>%
    mutate(retailers_adj_quintile = ntile(retailers_adj, 5))

#2 
analysis_data <- analysis_data %>%
    mutate(retailers_adj_quintile = as.factor(retailers_adj_quintile) %>%
      fct_recode("Lowest availabillity" = "1", "Second" = "2", 
      "Third" = "3", "Fourth " = "4", "Highest availabillity" = "5"))

analysis_data <- analysis_data %>%
    mutate(health_quintile = as.factor(health_quintile) %>%
      fct_recode("Most deprived" = "1", "Second" = "2", 
      "Third" = "3", "Fourth " = "4", "Least deprived" = "5"))

#bonus solution!
analysis_data <- analysis_data %>%
    mutate(health_quintile = ntile(health_rank, 5)) %>%
    mutate(health_quintile = as.factor(health_quintile) %>%
      fct_recode("Most deprived" = "1", "Second" = "2", 
      "Third" = "3", "Fourth " = "4", "Least deprived" = "5")) %>%
    mutate(retailers_adj_quintile = ntile(retailers_adj, 5)) %>%
    mutate(retailers_adj_quintile = as.factor(retailers_adj_quintile) %>%
      fct_recode("Lowest availabillity" = "1", "Second" = "2", 
      "Third" = "3", "Fourth " = "4", "Highest availabillity" = "5"))


#3
analysis_data <- analysis_data %>%
    mutate(simd_quintile = as.factor(simd_quintile) %>%
      fct_recode("Most deprived" = "1", "Second" = "2", 
      "Third" = "3", "Fourth " = "4", "Least deprived" = "5")) %>%
    mutate(simd_decile = as.factor(simd_decile) %>%
      fct_recode("Most deprived" = "1",
                 "Second" = "2", 
                 "Third" = "3", 
                 "Fourth " = "4",
                 "Fifth" = "5",
                 "Sixth" = "6",
                 "Seventh" = "7",
                 "Eighth" = "8",
                 "Ninth" = "9",
                 "Least deprived" = "10"))
      


## 2. Summarising and visualising

In [None]:
#Solutions

#1
analysis_data <- analysis_data %>%
    mutate(health_quintile = ntile(health_rank, 5)
           
#2
analysis_data <- analysis_data %>%
    mutate(health_quintile = as.factor(health_quintile) %>%
      fct_recode("Most deprived" = "1", "Second" = "2", 
      "Third" = "3", "Fourth " = "4", "Least deprived" = "5"))           
           

### Summary statistics and tables

Before we get stuck into the modelling for our research questions, let's draw up some plots and descriptives of our data. You should have some familliarity with these commands from "Key Methods".


First let's look at some of our categorical variables using `count`.

In [None]:
#Tables for the categorical variables
my_data %>% 
  count(urban_rural)

my_data %>% 
  count(simd_quintile)

my_data %>% 
  count(simd_decile)

my_data %>% 
  count(simd_quintile)

Now let's look at some of the continuous variables using `summary()`...

Remember when looking at the below code that using the `$` is a method for selecting a particular variable from a dataset...

In [None]:
#Summaries for all of the variables
summary(analysis_data$simd_rank)
summary(analysis_data$total_population)

Note that you can also look at all variables together by just using the `summary()` command with just the data object. Like...

In [None]:
summary(analysis_data)

You might be wondering why R doesnt try and do summaries for all of the variables? Well, because we have told R which of our variables are factors and which are numerical (continuous) R knows which variables to include in the summary command i.e. only the numerical ones.

### Plots (boxplots and histograms)

It is also good practice to look at our data using graphs and plots, particularly boxplots and histograms.

Let's look at the `retailers_adj` variable.

In [None]:
#Histogram and boxplot for the tobacco retailers
ggplot(data = my_data) +
  geom_histogram(mapping = aes(x = retailers_adj), binwidth = 0.5)

ggplot(data = my_data, mapping = aes(retailers_adj)) +
  geom_boxplot()

#Some other plots we might want to look at:

#ggplot(data = my_data) +
#  geom_histogram(mapping = aes(x = smoking_rate), binwidth = 0.1)

Looking at these graphs it seems we need to trim some outlying obervations (removing observations with retailers greater than 100)

In [None]:
my_data <- my_data %>%
    filter(retailers_adj <= 100)

In many cases, boxplots are more useful when combined with categorical variables...

In [None]:
ggplot(data = my_data, mapping = aes(x = simd_quintile, y = smoking_rate)) +
  geom_boxplot()

ggplot(data = my_data, mapping = aes(x = retailers_quintile, y = smoking_rate)) +
  geom_boxplot()

ggplot(data = my_data, mapping = aes(x = simd_quintile, y = retailers_adj)) +
  geom_boxplot()

## 3. Linear Regression in R

We have seen a hint of some of the results to our research questions but let's complete our analysis formally using regression.

As a reminder, here are our research questions:

1. Is the number of tobacco retailers associated with smoking rates in datazones in Scotland?
2. Is this association independent of the level of deprivation in these datazones?
3. Is the level of deprivation in datazones associated with the number of tobacco retailers?

Since we are only interested in urban areas let's remove all non-urban areas first...

In [None]:
#Create a new analysis file
urban_only <- analysis_data %>%
    filter(urban_rural_2cat =="Urban")

In [None]:
head(urban_only)

Ok, now we can tackle question 1 using a linear regression: 

In [None]:
#Run a linear regression model
single_linear_regression <- lm(smoking_rate ~ retailers_adj, data=urban_only, na.action=na.exclude)

This has produced the model and saved it in an object called `single_linear_regression`. We now need to view this output using  summary...

In [None]:
summary(single_linear_regression)

We also want confidence intervals...

In [None]:
confint(single_linear_regression)

Let's look at research question 2...:

2. Is this association independent of the level of deprivation in these datazones?

In [None]:
#Add some variables
multiple_linear_regression <- lm(smoking_rate ~ retailers_adj + simd_rank, data=analysis_data, na.action=na.exclude)
summary(multiple_linear_regression)
confint(multiple_linear_regression)

We can also use categorical predictors if we wish. In other words, using the `simd_quintile` variable. 

<b>Bonus Question! Before we run this command, can you think why we might want to use a categorical variable, rather than a continuous one?</b>

In [None]:
multiple_linear_regression <- lm(smoking_rate ~ retailers_adj_quintile, data=analysis_data, na.action=na.exclude)
summary(multiple_linear_regression)
confint(multiple_linear_regression)

And now lets add the categorical simd variable...

In [None]:
multiple_linear_regression <- lm(smoking_rate ~ retailers_adj_quintile + simd_quintile, data=analysis_data, na.action=na.exclude)
summary(multiple_linear_regression)
confint(multiple_linear_regression)

What is going on here? When we add new variables to a model we are essentially adding more dimensions to the data. This can be illustrated by the images below (ignore the axis labels!). The top image is a straightforward linear regression with one outcome and one predictor. The second is a regression with two predictor variables.   

![image-2.png](attachment:image-2.png) ![image.png](attachment:image.png)

We can also add more than 2 variables (but we cant represent on a graph as you cannot go beyond 3 dimensions!). But the principal is the same. In practical terms what we say is we are "adjusting" or "controlling" for the other variables. In other words, any effects we observe in our model we can say occur "independently" of the other variables. If we had observed a significant effect of tobacco retailers after adjusting for deprivation we would concludethat tobacco retailers are associated with smoking irrespective of the level of deprivation in an area. 

Let's look at research question 3:

3. Is the level of deprivation in datazones associated with the number of tobacco retailers?

To answer let's put all of the different parts of the SIMD into our model (i.e. the income_rank, health_rank, employment_rank, education_rank, access_rank, crime_rank and housing_rank variables)

In [None]:
#Look at predicting retailers
multiple_linear_regression <- lm(retailers_adj ~ income_rank + 
                                 health_rank + 
                                 employment_rank + 
                                 education_rank + 
                                 access_rank + 
                                 crime_rank + 
                                 housing_rank, 
                                 data=urban_only, na.action=na.exclude)

summary(multiple_linear_regression)
confint(multiple_linear_regression)

## Independent analysis exercises

Take a look at some ofthe remaining variables. What might be some other research questions you might explore? To get you started, look at the two health variables we havent looked at yet; `alcohol_admissions` and `drug_admissions`. For the remainder of the class, have a think about an easy research questions you might examine looking at either of these as a dependent variable. And then:

1. Think about how you might need to recode these variables and then have a go at carrying this out
2. Produce relevant descriptive statistics that might start to address your research question
3. Run an appropriate regression model
4. Have a think about what other aditional variables you might need to include, in addition to your independent variable of interest and add these to your model.
5. Have a think about how you might interpret the analysis, perhaps write a few words of interpretation in a mardown cell below the results,