# Module 4: Regression Basics

In this module, you will learn how to fit and plot a regression line using R. 

## Review of Scatterplots

Before we can do any analysis, we have to import our data to R. In this module, we will work with a new dataset that focuses on a bicycle sharing system that operates in and around Washinton DC in the US. This dataset contains several variables over 2 years, but we focus on just the average daily temperature and number of bicycles used over 100 days. Temperature is measured in degrees Celsius. Both temperature and number of bicycles are numeric variables.

Let's import the bicycle dataset to R using the "read.csv()" function. The name of the dataset file is "Bikes.csv". This dataset contains variable names, so we need to set the "header" input equal to "TRUE", or just leave it out. Let's import our dataset now, and call it "data.brsi".

In [None]:
data.bikes = read.csv(file = "Bikes.csv")

Notice that we left out the "header" input, so it just uses the default value, "TRUE". Let's check that we imported the dataset correctly by printing out the first few lines using the "head()" function.

In [None]:
head(data.bikes)

This looks right. Now we can move on.

Now that we have imported our data, let's make a scatterplot. We discussed how to make scatterplots in Module 1, so we will only review how to make them here.

Remember that we make scatterplots in R with the "plot()" function. The "plot()" function has 2 main inputs, "x" and "y". These inputs are the variables that go on the x-axis (horizontal) and y-axis (vertical). We can also use the optional inputs "main", "xlab" and "ylab" to add a title, x-axis label and y-axis label respectively.

Let's make a scatterplot with temperature on the y-axis and number of bicycles on the x-axis. We should also include appropriate axis labels and a title. Remember that we access the temperature and number of bicycles from inside "data.brsi" by using "$".

In [None]:
plot(x=data.bikes$Temp, y=data.bikes$Bikes, xlab="Temperature (Degrees C)", ylab = "Number of Bicycles Rented", 
     main = "Temperature and Bike Rentals")

## Fitting Regression Lines

Working with the bicycle dataset, we would like to determine if there is a relationship between temperature and number of bicycles rented. A powerful tool we have for investigating this question is linear regression. We will use temperature as a predictor variable and number of bicycles rented as a response variable. That is, we are going to try to estimate the number of bicycles rented based on the temperature. Our regression analysis starts by fitting a regression line.

In order to fit a regression line, we must calculate the slope and intercept of that line. The formulas for these are messy, and hard to use with more than a few points. The bicycle dataset that we are working with has 100 observations, and the original dataset has over 700. You would definitely not want to calculate the slope and intercept for this dataset by hand.

Fortunately, R is very good at calculating slopes and intercepts. In fact, there is a function that calculates these and other useful things for doing regression analysis. This function is called "lm()" (for Linear Model; the regression line is one type of linear model). 

The "lm()" function takes a special type of input called a formula. The relationship that we want to investigate is whether the number of bicycles rented depends on the temperature. The symbol in R for "depends on" is "~". Therefore, formulas in R are always written as the response variable, then a "~", then the predictor variable. Writing the relationship we are interested in as an R formula is then "Bikes ~ Temp", which says that Bikes "depends on" Temp.

The first input for the "lm()" function is called "formula", and it is a formula that describes the relationship we are interested in. The second input for the "lm()" function is called "data", and it is the data frame that contains the variables we are studying. 

In order to investigate whether the number of bicycles depends on the temperature, we need to set formula equal to "Bikes ~ Temp" (this says that the number of bicycles rented depends on the temperature). These variables are contained in our data frame called "data.bikes", so we need to set "data" equal to "data.bikes". Let's do this now and call the results reg.bikes.

In [None]:
reg.bikes = lm(formula = Bikes ~ Temp, data = data.bikes)

The output we get from "lm()" is called an lm object, because it is the object we get from the "lm()" function. That is, "reg.bikes" is an lm object.

We have now fit a regression line, but hasn't actually told us what the slope and intercept are. We have to print our new object to see its results. Let's do this now.

In [None]:
print(reg.bikes)

The part after "Call:" tells you what code you wrote to fit this regression model. This can be helpful to make sure that you wrote your formula correctly and that you specified the correct data frame. 

The really interesting part is after "Coefficients". This tells you the value of your regression coefficients, another name for the slope and intercept. The value of the intercept is pretty obvious, it's the one under "(Intercept)". In this case, our intercept is equal to 604.66. 

The value of our slope is under "Temp", in this case equal to 99.66 (it's just a coincidence that these decimals are the same). It's less obvious why the slope is labelled "Temp", but this turns out to be useful in later courses when you want to fit regression models with multiple predictors. In order to make sense in this more complicated scenario, R always labels the slope with the variable it corresponds to. 

If this part doesn't really make sense that's fine, just know that the slope of a regression line is always labelled with the name of your predictor variable. In this case, the label is "Temp", and its value is 99.66.

## Plotting Regression Lines

We have now fit a regression line to the bicycle data, but we only have numbers for the slope and intercept. This is an important first step, but it doesn't tell us much unless we can see the regression line. Conveniently, R is also very good at plotting regression lines.

Your first instinct may be to use the plot function on the lm object we made in the last section (called "reg.bikes"). This does do something, and it will be useful later, but for now it's more information than we need. What we really want to do is add our regression line to a scatterplot of temperature and number of bicycles rented. The function that does this in R is called "abline()" (because it can make lines joining point "a" to point "b", whatever points "a" and "b" are).

The "abline()" function in R can be used in several different ways, each with different inputs. Right now, we will use the input called "reg". If we set "reg" equal to an lm object, then the "abline()" function will automatically add that regression line to the last scatterplot you made.

Let's try this now. First, we need to make a new scatterplot with number of bikes rented on the y-axis and temperature on the x-axis. We should should make sure that our title says something about the regression line we will be adding. Now we can then add the regression line that we calculated in the last section. We do this using the "abline()" function, and set "reg" equal to "reg.bikes", the regression object that we computed.

In [None]:
plot(x=data.bikes$Temp, y=data.bikes$Bikes, xlab="Temperature (Degrees C)", ylab = "Number of Bicycles Rented", 
     main = "Regression Line of Bike Rentals vs Temperature")
abline(reg = reg.bikes)

Notice that we make our scatterplot and added our regression line in the same cell. If we split these functions across 2 cells, the "abline()" function won't have a scatterplot to add its to to.

## Residual Plots

Residuals are a very useful tool for checking how well our regression line fits the data. Plotting the residuals allows us to check many of the assumptions required for the regression line to be appropriate. Remember that the residual for each observation is its value on the response variable minus its predicted value based on the regression line. We could calculate these manually, but R has a function that does it for us.

The function in R for calculating residuals is called "resid()". The "resid()" function has only 1 input, called "object", and this is the regression object that we want to get residuals for. We then store these residuals under another name.

Let's get the residuals from our "reg.bikes" regression object, and call the results "resid.bikes".

In [None]:
resid.bikes = resid(reg.bikes)

Let's print the first few residuals to make sure nothing looks weird.

In [None]:
head(resid.bikes)

Notice that some of these residuals are positive and some are negative. This is fine. Positive residuals correspond to points above the regression line, and negative residuals correspond to points below the line.

Looking at the values of the residuals doesn't tell us very much. What we really want to do is plot them. We can do this using the "plot()" function that you are already familiar with. Typically, we put residuals on the y-axis, and the predictor variable on the x-axis. The resulting graph is called a residual plot.

Let's make a residual plot for our regression object, "reg.bikes". We will put temperature, "data.bikes$Temp", on the x-axis, and the residuals, "resid.bikes", on the y-axis.

In [None]:
plot(x = data.bikes$Temp, y=resid.bikes, xlab="Temperature (Degrees C)", ylab = "Residuals",
    main = "Residual Plot for a Regression of Bikes Rented vs Temp")

You will often see residual plots with a horizontal line at 0. This is because the mean of the residuals is always 0, and the line helps us judge the residuals departure from this mean.

We can add a horizontal line to our plot using the "abline()" function. Remember that we used the "reg" input with this function to plot a regression line. Another input for "abline()" is called "h". The "h" input plots a horizontal line at the value you specify. 

Remember that "abline()" only adds lines to plots, it doesn't make a new plot. This means that we need to make a plot for "abline()" to add to.

Let's re-make the residual plot from the last cell with a horizontal line at 0. We can use the "abline()" function with "h" equal to 0.

In [None]:
plot(x = data.bikes$Temp, y=resid.bikes, xlab="Temperature (Degrees C)", ylab = "Residuals",
    main = "Residual Plot for a Regression of Bikes Rented vs Temp")
abline(h = 0)

Do you notice any potential problems with this residual plot?

## Influential Points

This one is going to be messy. Is it worth including, or should we just wait until they get to leverage later?

## Extra Topic

The actual bike sharing dataset goes into much more detail than we discuss here. If you are interested in learning more, you can find the original dataset and a more thorough explanation of what each variable means at http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#.

This dataset is stored on the UCI Machine Learning Repository, along with over 350 other datasets. Each of these datasets is uploaded by a researcher, and they often describe what problems they were trying to solve using their data. This is a great place to look if you want real-world datasets to practice what you're learning in these modules.