# DS102 Statistical Programming in R : Lesson Eight - Linear Regression

### Table of Contents <a class="anchor" id="DS102L8_toc"></a>

* [Table of Contents](#DS102L8_toc)
    * [Page 1 - Introduction](#DS102L8_page_1)
    * [Page 2 - Scatter Plots](#DS102L8_page_2)
    * [Page 3 - Correlation Basics](#DS102L8_page_3)
    * [Page 4 - Calculating Correlation](#DS102L8_page_4)
    * [Page 5 - Introduction to Regression](#DS102L8_page_5)
    * [Page 6 - Computing Linear Regression](#DS102L8_page_6)
    * [Page 7 - Key Terms, R Libraries and Functions](#DS102L8_page_7)
    * [Page 8 - Hands On](#DS102L8_page_8)
    * [Page 9 - Hands On Solution](#DS102L8_page_9)    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS102L8_page_1"></a>

[Back to Top](#DS102L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Linear Regression
VimeoVideo('247057958', width=720, height=480)

Linear regression is a method for investigating the relationship between two variables. In linear regression, the relationship between the variables is represented as a line, and you compute parameters of the line (how steep it is and where it starts) as well as determine how accurately this line represents the relationship.

You will begin this lesson with scatter plots, which are used extensively to understand the relationship between continuous variables. You will then learn correlation. 

By the end of this lesson, you should be able to: 

* Create scatterplots
* Assess correlations visually and numerically
* Create correlation matrices
* Create and analyze linear regression models

This lesson will culminate with a practice hands on that examines the relationship between horsepower and the time it takes for a car to cover a quarter mile. 

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/446985965">this recorded live workshop </a> on simple linear regression, which is meant to go along with this lesson.</p>
    </div>
</div>


In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Linear Regression
VimeoVideo('446985965', width=720, height=480)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Scatter Plots<a class="anchor" id="DS102L8_page_2"></a>

[Back to Top](#DS102L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Linear Regression
VimeoVideo('446985965', width=720, height=480)

# Scatter Plots

In a scatter plot, data are displayed as a collection of points. Each data point is determined by the values of two variables, one on the horizontal (left to right) axis and one on the vertical (up and down) axis. Scatter plots make the relationship between variables easy to see.

In R, creating a scatter plot is relatively simple. In this lesson, you will use ```ggplot2``` to create scatter plots as well. If you have closed R since last using ```ggplot2```, remember that you will need to load it by using the following command at the beginning of every RStudio session: 

```{r}
library(ggplot2)
```

You could also click the check box next to ```gggplot2``` in the ```Packages``` tab.

You will start by creating a scatter plot using the ```faithful``` data set; you will plot eruption times verses waiting times, with ```eruptions``` on the horizontal axis, or ```x=``` axis, and ```waiting``` on the vertical, or ```y=``` axis:

```{r}
d <- ggplot(faithful, aes(x = eruptions, y = waiting))
d + geom_point()
```

These commands produce the following scatter plot:

![A scatterplot. The x axis is labeled eruptions and runs from one point five to approximately five point two five. The y axis is labeled waiting and runs from forty to one hundred. Data points are scattered and are mostly clustered in the bottom left and the upper right portions.](Media/L07-ScatterPlot.png)

You can add a title and improve the axis labels using the ```ggtitle()```, ```xlab()```, and ```ylab()``` functions:

```{r}
d + geom_point() + ggtitle("Old Faithful Eruption vs Waiting Times") +
xlab("Eruption Time (min)") + ylab("Wating Time (min)")
```

![A scatterplot titled old faithful eruption versus waiting times. The x axis is labeled eruption time in minutes and runs from one point five to approximately five point two five. The y axis is labeled waiting time in minutes and runs from forty to one hundred. Data points are scattered and are clustered mostly in the bottom left and the upper right portions.](Media/L07-ScatterPlot2.png)

You see from this plot that there are two clusters of data: 

* A short eruption time followed by a short wait until the next eruption
* A long eruption time followed by a long wait. 

After having created a scatter plot, you may want to see how well the data fit a straight line. You can do this easily in ```ggplot2``` with the additional function of adding in ```+ geom_smooth()``` and specifying as an argument to ```geom_smooth()``` that you want the method, or shape of line, to be ```lm```.  ```lm``` stands for *linear model*.  A linear model will create a straight line on the graph.

Here's how all that code fits together: 

```{r}
d + geom_point() + geom_smooth(method=lm)
```

This gives the following scatter plot with a *best fit* line.  The phrase "best fit line" means that the line you see below wasn't just plunked on the graph any old place; it was strategically fit to all the data points to be as close to as many of the points as possible. 

And here is the addition of your best fit line:

![A scatterplot. The x axis is labeled eruptions and runs from one point five to approximately five point two five. The y axis is labeled waiting and runs from forty to one hundred. Data points are scattered and are mostly clustered in the bottom left and the upper right portions. A blue line with a gray area around it, the confidence region, runs diagonally from the bottom left to the upper right to show the best line of fit.](Media/L07-ScatterPlot3.png)

In addition to the line that fits the data, the ```geom_smooth()``` function adds a grey area around the line. In this image, it may be a little difficult to see, because it does not expand much past the blue line.  It is easiest to see at the beginning and the end of the line, where there is a little contrast with all the black dots.  This grey shading is called the *confidence region*. Roughly speaking, if the boundaries of the region are close to the line, it indicates that you are confident in the accuracy of your estimates of the parameters that define the line; if the region extends away from the line, it indicates that you are less confident in the accuracy of the line.  You can think of the grey shading as like the margin of error. 

If you didn't want your graph to include the grey shading, however, it can easily be turned off with the argument ```se=false```: 

```{r}
d + geom_point() + geom_smooth(method=lm, se=FALSE)
```

Which yields this image.  It may be easier to see the grey shading on the previous graph now that you have something with which to contrast it!

![A scatterplot. The x axis is labeled eruptions and runs from one point five to approximately five point two five. The y axis is labeled waiting and runs from forty to one hundred. Data points are scattered and are mostly clustered in the bottom left and the upper right portions. A blue line runs diagonally from the bottom left to the upper right to show the best line of fit. The confidence region, which would be shown as a gray area around the blue line, is turned off.](Media/L07-ScatterPlot4.png)

If you feel that black points and a blue regression line are too somber, you can always change the color of the points and the line: 

```d + geom_point(color = "firebrick") +
geom_smooth(method=lm, se=FALSE, color = "goldenrod2")
```

This gives you the following colorful plot:

![A scatterplot. The x axis is labeled eruptions and runs from one point five to approximately five point two five. The y axis is labeled waiting and runs from forty to one hundred. Data points are scattered, appearing as red dots, and are mostly clustered in the bottom left and the upper right portions. A yellow line runs diagonally from the bottom left to the upper right to show the best line of fit.](Media/L07-ScatterPlot5.png)

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [1]:
try:
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *
except:
    !pip install DS_Students
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *

<p style="text-align: center">
  <img src="Media/L07-ScatterPlotEx.png" alt="Drawing" style="width: 500px;"/>
</p>

In [2]:
try:
    display(L8P2Q1)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Which command would create the following scatter…

<p style="text-align: center">
  <img src="Media/L07-ScatterPlotEx2.png" alt="Drawing" style="width: 500px;"/>
</p>

In [3]:
try:
    display(L8P2Q2)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. Which command would create the following scatter…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Correlation Basics<a class="anchor" id="DS102L8_page_3"></a>

[Back to Top](#DS102L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [4]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Linear Regression
VimeoVideo('329392215', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L08-pg3tutorial.zip)**.

# Correlation Basics

Two variables are correlated if there is a *linear relationship* between them; in a scatter plot, correlation is indicated if the points in the plot tend to lie close to a straight line. Both words in the phrase "linear relationship" are important.  *Linear* is important, because if there is no semblance of a straight line in the graph, then it cannot be correlated using the "standard" correlation, which is called *Pearson's Correlation*, denoted by symbols as *r*. Take a look at the graphs below.  None of them would count as being correlated, and have an *r* value of zero, because they are not linear.  

![Seven different graphs in different shapes, each with an r value of zero. None would count as being correlated.](Media/correlation3.png)

Relationship is equally important; there must be some connection between the two variables.  It can't be random; it has to be a pattern that is visible in the data.

---

## Direction of Correlations

There are technically three directions for a correlation: 

* Positive
* Negative
* No correlation (uncorrelated)

---

### Positive Correlations

A positive correlation is indicated if the line goes up from left to right. This means that the variables change together in the same direction; as one goes up, the other goes up, or as one goes down, the other goes down. A positive correlation is indicated statistically by having a positive *r* value.  The graphic below has positive correlations depicted on the left hand side. 

---

### Negative Correlations

If the line slopes from down from left to right, then it indicates a negative correlation. This means that the variables change together in different directions.  As one goes up, the other goes down.  A negative correlation is indicated statistically by having a negative *r* value.  The graphic below shows a negative correlation on the right.

![On the left, a graph with a positive correlation, with a line sloping up from left to right. On the right, a graph with a negative correlation, with the line sloping down from left to right.](Media/correlation4.png)

---

## Strength of Correlations

Not only can you judge a correlation by its type, but also by it's strength.  A strong correlation forms a very tight grouping of dots to make a line, like the graph below on the far left.  A moderate correlation will have the rough shape of a line, but the dots will be a bit more spread out, like the middle graph.  A weak correlation shows a very general trend, like the one on the right below.

![Three graphs. Graph number one, dots grouped tightly into a line, showing a strong correlation. Graph number two, dots grouped more loosely into a rough shape of a line, showing a moderate correlation. Graph number three, dots grouped very loosely, showing a weak correlation.](Media/correlation1.png)

You can have a strong, moderate, or weak correlation that is either positive or negative.  Check out the full spectrum below: 

![Seven different graphs showing different correlation strengths. Graph one has an r value of one point zero, a strong correlation. Graph two has an r value of zero point eight, a strong correlation. Graph three has an r value of zero point four, a moderate correlation. Graph four has an r value of zero point zero, showing no correlation. Graph five has an r value of negative zero point four, showing a moderate correlation. Graph six has an r value of negative zero point eight, showing a strong correlation. Graph seven has an r value of negative one point zero, showing a strong correlation.](Media/correlation5.png)

The numbers indicate the *r* values for a correlation. Correlations can range from -1 to + 1. The closer to 1, whether a positive or negative one, the stronger the correlation is.  The closer to 0, again, whether positive or negative, the weaker the correlation is.  Here are some correlation interpretation guidelines: 

<table class="table table-striped">
    <tr>
        <th>Correlation Strength</th>
        <th>Correlation Coefficient (r value)</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Strong</td>
        <td>0.7 - 1.0</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Moderate</td>
        <td>0.3 - .69</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Weak</td>
        <td>0.1 - 0.29</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>None</td>
        <td>0.0 - 0.09</td>
    </tr>
</table>

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>These guidelines definitely work for neatly controlled lab experiments; some argue that they don't work as well in real-world scenarios, which encompasses the majority of data science work! Count yourself lucky if you get a moderate correlation in your work. </p>
    </div>
</div>

---

## Types of Correlations

There are different "flavors" of correlations for different situations.  They are all interpreted the same way, but are calculated differently behind the scenes in R, to make sure that they are as accurate as possible. 

The three main types of correlations you will learn about are: 

* **Pearson's *R*:** For two normally distributed, continuous variables 
* **Spearman's Rho:** For two non-normally distributed continuous variables 
* **Kendall's Tau:** For two categorical variables

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>A correlation will ALWAYS be between two variables, no more, no less, no matter which type of correlation you're using!</p>
    </div>
</div>

---

## Correlation DOES NOT EQUAL Causation

Just because two things are related does not mean that one caused the other! Sometimes, there are additional variables not in your dataset that can indirectly relate to the relationship between the two variables, and sometimes, they really are just randomly related without a whole lot of rationale behind it. Take this scenario as a cautionary tale: 

>As ice cream sales increase, so do the number of armed robberies leading to murder.  

![A sun. Two arrows extend from the sun. The first points to ice cream and is labeled causation. The second points to a man with a bag of money on his back, a robber, and is labeled causation. Between the ice cream and the robber is double sided arrow that is labeled correlation.](Media/correlation2.png)

Does this mean that after eating ice cream, people get inspired to go out and commit murder? Most likely not.

Does this mean that murderers go out and celebrate their deeds by having some congratulatory ice cream?  Again, unlikely. 

This means that ice cream sales did not cause murders, and murders did not cause ice cream sales to increase.  Instead, there is a pesky third variable at work: heat! As temperatures increase in the summer, people are looking to cool off with some ice cream.  Makes logical sense! And, as temperatures increase, so does aggressive tendencies.  Hence the spike in murder rates. The lesson here to take to heart is that just because there is a relationship between two variables, that does not mean that one was responsible for the other. Reporting this incorrectly can make for some embarrassing mistakes.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Want to have a good laugh?</h3>
    </div>
    <div class="panel-body">
        <p>Here are some more crazy correlations to take a peek at! <a href="https://tylervigen.com/spurious-correlations">Spurious Correlations</a></p>
    </div>
</div>

---

## Examples

You will next create a plot that indicates very little correlation. The ```USArrests``` data frame has gruesome statistics on arrests for violent crimes, in arrests per 100,000 residents, in each of the 50 US states in 1973. One of the variables in this data frame is ```Murder```, which is the murder rate for each state. Another variable is ```UrbanPop```, which is the percentage of the state's population that lives in an urban area. You can create a scatter plot of these two variables together with the linear regression line using the following commands:

```{r}
d <- ggplot(USArrests, aes(x = UrbanPop, y = Murder))
d + geom_point() + geom_smooth(method=lm, se=FALSE)
```

This gives the following scatter plot:

![A scatter plot. The x axis is labeled Urban Pop and runs from thirty to over ninety. The y axis is labeled murder and runs from zero to approximately eighteen. Data points are scattered in a way that does not imply much of a pattern. A blue line runs through the data from left to right and is almost flat, with very little upward slope.](Media/L07-NoCorrelationPlot.png)

You can see that the data do not seem to be in much of a pattern. The line is almost flat, with very little slope. You say that the ```Murder``` and ```UrbanPop``` variables are uncorrelated. They do not have a linear relationship.

You now can make a plot that shows a negative correlation. The data frame ```mtcars``` has data from the 1974 Motor Trend magazine; this data covers 32 1973–74 models. One variable in this data frame is ```mpg```, which is the car's mileage in miles per US Gallon of fuel. Another variable is ```disp```, the engine displacement in cubic inches. You can create a scatter plot of these two variables with the linear regression line using the following commands:

```{r}
d <- ggplot(mtcars, aes(x = disp, y = mpg))
d + geom_point() + geom_smooth(method=lm, se=FALSE)
```

![A scatter plot. The x axis is labeled d i s p and runs from fifty to five hundred. The y axis is labeled m p g and runs from ten to thirty five. Data points are scattered, starting at the upper left portion and forming a downward curve to the lower right portion. A blue linear regression line through the data with a negative slope.](Media/L07-NegCorrelationPlot.png)

Because the linear regression line has a negative slope (it goes from upper left to lower right), these data values are negatively correlated. A larger value of displacement tends to be associated with a smaller value of miles per gallon.

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

<p style="text-align: center">
  <img src="Media/L07-CorrelationEx1.png" alt="Drawing" style="width: 500px;"/>
</p>

In [4]:
try:
    display(L8P3Q1)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. A scatter plot is shown [above]. Are x and y pos…

<p style="text-align: center">
  <img src="Media/L07-CorrelationEx2.png" alt="Drawing" style="width: 500px;"/>
</p>

In [6]:
try:
    display(L8P3Q2)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. Another scatter plot is shown [above]. Are x and…

<p style="text-align: center">
  <img src="Media/L07-CorrelationEx3.png" alt="Drawing" style="width: 500px;"/>
</p>

In [7]:
try:
    display(L8P3Q3)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. A third scatter plot is shown [above]. Are x and…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Calculating Correlation <a class="anchor" id="DS102L8_page_4"></a>

[Back to Top](#DS102L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [None]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Linear Regression
VimeoVideo('329392134', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L08-pg4tutorial.zip)**.

# Calculating Correlation

Now that you understand what a correlation is and how to interpret it, you will learn how to calculate correlations in R. 

---

## cor.test() 

The simplest way to find a correlation is to make use of the ```cor.test()``` function. You will select the two variables you want to correlate; with ```cor.test()``` you can only do two variables at a time. Here is the code: 

```{r}
cor.test(mtcars$hp, mtcars$cyl, method="pearson", use = "complete.obs")
```

This runs ```cor.test``` on the ```mtcars``` dataset variables of ```hp``` and ```cyl```.  You will use the ```method=``` argument to specify ```"pearson"``` if you have two continuous variables that are normally distributed.  If, however, you have two continuous variables that are NOT normally distributed (this is called *non-parametric*), then you will use the argument ```"spearman"```, which will conduct the non-parametric correlation *Spearman's Rho*, pronounced "row." If you had two categorical variables that are numeric or have been recoded to numeric, then you can make use of the argument ```"kendall"```, which conducts the test *Kendall's Tau*, pronounced like "ow!" with a "t" on the front, or the beginning of "tower." 

The ```use=``` function with the argument ```"complete.obs"``` means that you don't have to have a complete dataset; R will use what it has, as long as it has data for the two variables you are trying to correlate.

The result from ```cor.test()``` are below: 

```text
	Pearson's product-moment correlation

data:  mtcars$hp and mtcars$cyl
t = 8.2286, df = 30, p-value = 3.478e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6816016 0.9154223
sample estimates:
      cor 
0.8324475 
```

The first line tells you what analysis you ran - the Pearson's correlation.  The second line tells you the data you used.  Next you have information about whether this correlation was significant.  The *t* value is not important and is not reported, but the overall *p* value often is.  Like with anything else, if the *p* value is less than .05, than the correlation is significant.  

The last important part of this R output to pay attention to is the number underneath ```cor```.  This is your *r* correlation coefficient, which you will need to interpret.  This specific correlation between the horsepowers of the car and the number of cylinders in the engine is strongly positively correlated at 0.83. 

---

## Detailed Correlation Matrices 

```cor.test()``` works just fine, but what if you want to look at all variables in your dataset at once, just to take a quick peek at what's going on? Well, using ```cor.test()``` would take a long time! But there is a solution: *correlation matrices*. This lets you look at more than one correlation at a time, in a handy graphic.

The easiest way to create a correlation matrix with the *p* value included is to use the ```PerformanceAnalytics``` library and the ```chart.Correlation()``` function.

So, get ```PerformanceAnalytics``` all installed and running: 

```{r}
install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")
```

Then, because ```chart.Correlation()``` will look at all the data in your data frame, you need to limit it to only the quantitative, continuous variables.  This can easily be done with subsetting.  You are telling R that out of the ```mtcars``` data frame, you want to keep all the rows (if you wanted to keep certain rows, those numbers would go before the comma), and that you want to keep only the first seven columns in the data frame.  You'll name this new truncated dataset ```mtcars_quant```. 

```{r}
mtcars_quant <- mtcars[, c(1,2,3,4,5,6,7)]
```

And for a visual, this is what ```mtcars_quant``` now looks like.  You'll notice only the first seven columns were kept, compared to the entirety of the ```mtcars``` dataset.

![A data frame with seven columns, m p g, c y l, d i s p, h p, d r a t, w t, and q s e c. Each row has a heading that is a type of car, including Mazda R X 4, Mazda R X 4 wagon, Datsun seven hundred ten, and more. The cells for each row contain data relevant to the car.](Media/correlation6.png)

Now that you have a dataset filled with only quantitative variables, it is time to make your correlation matrix! All you need to do is call the ```chart.Correlation()``` function.  The first argument will be your data frame name, ```mtcars_quant```, the second will be ```histogram=```, and the third will be ```method=```, which is the type of correlation. ```method=``` takes the same arguments as ```cor.test()```: ```"pearson"```, ```"spearman"```, and ```"kendall"```. 

```{r}
chart.Correlation(mtcars_quant, histogram=FALSE, method="pearson")
```

And here is the plot that results from the above code: 

![Correlation plot. Along the diagonal are the following variable names, from top left to bottom right: mpg, cyl, disp, hp, drat, wt, qsec. To the right of the diagonal, there are correlation values, with zero to three red stars indicating the level of significance. From top to bottom and left to right, these read: -.85***, -.85***, -.78***, .68***, -.87***, .42*, .90***, .83***, -.70***, .78***, -.59***, .79***, -.71***, .89***, -.43*, -.45**, .66***, -.71***, -.71***, .001, and -0.17. To the left of the diagonal are qq plots for each of the variables.](Media/correlation7.png)

Although there is a lot to look at here, you only need to pay attention to the right-hand side. You will read this by the intersections of the variables on the left with the variables on the bottom, so the correlation of ```mpg``` with ```cyl``` is -0.85, and it is significant at *p* < .001, because three stars are displayed in that cell.  

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>By convention, if statistical significance stars aren't labeled, one star means p less than .05, two stars means p less than .01, and three stars means p less than .001. </p>
    </div>
</div>

---

## Visually Pleasing Correlation Matrices

The graph above conveniently has the larger *r* values printed in a larger font size as well, so it is somewhat intuitive to interpret.  But it definitely isn't the most pleasing to the eye. If you like a chart that is more of a looker, then ```corrplot()``` is the way to go.  However, it doesn't list significance values or specific *r* values, which is a downside. Another downside to ```corrplot()``` is that getting the image is a bit more complex.

---

### Getting a Correlation Matrix Table using cor()

First, you need to turn your data into a correlation matrix. The ```corrplot()``` function won't take your data frame as an argument.  To do this, you will use the function ```cor()``` on the quantitative dataset ```mtcars_quant``` that you have been using.

```{r}
corr_matrix <- cor(mtcars_quant)
corr_matrix
```

When you call your newly created matrix, this is what you will see: 

```text
            mpg        cyl       disp         hp        drat         wt        qsec
mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594  0.41868403
cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958 -0.59124207
disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799 -0.43369788
hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479 -0.70822339
drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120476
wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000 -0.17471588
qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159  1.00000000
```

This correlation matrix displays the correlation between the variables along the top and the variables along the left hand side. This is similar to how things were displayed in correlation matrix graphic above, but you'll notice that there aren't values just in the upper right hand side. Correlation matrices are often displayed as a triangle instead of a rectangle because the information repeats in the second half.  The correlation between ```mpg``` on the left and ```cyl``` on the top is the same as the correlation between ```cyl``` on the left and ```mpg``` on the top. See? 

![A data set with seven columns and seven rows. The columns and the rows have the same headings, which are m p g, c y l , d i s p, h p, d r a t, w t, and q s e c. The set of data shows correlation between the rows and columns. For example, the correlation between m p g on the left and c y l on the top is the same as the correlation between c y l on the left and m p g on the top.](Media/correlation8.png)

The defining line to look for is a diagonal line of 1s all the way down from the upper left corner to the lower right corner. This line presents itself because the correlation of any variable with itself will always be 1.  So ```mpg``` with ```mpg``` is 1, ```cyl``` with ```cyl``` is 1, and so on, all the way down. Take a look: 

![A data set with seven columns and seven rows. The columns and the rows have the same headings, which are m p g, c y l , d i s p, h p, d r a t, w t, and q s e c. The set of data shows correlation between the rows and columns. For example, the correlation between m p g on the left and c y l on the top is the same as the correlation between c y l on the left and m p g on the top. A yellow diagonal line runs from the top left to the bottom right, through each instance of a correlation of one point zero zero zero zero zero zero zero.](Media/correlation9.png)

Everything the yellow line touches is a 1.  The top right half is mirrored in the bottom left half.

---

### Installing corrplot()

Next, you will need to install and make available the library ```corrplot```: 

```{r}
install.packages("corrplot")
library("corrplot")
```

---

### Using corrplot()

And finally, you are ready to actually make a beautiful, visually pleasing correlation matrix plot!

```{r}
corrplot(corr_matrix, type="upper", order="hclust", p.mat = corr_matrix, sig.level = 0.01, insig="blank")
```

Remember, ```corrplot()``` is based off of the correlation matrix you created above, ```corr_matrix``` rather than your data frame.  ```type=``` allows you to pick whether you want to see the top or the bottom of the mirror-image matrix; in this case you have chosen ```"upper"```, which yields the following image:

![A correlation matrix plot showing the top of the mirror image matrix. The first row has seven squares, the second six, the third five, the fourth four, the fifth three, the sixth two, and the seventh one. The column and row headings are the same, w t, h p, c y l, d i s p, q s e c, m p g, and d r a t. To the right of the matrix is a vertical scale showing one in blue on the top and negative one in red on the bottom, with lighted shades of each color in between. The scale has increments of zero point two.](Media/correlation10.png)

If you had chosen ```"lower"``` instead, you would get this plot: 

![A correlation matrix plot showing the bottom of the mirror image matrix. The first row has one square, the second two, the third three, the fourth four, the fifth five, the sixth six, and the seventh seven. The column and row headings are the same, w t, h p, c y l, d i s p, q s e c, m p g, and d r a t. To the bottom of the matrix is a horizontal scale showing negative one in red on the left and one in blue on the right, with lighted shades of each color in between. The scale has increments of zero point two.](Media/correlation11.png)

The rest of the arguments to ```corrplot()``` have to do with demonstrating significance on the chart by only showing the significant values.  You can change the significance level using ```sig.level=()```; here it is set to *p* = .01, but you could choose *p* = .05 instead to get more findings or *p* = .001 to get a more rigorously determined set of correlations. Or really any other value you'd like. Then the ```insig="blank"``` argument tells R to leave blank anything that isn't significant at the level you have specified.  This is why not all rows of the matrices above are filled out. 

When interpreting either graph above, it's important to note that correlations will be shown with both a color and size gradient.  So the smaller and lighter a color is, the less significant the relationship between the two variables is.  Positive correlations are shown in blues, while negative correlations are shown in reds.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Really into corrplot()?</h3>
    </div>
    <div class="panel-body">
        <p>Then <a href="https://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram">learn all the fun things that can be done with it</a> from the Statistical Tools for High-Throughput Data Analysis group.</p>
    </div>
</div>

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [8]:
try:
    display(L8P4Q1, L8P4Q2, L8P4Q3, L8P4Q4)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Which function would you use if you only wanted …

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. Which function would you use to get a correlatio…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. Which function will provide a text-based correla…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '4. Which function will provide a chart with both p …

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Introduction to Regression<a class="anchor" id="DS102L8_page_5"></a>

[Back to Top](#DS102L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [5]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Linear Regression
VimeoVideo('329392264', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L08-pg5tutorial.zip)**.

# What is Linear Regression? 

Linear regression is the way you will compute the parameters of the line that best fits your data. It gives a relationship between the horizontal and vertical values in a scatter plot. In the process of computing the parameters of the line, you can also determine how well the line fits your data.

*Simple linear regression*, which you will learn here, is when there are only two variables. It is possible to do linear regression with more than two variables; this is called *multiple linear regression*, and you'll dive into it later.  When you compute a linear regression, you are actually computing the equation of the line that best fits the data.

You may remember from algebra that a line is represented by an equation of this form:

```text
y = mx + b
```

Where you have the following information: 

* **x:** Variable on the horizontal axis.
* **y:** Variable on the vertical axis.
* **m:** The slope of the line; indicates how steep the angle of the line is.
* **b:** The intercept of the line; where the line crosses the y axis.

If you know the slope and intercept of the line, then you can compute the y value of any point on the line from its x value. (You can also, with some algebra, compute the x value of any point on the line from its y value.) So when you say that the linear regression computes the parameters of the line that best fits your data, you are saying that the linear regression computes the slope and the intercept of the line.

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [9]:
try:
    display(L8P5Q1, L8P5Q2, L8P5Q3)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. What is the difference between simple linear reg…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. What is the slope of a line?\n', 'output_type': …

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. What is the y-intercept?\n', 'output_type': 'str…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Computing Linear Regression<a class="anchor" id="DS102L8_page_6"></a>

[Back to Top](#DS102L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [6]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Linear Regression
VimeoVideo('329392186', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L08-pg6tutorial.zip)**.

# Computing Linear Regression

You will compute a linear regression using the data in the R dataset ```cars```. Previously, you used this dataset to create a box plot. ```cars``` includes speed measurements (in miles per hour) and stopping distance (in feet) for cars measured in the 1920's. Take a look:

```{r}
head(cars)
```

```text
speed dist

1 4 2

2 4 10

3 7 4

4 7 22

5 8 16

6 9 10
```

Computing the linear regression in R is extremely simple. It is computed by the function ```lm()``` as follows:

```{r}
lin_reg <- lm(dist ~ speed, cars)
print(lin_reg)
```

```lin_reg``` is a variable name. The ```lm()``` function returns an object that stores all the information computed by the linear regression. So the assignment statement assigns this object to ```lin_reg```. When you call the ```print()``` function for the ```lin_reg``` object, it prints the information that ```lm()``` was called with, and it prints the coefficients of the line that best fits the data. 

You indicated which variables you wanted to fit the line with using the ```dist ~ speed``` argument to ```lm()```. In R terminology, this argument is a formula. In this formula, ```dist``` is the variable on the vertical axis-what you called y in the equation for the line above. ```speed``` is the variable on the horizontal axis - what you called ```x``` in the equation for the line. You can read the tilde ```~``` as the word "by."  So, you are asking R to produce a line showing stopping distance by speed. The last argument for the ```lm()``` function is the data frame name; ```cars``` in this case.

When ```lin_reg``` is printed, this is the output R will provide: 

```text
Call:

lm(formula = dist ~ speed, data = cars)

Coefficients

(Intercept) speed

-17.579 3.932
```

The coefficients are the slope and intercept of the line.  Intercept is the intercept, or ```b``` for the line, and the slope is labeled by the x variable name that was use to create it: the slope is labeled ```speed```. So the equation for this particular linear regression line predicting stopping distance using speed is:

```text
y = 3.932x -17.579
```

---

# Linear Regression Model Summary

The object stored in ```lin_reg``` has much more information stored in it. You can access some of this information by the following command:

```{r}
summary(lin_reg)
```

And the summary information that R will provide you with is below:

```text
Call:

lm(formula = dist ~ speed, data = cars)

Residuals:

Min 1Q Median 3Q Max

-29.069 -9.525 -2.272 9.215 43.201

Coefficients:

Estimate Std. Error t value Pr(>|t|)  
(Intercept) -17.5791 6.7584 -2.601 0.0123\*
speed 3.9324 0.4155 9.464 1.49e-12\*\*\*

---

Signif. codes:

* Three asterisks means 'less than 0.001'
* Two asterisks means 'less than 0.01'
* One asterisk means 'less than 0.05'
* One dot (or period) means 'less than 0.1'
* No codes means 'less than 1

Residual standard error: 15.38 on 48 degrees of freedom

Multiple R-squared: 0.6511,

Adjusted R-squared: 0.6438

F-statistic: 89.57 on 1 and 48 DF,

p-value: 1.49e-12
```

---

## Is the Overall Model Significant?

The first thing you will want to look for is the ```F-statistic``` and ```p-value```, at the very bottom.  These tell you if the overall model is significant.  What is meant by that? If the overall model is significant, it means that your x value (or x values, in the case of multiple regression) are a significant predictor of your y value. If the the *p* value isn't significant at *p* < .05 at the very least, then the rest of the output really isn't worth looking at.  You didn't find anything interesting to talk about or report!

---

## Which Individual Predictors are Significant? 

Luckily, speed is a significant predictor of stopping distance, since the *p* value is quite small; much smaller than .05. So, you can go on to look at the rest of the output. You'll next want to glance at the ```Coefficients``` section.  In the ```Coefficients``` section, you can find the information provided with the ```print()``` function above, under the column header ```Estimate```. The first row is the intercept and the second row labeled ```speed``` is the slope. Those are the values you would plug into your equation, just like before.  

The other important part of this ```Coefficients``` table is ```t value``` and ```Pr(>|t|)``` sections, for everything but the ```Intercept```.  Because you only had one x variable, since you are doing simple linear regression, that is the only thing besides ```Intercept``` listed.  But if you were doing multiple regression, there would be multiple x variables and thus multiple rows of data after ```Intercept```. R conducts a *t*-test on each individual predictor of y, to see if it contributes anything to the prediction model. The ``Pr(>|t|)``` section is the *p* value for this *t*-test, and if it is significant, than you know that the x variable has an impact on the y variable.  So, speed is a significant predictor of stopping distance according to the *t* test as well.  Since there was only one x variable, you expect the *F*-test results to match the *t*-test results, in that they should both be significant.  That is the case here as well: both *F* and *t* are significant. 

---

## How Much Variance is Explained by this Model?

Next, move down to the ```Multiple R-squared``` and ```Adjusted R-squared``` rows.  These both mean the same thing, but the second one is adjusted for the number of variables in the model, in order to reduce the amount of Type I error that may abound. As a general rule, looking at ```Adjusted R-squared``` is the more prudent thing to do. The R-squared value is also called the *coefficient of determination*. It is a measure of the percentage of the variability of the data set that the line explains. In this case, because the Adjusted R-squared value is 0.6438, the line explains approximately 64% of the variability of the data. Put another way, this means that speed is able to explain about 64% of the factors that go into stopping distance.  The rest is covered by other variables that have not been included in the model. The larger the R-Squared value, the more closely related the variables in the model are.

---

# Using Linear Regression to Predict Values

You can use this model to predict the necessary stopping distance for a given speed. For example, suppose you want to know the stopping distance for a 1920's vehicle traveling at 21 miles per hour. You could put 21 into the x value of the regression equation you computed above, and you can solve for y to get the predicted stopping distance. 

```text
y = mx + b
y = 3.93 * 21 - 17.57
y = 64.99
```

So your model predicts that a car going 21 miles per hour will require 64.99 feet to stop. You can describe what you have done graphically as well. In the image below, to find the predicted stopping distance at 21 miles per hour, you go up from the x axis at a value of 21 until you reach the regression line. Then, you go horizontally left to the y axis and read the value, which would be 64.99.

![A scatter plot. The x axis is labeled speed and runs from zero to twenty five. The y axis is labeled dist and runs from zero to one hundred twenty. Data is scattered from the bottom left to the upper right and a blue regression line moves from bottom left to upper right through the data. A red dashed line moves upward from twenty one on the x axis until it hits the regression line and then moves left until it hits the y axis at sixty four point nine nine three.](Media/L07-LinearRegressionPrediction.png)

Does your linear model guarantee that the actual stopping distance for a car going 21 miles per hour will be 64.99 feet? No, because there is variability in the relationship between speed and stopping distance. All you can say is that, based on our data, you would expect a car going 21 miles per hour to require something like 65 feet to stop; it may be longer or shorter than 65 feet.

Suppose you want to find the stopping distance of a car going 45 miles per hour. You could put 45 into the x value of the regression equation and compute y:

```text
y = mx + b
y = 3.93 * 45 - 17.57
y = 159.36
```

Your regression model shows that it will take 159.36 feet to stop. However, you should be very hesitant to accept this number for the following reason: you have created the model using speeds between 4 and 25 miles per hour.  There is no data to verify that your model will work well for speeds above 25 miles per hour.

In this case, you are using the model to extrapolate beyond the data, and extrapolation can be fraught with peril. In the case of the distance necessary to stop a car, this linear model may not be accurate at higher speeds.

---

# Scatter Plot with a Best Fit Line

Lastly, you can create a scatter plot of the speed versus distance data with a line that best fits the data.  This is done with the ```method=lm``` argument; you are asking R to fit the linear model into the graph.

```{r}
d <- ggplot(cars, aes(x = speed, y = dist))
d + geom_point() + geom_smooth(method=lm, se=FALSE)
```

Here is the result:

![A scatter plot. The x axis is labeled speed and runs from zero to twenty five. The y axis is labeled dist and runs from zero to one hundred twenty. Data is scattered from the bottom left to the upper right and a blue regression line moves from bottom left to upper right through the data.](Media/L07-LinearRegressionLinePlot.png)

The blue line is the line described by the linear regression equation you found above.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Often, when you create a scatter plot, you will create the linear regression line at the same time so you can see how well the data fit the line.</p>
    </div>
</div>

---

## Summary

When you have two continuous variables, analyses like *t*-tests won't work.  Instead, you'll use correlation to determine the relationship between variables, or simple linear regression to determine causality and/or predict y values.  An ideal visualization for two continuous variables is a scatterplot with a best fit line; it allows you to visually assess a correlation.  

Correlations can be assessed by both their strength (ranging from 0- 1) and their direction (positive or negative).  A positive correlation has both variables varying in the same direction, while a negative correlation has the variables varying in opposite directions.  

Linear regressions can be used to predict values, and thus is often called *predictive modeling*. A line is created using the equation y=mx + b, which takes into account m, the slope, and b, the y intercept.  Luckily, R calculates these values for you. 

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [10]:
try:
    display(L8P6Q1, L8P6Q2, L8P6Q3)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. What does the \x1b[31;1mF-statistic\x1b[0m and a…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. Suppose you have a data frame \x1b[31;1mstuff\x1…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. Eating asparagus produces a pungent smell in uri…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Key Terms, R Libraries and Functions <a class="anchor" id="DS102L8_page_7"></a>

[Back to Top](#DS102L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Correlation</td>
        <td>Linear relationship between two variables. Statistically denoted as r. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Pearson's R</td>
        <td>A type of correlation that is for two continuous, normally distributed variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Spearman's Rho</td>
        <td>A type of correlation that is for two non-normally distributed variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Kendall's Tau</td>
        <td>A type of correlation that is for two categorical variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Correlation Direction</td>
        <td>Can be positive, negative, or uncorrelated.  </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Positive Correlation</td>
        <td>When two variables vary in the same direction.  The points form a line going up from left to right on the graph, and the r value is positive. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Negative Correlation</td>
        <td>When two variables vary in different directions.  The points form a line going down from left to right, and the r value is negative.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Uncorrelated</td>
        <td>When there is no correlation. This may mean there is no relationship between points at all, or it may mean that the relationship the points have is not linear. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Correlation Strength</td>
        <td>Ranges from 0 -1, with zero being uncorrelated and 1 being perfectly correlated. Can be strong, moderate, or weak.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Correlation Matrix</td>
        <td>A table or graphic of all correlations in a dataset.  </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Simple Linear Regression</td>
        <td>Regression with only one x and one y variable. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multiple Linear Regression</td>
        <td>Regression with more than one x variable.  </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Slope</td>
        <td>Steepness of the regression line (m). </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>y-intercept</td>
        <td>Where the regression line crosses the y axis (b).</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multiple R Squared</td>
        <td>The amount of variance in y that can be explained by x.  Used for simple linear regression only. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Adjusted R Squared</td>
        <td>The amount of variance in y that can be explained by x.  Adjusts the multiple R squared value to account for Type I error inflation that comes with using multiple predictors.  Thus used for multiple regression primarily.  </td>
    </tr>
</table>

---

# Key R Functions

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>geom_point()</td>
        <td>A function for ggplot that produces a scatterplot.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>geom_smooth()</td>
        <td>A function for ggplot that produces a best fit line. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>method=lm</td>
        <td>An argument for geom_smooth() that produces a linear model line.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>se=FALSE</td>
        <td>An argument to geom_smooth() that removes the grey shaded confidence region from the best fit line.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>cor.test()</td>
        <td>Creates a correlation between two variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>method=</td>
        <td>An argument to cor.test() and chart.Correlation() that allows you to specify the type of correlation.  </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>use="complete.obs"</td>
        <td>An argument to cor.test() that allows you to use as much data as is available, even if some points are missing.  </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>chart.Correlation()</td>
        <td>Creates an easy correlation matrix graph.  </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>histogram=</td>
        <td>Argument to chart.Correlation() that suppresses histograms when FALSE. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>cor()</td>
        <td>Creates a table correlation matrix.  </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>corrplot()</td>
        <td>Creates visually pleasing correlation matrices.  </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>lm()</td>
        <td>Creates a linear regression model.  </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>summary()</td>
        <td>Provides additional output for linear regression and other statistics.</td>
    </tr>
</table>

---

# Key R Libraries

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>PerformanceAnalytics</td>
        <td>Package used for detailed correlation matrices.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>corrplot</td>
        <td>Creates visually pleasing correlation matrices. </td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Practice Hands On<a class="anchor" id="DS102L8_page_8"></a>

[Back to Top](#DS102L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


```c-lms
activity-type: project
activity-name: Lesson 8 Practice Hands-On
points: 0
due-at: 67%
close-at: end-of-module
```

For your Lesson 8 Hands-On, use the `mtcars` data frame to examine the effect that engine horsepower (`hp`) and vehicle weight (`wt`), measured in thousands of pounds) have on the time necessary to travel one quarter mile from a standing start (`qsec`). Use this information to answer the below questions. This Hands-On will **not** be graded, but you are encouraged to complete it. The best way to become a great Data Scientist is to practice! Please write your answers within an R Script file, and submit your project in the below area. 

---
## Requirements

1. Create a scatter plot with a trend line where the horizontal axis is engine horsepower and the vertical axis is quarter mile time. What is the relationship between time and engine horsepower: positively correlated, negatively correlated, or uncorrelated? 

2. Compute the linear regression for time and engine horsepower.  What is the equation of the line? What is the R-squared value? Is this what you would expect?

3. Create a scatter plot with a trend line where the horizontal axis is vehicle weight and the vertical axis is quarter mile time. What is this relationship: positively correlated, negatively correlated, or uncorrelated? 

4. Compute the linear regression for these two variables. What is the equation of the line? What is the R-squared value? Is this what you would expect?

5. Create a report (MS Powerpoint or equivalent) that shows your results and the code you used to generate the results. Please include your interpretation of the data included and answer all the questions posed above. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire document when finished!</p>
    </div>
</div>

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>To zip your file on <b>Windows</b>, right click on the file and select "Send to", then select "Compressed (zipped) folder". For <b>Mac</b> users, right click on the file and select "Compress", then select your file from the options.</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Hands-On Solution<a class="anchor" id="DS102L8_page_9"></a>

[Back to Top](#DS102L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


```c-lms
topic: Hands-On Solution
```

---
# Solutions 

---
## Part 1 Code Solution 

```{r}
#To view the dataset 
mtcars 

#Overall goal: Examine the effect of horsepower and vehicle weight on the time necessary to travel a quarter mile from a standing start

#Scatter plot with engine horsepower and quarter mile time

library("ggplot2")
ggplot(data=mtcars, aes(x = hp, y = qsec)) + geom_point() + geom_smooth(method='lm', se = TRUE)

#The above graph is negatively correlated, because the dots show a trend from top to bottom.

#Compute the linear regression for engine horsepower and quarter mile  time

regression <- lm(qsec~hp, mtcars)
summary(regression)

#The equation of the line is: y = -0.02x + 20.56

#The R squared value is .49, meaning that horsepower explains 49% of the variance in the quarter mile time.

#This is what I would expect - the more horsepower, the more powerful engine and the faster a car should be able to go.
```

---
## Part 2 Code Solution

```{r}

#Scatter plot with vehicle weight and quarter mile time

library("ggplot2")
ggplot(data=mtcars, aes(x = wt, y = qsec)) + geom_point() + geom_smooth(method='lm', se = TRUE)

#The above graph is uncorrelated.

#Compute the linear regression for vehicle weight and quarter mile time

regression2 <- lm(qsec~wt, mtcars)
summary(regression2)

#The equation of the line is: y = -.32x + 18.88

#The R squared value is -.00, meaning that vehicle weight does not account for any of the variance in quarter mile time

#I would have thought that if a car was heavier, it would be slower off the mark.  It looks like weights just don't vary enough in this dataset though.
```

---
## Example Presentation

An example presentation for this hands on can be found **<a href="https://repo.exeterlms.com/documents/V2/DataScience/Stat-Prog-R/Lesson-7-Hands-On-Presentation.zip"> here </a>**.
