# DS106 Modeling : Lesson One Companion Notebook

### Table of Contents <a class="anchor" id="DS106L1_toc"></a>

* [Table of Contents](#DS106L1_toc)
    * [Page 1 - Introduction to Modeling](#DS106L1_page_1)
    * [Page 2 - Introduction to Linear Regression](#DS106L1_page_2)
    * [Page 3 - ](#DS106L1_page_3)
    * [Page 4 - ](#DS106L1_page_4)
    * [Page 5 - ](#DS106L1_page_5)
    * [Page 6 - ](#DS106L1_page_6)
    * [Page 7 - ](#DS106L1_page_7)
    * [Page 8 - ](#DS106L1_page_8)
    * [Page 9 - ](#DS106L1_page_9)
    * [Page 10 - ](#DS106L1_page_10)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction to Modeling <a class="anchor" id="DS106L1_page_1"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to Modeling
VimeoVideo('246121269', width=720, height=480)

# Introduction

This lesson will be your first foray into modeling.  At its core, all modeling consists of is regression - and you will learn all the different ways to build on a simple linear regression so that you can best fit your data!

By the end of this lesson, you should be able to:

* Calculate the equation of a line
* Understand residuals
* List the assumptions for linear regression
* Test assumptions and complete linear regression in R and Python

This lesson will culminate in a hands-on in which you complete your own linear regression analyses in R and Python.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/436309839"> recorded live workshop </a> that goes over the theory of regressions and the statistical assumptions associated with them.</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Introduction to Linear Regression<a class="anchor" id="DS106L1_page_2"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


*Simple linear regression* is a method used to fit a line to data when you only have two variables.  Regression allows you to write an equation that models the relationship between the independent variable, x, and the dependent variable, y. You can think of x as a predictor, and y as the outcome you are trying to predict. When the model is created, it allows you to predict the value of y for any given value of x. 

The table below shows all the different terminology you may encounter for independent and dependent variables.

![A table with two columns. Column one, independent variable. I V. X. Predictor. Explanatory. Column two, dependent variable. D V. Y. Outcome. Response.](Media/106.L6.20.png)

---

## Linear Equations

A linear equation is the technical term for any equation that describes a line. Two important characteristics of a line are its *slope* and its *y-intercept*.

---

### y-Intercept

The *y-intercept* is the value at which the line crosses the y-axis. Stated differently, the y-intercept is the value of y that corresponds to x = 0. Given the equation of a line, you can find the y-intercept by plugging in a zero for the x. You may remember having done this during algebra class in high school. Remember that old standby equation, ```y = mx + b```? Well, the ```b``` part of it is the y-intercept.

For example, if your equation is:

```y = 12x + 7```

and you set the value of x to 0, then the equation becomes:

```y = 12(0) + 7```.  

Anything times zero is zero, so ```y = 7```.

---

### Slope

The *slope* of a line, ```m```, is a measure of how steep the line is. It shows how much influence x has on y. If the slope is positive, then the value of y increases as the value of x increases. If the slope is negative, the value of y decreases as x increases. This is similar to correlations. 

When the slope is expressed as a fraction, the numerator (the number on top) indicates the vertical change when moving from a single point on the line to another point on the line. The denominator (the number on the bottom) indicates the horizontal change. Try an example:

There is a point on the graph below at (0,2) - which is expressed as being at (x,y).  So it is 0 on the x-axis, and 2 on the y-axis. If you have the linear equation ```y = (3/5)x + 2```, where would the next point fall?

![A graph showing the x and y axes. There is a point plotted at open parentheses zero comma two close parentheses. A line with an upward slope passes through the point.](Media/L01-01.png)

Since the slope is 3/5, then if you want to move from the point (0, 2) to another point on the graph, you can go up 3 (the numerator of the slope, up is positive) and to the right 5 (the denominator of the slope, right is positive) to arrive at the point (5, 5), which will also be on the line.

![A graph showing the x and y axes. There is a point plotted at open parentheses zero comma two close parentheses and a point at open parentheses five comma five close parentheses. A line with an upward slope of three fifths passes through the two points and continues in both directions.](Media/L01-02.png)

With two points, you now have a better idea of the shape of the line and you have a good idea of the relationship between x and y.

---

## Linear Equations in Statistics

In regression, you probably will see different notation than in algebra. The y-intercept is represented with the symbol b<sub>0</sub> and the slope with the symbol b<sub>1</sub>. You can remember the difference between these two by thinking that your y-intercept is the y value when x is zero - so the symbol has a zero in it ( b<sub>0</sub>). 

This leads to the following representation of the linear equation:

> y = b<sub>1</sub>x +  b<sub>0</sub>

Sometimes, you may also see the y-intercept as the first part of this equation instead of tacked onto the end (order changed).

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - A Basic Regression Example<a class="anchor" id="DS106L1_page_3"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# A Basic Regression Example to Sink your *Teeth* Into

You will revisit the **[crocodile data you used previously](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/crocodiles.zip)**. 

Recall that the crocodile data was arranged in two columns, and looks like this:

![A table with three columns, common name, head length, and body length. In each row, the common name is estuarine crocodile. Each row has different data for head length and body length.](Media/L01-03.png)

---

## Interpreting the Regression Output

You can assume that there is a linear relationship between the head length and the body length, and you can express this symbolically as:

> y = b<sub>1</sub>x + b<sub>0</sub>

where y represents the predicted body length of an estuarine crocodile with a head length of x. The number b<sub>1</sub> is called the *estimated regression coefficient*. The coefficient represents the magnitude of change that will take place in y based on the x variable. Regardless of what tool is used to create the regression equation, the output will give values for b<sub>0</sub> and b<sub>1</sub>. In this case, the y-intercept, b<sub>0</sub>, = −18.274. The slope, b<sub>1</sub>, = 7.660.

---

### Interpreting the Slope

When the regression equation was created, the head length was the predictor variable (x), and the body length was the response (y). So, in terms of the regression equation, you have a slope of b<sub>1</sub> = 7.660. Or, with every 1 cm increase in the length of the head, the body is expected to increase by about 7.66 cm. This is the practical interpretation for any simple linear regression equation - that is, a one unit change in the horizontal axis variable predicts a b<sub>1</sub> unit change in the vertical axis variable. 

---

### Interpreting the y-Intercept

What about the interpretation of b<sub>0</sub>, the y-intercept? For this, you often need to be a bit careful. For example, with the crocodile data, the y-intercept of the regression equation is b<sub>0</sub> = −18.274. This means that if the head length gets down to 0 cm, the predicted body length is ~ -18.3 cm. Well, you can't have a negative body length, and you also can't have a head that is zero cm.  It just doesn't make sense.

Note that in the data table, the head lengths range from 24 cm to 61 cm. You have no business predicting the body length of a crocodile whose head length is outside of this range, because you don't have enough data to be accurate.

---

## Making Predictions

During an archaeological dig in northwest Africa, an 87 cm skull from a juvenile sarcosuchus (an extinct genus of crocodile from a distant relative of living crocodiles, believed to have lived 112 million years ago) was discovered. You want to estimate the body length of this crocodilian using the regression equation you calculated for the estuarine crocodiles.

![A drawing of a sarcosuchus, an extinct genus of crocodile that is believed to have lived one hundred and twelve million years ago.](Media/L01-04.png)

You are going to ignore for a moment that making this estimate will require extrapolation (guesstimating), because in the absence of anything else to help you predict, it seems to be the best approach. Besides, extrapolation on the high side of the range is less risky than extrapolating on the low side. And all you are really trying to do anyway is estimate the length of an extinct crocodile. It is not likely to affect world peace.

To predict the body length, you can substitute x = 87 into the regression equation, since the found skull was 87cm:

```y = 7.660x -18.274```
```y = 7.660(87) -18.274```
```y = 648.146```

Stop and think about this for a minute. This juvenile sarcosuchus is estimated to be over six and a half meters long! That is over 21 feet! Imagine how big it might have been if it reached maturity! And you thought regular crocs were scary!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - <a class="anchor" id="DS106L1_page_4"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - <a class="anchor" id="DS106L1_page_5"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - <a class="anchor" id="DS106L1_page_6"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - <a class="anchor" id="DS106L1_page_7"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - <a class="anchor" id="DS106L1_page_8"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - <a class="anchor" id="DS106L1_page_9"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - <a class="anchor" id="DS106L1_page_10"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">