# DS106 Modeling : Modeling with linear regression

### Table of Contents <a class="anchor" id="DS106L1_toc"></a>

* [Table of Contents](#DS106L1_toc)
    * [Page 1 - Introduction to Modeling](#DS106L1_page_1)
    * [Page 2 - Introduction to Linear Regression](#DS106L1_page_2)
    * [Page 3 - A Basic Regression Example](#DS106L1_page_3)
    * [Page 4 - What is a Residual? ](#DS106L1_page_4)
    * [Page 5 - Assumptions for Linear Regression](#DS106L1_page_5)
    * [Page 6 - You'll Get Along Swimmingly with Regression in R!](#DS106L1_page_6)
    * [Page 7 - Running Simple Linear Regression and Interpreting the Output](#DS106L1_page_7)
    * [Page 8 - Regression in Python](#DS106L1_page_8)
    * [Page 9 - Interpreting the Regression Output](#DS106L1_page_9)
    * [Page 10 - Key Terms](#DS106L1_page_10)
    * [Page 11 - Lesson 1 Practice Hands-On](#DS106L1_page_11)
    * [Page 12 - Lesson 1 Practice Hands-On Solution](#DS106L1_page_12)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction to Modeling <a class="anchor" id="DS106L1_page_1"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to Modeling
VimeoVideo('246121269', width=720, height=480)

# Introduction

This lesson will be your first foray into modeling.  At its core, all modeling consists of is regression - and you will learn all the different ways to build on a simple linear regression so that you can best fit your data!

By the end of this lesson, you should be able to:

* Calculate the equation of a line
* Understand residuals
* List the assumptions for linear regression
* Test assumptions and complete linear regression in R and Python

This lesson will culminate in a hands-on in which you complete your own linear regression analyses in R and Python.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/436309839"> recorded live workshop </a> that goes over the theory of regressions and the statistical assumptions associated with them.</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Introduction to Linear Regression<a class="anchor" id="DS106L1_page_2"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


*Simple linear regression* is a method used to fit a line to data when you only have two variables.  Regression allows you to write an equation that models the relationship between the independent variable, x, and the dependent variable, y. You can think of x as a predictor, and y as the outcome you are trying to predict. When the model is created, it allows you to predict the value of y for any given value of x. 

The table below shows all the different terminology you may encounter for independent and dependent variables.

![A table with two columns. Column one, independent variable. I V. X. Predictor. Explanatory. Column two, dependent variable. D V. Y. Outcome. Response.](Media/106.L6.20.png)

---

## Linear Equations

A linear equation is the technical term for any equation that describes a line. Two important characteristics of a line are its *slope* and its *y-intercept*.

---

### y-Intercept

The *y-intercept* is the value at which the line crosses the y-axis. Stated differently, the y-intercept is the value of y that corresponds to x = 0. Given the equation of a line, you can find the y-intercept by plugging in a zero for the x. You may remember having done this during algebra class in high school. Remember that old standby equation, ```y = mx + b```? Well, the ```b``` part of it is the y-intercept.

For example, if your equation is:

```y = 12x + 7```

and you set the value of x to 0, then the equation becomes:

```y = 12(0) + 7```.  

Anything times zero is zero, so ```y = 7```.

---

### Slope

The *slope* of a line, ```m```, is a measure of how steep the line is. It shows how much influence x has on y. If the slope is positive, then the value of y increases as the value of x increases. If the slope is negative, the value of y decreases as x increases. This is similar to correlations. 

When the slope is expressed as a fraction, the numerator (the number on top) indicates the vertical change when moving from a single point on the line to another point on the line. The denominator (the number on the bottom) indicates the horizontal change. Try an example:

There is a point on the graph below at (0,2) - which is expressed as being at (x,y).  So it is 0 on the x-axis, and 2 on the y-axis. If you have the linear equation ```y = (3/5)x + 2```, where would the next point fall?

![A graph showing the x and y axes. There is a point plotted at open parentheses zero comma two close parentheses. A line with an upward slope passes through the point.](Media/L01-01.png)

Since the slope is 3/5, then if you want to move from the point (0, 2) to another point on the graph, you can go up 3 (the numerator of the slope, up is positive) and to the right 5 (the denominator of the slope, right is positive) to arrive at the point (5, 5), which will also be on the line.

![A graph showing the x and y axes. There is a point plotted at open parentheses zero comma two close parentheses and a point at open parentheses five comma five close parentheses. A line with an upward slope of three fifths passes through the two points and continues in both directions.](Media/L01-02.png)

With two points, you now have a better idea of the shape of the line and you have a good idea of the relationship between x and y.

---

## Linear Equations in Statistics

In regression, you probably will see different notation than in algebra. The y-intercept is represented with the symbol b<sub>0</sub> and the slope with the symbol b<sub>1</sub>. You can remember the difference between these two by thinking that your y-intercept is the y value when x is zero - so the symbol has a zero in it ( b<sub>0</sub>). 

This leads to the following representation of the linear equation:

> y = b<sub>1</sub>x +  b<sub>0</sub>

Sometimes, you may also see the y-intercept as the first part of this equation instead of tacked onto the end (order changed).

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - A Basic Regression Example<a class="anchor" id="DS106L1_page_3"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# A Basic Regression Example to Sink your *Teeth* Into

You will revisit the **[crocodile data you used previously](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/crocodiles.zip)**. 

Recall that the crocodile data was arranged in two columns, and looks like this:

![A table with three columns, common name, head length, and body length. In each row, the common name is estuarine crocodile. Each row has different data for head length and body length.](Media/L01-03.png)

---

## Interpreting the Regression Output

You can assume that there is a linear relationship between the head length and the body length, and you can express this symbolically as:

> y = b<sub>1</sub>x + b<sub>0</sub>

where y represents the predicted body length of an estuarine crocodile with a head length of x. The number b<sub>1</sub> is called the *estimated regression coefficient*. The coefficient represents the magnitude of change that will take place in y based on the x variable. Regardless of what tool is used to create the regression equation, the output will give values for b<sub>0</sub> and b<sub>1</sub>. In this case, the y-intercept, b<sub>0</sub>, = −18.274. The slope, b<sub>1</sub>, = 7.660.

---

### Interpreting the Slope

When the regression equation was created, the head length was the predictor variable (x), and the body length was the response (y). So, in terms of the regression equation, you have a slope of b<sub>1</sub> = 7.660. Or, with every 1 cm increase in the length of the head, the body is expected to increase by about 7.66 cm. This is the practical interpretation for any simple linear regression equation - that is, a one unit change in the horizontal axis variable predicts a b<sub>1</sub> unit change in the vertical axis variable. 

---

### Interpreting the y-Intercept

What about the interpretation of b<sub>0</sub>, the y-intercept? For this, you often need to be a bit careful. For example, with the crocodile data, the y-intercept of the regression equation is b<sub>0</sub> = −18.274. This means that if the head length gets down to 0 cm, the predicted body length is ~ -18.3 cm. Well, you can't have a negative body length, and you also can't have a head that is zero cm.  It just doesn't make sense.

Note that in the data table, the head lengths range from 24 cm to 61 cm. You have no business predicting the body length of a crocodile whose head length is outside of this range, because you don't have enough data to be accurate.

---

## Making Predictions

During an archaeological dig in northwest Africa, an 87 cm skull from a juvenile sarcosuchus (an extinct genus of crocodile from a distant relative of living crocodiles, believed to have lived 112 million years ago) was discovered. You want to estimate the body length of this crocodilian using the regression equation you calculated for the estuarine crocodiles.

![A drawing of a sarcosuchus, an extinct genus of crocodile that is believed to have lived one hundred and twelve million years ago.](Media/L01-04.png)

You are going to ignore for a moment that making this estimate will require extrapolation (guesstimating), because in the absence of anything else to help you predict, it seems to be the best approach. Besides, extrapolation on the high side of the range is less risky than extrapolating on the low side. And all you are really trying to do anyway is estimate the length of an extinct crocodile. It is not likely to affect world peace.

To predict the body length, you can substitute x = 87 into the regression equation, since the found skull was 87cm:

```y = 7.660x -18.274```
```y = 7.660(87) -18.274```
```y = 648.146```

Stop and think about this for a minute. This juvenile sarcosuchus is estimated to be over six and a half meters long! That is over 21 feet! Imagine how big it might have been if it reached maturity! And you thought regular crocs were scary!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - What is a Residual? <a class="anchor" id="DS106L1_page_4"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# What is a Residual? 

Once you start using a regression to make predictions, the question always arises of "how close am I to reality?" Does your regression make good estimates, that are close to the real data, or bad estimates that are nowhere near it? You can answer these questions by looking at *residuals*. A residual in regression is simply the difference between the predicted y and the actual y. Typically, you want the residuals in regression to be small (indicating less error) and to be uniformly distributed.

In this very simple picture of a linear regression, the blue line represents the regression equation. Each red dot is a single data point. There are some short green segments leading from each red dot to the blue line.  The green lines are the residuals, and each data point has one.

![A graph with a linear regression. The x axis runs from zero to five in increments of one. The y axis runs from four to ten in increments of one. A blue line represents the regression equation and slopes upward, starting from less than zero point five on the x axis. Red dots are plotted on the graph, and each red dot is a single data point. Short green segments lead from each red dot to the blue line. The green lines are residuals.](Media/L01-07.png)

For any given value of x, the residual is the difference between what the _model_ (derived from your regression equation) says the y should be and the actual value for y. The longer the green segment, the larger the residual. Residuals can be either positive (red dot is above the blue line) or negative (red dot is below the blue line). If a data point happens to lie exactly on the regression line, the residual is zero. That means your equation guessed the value perfectly!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Assumptions for Linear Regression<a class="anchor" id="DS106L1_page_5"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Assumptions for Linear Regression

In order to use regression for hypothesis testing, certain conditions need to be satisfied. These are called *assumptions*, and they ensure that the data meets the minimum requirements necessary to draw inferences without adding too much bias.  There are six assumptions for a linear regression model:

1.  There is a linear relationship between x and y.
2.  *Homoscedasticity*: The error term is normally distributed.
3.  *Homogenity of variance*: the variance of the error terms is constant for all values of x. 
4.  The x's are fixed and measured without error. (In other words, the x's can be considered as known constants.)
5.  *Multicollinearity*: the observations are independent.
6. Lack of outliers

You will learn about each of these assumptions first on a theoretical level, and then in the next few pages will learn how to test them in Python and R.

---

## 1. Testing for a Linear Relationship between x and y

To check for a linear relationship, two things must be done:

* Create a scatterplot, and look at the shape. Does it look like a straight line?
* Plot the residuals, and look for an 'absence of patterns'

---

### Examine the Scatterplot for Linearity

You are looking for is a shape that is obviously not linear, because if you find one, you shouldn't do a linear regression! There are other types of regression that might work, however.

---

### Examine the Residuals for Patterns

You are looking for a residual plot with basically a straight line across.  No real pattern or shape to it, like the one below: 

![A residual plot showing data points that entail a mostly straight line across. There is no real pattern or shape to it.](Media/L01-11.png)

And here are some problematic residual plots: 

![A problematic residual plot with a megaphone shaped data cloud, indicating that there may be another variable that is not being accounted for and that is influencing the data.](Media/L01-12.png)

The above image indicates that there may be another variable you are not accounting for that is influencing your data.  This is called a *moderator*.  

![A problematic residual plot with an upside down U shaped data cloud, indicating that the distribution is non linear and that an exponential model should be used rather than a linear model.](Media/L01-13.png)

This image above indicates that you have a non-linear distribution and should try an exponential model rather than a linear one.

---

## 2. Testing for Homoscedasticity

*Homoscedasticity* basically means that at each level of the predictor variable, the error should be relatively constant. Approximately the same. Said another way, if you have homoscedasticity, you have equal variance for all levels of your independent variable. Error is just another word for residuals, so you want to see if the residuals are normally distributed for all the different levels of your IV. This can be done either with a histogram, or a Q-Q plot. The histogram is shown here:

![A histogram of residuals. Data is shown in vertical bars. X axis negative fifteen point nine, negative nine point nine, y axis, four. X axis negative nine point nine, negative three point nine, y axis eight. X axis negative three point nine, two point one, y axis eleven. X axis two point one, eight point one, y axis eight. X axis eight point one, fourteen point one has no data. X axis fourteen point one, twenty point one, y axis two. X axis twenty point one, twenty six point one, y axis two.](Media/L01-14.png)

On first glance, your knee-jerk reaction might be "hey, that doesn't look normal at all." It is easy to be too critical here. Is the histogram of the residuals tall in the middle, and short on the edges? If so, consider the residuals (or error term) approximately normally distributed.

---

## 3. Testing for Homogeneity of Variance

The third requirement is that there is constant variance in the error term. The way to check this is to (once again) to look at the residuals plot. Are you starting to think that looking at residuals is a great diagnostic? If you aren't, you should...

Variance in the error term is detected by seeing a shape in the residuals plot that looks like a megaphone. If there is no obvious megaphone shape, the variance can be assumed to be constant.

A megaphone shaped residuals plot would have a data cloud that looks like this:

![A residual plot with a megaphone shaped data cloud.](Media/L01-12.png)

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Variance is also denoted by a cone shape facing the other direction! </p>
    </div>
</div>

If your residuals don't look like this, and rather are in a flat line, you should proceed.

---

## 4. Testing the X as a Known Constant

The fourth requirement is that the x's are measured without error. This is nearly impossible to check, so it is usually assumed. Careful data collection and entry will ensure the independent variable is measured with as little error as possible.  (Having no error is nearly impossible!) 

---

## 5. Testing for Multicollinearity

The fifth requirement is that the observations, or y's, are independent. Independence is a tricky topic when it comes to statistics. In this case, it is saying that knowing the value of any one of the y's doesn't tell you anything about any other data point. In other words, if a change in the value of a response is going to affect the value of another response variable, you have dependence; otherwise, there is independence.

Here's another way to illustrate this. Suppose you had 48 kids come to your home on Halloween last year for trick-or-treat. This year, the number of trick-or-treaters you get will depend on a lot of things, such as the weather, or the day of the week where Halloween falls, or whether or not there will be an awesome Halloween carnival at the park right next to your house until sundown, etc. But the number of trick-or-treaters will not depend on how many you had last year. It may be similar to what you had last year, but just because things are similar doesn't make them dependent on one another.

In the case of simple linear regression, where you only have one independent variable, there's not a good test for multicollinearity, and you can just forge ahead.  However, in multiple regression, you can correlate each pair of independent variables with each other, and if they are strongly correlated, then this is a very good indication of multicollinearity. 

---

## 6. Testing for Outliers

Outliers, or data points that are outside the typical pattern of your data, can skew your results horribly.  

Compare a dataset without outliers to one where outliers have been artificially added in.  On the right, you see the original data.  On the left, you have data with three outliers.  Bet you can spot them! See how drastically the slope of the line has changed, just by adding in those three funky points? 

![Two data sets. On the left, a data set in which most of the data is plotted closely together, but three data points are outliers that are far outside the others. On the right, the original data set without outliers, in which the data is plotted in a mostly straight line.](Media/106.L6.21.png)

So, it's best to check for and remove outliers.  There are three ways a data point can be an outlier, so you'll need to check all three types and correct for them (i.e remove the offending data point) if necessary.

* **Leverage:** An extreme value in the x space.  You typically screen for outliers using a value called Cook's Distance or a value aptly named "leverage."
* **Distance:** An extreme value in the y space.  You typically screen for outliers by looking at a value called the studentized deleted residual.
* **Influential:** An extreme value in both the x and y space. This is screened for using values called DFFITS and DFBETAS.

Typically, you will check for outliers by either looking at leverage and distance values generated with your model and spotting anything outside normal parameters, or you can look at graphs of your data (box plot and/or line graph) to identify things that visually stick out.  You'll learn how to test for each outlier as you move into regression in Python and R.

----

### When You Find Outliers

Although you may end up deleting outliers, there's a triage approach for dealing with them - you don't want to automatically exclude outliers.

1. Make sure the outlier is not a result of a data entry error.  If you have the raw data, it is easy to check, but if someone else is entering data into a database or website and you're only seeing it second hand, this can be much more difficult.  Do your best to track outliers to their origin if you can and correct it if it was truly a mistake, or whether the value is actually extreme and should be retained.

    <div class="panel panel-info">
        <div class="panel-heading">
            <h3 class="panel-title">Tip!</h3>
        </div>
        <div class="panel-body">
            <p>If you have any say in the data collection process, you can make this step much easier by ensuring there are data checks in place.  For instance, if patients are filling out a survey with their age, you can throw up an error flag that does not allow any values above 120 years of age, as this is clearly outside the current human life span and must represent an error, rather than a genuine extreme value. </p>
        </div>
    </div>

2. Run a different statistic that is not as sensitive to outliers, such as something from the ANOVA family.

3. Run the regression analysis with and without the outliers to see how the model differs. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>There are some people who will not just remove their outliers, but replace them (called imputation in statistics) with other values, such as the mean.  This is typically something to stay away from! It can unduly bias your data and give an inaccurate picture of the results!</p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - You'll Get Along Swimmingly with Regression in R!<a class="anchor" id="DS106L1_page_6"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# You'll Get Along *Swimmingly* with Regression in R! 

Now that you have an understanding of the assumptions you will need to meet, you will begin the process of creating simple linear regression in R.  You will start with all the prep work needed!

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <b><a href="https://vimeo.com/463655964"> recorded live workshop </a></b> that goes over how to do regression in R.</p>
    </div>
</div>


---

## Install Packages

You will need the following libraries to perform regression in R.  Make sure that you install them first:

```{r}
install.packages("car")
install.packages("caret")
install.packages("gvlma")
install.packages("predictmeans")
install.packages("e1071")
install.packages("lmtest")
```

You will use ```car``` to create linear models, and the rest to test various aspects of the linear regression assumptions.

## Load Packages

After you've installed the appropriate packages, you will load the packages into R like this: 

```{r}
library("car")
library("caret")
library("gvlma")
library("predictmeans")
library("e1071")
library("lmtest")
```

---

## Load in Data

Manatees are curious, peaceful sea creatures that like to sun themselves just below the ocean's surface. Some environmentalists have claimed that manatees are being killed by powerboat propellers. This **[data](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/manatees.zip)** gives the number of Florida powerboat registrations in thousands and the number of manatees killed by powerboats for the years 1977-2006.

---

## Question Setup
 
The question you are trying to answer with this analysis is: ```Do the number of powerboats registered in Florida impact the number of manatees killed there by powerboats each year?```

---

## Data Wrangling

Luckily, there is no additional data wrangling you will need to do with this dataset to complete a linear regression.

---

## Test Assumptions

You will now begin the onerous process of testing all the assumptions for linear regression.  Although it can be a lot of work, it is an important part of regression because it will ensure that your results are as accurate as they can be.

---

### Testing for Linearity

Before conducting a regression analysis, it is important to first create a scatterplot and visually examine the data. You can only use linear regression when the two variables show a "linear" shape. If the data look like a random scattering of points, curved, or any other strange shape, there is no linear relationship in the data, and so it's not appropriate to run linear regression.  Start by looking at the scatterplot, to determine whether you have a linear relationship between your two variables.

Here's some example code to get a scatterplot: 

```{r}
scatter.smooth(x=manatees$PowerBoats, y=manatees$ManateeDeaths, main="Manatee Deaths by Power Boats")
```

Where ```scatter.smooth``` tells R you want a scatterplot with a line, you specify your x (in this case, ```PowerBoats```) and your y (```ManateeDeaths```), and then can put a title on it with ```main=``` if you'd like.  If you're just screening data quickly and no one else will see what you've created, by all means, skip the title step!

This code results in the following graph: 

![A scatterplot showing manatee deaths by power boats.](Media/L6.1.png)

As you can see, there is a linear relationship between the number of power boats and the number of manatee deaths. This means you have passed the assumption of linearity!

---

### Testing  for Homoscedasticity

Now you need to check for homoscedasticity.  

---

#### Create the Linear Model

In order to do this, you first need to create the basic regression model, because you will be able to run a whole bunch of tests once you have it, including the test for homoscedasticity. Here is the code to create a linear model: 

```{r}
lmMod <- lm(ManateeDeaths~PowerBoats, data=manatees)
```

This creates a model named ```lmMod``` (you could name it whatever you like, but choose something you can easily recognize in the future and/or will make sense to your coworkers), using the ```lm``` function, which stands for...you guessed it...linear model! Put your dependent variable first, in this case, ```ManateeDeaths```, and then a tilde (```~```) to mean "by" and then specify your independent variable.  Lastly, specify the dataset with the ```data=``` command. 

---

#### Test for Homoscedasticity

So you now have a model created, but you don't actually want to test a hypothesis yet.  You want to see if this model is robust enough to use for analysis, which means it needs to meet all of our assumptions.  You can create some graphs that will allow you to test for homoscedasticity using this code: 

```{r}
par(mfrow=c(2,2))
plot(lmMod)
```

And this figure is the output: 

![Four graphs that test for homoscedasticity. Top left, residuals vs fitted. The x axis is fitted values and the y axis is residuals. Bottom left, scale location. The x axis is fitted values and the y axis is standardized residuals. Top right, normal Q Q. X axis is theoretical quantiles and y axis is standardized residuals. Bottom right, residuals versus leverage. X axis is leverage and y axis is standardized residuals.](Media/106.L6.2.png)

It shows four plots, which all examine different information about residuals.  The ones you really want to pay attention to are on the left.  The top left graph shows the fitted values against the residuals, while the bottom left shows the fitted values against the standardized residuals.  Both of these graphs should show random points with a flat red line, straight across, if they are homoscedastic and thus meeting the assumptions that allow you to complete linear regression.  However, you can see a suspicious upward trend in your residuals, specifically in the bottom of the two left plots.  This raises a red flag.  You can declare this to be heteroscedastic (the opposite of homoscedastic).  

If you're on the fence, or you're not graphically inclined, you can also test for homoscedasticity with the Breush-Pagan test or the Non-Constant Variance (NCV) test.  Here's the code for the Breush-Pagan test:  

```{r}
lmtest::bptest(lmMod)
```

And here are the results: 

```text
	studentized Breusch-Pagan test

data:  lmMod
BP = 7.8867, df = 1, p-value = 0.00498
```

You'll notice we have a *p* value of .004.  Since this is less than .05, it means it is significant, and having a significant Breush-Pagan test means that you unfortunately have heteroscedasticity. 

Similarly, here's the code for the NCV test: 

```{r}
car::ncvTest(lmMod)
```

Which yields these results:

```text
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 9.037421, Df = 1, p = 0.0026451
```

### [Errata](Errata/DS106Mod-Change-Log.ipynb)
<a class="anchor" id="pvalue"></a>

You get a *p* value of .003, shown above.  The same rules apply as for your Breush-Pagan test, so since this is significant, you have a third indicator that the data is heteroscedastic, and you have not met the assumption of homoscedasticity. 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>It is unnecessary to test homoscedasticity in multiple ways, since results are unlikely to change much between methods. But if you're questioning things as you look at the graph, the definitive answer can always be found with a statistical test of homoscedasticity!</p>
    </div>
</div>

---

#### Correcting for Homoscedasticity Violations

So, you violated an assumption for R, which means that whatever regression model you build has a much higher likelihood of being biased.  Lucky for you, most of the time, you can recover from a violation of homoscedasticity. 

The first thing to try is to transform your dependent variable.  You'll use a Box-Cox transformation, to try and make your data more normal.  And no, you haven't just stepped into a Dr. Seuss book. You can run the following code from the ```caret``` library:

```{r}
distBCMod1 <- caret::BoxCoxTrans(manatees$ManateeDeaths)
print(distBCMod1)
```

The code above will perform the Box-Cox transformation and if you print it, you get some statistics about your new data:

![Input data summary. Minimum, fourteen point zero. First quartile, thirty point five. Median, fifty point zero. Mean, fifty two point eight. Third quartile, seventy five point five. Maximum, ninety seven point zero. Largest forward slash smallest, six point nine three. Sample skewness, zero point one seven three. Estimated lambda, zero point five.](Media/106.L6.5.png)

But this by itself is not incredibly helpful, though it can be nice to summary statistics on your transformed dependent variable.  In order to re-test your transformed variable, you need to bind it to the current dataset, using the function ```cbind()```:

```{r}
manatees <- cbind(manatees, dist_newM=predict(distBCMod1, manatees$ManateeDeaths))
```

You use the ```cbind()``` function to tack on the new variable, which has been named ```dist_newM```, which is filled with the predicted data from the Box-Cox transformation you did above (distBCMod1).

Then it's just a simple matter of creating a new linear model using your transformed dependent variable, and testing it once more with either the Breush-Pagan test or the NCV test, like so:

```{r}
lmMod_bc2 <- lm(dist_newM~PowerBoats, data=manatees)
lmtest::bptest(lmMod_bc2)
```

Here is the output of the Breush-Pagan test:

```text
	studentized Breusch-Pagan test

data:  lmMod_bc2
BP = 5.6078, df = 1, p-value = 0.01788
```

You'll notice that the results are still significant. So you are STILL violating the assumption of homoscedasticity, which is a bummer.  But cheer up, friend - don't lose all hope yet. If even after Box-Cox transformation your data still does not meet the assumption of homoscedasticity, you can use a different method for calculating the linear regression model.  Up until now, you have been using ordinary least squares (OLS) regression, which is by far the most popular, because it is the simplest. OLS fits your regression line by looking at the minimum sum of squares of the residual.  However, there are a myriad of other methods from which to choose.  

One of those methods is generalized least squares.  If you find yourself in need, check out this **[R Pubs on heteroscedasticity](https://rpubs.com/cyobero/187387)** and how to correct for it. Changing your regression method, however, is relatively advanced, and just for the purposes of learning, you'll continue on with the regression even though you have violated the assumption of homoscedasticity.

---

## Testing for Homogeneity of Variance

Above, you were able to generate plots for testing homoscedasticity by looking at your residuals.  These plots also provide useful information about homogeneity of variance, or how evenly your data is distributed.  All you need to do is to see whether your data forms a nice box, or whether it is cone shaped at either end.  In this case, the residual plots you generated above (repeated here for your convenience) show no cone, so you have passed the assumption of homogeneity of variance.

![Four graphs that test for homoscedasticity but also show useful information about homogeneity of variance. Top left, residuals vs fitted. The x axis is fitted values and the y axis is residuals. Bottom left, scale location. The x axis is fitted values and the y axis is standardized residuals. Top right, normal Q Q. X axis is theoretical quantiles and y axis is standardized residuals. Bottom right, residuals versus leverage. X axis is leverage and y axis is standardized residuals.](Media/106.L6.2.png)

---

### The GVLMA Library for Assumptions

To add additional tools to your arsenal, there is yet another way to test assumptions: the ```gvlma``` library. This will automatically run several assumption tests on your data and tell you in plain language whether you have met that assumption.  However, beware! The ```gvlma``` function does not test all assumptions necessary for regression and may not always be accurate, so make sure to investigate each thoroughly regardless of the ```gvlma``` findings. When you are pressed for time, are just scoping out your data, or are analyzing extremely low-stakes data, however, you may find ````gvlma``` a lifesaver.  Here's the code: 

```{r}
gvlma(lmMod_bc2)
```

And the output it provides:

```text
Call:
lm(formula = dist_newM ~ PowerBoats, data = manatees)

Coefficients:
(Intercept)   PowerBoats  
   -1.76474      0.01872  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = lmMod_bc2) 

                     Value p-value                   Decision
Global Stat        12.1803 0.01606 Assumptions NOT satisfied!
Skewness            1.1918 0.27496    Assumptions acceptable.
Kurtosis            0.1569 0.69207    Assumptions acceptable.
Link Function       6.9611 0.00833 Assumptions NOT satisfied!
Heteroscedasticity  3.8705 0.04914 Assumptions NOT satisfied!
```

You will go over each piece of the output point by point:

* **Global Stat:** This indicates whether the relationship between your x and y data is linear.  
  > If you recall from the scatterplot above, the relationship is linear, so this is a great example of why you do not want to fully trust GVLMA.   
* **Skewness:** This is a measure of whether your data is normally distributed horizontally.   
* **Kurtosis:** This is an indicator of whether your data is normally distributed vertically.
* **Link Function:** This tests whether your variable is continuous.  If this has been violated, then data is categorical and analyses such as logistic regression should be used instead.
  > Your model has not satisfied this assumption according to GVLMA, but guess what? Both the number of Power Boats and the number of Manatee Deaths are continuous, not categorical.  Make sure to carefully investigate GVLMA results.
* **Heteroscedasticity:** This is an indicator of whether your residuals are constant.  If you fail this, you fail the assumption of homoscedasticity discussed extensively above.

---

### Screening for Outliers

Next, you will screen for outliers.  You will need to test for all three types of outliers: distance, leverage, and influential points. 

---

#### Testing for Outliers in X Space

You will start by screening for outliers in x space.  You can do this by looking at Cook's Distance values.  To do this, you'll use the ```predictmeans``` library. Just run your regression model through the function ```cooks.distance()```. It will generate a plot that highlights anything that is an outlier in x space.  

Cook's distance measures how much the predicted values (fitted values) change when a particular observation is excluded from the model. It is a way to detect influential points that disproportionately affect the estimated coefficients and overall model fit.

```{r}
 cooks.distance(lmMod, group=NULL, plot=TRUE, idn=3, newwd=TRUE)
```
### [Errata](Errata/DS106Mod-Change-Log.ipynb)
<a class="anchor" id="forjupyter"></a>


##### NOTE: If you are using R in Jupyter Lab/ Notebook, change your final argument ```newwd=``` to FALSE (```newwd=FALSE```). This removes the "new window" pop-up and allows the results to print.

This is the resulting plot:

![A plot labeled Cooks Distance. The x axis is labeled observation number and runs from zero to thirty five. The y axis is labeled cooks distance and runs from zero point zero zero to zero point one five. Each observation number has a vertical bar. Three bars are labeled, twenty four, twenty six, and twenty nine, representing outliners in x space.](Media/modeling1.png)

You'll note that the numbers on the plot are the row numbers of data that are outliers in x space.   So cases numbered 24, 26, and 29 are all outliers in x space. 

You can also look at leverage values. Anything that has a leverage value of between .2 and .5 is a moderate problem, and anything over .5 is a major problem. 

To test for leverage, you will use the ```hat`` function and then plot it: 

```{r}
lev = hat(model.matrix(lmMod))
plot(lev)
```

This will yield the following plot:

![A plot with an x axis labeled index and y axis labeled leverage. The first data point is plotted high in the upper left, above zero point one zero. Data points then move downward to near zero point zero two, remain flat, move up toward zero point one zero and then back down again to below zero point zero six.](Media/106.L6.24.png)

You can probably guess that you don't have anything over .2 by looking at this chart, but it's good to know for sure.  You can check with the code below, which will provide you with a list of data points with leverage over .2.  

```{r}
manatees[lev>.2,]
```

In this case, this is the only output you get out, and the lack of numbers tells out that you have no outliers with a leverage over .2, like you suspected in your chart.

```text
[1] PowerBoats    ManateeDeaths dist_new      dist_newM    
<0 rows> (or 0-length row.names)
```

Therefore, you can conclude that you have no outliers in x space.

---

#### Testing for Outliers in y Space 

Next you want to test for outliers in y space.  This is typically done using something called the *studentized deleted residual*, or sometimes just shortened to the studentized residual.  This is similar to your regular residual, or error term, which you have already learned, but instead of calculating standard error with all of the residuals, you calculate it based on *n* - 1.  This means that you have *deleted* one residual, which is where the name comes from.

There is a simple line of code coming from the ```car``` library for the ```outlierTest``` function that will provide you with information about the studentized deleted residual: 

```{r}
car::outlierTest(lmMod)
```

Here's the output that code provides: 

```text
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
   rstudent unadjusted p-value Bonferonni p
24 2.691645           0.011213      0.39245
```

This particular code tests for the very furthest outlier.  If the Bonferroni *p* value is significant (i.e. less than .05), then it is likely that you have at least one outlier.  You can also look at the column for ```rstudent```, which is the raw studentized deleted residual. If this value is over 2.5 or 3ish, you have a problem with outliers in the y space.  Here your value is 2.6, so it could be considered an outlier, but since your Bonferroni *p* value is not significant, you'll leave it be. You have no outliers in y space.

---

### Testing for Outliers in x and y Space

To test for influential values, or outliers in both x and y space, there are two metrics to look at: DFFITS and DFBETAS.  

They are readily available with the function ```influence.measures``` for your model.  Check it out!

```{r}
summary(influence.measures(lmMod))
```

The output that you receive from this code is shown below:

```text
Potentially influential observations of
	 lm(formula = ManateeDeaths ~ PowerBoats, data = manatees) :

   dfb.1_ dfb.PwrB dffit cov.r   cook.d hat  
1   0.01  -0.01     0.01  1.19_*  0.00   0.10
23 -0.07   0.18     0.46  0.78_*  0.09   0.03
24 -0.20   0.32     0.57  0.74_*  0.14   0.04
```

You have a wide variety of outlier indicators here along the top, and the row numbers of any values that may be outliers are shown on the far left.  The column named ```dfb.1_``` is for DFBETAS, and the one named ```dffit``` is for DFFITS.  If a DFFITS or DFBETAS value is greater 1, you most likely have a problem with outliers that are influential. Luckily in this case, there are no values greater than 1 for these three points, so you have no influential outliers in your data. 

---

Phew! If you got a little lost back there, **[here](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/modeling_with_manatees.zip)** is an example how how the code should look when everything is all said and done. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Running Simple Linear Regression and Interpreting the Output<a class="anchor" id="DS106L1_page_7"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Running Simple Linear Regression and Interpreting the Output

You have successfully run the assumption gauntlet and come out on top! Even though you did not meet the assumption of homoscedasticity, you are going to perform ordinary least squares regression, because 99% of the time, you will be able to correct homoscedasticity with the Box-Cox transformation and thus will want OLS instead of another regression method.  It is likely that because the ```manatees``` dataset was just an example, not a full and wholly real dataset, something screwy has happened with the data. 

---

## Hypothesis Testing

In regression, you are trying to determine if there is a relationship between the predictor variable (the x) and the response variable (the y). The way to test for this is to test if the slope in the regression equation is different from zero. If the slope is zero, this means you have a flat line, which suggests that there is no linear relationship between the two variables. If the slope is not zero, that suggests that there is a linear relationship between the two variables.

You need to introduce a little more terminology here. Recall that there are population parameters, and sample statistics. Now, you have coefficients for the linear regression equation, b<sub>0</sub> and b<sub>1</sub>, which are sample statistics.  Anytime you have English (not Greek), letters, you have a sample statistic.  Anytime you have Greek letters, you have a population statistic.  There are population parameters for which b<sub>0</sub> and b<sub>1</sub> are estimates - these are β<sub>0</sub> and β<sub>1</sub>. The name of the Greek letter β is 'beta.'

If you want to test the slope to see if it is zero, that means you are testing the null hypothesis:

![H sub zero, beta sub one equals zero. H sub a, beta sub one does not equal zero.](Media/L01-15.png)

---

## Interpreting Output

Remember that you have already created your model.  Which means, once you are satisfied with its validity, all you need to do is call up a summary! 

```{r}
summary(lmMod_bc2)
```

The above code produces this output:

```text
Call:
lm(formula = dist_newM ~ PowerBoats, data = manatees)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.11978 -0.95827 -0.05081  0.75875  2.74620 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.764745   0.842078  -2.096   0.0439 *  
PowerBoats   0.018718   0.001106  16.929   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.226 on 33 degrees of freedom
Multiple R-squared:  0.8967,	Adjusted R-squared:  0.8936 
F-statistic: 286.6 on 1 and 33 DF,  p-value: < 2.2e-16
```

The first thing to look at in this output is the very bottom line, which reads ```F-Statistic```.  This tells you the *F* Statistic for the overall model as well as the *p* value for the overall model.  The larger the *F* statistic, the more likely it is to be significant, and with a whopping 286, you would expect *p* to be significant.  Lo and behold, it is at *p* < .001. 

So what is the practical interpretation of the slope not being equal to zero? It means there is sufficient evidence to lead us to believe that the number of registered power boats in Florida somehow influences the manatee deaths in Florida.

Your decision rule is to reject the null hypothesis, since the *p* value is less than .05. There is sufficient evidence to suggest that there is a linear relationship between the number of powerboats registered in Florida and the number of manatees killed by powerboats. This conclusion fits our intuition. If there are more boats on the water, it seems plausible that this may be related to the number of manatees killed. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>If any statistical conclusion is counter intuitive, you should always be very wary!</p>
    </div>
</div>

Next, you are interested in the ```Coefficients``` table, which provides the y-intercept (b<sub>0</sub>), and the slope (b<sub>1</sub>) for this regression equation.  If you had not transformed your dependent variable, you could easily interpret the raw values in the estimates column. To deal with this, you can either de-transform them for interpretation, or ignore these estimates.  There is still lots to glean from this output.  The *t* value tests each independent variable in the model.  In your case of simple linear regression, you only have one independent variable, and thus one *t* test. Just like other *t* tests you have learned, the larger the *t* value, the more likely the test will be significant.  And you see that here you have a *t* test that is significant at *p* < .001, meaning that the number of power boats registered in Florida is a significant predictor of the number of manatee deaths. 

How large a predictor, you might ask? Well, R<sup>2</sup> can provide that answer! The second to last line in our R output provides both a ```Multiple R-squared``` value and an ```adjusted R-squared``` value. 

* **Multiple R Squared:** Square of the multiple R. This is also called the coefficient of determination, and is a measure of how well the regression line fits the data. A common interpretation of R<sup>2</sup> is that it measures how much variation in the y-axis variable is explained by the x-axis variable.

  > In terms of the manatee data, the practical interpretation would be something like: "87% of the variation in the number of manatee deaths each year can be explained by the number of power boats. The other 13% is due to error or other variables not accounted for in this model."

* **Adjusted R Squared:** Measure of R squared that adjusts for the number of terms in the model. For models with only one term (i.e. the simple linear regression you have tackled in this lesson), there will not be much difference between the R squared and the adjusted R square. However, it's always good to get in the habit of using the Adjusted R Squared as with more than one independent variable, adjusted R squared will be much more accurate (usually lower) than R squared.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>Recently, Google Sheets has added the ability to create a linear regression, and provide the regression equation and R squared value. <a href="https://www.youtube.com/watch?v=n9-lYzJwX40"> Here are Step-by-Step Instructions</a></p>
    </div>
</div>

---

## Interpreting Non-Transformed Regression Output

If you had not transformed your dependent variable, you can add one more interpretation to your R output. Here is the model without any data transformations:

```{r}
summary(lmMod)
```

And the output it provides:

```text
Call:
lm(formula = ManateeDeaths ~ PowerBoats, data = manatees)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.736  -6.642  -1.239   4.374  22.309 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -42.525657   6.347072   -6.70 1.25e-07 ***
PowerBoats    0.129133   0.008334   15.49  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.237 on 33 degrees of freedom
Multiple R-squared:  0.8792,	Adjusted R-squared:  0.8755 
F-statistic: 240.1 on 1 and 33 DF,  p-value: < 2.2e-16
```

The practical interpretation of the slope (b<sub>1</sub>) is this: For every thousand more power boats that are registered in the state of Florida, there is a predicted additional 0.129 manatee deaths due to collisions between boats and manatees. Another way to look at it is by stating that for every additional 1000 power boats registered, there can be an expected increase of 1 manatee death.

If you tried to interpret the meaning of the y-intercept (b<sub>0</sub>), you would have this: If there were no boats registered in Florida, there would be 'negative 42.5 manatee deaths per year.' However, this is not a practical interpretation, since there are no data points anywhere near x = 0. So, in this case, and many others, the purpose of the y-intercept in a regression equation is just to define the location of the regression line. 

---

## Predictions

If you want to predict the number of manatees that will be killed if, say, 850 thousand powerboats are registered, then you plug x = 850 into the equation, and determine the predicted value:

```y = 0.129(850) −42.542```
```y = 67.1```

Note that if there are 850 thousand powerboats registered, then the regression equation suggests that there will be an estimated 67.1 manatees killed. Of course, you cannot kill a fraction of a manatee. However, when making predictions, it is appropriate to leave the prediction as a decimal. This is a predicted value, so it does not actually have to be attainable.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Regression in Python<a class="anchor" id="DS106L1_page_8"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Regression in Python

Now that you have gone over the basics of simple linear regression in R, you can port regression over to Python! 

---

## Import Packages

You will first need to import packages.  You'll need ```pandas``` to read in your data and the rest for data visualizations associated with assumption testing. 

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
import seaborn as sns
%matplotlib inline
import  statsmodels.api as sm
import statsmodels.stats.api as sms
from scipy.stats import boxcox
```

---

## Load in Data

You'll use the same ```manatees``` dataset to keep things easy. For your convenience, **[here it is again](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/manatees.zip)**. 

---

## Data Wrangling

Fortunately, there is no additional data wrangling required to be able to run linear regression in Python either.

---

## Test Assumptions

Next, you will learn how to test the linear regression assumptions in Python.

---

### Testing for Linearity and Normality

Once you have imported the data, you can then plot the data using ```sns.pairplot(manatees)```.  This will provide you with a scatterplot of the relationship between the variables so you can check for a linear relationship, and the histograms of each variable, so you can check for a normal distribution.  

![Four charts, two scatterplots showing the relationship between variables, and two histograms, one for each variable, power boats and manatee deaths.](Media/106.L6.9.png)

If a regular histogram isn't enough, you can also print one with the normal distribution curve on it, which is convenient for assessing normality. Just try ```sns.distplot([''])```, placing the dataset name in the parens and the variable name in the single quotes like this: 

```python
sns.distplot(manatees['PowerBoats'])
```

![A histogram with a normal distribution curve on it for the variable power boats.](Media/106.L6.10.png)

And here's the one for Manatee Deaths:

```python
sns.distplot(manatees['ManateeDeaths'])
```

![A histogram with a normal distribution curve on it for the variable manatee deaths.](Media/106.L6.11.png)

Are these variables normally distributed? It's a tough call, but roughly, yes.  

---

### Testing for Homoscedasticity

Your next step is to test for homoscedasticity.  

---

#### Create the Basic Model

Like R, this requires the model to be created. There are two packages that will handle regression in Python.  The first is ```sklearn```, and it is most commonly used for machine learning. The second is ```statsmodels```, which is the package you will discuss here.  This is largely because the output is easier to read and use.

You need to assign each of your variables to an x and and a y, since ```statsmodels``` does not take full data frames (unlike R).  Here's how: 

```python
x = manatees['PowerBoats']
y = manatees['ManateeDeaths']
```

If you want to look at these, you can call them simply by typing in ```x``` or ```y```. Now that you have your variables identified, you will build your model.  You can call it anything you want - in this case it's simply named ```model```. 

```python
model = sm.OLS(y,x).fit()
```

You call the OLS function out of the ```statsmodels``` package to indicate that you want to perform ordinary least squares regression instead of any other type, then place your y and x values, and call ```.fit()```. 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You will need to have y and x in that order - no switching them!</p>
    </div>
</div>

You will talk through how to view and interpret the output of the model later - when you're sure it's a rigorous model that isn't biased by a violation of assumptions.  Right now, you need to find and plot the residuals from this model, so you can test for homoscedasticity!

---

#### Test for Homoscedasticity

Unlike R, Python doesn't calculate residuals for you.  But it's not too hard.  Remember that the residual, or error term, is just the true values (your y from your dataset) minus the predicted values your model found.  Go ahead and calculate out your residuals: 

```python
pred_val = model.fittedvalues.copy()
true_val = manatees['ManateeDeaths'].values.copy()
residual = true_val - pred_val
```

The first line creates your predicted values, which have been called ```pred_val```, though of course you could call this anything you like.  Simply copy the fitted values from your model named ```model```.  Similarly, to get your true values into a useable form, called ```true_val```, you just need to copy your dependent variable from your original dataset.  Then you can subtract your predicted values from your true values in a new item called ```residual```. 

Now that you have your residuals, you can both graph them and run statistical tests on them.  How about graphing it first? 

```python
fig, ax = plt.subplots(figsize=(6, 2.5))
_ = ax.scatter(residual, pred_val)
```

The code above produces a residual plot.  You already know from exploration in R that this plot shows some heteroscedasticity - just look at how splayed out the dots are, when they should be random and you should be able to draw basically a flat horizontal line across.

![A residual plot with data splayed out, not plotted in a basically flat horizontal line across.](Media/106.L6.15.png)

But what if you wanted to double check your graph with some statistics? ```statsmodels``` has a couple different options.  The first is the Breusch Pagan test you ran in R. You can call the ```diagnostic.het_breuschpagan``` function out of the ```sms``` package. You will feed in your residual variable, then the dataset and the independent variable.  

```python
sms.diagnostic.het_breuschpagan(residual, manatees[['PowerBoats']])
```

And here is your output:

```text
(14.786518579352652, nan, 24.871600356010788, 1.7849503580150847e-05)
```

![Open parentheses fourteen point seven eight six five one eight five seven nine three five two six five two, n a n, twenty four point eight seven one six zero zero three five six zero one zero seven eight eight, one point seven eight four nine five zero three five eight zero one five zero eight four seven e zero five close parentheses.](Media/106.L6.16.png)

The numbers in the output above mean the following, with left to right discussed from top to bottom: 

* 14.79: This is the lagrange multiplier statistic.  
  > Ignore!
* nan: This is the *p* value for the lagrange multiplier statistic.
  > Ignore!
* 24.87: This is the *F* value to test for homoscedasticity.  Like all *F* values, the bigger it is, the more likely it is to be statistically significant.
* .0000018: This is the *p* value to test for homoscedasticity.  If it is < .05 (stastically significant), then this means you have violated the assumption of homoscedasticity and your data is, in fact, heteroscedastic.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>You'll notice that in our output, the p value is written in scientific notation. This means that you can place as many zeroes before the decimal place as the number behind the "e!" You see it here in interpretation, but don't be fooled into accidentally missing that e and realizing this was Very statistically significant!</p>
    </div>
</div>

The second test stats models gives us is the Harvey Collier test. Like with our Breusch Pagan test, if the results are statistically significant, then you have violated the assumption of homoscedasticity.

```python
sms.linear_harvey_collier(model)
```

The results are shown here:

```text
Ttest_1sampResult(statistic=5.479252885631467, pvalue=5.433881942347164e-06)
```

![T test underscore one sampResult open parentheses statistic equals five point four seven nine two five two eight eight five six three one four six seven, p value equals five point four three three eight eight one nine four two three four seen one six four e zero six close parentheses.](Media/106.L6.19.png)

And again, you have violated our assumption.  No surprise!

Of course, because you have heteroscedasticity, you either want to transform our variables so that they become homoscedastic or you want to choose a model that will tolerate the heteroscedasticity, like generalized least squares.  Check out this **[conference presentation](https://conference.scipy.org/scipy2010/slides/skipper_seabold_statsmodels.pdf)** under the "Robust Standard Errors" slide headings if you want to learn more.  

Here's how to apply the same Box-Cox transformation you did in R to your data in Python.  You can simply call the ```boxcox()``` function: 

```python
transformed, _ = boxcox(manatees['PowerBoats'])
```

Which provides an array of transformed ```PowerBoats``` values.  You can call this array anything you like. 

Then it's easy to plot your transformed values, to see if they have become more normal (and thus more likely to be homoscedastic):

```python
plt.hist(transformed)
```

![A histogram with data that does not show a normal distribution with a bell curve.](Media/106.L6.17.png)

As you can see, you still don't have a normal distribution with a beautiful bell curve, so you are unlikely to have corrected our hetereoscedasticity.  You can also pull these values into a model if you like, and then can re-run your tests for homoscedasticity. Just re-assign x as transformed instead of as PowerBoats like so, and create a new model:

```python
x = transformed
model1 = sm.OLS(y,x).fit()
model1.summary()
```

Then create new residual values:

```python
pred_val = model1.fittedvalues.copy()
true_val = manatees['ManateeDeaths'].values.copy()
residual = true_val - pred_val
```

And re-plot, to yield this graph:

![A graph with an x axis that runs from negative thirty to thirty and with a y axis that runs from forty to seventy. Data points are plotted on the graph, from low on the y axis at the lower left of the graph to high on the y axis at the upper right of the graph.](Media/106.L6.18.png)

You could also re-test using the Breusch Pagan or the Harvey Collier, but since you already know your data hasn't been fixed, you'll skip this step in Python.

---

### Testing for Multicollinearity

If you had more independent variables, you would need to test for multicollinearity.  This is easy to do in Python, by asking for all correlations between variables in the dataset.  You typically call this a correlation matrix.  If you're a pure numbers person, the correlation matrix is available in table form.  You'll just call the ```.corr()``` function after the dataset name like this:

```python
manatees.corr()
```

Which will produce this table: 

![A correlation matrix. Power boats, power boats, one point zero zero zero zero zero zero. Power boats, manatee deaths, zero point nine three seven six three seven. Manatee deaths, power boats, zero point nine three seven six three seven. Manatee deaths, manatee deaths, one point zero zero zero zero zero zero.](Media/106.L6.12.png)

If you're more of a visual learner, you can also opt for a heat map, and if you use the ```annot=True``` option, it will print the correlations out in your heat map as well.

```python
sns.heatmap(manatees.corr(), annot=True)
```

Which produces this graphic: 

![A heat map showing correlations. The heat map has four sections. Top left, power boats, power boats, one. Top right, power boats, manatee deaths, zero point nine four. Bottom left, manatee deaths, power boats, zero point nine four. Bottom right, manatee deaths, manatee deaths, one.](Media/106.L6.13.png)

---

### Screening for Outliers

In Python, you can easily create a plot that will test for influential data points (those that are outliers in both x and y space) by looking at leverage and studentized deleted residuals at the same time.  

```python
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(model, alpha = .05, ax = ax, criterion="cooks")
```

You call the ```influence_plot()``` function from ```statsmodels```, and then feed in your regression model and the *p* value of .05 as a marker of whether something is actually significant or not.  This uses Cook's Distance: ```(criterion="cooks")```. The plot that comes out shows you leverage on the x axis and distance measured with studentized deleted residuals, on the y axis: 

![An influence plot. The x axis is labeled H leverage and the y axis is labeled studentized residuals. Data is plotted in differently sized dots. The larger the dot, the more likely it is to be an outlier.](Media/106.L6.29.png)

The larger the dot on this plot, the more likely something is to be an outlier.  However, something is only officially an outlier when it has a row number next to it.  In this case, you don't have any studentized deleted residuals bigger than 2.5-3, and your leverage values aren't bigger than .5, which would indicate a major problem, so you're good to go.

Just to give you an example of what it will look like when you have influential outliers, here's a graph of the dataset with outliers artifically added in: 

![An influence plot. The x axis is labeled H leverage and the y axis is labeled studentized residuals. Data is plotted in differently sized dots. The larger the dot, the more likely it is to be an outlier. Data clustered in the top left are all small dots. In the bottom right is a significantly bigger dot, labeled thirty seven.](Media/106.L6.30.png)

See how row numbers 36 and 37 stick out like a sore thumb and are labeled? Those would be outliers to remove.

But back to your real data. If you'd rather look via table instead of a graph, just to double check the values, you can: 

```python
infl = model.get_influence()
print(infl.summary_frame())
```

You call the function ```get_influence``` on your regression model, and then are able to print out the summary of the influence information.  

This is the output provided.  Depending on the size of your monitors, it may show up in two rows, like this: 

```text
    dfb_PowerBoats   cooks_d  standard_resid  hat_diag  dffits_internal  \
0        -0.132313  0.017079       -1.344190  0.009364        -0.130685   
1        -0.091052  0.008333       -0.911488  0.009930        -0.091285   
2        -0.131236  0.016941       -1.240967  0.010881        -0.130158   
3        -0.138372  0.018804       -1.261270  0.011682        -0.137128   
4        -0.110133  0.012142       -0.982851  0.012413        -0.110190   
5        -0.160086  0.024862       -1.409258  0.012364        -0.157676   
6        -0.156233  0.023818       -1.341372  0.013064        -0.154330   
7        -0.124618  0.015514       -1.016502  0.014792        -0.124555   
8        -0.121656  0.014847       -0.946983  0.016286        -0.121846   
9        -0.109841  0.012187       -0.816096  0.017970        -0.110396   
10       -0.172217  0.029260       -1.201621  0.019862        -0.171055   
11       -0.009384  0.000091       -0.063826  0.021783        -0.009524   
12       -0.017788  0.000326       -0.114619  0.024205        -0.018052   
13       -0.116716  0.013811       -0.737538  0.024761        -0.117520   
14       -0.003570  0.000013       -0.024061  0.022178        -0.003624   
15       -0.077150  0.006084       -0.519499  0.022046        -0.077999   
16       -0.198493  0.038546       -1.309628  0.021980        -0.196330   
17       -0.027007  0.000751       -0.177870  0.023180        -0.027400   
18       -0.133642  0.018011       -0.849614  0.024344        -0.134204   
19        0.021350  0.000469        0.133459  0.025677         0.021666   
20       -0.046574  0.002230       -0.281645  0.027339        -0.047219   
21       -0.037879  0.001476       -0.213243  0.031446        -0.038423   
22        0.310787  0.091682        1.635976  0.033121         0.302790   
23        0.401756  0.147611        1.954662  0.037197         0.384202   
24       -0.013867  0.000198       -0.066572  0.042786        -0.014075   
25        0.421093  0.163851        1.872551  0.044642         0.404785   
26        0.036783  0.001393        0.169473  0.046251         0.037320   
27       -0.032946  0.001118       -0.150991  0.046730        -0.033430   
28       -0.067383  0.004666       -0.299773  0.049356        -0.068305   
29        0.270730  0.072503        1.164574  0.050747         0.269264   
30       -0.006067  0.000038       -0.026550  0.051047        -0.006158   
31        0.064526  0.004279        0.287097  0.049356         0.065417   
32        0.293257  0.084142        1.308732  0.046826         0.290072   
33        0.399878  0.148732        1.818009  0.043062         0.385658   
34        0.124546  0.015809        0.605561  0.041330         0.125735   

    student_resid    dffits  
0       -1.360930 -0.132313  
1       -0.909161 -0.091052  
2       -1.251247 -0.131236  
3       -1.272714 -0.138372  
4       -0.982345 -0.110133  
5       -1.430796 -0.160086  
6       -1.357918 -0.156233  
7       -1.017015 -0.124618  
8       -0.945506 -0.121656  
9       -0.811997 -0.109841  
10      -1.209786 -0.172217  
11      -0.062884 -0.009384  
12      -0.112943 -0.017788  
13      -0.732494 -0.116716  
14      -0.023704 -0.003570  
15      -0.513845 -0.077150  
16      -1.324053 -0.198493  
17      -0.175316 -0.027007  
18      -0.846056 -0.133642  
19       0.131516  0.021350  
20      -0.277796 -0.046574  
21      -0.210224 -0.037879  
22       1.679183  0.310787  
23       2.043968  0.401756  
24      -0.065590 -0.013867  
25       1.947990  0.421093  
26       0.167033  0.036783  
27      -0.148804 -0.032946  
28      -0.295723 -0.067383  
29       1.170911  0.270730  
30      -0.026157 -0.006067  
31       0.283187  0.064526  
32       1.323099  0.293257  
33       1.885039  0.399878  
34       0.599833  0.124546  
```

The DFBETAS value is shown in the ```dfb_PowerBoats``` column, and if any value is greater than 1, then you have a problem with an influential value.

The DFFITS value is shown in the ```dffits``` column, and like DFBETAS, if you have a value greater than 1, this indicates a problem.

Leverage values are shown in the ```hat_diag column```, and if a value is in the .2 - .5 range, you may have a moderate outlier problem, and if it is over .5, then you have a large outlier problem.

Studentized deleted residuals are shown in the ```student_resid``` column, and if you have a value over 2.5 or 3, you probably have an outlier problem.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Interpreting the Regression Output<a class="anchor" id="DS106L1_page_9"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Interpreting the Regression Output

Having tested all your assumptions, you are now ready to actually interpret your regression model.  Ordinarly, you would not run regression until you meet all the assumptions, but becasue you know your data is just test data that made it difficult to fix the heteroscedasticity problem, you're going to proceed for the purpose of instruction.

To call up the summary of your regression model, which has all the statistics you could ever want and need, you will simply use its name and ```.summary()```.

```python
model.summary()
```

 The regression output is shown here:

![O L S Regression Results. Dependent variable, manatee deaths. Model, O L S. Method, least squares. Date, Tuesday twenty five september two thousand eighteen. Time, ten zero two zero zero. Number of observations, thirty five. D F residuals, thirty four. D F model, one. Covariance type, non robust. R squared, zero point nine four five. Adjusted r squared, zero point nine four three. F statistic, five hundred eighty four point four. Probability open parentheses f statistic close parentheses, five point three five E twenty three. Log likelihood, negative one hundred forty one point four eight. A I C, two hundred eight five point zero. B I C, two hundred eighty six point five. Powerboats, coefficient, zero poitn zer seven five zero, standard error, zero point zero zero three, T, twenty four point one seven four, P greater than absolute value t, zero point zero zero zero, open bracket zero point zero two five, zero point zero six nine, zero point nine seven five close bracket, zer point zero eight one. Omnibus, four point zero two two, probability open parentheses omnibus close parentheses zer point one three four, skew, zero point seven seven five, kurtosis, two point six six zero, durbin watson, zero point seven eight two, jarque bera open parentheses J B close parentheses, three point six seven six, probabiltiy open parenthese J B close parentheses, zero point one five nine, condition number, one point zero zero.](Media/106.L6.14.png)

You find your overall *F* value under ```F-Statistic``` in the righthand column, and underneath is the *p* value. The overall model is statistically significant, because the *p* value is less that .05, so you can then move on to interpreting other things.  You have your slope under ```coef``` in the middle table, as well as the associated *t* test and and *p* value for the significance of this particular independent variable in your model.  Because the *p* value is < .05, you can determine that ```PowerBoats``` has a significant effect on the number of manatee deaths.  By looking at the ```R-squared``` and ```Adj. R-squared``` in the upper right hand corner, you see that this model accounts for 94% of the variance in explaining manatee deaths.  

The last table also provides some model fit indices:
* **Omnibus:** This is a test of skew and kurtosis of the residual.  You want a value close to zero, which would indicate normality.
* **Prob(Omnibus):** This is a test indicating the probabilty that residulas are normally distributed.  You would like to see a value close to zero here, to inidcate a normal distribution.
* **Skew:** Again, you would like to see a value close to zero, and this result feeds into the Omnibus test discussed above.
* **Kurtosis:** If you have a value close to zero, this means our data is relatively normal.
* **Durbin-Watson:** This also tests for homoscedasticity! You would like a value between 1 and 2, otherwise you have the presence of heteroscedasticity within your data.
* **Jarque-Bera (JB):** This also tests skew and kurtosis. It should also be close to zero.
* **Prob (JB):** The probabilty that JB is normal.  You would like this to be close to zero as well.
* **Condition Number:** You would like to see a condition number below 30 or so, becasue that indicates low multicollinearity.  If you have higher than 30, it's time to suspect related variables!

Now that you understand what these model fit indices are, you can see that:

* You have failed the omnibus and Jarque-Bera, probably because your kurtosis is too high
* You have failed the Durbin-Watson, so you have heteroscedasticity in your data
* You have no multicollinearity - which is expected since you only have one independent variable!

---

## Summary

* In statistics, you write the linear regression equation as y = b<sub>1</sub>x + b<sub>0</sub> where b<sub>0</sub> is the y-intercept of the line and b<sub>1</sub> is the slope of the line.

* Linear regression allows you to predict values of y for a given x. This is done by first calculating the coefficients b<sub>0</sub> and b<sub>1</sub> and then plugging in the desired value of x and solving for y.

* The independent variable (x) is the variable that is not affected by what happens to the other variable.

* The dependent variable (y) is the variable that is affected by what happens to the other variable. It *depends* on  the independent variable. For example, in the correlation between the number of powerboats and the number of manatee deaths, the number of deaths is affected by the number of powerboats in the water, but not the other way around. So, you would assign x to represent the number of powerboats and y to represent the number of manatee deaths.

* The true linear regression line is y = β<sub>1</sub>x + β<sub>0</sub> where β<sub>0</sub> is the true y-intercept of the line and β<sub>1</sub> is the true slope of the line.

* A residual is the difference between the observed value of y for a given x and the predicted value of y on the regression line for the same x.

* To check all the assumptions for linear regression, you will need to create a scatterplot of x and y, a residual plot, and a Q-Q plot of the residuals.

* To conduct a hypothesis test to determine whether there is a linear relationship between two variables, test the slope (β<sub>1</sub>) to see whether it is equal to zero.

* The appropriate hypotheses for this test are:
  H<sub>0</sub>: β<sub>1</sub> = 0 and
  H<sub>a</sub>: β<sub>1</sub> ≠ 0

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Key Terms<a class="anchor" id="DS106L1_page_10"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Simple Linear Regression</td>
        <td>A statistic with one independent variable and one dependent variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>y-Intercept</td>
        <td>The value where your line crosses the y-axis.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Slope</td>
        <td>How steep your line is.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Residual</td>
        <td>The difference between the real data and the estimate you have generated.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Assumptions</td>
        <td>The minimum requirements your data must meet to be accurate for a particular statistical analysis.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Homoscedasticity</td>
        <td>Normally distributed residuals.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Homogeneity of Variance</td>
        <td>Constant error variance.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multicollinearity</td>
        <td>High correlation between variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Leverage</td>
        <td>An extreme outlier in x space.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Distance</td>
        <td>An extreme outlier in y space.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Influential</td>
        <td>An extreme value in both x and y space.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Skewness</td>
        <td>A measure of how normally distributed data is horizontally.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Kurtosis</td>
        <td>A measure of how normally distributed data is vertically.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Heteroscedasticity</td>
        <td>When the residuals are not normally distributed.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multiple R-Squared</td>
        <td>Shows how much variation in the y variable is accounted for by the x variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Adjusted R-Squared</td>
        <td>Like multiple R-squared, but adjusted for the number of independent variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>DFBETAS</td>
        <td>A screening tool for influential outliers.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>DFFITS</td>
        <td>A screening tool for outliers.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Studentized Deleted Residuals</td>
        <td>A screening tool for outliers.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Omnibus</td>
        <td>A test of skew and kurtosis for the residual.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Durbin Watson Test</td>
        <td>Tests for homoscedasticity.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Jarque-Bera test</td>
        <td>Tests for skew and kurtosis.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Condition Number</td>
        <td>An indicator of multicollinearity.</td>
    </tr>
</table>

---

## Key R Libraries

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>car</td>
        <td>For conducting linear models.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>caret</td>
        <td>For testing assumptions in linear models.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>gvlma</td>
        <td>For testing assumptions in linear models.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>predictmeans</td>
        <td>For testing assumptions in linear models.</td>
    </tr>
</table>

---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>scatter.plot()</td>
        <td>Creates a scatter plot.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>lm()</td>
        <td>Creates a linear model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>bptest()</td>
        <td>Conducts a Breush-Pagan test.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ncvTest()</td>
        <td>Conducts a non-constant variance test.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>BoxCoxTrans()</td>
        <td>Completes a BoxCox transformation on your data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>cbind()</td>
        <td>Binds columns to a dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>gvlma()</td>
        <td>Tests a whole slew of linear regression assumptions.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>CookD()</td>
        <td>Examines Cook's Distance for outliers in x space.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>hat()</td>
        <td>Examines leverage.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>outlierTest()</td>
        <td>Tests for outliers.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>influence.measures()</td>
        <td>Tests for influential outliers.</td>
    </tr>
</table>

---

## Key Python Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>statsmodels</td>
        <td>For completing linear regression.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>scipy.stats</td>
        <td>For testing linear regression assumptions.</td>
    </tr>
</table>

---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sms.OLS()</td>
        <td>Completes an ordinary least squares regression. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sms.diagnostic.het_breuschpagan()</td>
        <td>For conducting a Breush-Pagan test of homoscedasticity.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sms.linear_harvery_collier()</td>
        <td>For conducting a Harvey Collier test of homoscedasticity. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>boxcox()</td>
        <td>Applies the BoxCox transformation to your data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.corr()</td>
        <td>Creates a correlation matrix.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sns.heatmap()</td>
        <td>Creates a heatmap of your correlation matrix.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sm.graphics.influence_plot()</td>
        <td>Creates a plot of influential outliers.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>get_influence()</td>
        <td>Provides outlier statistics to screen.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Lesson 1 Practice Hands-On<a class="anchor" id="DS106L1_page_11"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

This Hands-On will **not** be graded, but we encourage you to complete it. However, the best way to become great data scientist is to practice.

---

It is a well known phenomena that most of us shrink throughout the day each day.  The effects of gravity cause that our height measured at the end of the day is less than our height measured at the beginning of the day.  Fortunately, at night, our bodies stretch out again, so that from one morning to the next, each of us has returned to the morning height from the day before.

In the dataset below, there are AM and PM height measurements (in mm) for students from a boarding school in India.

## Hands on Part 1

Take the **[following dataset](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/heights.zip)**, and complete simple linear regression in R.  Make sure to test, note, and correct for all assumptions if possible!

## Hands on Part 2

Take the above dataset, and complete simple linear regression in Python.  Make sure to test, note, and correct for all assumptions if possible!

Create a slide presentation that walks through your assumptions and overall findings in R and Python.   Try to explain it in laymen's terms.  Please also submit your code files for grading!

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 12 - Lesson 1 Practice Hands-On Solution<a class="anchor" id="DS106L1_page_12"></a>

[Back to Top](#DS106L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Lesson 1 Practice  Hands-On Solution

Below you will find the solutions for Parts I & II of the Lesson 1 Practice Hands-On.

---

## Part I

```{r}
#DSO106 Modeling L1 Practice Hands-on Solution

#Load in Libraries
library("car")
library("caret")
library("gvlma")
library("predictmeans")

#Test Assumptions

## Linearity

scatter.smooth(x=heights$AM_Height, y=heights$PM_Height, main="Morning by Evening Height")

### Since this looks linear, the assumption of linearity has been met!

## Homoscedasticity

### Run the Basic Model

lmModHeights = lm(PM_Height~AM_Height, data=heights)

### Graph it

par(mfrow=c(2,2))
plot(lmModHeights)

#### Looking at the graphs, there should be an approximately flat line, and it looks like the top left curves and the bottom left graph has a dip, so the assumption of homoscedasticity may not be met.

### Breusch-Pagan Test

lmtest::bptest(lmModHeights)

#### Since this test was not significant, there is homoscedasticity! The assumption is met!

### Non-Constant Variance Test

car::ncvTest(lmModHeights)

#### Same here - it wasn't significant, so the assumption has been met!

## Homogeneity of Variance

### Looking at the graphs from the last assumption, this may not have been passed. But continuing for learning purposes!

### GVLMA test

gvlma(lmModHeights)

#### All assumptions acceptable! Wow!

## Screening for outliers in x space

###Cook's D

CookD(lmModHeights, group=NULL, plot=TRUE, idn=3, newwd=TRUE)

#### Looks like observations 3, 4, and 12 may be a problem

### Leverage values

lev = hat(model.matrix(lmModHeights))
plot(lev)

heights[lev>.2,]

#### Going by leverage values, only 3 is really an issue

## Screening for outliers in y space

car::outlierTest(lmModHeights)

### This test was significant, so it's likely there is at least one outlier

## Screening for outliers in x and y space (influential points)

summary(influence.measures(lmModHeights))

### Looks like the values on the list are 3, 11, and 37.  Should probably try a model in which outliers are removed from the data

## Creating a new model without outliers to test against the model with outliers

heightsNoO <- heights[c(1,2,5,6,7,8,9,10,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,38,39,40,41),]
lmModHeightsNoO = lm(PM_Height~AM_Height, data=heightsNoO)

## Look at the model summaries for each

summary(lmModHeights)

### Looks like morning height is a significant predictor of evening height and explains 99% of the variance in evening height.

summary(lmModHeightsNoO)

### Very similar results with the model with no outliers, so it's fine to keep and use the original model with all the data!
```

---

## Part II

**[HERE](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/ModelingL1HandsOn2.zip)** is a Jupyter Notebook file containing the solution in Python. You will need to download it, save it, and then open it with your own Jupyter Notebook.