<!-- TITLE for 1 -->
# 1.0 What is Linear Regression?

<!-- INTRODUCTION for 1.1 -->

A linear regression is a machine learning technique that models the relationship between explanatory variables (also called predictor variables, independent variables, or x-variables) and a continuous dependent variable (also called the target or y-variable). In the Boston housing dataset, the target variable is MEDV (median home value). It's considered continuous because there are unlimited possibilities (ranging from zero to infinity) for what value this variable could take. This is in contrast to discrete variables, which you can think of as categories. For discrete variables, values can only fall into one out of a finite number of options. You'll use linear regression to find the relationship between the predictor variables (nitric oxide concentrations, proximity to the Charles River, etc), and the median home value for towns in Boston.

<!-- TASK 1.1 -->
## 1.1 Linear Regression for Predictions

There are two common scenarios in which linear regression may be useful. The first is making predictions or forecasting. In these situations, you fit a model to learn the relationship between some predictor variables and a target variable. If you want to know the value of the target variable in some new case, the model can make a prediction for you. One example might be forecasting the profits of a business in the next quarter based on data you've collected this quarter. For the Boston housing data, you'll be able to predict the median home value for new towns (as long as you collect data on nitric oxide levels, number of rooms per home, etc). Notice that you can either predict future values, or simply values for cases you haven't seen before.

Think of a few other examples of when you might use linear regression to make a prediction.

In [None]:
## STARTER CODE for 1.1

# Linear regression could be used to predict...

<!-- HINT for 1.1-->
Double click for hint.
<!-- Think about sets of related variables. -->

<!-- TASK 1.2 -->
## 1.2 A Quick Math Interlude

Before you learn about the second use for linear regressions, let's take a few minutes to talk math. Given that the word "linear" is in the name, you might have guessed that linear regressions have to do with lines.

Remember the equation for a line? $$y = mx + b $$

Here, y represents the target or dependent variable, m is the slope, x is a predictor variable, and b is the y-intercept.

You might also see something like this: 

$$y = w_1x_1 + w_0$$

Or like this: 

$$y = \beta_1x_1 + \beta_0$$

In a case where you have multiple predictor variables (like in the Boston housing dataset), the equation might look more like this, with one value of x and one weight for every variable in the dataset:

$$y = w_3x_3 + w_2x_2 + w_1x_1 + x_0$$


All of these are telling you the same thing: that you can calculate the value of the target variable (y) by multiplying predictor variables (x) by weights (m, w, or $\beta$).

When you use your computer to fit a linear regression, it's basically calculating the values of the weights (also called regression coefficients) to make the best guess for y.

If you'd like to learn more about these concepts, check out this [video](https://www.youtube.com/watch?v=KsVBBJRb9TE) or this [article](http://onlinestatbook.com/2/regression/intro.html).

<!-- TASK 1.3 -->
## 1.3 Linear Regression for Understanding Relationships

Now, back to explaining why linear regression can be useful. Looking at the values your computer will calculate for the weights (m, w, or $\beta$) can tell you which variables are the best predictors of the target variable. Variables with the highest or lowest coefficients have the strongest relationship with the target variable. As these variables increase or decrease, so does the target variable.

If a linear regression calculates a very positive coefficient for a particular predictor variable, this means that there is a positive relationship between it and the target. As the predictor increases, so does the target. (Imagine the relationship between working more hours and having a higher paycheck. As you work more at your hourly wage job, you earn more money.) 

If the regression reveals a very negative coefficient for a predictor variable, it means there is an inverse relationship between it and the target. As the predictor decreases, the target increases. (For example, there is an inverse relationship between the number of coats in your closet when you throw a party and how much empty space there is. As you add more, the amount of space goes down.) 

Finally, if the coefficient is close to zero, the predictor has a negligible effect on the target. (Such as the relationship between the temperature outside and how tall you are. The temperature can move up or down, but your height won't change.)

Think of other relationships between variables that might yield a positive, negative or neutral regression coefficient.

In [None]:
## STARTER CODE for 1.3
# Positive regression coefficient:
# Negative regression coefficient:
# Neutral regression coefficient:

<!-- CONCLUSION for 1.0 -->
In this step, you learned about your first machine learning technique: linear regression. You know that it can be used to make predictions or to understand the relationships between variables, and saw a few examples. Next, you'll use Scikit-Learn to fit a linear regression on the Boston housing dataset.

<!-- TITLE for 2 -->
# 2.0 Fitting a Linear Regression

In [23]:
# Pre-run
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
import numpy as np

dataset = load_boston()
data = pd.DataFrame(dataset.data, columns = dataset.feature_names)
data['MEDV'] = dataset.target

<!-- INTRODUCTION for 2.1 -->
Now that you know what a linear regression is, and what it can be used for, it's time to fit one. In this lesson, you'll become familiar with the Scikit-Learn library, which contains tools for machine learning in Python, including (but certainly not limited to) linear regression. Using the Boston housing data, you'll learn to fit a linear regression that can predict a town's median home value based on the other variables (proportion of homes built before 1940, nitric oxide concentration, accessibility to highways, etc).

When you're modeling data in Scikit-Learn, there are three basic steps to remember: **fit** the model, **predict** the target variable, and **evaluate** the model's fit.

<!-- TASK 2.1 -->
## 2.1 Import LinearRegression

Throughout this step, you'll be creating a LinearRegression object and calling different methods on it.

Import the LinearRegression class from Scikit-Learn.
Create a LinearRegression object and assign it to the variable 'model'.

In [None]:
## STARTER CODE for 2.1
from sklearn.linear_model import LinearRegression

model = 

<!-- HINT for 2.1 -->
Double click for hint.
<!-- Create an instance of a linear regression model by declaring `LinearRegression()`. -->

<!-- TASK 2.2 -->
## 2.2 Assign Predictor and Target Values

To fit a model in Scikit-Learn, the target and predictor variables need to be in separate arrays. It's conventional to assign all values of predictor variables to an array called X, and the column of the target variable to another array called y.

Assign the predictor variables to X and the target variable to y. Check their dimensions to be sure they have the same number of observations (506 different towns), and that X has 13 columns (one for each predictor variable).

In [None]:
## STARTER CODE for 2.2
X = 
y = 

<!-- HINT -->
Double click for hint.
<!-- You can find the predictor variables in the `.data` attribute of the original Boston dataset import from sklearn, and the target variable in the `.target` attribute. Use `.shape` to see the dimensions of each array. -->

<!-- TASK 2.3 -->
## 2.3 Fit the Model

One of the things that makes Scikit-Learn easy to use is that many different models have the same methods, so once you've learned how to implement one modeling technique, the process for implementing other models others will look very similar. The first method you'll learn is `.fit()`.

The `.fit()` method takes arrays of predictor variables and target variables as arguments, and trains the model to learn the relationship between them. 

Fit the model on the X and y arrays you just defined.

In [None]:
## STARTER CODE for 2.3
model.fit()

<!-- HINT for 2.3-->
Double click for hint.
<!-- Add X and y inside `.fit()`. -->

<!-- TASK 2.4 -->
## 2.4 Predict Values of y

Now, the model has been trained on the relationship between the predictor variables and the target variable. Under the hood, Scikit-Learn has calculated a coefficient for every predictor variable. Given a new observation of a town for which you've measured proportion of old homes, nitric oxide levels, etc, the model can predict its median home value by calculating the linear combination of all values using these coefficients.

Each Scikit-Learn model type has a `.predict()` method that takes in an array of predictor variables and outputs the model's "guess" for the target variables. 

Use `.predict()` to create an array of predicted median home values for every observation in X and assign it to the variable "predictions". Check the length of X and the length of the "predictions" array. (They should be the same, since there's one prediction for every town in X.)

In [None]:
## STARTER CODE for 2.4
predictions =

<!-- HINT -->
Double click for hint.
<!-- Remember that you can check the length of an array using `len()`. -->

<!-- CONCLUSION for 2.0 -->
In this step, you've created a linear regression model in Scikit-Learn, trained it on the predictor and target variables using the `.fit()` method, and used it to make its own predictions for median home value with the `.predict()` method. The next question is: how did it perform? Next, you'll learn how to evaluate the model's performance.

<!-- TITLE for 3 -->
# 3.0 Evaluating the Model Graphically

<!-- INTRODUCTION for 2 -->
Now that you've trained the model and had it make some predictions, it's time to evaluate its performance. How accurate are its guesses for median home value compared to the actual measured values of MEDV in the dataset?

<!-- TASK 3.1 -->
## 3.1 Plotting Measured Values vs Predictions

One way to visually assess the model's performance is to plot the recorded median home values from the dataset against the predicted values that the model generated. If the model predicted perfectly for every town, the points would fall on a straight diagonal line because the predicted value would equal the measured value. The less accurate the guesses are, the more the dots will diverge from the line.

Make a scatterplot with the observed values of MEDV on the x-axis and the model's predictions on the y-axis. Don't forget to include axis labels!

In [None]:
## STARTER CODE for 3.1
# add code for scatter plot here
plt.plot([0,60],[0,60], color = 'red')
# add code here
plt.ylim(0,60)
plt.xlim(0,60)
plt.show()

<!-- HINT -->
Double click for hint.

<!-- Remember that you stored the MEDV values under the variable y, and that since these go on the x-axis, y will be the first argument in the .scatter() function. -->

<!-- TASK 3.2 -->
## 3.2 Investigating Coefficients

You've just plotted the actual median home values against the model's predictions. Were the points clustered along a line or diffusely spread? Does it look like the model's predictions were close to the actual values?

When the model calculated its predictions for median home value, it assigned various weights to each of the target variables. This means that each of the factors like AGE, RAD, or NOX have a differing level of importance in determining the final outcome. Think back to some of the plots you made during exploratory data analysis. Some of the variables, like RM, seemed to have a stronger correlation with MEDV than others, like RAD. Having a stronger correlation makes these better predictors, and the model will calculate higher coefficients for them.

Each Scikit-Learn estimator object stores the coefficients it calculates in the .coef\_ attribute. To see which variables had the highest coefficient (and therefore the strongest weight in the guess for median home value), you can make a dataframe that lists variables and coefficients and sort it from highest to lowest.

Assign the model.coef\_ to the 'coefficient' column of the coefficient_df dataframe.

In [None]:
## STARTER CODE for 3.2
coefficient_df = pd.DataFrame(columns = ['variable', 'coefficient'])
X = pd.DataFrame(X)
coefficient_df['variable'] = X.columns
# add code here
coefficient_df.sort_values('coefficient', ascending = False)

<!-- HINT for 3.2-->
Double click for hint.
<!-- You can assign the value of the coefficient column like this: `coefficient_df['coefficient'] =` -->

<!-- TASK 3.3 -->
## 3.3 Plotting Coefficients

You've probably noticed that some of the coefficients for the predictor variables are positive and others are negative. Remember, this represents the direction of the relationship between each variable and the predictor: a very negative coefficient means that the two variables are inversely related. For instance, as NOX increases, MEDV goes down. A positive coefficient means that the variables are positively related: as RM increases, so does MEDV. The further a coefficient is from zero in either direction, the stronger the relationship. Coefficients close to zero have little effect on the median home value.

To visualize these trends, you can plot the coefficients on a bar graph. The longest bars in either direction belong to the variables that had the most influence over the model's predictions.

Assign the values in the `coefficient` column of the dataframe to `bar_heights`.
Assign the values in the `variable` column of the dataframe to `labels`.

In [None]:
## STARTER CODE for 3.3
# sort the coefficient dataframe from highest to lowest
coefficient_df = coefficient_df.sort_values('coefficient', ascending = False)

# assign bar heights, positions, and labels
bar_heights = 
bar_positions = np.arange(0.5, 13.5, 1)
labels = 

# make plot
plt.figure(figsize = (8,6))
plt.bar(left = bar_positions, height = bar_heights)
plt.xticks(np.arange(0.75, 13.75, 1), labels, rotation = 70)
plt.ylabel('Coefficient')
plt.show()

<!-- HINT -->
Double click for hint.
<!-- Remember that you can refer to a column in a dataframe by slicing. -->

<!-- CONCLUSION for 3.0 -->
In this step, you evaluated the performance of the model visually by comparing the predictions it made for home value to the recorded values from the dataset, and plotting the coefficients for each of the predictor variables. From these plots, it appears that the model fits reasonably well, and the most important factors in predicting home value are the nitric oxide concentration and the number of rooms.

<!-- INTRODUCTION for 4 -->
# 4.0 Using Evaluation Metrics

In this lesson, you'll learn how to evaluate a linear regression using the R-squared value, mean absolute error, and root mean squared error.

<!-- TASK 3.1 -->
## 4.1 Introducing the R-squared value

For a more exact interpretation of how well the model fits the data, you can look at the R-squared value (also known as the coefficient of determination). Although there were 13 predictor variables in this multiple regression, imagine training the model on a single variable. You could plot that single variable on the x-axis and the target variable on the y-axis. If you drew a line through the resulting scatter of points to minimize the distance between each point and the line, you'd have draw the line of best fit. (This is basically what a linear regression does when it make predictions for the target variable.) You can quantify how well the line fits the points using the R-squared value. When there is a perfect fit (not a single point off the line), the R-squared value is equal to one.

Look at the example plot with a simulated target and predictor variable. Each data point falls exactly on the the line, so the R-squared for the imaginary model that generated this data would be 1.

In [None]:
## STARTER CODE for 4.1
# Plotting the predicted values for a model where R-squared = 1
plt.scatter([0, 0.5, 1, 1.5, 2, 2.5, 3], [0, 0.5, 1, 1.5, 2, 2.5,3], label = 'actual values', color = 'red')
plt.plot([0,3], [0,3], label = 'predicted values')
plt.xlabel('Predictor Variable')
plt.ylabel('Target Variable')
plt.legend(loc = 'best')
plt.xlim(0,3)
plt.ylim(0,3)
plt.show()

<!-- HINT for 4.1-->
Double click for hint.
<!-- Just run the cell. -->

<!-- TASK 4.2 -->
## 4.2 Calculating R-squared

The example you just saw was for a single linear regression where there was only one predictor and one target variable. Think about doing this for our 13-factor multiple linear regression... You'd have to plot each variable on its own axis in 13-dimensional space. This doesn't visualize well, but the bottom line is the same - the closer R-squared is to one, the better the model's fit.

The `.score()` method takes the arrays of predictor and target variables as arguments and returns the R-squared value of the model.

Assign the R-squared value of the model to the variable `Rsq` and print it.

In [None]:
## STARTER CODE for 4.2
Rsq = 

<!-- HINT for 4.2-->
Double click for hint.
<!-- The function operates on the model you've built, so call `model.score()` and input X and y. -->

<!-- TASK 4.3 -->
## 4.3 Mean Absolute Error

Another value that you can use to assess your model's performance is the mean absolute error, or MAE. This metric is an average of the distance between each predicted value and the corresponding actual value. It can be useful when you want to know how far off your model was on average, in meaningful units.

Since the target variable, median home value, was measured in thousands of dollars, the mean absolute error will also be in thousands of dollars.

You can calculate MAE using the `mean_absolute_error()` function. It takes the actual target variables and the predicted target variables as arguments.

Calculate the MAE of y and the model's predictions.

In [None]:
## STARTER CODE for 4.3
from sklearn.metrics import mean_absolute_error

mae = 

<!-- HINT for 4.3-->
Double click for hint.

<!-- Input `y` and `predictions` as arguments to the `mean_absolute_error()` function. -->

<!-- TASK 4.4 -->
## 4.4 Root Mean Squared Error

You just saw that the mean absolute error for the model was 3.27. Since the units for home value were thousands of dollars, this means, on average, the model's prediction for median home value were off by \$3,270.

When it comes to deciding between evaluating a model using MAE, or another metric, root mean squared error (RMSE), you must consider the effect of outliers. An outlier is a value very far from the mean and median of the dataset. Given that our model was trained on median home values ranging from about \$3,000 to \$55,000, we can be confident that it would make accurate predictions within this range. However, what if we asked the model to predict the value for a town whose median home value was \$100,000. This is far outside the range the model trained on, so it's likely going to guess way wrong.

In this situation, when a model guesses horribly wrongly for a few outliers, the MAE will change, but not a lot. It doesn't penalize the model too much for being very wrong, as long as it's only a few times.

If you prefer that your model be sensitive to outliers, but don't necessarily care about measuring how wrong outlier predictions were in units, RMSE is the appropriate metric to use.

Scikit-Learn has a `mean_squared_error(`) function, and you can calculate RMSE by simply taking the square root of the mean squared error.

Calculate the RMSE for y and the model's predictions.

In [None]:
## STARTER CODE for 4.4
from sklearn.metrics import mean_squared_error
from math import sqrt

rmse = 

<!-- HINT for 4.4-->
Double click for hint.

<!-- The `sqrt()` function returns the square root of a value. Apply this to the output of the `mean_squared_error()` function. -->

<!-- CONCLUSION for 4.0 -->
In this lesson, you've accomplished a lot: you created arrays of target and predictor variables, fit a linear regression, learned which variables had the most predictive power, and then evaluated model fit in 4 different ways. Now that you know how to do this for linear regressions using Scikit-Learn, you can easily transfer these same steps to a logistic regression or some other type of model. Just remember the basic steps: **fit**, **predict**, and **evaluate**.