![Title](Images/cisco.png)

## Lab - Evaluating Fit Errors in Linear Regression

### Objectives
In this lab, you will become familiar with the concepts of evaluating fit errors in linear regression.
<li>**Part 1: Import the Libraries and Data  **</li>
<li>**Part 2: Calculate the Errors **</li>
### Scenario / Background
In statistics, linear regression is a way to model a relationship between dependent variable $y$ and independent variable $x$. The goal of regression is to find a model that describes the data as accurately as possible.

In this lab, you will use the sales data and result of the linear regression from a previous lab to evaluate the accuracy of the model.
### Required Resources
* 1 PC with Internet access
* Python libraries: `pandas`, `numpy`, and `sklearn`
* Datafiles: stores-dist.csv

### Part 1: Import the Libraries and Data

In this part, you will import the libraries and the data from the file `stores-dist.csv`.

#### Step 1: Import the libraries.

In this step, you will import the following libraries:

* `numpy` as np
* `pandas` as pd

In [2]:
# Code Cell 1

# This lab produces some minor warnings that can be ignored.
# These warnings appear because some libraries are updated more often than others
# and the system is letting the user know that some function will be depricated soon
# Use the following code to prevent the warnings from being displayed, or comment them out
# to see the warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

#### Step 2: Import the data.

In this step, you will import the data from the file `stores-dist.csv`, change the column headings, and verify that the file is imported correctly.

The column headings, `annual net sales` and `number of stores in district` are renamed to make it easier during data processing.

* `annual net sales` to sales
* `number of stores in district` to stores

In [6]:
# Code Cell 2

# Import the file stores-dist.txt
salesDist = pd.read_csv('./Data/stores-dist.txt')

# Change the column headings
salesDist.columns = ['district','sales','stores']

# Verify the imported data
salesDist.head()

Unnamed: 0,district,sales,stores
0,1,231.0,12
1,2,156.0,13
2,3,10.0,16
3,4,519.0,2
4,5,437.0,6


The cdistrict column is not necessary for the evaluation of the linear regression fit; therefore, the column can be dropped.

In [7]:
# Code Cell 3
# Drop the district column.
sales = salesDist.drop('district',axis=1)

# Verify that the district column has been dropped.
sales.head()

Unnamed: 0,sales,stores
0,231.0,12
1,156.0,13
2,10.0,16
3,519.0,2
4,437.0,6


### Part 2: Calculating the Errors

In this part, you will use numpy to generate a regression line for the analyzed data. You will also calculate the centroid for this dataset. The centrod is the mean for the dataset. The generated simple linear regression line must also pass through the centroid.

You will also use sklearn.metrics to evaluate the linear regression model. You will calculate the $R^2$ score and mean square error (MSE).

#### Step 1: Assign the x and y variables.
Assign the sales from the dataframe as dependent variable $y$, and stores from the dataframe as the independent variable for $x$ axis.

In [8]:
# Code Cell 4
#dependent variable for y axis
y = sales.sales
#independent variable for x axis
x = sales.stores

### Step 2: Calculate the y values in the model
In a previous lab, you calculated the components for the linear regression fit with a polynomial model using `np.polyfit` to calculate a vector of coefficients `p` that minimizes the squared error. By using `np.poly1d`, you can compute the corresponding value for each value of `x` in the estimated polynomial model.

To recall the slope and y-intercept of the line, use the variable `p`. The array `p` displays the coefficent in a descending order. For a first order polynomial, the first coefficient is the slope (`m`) and the second coefficent is the y-intercept (`b`).

In [9]:
# Code Cell 5
# compute the y values from the polynomial model for each x value
order = 1
p = np.poly1d(np.polyfit(x, y ,order))

print('The array p(x) stores the calculated y value from the polynomial model for each x value,\n\n{}.'.format(p(x)))
print('\nThe vector of coefficients p describes this regression model:\n{}'.format(p))
print('\nThe zeroth order term (y-intercept or b) is stored in p[0]: {}.'.format(p[0]))
print('\nThe first order term (slope or m) is stored in p[1]: {}.'.format(p[1]))

The array p(x) stores the calculated y value from the polynomial model for each x value,

[169.93468442 134.14759895  26.78634257 527.80553905 384.65719719
 420.44428266 205.72176988 134.14759895  26.78634257 277.29594081
 527.80553905 313.08302627 456.23136812  62.57342803 169.93468442
 205.72176988 420.44428266  98.36051349 313.08302627 527.80553905
 563.59262451  62.57342803 134.14759895 348.87011173 384.65719719
 563.59262451 277.29594081].

The vector of coefficients p describes this regression model:
 
-35.79 x + 599.4

The zeroth order term (y-intercept or b) is stored in p[0]: 599.3797099726614.

The first order term (slope or m) is stored in p[1]: -35.787085462974005.


#### Step 3: Use different measures to evaluate models.
In this step, you will use `sklearn` to evaluate models. `Sklean` offers a range of measures. You will calculate the $R^2$ score, mean squared error (MSE), and mean absolute error (MAE) using the functions in `sklearn`.

To calculate the value for each measure, provide the values from `y`, which is the observed values from the imported csv file, `stores-dist.csv` as the first argument. As the second argument, use the values from `p(x)`, which were calculated from your first order polynomial model in the form of:

$$y = mx + b$$

where the m is `p[1]` and b is `p[0]` in the `poly1d` results.

The $R^2$ (coefficent of determination) regression score function gives some information about the amount of fit of the model. The best possible score for $R^2$ is 1.0. This score indicates how well the model is explaining the observed outcome.

In [10]:
# Code Cell 6
from sklearn.metrics import r2_score
r2 = r2_score(y, p(x))
r2

0.83217523508888

The mean squared error (MSE) is a measure of how well the model can be used to make a prediction. This number is always non-negative. The better values are closer to zero.

In [11]:
# Code Cell 7
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, p(x))
mse

5961.386465941158

The mean absolute error (MAE) is a measure of how close predictions are to the eventual outcomes. The MAE is an average of the absolute errors between the prediction and the true value.

In [12]:
# Code Cell 8
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y, p(x))
mae

61.2232611786873

All these measures allow you to determine how well your model can make prediction. In this lab, you only evaluated one model, simple linear regression.



---

ОТВЕТЫ НА ВОПРОСЫ

1. Какое значение вы получили для коэффициента детерминации?

0.83217523508888

2. Какое значение среднеквадратичной ошибки было рассчитано для вашей модели?

5961.386465941158

3. Какое значение средней абсолютной ошибки было получено?

61.2232611786873

4. Как с помощью NumPy найти ковариацию двух переменных? (См. 14-15).

np.cov(x, y)


<font size='0.5'>&copy; 2017 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.<font>