d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Linear Regression Lab 1

**Objectives**:
1. Develop a single-variable linear regression model.
2. Develop a multi-variable linear regression model.

In this lab, we'll do a quick demonstration of single-variable and multi-variable linear regression using Python and Scikit-Learn.

After each demonstration, you'll have the opportunity to complete the exercises yourself.

In [0]:
%run ../../Includes/Classroom-Setup

## Setup

### Load the Data
The `../../Includes/Classroom-Setup` notebook has made an aggregate table of data available to us. 

We can load the data as a Pandas DataFrame using the cell below. The `.toPandas()` method converts the Spark DataFrame to a Pandas DataFrame. We will use the Pandas DataFrame with Scikit-Learn throughout this Module.

In [0]:
ht_agg_spark_df = spark.read.table("ht_agg")
ht_agg_pandas_df = ht_agg_spark_df.toPandas()

### Framing a Business Problem

We have spoken frequently about the entire data science process starting with a good question. 

Over the next few labs, we will use supervised machine learning to answer the following business question:

> Given a users fitness profile, can we predict the average number of steps they are likely to take each day?

Here, our **inputs** will be fitness profile information and our **output** will be the average number of daily steps. The fitness profile information consists of average daily measurements of BMI, VO2, and resting and active heartrates.

We will perform supervised learning to develop a function to map these inputs to average daily steps.

### Scikit-Learn

#### Overview

One of the most popular libraries for doing machine learning in Python.

Scikit-Learn features:
- Simple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license

#### The Scikit-Learn `estimator` API

The main API for performing machine learning with sklearn is the **estimator** API.

An estimator is any object that learns from data; it may be a

- A **predictor**
   - classification algorithm
   - regression algorithm
   - clustering algorithm
- A **transformer** that extracts/filters useful features from raw data

#### Fitting a predictor model with sklearn

- All estimator objects expose a `.fit()` method.
- For supervised learning, this looks like `predictor.fit(features, target)`

```
estimator.fit(data)
```

#### Evaluating a model with sklearn

- All predictor objects expose a `.score()` method
- For supervised learning, this looks like `predictor.score(features, target)`
- sklearn provides a built-in metric depending upon whether a classification or regression algorithm is being used
- For classification, `predictor.score(features, target)`, uses the accuracy metric
- For regression, `predictor.score(features, target)`, uses the R2 metric

## Demonstration

### Single-Variable Linear Regression

Our first set of models will have a single independent variable (or single feature) and a single dependent variable (or single target).

A way to think about the relationship between feature and target is to put them both into a sentence, "for a [feature] of [value], we would predict that this user would have [value] [target]".

In our case , we might have an assumption that the feature `mean_bmi` is predictive of our target `mean_steps`, so our sentence could read:

> "For a mean BMI of 20, we would predict that this user would have 4000 mean steps."


Our intution and domain knowledge can help us discern predictive features.

### Setting up Linear Regression

First, we'll import our estimator of choice, a predictor called Linear Regression.

In [0]:
from sklearn.linear_model import LinearRegression

Then, we'll instantiate or create an instance of our estimator.

In [0]:
lr = LinearRegression()

### Create Feature Vectors

🧐 sklearn wants the shape of our data to be a matrix for our feature(s) and the shape of our target to be a vector. This is why you will see two square brackets around our feature - a matrix - and a single set of square brackets around our target - a vector.

In [0]:
X = ht_agg_pandas_df[['mean_bmi']]
y = ht_agg_pandas_df['mean_steps']

### Fit the Model

Next, fit our model, using the same `.fit(feature, target)` pattern we learned earlier.

The model will learn the relationship between features and target, i.e.
we will "train or fit the model".

In [0]:
lr.fit(X, y)

### Evaluate the model

Finally, use the `.score()` method to evaluate the single-variable model.

In [0]:
lr.score(X, y)

## Your Turn

### Exercise 1: Single-Variable Linear Regression

Fit a single-variable linear model for each of the remaining feature.
1. prepare a feature matrix for each of these features:
 - `mean_bmi`
 - `mean_active_heartrate`
 - `mean_resting_heartrate`
 - `mean_vo2`
1. fit a single-variable linear model for each of these features
1. evaluate using `.score()` each of these models and print the result

In [0]:
# ANSWER
X_bmi = ht_agg_pandas_df[['mean_bmi']]
X_active_heartrate = ht_agg_pandas_df[['mean_active_heartrate']]
X_resting_heartrate = ht_agg_pandas_df[['mean_resting_heartrate']]
X_vo2 = ht_agg_pandas_df[['mean_vo2']]

lr_bmi = LinearRegression()
lr_active_heartrate = LinearRegression()
lr_resting_heartrate = LinearRegression()
lr_vo2 = LinearRegression()

lr_bmi.fit(X_bmi, y)
lr_active_heartrate.fit(X_active_heartrate, y)
lr_resting_heartrate.fit(X_resting_heartrate, y)
lr_vo2.fit(X_vo2, y)

print("bmi:               ", lr_bmi.score(X_bmi, y))
print("active_heartrate:  ", lr_active_heartrate.score(X_active_heartrate, y))
print("resting_heartrate: ", lr_resting_heartrate.score(X_resting_heartrate, y))
print("vo2:               ", lr_vo2.score(X_vo2, y))

## Demonstration
### Multiple-Variable Linear Regression

Our next set of models will use more that one feature and but still have
a single target.

We can apply similar logic in forming a sentence to describe the relationship "for a [feature1] of [value1] and a [feature2] of [value2], we would predict that this user would have [value] [target]".

e.g.
> "For a mean BMI of 20 and a mean active heartrate of 125, we would predict that this user would have 9500 mean steps."

Let's try this model out.

In [0]:
ht_agg_pandas_df.mean_active_heartrate.sample()

#### Display results from previous models

Before we train this new model, let's display the results from the previous models
for comparison.

In [0]:
print("bmi:               ", lr_bmi.score(X_bmi, y))
print("active_heartrate:  ", lr_active_heartrate.score(X_active_heartrate, y))
print("resting_heartrate: ", lr_resting_heartrate.score(X_resting_heartrate, y))
print("vo2:               ", lr_vo2.score(X_vo2, y))

#### Train new multiple-variable linear regression

Train the new model using both `mean_bmi` and `mean_active_heartrate` as predictors.

In [0]:
X_bmi_act_hr = ht_agg_pandas_df[['mean_bmi', 'mean_active_heartrate']]
lr_bmi_act_hr = LinearRegression()
lr_bmi_act_hr.fit(X_bmi_act_hr, y)
print("bmi_act_hr: ", lr_bmi_act_hr.score(X_bmi_act_hr, y))

-sandbox
## Your Turn
### Exercise 2: Multi-Variable Linear Regression
😎 Note that this two feature model performs better than any of the single feature models.

Fit four multiple-variable linear models.
1. prepare a feature matrix
1. fit a linear model for each of feature matrix
1. evaluate each model using `.score()` and print the result

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Did you try any models with more than two features? Multiple-variable
linear models can use any or all of the features.

In [0]:
# ANSWER
X_1 = ht_agg_pandas_df[['mean_active_heartrate', 'mean_resting_heartrate']]
X_2 = ht_agg_pandas_df[['mean_active_heartrate', 'mean_vo2']]
X_3 = ht_agg_pandas_df[['mean_active_heartrate', 'mean_bmi', 'mean_vo2']]
X_4 = ht_agg_pandas_df[['mean_active_heartrate', 'mean_bmi', 'mean_vo2', 'mean_resting_heartrate']]

lr_1 = LinearRegression()
lr_2 = LinearRegression()
lr_3 = LinearRegression()
lr_4 = LinearRegression()

lr_1.fit(X_1, y)
lr_2.fit(X_2, y)
lr_3.fit(X_3, y)
lr_4.fit(X_4, y)

print("model 1: ", lr_1.score(X_1, y))
print("model 2: ", lr_2.score(X_2, y))
print("model 3: ", lr_3.score(X_3, y))
print("model 4: ", lr_4.score(X_4, y))


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>