d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Principal Components Analysis Lab

**Objective**: *Apply PCA to a dataset to learn more about how the features are related to one another.*

In this lab, you will apply what you've learned in this lesson. When complete, please use the answers to the exercises to answer questions in the following quiz within Coursera.

In [0]:
%run "../../Includes/Classroom-Setup"

## Exercise 1

In this exercise, you will create a user-level table with the following additional columns:

1. `steps_change` – the difference between the maximum steps and the minimum steps
1. `workout_minutes_change` - the difference between the maximum workout minutes and the minimum workout minutes
1. `var_workout_minutes` – the variance of the workout minutes
1. `var_steps` - the population variance of the steps

Fill in the blanks in the below cell to create the `adsda.ht_user_metrics_pca_lab` table.

In [0]:
%sql
-- ANSWER
CREATE OR REPLACE TABLE adsda.ht_user_metrics_pca_lab
USING DELTA LOCATION "/adsda/ht-user-metrics-pca-lab" AS (
  SELECT min(resting_heartrate) AS min_resting_heartrate,
         avg(resting_heartrate) AS avg_resting_heartrate,
         max(resting_heartrate) AS max_resting_heartrate,
         min(active_heartrate) AS min_active_heartrate,
         avg(active_heartrate) AS avg_active_heartrate,
         max(active_heartrate) AS max_active_heartrate,
         min(bmi) AS min_bmi,
         avg(bmi) AS avg_bmi,
         max(bmi) AS max_bmi,
         min(vo2) AS min_vo2,
         avg(vo2) AS avg_vo2,
         max(vo2) AS max_vo2,
         min(workout_minutes) AS min_workout_minutes,
         avg(workout_minutes) AS avg_workout_minutes,
         max(workout_minutes) AS max_workout_minutes,
         min(steps) AS min_steps,
         avg(steps) AS avg_steps,
         max(steps) AS max_steps,
         avg(steps) * avg(active_heartrate) AS as_x_aah,
         max(bmi) - min(bmi) AS bmi_change,
         max(steps) - min(steps) AS steps_change,
         max(workout_minutes) - min(workout_minutes) AS workout_minutes_change,
         var_pop(workout_minutes) AS var_workout_minutes,
         var_pop(steps) AS var_steps
  FROM adsda.ht_daily_metrics
  GROUP BY device_id
)

-sandbox
**Coursera Quiz:** How many rows and columns are in `adsda.ht_user_metrics_pca_lab`?

Fill in the blanks to get the answer to the question.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Refer back to the previous lesson's lab for help.

In [0]:
# ANSWER
df = spark.table("adsda.ht_user_metrics_pca_lab").toPandas()
df.shape

## Exercise 2

In this exercise, you will perform PCA.

Fill in the blanks below to perform the PCA analysis.

In [0]:
# ANSWER
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

pca = PCA(random_state=42)
pca.fit(scale(df))

-sandbox
**Coursera Quiz:** How many components were computed?

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Refer back to the Applied PCA demo.

In [0]:
# ANSWER
pca.n_components_

## Exercise 3

In this this exercise, you will visualize and identify the variance explained by the first component.

Fill in the blanks below to complete these tasks.

In [0]:
# ANSWER
import matplotlib.pyplot as plt
import numpy as np

plt.bar(range(1, 25), pca.explained_variance_ratio_) 
plt.xlabel('Component') 
plt.xticks(range(1, 25))
plt.ylabel('Percent of variance explained')
plt.ylim(0, 1)
plt.yticks(np.arange(0, 1, step=0.1))
plt.show()

**Coursera Quiz:** How much of the total variation in the feature set is explained by the first component?

In [0]:
# ANSWER
print(pca.explained_variance_ratio_[0])

## Exercise 4

In this exercise, you will determine how many components it takes to account for 90 percent of the variation in the feature set.

Fill in the blanks below to visualize the cumulative sum of variance explained.

In [0]:
# ANSWER
plt.plot(range(1, 25), np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Component') 
plt.xticks(range(1, 25))
plt.ylabel('Percent of cumulative variance explained')
plt.ylim(0, 1)
plt.yticks(np.arange(0, 1, step=0.1))
plt.show()

The above graphs are helpful but can be hard to visualize. Let's determine this programmatically.

**Coursera Quiz**: How many components does it take to explain 90 percent of the variation in the original feature set?

In [0]:
for component in list(zip(range(1, 25), np.cumsum(pca.explained_variance_ratio_))):
  if component[1] >= 0.9:
    print(component)
    break

-sandbox
## Exercise 5

In this exercise, you will examine the factor loadings of the PCA model.

Fill in the blanks below to return the factor loadings.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to determine which attribute returns the components.

In [0]:
# ANSWER
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
loadings

**Coursera Quiz**: How many rows and columns are there in this loadings matrix?

In [0]:
loadings.shape

-sandbox
## Exercise 6

In this exercise, you will use the loadings from the previous exercise to determine which is the most correlated with the first component.

Fill in the blanks below to create a more useful loadings DataFrame using the `loadings` matrix defined in the previous exercise.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This is the same data, but it now has helpful column and row names to make more sense of what we're looking at.

In [0]:
# ANSWER
import pandas as pd

component_columns = ["PC" + str(x) for x in range(1, 25)]
loadings_df = pd.DataFrame(loadings, columns=component_columns, index=df.columns)
loadings_df

-sandbox
**Coursera Quiz**: Which of the features is most correlated (in any direction) with the first component `PC1`?

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The below code uses the Pandas DataFrame API. You can always turn this back into a SQL table if you're more comfortable with SQL.

In [0]:
abs(loadings_df["PC1"]).sort_values(ascending=False)

-sandbox
### Exercise 7

In this exercise, you will prepare a feature set using the first few components from our PCA process.

Fill in the blanks below to prepare the new feature set using only the first three components.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to determine which method applies dimensionality reduction to an existing feature set.

In [0]:
# ANSWER
component_df = pd.DataFrame(pca.transform(scale(df)), columns=component_columns)
component_3_df = component_df.loc[:, ["PC1", "PC2", "PC3"]]
component_3_df

**Coursera Quiz**: Which of the following is a drawback of using PCA to reduce the feature space for supervised learning problems?

* PCA only needs a few components to represent the original features
* The curse of dimensionality
* PCA only works with a few columns at a time
* The resulting features are less interpretable

Congrats! That concludes our lesson on PCA!

Be sure to submit your quiz answers to Coursera, and join us in the next module to learn about feature engineering and selection.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>