d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Data Science Project

**Objective**: *Design, complete, and assess a common data science project.*

In this lab, you will use the data science process to design, build, and assess a common data science project.

In [0]:
%run "../../Includes/Classroom-Setup"

-sandbox
## Project Details

In recent months, our health tracker company has noticed that many customers drop out of the sign-up process when they have to self-identify their exercise lifestyle (`ht_users.lifestyle`) – this is especially true for those with a "Sedentary" lifestyle. As a result, the company is considering removing this step from the sign-up process. However, the company knows this data is valuable for targeting introductory exercises and they don't want to lose it for customers that sign up after the step is removed.

In this data science project, our business stakeholders are interested in identifying which customers have a sedentary lifestyle – specifically, they want to know if we can correctly identify whether somebody has a "Sedentary" lifestyle at least 95 percent of the time. If we can meet this objective, the organization will be able to remove the lifestyle-specification step of the sign-up process *without losing the valuable information provided by the data*.


<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> There are no solutions provided for this project. You will need to complete it independently using the guidance detailed below and the previous labs from the project.

-sandbox

## Exercise 1

Summary: 
* Specify the data science process question. 
* Indicate whether this is framed as a supervised learning or unsupervised learning problem. 
* If it is supervised learning, indicate whether the problem is a regression problem or a classification problem.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** When we are interested in predicting something, we are usually talking about a supervised learning problem.

In [0]:
#want to predict lifestyle
#this is a supervised learning problem
#this is a calssification problem using logisctic regression

-sandbox

## Exercise 2

Summary: 

* Specify the data science objective. 
* Indicate which evaluation metric should be used to assess the objective.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Remember, the data science objective needs to be measurable.

In [0]:
#The goal is see whether or not we are able to predict over 60% of accuracy

-sandbox

## Exercise 3

Summary:
* Design a baseline solution.
* Develop a baseline solution – be sure to split data between training for development and test for assessment.
* Assess your baseline solution. Does it meet the project objective? If not, use it as a threshold for further development.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Recall that baseline solutions are meant to be easy to develop.

In [0]:

%sql
CREATE OR REPLACE TABLE adsda.ht_user_metrics_lifestyle
USING DELTA LOCATION "/adsda/ht-user-metrics-lifestyle" AS (
  SELECT avg(resting_heartrate) AS avg_resting_heartrate,
         avg(active_heartrate) AS avg_active_heartrate,
         avg(bmi) AS bmi,
         avg(vo2) AS avg_vo2,
         avg(workout_minutes) AS avg_workout_minutes,
         avg(steps) AS steps,
         first(lifestyle) AS lifestyle
  FROM adsda.ht_daily_metrics
  GROUP BY device_id
)

In [0]:
ht_lifestyle_pd_df = spark.table("adsda.ht_user_metrics_lifestyle").toPandas()

In [0]:
ht_lifestyle_pd_df.describe(include='all')


Unnamed: 0,avg_resting_heartrate,avg_active_heartrate,bmi,avg_vo2,avg_workout_minutes,steps,lifestyle
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000
unique,,,,,,,4
top,,,,,,,Cardio Enthusiast
freq,,,,,,,1064
mean,62.26662,119.975708,22.902468,32.351569,35.57314,10211.991552,
std,12.521525,16.910238,4.49268,7.029757,12.472619,2985.154675,
min,45.04649,82.041834,7.592313,10.934276,4.219295,5047.646575,
25%,52.024483,106.580546,19.761279,27.334516,32.626821,7181.889726,
50%,58.526237,117.846432,22.912607,33.212109,36.840635,10839.99726,
75%,70.799247,131.75827,26.005915,37.412472,41.755371,12759.914384,


In [0]:
ht_lifestyle_pd_df['lifestyle'].value_counts()

In [0]:
ht_lifestyle_pd_df['predictedlifestyle']='Cardio Enthusiast'

In [0]:
ht_lifestyle_pd_df.head()

Unnamed: 0,avg_resting_heartrate,avg_active_heartrate,bmi,avg_vo2,avg_workout_minutes,steps,lifestyle,predictedlifestyle
0,82.683797,139.434875,22.398064,20.994012,5.502632,5171.49589,Sedentary,Cardio Enthusiast
1,77.732942,127.057153,25.150813,25.527475,37.216702,7115.591781,Weight Trainer,Cardio Enthusiast
2,86.511629,147.315731,19.148256,19.448407,45.000087,7257.693151,Weight Trainer,Cardio Enthusiast
3,77.550541,129.577004,24.240376,21.401302,37.886069,7129.690411,Weight Trainer,Cardio Enthusiast
4,68.933106,136.502687,30.726596,28.85523,32.241984,6958.378082,Weight Trainer,Cardio Enthusiast


-sandbox

## Exercise 4

Summary: 
* Design the machine learning solution, but do not yet develop it. 
* Indicate whether a machine learning model will be used. If so, indicate which machine learning model will be used and what the label/output variable will be.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Consider solutions that align with the framing you did in Exercise 1.

In [0]:
from sklearn.preprocessing import LabelEncoder

X_1 = ht_lifestyle_pd_df[['avg_active_heartrate', 'avg_resting_heartrate']]
X_2 = ht_lifestyle_pd_df[['bmi', 'avg_vo2']]
X_3 = ht_lifestyle_pd_df[['avg_active_heartrate', 'bmi', 'avg_workout_minutes']]
X_4 = ht_lifestyle_pd_df[['avg_resting_heartrate', 'bmi', 'avg_vo2', 'avg_resting_heartrate']]

le = LabelEncoder()
lifestyle = ht_lifestyle_pd_df['lifestyle']
le.fit(lifestyle)
y = le.transform(lifestyle)

-sandbox
## Exercise 5

Summary: 
* Explore your data. 
* Specify which tables and columns will be used for your label/output variable and your feature variables.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Consider aggregating features from other tables.

In [0]:
ht_lifestyle_pd_df.describe(include='all')

Unnamed: 0,avg_resting_heartrate,avg_active_heartrate,bmi,avg_vo2,avg_workout_minutes,steps,lifestyle,predictedlifestyle
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000,3000
unique,,,,,,,4,1
top,,,,,,,Cardio Enthusiast,Cardio Enthusiast
freq,,,,,,,1064,3000
mean,62.26662,119.975708,22.902468,32.351569,35.57314,10211.991552,,
std,12.521525,16.910238,4.49268,7.029757,12.472619,2985.154675,,
min,45.04649,82.041834,7.592313,10.934276,4.219295,5047.646575,,
25%,52.024483,106.580546,19.761279,27.334516,32.626821,7181.889726,,
50%,58.526237,117.846432,22.912607,33.212109,36.840635,10839.99726,,
75%,70.799247,131.75827,26.005915,37.412472,41.755371,12759.914384,,


In [0]:
ht_lifestyle_pd_df.head()

Unnamed: 0,avg_resting_heartrate,avg_active_heartrate,bmi,avg_vo2,avg_workout_minutes,steps,lifestyle,predictedlifestyle
0,82.683797,139.434875,22.398064,20.994012,5.502632,5171.49589,Sedentary,Cardio Enthusiast
1,77.732942,127.057153,25.150813,25.527475,37.216702,7115.591781,Weight Trainer,Cardio Enthusiast
2,86.511629,147.315731,19.148256,19.448407,45.000087,7257.693151,Weight Trainer,Cardio Enthusiast
3,77.550541,129.577004,24.240376,21.401302,37.886069,7129.690411,Weight Trainer,Cardio Enthusiast
4,68.933106,136.502687,30.726596,28.85523,32.241984,6958.378082,Weight Trainer,Cardio Enthusiast


-sandbox
## Exercise 6

Summary: 
* Prepare your modeling data. 
* Create a customer-level modeling table with the correct output variable and features. 
* Finally, split your data between training and test sets. Make sure this split aligns with that of your baseline solution.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Consider how to make the data split reproducible.

In [0]:
from sklearn.model_selection import train_test_split

X_1_train, X_1_test, y_train, y_test = train_test_split(X_1, y)
X_2_train, X_2_test, y_train, y_test = train_test_split(X_2, y)
X_3_train, X_3_test, y_train, y_test = train_test_split(X_3, y)
X_4_train, X_4_test, y_train, y_test = train_test_split(X_4, y)

-sandbox
## Exercise 7

Summary: 
* Build the model specified in your answer to Exercise 4. 
* Be sure to use an evaluation metric that aligns with your specified objective.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** This evaluation metric should align with the one used in your baseline solution.

In [0]:
from sklearn.linear_model import LogisticRegression
lr_1 = LogisticRegression(max_iter=10000)
lr_2 = LogisticRegression(max_iter=10000)
lr_3 = LogisticRegression(max_iter=10000)
lr_4 = LogisticRegression(max_iter=10000)

lr_1.fit(X_1_train, y_train)
lr_2.fit(X_2_train, y_train)
lr_3.fit(X_3_train, y_train)
lr_4.fit(X_4_train, y_train)

-sandbox
## Exercise 8

Summary: 
* Assess your model against the overall objective. 
* Be sure to use an evaluation metric that aligns with your specified objective.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Remember that we assess our models against our test data set to ensure that our solutions generalize.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** If your solution doesn't meet the objective, consider tweaking the model and data used by the model until it does meet the objective.

In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix

y_train_1_predicted = lr_1.predict(X_1_train)
y_test_1_predicted = lr_1.predict(X_1_test)
y_train_1_predicted = lr_1.predict(X_1_train)
y_test_1_predicted = lr_1.predict(X_1_test)
y_train_2_predicted = lr_2.predict(X_2_train)
y_test_2_predicted = lr_2.predict(X_2_test)
y_train_3_predicted = lr_3.predict(X_3_train)
y_test_3_predicted = lr_3.predict(X_3_test)
y_train_4_predicted = lr_4.predict(X_4_train)
y_test_4_predicted = lr_4.predict(X_4_test)

In [0]:

train_1_accuracy = accuracy_score(y_train, y_train_1_predicted)
train_1_conf_mat = confusion_matrix(y_train, y_train_1_predicted)
test_1_accuracy = accuracy_score(y_test, y_test_1_predicted)
test_1_conf_mat = confusion_matrix(y_test, y_test_1_predicted)

train_2_accuracy = accuracy_score(y_train, y_train_2_predicted)
train_2_conf_mat = confusion_matrix(y_train, y_train_2_predicted)
test_2_accuracy = accuracy_score(y_test, y_test_2_predicted)
test_2_conf_mat = confusion_matrix(y_test, y_test_2_predicted)

train_3_accuracy = accuracy_score(y_train, y_train_3_predicted)
train_3_conf_mat = confusion_matrix(y_train, y_train_3_predicted)
test_3_accuracy = accuracy_score(y_test, y_test_3_predicted)
test_3_conf_mat = confusion_matrix(y_test, y_test_3_predicted)

train_4_accuracy = accuracy_score(y_train, y_train_4_predicted)
train_4_conf_mat = confusion_matrix(y_train, y_train_4_predicted)
test_4_accuracy = accuracy_score(y_test, y_test_4_predicted)
test_4_conf_mat = confusion_matrix(y_test, y_test_4_predicted)

print("model 1: training accuracy: ", train_1_accuracy)
print("model 1: training confusion matrix: ")
print(train_1_conf_mat)
print(" ")
print("model 1: test accuracy:     ", test_1_accuracy)
print("model 1: test confusion matrix:     ")
print(test_1_conf_mat)
print(" ")
print("model 2: training accuracy: ", train_2_accuracy)
print("model 2: training confusion matrix: ")
print(train_2_conf_mat)
print(" ")
print("model 2: test accuracy:     ", test_2_accuracy)
print("model 2: test confusion matrix:     ")
print(test_2_conf_mat)
print(" ")
print("model 3: training accuracy: ", train_3_accuracy)
print("model 3: training confusion matrix: ")
print(train_3_conf_mat)
print(" ")
print("model 3: test accuracy:     ", test_3_accuracy)
print("model 3: test confusion matrix:     ")
print(test_3_conf_mat)
print(" ")
print("model 4: training accuracy: ", train_4_accuracy)
print("model 4: training confusion matrix: ")
print(train_4_conf_mat)
print(" ")
print("model 4: test accuracy:     ", test_4_accuracy)
print("model 4: test confusion matrix:     ")
print(test_4_conf_mat)
print(" ")

After completing all of the above objectives, you should be ready to communicate your results. Move to the next video in the lesson for a description on that part of the project.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>