# Guided Capstone Step 6. Documentation

**The Data Science Method**  


1.   Problem Identification 

2.   Data Wrangling 
  
3.   Exploratory Data Analysis 
 
4.   Pre-processing and Training Data Development

5.  Modeling

6.   **Documentation**
  * Review the Results
  * Finalize Code 
  * Finalize Documentation
  * Create a Project Report 
  * Create a Slide Deck for the Executive Audience

In this guided capstone we are going to revisit many of the actions we took in the previous guided capstone steps. This gives you the opportunity to practice the code you wrote to solve the questions in step 4 and 5. 

**<font color='teal'> Start by loading the necessary packages and printing out our current working directory just to confirm we are in the correct project directory. </font>**

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

## Fit Models with Training Dataset

**<font color='teal'> Using sklearn fit the model you chose in Guided Capstone 5 on your training dataset. This includes: creating dummy features for states if you need them, scaling the data,and creating train and test splits before fitting the chosen model.Also, remember to generate a model performance score(MAE, or explained variance) based on the testing hold-out data set.</font>**

#### Best Model 

In [2]:
df = pd.read_csv("../EDA/data/step3_output.csv")
df = df.drop("Unnamed: 0", axis=1)
df.head(5)

Unnamed: 0,Name,state,summit_elev,vertical_drop,trams,fastEight,fastSixes,fastQuads,quad,triple,...,SkiableTerrain_ac,Snow Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac,clusters
0,Hilltop Ski Area,Alaska,2090,294,0,0.0,0,0,0,1,...,30.0,30.0,150.0,36.0,69.0,30.0,34.0,152.0,30.0,1
1,Sunrise Park Resort,Arizona,11100,1800,0,0.0,0,1,2,3,...,800.0,80.0,115.0,49.0,250.0,74.0,78.0,104.0,80.0,2
2,Yosemite Ski & Snowboard Area,California,7800,600,0,0.0,0,0,0,1,...,88.0,88.0,110.0,84.0,300.0,47.0,47.0,107.0,0.0,2
3,Boreal Mountain Resort,California,7700,500,0,0.0,0,1,1,3,...,380.0,200.0,150.0,54.0,400.0,49.0,64.16681,150.0,200.0,2
4,Donner Ski Ranch,California,8012,750,0,0.0,0,0,0,1,...,505.0,60.0,163.0,82.0,400.0,75.0,75.0,170.0,0.0,2


In [3]:
# The Best Model selected was created by dropping columns 
# 'Name','AdultWeekend','state','summit_elev','vertical_drop','base_elev' 
X = df.drop(['Name','AdultWeekend','state','summit_elev','vertical_drop'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df.AdultWeekend

from sklearn import preprocessing as ppg

# Here we use the StandardScaler() method of the preprocessing package, and then call the fit() method with parameter X 
scaler = ppg.StandardScaler().fit(X)

# Declare a variable called X_scaled, and assign it the result of calling the transform() method with parameter X
X_scaled=scaler.transform(X) 

from sklearn.model_selection import train_test_split

# Get the 1-dimensional flattened array of our response variable y by calling the ravel() function on y
y = y.ravel()

# Call the train_test_split() function with the first two parameters set to X_scaled and y 
# Declare four variables, X_train, X_test, y_train and y_test separated by commas 
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=1)

# Replace Nan, +inf and -inf with 0
X_train[np.isnan(X_train)]=0

from numpy import inf

X_train[X_train == -inf] = 0
X_train[X_train == +inf] = 0

# Linear Regression 
from sklearn import linear_model

lm = linear_model.LinearRegression()
model = lm.fit(X_train,y_train)
model.intercept_

  return self.partial_fit(X, y)
  


56.92520947786605

In [4]:
# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model.predict(X_test)

# Metrics
from sklearn.metrics import explained_variance_score as evs,mean_absolute_error as mae

print("Explained Variance Score = " + str(evs(y_test, y_pred)))
print("Mean Absolute Error = " + str(mae(y_test, y_pred)))

Explained Variance Score = 0.7584894253464819
Mean Absolute Error = 6.3897770848327164


## Review the results 

**<font color='teal'> Now, let's predict the Big Mountain Weekend price with our model in order to provide a recommendation to our managers on how to price the `AdultWeekend` lift ticket. First we need to find the row for Big Mountain resort in our data using string contains or string matching.</font>**

In [5]:
#df[df['Name'].str.contains('Big Mountain')]
df[df['Name'].str.contains('Big','Mountain')]

Unnamed: 0,Name,state,summit_elev,vertical_drop,trams,fastEight,fastSixes,fastQuads,quad,triple,...,SkiableTerrain_ac,Snow Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac,clusters
44,Big Snow Resort - Blackjack,Michigan,850,465,0,0.0,0,0,0,0,...,170.0,86.0,95.0,42.0,210.0,49.0,65.0,115.0,0.0,1
46,Big Snow Resort - Indianhead Mountain,Michigan,1935,638,0,0.0,0,0,1,1,...,240.0,150.0,120.0,60.0,204.0,49.0,49.0,120.0,0.0,1
117,Ski Big Bear,Pennsylvania,1250,650,0,0.0,0,0,0,0,...,26.0,26.0,75.0,43.0,69.0,50.0,62.0,75.0,26.0,1


In [6]:
#Checked original dataset for 'Big Mountain' in Montana - Not there!
df2 = pd.read_csv("../Data_Wrangling/updated_ski_data.csv")
df2[df2['Name'].str.contains('Big')]

Unnamed: 0,Name,Region,state,summit_elev,vertical_drop,base_elev,trams,fastEight,fastSixes,fastQuads,...,LongestRun_mi,SkiableTerrain_ac,Snow Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac
74,Big Squaw Mountain Ski Resort,Maine,Maine,3200,660,1750,0,,0,0,...,0.8,,,67.0,6.0,,30.0,30.0,58.0,
97,Big Powderhorn Mountain,Michigan,Michigan,1800,600,1200,0,0.0,0,0,...,1.0,253.0,228.0,100.0,55.0,214.0,69.0,69.0,108.0,
99,Big Snow Resort - Blackjack,Michigan,Michigan,850,465,385,0,0.0,0,0,...,1.0,170.0,86.0,95.0,42.0,210.0,49.0,65.0,115.0,
105,Big Snow Resort - Indianhead Mountain,Michigan,Michigan,1935,638,1297,0,,0,0,...,1.0,240.0,150.0,120.0,60.0,204.0,49.0,49.0,120.0,
140,Big Sky Resort,Montana,Montana,11166,4350,7500,1,1.0,2,5,...,6.0,5800.0,400.0,144.0,46.0,400.0,,,144.0,
238,Ski Big Bear,Pennsylvania,Pennsylvania,1250,650,600,0,0.0,0,0,...,1.5,26.0,26.0,75.0,43.0,69.0,50.0,62.0,75.0,26.0
239,Big Boulder,Pennsylvania,Pennsylvania,2175,600,1700,0,0.0,0,0,...,,55.0,55.0,76.0,72.0,50.0,50.0,65.0,95.0,55.0


In [7]:
df2[df2['Name'].str.contains('Big Sky Resort')]

Unnamed: 0,Name,Region,state,summit_elev,vertical_drop,base_elev,trams,fastEight,fastSixes,fastQuads,...,LongestRun_mi,SkiableTerrain_ac,Snow Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac
140,Big Sky Resort,Montana,Montana,11166,4350,7500,1,1.0,2,5,...,6.0,5800.0,400.0,144.0,46.0,400.0,,,144.0,


In [10]:
df3 = pd.read_csv("../Data_Wrangling/data/step2_output.csv")
df3[df3['Name'].str.contains('Big')]

Unnamed: 0.1,Unnamed: 0,Name,Region,state,summit_elev,vertical_drop,base_elev,trams,fastEight,fastSixes,...,LongestRun_mi,SkiableTerrain_ac,Snow Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac
74,74,Big Squaw Mountain Ski Resort,Maine,Maine,3200,660,1750,0,0.0,0,...,0.8,,,67.0,6.0,0.0,30.0,30.0,58.0,0.0
97,97,Big Powderhorn Mountain,Michigan,Michigan,1800,600,1200,0,0.0,0,...,1.0,253.0,228.0,100.0,55.0,214.0,69.0,69.0,108.0,0.0
99,99,Big Snow Resort - Blackjack,Michigan,Michigan,850,465,385,0,0.0,0,...,1.0,170.0,86.0,95.0,42.0,210.0,49.0,65.0,115.0,0.0
105,105,Big Snow Resort - Indianhead Mountain,Michigan,Michigan,1935,638,1297,0,0.0,0,...,1.0,240.0,150.0,120.0,60.0,204.0,49.0,49.0,120.0,0.0
140,140,Big Sky Resort,Montana,Montana,11166,4350,7500,1,1.0,2,...,6.0,5800.0,400.0,144.0,46.0,400.0,57.916957,64.16681,144.0,0.0
238,238,Ski Big Bear,Pennsylvania,Pennsylvania,1250,650,600,0,0.0,0,...,1.5,26.0,26.0,75.0,43.0,69.0,50.0,62.0,75.0,26.0
239,239,Big Boulder,Pennsylvania,Pennsylvania,2175,600,1700,0,0.0,0,...,0.0,55.0,55.0,76.0,72.0,50.0,50.0,65.0,95.0,55.0


In [11]:
df3[df3['Name'] == 'Big Sky Resort']

Unnamed: 0.1,Unnamed: 0,Name,Region,state,summit_elev,vertical_drop,base_elev,trams,fastEight,fastSixes,...,LongestRun_mi,SkiableTerrain_ac,Snow Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac
140,140,Big Sky Resort,Montana,Montana,11166,4350,7500,1,1.0,2,...,6.0,5800.0,400.0,144.0,46.0,400.0,57.916957,64.16681,144.0,0.0


**<font color='teal'> Prepare the Big Mountain resort data row as you did in the model fitting stage.</font>**

**<font color='teal'> Predict the Big Mountain resort `Adult Weekend` price and print it out.</font>** This is our expected price to present to management. Based on our model given the characteristics of the resort in comparison to other ski resorts and their unique characteristics.

**<font color='teal'> Print the Big Mountain resort actual `Adult Weekend` price.</font>**

**<font color='teal'> As part of reviewing the results it is an important step to generate figures to visualize the data story. We can use the clusters we added to our data frame to create scatter plots for visualizing the Adult Weekend values compared to other characteristics. Run the example below to get you started and build two or three more figures to include in your data story telling.</font>**

In [8]:
#plt.scatter(df['summit_elev'], df['vertical_drop'], c=df['clusters'], s=50, cmap='viridis', label ='clusters')
#plt.scatter(ac['summit_elev'], ac['vertical_drop'], c='black', s=100)
#plt.xlabel('summit_elev')
#plt.ylabel('vertical_drop')
#plt.title('summit_elev by vertical_drop by cluster')
#plt.savefig('figures/fig1.png',bbox_inches='tight')

## Finalize Code

 Making sure our code is well organized and easy to follow is an important step. This is the time where you need to review the notebooks and Python scripts you've created and clean them up so they are easy to follow and succinct in nature. Addtionally, we will also save our final model as a callable object using Pickle for future use in a data pipeline. Pickle is a module that serializes (and de-serializes) Python objects so that they can become executable objects like functions. It's used extensively in production environments where machine learning models are deployed on an industrial scale!**<font color='teal'> Run the example code below to save out your callable model. Notice that we save it in the models folder we created in our previous guided capstone step.</font>** 

In [9]:
#import pickle
#s = pickle.dumps(model)
#from joblib import dump, load
#dump(model, 'models/regression_model_adultweekend.joblib') 

## Finalize Documentation

For model documentation, we want to save the model performance metrics as well as the features included in the final model. You could also save the model perfomance metrics and coefficients fo the other models you tried in case you want to refer to them later. **<font color='teal'> Create a dataframe containing the coefficients and the model performance metrics and save it out as a csv file, then upload it to your github repository.</font>** 