# 4. Pre-Processing and Training Data Development

The Data Science Method

1.Problem Identification

2.Data Wrangling

3.Exploratory Data Analysis

4.Pre-processing and Training Data Development

* Create dummy or indicator features for categorical variables
* Standardize the magnitude of numeric features
* Split into testing and training datasets
*  Apply scaler to the testing set
    
5.Modeling

* Fit Models with Training Data Set
* Review Model Outcomes — Iterate over additional models as needed.
* Identify the Final Model

6.Documentation
* Review the Results
* Present and share your findings - storytelling
* Finalize Code
* Finalize Documentation

In [6]:
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [7]:
print(os.getcwd())
df = pd.read_csv('data/step3_output.csv', index_col='Unnamed: 0')
print(df.shape)
df.head(5)

C:\Users\Sanjay\1Springboard\Springboard\Springboard\Guided-capstone-Unit-6
(165, 26)


Unnamed: 0,Name,state,summit_elev,vertical_drop,trams,fastEight,fastSixes,fastQuads,quad,triple,...,SkiableTerrain_ac,Snow Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac,clusters
2,Hilltop Ski Area,Alaska,2090,294,0.0,0.0,0.0,0,0,1,...,30.0,30.0,150.0,36.0,69.0,30.0,34.0,152.0,30.0,0
4,Sunrise Park Resort,Arizona,11100,1800,0.0,0.0,0.0,1,2,3,...,800.0,80.0,115.0,49.0,250.0,74.0,78.0,104.0,80.0,1
5,Yosemite Ski & Snowboard Area,California,7800,600,0.0,0.0,0.0,0,0,1,...,88.0,174.873239,110.0,84.0,300.0,47.0,47.0,107.0,100.395722,1
10,Donner Ski Ranch,California,8012,750,0.0,0.0,0.0,0,0,1,...,505.0,60.0,163.0,82.0,400.0,75.0,75.0,170.0,100.395722,1
12,June Mountain,California,10090,2590,0.0,0.0,0.0,2,0,0,...,1500.0,174.873239,115.103943,58.0,250.0,57.916957,64.16681,128.0,100.395722,1


In [8]:
#Create dummy variables for state. Add the dummies back to the dataframe and remove the original column for state.
df = pd.concat([df.drop(['state'], axis=1), pd.get_dummies(df[['state']])], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165 entries, 2 to 329
Data columns (total 60 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Name                  165 non-null    object 
 1   summit_elev           165 non-null    int64  
 2   vertical_drop         165 non-null    int64  
 3   trams                 165 non-null    float64
 4   fastEight             165 non-null    float64
 5   fastSixes             165 non-null    float64
 6   fastQuads             165 non-null    int64  
 7   quad                  165 non-null    int64  
 8   triple                165 non-null    int64  
 9   double                165 non-null    int64  
 10  surface               165 non-null    int64  
 11  total_chairs          165 non-null    int64  
 12  Runs                  165 non-null    float64
 13  TerrainParks          165 non-null    float64
 14  LongestRun_mi         165 non-null    float64
 15  SkiableTerrain_ac     1

In [10]:
cols = df.select_dtypes(np.uint8).columns
df[cols] = df[cols].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165 entries, 2 to 329
Data columns (total 60 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Name                  165 non-null    object 
 1   summit_elev           165 non-null    int64  
 2   vertical_drop         165 non-null    int64  
 3   trams                 165 non-null    float64
 4   fastEight             165 non-null    float64
 5   fastSixes             165 non-null    float64
 6   fastQuads             165 non-null    int64  
 7   quad                  165 non-null    int64  
 8   triple                165 non-null    int64  
 9   double                165 non-null    int64  
 10  surface               165 non-null    int64  
 11  total_chairs          165 non-null    int64  
 12  Runs                  165 non-null    float64
 13  TerrainParks          165 non-null    float64
 14  LongestRun_mi         165 non-null    float64
 15  SkiableTerrain_ac     1

In [11]:
df.nunique()

Name                    165
summit_elev             139
vertical_drop           100
trams                     1
fastEight                 1
fastSixes                 1
fastQuads                 3
quad                      3
triple                    6
double                    7
surface                   7
total_chairs             14
Runs                     56
TerrainParks              5
LongestRun_mi            25
SkiableTerrain_ac       101
Snow Making_ac           71
daysOpenLastYear         66
yearsOpen                53
averageSnowfall          66
AdultWeekday             55
AdultWeekend             54
projectedDaysOpen        54
NightSkiing_ac           47
clusters                  3
state_Alaska              2
state_Arizona             2
state_California          2
state_Colorado            2
state_Connecticut         2
state_Idaho               2
state_Illinois            2
state_Indiana             2
state_Iowa                2
state_Maine               2
state_Maryland      

## Standardize the magnitude of numeric features

Using sklearn preprocessing standardize the scale of the features of the dataframe except the name of the resort which we done't need in the dataframe for modeling, so it can be droppped here as well. Also, we want to hold out our response variable(s) so we can have their true values available for model performance review. Let's set <b> AdultWeekend </b> to the y variable as our response for scaling and modeling. Later we will go back and consider the <b> AdultWeekday, dayOpenLastYear, and projectedDaysOpen </b>. For now leave them in the development dataframe.

In [12]:
# first we import the preprocessing package from the sklearn library
from sklearn import preprocessing

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'AdultWeekend' from the df
X = df.drop(['Name','AdultWeekend', 'trams', 'fastEight', 'fastSixes'], axis=1) # I will also drop 'trams', 'fastEight', 'fastSixes' because they do not have values

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df['AdultWeekend'] 

# Here we use the StandardScaler() method of the preprocessing package, and then call the fit() method with parameter X 
scaler = preprocessing.StandardScaler().fit(X)

# Declare a variable called X_scaled, and assign it the result of calling the transform() method with parameter X 
X_scaled = scaler.transform(X) 

## Split into training and testing datasets


Using sklearn model selection import train_test_split, and create a 75/25 split with the y = <b>AdultWeekend</b>. We will start by using the adult weekend ticket price as our response variable for modeling.

In [13]:
# Import the train_test_split function from the sklearn.model_selection utility.  
from sklearn.model_selection import train_test_split

# Get the 1-dimensional flattened array of our response variable y by calling the ravel() function on y
y = y.ravel()

# Call the train_test_split() function with the first two parameters set to X_scaled and y 
# Declare four variables, X_train, X_test, y_train and y_test separated by commas 
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=1)