# Capstone Two - Pre-processing and Training Data Development

The enclosed represents Chapter 16.3 of the Springboard Data Scientist Career Track. The structure is as follows:
   * Creating Dummy Variables
   * Splitting the Data into Training & Testing subsets for Machine Learning
   * Standardized Scaling

I hope this submission shows that I understand when to apply the proper steps.

This code is built in Jupyter Notebook & uploaded on Github.


# Pre work

**Importing the relevant libraries to start**

In [160]:
#Import the necessary tools required in the correct lines below
import quandl
from fredapi import Fred
from getpass import getpass
import investpy

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import seaborn as sns
import os
import lxml
import datetime
import requests
import json
import collections
import seaborn as sns
from scipy import stats
import numpy as np


**Setting up the date time stamps.**

In [161]:
# These will be needed later

now = datetime.datetime.now()
year = now.strftime("%Y")
month = now.strftime("%m")
day = now.strftime("%d")
today_y_m_d_dash = now.strftime("%Y-%m-%d")
today_d_m_y_dash = now.strftime("%d/%m/%Y")


**Establishing connections to the relevant Application Programming Interface's ( if applicable ).**

In [162]:
# Importing Quandl API requires a password
# If you don't have one, please see the link below
# https://docs.quandl.com/docs#section-authentication

my_quandl_API = getpass()

········


In [163]:
quandl.ApiConfig.api_key = my_quandl_API

In [164]:
# Importing FRED API requires a password ( FRED stands for Federal Reserve Economic Data )
# If you don't have one, please see the link below
# https://fred.stlouisfed.org/docs/api/fred/

my_FRED_API = getpass()

········


In [165]:
fred = Fred(api_key=my_FRED_API)

**Establishing a color scale for heatmaps which will be used later.**

In [166]:
cdict = {'green':  ((0.0, 0.0, 0.0),   # no red at 0
                  (0.5, 1.0, 1.0),   # all channels set to 1.0 at 0.5 to create white
                  (1.0, 0.8, 0.8)),  # set to 0.8 so its not too bright at 1

        'red': ((0.0, 0.8, 0.8),   # set to 0.8 so its not too bright at 0
                  (0.5, 1.0, 1.0),   # all channels set to 1.0 at 0.5 to create white
                  (1.0, 0.0, 0.0)),  # no green at 1

        'blue':  ((0.0, 0.0, 0.0),   # no blue at 0
                  (0.5, 1.0, 1.0),   # all channels set to 1.0 at 0.5 to create white
                  (1.0, 0.0, 0.0))   # no blue at 1
       }

# Create the colormap using the dictionary
GnRd = colors.LinearSegmentedColormap('GnRd', cdict)


# 1.0 Creating Dummy Variables

We will import the dataframe from the previous Exploratory Data Analysis ( EDA ) section.

In [168]:
df = pd.read_csv('./_Capstone_One_Inflation/data/1.0_MAIN/QonQ_main_roll.csv')

As a reminder, the dataframe is composed of a quarterly change ( back ) on both Inflation & the Variables. The variables, however, are taking a rolling average. The rational is based on the idea that one of the variables may have had a bad day / week at the end of their respective term. If so, they may not properly display the impact they may have had on inflation.

In [170]:
df.head()

Unnamed: 0,Date,Inflation,Wages CPI,WTI,Copper,Soybeans,Natural Gas,Heating Oil,Corn,Wheat,...,Lean Hogs,Sugar,Lumber,Capacity Utilization,GDP,M2 Velocity,PMI,USD Index,Initial Jobless Claims,Unemployment Rate
0,2021-02-28 00:00:00,0.501,0.00763,0.294725,0.182753,0.235484,0.039364,0.362725,0.267969,0.081492,...,0.071797,0.153711,0.364326,0.019749,0.015004,-0.012195,2.6,-0.028476,32916.66667,-0.733333
1,2021-01-31 00:00:00,0.218,0.006251,0.133619,0.149469,0.247356,0.133913,0.212935,0.24438,0.075828,...,0.083947,0.167539,0.119677,0.026555,0.015004,-0.012195,2.266667,-0.021939,-17083.33333,-1.133333
2,2020-12-31 00:00:00,-0.009,0.006591,0.045217,0.11095,0.212504,0.331085,0.075287,0.227807,0.121747,...,0.251831,0.183723,0.017824,0.022883,0.015004,-0.012195,4.0,-0.020209,-234000.0,-2.033333
3,2020-11-30 00:00:00,-0.135,0.009198,0.012381,0.111624,0.189713,0.400814,-0.024694,0.194635,0.137836,...,0.340989,0.145086,0.352576,0.030228,0.015004,-0.012195,3.566667,-0.022875,-412333.3333,-2.766667
4,2020-10-31 00:00:00,0.196,0.012842,0.268018,0.150999,0.135852,0.421347,0.081213,0.128424,0.11224,...,0.22121,0.164845,0.733392,0.063448,0.015004,-0.012195,7.033333,-0.042067,-604333.3333,-3.833333


Per below, they have been scraped ( i.e. no null values ) & they all came in from the the API as floats ( which we want ). The `Date` is the only non-float variable.

In [171]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    312 non-null    object 
 1   Inflation               312 non-null    float64
 2   Wages CPI               312 non-null    float64
 3   WTI                     312 non-null    float64
 4   Copper                  312 non-null    float64
 5   Soybeans                312 non-null    float64
 6   Natural Gas             312 non-null    float64
 7   Heating Oil             312 non-null    float64
 8   Corn                    312 non-null    float64
 9   Wheat                   312 non-null    float64
 10  Cattle                  312 non-null    float64
 11  Lean Hogs               312 non-null    float64
 12  Sugar                   312 non-null    float64
 13  Lumber                  312 non-null    float64
 14  Capacity Utilization    312 non-null    fl

In [172]:
df.shape

(312, 21)

In [173]:
df.describe()

Unnamed: 0,Inflation,Wages CPI,WTI,Copper,Soybeans,Natural Gas,Heating Oil,Corn,Wheat,Cattle,Lean Hogs,Sugar,Lumber,Capacity Utilization,GDP,M2 Velocity,PMI,USD Index,Initial Jobless Claims,Unemployment Rate
count,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0,312.0
mean,-0.030804,0.005489,0.020214,0.01981,0.012885,0.035417,0.019414,0.014929,0.016604,0.006404,0.018066,0.019915,0.028027,-0.000384,0.010657,-0.004439,0.073184,0.000752,3685.096,0.00609
std,0.827474,0.0058,0.150741,0.120212,0.098168,0.197019,0.137546,0.122589,0.109806,0.066072,0.140843,0.147224,0.157614,0.017885,0.014045,0.022552,2.960773,0.034428,303492.4,0.880617
min,-4.846,-0.034864,-0.528002,-0.4902,-0.327256,-0.378821,-0.456848,-0.340215,-0.299369,-0.209991,-0.322853,-0.36217,-0.318534,-0.128528,-0.094662,-0.200725,-11.833333,-0.07218,-2430333.0,-4.266667
25%,-0.38775,0.003967,-0.057181,-0.050026,-0.041698,-0.08885,-0.050995,-0.056145,-0.055938,-0.039485,-0.075435,-0.080698,-0.07357,-0.004309,0.008557,-0.007373,-1.7,-0.024314,-9958.333,-0.2
50%,-0.0255,0.0063,0.025297,0.009065,0.006706,0.007933,0.020651,0.005665,0.006844,0.011647,0.007895,-0.010051,0.016908,0.001742,0.011626,-0.002556,-0.1,0.000669,-2625.0,-0.066667
75%,0.33325,0.008169,0.115388,0.086206,0.06149,0.157548,0.104415,0.076297,0.078351,0.055361,0.113479,0.100615,0.1122,0.006261,0.014817,0.003964,1.675,0.021954,5166.667,0.066667
max,4.007,0.02337,0.586463,0.496467,0.288663,0.629854,0.362725,0.475869,0.443412,0.165321,0.376706,0.604139,0.931183,0.096035,0.084535,0.040798,10.666667,0.118429,3365750.0,9.266667


In [174]:
df_dummies = pd.get_dummies( df, columns=['Inflation'], prefix='D' )

In [175]:
df_dummies.head(2)

Unnamed: 0,Date,Wages CPI,WTI,Copper,Soybeans,Natural Gas,Heating Oil,Corn,Wheat,Cattle,...,D_1.4609999999999999,D_1.5290000000000001,D_1.5319999999999998,D_1.663,D_1.9140000000000001,D_2.157,D_2.336,D_2.8089999999999997,D_3.322,D_4.007
0,2021-02-28 00:00:00,0.00763,0.294725,0.182753,0.235484,0.039364,0.362725,0.267969,0.081492,0.056126,...,0,0,0,0,0,0,0,0,0,0
1,2021-01-31 00:00:00,0.006251,0.133619,0.149469,0.247356,0.133913,0.212935,0.24438,0.075828,0.052881,...,0,0,0,0,0,0,0,0,0,0


In [176]:
df.head()

Unnamed: 0,Date,Inflation,Wages CPI,WTI,Copper,Soybeans,Natural Gas,Heating Oil,Corn,Wheat,...,Lean Hogs,Sugar,Lumber,Capacity Utilization,GDP,M2 Velocity,PMI,USD Index,Initial Jobless Claims,Unemployment Rate
0,2021-02-28 00:00:00,0.501,0.00763,0.294725,0.182753,0.235484,0.039364,0.362725,0.267969,0.081492,...,0.071797,0.153711,0.364326,0.019749,0.015004,-0.012195,2.6,-0.028476,32916.66667,-0.733333
1,2021-01-31 00:00:00,0.218,0.006251,0.133619,0.149469,0.247356,0.133913,0.212935,0.24438,0.075828,...,0.083947,0.167539,0.119677,0.026555,0.015004,-0.012195,2.266667,-0.021939,-17083.33333,-1.133333
2,2020-12-31 00:00:00,-0.009,0.006591,0.045217,0.11095,0.212504,0.331085,0.075287,0.227807,0.121747,...,0.251831,0.183723,0.017824,0.022883,0.015004,-0.012195,4.0,-0.020209,-234000.0,-2.033333
3,2020-11-30 00:00:00,-0.135,0.009198,0.012381,0.111624,0.189713,0.400814,-0.024694,0.194635,0.137836,...,0.340989,0.145086,0.352576,0.030228,0.015004,-0.012195,3.566667,-0.022875,-412333.3333,-2.766667
4,2020-10-31 00:00:00,0.196,0.012842,0.268018,0.150999,0.135852,0.421347,0.081213,0.128424,0.11224,...,0.22121,0.164845,0.733392,0.063448,0.015004,-0.012195,7.033333,-0.042067,-604333.3333,-3.833333


While we knew going into the process that the entire data set is composed of "non categorical" data, notably floats, we undertook dummy variable creation process to double confirm.

**We will proceed forward without the dummy variables.**

# 2.0 Split data into training and testing subsets

We will now undertake the train / test split. Please note, there are three (3) scaling approaches ( next section ) & each one will have its own data frame. Listed below is a summary of the scaling names, the approach & the name of the respective data frames which we will formalize in the next section. Each will have a train test split formalized here:

   * **MinMaxScaler** ( often called Normalization )
      * This approach reassigns the values from 0 -> 1
      * Data Frame Name  |  `df_MM_only`
   * **Standardization**
      * This approach finds the mean of the data, assigns that as Zero & the values presented are standard deviated moves
      * Data Frame Name  |  `df_SS_only`
   * **Log Transformation**
      * This approach usually is used with data that has long tails
      * Data Frame Name  |  `df_Log_only`

The creation of these train / test splits will start here but before we proceed we will need to import the necessary libraries.

In [177]:
# from library.sb_utils import save_file
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression


**2.1 Train test split**

To begin, let's see what a `70% / 30% Train/Test Split` looks like.

In [178]:
print(' Train |', len(df) * 0.7, '\n','Test  |', len(df) * 0.3)

 Train | 218.39999999999998 
 Test  | 93.6


In [179]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='Inflation'),
                                                    df.Inflation, test_size=0.3,
                                                    random_state=47
                                                   )

In [180]:
X_train.shape, X_test.shape

((218, 20), (94, 20))

Looks like the lingering 0.2 on the Train Set went to the Test Set; 'rounding'.

In [181]:
y_train.shape, y_test.shape

((218,), (94,))

In [182]:
date_list = ['Date']
date_train = X_train[['Date']]
date_test = X_train[['Date']]

X_train.drop(columns=date_list, inplace=True)
X_test.drop(columns=date_list, inplace=True)
X_train.shape, X_test.shape

((218, 19), (94, 19))

In [183]:
X_train.dtypes

Wages CPI                 float64
WTI                       float64
Copper                    float64
Soybeans                  float64
Natural Gas               float64
Heating Oil               float64
Corn                      float64
Wheat                     float64
Cattle                    float64
Lean Hogs                 float64
Sugar                     float64
Lumber                    float64
Capacity Utilization      float64
GDP                       float64
M2 Velocity               float64
PMI                       float64
USD Index                 float64
Initial Jobless Claims    float64
Unemployment Rate         float64
dtype: object

In [184]:
X_test.dtypes

Wages CPI                 float64
WTI                       float64
Copper                    float64
Soybeans                  float64
Natural Gas               float64
Heating Oil               float64
Corn                      float64
Wheat                     float64
Cattle                    float64
Lean Hogs                 float64
Sugar                     float64
Lumber                    float64
Capacity Utilization      float64
GDP                       float64
M2 Velocity               float64
PMI                       float64
USD Index                 float64
Initial Jobless Claims    float64
Unemployment Rate         float64
dtype: object

You have only numeric features in your X now!

**2.1 Initial Not-Even Model**

We will now start to see how good the mean is as a predictor. In other words, what if you simply say your best guess for inflation is ___?

In [185]:
train_mean = y_train.mean()
round(train_mean,4)

-0.0348

In [186]:
df_inf_mean = round(df['Inflation'].mean(),4)
df_inf_mean

-0.0308

In [187]:
diff = ( df_inf_mean * -1 ) - ( train_mean * -1 )
diff

-0.004025688073394491

How good is this? How closely does this match, or explain, the actual values? There are many ways of assessing how good one set of values agrees with another, which brings us to the subject of metrics.

That said, it's close.

In [188]:
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
dumb_reg.constant_

array([[-0.03482569]])

**2.2 Metrics**

**2.2.1 R-squared or coefficient of determination**

In [189]:
def r_squared(y,ypred):
    ybar = np.sum(y) / len(y)
    sum_sq_tot = np.sum((y - ybar)**2)
    sum_sq_res = np.sum((y - ypred)**2)
    R2 = 1.0 - sum_sq_res - sum_sq_tot
    return R2

In [190]:
y_tr_pred_ = train_mean * np.ones(len(y_train))
y_tr_pred_[:5]

array([-0.03482569, -0.03482569, -0.03482569, -0.03482569, -0.03482569])

In [191]:
y_tr_pred = dumb_reg.predict(X_train)
y_tr_pred[:5]

array([-0.03482569, -0.03482569, -0.03482569, -0.03482569, -0.03482569])

The `DummyRegressor` produces exactly the same results and saves us having to mess about broadcasting the mean to an array of the appropriate length. It also gives us an object with `fit()` and `predict()` methods as well so we can use them as conveniently as any other `sklearn` estimator.

In [192]:
r_squared(y_train, y_tr_pred)

-308.6062827522935

In [193]:
y_te_pred = train_mean * np.ones(len(y_test))
r_squared(y_test, y_te_pred)

-115.27892421268626

**2.2.2 Mean Absolute Error**

Simply speaking we are taking the average of the absolute errors:
$$MAE = \frac{1}{n}\sum_i^n|y_i - \hat{y}|$$

In [194]:
def mae(y, ypred):
    abs_error = np.abs( y - ypred )
    mae = np.mean(abs_error)
    return mae

In [195]:
mae(y_train, y_tr_pred)

0.5430366972477064

In [196]:
mae(y_test, y_te_pred)

0.5542122779621316

**2.2.3 Mean Squared Error**

Another common metric (and an important one internally for optimizing machine learning models) is the mean squared error. This is simply the average of the square of the errors:

$$MSE = \frac{1}{n}\sum_i^n(y_i - \hat{y})^2$$

In [197]:
def mse(y,ypred):
    sq_error = ( y - ypred )**2
    mse = np.mean(sq_error)
    return mse

In [198]:
mse(y_train, y_tr_pred)

0.7101061531015902

In [199]:
mse(y_test, y_te_pred)

0.6185939867578665

In [200]:
np.sqrt([mse(y_train, y_tr_pred), mse(y_test, y_te_pred)])

array([0.84267797, 0.78650746])

**2.3 sklearn metrics**

To make it easier to compute, `sklearn.metrics` provides many commonly used metrics, including the ones above.

**2.3.1 R-squared**

In [201]:
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

(0.0, -0.0002880609661533029)

**2.3 Mean Absolute Error**

In [202]:
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

(0.5430366972477064, 0.5542122779621317)

**2.3 Mean squared error**

In [203]:
mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)

(0.7101061531015906, 0.6185939867578666)

**MinMaxScaler** | *Assignment to its respective data frames.*

In [204]:
X_train_MM = pd.DataFrame(X_train)
X_test_MM = pd.DataFrame(X_test)
y_train_MM = pd.DataFrame(y_train)
y_test_MM = pd.DataFrame(y_test)
y_tr_pred_MM = pd.DataFrame(y_tr_pred)

**Standardization** | *Assignment to its respective data frames.*

In [205]:
X_train_SS = pd.DataFrame(X_train)
X_test_SS = pd.DataFrame(X_test)
y_train_SS = pd.DataFrame(y_train)
y_test_SS = pd.DataFrame(y_test)
y_tr_pred_SS = pd.DataFrame(y_tr_pred)

**Log Transformation** | *Assignment to its respective data frames.*

In [206]:
X_train_Log = pd.DataFrame(X_train)
X_test_Log = pd.DataFrame(X_test)
y_train_Log = pd.DataFrame(y_train)
y_test_Log = pd.DataFrame(y_test)
y_tr_pred_Log = pd.DataFrame(y_tr_pred)

# 3.0 Scale standardization

Now that the train / test splits have been completed, we will implement three (3) scaling approaches listed below. Each one has a different approach with a basic summary below each & their data frame names from the end of the previous section:

   * **MinMaxScaler** ( often called Normalization )
      * This approach reassigns the values from 0 -> 1
   * **Standardization**
      * This approach finds the mean of the data, assigns that as Zero & the values presented are standard deviated moves
   * **Log Transformation**
      * This approach usually is used with data that has long tails

Before we proceed we will need to import the necessary libraries.

In [207]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer

# 3.1 MinMaxScaler

In [208]:
pipe_MM = make_pipeline(
    SimpleImputer(strategy='median'), 
    MinMaxScaler(), 
    LinearRegression()
)

In [209]:
type(pipe_MM)

sklearn.pipeline.Pipeline

In [210]:
hasattr(pipe_MM, 'fit'), hasattr(pipe_MM, 'predict')

(True, True)

In [211]:
# X_train_MM = pd.DataFrame(X_train)
# X_test_MM = pd.DataFrame(X_test)
# y_train_MM = pd.DataFrame(y_train)
# y_test_MM = pd.DataFrame(y_test)
# y_tr_pred_MM = pd.DataFrame(y_tr_pred)

Fit the pipeline

In [212]:
pipe_MM.fit(X_train_MM, y_train_MM)

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('minmaxscaler', MinMaxScaler()),
                ('linearregression', LinearRegression())])

Make predictions on the train and test sets

In [213]:
y_tr_pred_MM = pipe_MM.predict(X_train_MM)
y_te_pred_MM = pipe_MM.predict(X_test_MM)

Assess performance

In [214]:
r2_score(y_train, y_tr_pred_MM), r2_score(y_test, y_te_pred_MM)

(0.4241012214816976, 0.327593460165006)

**Both the `Training` & `Testing` sets come up below a simple coin flip.**

# 3.2 Standardization

In [215]:
# X_train_SS = pd.DataFrame(X_train)
# X_test_SS = pd.DataFrame(X_test)
# y_train_SS = pd.DataFrame(y_train)
# y_test_SS = pd.DataFrame(y_test)
# y_tr_pred_SS = pd.DataFrame(y_tr_pred)

In [216]:
pipe_SS = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(),
    SelectKBest(f_regression),
    LinearRegression()
)

In [217]:
pipe_SS.fit(X_train_SS, y_train_SS)

  return f(**kwargs)


Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('selectkbest',
                 SelectKBest(score_func=<function f_regression at 0x000002039C2D3550>)),
                ('linearregression', LinearRegression())])

In [218]:
X_tr_scaled_SS = scaler.transform(X_train_SS)
X_te_scaled_SS = scaler.transform(X_test_SS)

Now using it's `transform()` method to apply the scaling to both the train and test split.

In [219]:
scaler = StandardScaler()
scaler.fit(X_train_SS)
X_tr_scaled_SS = scaler.transform(X_train_SS)
X_te_scaled_SS = scaler.transform(X_test_SS)

Training the model on the train split

In [220]:
lm = LinearRegression().fit(X_tr_scaled_SS, y_train)

Making predictions using the model on both train and test splits.

In [221]:
y_tr_pred_SS = lm.predict(X_tr_scaled_SS)
y_te_pred_SS = lm.predict(X_te_scaled_SS)

Assessing the models performance.

In [242]:
median_r2_SS = r2_score(y_train, y_tr_pred_SS), r2_score(y_test, y_te_pred_SS)
median_r2_SS

(0.41581843221746717, 0.327593460165006)

**Same results as above; both the `Training` & `Testing` sets come up below a simple coin flip.**

**3.2.2 Defining a new pipeline to select a different number of features**

In [271]:
pipe15_SS = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(),
    SelectKBest(f_regression, k=15),
    LinearRegression()
)

In [273]:
pipe15_SS.fit(X_train_SS, y_train_SS.values.ravel())

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('selectkbest',
                 SelectKBest(k=15,
                             score_func=<function f_regression at 0x000002039C2D3550>)),
                ('linearregression', LinearRegression())])

**3.2.3 Assess performance on train and test data**

In [274]:
y_tr_pred_SS = pipe15_SS.predict(X_train_SS)
y_te_pred_SS = pipe15_SS.predict(y_train_SS)

ValueError: X has 1 features, but this SimpleImputer is expecting 19 features as input.

**3.2.2 Assessing performance using cross-validation**

In [255]:
# cv_results_SS = cross_validate(pipe_SS, X_train_SS, y_train_SS, cv=5)

In [256]:
# cv_scores_SS = cv_results_SS['test score']
# cv_scores_SS

**3.2.3 Hyperparameter search using GridSearchCV**

Pulling the above together, we have:
* a pipeline that
    * imputes missing values
    * scales the data
    * selects the k best features
    * trains a linear regression model
* a technique (cross-validation) for estimating model performance

Now you want to use cross-validation for multiple values of k and use cross-validation to pick the value of k that gives the best performance. `make_pipeline` automatically names each step as the lowercase name of the step and the parameters of the step are then accessed by appending a double underscore followed by the parameter name. You know the name of the step will be 'selectkbest' and you know the parameter is 'k'.

You can also list the names of all the parameters in a pipeline like this:

In [264]:
pipe_SS.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'simpleimputer', 'standardscaler', 'selectkbest', 'linearregression', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'selectkbest__k', 'selectkbest__score_func', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize'])

In [265]:
k = [k+1 for k in range(len(X_train_SS.columns))]
grid_params_SS = {'selectkbest__k': k}

In [266]:
lr_grid_cv_SS = GridSearchCV(pipe_SS, param_grid=grid_params_SS, cv=5, n_jobs=-1)

In [270]:
lr_grid_cv_SS.fit(X_train_SS, y_train_SS.values.ravel())

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('simpleimputer',
                                        SimpleImputer(strategy='median')),
                                       ('standardscaler', StandardScaler()),
                                       ('selectkbest',
                                        SelectKBest(score_func=<function f_regression at 0x000002039C2D3550>)),
                                       ('linearregression',
                                        LinearRegression())]),
             n_jobs=-1,
             param_grid={'selectkbest__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                                            12, 13, 14, 15, 16, 17, 18, 19]})

In [278]:
score_mean = lr_grid_cv_SS.cv_results_SS['mean_test_score']
score_std = lr_grid_cv_SS.cv_results_SS['std_test_score']
cv_k = [k for k in lr_grid_cv_SS.cv_results_SS['param_selectkbest__k']]

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_SS'

# 3.3 Log Transformation

# Questions & Answers

**Question |** Does my data set have any categorical data, such as Gender or day of the week? 

**Answer |** No, it did not have categorical data from the beginning.

**Question |** Do my features have data values that range from 0 - 100 or 0-1 or both and more?

**Answer |** xxxxxxx