# WHY USE PIPELINES ARE IMPORTANT
### CONSISTANTLY REFINING YOUR CRUDE DATA INTO FINE FEATURES IS KEY

Sklearn has a wonderful tool called [Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). These are objects that take a list transformers as parameters through which data will be passed. Several prebuilt transformers are available, such as StandardScalar. Others can be found here: [Sklearn Pipeline Transformers](https://scikit-learn.org/stable/data_transforms.html). 

However, these prebuilt transformsers can be limiting, or a desired transformation has not been implemented into sklearn. So lets build our own!

At first, the sklearn Pipeline interface and custom transformers can be intimidating, but they don't have to be. 

<img src='images/refinery_pipes.jpg'/>

In [1]:
import pandas as pd
import numpy as np

### Why use pipelines?

Pipelines are useful because they can be trained to process data in a consistant way. 
Here are two examples:
1. Standardizing columns consistently
2. Creating Dummy Variables

#### 1. Standardizing Columns Consistently

In [2]:
# create two fake columns
training_df = pd.DataFrame(np.random.normal(0, 1, size=5), columns=['value'])
test_df = pd.DataFrame(np.random.normal(3, 1, size=5), columns=['value'])

In [3]:
training_df

Unnamed: 0,value
0,0.15691
1,-1.914345
2,-1.567131
3,-0.933098
4,-0.285161


In [4]:
test_df

Unnamed: 0,value
0,2.144488
1,2.036589
2,4.301508
3,4.632995
4,3.074765


If we want to scale`training_column` to put it into a machine learning model, then scale `test_column` to put it into the same machine learning model, the naive approach would be to scale the `training_column` and scale the `testing_column` separately:

In [5]:
training_mean = training_df.loc[:,'value'].mean()
training_standard_deviation = training_df.loc[:,'value'].std()
training_column_standardized = (training_df.loc[:,'value'] - training_mean)/ training_standard_deviation

In [6]:
training_column_standardized

0    1.236550
1   -1.167270
2   -0.764307
3   -0.028472
4    0.723499
Name: value, dtype: float64

In [7]:
test_mean = test_df.loc[:,'value'].mean()
test_standard_deviation = test_df.loc[:,'value'].std()
wrong_test_column_standardized = (test_df.loc[:,'value'] - test_mean)/ test_standard_deviation

In [8]:
wrong_test_column_standardized

0   -0.912676
1   -1.002726
2    0.887521
3    1.164171
4   -0.136289
Name: value, dtype: float64

__*What is wrong with this?*__

The testing data is not being standardized in relation to the training data! Valuable information has been destroyed.

Here is the correct way to standardsize the `test_column`

In [9]:
correct_test_column_standardized = (test_df.loc[:,'value'] - training_mean)/ test_standard_deviation

In [10]:
correct_test_column_standardized

0    2.548005
1    2.457955
2    4.348201
3    4.624852
4    3.324391
Name: value, dtype: float64

#### 2. Creating Dummy Variables

In [11]:
training_species_df = pd.DataFrame([['cat'], ['dog'], ['bat'], ['mouse'], ['rat']], columns=['species'])

In [12]:
test_species_df = pd.DataFrame([[np.nan], ['dog'], ['rat'], ['cat'], ['kangaroo']], columns=['species'])

In [13]:
training_species_df

Unnamed: 0,species
0,cat
1,dog
2,bat
3,mouse
4,rat


In [14]:
test_species_df

Unnamed: 0,species
0,
1,dog
2,rat
3,cat
4,kangaroo


In [15]:
pd.get_dummies(training_species_df.loc[:'species'])

Unnamed: 0,species_bat,species_cat,species_dog,species_mouse,species_rat
0,0,1,0,0,0
1,0,0,1,0,0
2,1,0,0,0,0
3,0,0,0,1,0
4,0,0,0,0,1


In [16]:
pd.get_dummies(test_species_df.loc[:'species'])

Unnamed: 0,species_cat,species_dog,species_kangaroo,species_rat
0,0,0,0,0
1,0,1,0,0
2,0,0,0,1
3,1,0,0,0
4,0,0,1,0


__*What is wrong with this?*__

`pd.get_dummies()` is pretty dumb, huh? 

Our machine learning model will not work correctly with the above dataframes!
- The columns are inconsistent in length! 
- Our columns are not ordered in the same way! 
- Our test_df is missing columns!
- Our test_df has columns that are not in the train_df!

To fix the problems we find in `pd.get_dummies()`, we can create our own code that we can later put into functions or a class.

The following code serves as a demonstration of the concepts we will use later to create a custom Dummifier Class:

In [17]:
# Create an empty dataframe to add columns to:
training_dummy_df = pd.DataFrame()

In [18]:
training_dummy_df

In [19]:
# save the unique values found in the training data into a list 
unique_col_values = training_species_df['species'].unique()

In [20]:
# for each item in the list of unique items, make a pd.Series consisting of 0's and 1's 
# where the item is present in the species column of the training_species_df
for item in unique_col_values:
    training_dummy_df[item] = (training_species_df.loc[:,'species'] == item).astype(int)

In [22]:
test_dummy_df = pd.DataFrame()

In [23]:
# for each item in the list of unique items, make a pd.Series consisting of 0's and 1's 
# where the item is present in the species column of the test_species_df
for item in unique_col_values:
    test_dummy_df[item] = (test_species_df.loc[:,'species'] == item).astype(int)

In [24]:
test_dummy_df

Unnamed: 0,cat,dog,bat,mouse,rat
0,0,0,0,0,0
1,0,1,0,0,0
2,0,0,0,0,1
3,1,0,0,0,0
4,0,0,0,0,0
