# Foundations of Machine Learning Models

## Assignment: Custom Transformer and Transformation Pipeline

#### Name: Richard Lee

### Details
Your task in this assignment is to create a custom transformation pipeline that takes in raw data and returns fully prepared, clean data that is ready for model training.  However, we will not actually train any models in this assignment.  This pipeline will employ an imputer class, a user-defined transformer class, and a data-normalization class.

Please note that the order of features in the final feature matrix must be correct.  See the below figure that illustrates the input and output of the transformation pipeline.  The positions of features $x_1$ and $x_2$ do not change - they remain in the first and second columns, respectively, both before and after the transformation pipeline.  In the transformed dataset, the $x_5$ feature is next, and is followed by the newly computed feature $x_6$.  Finally, the last two columns are the remaining one-hot vectors obtained from encoding the categorical feature $x_3$.

<img src="DataTransformation.png " width ="500" />

# Import Data

- Make sure that your notebook and data file are located in the same folder since this is what CodeGrade will be expecting.  
- Import the data from the file called `CustomTransformerData.csv` and call this DataFrame `custom_transform`.  
- Output the `custom_transform` DataFrame to see the data.

In [1]:
import pandas as pd

custom_transform = pd.read_csv('CustomTransformerData.csv')

custom_transform.head()

Unnamed: 0,x1,x2,x3,x4,x5
0,1.5,2.354153,COLD,593,0.75
1,2.5,3.314048,WARM,340,2.083333
2,3.5,4.021604,COLD,551,4.083333
3,4.5,,COLD,2368,6.75
4,5.5,5.847601,WARM,2636,10.083333


# Create Numeric and Categorical DataFrames

- Create two new DataFrames: 
  - Create one DataFrame called `data_num` that holds the numeric features from the `custom_transform` DataFrame.  
  - Create another DataFrame called `data_cat` that holds the categorical feature from `custom_transform`.

In [2]:
data_num = custom_transform.select_dtypes(include=['float64', 'int64'])

In [3]:
data_cat = custom_transform.select_dtypes(include=['object'])
data_num.head(), data_cat.head()

(    x1        x2    x4         x5
 0  1.5  2.354153   593   0.750000
 1  2.5  3.314048   340   2.083333
 2  3.5  4.021604   551   4.083333
 3  4.5       NaN  2368   6.750000
 4  5.5  5.847601  2636  10.083333,
      x3
 0  COLD
 1  WARM
 2  COLD
 3  COLD
 4  WARM)

# Create Custom Transformer

Create a custom transformer, just as we did in the lecture video entitled "Custom Transformers", that performs two computations: 

1. Adds an attribute to the end of the numerical data (i.e. new last column) that is equal to $\frac{x_1^3}{x_5}$ for each observation.  In other words, for each instance, you will cube the $x_1$ column and then divide by the $x_5$ column.

2. Drops the entire $x_4$ feature column if the passed function argument `drop_x4` is `True` and doesn't drop the column if `drop_x4` is `False`. (See further instructions below.)

You must name your custom transformer class `AssignmentTransformer`. Your class should include an input parameter called `drop_x4` with a default value of `True` that deletes the $x_4$ feature column when its value is `True`, but preserves the $x_4$ feature column when you pass a value of `False`.

This transformer will be used in a pipeline. In that pipeline, an imputer will be run *before* this transformer. Keep in mind that the imputer will output an array, so **this transformer must be written to accept an array.**  This is very important and a cause of many errors that students encounter.  In other words, think about using NumPy in your transformer instead of Pandas.

Additionally, this transformer will ONLY be given the numerical features of the data. The categorical feature will be handled elsewhere in the full pipeline. This means that your code for this transformer **must reflect the absence of the categorical $x_3$ column** when indexing data structures.  Again this is very important and a cause of the second most number of errors that students encounter.

In [4]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

X1_INDEX = 0
X5_INDEX = 3
X4_INDEX = 2

class AssignmentTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, drop_x4=True):
        self.drop_x4 = drop_x4

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        x6 = X[:, X1_INDEX]**3 / X[:, X5_INDEX]
        X = np.c_[X, x6]

        if self.drop_x4:
            X = np.delete(X, X4_INDEX, 1)

        return X


# Create Transformation Pipeline for Numerical Features

Create a custom transformation pipeline for only the numeric data called `num_pipeline` that:

1. Applies the `SimpleImputer` class to the data, where the strategy is set to `mean`.  The name of this step should be "imputer".

2. Applies the custom `AssignmentTransformer` class to the data.  Make sure that your custom transformer uses the default argument where you drop the $x_4$ column.  The name of this step should be "custom_trans".

3. Applies the `StandardScaler` class to the data.  The name of this step should be "std_scaler".

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="mean")),
    ('custom_trans', AssignmentTransformer()),
    ('std_scaler', StandardScaler())
])



# Quick Testing

The full pipeline will be implemented with a `ColumnTransformer` class.  However, to be sure that our numeric pipeline is working properly, lets invoke the `fit_transform()` method of the `num_pipeline` object passing it your `data_num` DataFrame.  Save this output data into a variable called `data_num_trans`.

### Run Pipeline and Create Transformed Numeric Data

In [6]:
data_num_trans = num_pipeline.fit_transform(data_num)

data_num_trans


array([[-1.63835604, -1.72914963, -1.19507691, -1.59050349],
       [-1.44560827, -1.52555901, -1.15738431, -1.39982426],
       [-1.2528605 , -1.37548847, -1.10084543, -1.20914502],
       [-1.06011273,  0.        , -1.02546024, -1.01846579],
       [-0.86736496, -0.9882004 , -0.93122876, -0.82778656],
       [-0.67461719, -0.69501705, -0.81815098, -0.63710732],
       [-0.48186942, -0.53226557, -0.68622691, -0.44642809],
       [-0.28912165, -0.27633017, -0.53545654, -0.25574886],
       [-0.09637388, -0.03636359,  0.        , -0.6099295 ],
       [ 0.09637388,  0.128392  , -0.17737691,  0.12560961],
       [ 0.28912165,  0.26571811,  0.02993235,  0.31628884],
       [ 0.48186942,  0.4501331 ,  0.25608791,  0.50696808],
       [ 0.67461719,  0.75841437,  0.50108976,  0.69764731],
       [ 0.86736496,  0.88038895,  0.76493791,  0.88832654],
       [ 1.06011273,  0.        ,  1.04763235,  1.07900578],
       [ 1.2528605 ,  1.41623801,  1.34917309,  1.26968501],
       [ 1.44560827,  1.

### One-Hot Encode Categorical Features

Similarly, you will employ a `OneHotEncoder` class in the `ColumnTransformer` below to construct the final full pipeline.  However, let's instantiate an object of the `OneHotEncoder` class to use in the `ColumnTransformer`.

1. Call this object `cat_encoder` that has the `drop` parameter set to `first`.  
2. Also, include `sparse=False` when instantiating the object to make the auto grading work correctly.  
3. Next, call the `fit_transform()` method and pass it your categorical data (`data_cat`).  
4. Save this output data into a variable called `data_cat_OHE`.

In [7]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder(sparse=False, drop='first')

data_cat_OHE = cat_encoder.fit_transform(data_cat)

data_cat_OHE


array([[0., 0.],
       [0., 1.],
       [0., 0.],
       [0., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.]])

# Put it All Together with a Column Transformer

Now, we are finally ready to construct the full transformation pipeline called `full_pipeline` that will transform our raw data into clean, ready-to-train data.  
1. Construct this `ColumnTransformer` below.  Note that you will use:
  - the `num_pipeline` in the full_pipeline with your numeric data
  - the `cat_encoder` object in the full_pipeline with your categorical data
2. Call the `fit_transform()` method on the original `custom_transform` data to obtain the final, clean data.  
3. Save this output data into a variable called `data_trans`.

In [8]:
from sklearn.compose import ColumnTransformer

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, list(data_num.columns)),
    ("cat", cat_encoder, list(data_cat.columns))
])

data_trans = full_pipeline.fit_transform(custom_transform)

data_trans


array([[-1.63835604, -1.72914963, -1.19507691, -1.59050349,  0.        ,
         0.        ],
       [-1.44560827, -1.52555901, -1.15738431, -1.39982426,  0.        ,
         1.        ],
       [-1.2528605 , -1.37548847, -1.10084543, -1.20914502,  0.        ,
         0.        ],
       [-1.06011273,  0.        , -1.02546024, -1.01846579,  0.        ,
         0.        ],
       [-0.86736496, -0.9882004 , -0.93122876, -0.82778656,  0.        ,
         1.        ],
       [-0.67461719, -0.69501705, -0.81815098, -0.63710732,  0.        ,
         1.        ],
       [-0.48186942, -0.53226557, -0.68622691, -0.44642809,  1.        ,
         0.        ],
       [-0.28912165, -0.27633017, -0.53545654, -0.25574886,  0.        ,
         0.        ],
       [-0.09637388, -0.03636359,  0.        , -0.6099295 ,  0.        ,
         1.        ],
       [ 0.09637388,  0.128392  , -0.17737691,  0.12560961,  1.        ,
         0.        ],
       [ 0.28912165,  0.26571811,  0.02993235,  0.

# Prepare for Grading

Prepare your `data_trans` NumPy array for grading by running the code below.  We are using the NumPy [around()](https://numpy.org/doc/stable/reference/generated/numpy.around.html) function to round all the values to 2 decimal places and this will return a NumPy array.

Please double check that the final order of the features in this `data_trans` array matches what was given to you at the top of this document.

In [9]:
data_trans = np.around(data_trans, decimals=2)
data_trans

array([[-1.64, -1.73, -1.2 , -1.59,  0.  ,  0.  ],
       [-1.45, -1.53, -1.16, -1.4 ,  0.  ,  1.  ],
       [-1.25, -1.38, -1.1 , -1.21,  0.  ,  0.  ],
       [-1.06,  0.  , -1.03, -1.02,  0.  ,  0.  ],
       [-0.87, -0.99, -0.93, -0.83,  0.  ,  1.  ],
       [-0.67, -0.7 , -0.82, -0.64,  0.  ,  1.  ],
       [-0.48, -0.53, -0.69, -0.45,  1.  ,  0.  ],
       [-0.29, -0.28, -0.54, -0.26,  0.  ,  0.  ],
       [-0.1 , -0.04,  0.  , -0.61,  0.  ,  1.  ],
       [ 0.1 ,  0.13, -0.18,  0.13,  1.  ,  0.  ],
       [ 0.29,  0.27,  0.03,  0.32,  0.  ,  1.  ],
       [ 0.48,  0.45,  0.26,  0.51,  0.  ,  1.  ],
       [ 0.67,  0.76,  0.5 ,  0.7 ,  0.  ,  0.  ],
       [ 0.87,  0.88,  0.76,  0.89,  1.  ,  0.  ],
       [ 1.06,  0.  ,  1.05,  1.08,  1.  ,  0.  ],
       [ 1.25,  1.42,  1.35,  1.27,  0.  ,  1.  ],
       [ 1.45,  1.55,  1.67,  1.46,  1.  ,  0.  ],
       [ 1.64,  1.71,  2.01,  1.65,  1.  ,  0.  ]])