# DTSC670: Foundations of Machine Learning Models
## Module 2
## Assignment 4: Custom Transformer and Transformation Pipeline

#### Name: Trenton Middleton

Begin by writing your name above.

Your task in this assignment is to create a custom transformation pipeline that takes in raw data and returns fully prepared, clean data that is ready for model training.  However, we will not actually train any models in this assignment.  This pipeline will employ an imputer class, a user-defined transformer class, and a data-normalization class.

Please note that the order of features in the final feature matrix must be correct.  See the below figure that illustrates the input and output of the transformation pipeline.  The positions of features $x_1$ and $x_2$ do not change - they remain in the first and second columns, respectvely, both before and after the transformation pipeline.  In the transformed dataset, the $x_5$ feature is next, and is followed by the newly computed feature $x_6$.  Finally, the last two columns are the remaining one-hot vectors obtained from encoding the categorical feature $x_3$.

<img src="DataTransformation.png " width ="500" />

# Import Data

Import data from the file called `CustomTransformerData.csv`.

In [1]:
# Import numpy and pandas
# import csv file
import numpy as np
import pandas as pd
customer_trans = pd.read_csv("CustomTransformerData.csv")
customer_trans

Unnamed: 0,x1,x2,x3,x4,x5
0,1.5,2.354153,COLD,593,0.75
1,2.5,3.314048,WARM,340,2.083333
2,3.5,4.021604,COLD,551,4.083333
3,4.5,,COLD,2368,6.75
4,5.5,5.847601,WARM,2636,10.083333
5,6.5,7.22991,WARM,2779,14.083333
6,7.5,7.997255,HOT,1057,18.75
7,8.5,9.203947,COLD,819,24.083333
8,9.5,10.335348,WARM,3349,
9,10.5,11.112142,HOT,3235,36.75


# Create Custom Transformer

Create a custom transformer, just as we did in the lecture video entitled "Custom Transformers", that performs two computations: 

1. Adds an attribute to the end of the data (i.e. new last column) that is equal to $\frac{x_1^3}{x_5}$ for each observation

2. Drops the entire $x_4$ feature column.  (See further instructions below.)

You must name your custom transformer class `Assignment4Transformer`. Your class should include an input parameter with a default value of `True` that deletes the $x_4$ feature column when its value is `True`, but preserves the $x_4$ feature column when its value is `False`.

This transformer will be used in a pipeline. In that pipeline, an imputer will be run *before* this transformer. Keep in mind that the imputer will output an array, so **this transformer must be written to accept an array.**

Additionally, this transformer will ONLY be given the numerical features of the data. The categorical feature will be handled elsewhere in the full pipeline. This means that your code for this transformer **must reflect the absence of the categorical $x_3$ column** when indexing data structures.

In [2]:
# Create imputer to fill NaN with mean values.  Leave the imputer as an array
# NOTES transformer needs to work with an array
# Only numerical data is passed to a transformer

from sklearn.impute import SimpleImputer
data_num = customer_trans[['x1', 'x2', 'x4', 'x5']]
imputer = SimpleImputer(strategy = 'mean')
imputer.fit(data_num)
X = imputer.transform(data_num)
X


array([[1.50000000e+00, 2.35415298e+00, 5.93000000e+02, 7.50000000e-01],
       [2.50000000e+00, 3.31404772e+00, 3.40000000e+02, 2.08333333e+00],
       [3.50000000e+00, 4.02160446e+00, 5.51000000e+02, 4.08333333e+00],
       [4.50000000e+00, 1.05067957e+01, 2.36800000e+03, 6.75000000e+00],
       [5.50000000e+00, 5.84760100e+00, 2.63600000e+03, 1.00833333e+01],
       [6.50000000e+00, 7.22991004e+00, 2.77900000e+03, 1.40833333e+01],
       [7.50000000e+00, 7.99725523e+00, 1.05700000e+03, 1.87500000e+01],
       [8.50000000e+00, 9.20394654e+00, 8.19000000e+02, 2.40833333e+01],
       [9.50000000e+00, 1.03353477e+01, 3.34900000e+03, 4.30245098e+01],
       [1.05000000e+01, 1.11121419e+01, 3.23500000e+03, 3.67500000e+01],
       [1.15000000e+01, 1.17596108e+01, 2.16000000e+02, 4.40833333e+01],
       [1.25000000e+01, 1.26290958e+01, 2.52900000e+03, 5.20833333e+01],
       [1.35000000e+01, 1.40825889e+01, 1.73500000e+03, 6.07500000e+01],
       [1.45000000e+01, 1.46576780e+01, 1.25400000e

In [3]:
# CREATION OF TRANSFORMER

from sklearn.base import BaseEstimator, TransformerMixin

# column index
x1_ix, x2_ix, x4_ix, x5_ix = 0, 1, 2, 3

class Assignment4Transformer(BaseEstimator, TransformerMixin):
    def __init__(self, drop_x4 = True):
        self.drop_x4 = drop_x4

    def fit(self, X, y=None):
        return self  

# calculation to do X1 cubed / X5 and also drop X4 if set to true    
    def transform(self, X):
        x6 = X[:,x1_ix] **3 / X[:, x5_ix]
        
        if self.drop_x4:
            drop_x4 = True
            X = np.delete(X,[2],1)
            return np.c_[X, x6]
        else:
            return np.c_[X, x6]
        

# Create Transformation Pipeline for Numerical Features

Create a custom transformation pipeline for numeric data only called `num_pipeline` that:

1. Applies the `SimpleImputer` class to the data, where the strategy is set to `mean`.

2. Applies the custom `Assignment4Transformer` class to the data.

3. Applies the `StandardScaler` class to the data.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Create the transformation pipeline of the numerical features.  Applies imputer, Trasnformer and Scaler
num_pipeline = Pipeline([
    ('imputer', SimpleImputer (strategy = 'mean')),
    ('attribs_adder', Assignment4Transformer()),
    ('std_scaler', StandardScaler ())])



# Create Numeric and Categorical DataFrames

Create two new data frames.  Create one DataFrame called `data_num` that holds the numeric features.  Create another DataFrame called `data_cat` that holds the categorical features.

In [5]:
# New data frame for numerical data
data_num = customer_trans[['x1', 'x2', 'x4', 'x5']]

# New dataframe for categorical data
data_cat = customer_trans [['x3']]


# Quick Testing

The full pipeline will be implemented with a `ColumnTransformer` class.  However, to be sure that our numeric pipeline is working properly, lets invoke the `fit_transform()` method of the `num_pipeline` object.  Then, take a look at the transformed data to be sure all is well.

### Run Pipeline and Create Transformed Numeric Data

In [6]:
# this tests the pipline but we have not run the imputer or custom transformer.  This just test that the numeric pipeline
column_transformed = num_pipeline.fit_transform(data_num)
column_transformed



array([[-1.63835604, -1.72914963, -1.19507691, -1.59050349],
       [-1.44560827, -1.52555901, -1.15738431, -1.39982426],
       [-1.2528605 , -1.37548847, -1.10084543, -1.20914502],
       [-1.06011273,  0.        , -1.02546024, -1.01846579],
       [-0.86736496, -0.9882004 , -0.93122876, -0.82778656],
       [-0.67461719, -0.69501705, -0.81815098, -0.63710732],
       [-0.48186942, -0.53226557, -0.68622691, -0.44642809],
       [-0.28912165, -0.27633017, -0.53545654, -0.25574886],
       [-0.09637388, -0.03636359,  0.        , -0.6099295 ],
       [ 0.09637388,  0.128392  , -0.17737691,  0.12560961],
       [ 0.28912165,  0.26571811,  0.02993235,  0.31628884],
       [ 0.48186942,  0.4501331 ,  0.25608791,  0.50696808],
       [ 0.67461719,  0.75841437,  0.50108976,  0.69764731],
       [ 0.86736496,  0.88038895,  0.76493791,  0.88832654],
       [ 1.06011273,  0.        ,  1.04763235,  1.07900578],
       [ 1.2528605 ,  1.41623801,  1.34917309,  1.26968501],
       [ 1.44560827,  1.

### One-Hot Encode Categorical Features

Similarly, you will employ a `OneHotEncoder` class in the `ColumnTransformer` below to construct the final full pipeline.  However, let's instantiate an object of the `OneHotEncoder` class called `cat_encoder` that has the `drop` parameter set to `first`.  Next, call the `fit_transform()` method and pass it your categorical data.  Take a look at the transformed one-hot vectors to be sure all is well.

In [7]:

from sklearn.preprocessing import OneHotEncoder

# on hot encoding creation.  Turns the categorical data into numeric values
# Drop the first value
cat_encoder = OneHotEncoder(drop='first')
data_cat_1hot = cat_encoder.fit_transform(data_cat)
data_cat_1hot.toarray()




array([[0., 0.],
       [0., 1.],
       [0., 0.],
       [0., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.]])

# Put it All Together with a Column Transformer

Now, we are finally ready to construct the full transformation pipeline called `full_pipeline` that will transform our raw data into clean, ready-to-train data.  Construct this ColumnTransformer below, then call the `fit_transform()` method to obtain the final, clean data.  Save this output data into a variable called `data_trans`.

In [8]:
from sklearn.compose import ColumnTransformer


num_attribs = list(data_num)
cat_attribs = ["x3"]

# full pipeline
full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder (drop = 'first'), cat_attribs)])

# creation of final data
data_trans = full_pipeline.fit_transform(customer_trans)
data_trans

array([[-1.63835604, -1.72914963, -1.19507691, -1.59050349,  0.        ,
         0.        ],
       [-1.44560827, -1.52555901, -1.15738431, -1.39982426,  0.        ,
         1.        ],
       [-1.2528605 , -1.37548847, -1.10084543, -1.20914502,  0.        ,
         0.        ],
       [-1.06011273,  0.        , -1.02546024, -1.01846579,  0.        ,
         0.        ],
       [-0.86736496, -0.9882004 , -0.93122876, -0.82778656,  0.        ,
         1.        ],
       [-0.67461719, -0.69501705, -0.81815098, -0.63710732,  0.        ,
         1.        ],
       [-0.48186942, -0.53226557, -0.68622691, -0.44642809,  1.        ,
         0.        ],
       [-0.28912165, -0.27633017, -0.53545654, -0.25574886,  0.        ,
         0.        ],
       [-0.09637388, -0.03636359,  0.        , -0.6099295 ,  0.        ,
         1.        ],
       [ 0.09637388,  0.128392  , -0.17737691,  0.12560961,  1.        ,
         0.        ],
       [ 0.28912165,  0.26571811,  0.02993235,  0.

# Prepare for Grading

Prepare your `data_trans` NumPy array for grading by using the NumPy [around()](https://numpy.org/doc/stable/reference/generated/numpy.around.html) function to round all the values to 2 decimal places - this will return a NumPy array.

Please note the final order of the features in your final numpy array, which is given at the top of this document.

___You MUST print your final answer, which is the NumPy array discussed above, using the `print()` function!  This MUST be the only `print()` statement in the entire notebook!  Do not print anything else using the print() function in this notebook!___

In [9]:
# prepare the data_trans by rounding
print(np.around(data_trans,decimals=2))

[[-1.64 -1.73 -1.2  -1.59  0.    0.  ]
 [-1.45 -1.53 -1.16 -1.4   0.    1.  ]
 [-1.25 -1.38 -1.1  -1.21  0.    0.  ]
 [-1.06  0.   -1.03 -1.02  0.    0.  ]
 [-0.87 -0.99 -0.93 -0.83  0.    1.  ]
 [-0.67 -0.7  -0.82 -0.64  0.    1.  ]
 [-0.48 -0.53 -0.69 -0.45  1.    0.  ]
 [-0.29 -0.28 -0.54 -0.26  0.    0.  ]
 [-0.1  -0.04  0.   -0.61  0.    1.  ]
 [ 0.1   0.13 -0.18  0.13  1.    0.  ]
 [ 0.29  0.27  0.03  0.32  0.    1.  ]
 [ 0.48  0.45  0.26  0.51  0.    1.  ]
 [ 0.67  0.76  0.5   0.7   0.    0.  ]
 [ 0.87  0.88  0.76  0.89  1.    0.  ]
 [ 1.06  0.    1.05  1.08  1.    0.  ]
 [ 1.25  1.42  1.35  1.27  0.    1.  ]
 [ 1.45  1.55  1.67  1.46  1.    0.  ]
 [ 1.64  1.71  2.01  1.65  1.    0.  ]]
