# Hand-made Standardizer

## 1. Key Concept : State-less vs. State-full transformers

👇 Consider the following train and test sets

In [None]:
import numpy as np
import pandas as pd

X_train = pd.DataFrame({
    'A': {0: 1, 1: 2, 2: 3},
    'B': {0: 2, 1: 3, 2: 4},
    'C': {0: 3, 1: 4, 2: 5}})
display(X_train)

X_test = pd.DataFrame({
    'A': {0: 1, 1: 2, 2: 3},
    'B': {0: 2, 1: 3, 2: 4},
    'C': {0: 3, 1: 4, 2: 10}})
display(X_test)

Unnamed: 0,A,B,C
0,1,2,3
1,2,3,4
2,3,4,5


Unnamed: 0,A,B,C
0,1,2,3
1,2,3,4
2,3,4,10


👇 And the following pipeline

In [None]:
from sklearn import set_config; set_config(display='diagram')
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.pipeline import make_pipeline, make_union

scaler = StandardScaler()
feature_averager = FunctionTransformer(lambda df: pd.DataFrame(1/3 * (df["A"] + df["B"] + df["C"])))
pipe = make_union(scaler, feature_averager)
pipe

In [None]:
pipe.fit(X_train)
pd.DataFrame(pipe.transform(X_train))

Unnamed: 0,0,1,2,3
0,-1.224745,-1.224745,-1.224745,2.0
1,0.0,0.0,0.0,3.0
2,1.224745,1.224745,1.224745,4.0


In [None]:
pd.DataFrame(pipe.transform(X_test))

Unnamed: 0,0,1,2,3
0,-1.224745,-1.224745,-1.224745,2.0
1,0.0,0.0,0.0,3.0
2,1.224745,1.224745,7.348469,5.666667


☝️ Notice how the `StandardScaler` and the `FunctionTransformer` are fundamentally different:

1️⃣ `FunctionTransformer` can only performs **stateless** transformations
 
$(X_1, X_2, X_3)$ --> $\frac{(X_1 + X_2 + X_3)}{3}$ for our `feature_averager`

other stateless transformations for instance: 

$X$ --> $log(X)$  
$(X_1, X_2)$ --> $X_1 + 5X_2^2$ 

2️⃣ `StandardScaler` performs a **state-full** transformation 

$
X \rightarrow \frac{(X-\mu )}{\sigma}
$

- that requires to **store** information from the train set during the `.fit` (here, `mean_train` and `std_train`)
- In order to **reuse/apply** these back later during the `.transform` phase, on *both* train or test sets

☝️ What if we wanted to code our own state-full custom transformer? For that, we will have to code our own class

## 2. Create your own state-full transformer

### 2.1 CustomStandardizer

👉 Try to code your own class `CustomStandardizer` that should behave exactly like `StandardScaler` from scikit-learn.  
This means having a `fit()` and `transform()` method.

Then, fit it on `X_train` and transform `X_test` with it to compare with the original scikit-learn version!





In [None]:
# TransformerMixin inheritance is used to create fit_transform() method from fit() and transform()
from sklearn.base import TransformerMixin, BaseEstimator

# $DELETE_BEGIN
# Bonus: allow us to raise a NotFittedError when one call the transform method before fitting the instance
from sklearn.exceptions import NotFittedError
# $DELETE_END

class CustomStandardizer(TransformerMixin, BaseEstimator):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        # Store what needs to be stored as instance attributes. Return "self" to allow chaining fit and transform.
        # $CHALLENGIFY_BEGIN
        self.means = X.mean()
        self.stds = X.std(ddof=0)
        # Return self to allow chaining & fit_transform
        return self
        # $CHALLENGIFY_END
    
    def transform(self, X, y=None): 
        # $CHALLENGIFY_BEGIN
        if not (hasattr(self, "means") and hasattr(self, "stds")):
            raise NotFittedError("This CustomStandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.")
        # Standardization
        standardized_feature = (X - self.means) / self.stds
        return standardized_feature
        # $CHALLENGIFY_END
    
    # $DELETE_BEGIN
    def inverse_transform(self, X, y=None):
        if not (hasattr(self, "means") and hasattr(self, "stds")):
            raise NotFittedError("This CustomStandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.")
        return X * self.stds + self.means
    # $DELETE_END

In [None]:
# Try it out below
custom_standardizer = CustomStandardizer()
custom_standardizer.fit(X_train)
custom_standardizer.transform(X_test)

Unnamed: 0,A,B,C
0,-1.224745,-1.224745,-1.224745
1,0.0,0.0,0.0
2,1.224745,1.224745,7.348469


In [None]:
from nbresult import ChallengeResult

tmp = CustomStandardizer()
tmp_train = np.array(tmp.fit_transform(X_train))
tmp_test = np.array(tmp.transform(X_test))

result = ChallengeResult('standardizer', 
                         X_train_transformed=tmp_train,
                         X_test_transformed=tmp_test
)

result.write()
print(result.check())

platform darwin -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /Users/jb/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/jb/code/lewagon/data-solutions/05-ML/08-Workflow/05-Hand-Made-Standardizer
plugins: anyio-2.0.2, dash-1.19.0
[1mcollecting ... [0mcollected 1 item

tests/test_standardizer.py::TestStandardizer::test_solution [32mPASSED[0m[32m       [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/standardizer.pickle

[32mgit[39m commit -m [33m'Completed standardizer step'[39m

[32mgit[39m push origin master


<details>
<summary>💡 Hint if test above only fail by a small margin </summary>

Be carefull there is a slight difference between `np.std()` and `pd.std` methods! This stackoverflow [post](https://stackoverflow.com/questions/44220290/sklearn-standardscaler-result-different-to-manual-result) might help 😉
      
</details>

### 2.2 Inverse Transform

❗️ Scikit-learn transformer also have [`inverse_transform`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.inverse_transform) methods. Try to implement it in your custom scaler!

In [None]:
# YOUR CODE HERE

In [None]:
# Test yourself below

custom_scaler = CustomStandardizer().fit(X_train)
X_train_transformed = custom_scaler.transform(X_train)
display(X_train_transformed)

X_train_detransformed = custom_scaler.inverse_transform(X_train_transformed)
display(X_train_detransformed)

Unnamed: 0,A,B,C
0,-1.224745,-1.224745,-1.224745
1,0.0,0.0,0.0
2,1.224745,1.224745,1.224745


Unnamed: 0,A,B,C
0,1.0,2.0,3.0
1,2.0,3.0,4.0
2,3.0,4.0,5.0


In [None]:
assert np.allclose(X_train_detransformed, X_train)

### 2.3 Complete custom pipeline!

👉 Now that we have replicated scikit-learn's `StandardScaler`, we create many new ones !

Try to create the following:

- A `CustomStandardizer(shrink_factor = 1)` which takes one additional argument to allow scaling by more than 1 standard deviation


- A `FeatureAverager()` class that improves upon the one you built on section 1, scaling the result of the 

$$(X_1, X_2, X_3) --> \frac{1/3 * (X_1 + X_2 + X_3)}{max(X_1, X_2, X_3)}$$

Then, use them both to your ininitial feature union `pipe` to make your own custom pipeline!

In [None]:
# Custom Standardizer

# $DELETE_BEGIN
# Bonus: allow us to raise a NotFittedError when one call the transform method before fitting the instance
from sklearn.exceptions import NotFittedError
# $DELETE_END

class CustomStandardizer(TransformerMixin, BaseEstimator):
    
    def __init__(self, shrink_factor = 1):
        # $CHALLENGIFY_BEGIN
        self.shrink_factor = shrink_factor
        # $CHALLENGIFY_END
    
    def fit(self, X, y=None):
        # Store what needs to be stored as instance attributes. Return "self" to allow chaining fit and transform.
        # $CHALLENGIFY_BEGIN
        self.means = X.mean()
        self.stds = X.std(ddof=0)
        # Return self to allow chaining & fit_transform
        return self
        # $CHALLENGIFY_END
    
    def transform(self, X, y=None): 
        # $CHALLENGIFY_BEGIN
        if not (hasattr(self, "means") and hasattr(self, "stds")):
            raise NotFittedError("This CustomStandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.")
        # Standardization
        standardized_feature = (X - self.means) / self.stds / self.shrink_factor
        return standardized_feature
        # $CHALLENGIFY_END
    
    def inverse_transform(self, X, y=None):
        # $CHALLENGIFY_BEGIN
        if not (hasattr(self, "means") and hasattr(self, "stds")):
            raise NotFittedError("This CustomStandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.")
        return X * self.shrink_factor * self.stds + self.means
        # $CHALLENGIFY_END

Test you new `CustomStandardizer` custom transformer by fitting on `X_train` and transforming it

In [None]:
custom_scaler = CustomStandardizer(shrink_factor=2).fit(X_train)

X_train_transformed = custom_scaler.transform(X_train)
display(X_train_transformed)

X_train_detransformed = custom_scaler.inverse_transform(X_train_transformed)
display(X_train_detransformed)

Unnamed: 0,A,B,C
0,-0.612372,-0.612372,-0.612372
1,0.0,0.0,0.0
2,0.612372,0.612372,0.612372


Unnamed: 0,A,B,C
0,1.0,2.0,3.0
1,2.0,3.0,4.0
2,3.0,4.0,5.0


In [None]:
# Feature Averager

# $DELETE_BEGIN
# Bonus: allow us to raise a NotFittedError when one call the transform method before fitting the instance
from sklearn.exceptions import NotFittedError
# $DELETE_END

class FeatureAverager(TransformerMixin, BaseEstimator):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        # Store what needs to be stored as instance attributes. Return "self" to allow chaining fit and transform.
        # $CHALLENGIFY_BEGIN
        self.features_sum = X['A'] + X['B'] + X['C']
        self.max_factor = np.max(self.features_sum)
        # Return self to allow chaining & fit_transform
        return self
        # $CHALLENGIFY_END
    
    def transform(self, X, y=None): 
        # $CHALLENGIFY_BEGIN
        if not (hasattr(self, "max_factor") and hasattr(self, "features_sum")):
            raise NotFittedError("This FeatureAverager instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.")
        # Feature Averager
        feature_averager = (1/3 * self.features_sum) / self.max_factor
        return pd.DataFrame(feature_averager)
        # $CHALLENGIFY_END
    

Test you `FeatureAverager` custom transformer by fitting on `X_train` and transforming it

In [None]:
# $CHALLENGIFY_BEGIN
custom_feature_averager = FeatureAverager().fit(X_train)

X_train_transformed = custom_feature_averager.transform(X_train)
display(X_train_transformed)
# $CHALLENGIFY_END

Unnamed: 0,0
0,0.166667
1,0.25
2,0.333333


In [None]:
from nbresult import ChallengeResult

tmp = FeatureAverager()
tmp_train = np.array(tmp.fit_transform(X_train))
tmp_test = np.array(tmp.transform(X_test))

result = ChallengeResult('feature_averager', 
                         X_train_transformed=tmp_train,
                         X_test_transformed=tmp_test
)

result.write()
print(result.check())

platform darwin -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /Users/jb/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/jb/code/lewagon/data-solutions/05-ML/08-Workflow/05-Hand-Made-Standardizer
plugins: anyio-2.0.2, dash-1.19.0
[1mcollecting ... [0mcollected 1 item

tests/test_feature_averager.py::TestFeatureAverager::test_solution [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/feature_averager.pickle

[32mgit[39m commit -m [33m'Completed feature_averager step'[39m

[32mgit[39m push origin master


Create a feature union named `pipe` using your custom standardizer and the feature averager created

In [None]:
# $CHALLENGIFY_BEGIN
custom_standardizer = CustomStandardizer(shrink_factor=1)
custom_feature_averager = FeatureAverager()

pipe = make_union(custom_standardizer, custom_feature_averager)
pipe
# $CHALLENGIFY_END

Fit the `X_train` and `X_test` and transform them

In [None]:
# fit and transform X_train

# $CHALLENGIFY_BEGIN
pipe.fit(X_train)
pd.DataFrame(pipe.transform(X_train))
# $CHALLENGIFY_END

Unnamed: 0,0,1,2,3
0,-1.224745,-1.224745,-1.224745,0.166667
1,0.0,0.0,0.0,0.25
2,1.224745,1.224745,1.224745,0.333333


In [None]:
# fit and transform X_test

# $CHALLENGIFY_BEGIN
pd.DataFrame(pipe.transform(X_test))
# $CHALLENGIFY_END

Unnamed: 0,0,1,2,3
0,-1.224745,-1.224745,-1.224745,0.166667
1,0.0,0.0,0.0,0.25
2,1.224745,1.224745,7.348469,0.333333


In [None]:
from nbresult import ChallengeResult

tmp = pipe
tmp_train = np.array(tmp.fit_transform(X_train))
tmp_test = np.array(tmp.transform(X_test))

result = ChallengeResult('feature_union_custom_transformers', 
                         X_train_transformed=tmp_train,
                         X_test_transformed=tmp_test
)

result.write()
print(result.check())

platform darwin -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /Users/jb/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/jb/code/lewagon/data-solutions/05-ML/08-Workflow/05-Hand-Made-Standardizer
plugins: anyio-2.0.2, dash-1.19.0
[1mcollecting ... [0mcollected 1 item

tests/test_feature_union_custom_transformers.py::TestFeatureUnionCustomTransformers::test_solution [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/feature_union_custom_transformers.pickle

[32mgit[39m commit -m [33m'Completed feature_union_custom_transformers step'[39m

[32mgit[39m push origin master


🏁 Congratulation! Don't forget to commit and push your notebooks