# Object Oriented Programming - Practice
> Writing a Standard Scaler from Scratch

In this notebook, we walkthrough the process of coding sklearn's StandardScaler class from scratch.

In [1]:
# Run this cell unchanged
# Import assignment packages
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Load data
data = load_diabetes()
df = pd.DataFrame(data['data'], columns = data['feature_names'])
df['target'] = data['target']

# Output preview of data
df.head(2)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0


**Let's set up a train test split for our dataset**

In [2]:
# Run this cell unchanged
X = df.drop('target', axis = 1)
y = df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2021)

**The process we will move through in this notebook will be the following:**
1. Write `fit` and `transform` functions ***outside*** of a class. 
    * We want to get the code working before we throw it into a class.
    
2. Create a `StandardScaler` class with `fit` and `transform` methods
    * We will need to add the `self` variable during this step. 
    
3. Compare our results with Sklearn's

### fit

In the cell below, we have defined a function called `fit`. 

**This function should receive 1 argument.**
1. `X` - A pandas dataframe or numpy array

**This function should execute the following steps:**
1. Convert `X` to a numpy array by passing the input into `np.array`.
    * To loop over the index of a pandas dataframe, we use `.iloc`, ie `df.iloc[:,0]` will return the first column of a dataframe, but numpy does not have an `.iloc` method. To avoid an error, the easiest solution is to ensure the input is a numpy array.
    
3. Loop over the columns of the numpy array.
4. For each column, calculate the mean and standard deviation
5. Store the statistics in the container as a tuple with the following format:
```python
(mean, standard_deviation)
```

In [7]:
container = []

def fit(X):
    # Convert X to a numpy array by passing the input into np.array
    x_array = np.array(X)
    # Loop over the columns of the numpy array.
    for col in range(x_array.shape[1]):
        # For each column, calculate the mean and standard deviation
        mean = x_array[:, col].mean()
        std = x_array[:, col].std()
        # Store the statistics in the container
        container.append((mean,std))

Let's test our function on our X_train

In [8]:
# Run this cell unchanged

from src.public_tests import test_fit
# Create container
container = []
# Run fit function
fit(X_train)
# Test results
test_fit(container)

✅ The fit function added the correct data to the container!


### transform

Below we define function called `transform`. 

**This function should receive 1 argument**
1. `X` - Pandas dataframe or numpy array

**This function should execute the following steps:**
1. Convert X to a numpy array by passing the input into np.array
2. Loop over the columns of X
3. Access the mean and standard deviation that were created from the `fit` function and stored in the container variable.
4. Subtract the mean from the column and divide by the standard deviation.
5. Return the transformed version of X

In [13]:
def transform(X):
    # Convert X to a numpy array by passing the input into np.array
    x_array = np.array(X)
    # Loop over the columns of X
    for col in range(x_array.shape[1]):
        # Access the mean and standard deviation that were 
        # created from the fit function and stored in the container variable.
        mean = container[col][0]
        std = container[col][1]
        # Subtract the mean from the column 
        # and divide by the standard deviation.
        x_array[:, col] = ((x_array[:, col] - mean) / std)
    # Return the transformed version of X
    return x_array

Let's test our `transform` function on the training data!

In [14]:
from src.public_tests import test_transform

container = []
fit(X_train)
X_train_scaled = transform(X_train)
test_transform(X_train_scaled[:5])

✅ The transform function returned the correct data!


## Move our code into a `StandardScaler` class!

**Please complete the StandardScaler class**

In [16]:
class StandardScaler:
    
    def fit(self, X):
        self.data = []
        x_array = np.array(X)
        for col in range(x_array.shape[1]):
            mean = x_array[:, col].mean()
            std = x_array[:, col].std()
            self.data.append((mean,std))
            
    def transform(self, X):
        x_array = np.array(X)
        for col in range(x_array.shape[1]):
            mean = container[col][0]
            std = container[col][1]
            x_array[:, col] = ((x_array[:, col] - mean) / std)
        return x_array

Now let's compare our results with Sklearn's scaler!

In [17]:
from sklearn.preprocessing import StandardScaler as SklearnScaler

In [18]:
# Create an instance of our scaler
our_scaler = StandardScaler()
our_scaler.fit(X_train)

# Create an instance of sklearn's scaler
sk_scaler = SklearnScaler()
sk_scaler.fit(X_train)

StandardScaler()

In [19]:
# Scaler train with our scaler
our_scaled_train = our_scaler.transform(X_train)
sk_scaled_train = sk_scaler.transform(X_train)

# Scaler test with our scaler
our_scaled_test = our_scaler.transform(X_test)
sk_scaled_test = sk_scaler.transform(X_test)

In [20]:
# Check if our scaled train is the same as sklearn's
np.all(our_scaled_train == sk_scaled_train)

True

In [21]:
# Check if our scaled test is the same as sklearn's
np.all(our_scaled_test == sk_scaled_test)

True