<h1 style='text-align: center'>Importance of scaling data</h1>

<p  style='text-align: center'>
This notebook is in <span style='color: green; font-weight: 700'>Active</span> state of development!
<a style='font-weight:700' href='https://github.com/LilDataScientist'> Code on GitHub! </a></p>

# Materials

* [Gradient Descent in Practice I - Feature Scalin](https://www.coursera.org/learn/machine-learning/lecture/xx3Da/gradient-descent-in-practice-i-feature-scaling)
* [All about Feature Scaling](https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35)

# Preface âŒš
This notebook was initially created not for achiving the best public score, but for clear understanding of need in the scaling features.

Sklearn automatically scale out data when we are creating LinearRegression model, but why do we have need of it? Let's create our custom LinearRegression

# What we are trying to prove ðŸ¥‡

Scaling data can have a great impact on the result. In the picture above you can see obviously that the path on the left is much longer than that on the right. The scaling is applied to the left to become the right one.


<div style="width:100%;text-align: center;">
    <img src="https://miro.medium.com/max/600/1*yi0VULDJmBfb1NaEikEciA.png" width="600px"/> 
</div>

Still sklearn's Linear Regression has in-built feature scaling in Linear Regression Class we can not turn it off and see the difference, so we need to implement our own Linear Regression class!

# Loss function

We will minimize mean squared root erorr which is sum of squared difference between model output $y_{pred}$ and actual value $y$:  

$$\large
y_{pred} = bias + \langle x, w \rangle
$$

$\text{}$

$$\large
MSE = \frac{1}{n}\sum_{i=0}^n{(y_{pred} - y_{real})^2}
$$

Let's calculate gradients with respect to $w$ and $bias$

since $y_{pred}$ is $bias + \langle w, x \rangle$ we can say that:

$$\large
MSE = \frac{1}{l} \sum_{i=0}^l {( bias + \langle w, x \rangle  - y_{real})^2}
$$

Now let's take the partition derivative of $w_i$ and $bias$

# Derivative with respect to $w_i$

$$\large
\frac{dMSE}{dw_i} = 2 \cdot x_i \cdot (bias + \langle w, x \rangle)
$$  

Since $y_{pred}$ is $bias + \langle w, x \rangle$ we can say that: 

$$\large
\frac{dMSE}{dw_i} = 2 \cdot x_i \cdot (y_{pred} - y_{real})
$$

Since having or not having 2 does not matter for optimization we can ommit it for convinient use  

$$\large
\frac{dMSE}{dw_i} = x_i \cdot (y_{pred} - y_{real})
$$

# Derivative with respect to $bias$

$$\large
\frac{dMSE}{dw_i} = 2 \cdot (bias + \langle w, x \rangle)
$$ 

Since $y_{pred}$ is $bias + \langle w, x \rangle$ we can say that:  

$$\large
\frac{dMSE}{dw_i} = 2 \cdot (y_{pred} - y_{real})
$$

Since having or not having 2 does not matter for optimization we can ommit it for convinient use   

$$\large
\frac{dMSE}{dw_i} = y_{pred} - y_{real}
$$

# Linear Regression ðŸ’»

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error

In [None]:
class LinearRegression:
    
    def __init__(self, iterations=10000, learning_rate=0.1):
        self.lr = learning_rate
        self.iterations = iterations
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for iteration in range(self.iterations):
            y_predicted = self.predict(X)
            loss = y_predicted - y

            # Gradients
            dw = X.T @ loss
            db = loss.mean(axis=0)
            
            dw = dw / n_samples

            # Update weights
            self.weights -= self.lr * dw
            self.bias -= self.lr * db


    def predict(self, X):
        return X @ self.weights + self.bias

# Load and split data ðŸ“±

In case of simplicity we will be using only 1 feature - LotArea.

In [None]:
train_data = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')

X_train, X_test, y_train, y_test = train_test_split(train_data[['LotArea']], train_data['SalePrice'], test_size=0.25)

# Fit model without scaling features ðŸ“‰

In case that features, such as SalePrice and LotArea are quite big, we need to use extra small learning rate. It means that it will take so much time to find the local minimum! Let's see

In [None]:
model = LinearRegression(iterations=10000, learning_rate=0.000000001)
model.fit(X_train, y_train)
mean_squared_log_error(y_test, model.predict(X_test))

It took us around 1 minute and we only get around 0.7 mean squared log error. That's very bad.

# Scaler class ðŸ’¿

Machine learning algorithm that works on numbers does not know what that number represents. A weight of 10 grams and a price of 10 dollars are two different things, but for mode it is the just the same numbers. 

The problem is that if 1 feature is much bigger than other feature, then the assumption algorithm makes that since first feature is bigger than second one, than it is more important.

We can solve this problem by getting all features in same not big range. Let's say $-1 \leq x_i \leq 1$, where $x_i$ is a feature

There are many scalers, but we will be using Min-Max scaler.

$$x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}$$

MinMaxScaler scales all the data features in the range $[0, 1]$

In [None]:
class StandardScaler:
    
    def __init__(self):
        self.max_value = None
        self.min_value = None
    
    def fit(self, X):
        self.max_value = X.max()
        self.min_value = X.min()
    
    def transform(self, X):
        return (X - self.min_value) / (self.max_value - self.min_value)
    
    def inverse_transform(self, X):
        """
        Scale back the data to the original representation
        """
        return (X * (self.max_value - self.min_value)) + self.min_value

# Fit model with scaling features ðŸ“ˆ

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train, X_test = scaler.transform(X_train), scaler.transform(X_test)

model = LinearRegression(iterations=10000, learning_rate=0.1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mean_squared_log_error(y_test, y_pred)

As we expected, better score for lower time!