## Scaling in Machine Learning

Scaling refers to the process of standardizing or normalizing the features of a dataset so that they can contribute equally to the model's performance. This is crucial because many machine learning algorithms, such as gradient descent-based methods (e.g., linear regression, logistic regression) and distance-based algorithms (e.g., k-nearest neighbors, support vector machines), are sensitive to the scale of the input features.

## Types of Scaling

#### 1. Standardization (Z-score Normalization)
- Transforms data to have a mean of 0 and a standard deviation of 1
- Formula: `z = (x - μ)/σ`
- Useful when the data follows a Gaussian distribution

#### 2. Min-Max Scaling
- Scales the data to a fixed range, usually [0, 1]
- Formula: `x' = (x - min(x))/(max(x) - min(x))`
- Effective for algorithms that require bounded input

#### 3. Robust Scaling
- Uses the median and the interquartile range for scaling, making it robust to outliers
- Formula: `x' = (x - median)/IQR`

#### 4. MaxAbs Scaling
- Scales each feature by its maximum absolute value, resulting in values between -1 and 1
- Useful for data that is already centered at zero

### Importance of Scaling

- **Improved Convergence**: Helps gradient descent converge faster
- **Equal Feature Weight**: Ensures that features contribute equally to the distance calculations
- **Model Performance**: Can significantly affect the performance of distance-based algorithms and improve the accuracy of the model

### When to Scale

Always consider scaling when:
- Using algorithms sensitive to feature magnitudes
- Features are measured on different scales
- There are significant outliers in the data

In [134]:
import numpy as np
import pandas as pd

In [135]:
a = np.array([60, 100, 95, 125, 130]) 
b = np.array([1, 12, 3, 4, 5])

### Standard Scaling

In [136]:
ma, mb = np.mean(a, axis=0), np.mean(b, axis=0)
sa, sb = np.std(a, axis=0), np.std(b, axis=0)

In [137]:
scl_a = (a - ma) / sa
scl_b = (b - mb) / sb

In [138]:
(ma, sa), scl_a.mean(), scl_a.std()

((102.0, 25.019992006393608), 8.881784197001253e-17, 0.9999999999999998)

In [139]:
(mb, sb), scl_b.mean(), scl_b.std()

((5.0, 3.7416573867739413), -2.2204460492503132e-17, 1.0)

In [140]:
scl_a, scl_b

(array([-1.67865761, -0.07993608, -0.27977627,  0.91926488,  1.11910507]),
 array([-1.06904497,  1.87082869, -0.53452248, -0.26726124,  0.        ]))

In [141]:
from sklearn.preprocessing import StandardScaler

In [142]:
scl_std = StandardScaler()

df = pd.DataFrame({
    'a': a,
    'b': b,
})

scl_std.fit(df)

In [143]:
print(scl_std.mean_)
print(ma, mb)

[102.   5.]
102.0 5.0


In [144]:
scl_std.transform(df).T

array([[-1.67865761, -0.07993608, -0.27977627,  0.91926488,  1.11910507],
       [-1.06904497,  1.87082869, -0.53452248, -0.26726124,  0.        ]])

In [145]:
scl_a, scl_b

(array([-1.67865761, -0.07993608, -0.27977627,  0.91926488,  1.11910507]),
 array([-1.06904497,  1.87082869, -0.53452248, -0.26726124,  0.        ]))

### Min-Max Scaling

In [146]:
scl_a_mm = (a-a.min()) / (a.max()-a.min())
scl_b_mm = (b-b.min()) / (b.max()-b.min())

In [147]:
scl_a_mm, scl_b_mm

(array([0.        , 0.57142857, 0.5       , 0.92857143, 1.        ]),
 array([0.        , 1.        , 0.18181818, 0.27272727, 0.36363636]))

In [148]:
from sklearn.preprocessing import MinMaxScaler

In [150]:
mm_scalar = MinMaxScaler()

df = pd.DataFrame({
    'a': a,
    'b': b,
})

mm_scalar.fit(df)

In [153]:
mm_scalar.transform(df).T

array([[0.        , 0.57142857, 0.5       , 0.92857143, 1.        ],
       [0.        , 1.        , 0.18181818, 0.27272727, 0.36363636]])