# Numeric Scaling

---

## What is Numeric Scaling?

Numeric scaling is a preprocessing technique that transforms numerical features to a common scale without distorting differences in the ranges of values.

---


## Why Scaling Matters

Some machine learning algorithms are sensitive to the scale of numerical features. Without proper scaling, these algorithms may perform poorly or converge slowly.

### Algorithms affected by feature scaling include:

- **Support Vector Machines (SVM):**  
  SVM uses distance-based calculations to find the optimal hyperplane. Features with larger scales can dominate the distance metrics, leading to biased results.

- **k-Nearest Neighbors (k-NN):**  
  k-NN relies on distance metrics (e.g., Euclidean distance) to find neighbors. Features on larger scales will disproportionately influence neighbor selection.

- **Neural Networks:**  
  Large feature scales can cause gradients to explode or vanish, making training unstable or slow.

Proper scaling ensures all features contribute equally to the learning process, improving model performance and training stability.

---

## Example Data

Consider the following sample dataset of two features:

| Sample | Feature 1 | Feature 2 |
|--------|-----------|-----------|
| 1      | 10        | 1000      |
| 2      | 15        | 1500      |
| 3      | 20        | 2000      |
| 4      | 25        | 2500      |
| 5      | 30        | 3000      |

Because Feature 1 and Feature 2 are on very different scales, many machine learning algorithms will struggle without scaling.

---

## Common Scaling Methods

### 1. Min-Max Scaling (Normalization)

Scales features to a fixed range, usually [0, 1].

**Formula:**  
X' = (X - X_min) / (X_max - X_min)

Sensitive to outliers.

In [1]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([[10, 1000],
                 [15, 1500],
                 [20, 2000],
                 [25, 2500],
                 [30, 3000]])

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [0.75 0.75]
 [1.   1.  ]]


## 2. Standardization (Z-score Normalization)

Standardization is a scaling technique that transforms numerical features so that they have a mean of zero and a standard deviation of one (unit variance).

---

### Formula:

X' = (X - μ) / σ

- X: Original value  
- μ: Mean of the feature  
- σ: Standard deviation of the feature  
- X': Standardized value

---

### Characteristics:

- Centers the data around zero (mean = 0).  
- Scales the data to have unit variance (standard deviation = 1).  
- Less sensitive to outliers compared to min-max scaling.  
- Useful for algorithms that assume normally distributed data, such as linear regression, logistic regression, and neural networks.


In [2]:
from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[10, 1000],
                 [15, 1500],
                 [20, 2000],
                 [25, 2500],
                 [30, 3000]])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

[[-1.41421356 -1.41421356]
 [-0.70710678 -0.70710678]
 [ 0.          0.        ]
 [ 0.70710678  0.70710678]
 [ 1.41421356  1.41421356]]


## 3. Robust Scaling

Robust Scaling is a scaling technique that uses statistics which are robust to outliers — the median and the interquartile range (IQR) — instead of the mean and standard deviation.

---

### Formula:

X' = (X - median(X)) / IQR(X)

- X: Original value  
- median(X): Median of the feature  
- IQR(X): Interquartile range (Q3 - Q1) of the feature  
- X': Scaled value


---

### Characteristics:

- Centers data around the median instead of the mean.  
- Scales data according to the IQR (difference between 75th and 25th percentile).  
- More robust to outliers than standardization and min-max scaling.  
- Useful when data contains many outliers or is not normally distributed.

In [3]:
from sklearn.preprocessing import RobustScaler
import numpy as np

data = np.array([[10, 1000],
                 [15, 1500],
                 [20, 2000],
                 [25, 2500],
                 [1000, 3000]])  # Outlier in the first column

scaler = RobustScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

[[-1.  -1. ]
 [-0.5 -0.5]
 [ 0.   0. ]
 [ 0.5  0.5]
 [98.   1. ]]
