# Numerical Data Scaling & Normalization

Data preprocessing is necessary for models to work, increasing their accuracy and reducing overfitting. For more info, checkout the User Guide for [Data Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) on the scikit-learn docs.

We will first look at **numerical data**, meaning integers or floating-point values.

- **Discrete**: Number of Children, Number of floors, Number employees
- **Continuous**: Height, Weight, Price, Temperature, Distance

The most common feature engineering techniques when dealing with numerical data are:

- **Normalization**: Converting numerical values into a standard range.
- **Binning** (also referred to as bucketing): Converting numerical values into buckets of ranges.

> Here are some reasons for scaling features:
>
> - Models that rely on the distance between a pair of samples, for instance k-nearest neighbors, should be trained on normalized features to make each feature contribute approximately equally to the distance computations.
>
> - Many models such as logistic regression use a numerical solver (based on gradient descent) to find their optimal parameters. This solver converges faster when the features are scaled, as it requires less steps (called iterations) to reach the optimal solution.

## Numerical data: normalization

The goal of normalization is to transform features to be on a similar scale. For example, consider the following two features:

- Feature X spans the range `154` to `24,917,482`.
- Feature Y spans the range `5` to `22`.

These two features span very different ranges. Normalization might manipulate X and Y so that they span a similar range, perhaps 0 to 1. Intuitively, it puts the values on a fair scale for comparison:

- The number `$10` million in house prices is a lot. Similarly, the number `20` in the number of rooms is a lot. After scaling, we find big values (regardless of scale) are close to 1. Also, we find small values (regardless of scale) are close to 0.

See: [Importance of Feature Scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html)

### Example: House Prices vs Jacket Prices

In [3]:
import pandas as pd

In [7]:
house_price = 1_000

house_prices = [
    500_000,
    600_000,
    700_000,
    550_000,
    450_000,
    800_000,
    1_200_000,
    900_000,
]

In [8]:
jacket_price = 1_000
jacket_prices = [
    100,
    200,
    300,
    400,
    500,
    450,
    650,
    1200,
]

In [9]:
df = pd.DataFrame({
    'house_prices': house_prices,
    'jacket_prices': jacket_prices
})
df

Unnamed: 0,house_prices,jacket_prices
0,500000,100
1,600000,200
2,700000,300
3,550000,400
4,450000,500
5,800000,450
6,1200000,650
7,900000,1200


In [11]:
df['house_price_z'] = (df['house_prices'] - df['house_prices'].mean()) / df['house_prices'].std()
df['jacket_price_z'] = (df['jacket_prices'] - df['jacket_prices'].mean()) / df['jacket_prices'].std()
df

Unnamed: 0,house_prices,jacket_prices,house_price_z,jacket_price_z
0,500000,100,-0.853666,-1.102396
1,600000,200,-0.451941,-0.808424
2,700000,300,-0.050216,-0.514452
3,550000,400,-0.652804,-0.220479
4,450000,500,-1.054529,0.073493
5,800000,450,0.35151,-0.073493
6,1200000,650,1.958411,0.514452
7,900000,1200,0.753235,2.1313


In [18]:
df.loc[df['jacket_prices'] == 1200, ['jacket_prices', 'jacket_price_z']]

Unnamed: 0,jacket_prices,jacket_price_z
7,1200,2.1313


In [17]:
df.loc[df['house_prices'] == 500000, ['house_prices', 'house_price_z']]

Unnamed: 0,house_prices,house_price_z
0,500000,-0.853666


### Conclusion: Jacket vs House

Thus, we can say, based on the data we have, that a house price of 500,000 (z-score = -0.85) is much cheaper when placed on a fair scale compared to a jacket price of 1200 (z-score = 2.13). In fact, since the house price is negative, it is on the low-end of the scale.

## Normalization methods

We will cover three popular normalization methods:

- linear scaling
- Z-score scaling
- log scaling

### Linear scaling

**Linear scaling** (more commonly shortened to just **scaling**) means converting floating-point values from their natural range into a standard range—usually 0 to 1 or -1 to +1. See [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) in scikit-learn.

**Usage**: Linear scaling is effective when these conditions are met:

1. **Stable Bounds:** The minimum and maximum values of the feature remain relatively consistent over time.
2. **Few Outliers:** The feature has minimal outliers, and those that exist are not significantly extreme.
3. **Uniform Distribution:** The feature's values are distributed evenly across its range.

**Example: Human Age**

Human age is a good candidate for linear scaling because:

- Its range is well-defined (typically 0 to 100 years).
- Outliers are rare (a very small percentage of people live beyond 100).
- A large dataset should provide sufficient representation of all age groups, even if some are more common.

**Note**: Most real-world features do not meet all of the criteria for linear scaling. **Z-score scaling is typically a better normalization choice than linear scaling**.

### Z-score scaling

**Z-score scaling** centers the values such that they have mean=0 and stddev=1. See: [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) in scikit-learn. Z-scores are roughly between -3 and +3.

It is often a required step in applying statistical techniques, such as the z-score method of outlier removal, hypothesis testing, and others.

**Usage**: Z-score is a good choice when:

1. the data follows a **normal distribution** or
2. a distribution **somewhat like a normal distribution**.

While many distributions are approximately **normal within their core**, they may still contain extreme outliers. For instance, a net_worth feature might exhibit most values within 3 standard deviations, but a few could be hundreds of standard deviations away. To address such cases, **combining Z-score scaling with clipping can be effective**.


### Log scaling

**Log scaling** computes the logarithm of the raw value. In theory, the logarithm could be any base; in practice, log scaling usually calculates the natural logarithm (ln). See: [Effect of transforming the targets in regression model](https://scikit-learn.org/stable/auto_examples/compose/plot_transformed_target.html) from the Scikit-learn docs.

Log scaling is a useful technique for data that follows a **power law distribution**. This type of distribution is characterized by a few high-value data points and many low-value ones.

- Example 1: a few popular movies have a massive number of ratings, while the majority of movies have only a handful.
- Example 2: most books sell very few copies, while a select few become bestsellers and sell millions. The majority of books fall somewhere in between, selling thousands of copies.

By applying log scaling, we can **transform this skewed distribution into a more balanced one**. This transformation can significantly improve the performance of machine learning models, leading to more accurate predictions.

![Log and exponential transformation](../assets/img/log_and_exp_transformation.png)

**Figure 2.** Log and exponential transformation

Source: [discuss.boardinfinity.com](http://discuss.boardinfinity.com/)


### Clipping

**Clipping** is a technique used to mitigate the impact of outliers. It involves capping the values of extreme data points to a specific maximum or minimum value. While this might seem counterintuitive, clipping can often significantly improve the performance of machine learning models.

- Example: consider a dataset with a feature called roomsPerPerson. This feature represents the average number of rooms per person in a household. While most values are normally distributed, a few extreme outliers exist, such as households with an exceptionally high number of rooms per person.

By clipping these outliers, we can prevent them from disproportionately influencing the model's training process and ultimately improve its predictive accuracy.

A form of clipping is the Z-score outlier removal method. Any values outside the -3 and +3 standard deviations range are omitted.

## Summary of normalization techniques

The best normalization technique is one that works well in practice, so try new ideas if you think they'll work well on your feature distribution.

- **Linear scaling**: when the feature is uniformly distributed across a fixed range.
- **Log scaling**: when the feature conforms to the power law distribution.
- **Z-score scaling**: when the feature distribution does not contain extreme outliers.
- **Clipping**: when the feature contains extreme outliers.

You’ll find good visuals in Scikit-learn Docs: [Compare the effect of different scalers on data with outliers](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html).


## Numerical Data Binning / Bucketing

**Binning** / **Bucketing** is a feature engineering technique that involves grouping numerical values into discrete categories or bins.

- Example: number of shoppers (store visitors) versus temperature (weather)

![](../assets/img/scatter_temp_vs_shoppers.png)

**Figure 2.** scatter plot of temperature vs number of shoppers

The graph suggests three clusters in the following subranges:

- Bin 1 is the temperature range 4-11.
- Bin 2 is the temperature range 12-26.
- Bin 3 is the temperature range 27-36.

![](../assets/img/scatter_temp_vs_shoppers_binned.png)

**Figure 3.** same previous scatter plot with 2 red lines indicating the threshold for the three bins.

Binning is a good alternative to scaling or clipping when either of the following conditions is met:

- The overall linear relationship between the feature and the label is weak or nonexistent. (The example above shows non-linearity: from down to up to down; i.e., a curve not a line).
- When the feature values are clustered.