## [Scaling Numerical Data](https://towardsdatascience.com/scaling-numerical-data-explained-a-visual-guide-with-code-examples-for-beginners-11676cdb45cb/)

> Transforming adult-sized data for child-like models

Data scaling (including what some call "normalization) is the process of transforming these adult-sized numbers into child-friendly proportions. It’s about creating a level playing field where every feature, big or small, can be understood and valued appropriately.

In [1]:
!pip install -q pandas numpy scikit-learn matplotlib scipy

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer
from scipy import stats

import warnings
warnings.filterwarnings('ignore')

# Read the data
data = {
    'Temperature_Celsius': [15, 18, 22, 25, 28, 30, 32, 29, 26, 23, 20, 17],
    'Humidity_Percent': [50, 55, 60, 65, 70, 75, 80, 72, 68, 62, 58, 52],
    'Wind_Speed_kmh': [5, 8, 12, 15, 10, 7, 20, 18, 14, 9, 6, 11],
    'Golfers_Count': [20, 35, 50, 75, 100, 120, 90, 110, 85, 60, 40, 25],
    'Green_Speed': [8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 11.0, 10.5, 10.0, 9.5, 9.0]
}

df = pd.DataFrame(data)

df.head()

Unnamed: 0,Temperature_Celsius,Humidity_Percent,Wind_Speed_kmh,Golfers_Count,Green_Speed
0,15,50,5,20,8.5
1,18,55,8,35,9.0
2,22,60,12,50,9.5
3,25,65,15,75,10.0
4,28,70,10,100,10.5


#### Min-Max Scaling
Min Max Scaling transforms all values to a fixed range, typically between 0 and 1, by subtracting the minimum value and dividing by the range.

In [3]:
# 1. Min-Max Scaling for Temperature_Celsius
min_max_scaler = MinMaxScaler()
df['Temperature_MinMax'] = min_max_scaler.fit_transform(df[['Temperature_Celsius']])

df.head()

Unnamed: 0,Temperature_Celsius,Humidity_Percent,Wind_Speed_kmh,Golfers_Count,Green_Speed,Temperature_MinMax
0,15,50,5,20,8.5,0.0
1,18,55,8,35,9.0,0.176471
2,22,60,12,50,9.5,0.411765
3,25,65,15,75,10.0,0.588235
4,28,70,10,100,10.5,0.764706


#### Standard Scaling
Standard Scaling centers data around a mean of 0 and scales it to a standard deviation of 1, achieved by subtracting the mean and dividing by the standard deviation.

In [4]:
# 2. Standard Scaling for Wind_Speed_kmh
standard_scaler = StandardScaler()
df['Wind_Speed_Standardized'] = standard_scaler.fit_transform(df[['Wind_Speed_kmh']])

df.head()

Unnamed: 0,Temperature_Celsius,Humidity_Percent,Wind_Speed_kmh,Golfers_Count,Green_Speed,Temperature_MinMax,Wind_Speed_Standardized
0,15,50,5,20,8.5,0.0,-1.379693
1,18,55,8,35,9.0,0.176471,-0.71744
2,22,60,12,50,9.5,0.411765,0.165563
3,25,65,15,75,10.0,0.588235,0.827816
4,28,70,10,100,10.5,0.764706,-0.275939


#### Robust Scaling
Robust Scaling centers data around the median and scales using the interquartile range (IQR)

In [5]:
# 3. Robust Scaling for Humidity_Percent
robust_scaler = RobustScaler()
df['Humidity_Robust'] = robust_scaler.fit_transform(df[['Humidity_Percent']])

df.head()

Unnamed: 0,Temperature_Celsius,Humidity_Percent,Wind_Speed_kmh,Golfers_Count,Green_Speed,Temperature_MinMax,Wind_Speed_Standardized,Humidity_Robust
0,15,50,5,20,8.5,0.0,-1.379693,-1.018868
1,18,55,8,35,9.0,0.176471,-0.71744,-0.641509
2,22,60,12,50,9.5,0.411765,0.165563,-0.264151
3,25,65,15,75,10.0,0.588235,0.827816,0.113208
4,28,70,10,100,10.5,0.764706,-0.275939,0.490566


#### Log Transformation
It applies a logarithmic function to the data, compressing the scale of very large values.

In [6]:
# 4. Log Transformation for Golfers_Count
df['Golfers_Log'] = np.log1p(df['Golfers_Count'])
df['Golfers_Log_std'] = standard_scaler.fit_transform(df[['Golfers_Log']])

df.head()

Unnamed: 0,Temperature_Celsius,Humidity_Percent,Wind_Speed_kmh,Golfers_Count,Green_Speed,Temperature_MinMax,Wind_Speed_Standardized,Humidity_Robust,Golfers_Log,Golfers_Log_std
0,15,50,5,20,8.5,0.0,-1.379693,-1.018868,3.044522,-1.870976
1,18,55,8,35,9.0,0.176471,-0.71744,-0.641509,3.583519,-0.90476
2,22,60,12,50,9.5,0.411765,0.165563,-0.264151,3.931826,-0.280378
3,25,65,15,75,10.0,0.588235,0.827816,0.113208,4.330733,0.434712
4,28,70,10,100,10.5,0.764706,-0.275939,0.490566,4.615121,0.94451


#### Box-Cox Transformation
This is a family of power transformations (that includes log transformation as a special case) that aims to normalize the distribution of data by applying a power transformation with a parameter lambda (λ), which is optimized to achieve the desired normality.

In [7]:
# 5. Box-Cox Transformation for Green_Speed
df['Green_Speed_BoxCox'], lambda_param = stats.boxcox(df['Green_Speed'])

df.head()

Unnamed: 0,Temperature_Celsius,Humidity_Percent,Wind_Speed_kmh,Golfers_Count,Green_Speed,Temperature_MinMax,Wind_Speed_Standardized,Humidity_Robust,Golfers_Log,Golfers_Log_std,Green_Speed_BoxCox
0,15,50,5,20,8.5,0.0,-1.379693,-1.018868,3.044522,-1.870976,5.599116
1,18,55,8,35,9.0,0.176471,-0.71744,-0.641509,3.583519,-0.90476,5.916225
2,22,60,12,50,9.5,0.411765,0.165563,-0.264151,3.931826,-0.280378,6.229654
3,25,65,15,75,10.0,0.588235,0.827816,0.113208,4.330733,0.434712,6.539637
4,28,70,10,100,10.5,0.764706,-0.275939,0.490566,4.615121,0.94451,6.846381


In [8]:
df['Golfers_Count_Log'] = np.log1p(df['Golfers_Count']) 
df['Golfers_Count_Log_std'] = standard_scaler.fit_transform(df[['Golfers_Count_Log']])

box_cox_transformer = PowerTransformer(method='box-cox') # By default already has standardizing
df['Green_Speed_BoxCox'] = box_cox_transformer.fit_transform(df[['Green_Speed']])

print("nBox-Cox lambda parameter:", lambda_param)
print("nBox-Cox lambda parameter:", lambda_param)

nBox-Cox lambda parameter: 0.7900474068367501
nBox-Cox lambda parameter: 0.7900474068367501


In [9]:
# Display the results
transformed_data = df[[
    'Temperature_MinMax', 
    'Humidity_Robust', 
    'Wind_Speed_Standardized',
    'Green_Speed_BoxCox',
    'Golfers_Log_std', 
]]

transformed_data = transformed_data.round(2)
print(transformed_data)

    Temperature_MinMax  Humidity_Robust  Wind_Speed_Standardized  \
0                 0.00            -1.02                    -1.38   
1                 0.18            -0.64                    -0.72   
2                 0.41            -0.26                     0.17   
3                 0.59             0.11                     0.83   
4                 0.76             0.49                    -0.28   
5                 0.88             0.87                    -0.94   
6                 1.00             1.25                     1.93   
7                 0.82             0.64                     1.49   
8                 0.65             0.34                     0.61   
9                 0.47            -0.11                    -0.50   
10                0.29            -0.42                    -1.16   
11                0.12            -0.87                    -0.06   

    Green_Speed_BoxCox  Golfers_Log_std  
0                -1.70            -1.87  
1                -1.13         