# Feature Scaling Exercise

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

Data processing is a crucial step in developing Machine Learning systems, combining domain expertise with mathematical transformations. Feature scaling plays a vital role in preventing model bias toward features with larger magnitude values, making it particularly important when experimenting with multiple Machine Learning algorithms.

## Learning Objectives

- Understand the importance of feature scaling in machine learning
- Learn three key feature scaling techniques:
  - Standardized Scaling (Z-score)
  - Min-Max Scaling
  - Robust Scaling
- Apply feature scaling methods using scikit-learn preprocessing tools

### Tasks

- Apply `StandardScaler` to normalize data using Z-score method
- Implement Min-Max scaling to transform features to [0,1] range
- Use `RobustScaler` for scaling data with outliers
- Compare results across different scaling methods

## Prerequisites

- Python programming environment
- Basic understanding of statistical and machine learning concepts
- Familiarity with common ML libraries


## Get Started
- Please select kernel "conda_python3" from SageMaker notebook instance.
### Import necessary libraries


In [None]:
# Import necessary libraries for data preprocessing
# Import StandardScaler, MinMaxScaler, and RobustScaler from scikit-learn's preprocessing module.
# These are used for scaling and normalizing numerical features to improve model performance.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler  # Scalers for feature normalization

# Import the NumPy library and alias it as 'np'.
# NumPy is fundamental for numerical computations in Python, especially for handling arrays and matrices.
import numpy as np  # Numerical operations

# Import the pandas library and alias it as 'pd'.
# Pandas is used for data manipulation and analysis, providing data structures like DataFrames and Series.
import pandas as pd  # Data manipulation and analysis

## Feature Scaling

Using unscaled features can introduce bias in machine learning models, particularly for algorithms like linear and logistic regression that are sensitive to variable magnitudes. While tree-based models (e.g., random forests) are inherently scale-invariant, standardizing features remains a recommended practice. Normalization ensures fair feature comparison when evaluating multiple algorithms and helps optimization converge faster in gradient-based methods. A consistent preprocessing approach across all features enables more reliable model performance comparisons and often improves overall results.

In [None]:
# Set NumPy print options to suppress scientific notation for small numbers.
# This makes the output of NumPy arrays more readable by displaying small numbers in fixed-point notation instead of scientific notation (e.g., 0.0001 instead of 1e-4).
np.set_printoptions(suppress=True)

## Load sample RNASeq data

The RNA-seq data reveals substantial variation in gene expression levels across the five analyzed genes, as evidenced by their FPKM (Fragments Per Kilobase per Million) values. The expression magnitudes span multiple orders of scale, with certain genes demonstrating markedly higher transcriptional activity compared to others. This wide dynamic range in FPKM measurements highlights the inherent biological diversity in gene expression patterns and underscores the need for appropriate data normalization when conducting downstream comparative analyses.

In [None]:
# Create a Pandas DataFrame named 'fpkms'.
# The DataFrame is initialized with a list of numerical values.
# Each value in the list represents an FPKM (Fragments Per Kilobase of transcript per Million mapped reads) value,
# likely representing gene expression levels from RNA sequencing data.
# The 'columns=["fpkms"]' argument specifies that the DataFrame should have a single column named "fpkms".
fpkms = pd.DataFrame([1295.0, 25.0, 19000.0, 5.0, 1.0, 300.0], columns=["fpkms"])

# Display the DataFrame 'fpkms'.
# This line will output the DataFrame to the console, showing the FPKM values in a tabular format with the column name "fpkms".
fpkms

## Standardized Scaling

Standard scaling (Z-score normalization) transforms feature values by:  
1. **Centering**: Subtracting the mean ($\mu$) from each value  
2. **Scaling**: Dividing by the standard deviation ($\sigma$)  

Mathematically:  
$$
SS(x_i) = \frac{x_i - \mu}{\sigma}
$$

Key properties:  
- Results in a distribution with $\mu = 0$ and $\sigma = 1$  
- Preserves the shape of the original distribution  
- Alternative scaling by variance ($\sigma^2$) is possible but less common

### `fit_transform()` Function in Scaling

The `fit_transform()` function is a convenient and efficient method used in scaling and preprocessing data in machine learning. It combines two steps into one: **fitting** and **transforming**.

1. **Fit**:  
   The function first **fits** the scaler to the data, calculating the necessary parameters (e.g., mean, standard deviation, min, or max) based on the input dataset.

2. **Transform**:  
   It then **transforms** the data by applying the scaling operation (e.g., normalization or standardization) using the calculated parameters.

#### Key Points:
- **Use Case**:  
  `fit_transform()` is particularly useful for preprocessing **training data**, as it ensures the scaling is consistent and based on the distribution of the training set.

- **Test/Validation Data**:  
  For test or validation data, only the `transform()` function should be used to apply the same scaling parameters learned from the training data. This prevents **data leakage** and ensures the model generalizes well to unseen data.

#### Example:
```python
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)


In [None]:
# Create a StandardScaler object.
# StandardScaler is used for feature scaling to standardize the 'fpkms' column.

ss = # Your code goes here

# Apply StandardScaler to the 'fpkms' column of the DataFrame 'fpkms' and create a new column named 'zscore' to store the standardized values.
# ss.fit_transform() calculates the mean and standard deviation of the 'fpkms' column and then standardizes the data.
# The standardized values (z-scores) are added as a new column named 'zscore' to the 'fpkms' DataFrame.

fpkms["zscore"] = # Your code goes here

# Display the DataFrame 'fpkms' with the newly added 'zscore' column.
# This line shows the DataFrame after the standardization process, allowing you to inspect the results.
fpkms

In [None]:
# We can manually use the formula to compute the same result
# Convert the 'fpkms' column from the 'fpkms' DataFrame into a NumPy array named 'fw'.
# It's assumed that 'fpkms' is a Pandas DataFrame and it has a column named 'fpkms' containing gene expression values (FPKM - Fragments Per Kilobase of transcript per Million mapped reads).
fw = np.array(fpkms["fpkms"])

# Calculate the z-score for the first element of the 'fw' array (which represents FPKM values).
# fw[0] accesses the first FPKM value in the 'fw' array.
# np.mean(fw) calculates the mean of all FPKM values in the 'fw' array.
# np.std(fw) calculates the standard deviation of all FPKM values in the 'fw' array.
# The formula (fw[0] - np.mean(fw)) / np.std(fw) standardizes the first FPKM value by subtracting the mean and dividing by the standard deviation, resulting in a z-score.
(fw[0] - np.mean(fw)) / np.std(fw)

## Min-Max Scaling

Min-max scaling normalizes feature values to a fixed range [0, 1] using the transformation:  

$$
MMS(X_i) = \frac{x_i - \min(X)}{\max(X) - \min(X)}
$$

where:  
- $x_i$ = original feature value  
- $\min(X)$ = minimum value in feature $X$  
- $\max(X)$ = maximum value in feature $X$  

This preserves the original distribution while ensuring consistent scale across features.  

In [None]:
# Create an instance of the MinMaxScaler.
# MinMaxScaler is used to scale and translate each feature individually such that it is in the given range on the training set, e.g. between zero and one.

mms = # Your code goes here

# Apply MinMaxScaler to the 'fpkms' column of the 'fpkms' DataFrame.
# .fit_transform() method first computes the minimum and maximum values of the 'fpkms' column and then scales the column to the range [0, 1].
# The result is assigned to a new column named 'minmax' in the 'fpkms' DataFrame.

fpkms["minmax"] = # Your code goes here

# Display the 'fpkms' DataFrame after applying MinMaxScaler.
# This will show the original 'fpkms' column and the newly created 'minmax' column containing the scaled values.
fpkms

In [None]:
# We can manually use the formula to compute the same result
# Normalizing the first element of the array `fw` using min-max scaling
# Formula: (x - min) / (max - min)
# This scales the value to a range between 0 and 1
(fw[0] - np.min(fw)) / (np.max(fw) - np.min(fw))

## Robust Scaling

Min-max scaling suffers from outlier sensitivity, as extreme values compress the scaled range for all observations. Robust scaling addresses this by using median-centered, IQR-normalized transformation:  

$$
\text{ScaledValue} = \frac{x_i - \text{median}(X)}{\text{IQR}_{(1,3)}(X)}
$$
  
where $\text{IQR}_{(1,3)}(X) = Q_3(75^{\text{th}}\ \text{percentile}) - Q_1(25^{\text{th}}\ \text{percentile})$. This approach preserves the majority of the data's structure while minimizing outlier influence.


In [None]:
# Create a RobustScaler instance to scale the 'fpkms' column

rs = # Your code goes here

# Apply RobustScaler transformation to the 'fpkms' column and store the result in a new column 'robust'
# RobustScaler is useful for handling outliers since it scales data based on the interquartile range (IQR)

fpkms["robust"] = # Your code goes here

# Display the modified DataFrame with the new 'robust' column
fpkms

## Conclusion

Feature scaling is essential for many machine learning algorithms, particularly those sensitive to feature magnitudes like linear and logistic regression. We explored three powerful scaling techniques:

- Z-score scaling for standardization
- Min-Max scaling for bounded ranges
- Robust scaling for handling outliers

Each method serves specific purposes and should be chosen based on your data characteristics and model requirements.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
