# Feature Scaling Exercise

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

Data processing is a critical step in building Machine Learning systems, requiring both domain knowledge and mathematical transformations. Feature scaling helps prevent model bias toward features with high magnitude values and is especially important when trying multiple Machine Learning algorithms.

## Learning Objectives

- Understand the importance of feature scaling in machine learning
- Learn three key feature scaling techniques:
  - Standardized Scaling (Z-score)
  - Min-Max Scaling
  - Robust Scaling
- Apply feature scaling methods using scikit-learn preprocessing tools

### Tasks

- Apply `StandardScaler` to normalize data using Z-score method
- Implement Min-Max scaling to transform features to [0,1] range
- Use `RobustScaler` for scaling data with outliers
- Compare results across different scaling methods

## Prerequisites

- Python programming environment
- Basic understanding of statistical and machine learning concepts
- Familiarity with common ML libraries


## Get Started

### Set up conda environment

Ensure that you have created then conda environment using the `conda_env.yml` file included in this repository. E.g.,

```
# Create conda environment
conda env create -f conda_env.yml

# Register the kernel
python -m ipykernel install --user \
    --name=nigms_sandbox_ud \
    --display-name "Python (NIGMS Sandbox UD)"
```

Then, when starting the notebook, select the `"Python (NIGMS Sandbox UD)"` kernel from the list.

Note that you may need to restart Jupyter Lab for these changes to take effect.

### Import necessary libraries


In [None]:
# Import necessary libraries for data preprocessing
# Import StandardScaler, MinMaxScaler, and RobustScaler from scikit-learn's preprocessing module.
# These are used for scaling and normalizing numerical features to improve model performance.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler  # Scalers for feature normalization
# Import the NumPy library and alias it as 'np'.
# NumPy is fundamental for numerical computations in Python, especially for handling arrays and matrices.
import numpy as np  # Numerical operations
# Import the pandas library and alias it as 'pd'.
# Pandas is used for data manipulation and analysis, providing data structures like DataFrames and Series.
import pandas as pd  # Data manipulation and analysis

## Feature Scaling

Using the raw values as input features might make models biased
toward features having really high magnitude values. These models are typically sensitive to the magnitude or
scale of features like linear or logistic regression. Other models like tree based methods can still work without
feature scaling. However it is still recommended to normalize and scale down the features with feature scaling,
especially if you want to try out multiple Machine Learning algorithms on input features.


In [None]:
# Set NumPy print options to suppress scientific notation for small numbers.
# This makes the output of NumPy arrays more readable by displaying small numbers in fixed-point notation instead of scientific notation (e.g., 0.0001 instead of 1e-4).
np.set_printoptions(suppress=True)

## Load sample RNASeq data

We have five genes with their RNASeq FPKMs (Fragments Per Kilobase of exon per Million reads). It is quite evident that some genes have
been expressed a lot more than the others, giving a rise to values of high scale and magnitude.


In [None]:
# Create a Pandas DataFrame named 'fpkms'.
# The DataFrame is initialized with a list of numerical values.
# Each value in the list represents an FPKM (Fragments Per Kilobase of transcript per Million mapped reads) value,
# likely representing gene expression levels from RNA sequencing data.
# The 'columns=["fpkms"]' argument specifies that the DataFrame should have a single column named "fpkms".
fpkms = pd.DataFrame([1295.0, 25.0, 19000.0, 5.0, 1.0, 300.0], columns=["fpkms"])
# Display the DataFrame 'fpkms'.
# This line will output the DataFrame to the console, showing the FPKM values in a tabular format with the column name "fpkms".
fpkms

## Standardized Scaling

The standard scaler tries to standardize each value in a feature column by removing the mean and scaling
the variance to be 1 from the values. This is also known as centering and scaling and can be denoted
mathematically as $SS(X_i) = \frac{x_i - \mu}{\sigma}$, where each value in feature $X$ is subtracted by the mean $\mu_i$ and the resultant is divided by the standard deviation $\sigma_x$. This is also known as $Z$-scsore scaling. We can aslo divide the resultant by the variance instead of the standard deviation if needed.


In [None]:
# Create a StandardScaler object.
# StandardScaler is used for feature scaling to standardize the 'fpkms' column.

ss = # Your code goes here

# Apply StandardScaler to the 'fpkms' column of the DataFrame 'fpkms' and create a new column named 'zscore' to store the standardized values.
# ss.fit_transform() calculates the mean and standard deviation of the 'fpkms' column and then standardizes the data.
# The standardized values (z-scores) are added as a new column named 'zscore' to the 'fpkms' DataFrame.

fpkms["zscore"] = # Your code goes here

# Display the DataFrame 'fpkms' with the newly added 'zscore' column.
# This line shows the DataFrame after the standardization process, allowing you to inspect the results.
fpkms

In [None]:
# We can manually use the formula to compute the same result
# Convert the 'fpkms' column from the 'fpkms' DataFrame into a NumPy array named 'fw'.
# It's assumed that 'fpkms' is a Pandas DataFrame and it has a column named 'fpkms' containing gene expression values (FPKM - Fragments Per Kilobase of transcript per Million mapped reads).
fw = np.array(fpkms["fpkms"])
# Calculate the z-score for the first element of the 'fw' array (which represents FPKM values).
# fw[0] accesses the first FPKM value in the 'fw' array.
# np.mean(fw) calculates the mean of all FPKM values in the 'fw' array.
# np.std(fw) calculates the standard deviation of all FPKM values in the 'fw' array.
# The formula (fw[0] - np.mean(fw)) / np.std(fw) standardizes the first FPKM value by subtracting the mean and dividing by the standard deviation, resulting in a z-score.
(fw[0] - np.mean(fw)) / np.std(fw)

## Min-Max Scaling

With min-max scaling, we can transform and scale our feature values such that each value is within the
range of [0, 1]. Min-Max Scaler can be represented as $MMS(X_i)=\frac{x_i - min(x)}{max(x) - min(x)}$, where we scale aach value in the feature $X$ by substracting it from the minimum value in the feature $min(X)$ and dividing the resultant by the difference between the maximum and minimum values in the feature $max(X)-min(X)$.


In [None]:
# Create an instance of the MinMaxScaler.
# MinMaxScaler is used to scale and translate each feature individually such that it is in the given range on the training set, e.g. between zero and one.

mms = # Your code goes here

# Apply MinMaxScaler to the 'fpkms' column of the 'fpkms' DataFrame.
# .fit_transform() method first computes the minimum and maximum values of the 'fpkms' column and then scales the column to the range [0, 1].
# The result is assigned to a new column named 'minmax' in the 'fpkms' DataFrame.

fpkms["minmax"] = # Your code goes here

# Display the 'fpkms' DataFrame after applying MinMaxScaler.
# This will show the original 'fpkms' column and the newly created 'minmax' column containing the scaled values.
fpkms

In [None]:
# We can manually use the formula to compute the same result
# Normalizing the first element of the array `fw` using min-max scaling
# Formula: (x - min) / (max - min)
# This scales the value to a range between 0 and 1
(fw[0] - np.min(fw)) / (np.max(fw) - np.min(fw))

## Robust Scaling

The disadvantage of min-max scaling is that often the presence of outliers affects the scaled values for any
feature. Robust scaling tries to use specific statistical measures to scale features without being affected by
outliers. Mathematically this scaler can be represented as $\frac{x_i - median(x)}{IQR_{(1,3)}(x)}$, where we scale each value of feature $X$ by subtracting the median of $X$ and dividing the resultant by the IQR (Inter-Quartile Range) of $X$ which is the range (difference) between the first quartile (25th percentile) and the third quartile (75th percentile).


In [None]:
# Create a RobustScaler instance to scale the 'fpkms' column

rs = # Your code goes here

# Apply RobustScaler transformation to the 'fpkms' column and store the result in a new column 'robust'
# RobustScaler is useful for handling outliers since it scales data based on the interquartile range (IQR)

fpkms["robust"] = # Your code goes here

# Display the modified DataFrame with the new 'robust' column
fpkms

## Conclusion

Feature scaling is essential for many machine learning algorithms, particularly those sensitive to feature magnitudes like linear and logistic regression. We explored three powerful scaling techniques:

- Z-score scaling for standardization
- Min-Max scaling for bounded ranges
- Robust scaling for handling outliers

Each method serves specific purposes and should be chosen based on your data characteristics and model requirements.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
