# Scale Data with Outliers

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).


## Overview

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This is particularly important for:

- Algorithms that use **a weighted sum of inputs**, such as linear regression.
- Algorithms that rely on **distance measures**, such as k-nearest neighbors (k-NN).

#### Standardization: A Common Scaling Technique

Standardization is a widely used scaling technique that transforms the probability distribution of an input variable into a standard Gaussian distribution (zero mean and unit variance). It involves:

1. Subtracting the mean from the values.
2. Dividing by the standard deviation.

Mathematically, standardization is represented as:
$$
z = \frac{x - \mu}{\sigma}
$$
where:
- $x$ is the original value,
- $\mu$ is the mean of the input variable,
- $\sigma$ is the standard deviation of the input variable,
- $z$ is the standardized value.

#### Limitations of Standardization

Standardization can become skewed or biased if the input variable contains **outlier values**. Outliers can significantly affect the mean and standard deviation, leading to suboptimal scaling.

#### Robust Scaling: An Alternative Approach

To address the limitations of standardization, **robust scaling** can be used. This technique is less sensitive to outliers and involves:

1. Subtracting the median from the values.
2. Dividing by the interquartile range (IQR).

We will describe the **robust scaling** in details.

#### Summary

- **Standardization** is a common scaling technique that transforms data to have zero mean and unit variance but can be skewed by outliers.
- **Robust scaling** uses the median and interquartile range, making it more suitable for datasets with outliers.
- Choosing the right scaling technique depends on the characteristics of your data and the machine learning algorithm being used.

## Learning Objectives

- Understand why many machine learning algorithms prefer scaled numerical input variables
- Learn robust scaling techniques that use percentiles to scale numerical input variables containing outliers
- Master using `RobustScaler` to scale numerical input variables using median and interquartile range
- Evaluate the impact of different scaling ranges on model performance

### Tasks to complete

- Examine raw dataset distributions
- Apply robust scaling transformation
- Evaluate model performance with different scaling ranges
- Compare results between scaled and unscaled data


## Prerequisites

- A working Python environment
- Basic understanding of Python programming concepts
- Basic understanding of machine learning concepts
- Familiarity with pandas and numpy libraries
- Knowledge of basic statistical concepts


## Get Started

To start, we install required packages and import necessary libraries.


### Install packages

In [None]:
# Install necessary Python libraries using pip
%pip install matplotlib numpy pandas scikit-learn  

# matplotlib - A library for creating static, animated, and interactive visualizations in Python.
# numpy - A fundamental package for numerical computing, providing support for arrays and mathematical operations.
# pandas - A powerful data analysis and manipulation library, useful for working with structured data.
# scikit-learn - A machine learning library that provides simple and efficient tools for data mining and analysis.


### Import libraries

In [None]:
# Import necessary libraries
from matplotlib import pyplot  # For plotting graphs
from numpy import mean, std  # For calculating mean and standard deviation
from pandas import DataFrame, read_csv  # For handling data as DataFrames and reading CSV files
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score  # For cross-validation
from sklearn.neighbors import KNeighborsClassifier  # K-Nearest Neighbors classifier
from sklearn.pipeline import Pipeline  # To create machine learning pipelines
from sklearn.preprocessing import LabelEncoder, RobustScaler  # For data preprocessing

## Robust Scaling Data

When working with machine learning algorithms, input variables with **very large values** relative to other variables can dominate or skew the model's behavior. This happens because algorithms may focus excessively on the variables with larger values, effectively ignoring those with smaller values. This imbalance can lead to suboptimal model performance.

Additionally, **outliers** in the data can further complicate the scaling process. Outliers can distort the probability distribution of the data, making traditional scaling techniques like standardization less effective. Standardization relies on the mean and standard deviation, both of which can be heavily influenced by outliers, resulting in skewed scaling.

### Robust Scaling: A Solution for Outliers and Large Values

To address these issues, **robust scaling** is a preferred alternative. Robust scaling uses the **median (50th percentile)** and the **interquartile range (IQR)** to scale the data. The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). 

The steps for robust scaling are as follows:
1. Subtract the **median** from each value.
2. Divide the result by the **IQR**.

Mathematically, robust scaling can be expressed as:

$
z = \frac{x - \text{median}}{\text{IQR}}
$

where:
- $x$ is the original value,
- $\text{median}$ is the median of the input variable,
- $\text{IQR}$ is the interquartile range (Q3 - Q1),
- $z$ is the robustly scaled value.

### Advantages of Robust Scaling
- **Resilience to Outliers**: Since the median and IQR are less sensitive to extreme values, robust scaling is more effective for datasets containing outliers.
- **Balanced Scaling**: It ensures that no single variable dominates the model due to its scale, leading to better performance for algorithms sensitive to input variable ranges.

By using robust scaling, you can ensure that your data is appropriately normalized, even in the presence of outliers or variables with large value ranges.


## Diabetes Dataset

The dataset classifies patient data based on whether they experienced the onset of diabetes within five years or not.

### Dataset Overview
- **Number of Instances**: 768
- **Number of Attributes**: 8 (all numeric-valued) plus a class variable.

### Attributes Description
1. **Number of times pregnant**
2. **Plasma glucose concentration** (measured 2 hours after an oral glucose tolerance test)
3. **Diastolic blood pressure** (mm Hg)
4. **Triceps skinfold thickness** (mm)
5. **2-Hour serum insulin** (mu U/ml)
6. **Body mass index** (weight in kg/(height in m)^2)
7. **Diabetes pedigree function** (a function that scores the likelihood of diabetes based on family history)
8. **Age** (years)
9. **Class variable** (0 or 1, where 1 indicates a positive test for diabetes)

### Missing Attribute Values
- The dataset contains missing attribute values.

### Class Distribution
- **Class Value 0**: 500 instances (interpreted as "tested negative for diabetes")
- **Class Value 1**: 268 instances (interpreted as "tested positive for diabetes")

### Additional Resources
- **Dataset File**: [pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv)
- **Dataset Details**: [pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names)

In [None]:
# Load and summarize the diabetes dataset
# Define the path to the dataset
pima_indians_diabetes_csv = "../../Data/pima-indians-diabetes.csv"  

# Define the dataset file path (Ensure 'pima_indians_diabetes_csv' is correctly defined)
dataset = read_csv(pima_indians_diabetes_csv, header=None)  # Load dataset without headers
print(dataset.head())  # Display the first five rows of the dataset

# Summarize the shape of the dataset
print(dataset.shape)  # Print the number of rows and columns in the dataset

# Summarize each variable (statistical summary)
print(dataset.describe())  # Display summary statistics (count, mean, std, min, max, etc.)

### Dataset Summary

The dataset consists of:
- **8** input variables
- **1** output variable
- **768** rows of data

A statistical summary of the input variables reveals that each variable operates on a **very different scale**. This characteristic makes the dataset an excellent candidate for exploring **data scaling methods**.

### Visualizing Input Variables

To better understand the dataset, we can create a **histogram for each input variable**. These plots highlight two key observations:
1. **Differing Scales**: Each input variable has a distinct range and scale.
2. **Presence of Outliers**: Some distributions exhibit outliers, which can skew the data.

### Implications for Data Scaling

Given the **varying scales** and the **presence of outliers**, the dataset is well-suited for applying a **robust scaler transform**. This method is particularly effective for standardizing data when:
- Input variables have significantly different scales.
- Outliers are present, as robust scaling is less sensitive to extreme values compared to traditional standardization.


By using a robust scaler, we can ensure that the data is appropriately normalized, leading to better performance in machine learning models.

In [None]:
# Generate histograms for all variables in the dataset
# xlabelsize and ylabelsize control the font size of axis labels
fig = dataset.hist(xlabelsize=4, ylabelsize=4)

# Reduce the title size of each subplot for better readability
[x.title.set_size(4) for x in fig.ravel()]

# Display the histogram plot
pyplot.show()

Next, let's fit and evaluate a machine learning model on the raw dataset. We will use
a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated
stratified k-fold cross-validation.

In [None]:
# Evaluate k-Nearest Neighbors (KNN) on the raw Pima Indians Diabetes dataset

# Load dataset from CSV file (Assuming 'pima_indians_diabetes_csv' is defined elsewhere)
dataset = read_csv(pima_indians_diabetes_csv, header=None)

# Convert the dataset into a NumPy array
data = dataset.values

# Separate the dataset into input features (X) and output labels (y)
X, y = data[:, :-1], data[:, -1]

# Ensure input features are of type float (for consistency in calculations)
X = X.astype("float32")

# Encode the output labels as integers (necessary for classification models)
y = LabelEncoder().fit_transform(y.astype("str"))

# Define the KNN model with default parameters
model = KNeighborsClassifier()

# Set up cross-validation with 10 splits, repeated 3 times for robust evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Perform cross-validation and compute accuracy scores across different folds
n_scores = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# Report the model's mean accuracy and standard deviation across all cross-validation runs
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

# Show any plots (though no plots are explicitly generated in this code)
pyplot.show()

In this case we can see that the model achieved a mean classi cationfiaccuracy of about
71.7 percent.

Next, let's explore a robust scaling transform of the dataset.

## IQR Robust Scaler Transform

We can apply the **Robust Scaler** to the diabetes dataset directly. This method scales the data using the **Interquartile Range (IQR)**, making it robust to outliers. Here's how it works:

1. **Define the RobustScaler Instance**:
   - A `RobustScaler` object is created with default hyperparameters.

2. **Apply the Transform**:
   - The `fit_transform()` function is called on the dataset. This function:
     - Computes the median and IQR for each feature.
     - Scales the data based on these values.

3. **Output**:
   - The result is a transformed version of the dataset where the values are scaled to the IQR, ensuring robustness to outliers and varying scales.

This approach is particularly useful for datasets like the diabetes dataset, where input variables have differing scales and may contain outliers.

In [None]:
# Load the Pima Indians Diabetes dataset from a CSV file
# `header=None` ensures that no row is treated as column names
dataset = read_csv(pima_indians_diabetes_csv, header=None)

# Extract only the numeric input features (excluding the target variable in the last column)
data = dataset.values[:, :-1]

# Initialize the RobustScaler, which scales features using statistics 
# that are robust to outliers (i.e., median and interquartile range)
trans = RobustScaler()

# Fit the scaler to the data and transform it to remove the influence of outliers
data = trans.fit_transform(data)

# Convert the transformed NumPy array back into a Pandas DataFrame for better readability
dataset = DataFrame(data)

# Display summary statistics of the transformed dataset
# This includes count, mean, standard deviation, min, and percentile values
print(dataset.describe())

We can see that the
distributions have been adjusted. The median values are now zero and the standard deviation
values are now close to 1.0.

In [None]:
# Generate histograms for all variables in the dataset
# xlabelsize and ylabelsize adjust the font size of axis labels for readability
fig = dataset.hist(xlabelsize=4, ylabelsize=4)

# Reduce the title font size for each subplot in the figure
[x.title.set_size(4) for x in fig.ravel()]

# Display the histogram plot
pyplot.show()

Histogram plots of the variables are created, although the distributions don't look much
different from their original distributions seen in the previous section. We can see that the
center of mass for each distribution is now close to zero.

Next, let's evaluate the same KNN model as the previous section, but in this case on a
robust scaler transform of the dataset.

In [None]:
# Evaluate K-Nearest Neighbors (KNN) on the Pima Indians Diabetes dataset 
# using RobustScaler for data preprocessing

# Load the dataset from CSV file
dataset = read_csv(pima_indians_diabetes_csv, header=None)  # Replace with actual file path
data = dataset.values  # Convert dataframe to NumPy array for processing

# Separate the dataset into input features (X) and target labels (y)
X, y = data[:, :-1], data[:, -1]

# Convert input features to float type and encode target labels as integers
X = X.astype("float32")  
y = LabelEncoder().fit_transform(y.astype("str"))  # Ensures target labels are numerically encoded

# Define a preprocessing and modeling pipeline
trans = RobustScaler()  # Use RobustScaler to handle outliers and normalize features
model = KNeighborsClassifier()  # Initialize KNN classifier
pipeline = Pipeline(steps=[("t", trans), ("m", model)])  # Create a pipeline with scaling and classification

# Define cross-validation strategy
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)  
# This splits data into 10 folds and repeats the process 3 times to ensure robust evaluation

# Evaluate the pipeline using cross-validation and compute accuracy scores
n_scores = cross_val_score(pipeline, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# Report mean accuracy and standard deviation of scores
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

# Show any plots (though no plot is explicitly created in this code)
pyplot.show()

We can see that the robust scaler transform results in a lift in
performance from 71.7 percent accuracy without the transform to about 73.4 percent with the
transform.

## Explore Robust Scaler Range

By default, the **Robust Scaler** uses the **Interquartile Range (IQR)** to scale each variable. The IQR is bounded by the **25th percentile (Q1)** and the **75th percentile (Q3)**. This range is specified by the `quantile_range` argument as a tuple (e.g., `(0.25, 0.75)`).

### Customizing the Range
The range can be adjusted to potentially improve model performance:
- **Wider Range**: Includes more data points, reducing the number of values considered outliers.
- **Narrower Range**: Excludes more data points, increasing the number of values considered outliers.

### Example: Exploring Different Ranges
The example below demonstrates the effect of varying the range from:
- **1st to 99th** percentiles (very wide range, minimizing outliers)
- **30th to 70th** percentiles (narrow range, maximizing outliers)

By experimenting with different ranges, you can fine-tune the scaling process to better suit your dataset and improve model performance.

In [None]:
# Explore the scaling range of the robust scaler transform

# Get the dataset
def get_dataset():
    # Load the dataset from the specified CSV file without a header row
    dataset = read_csv(pima_indians_diabetes_csv, header=None)
    
    # Convert the dataset into a numpy array
    data = dataset.values
    
    # Separate the dataset into input features (X) and output labels (y)
    X, y = data[:, :-1], data[:, -1]  # Inputs (all columns except the last), Outputs (last column)
    
    # Ensure that the input features are of type float32 (for better performance)
    X = X.astype("float32")
    
    # Convert the output labels to integers using LabelEncoder (for classification tasks)
    y = LabelEncoder().fit_transform(y.astype("str"))  # Convert to string and then encode to integers
    
    # Return the prepared input and output data
    return X, y



# Get a list of models to evaluate
def get_models():
    # Create an empty dictionary to store the models
    models = dict()
    
    # Iterate over a list of values to create different models
    for value in [1, 5, 10, 15, 20, 25, 30]:
        # Define the transformation step using RobustScaler with quantile_range based on 'value'
        trans = RobustScaler(quantile_range=(value, 100 - value))
        
        # Define the classifier model using KNeighborsClassifier
        model = KNeighborsClassifier()
        
        # Store the model in the dictionary, where the key is the string representation of 'value'
        # The pipeline includes both the transformation step and the model step
        models[str(value)] = Pipeline(steps=[("t", trans), ("m", model)])
    
    # Return the dictionary of models
    return models

# Evaluate a given model using cross-validation
def evaluate_model(model, X, y):
    # Create a Repeated Stratified K-Fold cross-validation object
    # n_splits=10: 10-fold cross-validation (split data into 10 parts)
    # n_repeats=3: repeat the process 3 times to get more robust results
    # random_state=1: ensures reproducibility of results by fixing the random seed
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    
    # Perform cross-validation and calculate accuracy scores for each fold
    # scoring="accuracy" specifies that the evaluation metric is accuracy
    # n_jobs=-1: use all available CPUs to speed up the computation
    scores = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)
    
    # Return the array of accuracy scores from each fold
    return scores

# Define the dataset by calling the function to load or generate it
X, y = get_dataset()

# Get the models to evaluate by calling a function that returns a dictionary of model names and their respective model objects
models = get_models()

# Initialize empty lists to store the evaluation results and model names
results, names = list(), list()

# Loop through each model and evaluate it
for name, model in models.items():
    # Evaluate the current model using the dataset X and y, and store the scores
    scores = evaluate_model(model, X, y)
    
    # Append the scores to the results list
    results.append(scores)
    
    # Append the model's name to the names list
    names.append(name+"-"+str(100-int(name)))
    
    # Print the model's name along with its mean and standard deviation of the evaluation scores
    print(">%s-%s %.3f (%.3f)" % (name, 100-int(name), mean(scores), std(scores)))

We can see that ranges such as 10-90 and 15-85 perform better than the default
of 25-75.

In [None]:
# Plotting a boxplot to compare the performance of different models

# 'results' contains the performance data for each model (e.g., accuracy scores, etc.)
# 'names' is a list of the model names corresponding to the performance results
# The 'boxplot' function is used to create the boxplot for comparing results visually
pyplot.boxplot(results, showmeans=True)  # Plot the boxplot with showing means

# Add x-axis and y-axis labels
pyplot.xlabel("Model")
pyplot.ylabel("Mean Accuracy")

# Set the x-axis labels using 'xticklabels'
pyplot.xticks(ticks=range(1, len(names) + 1), labels=names)  # Set the model names on the x-axis

# Display the plot
pyplot.show()  # Show the generated plot to the user

### Evaluating IQR Range Impact

To assess the effect of different IQR ranges on model performance, **box and whisker plots** were created to summarize the classification accuracy scores for each range. The plots reveal:

- **Subtle Differences**: There is a slight variation in the **distribution** and **mean accuracy** between different IQR ranges.
- **Comparison of Ranges**:
  - **15th to 85th** Percentiles: A wider range that includes more data points, potentially reducing the impact of outliers.
  - **25th to 75th** Percentiles: The default IQR range, which is narrower and may classify more values as outliers.

These results suggest that adjusting the IQR range can influence model performance, though the differences may be subtle. Further experimentation with additional ranges could help identify the optimal configuration for the dataset.

## Conclusion

Robust scaling provides an effective way to standardize numerical variables when outliers are present. Different scaling ranges (like 10-90 or 15-85 percentiles) can outperform the default 25-75 percentile range, showing the importance of testing different scaling parameters.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.