# Feature Scaling: Scale Numerical Data

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).


## Overview

Many machine learning algorithms perform significantly better when numerical input variables are scaled to a standard range. This is because algorithms that rely on distance calculations (e.g., k-Nearest Neighbors, Support Vector Machines) or gradient-based optimization (e.g., linear regression, neural networks) are sensitive to the scale of the input features. Without scaling, features with larger magnitudes can dominate the model's behavior, leading to biased or suboptimal results.

This tutorial provides a comprehensive guide to two widely used techniques for scaling numerical data: **normalization** and **standardization**. These preprocessing steps ensure that all input features contribute equally to the model's learning process, improving performance and convergence.

#### Key Concepts:
1. **Normalization**:
   - Rescales numerical features to a fixed range, typically [0, 1].
   - Useful for algorithms that require input data to be bounded, such as neural networks.

2. **Standardization**:
   - Rescales numerical features to have a mean of 0 and a standard deviation of 1.
   - Suitable for algorithms that assume normally distributed data, such as linear regression and logistic regression.

#### Why Scaling Matters:
- **Improves Model Performance**: Ensures that no single feature dominates due to its scale.
- **Speeds Up Convergence**: Helps gradient-based algorithms converge faster during training.
- **Enhances Interpretability**: Makes it easier to compare the importance of features.

By the end of this tutorial, you will understand how to apply normalization and standardization to your datasets, ensuring that your machine learning models perform at their best.

## Learning Objectives

- Understand why data scaling is a recommended pre-processing step for many machine learning algorithms
- Learn how data scaling can be achieved through normalizing or standardizing real-valued input and output variables
- Apply standardization and normalization techniques to improve predictive modeling algorithm performance

### Tasks to complete

- Load and examine the diabetes dataset
- Apply MinMaxScaler transformation
- Apply StandardScaler transformation
- Compare model performance with different scaling approaches

## Prerequisites

- Basic understanding of Python programming
- Familiarity with NumPy and scikit-learn libraries
- Knowledge of basic statistical and machine learning concepts

## Get Started

To start, we install the required packages and import the necessary libraries.

### Install required packages

In [None]:
# Install necessary Python libraries using the pip package manager.
# These libraries are commonly used for data analysis, machine learning, and plotting.

# %pip is a magic command in Jupyter Notebook to install Python packages directly in the notebook environment.
# - matplotlib: A plotting library for creating static, animated, and interactive visualizations.
# - numpy: A fundamental library for numerical computations, supporting arrays, matrices, and mathematical functions.
# - pandas: A powerful library for data manipulation and analysis, providing data structures like DataFrames.
# - scikit-learn: A comprehensive library for machine learning, offering tools for data preprocessing, modeling, and evaluation.
%pip install matplotlib numpy pandas scikit-learn

### Import necessary libraries

In [None]:
# Import the pyplot module from matplotlib for plotting functionalities and alias it as plt.
from matplotlib import pyplot as plt

# Import the asarray, mean, and std functions from the numpy library for numerical operations.
from numpy import asarray, mean, std

# Import the DataFrame and read_csv classes from the pandas library for data manipulation and reading CSV files.
from pandas import DataFrame, read_csv

# Import RepeatedStratifiedKFold and cross_val_score from sklearn.model_selection for model evaluation using cross-validation.
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

# Import KNeighborsClassifier from sklearn.neighbors for K-Nearest Neighbors classification model.
from sklearn.neighbors import KNeighborsClassifier

# Import Pipeline from sklearn.pipeline to create composite estimators.
from sklearn.pipeline import Pipeline

# Import LabelEncoder, MinMaxScaler, and StandardScaler from sklearn.preprocessing for data preprocessing tasks like label encoding and feature scaling.
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler

## Numerical Data Scaling Methods

- **Normalization**: Scales each input variable separately to the range of **0 to 1**. This range is ideal for floating-point values, as it ensures the highest precision for numerical computations.  
  Formula:  
  $$
  X_{\text{normalized}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
  $$

- **Standardization**: Scales each input variable separately by **subtracting the mean** (centering) and **dividing by the standard deviation**. This transforms the distribution to have a **mean of zero** and a **standard deviation of one**.  
  Formula:  
  $$
  X_{\text{standardized}} = \frac{X - \mu}{\sigma}
  $$  
  where $\mu$ is the mean and $\sigma$ is the standard deviation.

### Data Normalization

**Normalization** is a technique used to rescale data from its original range so that all values fall within a new range of **0 to 1**. To perform normalization, you need to know or accurately estimate the **minimum** and **maximum** observable values in the dataset. These values can often be estimated from the available data.

- **Purpose**: Normalization ensures that all features contribute equally to the model's learning process, especially when features have different scales.
- **Formula**:  
  $$
  X_{\text{normalized}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
  $$  
  where:
  - $X$ is the original value,
  - $X_{\text{min}}$ is the minimum value in the dataset,
  - $X_{\text{max}}$ is the maximum value in the dataset.

In [None]:
# example of a normalization

# Define a NumPy array named 'data' containing sample data for normalization.
data = asarray([[100, 0.001], [8, 0.05], [50, 0.005], [88, 0.07], [4, 0.1]])
# Print the original 'data' array to show the data before normalization.
print(data)

# Transform features by scaling each feature to a given range.
# This estimator scales and translates each feature individually such
# that it is in the given range on the training set, e.g. between
# zero and one.
# Create a MinMaxScaler object named 'scaler'.
# MinMaxScaler scales features to a range between zero and one by default.
scaler = MinMaxScaler()

# Fit the MinMaxScaler 'scaler' to the 'data' to calculate the min and max values for each feature.
# Then, transform the 'data' using the fitted scaler to normalize it.
scaled = scaler.fit_transform(data)

# Print the 'scaled' array to show the data after Min-Max normalization.
print(scaled)

### Data Standardization

Standardizing a dataset involves rescaling the distribution of values so that:
- The **mean** of the observed values becomes **0**.
- The **standard deviation** becomes **1**.

This process is equivalent to **subtracting the mean** (centering the data) and **dividing by the standard deviation**. Like normalization, standardization is often useful and sometimes required for machine learning algorithms, especially when input features have differing scales.

#### Key Points:
- **Assumption**: Standardization assumes that the data follows a **Gaussian distribution** (bell curve) with a well-defined mean and standard deviation.
- **Flexibility**: You can still standardize data even if it doesn’t perfectly fit a Gaussian distribution, but the results may be less reliable.
- **Use Cases**: Standardization is particularly important for algorithms that rely on distance calculations (e.g., k-Nearest Neighbors, Support Vector Machines) or gradient-based optimization (e.g., linear regression, neural networks).

#### Formula:
$$
X_{\text{standardized}} = \frac{X - \mu}{\sigma}
$$
where:
- $X$ is the original value,
- $\mu$ is the mean of the dataset,
- $\sigma$ is the standard deviation of the dataset.

In [None]:
# example of a standardization
# This section demonstrates how to standardize data using scikit-learn's StandardScaler.

# Define a NumPy array named 'data' representing the dataset to be standardized.
data = asarray([[100, 0.001], [8, 0.05], [50, 0.005], [88, 0.07], [4, 0.1]])

# Print the original 'data' array to show the data before standardization.
print(data)

# Create a StandardScaler object named 'scaler'.
# StandardScaler is used to standardize features by removing the mean and scaling to unit variance.
scaler = StandardScaler()

# Fit the StandardScaler to the 'data' array and then transform the data.
# 'fit_transform' computes the mean and standard deviation from the data and then performs standardization.
scaled = scaler.fit_transform(data)

# Print the 'scaled' array, which now contains the standardized data.
print(scaled)

## Diabetes Dataset

The dataset classifies patient data as
either an onset of diabetes within five years or not. 

```
Number of Instances: 768
Number of Attributes: 8 plus class 
For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")
   Class Value  Number of instances
   0            500
   1            268
```

You can learn more about the dataset here:

* Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
* Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

In [None]:
# This section of the code is designed to load the pima-indians-diabetes dataset and provide a basic summary of its contents.

# Load the dataset from a CSV file.
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)

# Print the first few rows of the dataset.
# This allows for a quick inspection of the data's structure and content.
print(dataset.head())

# Print the shape (number of rows and columns) of the dataset.
print(dataset.shape)

# Print descriptive statistics for each variable in the dataset.
# This includes count, mean, std, min, max, and percentiles.
print(dataset.describe())

### Dataset Overview

The dataset consists of:
- **8 input variables**,
- **1 output variable**, and
- **768 rows of data**.

A statistical summary of the input variables reveals that each variable has a **significantly different scale**. This characteristic makes the dataset an excellent candidate for exploring and applying **data scaling methods**.

### Visualization of Input Variables

To further understand the dataset, we can create a **histogram** for each input variable. These plots confirm the **differing scales** across the variables and highlight the variability in their distributions.


In [None]:
# Generate histograms for each column in the 'dataset' DataFrame.
fig = dataset.hist(xlabelsize=4, ylabelsize=4)

# Iterate through each histogram subplot in the figure and set the title font size to 4.
[x.title.set_size(4) for x in fig.ravel()]

# Display the plot containing the histograms.
plt.show()

### Model Evaluation on Raw Data

Next, we will fit and evaluate a machine learning model using the **raw dataset**. For this task, we will employ a **k-nearest neighbor (KNN) algorithm** with its default hyperparameters. To ensure robust evaluation, we will use **repeated stratified k-fold cross-validation**, which provides a reliable estimate of the model's performance by maintaining the class distribution across folds and repeating the process multiple times.

In [None]:
# This section of the code evaluates a K-Nearest Neighbors (KNN) classifier on the raw Pima Indians Diabetes dataset without any feature transformation.

# Load the dataset from a CSV file named "pima-indians-diabetes.csv" located in the "../../Data/" directory into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)

# Convert the pandas DataFrame 'dataset' into a NumPy array 'data'.
# This is often done for compatibility with scikit-learn and for faster numerical computations.
data = dataset.values

# separate into input and output columns
# Split the 'data' array into input features (X) and output labels (y).
# 'data[:, :-1]' selects all rows (:) and all columns except the last one (:-1) for input features (X).
# 'data[:, -1]' selects all rows (:) and only the last column (-1) for output labels (y).
X, y = data[:, :-1], data[:, -1]

# ensure inputs are floats and output is an integer label
# Convert the input features 'X' to float32 data type.
X = X.astype("float32")

# This ensures that the features are in a numerical format suitable for machine learning models.
# Encode the output labels 'y' using LabelEncoder.
# 'y.astype("str")' first converts the labels to string type to handle potential mixed data types.
# 'LabelEncoder().fit_transform()' then fits the LabelEncoder to the labels and transforms them into numerical labels (0, 1, 2, ...).
y = LabelEncoder().fit_transform(y.astype("str"))

# define and configure the model
# Create a KNeighborsClassifier object named 'model' with default parameters.
model = KNeighborsClassifier()

# evaluate the model
# Create a RepeatedStratifiedKFold cross-validation object named 'cv'.
# 'n_splits=10' sets the number of folds for k-fold cross-validation to 10.
# 'n_repeats=3' specifies that the cross-validation process should be repeated 3 times.
# 'random_state=1' ensures that the data splitting is consistent across runs for reproducibility.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Perform cross-validation using 'cross_val_score' to evaluate the 'model'.
# 'model' is the KNeighborsClassifier model to evaluate.
# 'X' is the input features.
# 'y' is the output labels.
# 'scoring="accuracy"' specifies that accuracy is the metric to evaluate.
# 'cv=cv' uses the RepeatedStratifiedKFold object 'cv' for cross-validation splitting.
# 'n_jobs=-1' utilizes all available CPU cores for parallel computation to speed up the process.
n_scores = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# report model performance
# Print the performance of the KNN model.
# "Accuracy: %.3f (%.3f)" is a format string.
# '%.3f' is replaced by the mean accuracy calculated from 'n_scores', rounded to 3 decimal places.
# '%.3f' is replaced by the standard deviation of the accuracy scores in 'n_scores', rounded to 3 decimal places.
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

In this case we can see that the model achieved a mean classification accuracy of about
71.7 percent.

## `MinMaxScaler` Transform

We can apply the `MinMaxScaler` to the diabetes dataset to normalize the input variables. By default, the `MinMaxScaler` scales values to the range **0 to 1**. Here's how it works:

1. **Define the Scaler**: Create an instance of `MinMaxScaler` with default hyperparameters.
2. **Transform the Data**: Use the `fit_transform()` function to apply the scaling to the dataset, generating a transformed version where all input variables are normalized to the specified range.

### Summary of each input variable

In [None]:
# visualize a minmax scaler transform of the diabetes dataset

# Load the 'pima-indians-diabetes.csv' dataset from the specified path using pandas' read_csv function.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)

# Extract all rows and all columns except the last one from the dataset's values (NumPy array).
# This assumes the last column is the target variable and the rest are features.
data = dataset.values[:, :-1]

# perform a min-max scaler transform of the dataset
# Create a MinMaxScaler object to scale features to a range between 0 and 1.
trans = MinMaxScaler()

# Fit the MinMaxScaler to the data and then transform the data.
# 'fit_transform' learns the min and max values from the data and applies the scaling.
data = trans.fit_transform(data)

# Convert the NumPy array 'data' (which is now scaled) back into a pandas DataFrame.
dataset = DataFrame(data)

# Print descriptive statistics of the transformed DataFrame.
# 'describe()' provides count, mean, std, min, 25%, 50%, 75%, max for each column.
print(dataset.describe())

After applying the transformation, we can observe that the distributions of the variables have been adjusted. The minimum and maximum values for each variable are now **0.0** and **1.0**, respectively, confirming that the data has been successfully normalized to the desired range.

### Histogram plots of the variables

In [None]:
# Create histograms for each variable in the 'dataset'. 'dataset.hist()' generates histograms for all numerical columns in the DataFrame. 'xlabelsize=4' and 'ylabelsize=4' set the font size of the x and y axis labels to 4.
fig = dataset.hist(xlabelsize=4, ylabelsize=4)

# Iterate over each histogram subplot in the figure 'fig'. 'fig.ravel()' flattens the array of subplots into a 1D array, and '[x.title.set_size(4) for x in fig.ravel()]' sets the title size of each subplot to 4 using a list comprehension.
[x.title.set_size(4) for x in fig.ravel()]

# Display the generated histograms plot. 'plt.show()' is a function from matplotlib.pyplot that opens a window and displays the plot.
plt.show()

Histogram plots of the variables are generated to visualize their distributions. While the overall shapes of the distributions appear similar to their original forms (as seen in the previous section), we can confirm that the minimum and maximum values have been successfully scaled to **0** and **1**, respectively, as intended.

### Model Evaluation

Next, we will evaluate the same **k-nearest neighbor (KNN) model** as before, but this time using the dataset transformed by the **`MinMaxScaler`**. This will allow us to assess the impact of normalization on the model's performance.

In [None]:
# This script evaluates a K-Nearest Neighbors (KNN) classifier on the Pima Indians Diabetes dataset,
# using a MinMaxScaler for feature scaling and Repeated Stratified K-Fold cross-validation for performance evaluation.

# Read the CSV file 'pima-indians-diabetes.csv' into a pandas DataFrame.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)

# Extract the values from the pandas DataFrame and store them in a NumPy array named 'data'.
data = dataset.values

# Separate the 'data' array into input features (X) and output labels (y).
# 'data[:, :-1]' selects all rows and all columns except the last one for input features (X).
# 'data[:, -1]' selects all rows and only the last column for output labels (y).
X, y = data[:, :-1], data[:, -1]

# Ensure inputs are floats and output is an integer label
# Convert the input features 'X' to float32 data type.
X = X.astype("float32")

# Encode the output labels 'y' using LabelEncoder to convert them into integer labels.
# 'y.astype("str")' first converts the labels to string type to handle potential mixed data types before encoding.
y = LabelEncoder().fit_transform(y.astype("str"))

# Define the pipeline
# Create a MinMaxScaler object named 'trans' to scale the input features.
trans = MinMaxScaler()
# Create a KNeighborsClassifier object named 'model' with default parameters.
model = KNeighborsClassifier()
# Create a Pipeline object named 'pipeline' that chains the MinMaxScaler and KNeighborsClassifier.
# 'steps=[("t", trans), ("m", model)]' defines the steps as a list of tuples, where 't' is the scaler and 'm' is the model.
pipeline = Pipeline(steps=[("t", trans), ("m", model)])

# Evaluate the pipeline
# Create a RepeatedStratifiedKFold cross-validation object named 'cv'.
# 'n_splits=10' sets the number of folds for k-fold cross-validation to 10.
# 'n_repeats=3' specifies that the cross-validation process should be repeated 3 times.
# 'random_state=1' ensures that the data splitting is consistent across runs for reproducibility.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Evaluate the 'pipeline' using cross-validation with RepeatedStratifiedKFold.
# 'pipeline' is the model pipeline to evaluate.
# 'X' is the input features.
# 'y' is the output labels.
# 'scoring="accuracy"' specifies that accuracy is the metric to evaluate.
# 'cv=cv' uses the 'cv' cross-validation object defined above.
# 'n_jobs=-1' utilizes all available CPU cores for parallel computation to speed up the process.
n_scores = cross_val_score(pipeline, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# Report pipeline performance
# Print the mean and standard deviation of the cross-validation accuracy scores.
# "Accuracy: %.3f (%.3f)" is a format string to print the accuracy and standard deviation, rounded to 3 decimal places.
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

We can see that the `MinMaxScaler` transform results in a lift in
performance from 71.7 percent accuracy without the transform to about 73.9 percent with the
transform.

## `StandardScaler` Transform

We can apply the `StandardScaler` to the diabetes dataset to standardize the input variables. By default, the `StandardScaler` performs the following operations:
1. **Centering**: Subtracts the mean of each variable to center the data around **0.0**.
2. **Scaling**: Divides by the standard deviation to ensure the data has a standard deviation of **1.0**.

### Steps to Apply `StandardScaler`:
1. **Define the Scaler**: Create an instance of `StandardScaler` with default hyperparameters.
2. **Transform the Data**: Use the `fit_transform()` function to apply the standardization to the dataset, generating a transformed version where the input variables have a mean of 0 and a standard deviation of 1.

### Summary of each input variable

In [None]:
# This section of the code demonstrates how to apply StandardScaler to the Pima Indians Diabetes dataset and then summarizes the transformed data.

# Load the Pima Indians Diabetes dataset from the specified CSV file into a pandas DataFrame.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)

# Extract the input features (all columns except the last one) from the dataset DataFrame and convert them into a NumPy array.
# '[:, :-1]' selects all rows (:) and all columns up to, but not including, the last column (:-1).
data = dataset.values[:, :-1]

# Initialize a StandardScaler object named 'trans'.
# StandardScaler standardizes features by removing the mean and scaling to unit variance.
trans = StandardScaler()

# Fit the StandardScaler 'trans' to the input data 'data' and then transform the data.
# 'fit_transform' computes the mean and standard deviation on the data and then performs the standardization.
data = trans.fit_transform(data)

# Convert the transformed NumPy array 'data' back into a pandas DataFrame named 'dataset'.
# This is done to easily summarize the transformed data using DataFrame methods.
dataset = DataFrame(data)

# Print the descriptive statistics of the transformed DataFrame 'dataset'.
# 'describe()' method provides summary statistics of the DataFrame, including count, mean, std, min, 25%, 50%, 75%, max for each column.
print(dataset.describe())

We can see that the
distributions have been adjusted and that the mean is a very small number close to zero and
the standard deviation is very close to 1.0 for each variable.

### Histogram plots of the variables

In [None]:
# Create histograms for all variables in the dataset
# The xlabelsize and ylabelsize parameters control the font size of axis labels
fig = dataset.hist(xlabelsize=4, ylabelsize=4)

# Reduce the title size for each subplot in the figure for better readability
[x.title.set_size(4) for x in fig.ravel()]

# Display the histogram plots
plt.show()

Histogram plots of the variables are generated to visualize their distributions. While the overall shapes of the distributions appear similar to their original forms (as seen in the previous section), the **scale on the x-axis** has changed. Notably, the **center of mass** for each distribution is now centered around **0**, which is more apparent for some variables than others.

### Model Evaluation

Next, we will evaluate the same **k-nearest neighbor (KNN) model** as before, but this time using the dataset transformed by the **`StandardScaler`**. This will allow us to assess the impact of standardization on the model's performance. The complete example is provided below.

In [None]:
# Read the 'pima-indians-diabetes.csv' dataset from the specified relative path using pandas' read_csv function, assuming no header row.
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)

# Convert the pandas DataFrame 'dataset' into a NumPy array 'data' for easier numerical operations.
data = dataset.values  # Convert DataFrame to NumPy array for easier manipulation

# Separate features (X) and target labels (y)
# Assign all columns except the last one from the 'data' array to 'X' as features.
X, y = data[:, :-1], data[:, -1]  # Separate features (X) and target labels (y)

# Ensure input features are float type and encode target labels as integers
# Convert the feature matrix 'X' to float32 data type to ensure numerical stability and compatibility with scikit-learn models.
X = X.astype("float32")

# Apply Label Encoding to the target labels 'y' to convert them into numerical labels (0, 1, 2, ...) if they are categorical.
y = LabelEncoder().fit_transform(y.astype("str"))  # Encode labels (if necessary)

# Define a pipeline with standardization and KNN classifier
# Initialize a StandardScaler object named 'trans' to perform feature scaling (standardization).
trans = StandardScaler()  # StandardScaler normalizes feature values

# Initialize a KNeighborsClassifier object named 'model' with default parameters.
model = KNeighborsClassifier()  # Define KNN model

# Construct a Pipeline named 'pipeline' that sequentially applies the StandardScaler ('trans') and then the KNeighborsClassifier ('model').
pipeline = Pipeline(steps=[("t", trans), ("m", model)])  # Create a pipeline with scaling + KNN

# Define the cross-validation strategy
# Create a RepeatedStratifiedKFold cross-validation object named 'cv' for stratified k-fold cross-validation repeated multiple times.
# This performs 10-fold cross-validation, repeated 3 times for robustness
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Evaluate the pipeline using cross-validation
# Perform cross-validation using cross_val_score to evaluate the 'pipeline' on the dataset (X, y).
# Computes accuracy scores for each fold and repetition
n_scores = cross_val_score(pipeline, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# Report mean and standard deviation of accuracy across folds
# Print the mean and standard deviation of the accuracy scores obtained from cross-validation to summarize the model's performance.
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

# Display plot (though nothing is plotted in this script)
# Display any plots created using matplotlib (in this script, no plot is actually generated or displayed, so this line is effectively doing nothing).
plt.show()  # This is unnecessary unless you plan to plot something

The results show that applying the **`StandardScaler` transform** leads to a noticeable improvement in model performance. The accuracy increases from **71.7%** (without scaling) to approximately **74.1%** (with standardization). This performance is slightly higher than the **73.9%** accuracy achieved using the **`MinMaxScaler`**.

## Conclusion

Data scaling techniques, such as **normalization** and **standardization**, can significantly enhance model performance. In this tutorial, we observed that:
- The **`StandardScaler`** improved the KNN model's accuracy from **71.7%** to **74.1%**.
- The **`MinMaxScaler`** achieved an accuracy of **73.9%**.

The choice between these scaling methods depends on the specific characteristics of your data and the requirements of your machine learning model.

## Clean up

Remember to:
- Delete any downloaded datasets
- Shut down your SageMaker notebook instance when finished
- Remove any unnecessary resources

