Why Standardize or Scale Data?
Machine learning is a difficult subject to learn because not only can you have programming errors, but errors in a lot of different areas.
One common error is an error in failing to understand the assumptions of a machine learning model.
A common assumption in many types of models that your data is scaled appropriately.
Terms to Know
Scale
Generally means to change the range of the values. The shape of the distribution doesn’t change. Think about how a scale model of a
building has the same proportions as the original, just smaller. That’s why we say it is "drawn to scale".

Standardize
Standardizing is one of several kinds of scaling. It means scaling the values so that the distribution has a standard deviation of 1 with a
mean of 0. It outputs something very close to a normal distribution.
Note:
Scaled values lose their original units. Dollars are no longer in dollar units, meters are no longer in meter units, etc.

The Math
Standardization is calculated as:
standardized_feature = (feature - mean_of_feature) / std_dev_of_feature

Standardizing Data in Python
In Python you can scale data by using Scikit-learn's StandardScaler.
To avoid data leakage, the scaler should be fit on only the training set. When the scaler fits on data it calculates the means and standard
deviations of each feature. Then the scaler can be used to transform both the training and test sets based on the calculations done
during the fit step. This means that the average and variance (standard deviation) will be calculated using only the training data
because we want to keep information in the test data, including information about means and variances, reserved for only final model
evaluation. The scaling of target values (y) is generally not required.
The code below can be used to to standardize your data.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load Data
path = r"C:\Users\User\github_projects\Machine_Learning_with_Python\datasets\apartments.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,Sold,SqFt,Price
0,1,200,906442
1,0,425,272629
2,1,675,824862
3,1,984,720344
4,0,727,879679


In [2]:
# Assign Target y and Features X
# The target is the 'Sold' column which indicates whether the apartment sold within 2 weeks of being listed. The features are the square
# footage and list price of the apartment.
# Assign the target column as y
y = df['Sold']
# Assign the rest of the columns as X
X = df.drop(columns = 'Sold')
# Train Test Split for Model Validation
# Now we will split the data into a training set and testing set.
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [5]:
# Explore the Features
# Before scaling, lets explore our original data. Notice that we are only exploring the training set. We are keeping any information about
# the test set hidden from our analysis.
# Obtain descriptive statistics of your features
X_train.describe().round(0)


# The descriptive statistics above help you understand the original data (before we scale it). Notice that the range of the SqFt is 114-997,
# and the range of the Price is 109277 - 995878. The mean SqFt is 564 and the mean Price is 524950.

Unnamed: 0,SqFt,Price
count,75.0,75.0
mean,564.0,524950.0
std,285.0,274185.0
min,114.0,109277.0
25%,320.0,272804.0
50%,588.0,503613.0
75%,836.0,786078.0
max,997.0,995878.0


In [6]:
# Instantiate and Fit the Scaler on the Training Data
# Note that we only fit on the TRAINING set of data. This means that all calculations for scaling are based only on the training data.
# Remember, the purpose of the test set is to simulate unseen data so we do not use it in any calculations for pre-processing.
# instantiate scaler
scaler = StandardScaler()
# fit scaler on training data
scaler.fit(X_train)

# The fit step performs the calculations, but it does NOT apply them. After fitting, the data is still the same.

In [7]:
# Use the Scaler to Transform Both the Training and Testing Data
# In order to apply the calculations made during the fit step, you will need to transform the data. 
# We will transform both the train set and test set.

# transform training data
train_scaled = scaler.transform(X_train)

# transform testing data
test_scaled = scaler.transform(X_test)

# view the first 5 rows of train_scaled
train_scaled[:5]

array([[-1.37431725,  1.72912293],
       [ 1.34901239, -0.33899825],
       [ 1.35959527, -0.23730597],
       [ 1.1197165 , -1.01916082],
       [ 0.98919422,  1.07150845]])

In [8]:
# Notice that StandardScaler, like all sklearn transformers we will be learning about, outputs Numpy arrays, not Pandas dataframes. If we
# want to convert a Numpy array back to a dataframe (which we do not necessarily have to), we can use pd.DataFrame()
# We will convert back to a dataframe here to allow us to more easily explore and understand the effects of transforming our data with
# StandardScaler.
# transform back to a dataframe
X_train_scaled = pd.DataFrame(train_scaled, columns=X_train.columns)
X_train_scaled.head()

Unnamed: 0,SqFt,Price
0,-1.374317,1.729123
1,1.349012,-0.338998
2,1.359595,-0.237306
3,1.119716,-1.019161
4,0.989194,1.071508


In [9]:
# Explore the Scaled Data
# Obtain descriptive statistics of the scaled data
# Use .round(2) to eliminate scientific notation and maintain 2 places after the decimal
X_train_scaled.describe().round(2)

Unnamed: 0,SqFt,Price
count,75.0,75.0
mean,-0.0,0.0
std,1.01,1.01
min,-1.59,-1.53
25%,-0.86,-0.93
50%,0.09,-0.08
75%,0.96,0.96
max,1.53,1.73


In [None]:
# The first thing you should notice about the descriptive statistics is that the mean for the features will be approximately 0 and the
# standard deviation will be approximately 1.
# The original data was on different scales. The magnitude of the value now represents how far away each value is from the mean of
# each feature, in units of standard deviation. Values that are closer to the mean will be closer to zero. As a value becomes more
# dramatically different than the mean, it will have a larger magnitude.
# You will also notice that some values are negative and others are positive. With the new mean set to 0, any value below the mean of the
# feature is negative, any value above the mean is positive.
# Values with large magnitudes (in either the + or - direction) could be considered outliers. While there is no exact threshold for
# establishing outliers, generally scaled values beyond -3 or 3 are considered outliers.

In [None]:
# Summary
# Scaling data is required to meet the assumptions of many, but not all, kinds of models. Standardizing is one type of scaling that is often
# used. Standardizing means to subtract the mean of a series of numbers from each number in that series, then divide the result by the
# standard deviation of that series. In Scikit-learn you can use StandardScaler() to scale your data before using it for machine learning.
# We always fit transformers like StandardScaler() on training data, then use the fitted transformer to transform both the training data and
# the testing data.