# When to standardize

Now that you've learned when it is appropriate to standardize your data, which of these scenarios is a reason to standardize?

- A column you want to use for modeling has extremely high variance.
- You have a dataset with several continuous columns on different scales, and you'd like to use a linear model to train the data.
- The models you're working with use some sort of distance metric in a linear space.

# Modeling without normalizing

Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first.

Here we have a subset of the `wine` dataset. One of the columns, `Proline`, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

In [14]:
import pandas as pd

wine = pd.read_csv("dataset/wine_types.csv")
wine.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [15]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X = wine.drop('Type', axis=1)
y= wine['Type']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

knn = KNeighborsClassifier()

# Fit the knn model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.7777777777777778


# Checking the variance

Check the variance of the columns in the `wine` dataset. Out of the four columns listed, which column is the most appropriate candidate for normalization?

In [16]:
wine.var()

Type                                0.600679
Alcohol                             0.659062
Malic acid                          1.248015
Ash                                 0.075265
Alcalinity of ash                  11.152686
Magnesium                         203.989335
Total phenols                       0.391690
Flavanoids                          0.997719
Nonflavanoid phenols                0.015489
Proanthocyanins                     0.327595
Color intensity                     5.374449
Hue                                 0.052245
OD280/OD315 of diluted wines        0.504086
Proline                         99166.717355
dtype: float64

# Log normalization in Python

Now that we know that the `Proline` column in our `wine` dataset has a large amount of variance, let's log normalize it.

In [17]:
import numpy as np
# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())

99166.71735542428
0.17231366191842018


# Scaling data - investigating columns

You want to use the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the `wine` dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model.

Which of the following statements about these columns is true?

In [18]:
wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()

Unnamed: 0,Ash,Alcalinity of ash,Magnesium
count,178.0,178.0,178.0
mean,2.366517,19.494944,99.741573
std,0.274344,3.339564,14.282484
min,1.36,10.6,70.0
25%,2.21,17.2,88.0
50%,2.36,19.5,98.0
75%,2.5575,21.5,107.0
max,3.23,30.0,162.0


- The max of Ash is 3.23, the max of Alcalinity of ash is 30, and the max of Magnesium is 162.

# Scaling data - standardizing columns

Since we know that the `Ash`, `Alcalinity of ash`, and `Magnesium columns` in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

In [19]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create the scaler
scaler = StandardScaler()

# Subset the DataFrame you want to scale 
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

# Apply the scaler to wine_subset
wine_subset_scaled = scaler.fit_transform(wine_subset)

# KNN on non-scaled data

Before adding standardization to your scikit-learn workflow, you'll first take a look at the accuracy of a K-nearest neighbors model on the `wine` dataset without standardizing the data.

In [20]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.7777777777777778


# KNN on scaled data

The accuracy score on the unscaled `wine` dataset was decent, but let's see what you can achieve by using standardization. Once again, the `knn` model as well as the X and y data and labels set have already been created for you.

In [25]:
wine_dropped = wine.drop(['Ash', 'Alcalinity of ash', 'Magnesium'], axis =1)
wine_new = pd.concat([wine_dropped, wine_subset], axis=1)
X = wine_new.drop('Type', axis=1)
y = wine_new['Type']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Instantiate a StandardScaler
scaler = StandardScaler()

# Scale the training and test features
X_train_scaled = scaler.fit_transform(X_train, y_train) 
X_test_scaled = scaler.transform(X_test)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train_scaled, y_train)

# Score the model on the test data
print(knn.score(X_test_scaled, y_test))

0.9555555555555556
