# Standardizing data for machine learning

### What is standardization?

Blah blah

### Log normalization

Log normalization is a method for standardizing your data that can be useful when you have a particular column with high variance. Log normalization applies a log transformation to your values, which transforms your values onto a scale that approximates normality, an assumption about data that a lot of machine learning models make. The method of log normalization we're going to work with in Python takes the natural log of each number, which is simply the exponent you would raise above the mathematical constant _e_ (approximately equal to 2.718) to get that number.  

Log normalization is a good strategy when you care about relative changes in a linear model, when you still want to capture the magnitude of change, and when you want to keep everything in the positive space. It's a nice way to minimize the variance of a column and make it comparable to other columns for modeling.

Let's see how this works in Python. Here we have a small dataset where one of the columns, `col2`, has high variance. We can check the variance across a set of data using the `.var()` method:

In [1]:
import pandas as pd

var_df = pd.DataFrame({"col1": [1.0, 1.2, 0.75, 1.6], 
                       "col2": [3.0, 45.5, 28.0, 100.0]})

var_df.var()

col1       0.128958
col2    1691.729167
dtype: float64

Applying log normalization to data in Python is fairly straightforward. We can use the `np.log()` function from NumPy to log normalize `col2`.

In [2]:
import numpy as np

var_df["col2_log"] = np.log(var_df["col2"])
var_df

Unnamed: 0,col1,col2,col2_log
0,1.0,3.0,1.098612
1,1.2,45.5,3.817712
2,0.75,28.0,3.332205
3,1.6,100.0,4.60517


If we check the variance of both `col1` and the `col2_log`, we can see that the variances are much closer together now:

In [3]:
np.var(var_df[["col1", "col2_log"]])

col1        0.096719
col2_log    1.697165
dtype: float64

### Scaling for feature comparison

Scaling is a method of standardization that's most useful when you're working with a dataset that contains continuous features that are on different scales, and you're using a model that operates in some sort of linear space (like linear regression or k-nearest neighbors). Feature scaling transforms the features in your dataset so they have a mean of zero and a variance of one. This will make it easier to linearly compare features. This is a requirement for many models in scikit-learn.

Let's take a look at another dataframe.

In [4]:
scale_dict = {"col1": [1.0, 1.2, 0.75, 1.6],
              "col2": [48.0, 45.5, 46.2, 50.0],
              "col3": [100.0, 101.3, 103.5, 104.1]}

scale_df = pd.DataFrame(scale_dict)

scale_df

Unnamed: 0,col1,col2,col3
0,1.0,48.0,100.0
1,1.2,45.5,101.3
2,0.75,46.2,103.5
3,1.6,50.0,104.1


In each column, we have numbers that are relatively close within the column, but not across columns. If we look at the variance, it's relatively low across columns:

In [5]:
scale_df.var()

col1    0.128958
col2    4.055833
col3    3.649167
dtype: float64

To better model this data, scaling would be a good choice here.

Scikit-learn has a variety of scaling methods, but we're only going to focus on the `StandardScaler()` method. This method works by removing the mean and scaling each feature to have unit variance. There's a simpler scale function in scikit-learn, but the benefit of using `StandardScaler()` is that you can apply the same transformation on other data, like a test set, or new data that's part of the same set, for example, without having to rescale everything.

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

Once we have the standard scaler method, we can apply the fit transform function on the DataFrame. Here we're keeping it as a DataFrame to see the results more easily.

In [7]:
df_scaled = pd.DataFrame(scaler.fit_transform(scale_df), columns=scale_df.columns)

Taking a look at the DataFrame, it has now been scaled, and the variance is now equivalent across columns:

In [8]:
print(df_scaled)
print("\n")
print(df_scaled.var())

       col1      col2      col3
0 -0.442127  0.329683 -1.344939
1  0.200967 -1.103723 -0.559132
2 -1.245995 -0.702369  0.770695
3  1.487156  1.476409  1.133375


col1    1.333333
col2    1.333333
col3    1.333333
dtype: float64


### Standardization and modeling

Let's try out two situations where you'd want to use these standardization techniques on a dataset, beginning with log normalization. We'll then do some modeling with non-standardized and standardized data, to compare the results.

If we take a look at the feature variance in the `wine_types` dataset:

In [9]:
dir_string = "../../datasets/"
wine_types = pd.read_csv(dir_string + "wine_types.csv")

wine_types.var()

Type                                0.600679
Alcohol                             0.659062
Malic acid                          1.248015
Ash                                 0.075265
Alcalinity of ash                  11.152686
Magnesium                         203.989335
Total phenols                       0.391690
Flavanoids                          0.997719
Nonflavanoid phenols                0.015489
Proanthocyanins                     0.327595
Color intensity                     5.374449
Hue                                 0.052245
OD280/OD315 of diluted wines        0.504086
Proline                         99166.717355
dtype: float64

We can see that the `Proline` column has much higher variance than anything else in the dataset. This column would be a good candidate for log normalization. If we apply the same `np.log()` function we used before:

In [10]:
wine_types["Proline_log"] = np.log(wine_types["Proline"])

wine_types["Proline_log"].var()

0.17231366191842018

We can see the variance has been reduced, which will make feature comparisons easier.

Finally, let's apply scaling to the `wine_types` dataset as well. Let's say that we want to use the `Ash`, `Alcalinity of ash`, and `Magnesium` columns to train a linear model, but it's possible that these columns are all measured in different ways. 

Taking a quick look at these columns with the `.describe()` method:

In [11]:
wine_types[["Ash", "Alcalinity of ash", "Magnesium"]].describe()

Unnamed: 0,Ash,Alcalinity of ash,Magnesium
count,178.0,178.0,178.0
mean,2.366517,19.494944,99.741573
std,0.274344,3.339564,14.282484
min,1.36,10.6,70.0
25%,2.21,17.2,88.0
50%,2.36,19.5,98.0
75%,2.5575,21.5,107.0
max,3.23,30.0,162.0


We can see that these three columns are on different scales—the means, mins, and maxes are wildly different. Since we can see that these columns are all on different scales, let's standardize them in a way that allows for use in a linear model.

First we'll create a new `StandardScaler()` method:

In [12]:
wine_scaler = StandardScaler()

Then we'll apply it to the subset of the `wine_types` dataset:

In [13]:
wine_subset = wine_types[["Ash", "Alcalinity of ash", "Magnesium"]]

wine_subset_scaled = wine_scaler.fit_transform(wine_subset)

And finally, let's take a look at the scaled data and its variance by transforming it back into a DataFrame:

In [14]:
wine_subset_scaled_df = pd.DataFrame(wine_subset_scaled, columns=wine_subset.columns) 

print(wine_subset_scaled_df.head())
print("\n")
print(wine_subset_scaled_df.var())

        Ash  Alcalinity of ash  Magnesium
0  0.232053          -1.169593   1.913905
1 -0.827996          -2.490847   0.018145
2  1.109334          -0.268738   0.088358
3  0.487926          -0.809251   0.930918
4  1.840403           0.451946   1.281985


Ash                  1.00565
Alcalinity of ash    1.00565
Magnesium            1.00565
dtype: float64


The data has been scaled and the variance is equal across columns.

### scikit-learn refresher

Let's do a quick refresher on k-nearest neighbors and scikit-learn in general, which we'll be using throughout the workshop. 

K-nearest neighbors is a model that classifies data based on its distance to training set data. A new data point is assigned a label based on the class that the majority of surrounding data points belongs to, using a distance metric like the Euclidean distance metric. 

Let's walk through an example of k-nearest neighbors and the scikit-learn workflow. First, let's generate a toy dataset using scikit-learn's `make_classification()` method, which generates a normally-distributed dataset you can use for classification problems. There are a number of ways you can customize the generated dataset, but we're going to keep it simple here. 

The parameters filled in below are:
- `n_samples`: the number of samples (rows) in the dataset. The default is 100; here it's increased to 1000.
- `n_features`: the number of features (columns) in the dataset. Instead of the default 20, let's keep it very simple at 3 features.
- `n_classes`: the number of class labels. We'll keep it at the default 2. 
- `n_redundant`: the number of redundant features—features that aren't informative for classification—which is useful if you want to test out dimensionality reduction methods, for example. We're not doing that here, so we'll set it to 0.

The function generates both the X and y sets, so we'll store those sets in the `X_gen` and `y_gen` variables.

In [15]:
from sklearn.datasets import make_classification

X_gen, y_gen = make_classification(n_samples=1000, n_features=3, n_classes=2, n_redundant=0)

Now that we have the data, let's build our model. We'll need both `train_test_split`, to split up our dataset before training, and `KNeighborsClassifier()`, to train the k-nearest neighbors model.

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

Let's split up our data into training and test sets:

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X_gen, y_gen)

Once we've done that, we can create our `KNeighborsClassifier()` model, and then use `.fit()` to fit the model to our training data.

In [18]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier()

The last step we want to take is to check the accuracy of our classifier. There are a multitude of ways you can validate the accuracy of your mode, but we'll just use `KNeighborsClassifier()`'s built in `.score()` method, which takes your `X` test data, predicts its `y` labels, and then compares the predicted `y` to the true values of `y` you pass in as the second parameter. The output is the percentage in the test `X` that received an accurate prediction.

In [19]:
knn.score(X_test, y_test)

0.964

Now that we've refreshed ourselves on the basics of k-nearest neighbors and scikit-learn, let's move on to applying standardization in a real modeling situation.

### Wine_types modeling: non-scaled data

Let's first take a look at the accuracy of a k-nearest neighbors model on the `wine_types` dataset without standardizing the data. We're going to use the features in this dataset to predict the `Type` of wine, of which there are three labels: 1, 2, and 3.

Let's break out `Type` into its own `y_wine` dataset as our predicted class, and we'll keep the rest of the features in `X_wine`.

In [20]:
X_wine = wine_types.drop("Type", axis=1)
y_wine = wine_types["Type"]

We've already previously imported `train_test_split` so let's split up our data:

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X_wine, y_wine)

Now we can create a new `KNeighborsClassifier()` model:

In [22]:
knn_wine_unscaled = KNeighborsClassifier()

And then we can fit the model to our training set and then take a look at its accuracy:

In [23]:
knn_wine_unscaled.fit(X_train, y_train)

knn_wine_unscaled.score(X_test, y_test)

0.6444444444444445

The accuracy isn't terrible, but it's possible that we can improve our model by scaling our data!

### Wine_types modeling: scaled data

The accuracy score on the unscaled `wine_types` dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data.

Let's apply the `StandardScaler` method to our `wine_types` feature set:

In [24]:
wine_scaler = StandardScaler()

X_scaled = wine_scaler.fit_transform(X_wine)

Once again, we'll split up the data:

In [25]:
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y_wine)

Then fit the `KNeighborsClassifier` model to the data and take a look at the accuracy score:

In [26]:
knn_wine_scaled = KNeighborsClassifier()
knn_wine_scaled.fit(X_train_scaled, y_train_scaled)

knn_wine_scaled.score(X_test_scaled, y_test_scaled)

0.9777777777777777

We can see that there's been a dramatic increase in the accuracy of our model, simply by scaling our dataset.