## Our Unscaled Data

We'll start by creating a small DataFrame with two imbalanced features and a label.

In [None]:
import pandas as pd

alien_fuel_data = {"alien_weight":[80, 63, 70, 93, 67, 72, 88, 61], 
              "spaceship_weight":[2993, 3267, 4231, 3987, 2324, 4118, 5003, 2576], 
              "fuel":[523, 353, 489, 628, 411, 528, 339, 418]}

# practice what we learned last lesson
alien_fuel_df = "turn alien_fuel_data into a dataframe"
alien_fuel_df.head()

Unnamed: 0,alien_weight,spaceship_weight,fuel
0,80,2993,523
1,63,3267,353
2,70,4231,489
3,93,3987,628
4,67,2324,411


In [None]:
# practice more from last lesson and extract the features and labels into X and y
X = "extract alien_weight and spaceship_weight,  features"
y = "extract fuel, the label"

## Revisiting the `train_test_split`

Remember how we do this?

If not, no biggie. That's what [docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) are for.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training data: \n", X_train, "\n")
print("Test data: \n", X_test)

Training data: 
    alien_weight  spaceship_weight
0            80              2993
7            61              2576
2            70              4231
4            67              2324
3            93              3987
6            88              5003 

Test data: 
    alien_weight  spaceship_weight
1            63              3267
5            72              4118


## Scaling the training and test data

To understand how we use the training data's values to scale the testing data, we'll need to explore SciKit Learn's `.fit()` and `.transform()` methods.

We first obtain the mean and standard deviation from the training data.

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit("what should you put here?") # creates the standardization values

print(scaler.mean_) # prints the mean of both columns of training data
print(scaler.scale_) # prints the SD of both columns of training data

[  76.5 3519. ]
[ 11.47097787 959.35516538]


### Transforming the training data

Then we use the values generated by `.fit()` to transform the training data.

In [None]:
X_train_scaled = scaler.transform(X_train) # transforms the training data based on the values generated from .fit()

print("Training data before scaling: \n", X_train, "\n")

# SKL often strips the column headers after transformation. They're added back on the following line.
X_train_scaled_formatted = pd.DataFrame(X_train_scaled,columns=X_train.columns) 
print("Training data after scaling: \n", X_train_scaled_formatted)

Training data before scaling: 
    alien_weight  spaceship_weight
0            80              2993
7            61              2576
2            70              4231
4            67              2324
3            93              3987
6            88              5003 

Training data after scaling: 
    alien_weight  spaceship_weight
0      0.305118         -0.548285
1     -1.351236         -0.982952
2     -0.566647          0.742165
3     -0.828177         -1.245628
4      1.438413          0.487828
5      1.002530          1.546872


### Transforming the testing set

Then we use the same values from fitting the training data to transform the testing data. This prevents the data leakage that would otherwise occur.

In [None]:
# the scaler object still has the same mean and standard deviation that it did on the training set
print(scaler.mean_)
print(scaler.scale_)

# that same transformation is applied to X_test
X_test_scaled = scaler.transform(X_test)

print("\nTest data before scaling: \n", X_test, "\n")
X_test_scaled_formatted = pd.DataFrame(X_test_scaled,columns=X_test.columns)
print("Test data after scaling: \n", X_test_scaled_formatted)

[  76.5 3519. ]
[ 11.47097787 959.35516538]

Test data before scaling: 
    alien_weight  spaceship_weight
1            63              3267
5            72              4118 

Test data after scaling: 
    alien_weight  spaceship_weight
0     -1.176883         -0.262676
1     -0.392294          0.624378
