# Unit 3 Feature Scaling

## Lesson Introduction

Hey there\! Today, we're going to learn about **feature scaling**. You might be wondering, what is feature scaling, and why should we care? Simply put, feature scaling is like making sure all the ingredients in your recipe are measured in the same unit. Imagine trying to mix pounds of flour and teaspoons of salt without converting one to the other — it wouldn't make sense, right?

Our goal is to understand why feature scaling is crucial in machine learning and to learn how to do it using Python and a library called **Scikit-learn**.

-----

## What is Feature Scaling?

Feature scaling ensures that all your data features contribute equally when building a machine learning model. Without scaling, large values might dominate, leading to biased outcomes. For example, if predicting house prices, and one feature was in thousands (like square footage) and another in single digits (like the number of rooms), the model might overlook the smaller values just because they seem less relevant.

There are two common types:

  * **Standardization**: Transforms data to have a mean ($\\mu$) of 0 and a standard deviation ($\\sigma$) of 1.

      * Formula: $z = \\frac{x - \\mu}{\\sigma}$, where $x$ is the original feature value, $\\mu$ is the mean of the feature, and $\\sigma$ is the standard deviation of the feature.

  * **Normalization**: Rescales data to range between 0 and 1.

      * Formula: $x' = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}$, where $x$ is the original feature value, $\\min(x)$ is the minimum value of the feature, and $\\max(x)$ is the maximum value of the feature.

Today, we'll focus on both standardization using `StandardScaler` and normalization using `MinMaxScaler` from `Scikit-learn`.

-----

## Example of Feature Scaling with `StandardScaler`

Let's create a small sample dataset to see how feature scaling works.

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample dataset
data = {'Feature1': [1, 2, 3, 4], 'Feature2': [10, 20, 30, 40]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
```

Output:

```
Original DataFrame:
   Feature1  Feature2
0         1        10
1         2        20
2         3        30
3         4        40
```

Before scaling, `Feature1` ranges from 1 to 4, and `Feature2` ranges from 10 to 40. Let's scale this dataset using `StandardScaler`.

-----

## Applying Feature Scaling with `StandardScaler`

We’ll use the `StandardScaler` to perform the scaling. The `fit_transform` method will calculate the mean and standard deviation for scaling, and then apply the scaling to the data.

```python
# Feature scaling with StandardScaler
standard_scaler = StandardScaler()
scaled_data_standard = standard_scaler.fit_transform(df)
```

Continuing from where we left off, we need to convert this scaled data back to a DataFrame for better readability.

```python
# Convert the scaled data back to a DataFrame for better readability
scaled_df_standard = pd.DataFrame(scaled_data_standard, columns=df.columns)
print("Scaled DataFrame (StandardScaler):")
print(scaled_df_standard)
```

Output:

```
Scaled DataFrame (StandardScaler):
   Feature1  Feature2
0 -1.341641 -1.341641
1 -0.447214 -0.447214
2  0.447214  0.447214
3  1.341641  1.341641
```

-----

## Scaling Double-check

Let's check if the data is scaled correctly. We will calculate mean and standard deviation for both features:

```python
print("Mean of each feature after scaling (should be close to 0):")
print(scaled_df_standard.mean())
print("Standard deviation of each feature after scaling (should be close to 1):")
print(scaled_df_standard.std())
```

Here is the output:

```
Mean of each feature after scaling (should be close to 0):
Feature1    0.0
Feature2    0.0
dtype: float64

Standard deviation of each feature after scaling (should be close to 1):
Feature1    1.118034
Feature2    1.118034
dtype: float64
```

The mean of each feature in the scaled DataFrame is 0, and the standard deviation is 1. This makes it easier for the machine learning model to treat all features equally.

-----

## Example of Feature Scaling with `MinMaxScaler`

Let's also apply feature scaling using the `MinMaxScaler` to see how normalization works. The good news is that using the `MinMaxScaler` is exactly the same as for the `StandardScaler`. You literally just change the scaler's name and everything works\!

```python
# Feature scaling with MinMaxScaler
minmax_scaler = MinMaxScaler()
scaled_data_minmax = minmax_scaler.fit_transform(df)
```

Convert the normalized data back to a DataFrame for better readability and verify the range.

```python
# Convert the scaled data back to a DataFrame for better readability
scaled_df_minmax = pd.DataFrame(scaled_data_minmax, columns=df.columns)
print("Scaled DataFrame (MinMaxScaler):")
print(scaled_df_minmax)
```

Output:

```
Scaled DataFrame (MinMaxScaler):
   Feature1  Feature2
0       0.0       0.0
1       0.3       0.3
2       0.6       0.6
3       1.0       1.0
```

-----

## Scaling Double-Check

Let's validate the results:

```python
print("Minimum of each feature after scaling (should be 0):")
print(scaled_df_minmax.min())
print("Maximum of each feature after scaling (should be 1):")
print(scaled_df_minmax.max())
```

Output:

```
Minimum of each feature after scaling (should be 0):
Feature1    0.0
Feature2    0.0
dtype: float64

Maximum of each feature after scaling (should be 1):
Feature1    1.0
Feature2    1.0
dtype: float64
```

The minimum of each feature in the scaled DataFrame is 0, and the maximum is 1, ensuring that all data points fall within this range.

-----

## Lesson Summary

Great job\! You learned what feature scaling is and why it is essential in machine learning. By scaling your features, you ensure that all data points contribute equally to the model. You also got hands-on with Python, `StandardScaler`, and `MinMaxScaler` from `Scikit-learn` to both standardize and normalize a sample dataset.

Now it's time to move on to some practice exercises. You'll get the chance to apply what you learned and become even more confident in your ability to scale features effectively. Let's get started\!

## Scaling Recipe Ingredients

In the given code, you can see how to scale ingredient measurements using both StandardScaler and MinMaxScaler from SciKit Learn. This helps ensure the ingredients are balanced correctly. Click Run to see the scaled measurements for Flour (cups) and Salt (tsp).


```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample recipe ingredient measurements (e.g., teaspoons, cups)
recipe_data = {'Flour (cups)': [4, 7, 12], 'Salt (tsp)': [0.5, 1.5, 2.5]}
df = pd.DataFrame(recipe_data)

# Standard scaling
standard_scaler = StandardScaler()
standard_scaled_data = standard_scaler.fit_transform(df)
standard_scaled_df = pd.DataFrame(standard_scaled_data, columns=df.columns)
print("Standard Scaled Recipe Ingredients:")
print(standard_scaled_df)

# Min-max scaling
minmax_scaler = MinMaxScaler()
minmax_scaled_data = minmax_scaler.fit_transform(df)
minmax_scaled_df = pd.DataFrame(minmax_scaled_data, columns=df.columns)
print("Min-Max Scaled Recipe Ingredients:")
print(minmax_scaled_df)

```

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample recipe ingredient measurements (e.g., teaspoons, cups)
recipe_data = {'Flour (cups)': [4, 7, 12], 'Salt (tsp)': [0.5, 1.5, 2.5]}
df = pd.DataFrame(recipe_data)

# Standard scaling
standard_scaler = StandardScaler()
standard_scaled_data = standard_scaler.fit_transform(df)
standard_scaled_df = pd.DataFrame(standard_scaled_data, columns=df.columns)
print("Standard Scaled Recipe Ingredients:")
print(standard_scaled_df)

# Min-max scaling
minmax_scaler = MinMaxScaler()
minmax_scaled_data = minmax_scaler.fit_transform(df)
minmax_scaled_df = pd.DataFrame(minmax_scaled_data, columns=df.columns)
print("Min-Max Scaled Recipe Ingredients:")
print(minmax_scaled_df)

```

**Standard Scaled Recipe Ingredients:**
Flour (cups)  Salt (tsp)
0     -1.069045   -1.224745
1     -0.267261    0.000000
2      1.336306    1.224745
**Min-Max Scaled Recipe Ingredients:**
Flour (cups)  Salt (tsp)
0      0.000000         0.0
1      0.375000         0.5
2      1.000000         1.0

## Standardize Ingredient Quantities

Space Explorer, let's change the normalization of the ingredient quantities to standardization. Replace the MinMaxScaler with StandardScaler to standardize the data. Once done, run the code to observe how the data transforms differently.

Let's code!

```python
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Recipe ingredient quantities
recipe_data = {'Flour_g': [100, 200, 300, 400], 'Salt_g': [1, 2, 3, 4]}
df = pd.DataFrame(recipe_data)

# Normalize the ingredient quantities
minmax_scaler = MinMaxScaler()
normalized_data = minmax_scaler.fit_transform(df)

# Convert back to DataFrame and print
normalized_df = pd.DataFrame(normalized_data, columns=df.columns)
print(normalized_df)

```

To standardize the ingredient quantities instead of normalizing them, you need to replace `MinMaxScaler` with `StandardScaler`. Here's the corrected code:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler # Changed from MinMaxScaler

# Recipe ingredient quantities
recipe_data = {'Flour_g': [100, 200, 300, 400], 'Salt_g': [1, 2, 3, 4]}
df = pd.DataFrame(recipe_data)

# Standardize the ingredient quantities # Changed from Normalize
standard_scaler = StandardScaler() # Changed from MinMaxScaler
standardized_data = standard_scaler.fit_transform(df) # Changed variable name

# Convert back to DataFrame and print
standardized_df = pd.DataFrame(standardized_data, columns=df.columns) # Changed variable name
print(standardized_df)

```

## Feature Scaling for Recipe Measurements

Hey there, Space Voyager!

You've been given some code to scale recipe measurements using MinMaxScaler. It looks like there's a small issue preventing it from working correctly. Can you identify and fix the problem? Run the code to see what's going wrong.

Good luck!


```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample recipe measurements (grams and teaspoons)
data = {'Ingredient1': [10, 20, 30, 40], 'Ingredient2': [1, 2, 3, 4]}
df = pd.DataFrame(data)

# Feature scaling with MinMaxScaler
minmax_scaler = MinMaxScaler()
scaled_data_minmax = minmax_scaler.fit(df)

# Convert the scaled data back to a DataFrame for better readability
scaled_df_minmax = pd.DataFrame(scaled_data_minmax, columns=df.columns)
print(scaled_df_minmax)

```

The issue in the provided code is that `minmax_scaler.fit(df)` only calculates the parameters (min and max) for scaling, but it does not actually transform the data. To both fit the scaler and transform the data, you need to use `fit_transform(df)`.

Here's the corrected code:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample recipe measurements (grams and teaspoons)
data = {'Ingredient1': [10, 20, 30, 40], 'Ingredient2': [1, 2, 3, 4]}
df = pd.DataFrame(data)

# Feature scaling with MinMaxScaler
minmax_scaler = MinMaxScaler()
scaled_data_minmax = minmax_scaler.fit_transform(df) # Changed from .fit(df) to .fit_transform(df)

# Convert the scaled data back to a DataFrame for better readability
scaled_df_minmax = pd.DataFrame(scaled_data_minmax, columns=df.columns)
print(scaled_df_minmax)
```

## Recipe Data Scaling

Alright, Space Voyager! We need to complete our recipe scaling. Your task is to fill in the missing pieces to standardize and normalize our recipe data.


```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Recipe dataset
recipes = {'Sugar': [5, 10, 15, 20], 'Flour': [500, 1000, 1500, 2000]}
df_recipes = pd.DataFrame(recipes)

# Standardize the dataset using StandardScaler
# TODO: Fill in the standardized_recipes using the fit_transform method
standardized_recipes = _____

# Normalize the dataset using MinMaxScaler
# TODO: Fill in the normalized_recipes using the fit_transform method
normalized_recipes = _____

# Print the standardized and normalized dataframes
print("Standardized Recipes:\n", pd.DataFrame(standardized_recipes, columns=df_recipes.columns))
print("Normalized Recipes:\n", pd.DataFrame(normalized_recipes, columns=df_recipes.columns))

```

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Recipe dataset
recipes = {'Sugar': [5, 10, 15, 20], 'Flour': [500, 1000, 1500, 2000]}
df_recipes = pd.DataFrame(recipes)

# Standardize the dataset using StandardScaler
scaler_standard = StandardScaler()
standardized_recipes = scaler_standard.fit_transform(df_recipes)

# Normalize the dataset using MinMaxScaler
scaler_minmax = MinMaxScaler()
normalized_recipes = scaler_minmax.fit_transform(df_recipes)

# Print the standardized and normalized dataframes
print("Standardized Recipes:\n", pd.DataFrame(standardized_recipes, columns=df_recipes.columns))
print("Normalized Recipes:\n", pd.DataFrame(normalized_recipes, columns=df_recipes.columns))
```