# Lesson 2: Data Scaling Techniques


Hello! Today, we are diving into the world of data scaling techniques. Imagine you are playing a game where you need to fit different shapes into matching holes. If your shapes vary greatly in size, it can be challenging. Similarly, in data analysis and machine learning, features (or columns) in your dataset may have vastly different scales. This can affect the performance of your analysis or model.

Our goal for this lesson is to understand two key data scaling techniques: Standard Scaling and Min-Max Scaling. By the end of this lesson, you'll be able to apply these techniques to scale features in a dataset, making them easier to work with.

## Understanding Standard Scaling

Standard Scaling is like leveling the playing field for your data. It transforms your data so it has a mean (average) of 0 and a standard deviation (how spread out the numbers are) of 1. This is especially useful when you want your data to follow a standard normal distribution.

The formula for standard scaling is:

\[ z = \frac{X - \mu}{\sigma} \]

Where:

- \( X \) is the original value.
- \( \mu \) is the mean of the values.
- \( \sigma \) is the standard deviation of the values.

In simpler terms, you subtract the average value from each data point and then divide by how much your data varies from the average.

### Applying Standard Scaling

Let's use the Titanic dataset to perform Standard Scaling on the age and fare columns.

```python
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Calculate mean and standard deviation for 'age' and 'fare'
age_mean = titanic['age'].mean()
age_std = titanic['age'].std()
fare_mean = titanic['fare'].mean()
fare_std = titanic['fare'].std()

# Standard Scaling
titanic['age_standard'] = (titanic['age'] - age_mean) / age_std
titanic['fare_standard'] = (titanic['fare'] - fare_mean) / fare_std

print(titanic[['age', 'age_standard', 'fare', 'fare_standard']].head())
```

Output:

```
    age  age_standard      fare  fare_standard
0  22.0     -0.530005   7.2500      -0.502445
1  38.0      0.571499  71.2833       0.786845
2  26.0     -0.254046   7.9250      -0.488854
3  35.0      0.432593  53.1000       0.420731
4  35.0      0.432593   8.0500      -0.485866
```

## Understanding Min-Max Scaling

Min-Max Scaling adjusts the scale of your data to fit within a specific range, typically between 0 and 1. This is like resizing shapes to fit in a smaller box, making them easier to compare.

The formula for Min-Max Scaling is:

\[ X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}} \]

Where:

- \( X \) is the original value.
- \( X_{\text{min}} \) is the minimum value in the feature.
- \( X_{\text{max}} \) is the maximum value in the feature.

In simpler terms, you subtract the smallest value from each data point and then divide by the range (difference between the largest and smallest values).

### Applying Min-Max Scaling

Let's apply Min-Max Scaling to the age and fare columns in the Titanic dataset.

```python
# Calculate min and max for 'age' and 'fare'
age_min = titanic['age'].min()
age_max = titanic['age'].max()
fare_min = titanic['fare'].min()
fare_max = titanic['fare'].max()

# Min-Max Scaling
titanic['age_minmax'] = (titanic['age'] - age_min) / (age_max - age_min)
titanic['fare_minmax'] = (titanic['fare'] - fare_min) / (fare_max - fare_min)

print(titanic[['age', 'age_minmax', 'fare', 'fare_minmax']].head())
```

Output:

```
    age  age_minmax     fare  fare_minmax
0  22.0    0.271174   7.2500     0.014151
1  38.0    0.472229  71.2833     0.139136
2  26.0    0.321438   7.9250     0.015469
3  35.0    0.434531  53.1000     0.103644
4  35.0    0.434531   8.0500     0.015713
```

## Lesson Summary

Great job! Today, you learned about the importance of data scaling and explored two common techniques: Standard Scaling and Min-Max Scaling. These techniques help bring features to a common scale, making them easier to analyze and work within machine learning models.

Now it's time for some hands-on practice. You'll apply Standard Scaling and Min-Max Scaling to different columns in a dataset using the CodeSignal IDE. This will solidify your understanding and give you practical experience in scaling data. Enjoy scaling your data!
```

This Markdown format ensures that your lesson is well-structured and easy to read, with clear separation of sections and code snippets.

## Standard Scaling for the Age Column

Wondering how to standardize a feature in the Titanic dataset? This task helps you understand how the ages of passengers can be normalized to identify patterns more easily.

```py
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Standard Scaling for 'age' column
age_mean = titanic['age'].mean()
age_std = titanic['age'].std()
titanic['age_standard'] = (titanic['age'] - age_mean) / age_std

print(titanic[['age', 'age_standard']].head())

```

The Python code you've provided illustrates how to apply Standard Scaling to the 'age' column of the Titanic dataset. This process is essential for normalizing the data, making it easier to identify patterns and perform further statistical analyses or machine learning models where scale and distribution might impact performance.

Here’s a breakdown of what each part of the script does:

1. **Import Libraries**:
   ```python
   import pandas as pd
   import seaborn as sns
   ```
   This imports Pandas, which is used for data manipulation, and Seaborn, which provides the Titanic dataset and is also useful for data visualization.

2. **Load Dataset**:
   ```python
   titanic = sns.load_dataset('titanic')
   ```
   This line loads the Titanic dataset from Seaborn's repository into a Pandas DataFrame called `titanic`. The dataset includes various details about the Titanic passengers such as age, sex, class, and survival status.

3. **Calculate Mean and Standard Deviation**:
   ```python
   age_mean = titanic['age'].mean()
   age_std = titanic['age'].std()
   ```
   These lines calculate the mean and standard deviation of the 'age' column. These values are necessary for the standardization formula, where data is scaled to have a mean of zero and a standard deviation of one.

4. **Standard Scaling**:
   ```python
   titanic['age_standard'] = (titanic['age'] - age_mean) / age_std
   ```
   Here, each age value is standardized by subtracting the mean of the age column and then dividing by its standard deviation. This transformation results in a new column `age_standard` where the scaled ages have a mean of 0 and a standard deviation of 1.

5. **Print Results**:
   ```python
   print(titanic[['age', 'age_standard']].head())
   ```
   This line prints the first five rows of the original 'age' column alongside the new 'age_standard' column to show the effect of the scaling.

The output from this script will help you see how the original age values have been transformed into standardized scores, which can be particularly useful when comparing measurements that have different units or different scales. Standardizing the data simplifies many computational tasks in data analysis and is a common preprocessing step for many machine learning algorithms.

## Apply Min-Max Scaling to 'fare' Column

Awesome, Space Explorer!

Now, let's apply what you’ve learned. Change the code to use Min-Max Scaling on the fare column instead of Standard Scaling on the age column. This will help adjust the fare values to a scale between 0 and 1.

Let's code!

```py
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Standard Scaling for 'age' column
age_mean = titanic['age'].mean()
age_std = titanic['age'].std()
titanic['age_standard'] = (titanic['age'] - age_mean) / age_std

print(titanic[['age', 'age_standard']].head())


```

To modify the code for applying Min-Max Scaling to the 'fare' column of the Titanic dataset, we need to change the calculations to find the minimum and maximum values of the 'fare' column and then use these to scale the fare values to a range between 0 and 1. Here’s how you can adjust your script:

```python
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Min-Max Scaling for 'fare' column
fare_min = titanic['fare'].min()
fare_max = titanic['fare'].max()
titanic['fare_minmax'] = (titanic['fare'] - fare_min) / (fare_max - fare_min)

print(titanic[['fare', 'fare_minmax']].head())
```

### Explanation of Changes:

1. **Remove Standard Scaling for 'age'**: Since the focus is now on the 'fare' column, the lines related to scaling the 'age' column have been removed.

2. **Calculate Min and Max for 'fare'**:
   - `fare_min = titanic['fare'].min()`: This line calculates the minimum fare in the dataset.
   - `fare_max = titanic['fare'].max()`: This line calculates the maximum fare in the dataset.

3. **Apply Min-Max Scaling**:
   - `titanic['fare_minmax'] = (titanic['fare'] - fare_min) / (fare_max - fare_min)`: This formula adjusts each fare value to a scale of 0 to 1. The subtraction of `fare_min` shifts all values so that the lowest fare becomes 0. Dividing by `(fare_max - fare_min)` scales the highest price to 1, with all other values falling proportionately between these two points.

4. **Print Results**:
   - `print(titanic[['fare', 'fare_minmax']].head())`: This prints the first five rows to show the original 'fare' values and their corresponding scaled values in the new 'fare_minmax' column.

This approach ensures that fare values are scaled down to a uniform range, making them easier to compare and use in further analyses or machine learning models where input value scaling might be crucial.

## Standard Scaling of Titanic 'Fare' Column

Hey there, Space Explorer!

Let's practice Standard Scaling on the Titanic dataset. Fill in the missing pieces of code to standardize the fare column.

Keep shining!

```py
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Calculate mean and standard deviation for 'fare' column
fare_mean = titanic['fare'].mean()
fare_std = titanic['fare'].std()

# TODO: Apply Standard Scaling to 'fare' column
# Hint: Use the formula (X - mean) / std to create new 'fare_standard' column

print(titanic[['fare', 'fare_standard']].head())


```

To complete the TODO task in your script, you need to apply Standard Scaling to the 'fare' column of the Titanic dataset. This involves subtracting the mean of the 'fare' column from each fare value and then dividing by the standard deviation. Here's the complete code with the missing line added to create the 'fare_standard' column:

```python
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Calculate mean and standard deviation for 'fare' column
fare_mean = titanic['fare'].mean()
fare_std = titanic['fare'].std()

# Apply Standard Scaling to 'fare' column
titanic['fare_standard'] = (titanic['fare'] - fare_mean) / fare_std

print(titanic[['fare', 'fare_standard']].head())
```

### Explanation of the Added Code:
- **Standard Scaling for 'fare'**:
  - `titanic['fare_standard'] = (titanic['fare'] - fare_mean) / fare_std`: This line applies the standard scaling formula. It subtracts the mean fare (`fare_mean`) from each entry in the 'fare' column and divides the result by the standard deviation (`fare_std`). This transformation standardizes the 'fare' values, resulting in a new column `fare_standard` where the scaled values have a mean of 0 and a standard deviation of 1.

### Output:
- The `print` statement will display the first five rows of the original 'fare' values alongside their standardized values in the 'fare_standard' column. This allows you to see how the original fare values have been transformed to a standardized scale.

This process is crucial for data preprocessing, especially in scenarios involving machine learning models that assume normally distributed data, as it helps to normalize the data features to the same scale.