# Lesson 2

## Topic Overview

In today's lesson, you'll learn how to standardize financial data using the `StandardScaler` from the `sklearn` library. Scaling features ensure that all data contribute equally to machine learning models, improving their performance and robustness.

**Lesson Goal:** By the end of this lesson, you will be able to effectively scale financial features and understand the importance of this step in preparing data for machine learning.

## Revision: Loading and Preprocessing the Dataset

Let's quickly recall how to load and preprocess the Tesla stock dataset:

```python
import pandas as pd
import datasets

# Load the dataset
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# Feature Engineering: creating new features
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']
```

We've successfully loaded the Tesla dataset and created new features: `High-Low` and `Price-Open`.

## Introduction to Feature Scaling

Feature scaling is crucial for machine learning for several reasons:

- **Equal Contribution:** Ensures all features contribute equally to the model.
- **Improved Convergence:** Helps in faster convergence during model training by making gradients less sensitive to feature magnitude.
- **Prevent Dominance:** Prevents features with larger scales from dominating those with smaller scales.

Feature scaling is particularly useful in scenarios like:

- **Predicting House Prices:** Square footage in thousands vs. the number of bedrooms in single digits.
- **Stock Market Analysis:** Stock price in hundreds vs. trading volume in millions.
- **Health Data:** Age in the 0-100 range vs. blood pressure in the hundreds.
- **Retail Sales Prediction:** Number of items sold vs. store rating in single digits.

These examples highlight the importance of scaling to ensure uniform treatment of features, thereby enhancing model performance.

## Defining Standardization

Standardization involves transforming your data so that the mean of each feature is 0 and the standard deviation is 1. This process ensures all features are on the same scale, improving the performance and robustness of machine learning models. The formula for standardization is:

\[
z = \frac{x - \mu}{\sigma}
\]

Where:

- \( z \) is the standardized value,
- \( x \) is the original value,
- \( \mu \) is the mean of the feature, calculated as the average of all values of that feature,
- \( \sigma \) is the standard deviation of the feature, which measures the amount of variation or dispersion of the values.

By applying this formula, each feature will have a mean of 0 and a standard deviation of 1, enabling more stable and faster convergence during the training of machine learning models.

## Implementing StandardScaler on Financial Data

Let's proceed to scale our features using `StandardScaler` from `sklearn`. The `StandardScaler` standardizes features by removing the mean and scaling to unit variance.

First, we define our features:

```python
from sklearn.preprocessing import StandardScaler

# Defining features
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
```

Now, let's initialize the scaler and apply it to our features:

```python
# Scaling
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
```

Here, `fit_transform` computes the mean and standard deviation to scale the data and then returns the transformed version.

## Inspecting Scaled Features

It's essential to inspect and validate the scaled features to ensure they have been correctly normalized. Let's display the first few rows of the scaled features:

```python
# Displaying the first few scaled features
print("Scaled features (first 5 rows):\n", features_scaled[:5])
```

The output of the above code will be:

```
Scaled features (first 5 rows):
 [[-0.48165383  0.08560547  2.29693712]
 [-0.48579183 -0.02912844  2.00292929]
 [-0.50368231 -0.04721815  0.33325453]
 [-0.51901702 -0.0599476  -0.23997882]
 [-0.52169457 -0.06145506  0.08156432]]
```

This output demonstrates that our features have been successfully scaled to have a standardized scale, specifically with mean values hovering around 0 and standard deviation about 1. This scaling ensures equality in feature contribution to the machine learning model.

## Validating Scaled Features

After scaling your features, it's important to check the mean and standard deviation to ensure they are correctly standardized. You can do this using the following code:

```python
# Checking mean values and standard deviations of scaled features
scaled_means = features_scaled.mean(axis=0)
scaled_stds = features_scaled.std(axis=0)

print("\nMean values of scaled features:", scaled_means)
print("Standard deviations of scaled features:", scaled_stds)
```

The output will show that the means are close to 0 and the standard deviations are close to 1:

```
Mean values of scaled features: [ 3.39667875e-17  5.57267607e-18 -6.79335750e-17]
Standard deviations of scaled features: [1. 1. 1.]
```

This validation confirms that your features have been successfully scaled.

## Lesson Summary

In this lesson, we revisited loading and preprocessing the Tesla stock dataset, discussed the importance of scaling features, and implemented `StandardScaler` to normalize our financial data features. By inspecting the scaled features, we ensured they were correctly normalized.

Experiment with scaling other features in the dataset to understand their impact further. This practice will reinforce your understanding and skill in data preprocessing, which is vital for building effective and reliable machine-learning models. Happy coding!

---

This formatting should make the content easier to read and follow.

## Scaling a Single Feature with StandardScaler

Let's focus on scaling only the `Volume` feature and adding it as a new column `Volume_Scaled` in our dataset. Here's how you can modify the code:

```python
import pandas as pd
import datasets
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# Feature Engineering: creating new features
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Scaling only the Volume feature
scaler = StandardScaler()
tesla_df['Volume_Scaled'] = scaler.fit_transform(tesla_df[['Volume']])

# Displaying the first few rows of the new Volume_Scaled feature
print(tesla_df[['Volume', 'Volume_Scaled']].head())
```

### Explanation:
1. **Scaling the Volume Feature:** 
   - `tesla_df['Volume_Scaled'] = scaler.fit_transform(tesla_df[['Volume']])`: This line scales the `Volume` column using `StandardScaler` and adds the scaled values as a new column called `Volume_Scaled`.

2. **Display the Scaled Feature:**
   - The `print` statement shows the original `Volume` and the new `Volume_Scaled` features for the first five rows, allowing you to compare the unscaled and scaled data.

This modification helps you focus on feature scaling for a single column while integrating the scaled feature back into the original DataFrame. 🚀

## Identify and Fix the Code

The issue in your code lies in the use of `scaler.transform(features)` instead of `scaler.fit_transform(features)`. The `fit_transform` method is necessary for computing the mean and standard deviation before transforming the data. The `transform` method is used only when the scaler has already been fitted.

Here’s the corrected code:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
import datasets

# Load the dataset
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# Feature Engineering: creating new features
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Defining features
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values

# Scaling
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)  # Corrected here

# Displaying the first few scaled features
print("Scaled features (first 5 rows):\n", features_scaled[:5])

# Checking mean values and standard deviations of scaled features
scaled_means = features_scaled.mean(axis=0)
scaled_stds = features_scaled.std(axis=0)

print("\nMean values of scaled features:", scaled_means)
print("Standard deviations of scaled features:", scaled_stds)
```

### Explanation:
- **Issue:** The code was using `scaler.transform(features)` without first fitting the scaler. This would throw an error because the scaler wasn't trained on the data yet.
- **Fix:** Replace `transform` with `fit_transform`, which fits the scaler to the data and then transforms it, ensuring that the scaling is applied correctly.

Now the code will run correctly and produce the scaled features needed for further analysis or machine learning tasks.

## Scaling Financial Features with StandardScaler

Here's the completed code to properly scale the features using `StandardScaler` and validate the scaling:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
import datasets

# Load the dataset
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# Feature Engineering: creating new features
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Defining features
# Features include new columns and 'Volume' column
features = tesla_df[['Open', 'High', 'Low', 'Close', 'Volume', 'High-Low', 'Price-Open']]

# Initialize the StandardScaler and scale the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Displaying the first few scaled features
print("Scaled features (first 5 rows):\n", features_scaled[:5])

# Checking mean values and standard deviations of scaled features
scaled_means = features_scaled.mean(axis=0)
scaled_stds = features_scaled.std(axis=0)

print("\nMean values of scaled features:", scaled_means)
print("Standard deviations of scaled features:", scaled_stds)
```

### Title: Scaling Features and Validating with StandardScaler

### Conclusion:
The code successfully scales the defined features from the Tesla stock dataset using `StandardScaler`. It first loads the dataset and performs feature engineering to create new columns. Then, it initializes `StandardScaler`, fits it to the features, transforms the features, and prints the scaled values. Finally, it validates the scaling by printing the mean values and standard deviations of the scaled features. This ensures the features are scaled properly for further analysis or modeling tasks.

## Implement Feature Scaling Using StandardScaler

Here's the code with the missing pieces filled in to scale the features correctly using `StandardScaler` and display the first few rows of the scaled features:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
import datasets

# Load the Tesla stock dataset
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# Feature Engineering: creating new features
# Create a new feature with a value corresponding to a daily price change
tesla_df['Daily_Change'] = tesla_df['Close'] - tesla_df['Open']
# Create a new feature with a value equal to the mean price during the day
tesla_df['Mean_Price'] = (tesla_df['High'] + tesla_df['Low']) / 2

# Defining features
features = tesla_df[['Daily_Change', 'Mean_Price', 'Volume', 'Open']].values

# Scaling the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Displaying the first few scaled features
print("Scaled features (first 5 rows):\n", features_scaled[:5])

# Check mean values and standard deviations of scaled features
scaled_means = features_scaled.mean(axis=0)
scaled_stds = features_scaled.std(axis=0)

print("\nMean values of scaled features:", scaled_means)
print("Standard deviations of scaled features:", scaled_stds)
```

### Title: Feature Scaling and Validation with Tesla Stock Data

### Conclusion:
The code successfully generates new features, `Daily_Change` and `Mean_Price`, and then scales them along with the `Volume` and `Open` columns using `StandardScaler`. After scaling, it displays the first few rows of the scaled features and checks the mean values and standard deviations to ensure that the features have been standardized correctly. This provides a solid foundation for further data analysis or model training.

## Final Data Scaling Implementation

Here's the complete implementation that loads the Tesla dataset, creates new features, scales them with `StandardScaler`, and validates the scaled features according to the given TODOs:

```python
import pandas as pd
import datasets
from sklearn.preprocessing import StandardScaler

# Load the Tesla dataset using `datasets.load_dataset`
data = datasets.load_dataset('codesignal/tsla-historic-prices')

# Convert the dataset to a pandas DataFrame
tesla_df = pd.DataFrame(data['train'])

# Create new features: 'Volatility' and 'Daily_Average'
# Volatility represents the price fluctuation daily range relative to the opening price.
tesla_df['Volatility'] = (tesla_df['High'] - tesla_df['Low']) / tesla_df['Open']
# Daily_Average represents the average of the daily high and low prices.
tesla_df['Daily_Average'] = (tesla_df['High'] + tesla_df['Low']) / 2

# Define the features from the DataFrame
# The features will include Volatility, Daily_Average, and Volume
features = tesla_df[['Volatility', 'Daily_Average', 'Volume']].values

# Initialize the StandardScaler and scale the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Print the first 5 rows of the scaled features
print("Scaled features (first 5 rows):\n", features_scaled[:5])

# Check and print the mean values and standard deviations of the scaled features
scaled_means = features_scaled.mean(axis=0)
scaled_stds = features_scaled.std(axis=0)

print("\nMean values of scaled features:", scaled_means)
print("Standard deviations of scaled features:", scaled_stds)
```

### Title: Tesla Stock Feature Engineering and Scaling

### Conclusion:
The implementation successfully loads the Tesla stock dataset, creates two new features—`Volatility` and `Daily_Average`—and scales them along with the `Volume` column using `StandardScaler`. The scaled features are printed, and their mean values and standard deviations are checked to ensure proper scaling. This process standardizes the features, making them ready for further data analysis or modeling tasks.