# Here's the content formatted in Markdown:

---

# Lesson Overview

Welcome! In today's lesson, we will learn how to split a dataset into training and testing sets. This is a crucial step in preparing your data for machine learning models to ensure they generalize well to unseen data.

**Lesson Goal:** By the end of this lesson, you will understand how to split financial datasets, such as Tesla's stock data, into training and testing sets using Python.

## Revision of Preprocessing Steps

Before we delve into splitting the dataset, let's briefly review the preprocessing steps we have covered so far. The dataset has been loaded, new features have been engineered, and the features have been scaled.

Here's the code for those steps for a quick revision:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
import datasets

# Loading and preprocessing the dataset (revision)
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Defining features and target
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
# Target is the column that we are trying to predict
target = tesla_df['Close'].values

# Scaling
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
```

## Understanding the Importance of Splitting Datasets

To avoid overfitting, where a model learns the training data too well and performs poorly on new, unseen data, it's important to evaluate your machine learning model on data it has never seen before. This is where splitting datasets into training and testing sets comes into play.

### Why Split?

- **Training Set:** Used to train the machine learning model.
- **Testing Set:** Used to evaluate the model's performance and check its ability to generalize to unseen data.

This ensures that your model's performance is not just tailored to the training data but can be generalized to new inputs.

## Implementing Dataset Split with `train_test_split`

The `train_test_split` function from `sklearn.model_selection` helps us easily split the data.

**Parameters of `train_test_split`:**

- `test_size`: The proportion of the dataset to include in the test split (e.g., 0.25 means 25% of the data will be used for testing).
- `train_size`: The proportion of the dataset to include in the train split (optional if `test_size` is provided).
- `random_state`: Controls the shuffling applied to the data before the split. Providing a fixed value ensures reproducibility.

Let's split our scaled features and targets into training and testing sets:

```python
from sklearn.model_selection import train_test_split

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)
```

The `train_test_split` function will split our dataset into training and testing sets:

- `features_scaled` and `target` are the inputs.
- `test_size=0.25` means 25% of the data goes to the test set.
- `random_state=42` ensures reproducibility. The state can be any other number, too.

## Verifying Shapes and Contents of the Split Data

After splitting the dataset, it's important to verify the shapes and the contents of the resulting sets to ensure the split was done correctly.

### Checking Shapes:

Print the shapes of the training and testing sets to confirm the split ratio is as expected.

### Inspecting Sample Rows:

Print a few rows of the training and testing sets to visually inspect the data.

Let's check our split data:

```python
# Verify splits
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")

print(f"First 5 rows of training features: \n{X_train[:5]}")
print(f"First 5 training targets: {y_train[:5]}\n")

print(f"First 5 rows of testing features: \n{X_test[:5]}")
print(f"First 5 testing targets: {y_test[:5]}")
```

The output of the above code will be:

```
Training features shape: (2510, 3)
Testing features shape: (837, 3)
First 5 rows of training features: 
[[-4.66075964e-01  6.80184955e-02  3.11378946e-01]
 [ 4.01701510e+00  5.04529577e+00 -4.61555718e-02]
 [ 2.04723437e+00  3.09900603e+00  9.43022378e-04]
 [-5.30579018e-01 -2.30986178e-02 -5.67163058e-01]
 [-4.78854883e-01 -5.79376618e-02 -6.94451021e-01]]
First 5 training targets: [ 17.288    355.666656 222.419998  15.000667  13.092   ]

First 5 rows of testing features: 
[[-0.36226203  0.2087143   0.69346624]
 [ 1.27319589  1.04049732  0.58204785]
 [-0.53556882 -0.03231093 -0.86874821]
 [-0.49029475  0.07773304 -0.51784526]
 [ 3.0026057  -4.41816938 -0.31923731]]
First 5 testing targets: [ 23.209333 189.606674  14.730667  16.763332 325.733337]
```

This output confirms that our dataset has been successfully split into training and testing sets, showing the shape of each set and giving us a glimpse into the rows of our features and targets post-split. It's an important validation step to ensure our data is ready for machine learning model training and evaluation.

## Lesson Summary

Great job! In this lesson, we:

- Discussed the importance of splitting datasets to avoid overfitting.
- Implemented `train_test_split` to divide the dataset into training and testing sets.
- Verified the shapes and inspected sample rows of the resulting splits.

These steps are crucial for ensuring that your machine learning models can generalize well to new data. Up next, you'll have some practice exercises to solidify your understanding and improve your data preparation skills. Keep going!

--- 

This Markdown format is clean, well-structured, and suitable for lesson notes or documentation.

## Adjust the Dataset Split Ratio

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import datasets

# Load the dataset
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# Create new features
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Define features and target
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
target = tesla_df['Close'].values

# Scale the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.30, random_state=42)

# Verify splits
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")

print(f"First 5 rows of training features: \n{X_train[:5]}")
print(f"First 5 training targets: {y_train[:5]}\n")

print(f"First 5 rows of testing features: \n{X_test[:5]}")
print(f"First 5 testing targets: {y_test[:5]}")

```

## Fix the Dataset Split

It looks like there's an issue with the variables returned by `train_test_split`. The `train_test_split` function from `sklearn.model_selection` returns four values: `X_train`, `X_test`, `y_train`, and `y_test`, but your code is assigning them incorrectly.

Here's the corrected version of the code:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import datasets

# Load dataset
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# Feature engineering
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Define features and target
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
target = tesla_df['Close'].values

# Scale features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)

# Verify splits
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")

print(f"First 5 rows of training features:\n{X_train[:5]}")
print(f"First 5 training targets: {y_train[:5]}\n")

print(f"First 5 rows of testing features:\n{X_test[:5]}")
print(f"First 5 testing targets: {y_test[:5]}")
```

### Key Changes:
- Changed `X_train, y_train, X_test, y_test` to `X_train, X_test, y_train, y_test` in the `train_test_split` function call.

### Conclusion:
This correction ensures that the features and targets are split into the appropriate training and testing sets, and the resulting shapes and sample values can be verified as expected. Happy debugging! 🚀

## Fill in the Blanks: Splitting and Scaling Data

Here’s how you can fill in the blanks to ensure the code works as intended:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import datasets

# Loading and preprocessing the dataset (revision)
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Defining features and target
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
target = tesla_df['Close'].values

# Scaling
scaler = StandardScaler()
# TODO: Fit and transform the features using StandardScaler
features_scaled = scaler.fit_transform(features)

# Splitting the dataset
# TODO: Split the preprocessed features and target into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)

# Verify splits
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")

print(f"First 5 rows of training features: \n{X_train[:5]}")
print(f"First 5 training targets: {y_train[:5]}\n")

print(f"First 5 rows of testing features: \n{X_test[:5]}")
print(f"First 5 testing targets: {y_test[:5]}")
```

### Key Fill-ins:
1. **Scaling the features**:
   ```python
   features_scaled = scaler.fit_transform(features)
   ```
   This scales the features using `StandardScaler`, ensuring that they have a mean of 0 and a standard deviation of 1.

2. **Splitting the dataset**:
   ```python
   X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)
   ```
   This correctly splits the scaled features and target into training and testing sets.

### Conclusion:
This completes the data preprocessing steps and ensures the data is properly prepared for any further analysis or model training. Keep up the stellar exploration! 🚀✨

## Splitting the Dataset into Training and Testing Sets

The error occurs because the number of samples in `features_scaled` and `target` are inconsistent. This typically happens when you apply `dropna()` on the features but not on the target, resulting in a mismatch in the number of samples between them.

To fix this, we need to ensure that both `features` and `target` have the same number of rows after dropping any rows with missing values.

Here’s how to adjust the code:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import datasets

# Load the dataset
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# Create new features
tesla_df['5 Day Moving Avg'] = tesla_df['Close'].rolling(window=5).mean()
tesla_df['10 Day Moving Avg'] = tesla_df['Close'].rolling(window=10).mean()

# Drop rows with NaN values in both features and target to ensure consistency
tesla_df = tesla_df.dropna(subset=['5 Day Moving Avg', '10 Day Moving Avg', 'Volume', 'Close'])

# Define features and target
features = tesla_df[['5 Day Moving Avg', '10 Day Moving Avg', 'Volume']].values
target = tesla_df['Close'].values

# Scale the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Split the dataset into training and testing sets, using 25% for testing and random_state of 42
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)

# Verify splits
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"First 5 rows of training features: \n{X_train[:5]}")
print(f"First 5 training targets: {y_train[:5]}\n")
print(f"First 5 rows of testing features: \n{X_test[:5]}")
print(f"First 5 testing targets: {y_test[:5]}")
```

### Key Fix:
- **Drop rows with `NaN` values**:
  ```python
  tesla_df = tesla_df.dropna(subset=['5 Day Moving Avg', '10 Day Moving Avg', 'Volume', 'Close'])
  ```
  This ensures that both the features and the target have consistent lengths.

### Conclusion:
This adjustment ensures that the `features` and `target` arrays have the same number of samples, avoiding the `ValueError`. Now, your code should work without any issues! 🌌🚀

## Preprocess and Split Tesla Stock Data

Here's the complete code to accomplish the tasks outlined in the TODO comments:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import datasets

# TODO: Load the dataset 'codesignal/tsla-historic-prices' and convert it to a DataFrame
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# TODO: Create new features 'SMA20' (20-day Simple Moving Average) and 'EMA20' (20-day Exponential Moving Average)
tesla_df['SMA20'] = tesla_df['Close'].rolling(window=20).mean()
tesla_df['EMA20'] = tesla_df['Close'].ewm(span=20, adjust=False).mean()

# TODO: Drop NaN values that were created by moving average
tesla_df = tesla_df.dropna(subset=['SMA20', 'EMA20', 'Volume', 'Close'])

# TODO: Define features and target
# `features` include SMA20, EMA20, and Volume, `target` includes Close prices
features = tesla_df[['SMA20', 'EMA20', 'Volume']].values
target = tesla_df['Close'].values

# TODO: Scale features using StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# TODO: Split the dataset into training and testing sets using train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)

# TODO: Verify splits by printing shapes and sample rows of training and testing sets
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"First 5 rows of training features: \n{X_train[:5]}")
print(f"First 5 training targets: {y_train[:5]}\n")
print(f"First 5 rows of testing features: \n{X_test[:5]}")
print(f"First 5 testing targets: {y_test[:5]}")
```

### Breakdown of the Steps:
1. **Loading the Dataset**:
   ```python
   data = datasets.load_dataset('codesignal/tsla-historic-prices')
   tesla_df = pd.DataFrame(data['train'])
   ```
   This loads the dataset and converts it into a DataFrame.

2. **Creating New Features**:
   ```python
   tesla_df['SMA20'] = tesla_df['Close'].rolling(window=20).mean()
   tesla_df['EMA20'] = tesla_df['Close'].ewm(span=20, adjust=False).mean()
   ```
   This creates the `SMA20` and `EMA20` features using rolling and exponential weighted averages.

3. **Dropping NaN Values**:
   ```python
   tesla_df = tesla_df.dropna(subset=['SMA20', 'EMA20', 'Volume', 'Close'])
   ```
   This ensures that any rows with `NaN` values (caused by the moving averages) are removed.

4. **Defining Features and Target**:
   ```python
   features = tesla_df[['SMA20', 'EMA20', 'Volume']].values
   target = tesla_df['Close'].values
   ```
   This defines the `features` and `target` for model training.

5. **Scaling the Features**:
   ```python
   scaler = StandardScaler()
   features_scaled = scaler.fit_transform(features)
   ```
   The `StandardScaler` is used to normalize the features.

6. **Splitting the Dataset**:
   ```python
   X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)
   ```
   This splits the data into training and testing sets.

7. **Verifying the Splits**:
   ```python
   print(f"Training features shape: {X_train.shape}")
   print(f"Testing features shape: {X_test.shape}")
   print(f"First 5 rows of training features: \n{X_train[:5]}")
   print(f"First 5 training targets: {y_train[:5]}\n")
   print(f"First 5 rows of testing features: \n{X_test[:5]}")
   print(f"First 5 testing targets: {y_test[:5]}")
   ```
   This final step prints out the shapes and sample rows of the training and testing datasets to verify everything is in order.

### Conclusion:
With all these steps combined, you now have a fully preprocessed dataset ready for any further analysis or machine learning tasks. Keep reaching for the stars! 🌟🚀