## Introduction
In today's lesson, our focus is on preprocessing the Iris dataset for TensorFlow. We will explore various techniques, such as data splitting, feature scaling, and one-hot encoding. This foundation is invaluable in the field of machine learning as it aids in understanding the intricacies of data transformation before we feed it to a neural network. Let's get into it!

## Overview of the Iris Dataset
Before we delve into data preprocessing, it is imperative to understand the data we are processing. The Iris dataset comprises measurements from 150 Iris flowers coming from three different species. Each sample includes the following 4 features:

- Sepal length (cm): e.g., 5.1, 4.9, 4.7, etc.
- Sepal width (cm): e.g., 3.5, 3.0, 3.2, etc.
- Petal length (cm): e.g., 1.4, 1.4, 1.3, etc.
- Petal width (cm): e.g., 0.2, 0.2, 0.2, etc.
Additionally, each sample has a class label representing the Iris species. The targets in the dataset are represented as one of the following options:

- Iris setosa: 0
- Iris versicolor: 1
- Iris virginica: 2
With these measurements and labels, the Iris dataset becomes a multivariate dataset often used for machine learning introductions.

## Insight into Data Preprocessing
Data preprocessing is a crucial step in machine learning. It is the process of converting or mapping data from the initial form to another format to prepare the data for the next processing phase. This converted data could be easier for the algorithms to extract information, hence improving their ability to predict. The steps involved in preprocessing we will cover in today's lesson include data load, split, scale, and encode.

## Step 1: Loading the Dataset
Before diving into preprocessing, let's start by loading the Iris dataset. We use the load_iris function from scikit-learn for this purpose. It returns the feature matrix X and the target vector y.

```Python
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Displaying shapes
print(f'X shape: {X.shape}')
print(f'y shape: {y.shape}')
```
The output will be:

```sh

X shape: (150, 4)
y shape: (150,)
```

Here, X contains 150 samples, each with 4 features (sepal length, sepal width, petal length, and petal width). The y vector contains 150 class labels, with each label representing one of the three Iris species. This initial step helps us understand the dimensions of our dataset before we proceed with further processing.

## Step 2: Splitting into Training and Testing Sets
The initial step in preprocessing itself is data splitting. We divide the dataset into two parts: a training set and a testing set. The training set is used to train the model, while the testing set validates its performance. Typically, we use scikit-learn's train_test_split function for this purpose. By splitting the data, we ensure that our model can generalize well to new, unseen data. The stratify parameter ensures that the proportion of different classes in the split datasets is the same as in the original dataset. For our specific example, we will use 70% of the data for training and 30% for testing.

```Python
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
```

## Step 3: Feature Scaling
After splitting the data, we perform feature scaling to normalize the range of independent variables or features. This step is crucial because it ensures all input features have the same scale, preventing features with larger scales from dominating those with smaller scales. We achieve this normalization using the StandardScaler from scikit-learn, which standardizes features by centering the data to have a mean of 0 and scaling to unit variance. The fit method calculates the mean and standard deviation for scaling based on the training data.

```Python
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

## Step 4: Target Encoding
The final preprocessing step is data encoding. The target variables in the Iris dataset are categorical and must be converted into a format that our machine learning model can utilize. This is done using one-hot encoding, which transforms categorical data into a binary (0 or 1) format. For example, a variable labeled as 1 (Iris versicolor) would be represented as [0, 1, 0] after one-hot encoding. We use the OneHotEncoder from scikit-learn to perform this action, ensuring our target variables are ready for input into the model. The fit method learns the unique categories present in the training data, which will be used for encoding.

```Python
from sklearn.preprocessing import OneHotEncoder

# One-hot encode the targets
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(y_train.reshape(-1, 1))
y_train_encoded = encoder.transform(y_train.reshape(-1, 1))
y_test_encoded = encoder.transform(y_test.reshape(-1, 1))
```

## Data Preprocessing in Practice
Below are the summarized preprocessing steps, including data loading, splitting, scaling, and encoding in one section encapsulated in a single function. This function facilitates modularization, allowing us to use the processed data imported in another file where we develop our model.

```Python
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def load_preprocessed_data():
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

    # Scale the features
    scaler = StandardScaler().fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # One-hot encode the targets
    encoder = OneHotEncoder(sparse_output=False).fit(y_train.reshape(-1, 1))
    y_train_encoded = encoder.transform(y_train.reshape(-1, 1))
    y_test_encoded = encoder.transform(y_test.reshape(-1, 1))

    return X_train_scaled, X_test_scaled, y_train_encoded, y_test_encoded
```

## Loading and Printing Preprocessed Data
After defining the function that preprocesses the data, we can load the preprocessed data and print a sample of the training input and target.

```Python
# Load preprocessed data
X_train, X_test, y_train, y_test = load_preprocessed_data()

# Print a sample of one training input and target
print(f'Sample of preprocessed X_train: {X_train[0]}')
print(f'Sample of preprocessed y_train: {y_train[0]}\n')

# Print the shape of scaled and encoded data
print(f'Shape of preprocessed X_train: {X_train.shape}')
print(f'Shape of preprocessed X_test: {X_test.shape}')
print(f'Shape of preprocessed y_train: {y_train.shape}')
print(f'Shape of preprocessed y_test: {y_test.shape}')
```
The output of the above code will be:

```sh
Sample of preprocessed X_train: [-0.90045861 -1.22024754 -0.4419858  -0.13661044]
Sample of preprocessed y_train: [0. 1. 0.]

Shape of preprocessed X_train: (105, 4)
Shape of preprocessed X_test: (45, 4)
Shape of preprocessed y_train: (105, 3)
Shape of preprocessed y_test: (45, 3)
```

This output illustrates the results of our preprocessing steps — scaling of feature data to ensure a standardized dataset and one-hot encoding of target variables to prepare them for machine learning models.

## Lesson Summary and Practice
In conclusion, we have successfully preprocessed the Iris dataset and made it ready for machine learning modeling with TensorFlow. We've loaded, split, scaled, and encoded the data using Python. This fundamental knowledge is essential to you as a Machine Learning Engineer to improve accuracy and build efficient models using TensorFlow.

Next, we will have exercises to consolidate these preprocessing steps. The exercises aim to enhance your understanding and application of data preprocessing and prepare you for more challenging tasks in the future. Happy learning!




## Exploring and Preprocessing the Iris Dataset
We have covered important concepts for preprocessing the Iris dataset. Now, let's put that knowledge into practice.

In this task, you will see how to preprocess the Iris dataset for TensorFlow by running the provided code. This includes splitting the dataset, scaling the features, and one-hot encoding the targets.

Simply execute the code to observe how the data is preprocessed and the shapes of the resulting arrays.

```py
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def load_preprocessed_data():
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

    # Scale the features
    scaler = StandardScaler().fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # One-hot encode the targets
    encoder = OneHotEncoder(sparse_output=False).fit(y_train.reshape(-1, 1))
    y_train_encoded = encoder.transform(y_train.reshape(-1, 1))
    y_test_encoded = encoder.transform(y_test.reshape(-1, 1))

    return X_train_scaled, X_test_scaled, y_train_encoded, y_test_encoded

# Load preprocessed data
X_train, X_test, y_train, y_test = load_preprocessed_data()

# Print a sample of one training input and target
print(f'Sample of preprocessed X_train: {X_train[0]}')
print(f'Sample of preprocessed y_train: {y_train[0]}\n')

# Print the shape of scaled and encoded data
print(f'Shape of preprocessed X_train: {X_train.shape}')
print(f'Shape of preprocessed X_test: {X_test.shape}')
print(f'Shape of preprocessed y_train: {y_train.shape}')
print(f'Shape of preprocessed y_test: {y_test.shape}')
```

## Changing Train-Test Split Ratio

Great job understanding the basics of preprocessing the Iris dataset. Now, let's explore how changing the train-test split ratio impacts data preparation.

Change the train-test split ratio from 70/30 to 80/20 by modifying the test_size parameter from 0.3 to 0.2. This will help you see how the data split affects model training and evaluation.

```py
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def load_preprocessed_data():
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target

    # Create an 80/20 train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

    # Scale the features
    scaler = StandardScaler().fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # One-hot encode the targets
    encoder = OneHotEncoder(sparse_output=False).fit(y_train.reshape(-1, 1))
    y_train_encoded = encoder.transform(y_train.reshape(-1, 1))
    y_test_encoded = encoder.transform(y_test.reshape(-1, 1))

    return X_train_scaled, X_test_scaled, y_train_encoded, y_test_encoded

# Load preprocessed data
X_train, X_test, y_train, y_test = load_preprocessed_data()

# Print a sample of one training input and target
print(f'Sample of preprocessed X_train: {X_train[0]}')
print(f'Sample of preprocessed y_train: {y_train[0]}\n')

# Print the shape of scaled and encoded data
print(f'Shape of preprocessed X_train: {X_train.shape}')
print(f'Shape of preprocessed X_test: {X_test.shape}')
print(f'Shape of preprocessed y_train: {y_train.shape}')
print(f'Shape of preprocessed y_test: {y_test.shape}')

```

## Fix the Data Preprocessing Bugs

So far, you have learned how to preprocess the Iris dataset. Now, it's time to practice and ensure you can implement it correctly.

In this exercise, you need to fix a few bugs in the code that preprocesses the Iris dataset for TensorFlow. There are mistakes that prevent the code from running correctly.

Your task is to find and fix these errors. This will help you understand common mistakes and ensure you can preprocess datasets accurately.

```py
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def load_preprocessed_data():
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target

    # Split the dataset into training and testing sets (80/20 split)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

    # Scale the features
    scaler = StandardScaler().fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # One-hot encode the targets
    encoder = OneHotEncoder(sparse_output=False).fit(y_train.reshape(-1, 1))
    y_train_encoded = encoder.transform(y_train.reshape(-1, 1))
    y_test_encoded = encoder.transform(y_test.reshape(-1, 1))

    return X_train_scaled, X_test_scaled, y_train_encoded, y_test_encoded

# Load preprocessed data
X_train, X_test, y_train, y_test = load_preprocessed_data()

# Print a sample of one training input and target
print(f'Sample of preprocessed X_train: {X_train[0]}')
print(f'Sample of preprocessed y_train: {y_train[0]}\n')

# Print the shape of scaled and encoded data
print(f'Shape of preprocessed X_train: {X_train.shape}')
print(f'Shape of preprocessed X_test: {X_test.shape}')
print(f'Shape of preprocessed y_train: {y_train.shape}')
print(f'Shape of preprocessed y_test: {y_test.shape}')

```

## Hands-on Data Preprocessing

In the previous tasks, you practiced various preprocessing techniques for the Iris dataset.

Now, in this exercise, you will complete the code to preprocess the dataset. This includes scaling the features, and one-hot encoding the targets.

Fill in the missing parts denoted by TODO comments.

```py
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def load_preprocessed_data():
    # Load the Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

    # TODO: Scale the features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # TODO: One-hot encode the targets
    encoder = OneHotEncoder(sparse_output=False)
    y_train_encoded = encoder.fit_transform(y_train.reshape(-1, 1))
    y_test_encoded = encoder.transform(y_test.reshape(-1, 1))

    return X_train_scaled, X_test_scaled, y_train_encoded, y_test_encoded

# Load preprocessed data
X_train, X_test, y_train, y_test = load_preprocessed_data()

# Print a sample of one training input and target
print(f'Sample of preprocessed X_train: {X_train[0]}')
print(f'Sample of preprocessed y_train: {y_train[0]}\n')

# Print the shape of scaled and encoded data
print(f'Shape of preprocessed X_train: {X_train.shape}')
print(f'Shape of preprocessed X_test: {X_test.shape}')
print(f'Shape of preprocessed y_train: {y_train.shape}')
print(f'Shape of preprocessed y_test: {y_test.shape}')

```

## End-to-end Preprocessing the Iris Dataset

You've done a great job understanding the concepts of preprocessing the Iris dataset.

In this final task, you will implement the complete process of preprocessing the Iris dataset for TensorFlow from scratch. Follow the steps to load the dataset, split it into training and testing sets, scale the features, and one-hot encode the target labels.

Create the function load_preprocessed_data() to achieve the desired outcomes.

```py
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def load_preprocessed_data():
    # Load the Iris dataset
    iris = load_iris()
    
    # Assign the dataset attributes: data to X and target to y
    X, y = iris.data, iris.target

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

    # Scale the features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # One-hot encode the targets
    encoder = OneHotEncoder(sparse_output=False)
    y_train_encoded = encoder.fit_transform(y_train.reshape(-1, 1))
    y_test_encoded = encoder.transform(y_test.reshape(-1, 1))

    return X_train_scaled, X_test_scaled, y_train_encoded, y_test_encoded

# Load preprocessed data
X_train, X_test, y_train, y_test = load_preprocessed_data()

# Print a sample of one training input and target
print(f'Sample of preprocessed X_train: {X_train[0]}')
print(f'Sample of preprocessed y_train: {y_train[0]}\n')

# Print the shape of scaled and encoded data
print(f'Shape of preprocessed X_train: {X_train.shape}')
print(f'Shape of preprocessed X_test: {X_test.shape}')
print(f'Shape of preprocessed y_train: {y_train.shape}')
print(f'Shape of preprocessed y_test: {y_test.shape}')

```