<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>


# Deep Learning Basics with PyTorch

**Dr. Yves J. Hilpisch with GPT-5**


# Chapter 2 — Data, Features, and Representations

This Colab-ready notebook mirrors the Iris example: visualization, a scaler+logistic regression pipeline, and a 2D decision boundary.

## Overview

This notebook provides a concise, hands-on walkthrough of Deep Learning Basics with PyTorch.
Use it as a companion to the chapter: run each cell, read the short notes,
and try small variations to build intuition.

Tips:
- Run cells top to bottom; restart kernel if state gets confusing.
- Prefer small, fast experiments; iterate quickly and observe outputs.
- Keep an eye on shapes, dtypes, and devices when using PyTorch.


## Highlights

- Chapter 2 turns raw measurements into features you can visualise and model.
- Comparing petal vs sepal measurements shows why feature choice matters.
- Scaling plus logistic regression forms a reliable baseline before trying fancier models.

## Guidance

1. Explore the Iris dataset and visualise different feature pairs.
2. Build intuition about separability before fitting any model.
3. Use a pipeline to combine scaling and classification without leakage.
4. Interpret the confusion matrix and decision boundary to see what the model learned.

## Explanation

Treat this notebook as a map of Chapter 2: sketch the data, run a clean train/test split, and relate the geometry of the boundary to model behaviour.

In [None]:
  # Optional: ensure packages are present (Colab usually has these)
  # !pip -q install scikit-learn matplotlib numpy
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8')  # plotting
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
%config InlineBackend.figure_format = 'retina'


## Load and Visualise Petal Features

Petal length and width are the most separable pair discussed in the book. Plot them first to see how clearly the species separate in two dimensions.

In [None]:
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]  # petal length, petal width; inputs
y = iris.target  # targets/labels

plt.figure(figsize = (4, 3))  # plotting
for cls, marker, label in [
    (0, 'o', iris.target_names[0]),
    (1, 's', iris.target_names[1]),
    (2, '^', iris.target_names[2]),
]:
    idx = y == cls
    plt.scatter(X[idx, 0], X[idx, 1], marker = marker, label = label, s = 25)  # plotting

plt.xlabel('petal length (cm)')  # plotting
plt.ylabel('petal width (cm)')
plt.legend(frameon = False)
plt.tight_layout()
plt.show()


### Explanation

- Petal measurements form tight clusters, matching Figure 2 from the chapter.
- A quick scatter plot tells you that a linear model should perform well here.
- Keep the figure small and readable; you want to spot overlaps at a glance.

## Optional: Inspect Sepal Features

Sepal length and width are less separable—use this view to contrast with the petal plot and understand why feature choice matters.

In [None]:
X_sepal = iris.data[:, [0, 1]]

plt.figure(figsize = (4, 3))  # plotting
for cls, marker, label in [
    (0, 'o', iris.target_names[0]),
    (1, 's', iris.target_names[1]),
    (2, '^', iris.target_names[2]),
]:
    idx = y == cls
    plt.scatter(X_sepal[idx, 0], X_sepal[idx, 1], marker = marker, label = label,  # plotting
        s = 25)

plt.xlabel('sepal length (cm)')  # plotting
plt.ylabel('sepal width (cm)')
plt.legend(frameon = False)
plt.tight_layout()
plt.show()


### Explanation

- Sepal features overlap more, mirroring Figure 3 in the text.
- Even with clean data, not every 2D projection separates classes—this motivates feature engineering.
- Keep both plots handy when you discuss model performance.

## Train a Scaler + Logistic Regression Pipeline

Standardisation prevents any single feature from dominating. The pipeline mirrors the book’s advice against leakage: scale inside the split, then fit the classifier.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.25, random_state = 42, stratify = y
)
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter = 1000))
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.3f}")
ConfusionMatrixDisplay.from_predictions(y_test, y_pred,
    display_labels = iris.target_names)
plt.tight_layout()  # plotting
plt.show()


### Explanation

- Stratified splits keep class proportions balanced.
- Accuracy is high because petal features separate species cleanly.
- The confusion matrix reveals whether any class pair still gets mixed up—interpret it before celebrating the score.

## Decision Boundary in 2D

Fit on the full dataset to mirror the book’s decision-boundary visual. The contour plot shows how a linear classifier partitions the feature space.

In [None]:
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter = 1000))
pipe.fit(X, y)

xmin, xmax = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
ymin, ymax = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(
    np.linspace(xmin, xmax, 200),
    np.linspace(ymin, ymax, 200),
)
zz = pipe.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.figure(figsize = (4, 3))  # plotting
plt.contourf(xx, yy, zz, levels = [-0.5, 0.5, 1.5, 2.5], cmap = 'coolwarm', alpha = 0.2)
for cls, marker, label in [
    (0, 'o', iris.target_names[0]),
    (1, 's', iris.target_names[1]),
    (2, '^', iris.target_names[2]),
]:
    idx = y == cls
    plt.scatter(X[idx, 0], X[idx, 1], marker = marker, label = label, s = 25)  # plotting

plt.xlabel('petal length (cm)')  # plotting
plt.ylabel('petal width (cm)')
plt.legend(frameon = False)
plt.tight_layout()
plt.show()


### Explanation

- The linear boundary slices the space into three regions; compare it with the scatter to see where margins are tight.
- Contour levels align with class labels (0, 1, 2), matching the book’s boundary figure.
- Try swapping in sepal features or a non-linear model to reproduce Chapter 2’s “less separable” discussion.

## Exercises

1. Create a leakage example and fix it using proper splits and pipelines.
2. Compare results with/without standardization for a chosen model.


<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>
