<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>

# Python Primer for Machine & Deep Learning
## Scikit-learn Mini Primer

**&copy; Dr. Yves J. Hilpisch**

AI-Powered by GPT-5

This notebook shows a reliable workflow: split data, fit a pipeline, evaluate with a confusion matrix and ROC, then a regression baseline and cross‑validation.

In [None]:
# If running in a fresh Colab, uncomment to install
# !python -m pip install -q scikit-learn matplotlib

### Train/test split and pipeline (classification)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc
import numpy as np, matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8')
X, y = make_classification(n_samples=600, n_features=6, n_informative=4, random_state=0)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
clf.fit(Xtr, ytr)
y_pred = clf.predict(Xte)
acc = accuracy_score(yte, y_pred); acc

### Confusion matrix and ROC

In [None]:
cm = confusion_matrix(yte, y_pred)
fig, ax = plt.subplots(figsize=(5.2,4))
im = ax.imshow(cm, cmap='Blues')
for (i,j), v in np.ndenumerate(cm):
    ax.text(j, i, str(v), ha='center', va='center')
ax.set_xlabel('predicted'); ax.set_ylabel('true'); ax.set_title('Confusion matrix')
plt.show()
y_score = clf.predict_proba(Xte)[:,1]
fpr, tpr, _ = roc_curve(yte, y_score); roc_auc = auc(fpr, tpr)
fig, ax = plt.subplots(figsize=(5.2,4))
ax.plot(fpr, tpr, label=f'ROC AUC = {roc_auc:.3f}')
ax.plot([0,1],[0,1],'--',c='0.6'); ax.set(xlabel='FPR', ylabel='TPR', title='ROC curve')
ax.legend(); ax.grid(alpha=0.3); plt.show()

### Regression baseline

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
Xr, yr = make_regression(n_samples=500, n_features=4, noise=10.0, random_state=1)
Xtr, Xte, ytr, yte = train_test_split(Xr, yr, test_size=0.2, random_state=42)
reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(Xtr, ytr)
yhat = reg.predict(Xte)
r2, mae = r2_score(yte, yhat), mean_absolute_error(yte, yhat)
r2, mae

### Cross‑validation and grid search

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy'); scores.mean(), scores.std()
grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000)),
                     param_grid={'logisticregression__C':[0.1,1.0,10.0]}, cv=5, scoring='accuracy')
grid.fit(X, y); grid.best_params_, grid.best_score_

## Exercises
1. Replace LogisticRegression with a tree‑based model (e.g., RandomForest) and compare ROC/AUC.
2. Add a StandardScaler outside the pipeline and show why it leads to leakage on the test set.
3. For regression, plot predicted vs. true and add a y=x dashed line.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>