# Scikit-learn Starter Notebook
A hands-on notebook to practice the basics of scikit-learn.

**What you'll learn:**
1) Loading datasets  
2) Train/validation split  
3) Training a simple model  
4) Evaluating with accuracy & confusion matrix  
5) Using preprocessing (scaling)  
6) Building a Pipeline  
7) Cross-validation  

---
Run each cell from top to bottom (Shift+Enter in Jupyter).

## 0. Setup
Install packages if needed (you can skip if they're already installed).

In [None]:
# If needed, uncomment:
# !pip install -q scikit-learn pandas matplotlib


## 1. Imports

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

print("Versions -> numpy:", np.__version__, ", pandas:", pd.__version__)

## 2. Load a sample dataset (Iris)
Scikit-learn ships with small, clean datasets like *Iris*. We'll use it for quick experiments.

In [None]:
iris = datasets.load_iris(as_frame=True)
X = iris.data
y = iris.target

print("X shape:", X.shape)
print("y shape:", y.shape)
X.head()

## 3. Train/Test Split
Split the data into training and test sets (80/20).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape

## 4. Train a simple model (Logistic Regression)

In [None]:
log_clf = LogisticRegression(max_iter=1000)
log_clf.fit(X_train, y_train)

y_pred = log_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

### Confusion Matrix & Classification Report

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))

# Plot confusion matrix (simple matplotlib, single figure, default colors)
fig, ax = plt.subplots()
ax.imshow(cm)
ax.set_title("Confusion Matrix")
ax.set_xlabel("Predicted")
ax.set_ylabel("True")
ax.set_xticks(range(len(iris.target_names)))
ax.set_yticks(range(len(iris.target_names)))
ax.set_xticklabels(iris.target_names, rotation=45, ha='right')
ax.set_yticklabels(iris.target_names)
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, cm[i, j], ha='center', va='center')
plt.tight_layout()
plt.show()

## 5. Add preprocessing (Standardization)
Some models work better when features are scaled. We'll use a **Pipeline** to combine scaling and the model.

In [None]:
pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000))
])
pipe.fit(X_train, y_train)
print("Pipeline test accuracy:", pipe.score(X_test, y_test))

## 6. Try a different model (Decision Tree)

In [None]:
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
print("DecisionTree test accuracy:", tree.score(X_test, y_test))

## 7. Cross-Validation
Use **cross_val_score** on the full dataset to estimate generalization performance.

In [None]:
scores = cross_val_score(pipe, X, y, cv=5)
print("CV scores:", scores)
print("Mean CV:", scores.mean())

## 8. Mini-Exercises (try these!)
1. Change the classifier in the pipeline to `DecisionTreeClassifier` and compare CV scores.
2. Adjust tree depth: `DecisionTreeClassifier(max_depth=2)` – how does accuracy change?
3. Replace Iris with `datasets.load_wine(as_frame=True)` – what changes?
4. Add `StandardScaler()` to a tree pipeline (even if trees don't need scaling) – does it affect results?
5. Try `KNeighborsClassifier` (remember to scale!).