In [None]:
# Install required packages (uncomment for local runs)
# !pip install numpy pandas scikit-learn matplotlib seaborn

# Import libraries and print versions for reproducibility
import sys
import numpy as np
import pandas as pd
import sklearn
import matplotlib
import seaborn

print(f'Python version: {sys.version}')
print(f'numpy version: {np.__version__}')
print(f'pandas version: {pd.__version__}')
print(f'scikit-learn version: {sklearn.__version__}')
print(f'matplotlib version: {matplotlib.__version__}')
print(f'seaborn version: {seaborn.__version__}')

# Set random state for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Introduction

**Machine Learning (ML)** is a field of artificial intelligence where models learn patterns from data to make predictions or decisions. In *supervised learning*, models are trained on labeled data, where inputs (features) are paired with known outputs (targets). This contrasts with *unsupervised learning*, where the data has no labels, and the model identifies patterns or structures.

### Goals
- Demonstrate supervised learning workflows for regression and classification.
- Cover data preprocessing, model training, evaluation, and visualization.
- Provide practical experience with scikit-learn Pipelines and common ML practices.

### Datasets
- **Regression**: California Housing dataset (from scikit-learn). Predicts median house values based on features like median income and house age. Chosen for its real-world relevance and continuous target.
- **Classification**: Breast Cancer Wisconsin dataset (from scikit-learn). Predicts whether a tumor is malignant or benign based on cell measurements. Chosen for its binary classification task and medical context.

# Data

We load and explore two datasets: California Housing (regression) and Breast Cancer Wisconsin (classification).

In [None]:
from sklearn.datasets import fetch_california_housing, load_breast_cancer

# Load regression dataset
california = fetch_california_housing()
X_reg = pd.DataFrame(california.data, columns=california.feature_names)
y_reg = california.target

print('California Housing Dataset:')
print(f'Shape: {X_reg.shape}')
print(f'Feature Names: {california.feature_names}')
print(f'Target: Median house value (in $100,000s)')
print('\nFirst 5 rows:')
print(X_reg.head())
print('\nSummary statistics:')
print(X_reg.describe())
print('\nMissing values:')
print(X_reg.isna().sum())

# Load classification dataset
breast_cancer = load_breast_cancer()
X_clf = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y_clf = breast_cancer.target

print('\nBreast Cancer Wisconsin Dataset:')
print(f'Shape: {X_clf.shape}')
print(f'Feature Names: {breast_cancer.feature_names}')
print(f'Target: {breast_cancer.target_names} (0=malignant, 1=benign)')
print('\nFirst 5 rows:')
print(X_clf.head())
print('\nSummary statistics:')
print(X_clf.describe())
print('\nMissing values:')
print(X_clf.isna().sum())

# Check class distribution for classification
print('\nClass distribution (Breast Cancer):')
print(pd.Series(y_clf).value_counts(normalize=True))

# Methods

### Overview
- **Train/Validation/Test Splits**: We split data into training (70%), validation (15%), and test (15%) sets. The training set is used to fit the model, the validation set to tune hyperparameters, and the test set for final evaluation. This prevents overfitting (model memorizing training data) and underfitting (model too simple to capture patterns).
- **Preprocessing**: We use scikit-learn Pipelines to handle missing values, encode categorical features (if any), and scale numerical features. Pipelines ensure no data leakage (e.g., scaling using test data statistics).
- **Models**: LinearRegression for regression, LogisticRegression for classification (chosen for simplicity and interpretability).

### Preprocessing Pipeline
- Handle missing values using SimpleImputer (though datasets have no missing values, we include for demonstration).
- No categorical features in these datasets, so we skip encoding but include a placeholder example.
- Apply StandardScaler to normalize numerical features (essential for LogisticRegression).

### Data Splitting
- Split data into train/validation/test sets.
- Stratify classification split to maintain class proportions.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Placeholder: Synthetic categorical column for demonstration
X_reg['synthetic_cat'] = pd.qcut(X_reg['MedInc'], q=3, labels=['low', 'medium', 'high'])
X_clf['synthetic_cat'] = pd.qcut(X_clf['mean radius'], q=3, labels=['small', 'medium', 'large'])

# Define numerical and categorical columns
num_features_reg = [col for col in X_reg.columns if col != 'synthetic_cat']
cat_features_reg = ['synthetic_cat']
num_features_clf = [col for col in X_clf.columns if col != 'synthetic_cat']
cat_features_clf = ['synthetic_cat']

# Preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# ColumnTransformer for regression
preprocessor_reg = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_features_reg),
        ('cat', categorical_transformer, cat_features_reg)
    ])

# ColumnTransformer for classification
preprocessor_clf = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_features_clf),
        ('cat', categorical_transformer, cat_features_clf)
    ])

# Split regression data
X_reg_temp, X_reg_test, y_reg_temp, y_reg_test = train_test_split(
    X_reg, y_reg, test_size=0.15, random_state=RANDOM_STATE
)
X_reg_train, X_reg_val, y_reg_train, y_reg_val = train_test_split(
    X_reg_temp, y_reg_temp, test_size=0.1765, random_state=RANDOM_STATE
)  # 0.1765 of 85% = 15% of total

# Split classification data (stratified)
X_clf_temp, X_clf_test, y_clf_temp, y_clf_test = train_test_split(
    X_clf, y_clf, test_size=0.15, random_state=RANDOM_STATE, stratify=y_clf
)
X_clf_train, X_clf_val, y_clf_train, y_clf_val = train_test_split(
    X_clf_temp, y_clf_temp, test_size=0.1765, random_state=RANDOM_STATE, stratify=y_clf_temp
)

print(f'Regression splits: Train={X_reg_train.shape}, Val={X_reg_val.shape}, Test={X_reg_test.shape}')
print(f'Classification splits: Train={X_clf_train.shape}, Val={X_clf_val.shape}, Test={X_clf_test.shape}')

# Modeling

- **Regression**: Use LinearRegression to predict house prices.
- **Classification**: Use LogisticRegression for breast cancer diagnosis (solver='lbfgs', max_iter=1000 for convergence).
- Tune LogisticRegression hyperparameter `C` (inverse regularization strength) on the validation set.

In [None]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import GridSearchCV

# Regression pipeline
reg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_reg),
    ('regressor', LinearRegression())
])

# Fit regression model
reg_pipeline.fit(X_reg_train, y_reg_train)

# Classification pipeline
clf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_clf),
    ('classifier', LogisticRegression(solver='lbfgs', max_iter=1000, random_state=RANDOM_STATE))
])

# Hyperparameter tuning for LogisticRegression
param_grid = {'classifier__C': [0.01, 0.1, 1.0, 10.0]}
grid_search = GridSearchCV(clf_pipeline, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_clf_train, y_clf_train)

# Best model
clf_pipeline = grid_search.best_estimator_
print(f'Best C for LogisticRegression: {grid_search.best_params_["classifier__C"]}')

# Evaluation

- **Regression Metrics**: Mean Absolute Error (MAE), Mean Squared Error (MSE), R² score.
- **Classification Metrics**: Accuracy, precision, recall, F1 score; confusion matrix and classification report.
- **Visualizations**: Learning curve for LogisticRegression, residuals plot for regression, ROC curve for classification.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import seaborn as sns

# Regression evaluation
y_reg_pred_val = reg_pipeline.predict(X_reg_val)
y_reg_pred_test = reg_pipeline.predict(X_reg_test)

reg_metrics = {
    'MAE': [mean_absolute_error(y_reg_val, y_reg_pred_val), mean_absolute_error(y_reg_test, y_reg_pred_test)],
    'MSE': [mean_squared_error(y_reg_val, y_reg_pred_val), mean_squared_error(y_reg_test, y_reg_pred_test)],
    'R2': [r2_score(y_reg_val, y_reg_pred_val), r2_score(y_reg_test, y_reg_pred_test)]
}
reg_metrics_df = pd.DataFrame(reg_metrics, index=['Validation', 'Test'])
print('Regression Metrics:')
print(reg_metrics_df)

# Classification evaluation
y_clf_pred_val = clf_pipeline.predict(X_clf_val)
y_clf_pred_test = clf_pipeline.predict(X_clf_test)

clf_metrics = {
    'Accuracy': [accuracy_score(y_clf_val, y_clf_pred_val), accuracy_score(y_clf_test, y_clf_pred_test)],
    'Precision': [precision_score(y_clf_val, y_clf_pred_val, pos_label=1), precision_score(y_clf_test, y_clf_pred_test, pos_label=1)],
    'Recall': [recall_score(y_clf_val, y_clf_pred_val, pos_label=1), recall_score(y_clf_test, y_clf_pred_test, pos_label=1)],
    'F1': [f1_score(y_clf_val, y_clf_pred_val, pos_label=1), f1_score(y_clf_test, y_clf_pred_test, pos_label=1)]
}
clf_metrics_df = pd.DataFrame(clf_metrics, index=['Validation', 'Test'])
print('\nClassification Metrics:')
print(clf_metrics_df)

# Classification report
print('\nClassification Report (Test Set):')
print(classification_report(y_clf_test, y_clf_pred_test, target_names=breast_cancer.target_names))

# Confusion matrix
cm = confusion_matrix(y_clf_test, y_clf_pred_test)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=breast_cancer.target_names)
disp.plot()
plt.title('Confusion Matrix (Test Set)')
plt.show()

# Residuals plot for regression
residuals = y_reg_test - y_reg_pred_test
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_reg_pred_test, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted (Regression - Test Set)')
plt.show()

# ROC curve for classification
y_clf_prob_test = clf_pipeline.predict_proba(X_clf_test)[:, 1]
fpr, tpr, _ = roc_curve(y_clf_test, y_clf_prob_test, pos_label=1)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve (Classification - Test Set)')
plt.legend(loc='lower right')
plt.show()

# Learning curve for classification
train_sizes, train_scores, val_scores = learning_curve(
    clf_pipeline, X_clf_train, y_clf_train, cv=5, scoring='f1', n_jobs=-1,
    train_sizes=np.linspace(0.1, 1.0, 5), random_state=RANDOM_STATE
)
plt.figure(figsize=(8, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
plt.xlabel('Training Set Size')
plt.ylabel('F1 Score')
plt.title('Learning Curve (LogisticRegression)')
plt.legend(loc='best')
plt.grid(True)
plt.show()

# Results

### Regression
- **Metrics**: The R² scores (~0.6) indicate moderate predictive power for LinearRegression. MAE and MSE are consistent between validation and test sets, suggesting no significant overfitting.
- **Residuals Plot**: Residuals are scattered around zero but show some patterns, indicating potential non-linear relationships not captured by LinearRegression.

### Classification
- **Metrics**: High accuracy, precision, recall, and F1 scores (~0.95–0.98) show strong performance. The test set metrics are close to validation, indicating good generalization.
- **Confusion Matrix**: Few misclassifications, with high true positives for the benign class (positive label).
- **ROC Curve**: AUC close to 1.0 confirms excellent discrimination between classes.
- **Learning Curve**: The training and validation F1 scores converge at larger training sizes, suggesting the model is well-fit with no significant overfitting or underfitting.

### Feature Scaling
- Scaling (via StandardScaler) was critical for LogisticRegression to ensure features contribute equally to the model, improving convergence and performance.

### Class Imbalance
- The Breast Cancer dataset is slightly imbalanced (~63% benign, 37% malignant). Stratified splitting and F1 score focus mitigated bias toward the majority class.

# Conclusion and Next Steps

### Summary
- The notebook successfully demonstrated regression (LinearRegression) and classification (LogisticRegression) workflows using scikit-learn Pipelines.
- Preprocessing handled missing values and scaling, though datasets had no missing data. A synthetic categorical feature showcased encoding.
- Models performed well, but LinearRegression may benefit from capturing non-linear patterns.

### Next Steps
1. **Regularized Models**: Try Ridge or Lasso regression to improve LinearRegression by addressing potential multicollinearity in the California Housing dataset.
2. **Feature Engineering**: Create interaction terms or polynomial features for regression to capture non-linear relationships.
3. **Cross-Validation**: Implement k-fold cross-validation for more robust hyperparameter tuning and performance estimation.