
# 🧪 Midterm Project: Data Science with Python — Notebook Scaffold

**Generated:** 2025-10-15 16:01

This notebook is a scaffold aligned to your assignment requirements:
- Data acquisition, cleaning, preprocessing
- EDA & visualizations (≥ 3 plots)
- Two ML models + ≥ 2 evaluation metrics
- Reproducible and well-commented

> Replace any **TODO** blocks with your team's specifics (dataset path, target column, decisions, etc.).



## 0. Project Config (EDIT ME)

Fill these in for your chosen dataset.


In [None]:

# === Project Configuration ===
CONFIG = {
    # TODO: set the absolute/relative path to your dataset file (CSV recommended for reproducibility)
    "data_path": "PATH/TO/YOUR_DATA.csv",

    # TODO: set your target column name after you inspect the dataset
    "target": None,  # e.g., "SalePrice" or "churned"

    # Optional: rows to sample for quick iteration (set to None to use full data)
    "row_limit": None,

    # Optional: treat specific values as NaN
    "na_values": ["NA", "N/A", "", "?", "null", "None"]
}



## 1. Data Acquisition

- **Source**: Describe where you obtained the data (Kaggle/UCI/API/etc.).
- **License**: Note any usage restrictions if applicable.
- **Method**: If from an API, paste example request; if file, document download steps.


In [None]:

# === Load Data ===
assert CONFIG["data_path"] != "PATH/TO/YOUR_DATA.csv", "Please set CONFIG['data_path'] to your dataset file."

read_kwargs = {}
if CONFIG["na_values"]:
    read_kwargs["na_values"] = CONFIG["na_values"]

df = pd.read_csv(CONFIG["data_path"], **read_kwargs)

if CONFIG["row_limit"]:
    df = df.sample(CONFIG["row_limit"], random_state=RANDOM_STATE)

print("Shape:", df.shape)
df.head()



## 2. Cleaning

Document decisions:
- How many duplicates? Dropped?
- Which columns had missing values? Strategy per column (drop vs. impute) and **why**.
- Any obvious invalid categories/out-of-range numeric values corrected?


In [None]:

# === Inspect Raw ===
display(df.info())
display(df.describe(include='all').T)

# === Basic Cleaning Example (customize for your data) ===

# 1) Remove duplicate rows if any
before = len(df)
df = df.drop_duplicates()
after = len(df)
print(f"Removed {before - after} duplicate rows.")

# 2) Strip whitespace from object (string) columns
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].astype(str).str.strip()

# 3) Optional: drop columns with excessive missingness (> 60%) — justify in report
missing_ratio = df.isna().mean().sort_values(ascending=False)
print("Missing ratio (top 10):")
display(missing_ratio.head(10))

# Example threshold (EDIT/justify)
THRESH = 0.60
to_drop = missing_ratio[missing_ratio > THRESH].index.tolist()
if to_drop:
    print("Dropping columns due to high missingness:", to_drop)
    df = df.drop(columns=to_drop)

print("Shape after cleaning:", df.shape)



## 3. Preprocessing

Use at least **two** techniques (e.g., encoding categoricals + scaling numerics).  
Engineered features are a plus; explain the rationale.


In [None]:

# === Identify Features & Target ===
assert CONFIG["target"] is not None, "Set CONFIG['target'] to your target column name."
assert CONFIG["target"] in df.columns, f"Target '{CONFIG['target']}' not found in columns."

y = df[CONFIG["target"]]
X = df.drop(columns=[CONFIG["target"]])

# Drop columns that are entirely NA after cleaning (defensive)
X = X.dropna(axis=1, how='all')

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y if y.nunique() < 20 else None
)

# Column types
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

print("Categorical features:", cat_cols[:10], "..." if len(cat_cols) > 10 else "")
print("Numeric features:", num_cols[:10], "..." if len(num_cols) > 10 else "")

# Preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ("imputer",  __import__("sklearn").impute.SimpleImputer(strategy="median")),
    ("scaler",   StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer",  __import__("sklearn").impute.SimpleImputer(strategy="most_frequent")),
    ("encoder",  OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols)
    ]
)

# Heuristic: decide if task is classification or regression based on target dtype/cardinality
is_classification = (y.dtype.kind in "biu" and y.nunique() < max(50, int(0.2 * len(y)))) or y.dtype == "object"
print("Task type:", "Classification" if is_classification else "Regression")



## 4. Exploratory Data Analysis (EDA) + Visualizations

Create at least **three** distinct plots. Suggestions:
- Histograms of key numeric features
- Box plots to inspect outliers
- Scatter plots showing relationships with the target (if numeric)
- Correlation heatmap (for numeric features)

> Ensure titles, axis labels, and clear legends where applicable.


In [None]:

# === Basic EDA ===
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)
display(X_train.describe(include='all').T)

# 1) Histogram: numeric columns
for col in X_train.select_dtypes(include=[np.number]).columns[:3]:
    plt.figure()
    X_train[col].hist(bins=30)
    plt.title(f"Histogram of {col}")
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.show()

# 2) Box plot: first numeric column (if available)
num_cols_plot = X_train.select_dtypes(include=[np.number]).columns
if len(num_cols_plot) > 0:
    c = num_cols_plot[0]
    plt.figure()
    plt.boxplot(X_train[c].dropna(), vert=True)
    plt.title(f"Box Plot of {c}")
    plt.ylabel(c)
    plt.show()

# 3) Correlation heatmap (numeric only)
if len(num_cols_plot) > 1:
    corr = X_train[num_cols_plot].corr()
    plt.figure()
    plt.imshow(corr, aspect='auto', interpolation='nearest')
    plt.colorbar()
    plt.title("Correlation Heatmap (numeric features)")
    plt.xticks(range(len(num_cols_plot)), num_cols_plot, rotation=90)
    plt.yticks(range(len(num_cols_plot)), num_cols_plot)
    plt.tight_layout()
    plt.show()



## 5. Modeling

Train **two** algorithms appropriate to your task.  
We’ll build a baseline linear model and a Random Forest with light tuning.


In [None]:

# === Build Models ===

results = []

if is_classification:
    # Model A: Logistic Regression (baseline)
    clf_lr = Pipeline(steps=[
        ("preprocess", preprocessor),
        ("model", LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))
    ])
    clf_lr.fit(X_train, y_train)
    y_pred_lr = clf_lr.predict(X_test)
    y_proba_lr = None
    try:
        y_proba_lr = clf_lr.predict_proba(X_test)[:, 1]
    except Exception:
        pass

    acc = accuracy_score(y_test, y_pred_lr)
    prec = precision_score(y_test, y_pred_lr, average='weighted', zero_division=0)
    rec = recall_score(y_test, y_pred_lr, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred_lr, average='weighted', zero_division=0)
    roc = roc_auc_score(y_test, y_proba_lr, multi_class='ovr') if (y_proba_lr is not None and y.nunique()>2) else (roc_auc_score(y_test, y_proba_lr) if y_proba_lr is not None and y.nunique()==2 else None)

    results.append({
        "model": "LogisticRegression",
        "accuracy": acc, "precision": prec, "recall": rec, "f1": f1, "roc_auc": roc
    })

    # Model B: RandomForestClassifier with light tuning
    clf_rf = Pipeline(steps=[
        ("preprocess", preprocessor),
        ("model", RandomForestClassifier(random_state=RANDOM_STATE))
    ])

    param_grid = {
        "model__n_estimators": [200, 400],
        "model__max_depth": [None, 10, 20],
        "model__min_samples_split": [2, 5]
    }

    grid = GridSearchCV(clf_rf, param_grid, cv=3, n_jobs=-1, scoring="f1_weighted")
    grid.fit(X_train, y_train)

    best_rf = grid.best_estimator_
    y_pred_rf = best_rf.predict(X_test)
    try:
        y_proba_rf = best_rf.predict_proba(X_test)[:, 1] if y.nunique()==2 else None
    except Exception:
        y_proba_rf = None

    acc = accuracy_score(y_test, y_pred_rf)
    prec = precision_score(y_test, y_pred_rf, average='weighted', zero_division=0)
    rec = recall_score(y_test, y_pred_rf, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred_rf, average='weighted', zero_division=0)
    roc = roc_auc_score(y_test, y_proba_rf) if y_proba_rf is not None else None

    results.append({
        "model": "RandomForestClassifier (tuned)",
        "accuracy": acc, "precision": prec, "recall": rec, "f1": f1, "roc_auc": roc,
        "best_params": grid.best_params_
    })

else:
    # Regression path
    # Model A: Linear Regression (baseline)
    reg_lr = Pipeline(steps=[
        ("preprocess", preprocessor),
        ("model", LinearRegression())
    ])
    reg_lr.fit(X_train, y_train)
    y_pred_lr = reg_lr.predict(X_test)

    rmse = mean_squared_error(y_test, y_pred_lr, squared=False)
    mae = mean_absolute_error(y_test, y_pred_lr)
    r2 = r2_score(y_test, y_pred_lr)

    results.append({
        "model": "LinearRegression",
        "rmse": rmse, "mae": mae, "r2": r2
    })

    # Model B: RandomForestRegressor with light tuning
    reg_rf = Pipeline(steps=[
        ("preprocess", preprocessor),
        ("model", RandomForestRegressor(random_state=RANDOM_STATE))
    ])

    param_grid = {
        "model__n_estimators": [200, 400],
        "model__max_depth": [None, 10, 20],
        "model__min_samples_split": [2, 5]
    }

    grid = GridSearchCV(reg_rf, param_grid, cv=3, n_jobs=-1, scoring="neg_root_mean_squared_error")
    grid.fit(X_train, y_train)

    best_rf = grid.best_estimator_
    y_pred_rf = best_rf.predict(X_test)

    rmse = mean_squared_error(y_test, y_pred_rf, squared=False)
    mae = mean_absolute_error(y_test, y_pred_rf)
    r2 = r2_score(y_test, y_pred_rf)

    results.append({
        "model": "RandomForestRegressor (tuned)",
        "rmse": rmse, "mae": mae, "r2": r2,
        "best_params": grid.best_params_
    })

# Display results
pd.DataFrame(results)



## 6. Interpretation (Plain English)

Summarize what the metrics mean in context of your problem:
- If classification: Which model generalizes better? Discuss precision/recall tradeoffs.
- If regression: Compare RMSE/MAE/R² and what they imply for prediction quality.
- Note any limitations, class imbalance, or data constraints.



## 7. Save Artifacts (Optional)

Export cleaned data, model, or figures for your report.


In [None]:

# === Save Outputs ===
OUTPUT_DIR = "outputs"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Example: save cleaned dataset
df.to_csv(os.path.join(OUTPUT_DIR, "cleaned_dataset.csv"), index=False)

# Save results table
_ = pd.DataFrame(results)
_.to_csv(os.path.join(OUTPUT_DIR, "model_results.csv"), index=False)

print("Saved:", os.listdir(OUTPUT_DIR))


In [None]:

# === Setup & Imports ===
# If you don't have some packages, install them first (uncomment the next lines in your own environment).
# !pip install pandas numpy scikit-learn matplotlib
import os
import json
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    mean_squared_error, mean_absolute_error, r2_score
)
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# Plot settings (do not set specific colors per project rules)
plt.rcParams.update({'figure.figsize': (8, 5)})
pd.set_option('display.max_columns', 100)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
