# Capstone Project — Initial Report & Exploratory Data Analysis (EDA)

This notebook satisfies the deliverables for **Module 20.1**:
- Data cleaning + EDA
- Clear, labeled visualizations
- Feature engineering
- **One baseline ML model** with metric + interpretation

> **How to use:** Put your dataset into `../data/` and set `DATA_PATH` below.


In [None]:
# 1) Imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.linear_model import LogisticRegression, LinearRegression

pd.set_option("display.max_columns", 200)


## 2) Load data

Update the `DATA_PATH` and (optionally) `TARGET_COL` once you know your target column name.


In [None]:
DATA_PATH = "../data/dataset.csv"   # <- change if your file name differs
df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
df.head()


## 3) Initial exploration

We check:
- column types
- missing values
- duplicates
- basic stats


In [None]:
df.info()


In [None]:
df.isna().sum().sort_values(ascending=False).head(25)


In [None]:
dup_count = df.duplicated().sum()
dup_count


## 4) Data cleaning

**Recommended minimum:**
- remove duplicates
- handle missing values (drop or impute)
- sanity-check values (e.g., negative ages, impossible ranges)


In [None]:
# Remove duplicates
df = df.drop_duplicates()
print("After duplicates removal:", df.shape)


### Missing values strategy

Pick **one** approach:
- If dataset is large and missingness is small → drop rows with missing target / key columns
- Otherwise → impute numeric with median and categorical with most-frequent


In [None]:
# Example: drop rows where target is missing (update TARGET_COL after you set it)
TARGET_COL = None  # <- set like: "churn" or "price"

# If you know the target, uncomment:
# df = df.dropna(subset=[TARGET_COL])

# Quick overall view
df.isna().mean().sort_values(ascending=False).head(20)


## 5) Feature overview (categorical vs numeric)

In [None]:
numeric_cols = df.select_dtypes(include=["number"]).columns.tolist()
categorical_cols = df.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

print("Numeric cols:", len(numeric_cols))
print("Categorical cols:", len(categorical_cols))

numeric_cols[:10], categorical_cols[:10]


## 6) EDA — univariate plots

### 6.1 Numeric distributions


In [None]:
# Plot up to first 6 numeric columns (adjust as needed)
cols_to_plot = numeric_cols[:6]

for col in cols_to_plot:
    plt.figure(figsize=(7, 4))
    sns.histplot(df[col].dropna(), kde=True)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()


### 6.2 Categorical counts

In [None]:
# Plot up to first 6 categorical columns (adjust as needed)
cat_to_plot = categorical_cols[:6]

for col in cat_to_plot:
    plt.figure(figsize=(7, 4))
    top_vals = df[col].astype(str).value_counts().head(15)
    sns.barplot(x=top_vals.values, y=top_vals.index)
    plt.title(f"Top categories in {col}")
    plt.xlabel("Count")
    plt.ylabel(col)
    plt.tight_layout()
    plt.show()


## 7) EDA — relationships

### 7.1 Correlation heatmap (numeric only)


In [None]:
if len(numeric_cols) > 1:
    plt.figure(figsize=(10, 6))
    corr = df[numeric_cols].corr(numeric_only=True)
    sns.heatmap(corr, annot=False)
    plt.title("Correlation heatmap (numeric features)")
    plt.tight_layout()
    plt.show()


### 7.2 Target vs feature plots

Once you set `TARGET_COL`, you can compare:
- numeric feature vs target (boxplot for classification, scatter for regression)
- categorical feature vs target (count/mean plots)


In [None]:
# Set your target column name here
# TARGET_COL = "your_target"

if TARGET_COL is not None and TARGET_COL in df.columns:
    if df[TARGET_COL].dtype in ["object", "category", "bool"]:
        # Classification-like target
        for col in numeric_cols[:4]:
            plt.figure(figsize=(7, 4))
            sns.boxplot(x=df[TARGET_COL].astype(str), y=df[col])
            plt.title(f"{col} by {TARGET_COL}")
            plt.xlabel(TARGET_COL)
            plt.ylabel(col)
            plt.tight_layout()
            plt.show()
    else:
        # Regression-like target
        for col in numeric_cols[:4]:
            if col == TARGET_COL:
                continue
            plt.figure(figsize=(7, 4))
            sns.scatterplot(x=df[col], y=df[TARGET_COL])
            plt.title(f"{TARGET_COL} vs {col}")
            plt.xlabel(col)
            plt.ylabel(TARGET_COL)
            plt.tight_layout()
            plt.show()
else:
    print("Set TARGET_COL to enable target-based plots.")


## 8) Outlier analysis (numeric)

We use IQR rule to flag potential outliers for a few numeric columns.


In [None]:
def iqr_outlier_fraction(series: pd.Series) -> float:
    s = series.dropna()
    if s.empty:
        return 0.0
    q1, q3 = np.percentile(s, [25, 75])
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return ((s < lower) | (s > upper)).mean()

outlier_report = []
for col in numeric_cols[:10]:
    frac = iqr_outlier_fraction(df[col])
    outlier_report.append((col, frac))

outlier_df = pd.DataFrame(outlier_report, columns=["column", "outlier_fraction"]).sort_values("outlier_fraction", ascending=False)
outlier_df


## 9) Feature engineering (examples)

Add transformations that make sense for your dataset, such as:
- ratios (e.g., spend / visits)
- logs (for skewed variables)
- binning (e.g., age group)
- date feature extraction (year/month/day)


In [None]:
# Example: log transform a skewed numeric column (replace with your own)
# if "income" in df.columns:
#     df["log_income"] = np.log1p(df["income"])

df.head()


## 10) Baseline model

Pick the correct baseline depending on your task:

### Classification
- Logistic Regression
- Metric: Accuracy (balanced) or F1 (imbalanced)

### Regression
- Linear Regression
- Metric: RMSE / MAE, plus R²


In [None]:
# Decide task type automatically (best-effort)
# If TARGET_COL is numeric -> regression; else -> classification
if TARGET_COL is None:
    raise ValueError("Please set TARGET_COL before modeling.")

y = df[TARGET_COL]
X = df.drop(columns=[TARGET_COL])

# Identify columns
numeric_features = X.select_dtypes(include=["number"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

# Preprocess
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="drop"
)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Choose model
is_regression = pd.api.types.is_numeric_dtype(y)

if is_regression:
    model = LinearRegression()
else:
    model = LogisticRegression(max_iter=2000)

clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", model)
])

clf.fit(X_train, y_train)
pred = clf.predict(X_test)

# Evaluation
if is_regression:
    mae = mean_absolute_error(y_test, pred)
    rmse = mean_squared_error(y_test, pred, squared=False)
    r2 = r2_score(y_test, pred)
    print("Baseline Regression Results")
    print("MAE:", mae)
    print("RMSE:", rmse)
    print("R2:", r2)
else:
    acc = accuracy_score(y_test, pred)
    print("Baseline Classification Results")
    print("Accuracy:", acc)
    print("\nClassification report:\n", classification_report(y_test, pred))
    print("\nConfusion matrix:\n", confusion_matrix(y_test, pred))


## 11) Metric rationale (write in your own words)

Replace the text below with your real explanation.

- If you used **Accuracy**: explain class balance + why accuracy makes sense.
- If you used **F1**: explain class imbalance + importance of precision/recall tradeoff.
- If you used **RMSE/MAE**: explain error units + why you chose it.
- Always include 2–6 sentences of interpretation.


**Example text (edit):**

I chose **Accuracy** as the baseline metric because the target classes are relatively balanced, and accuracy provides a clear overall measure of correct predictions. The baseline model achieves an accuracy of XX on the test set, which sets an initial benchmark. In the next module, I will compare additional models and use cross-validation to confirm stability and improve performance.
