# LAB 2 - Exploratory Data Analysis (EDA) & Data Curation
**Lab:** 2 of 4  
**Last updated:** 2025-12-26

## Goal
Perform EDA and apply data curation techniques:
- Distributions & trends
- Correlation analysis
- Missing data handling (imputation, interpolation)
- Outlier detection (IQR)
- Simple data synthesis for training/testing pipelines

## Dataset options
- Prefer Lab 1 output: `outputs/lab1_cleaned_oil_production.csv`
- Or download a dataset from Kaggle: https://www.kaggle.com/datasets

## 1) Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

## 2) Load dataset (Lab 1 output preferred)

In [None]:
from pathlib import Path

PATH_LAB1 = Path("outputs/lab1_cleaned_oil_production.csv")
DATA_PATH = PATH_LAB1 if PATH_LAB1.exists() else Path("data/oil_production_sample.csv")

def make_synthetic_daily(n_rows=365, seed=7):
    rng = np.random.default_rng(seed)
    dates = pd.date_range("2024-01-01", periods=n_rows, freq="D")
    field = rng.choice(["NorthSea-A", "NorthSea-B", "GOM-Alpha", "ME-East"], size=n_rows)
    oil = rng.normal(52000, 8500, size=n_rows).clip(15000, 90000)
    gas = (oil * rng.normal(3.7, 0.7, size=n_rows)).clip(10000, 550000)
    wc = rng.normal(0.30, 0.09, size=n_rows).clip(0.05, 0.7)
    down = rng.poisson(1.3, size=n_rows).astype(float)
    df = pd.DataFrame({"date": dates, "field": field, "oil_bbl": oil, "gas_mcf": gas, "avg_water_cut": wc, "downtime_hr": down})
    df.loc[rng.random(n_rows)<0.03, "avg_water_cut"] = np.nan
    df.loc[rng.random(n_rows)<0.01, "oil_bbl"] = df["oil_bbl"] * 4
    df.loc[rng.random(n_rows)<0.01, "gas_mcf"] = np.nan
    return df

if DATA_PATH.exists():
    df = pd.read_csv(DATA_PATH)
    print("Loaded:", DATA_PATH, "shape:", df.shape)
else:
    df = make_synthetic_daily()
    print("Using synthetic dataset. shape:", df.shape)

df["date"] = pd.to_datetime(df["date"], errors="coerce")
df.head()

## 3) Missingness and summary stats

In [None]:
df.isna().mean().sort_values(ascending=False)

In [None]:
df.describe().T

## 4) Distributions

In [None]:
num_cols = [c for c in df.columns if c not in ["date","field"]]
df[num_cols].hist(bins=30, figsize=(12,8))
plt.suptitle("Distributions of numeric features")
plt.show()

## 5) Time-series trends by field

In [None]:
for f in df["field"].dropna().unique():
    sub = df[df["field"]==f].sort_values("date")
    plt.figure(figsize=(10,3))
    plt.plot(sub["date"], sub["oil_bbl"])
    plt.title(f"Oil production trend: {f}")
    plt.xlabel("date"); plt.ylabel("oil_bbl")
    plt.show()

## 6) Correlation analysis

In [None]:
corr = df[[c for c in df.columns if c not in ["date","field"]]].corr()
corr

In [None]:
plt.figure(figsize=(6,5))
plt.imshow(corr, interpolation="nearest")
plt.xticks(range(len(corr.columns)), corr.columns, rotation=45, ha="right")
plt.yticks(range(len(corr.index)), corr.index)
plt.title("Correlation heatmap")
plt.colorbar()
plt.tight_layout()
plt.show()

## 7) Outlier detection using IQR

In [None]:
def iqr_bounds(series, k=1.5):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    return q1 - k*iqr, q3 + k*iqr

lb, ub = iqr_bounds(df["oil_bbl"].dropna())
outliers = df[(df["oil_bbl"] < lb) | (df["oil_bbl"] > ub)]
print("IQR bounds:", lb, ub)
print("Outliers:", len(outliers))
outliers.head()

## 8) Missing data handling (median + interpolation)

In [None]:
df_cur = df.copy()

# Median imputation by field
for col in ["gas_mcf", "avg_water_cut"]:
    if col in df_cur.columns:
        df_cur[col] = df_cur.groupby("field")[col].transform(lambda s: s.fillna(s.median()))

# Interpolate oil by field
df_cur = df_cur.sort_values(["field","date"])
df_cur["oil_bbl_interp"] = df_cur.groupby("field")["oil_bbl"].apply(lambda s: s.interpolate(limit_direction="both"))

df_cur.isna().sum()

## 9) Simple data synthesis (augmentation)

In [None]:
rng = np.random.default_rng(0)
sample = df_cur.dropna(subset=["oil_bbl_interp","gas_mcf"]).sample(200, random_state=0)

synthetic = sample.copy()
synthetic["oil_bbl_interp"] *= (1 + rng.normal(0, 0.02, size=len(synthetic)))
synthetic["gas_mcf"] *= (1 + rng.normal(0, 0.03, size=len(synthetic)))
synthetic["is_synthetic"] = 1

real = df_cur.copy()
real["is_synthetic"] = 0

augmented = pd.concat([real, synthetic], ignore_index=True)
augmented["is_synthetic"].value_counts()

## 10) Save curated dataset for Lab 3

In [None]:
from pathlib import Path
OUT = Path("outputs/lab2_curated_for_ml.csv")
OUT.parent.mkdir(parents=True, exist_ok=True)
df_cur.to_csv(OUT, index=False)
print("Saved:", OUT)

## Checkpoint questions
1) Why might correlation be misleading for time-series data?  
2) When is interpolation acceptable and when is it risky?  
3) How would you decide whether to cap, remove, or keep outliers in Oil & Gas?