# Assignment-1 Solution (PDF Learning)
**Student:** Rohan Malhotra  
**Roll Number (r):** 102303437  

This notebook:
- Loads the Kaggle India Air Quality dataset CSV(s) from `data/`
- Extracts **NO2** as feature **x**
- Computes `z = x + ar*sin(br*x)` for your roll number
- Estimates parameters **λ, μ, c** for:  
$$\hat{p}(z)=c\,e^{-\lambda(z-\mu)^2}$$

Put the Kaggle CSV(s) in the repo `data/` folder (same directory as this notebook).

In [None]:

import os, math
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


## Step 0 — Set roll number and compute $a_r, b_r$

In [None]:

r = 102303437

ar = 0.05 * (r % 7)
br = 0.3 * ((r % 5) + 1)

r_mod7 = r % 7
r_mod5 = r % 5

r_mod7, r_mod5, ar, br


## Step 1 — Load dataset and extract NO2 as x
This auto-detects a CSV in `data/` that contains a `NO2` column (case-insensitive).

In [None]:

data_dir = Path("data")
csvs = sorted(list(data_dir.glob("*.csv")))
if not csvs:
    raise FileNotFoundError("No .csv found in data/. Download from Kaggle and place CSV(s) inside data/.")

csv_path, no2_col = None, None
for p in csvs:
    try:
        df_head = pd.read_csv(p, nrows=5)
    except Exception:
        continue
    cols = {c.lower(): c for c in df_head.columns}
    if "no2" in cols:
        csv_path, no2_col = p, cols["no2"]
        break

if csv_path is None:
    raise ValueError("Couldn't find a CSV containing a 'NO2' column (case-insensitive) in data/.")

csv_path, no2_col


In [None]:

df = pd.read_csv(csv_path)
x = pd.to_numeric(df[no2_col], errors="coerce").to_numpy(dtype=float)
x = x[np.isfinite(x)]
x[:10], x.size


## Step 1 — Transform x → z

In [None]:

z = x + ar * np.sin(br * x)

z[:10], z.size


## Step 2 — Estimate parameters (MLE, normalized)
For a proper density:
$$\int_{-\infty}^{\infty} c e^{-\lambda(z-\mu)^2} dz = 1 \Rightarrow c=\sqrt{\lambda/\pi}$$
This corresponds to a Normal distribution with variance $1/(2\lambda)$.
MLE:
- $\mu$ = sample mean of z
- $\sigma^2 = \frac{1}{n}\sum (z-\mu)^2$
- $\lambda = 1/(2\sigma^2)$
- $c = \sqrt{\lambda/\pi}$

In [None]:

mu = float(np.mean(z))
s2 = float(np.mean((z - mu) ** 2))  # MLE variance (divide by n)
lam = 1.0 / (2.0 * s2)
c = math.sqrt(lam / math.pi)

lam, mu, c


### (Optional) Fit c directly to histogram via least squares
If your instructor expects learning `c` without enforcing normalization, you can fit `(λ, μ, c)` to a histogram target.

In [None]:

from scipy.optimize import curve_fit

# Histogram (density=True gives a density estimate)
y, edges = np.histogram(z, bins=60, density=True)
centers = 0.5 * (edges[:-1] + edges[1:])

def p_hat(zv, lam_, mu_, c_):
    return c_ * np.exp(-lam_ * (zv - mu_)**2)

# Initial guesses: use MLE values as a good start
p0 = (lam, mu, c)

popt, pcov = curve_fit(p_hat, centers, y, p0=p0, maxfev=20000)
lam_fit, mu_fit, c_fit = popt

(lam_fit, mu_fit, c_fit)


## Plot: Histogram of z and fitted curves

In [None]:

zmin, zmax = np.percentile(z, [0.5, 99.5])
grid = np.linspace(zmin, zmax, 500)

plt.figure(figsize=(9,5))
plt.hist(z, bins=60, density=True, alpha=0.5, label="Histogram density")

plt.plot(grid, c*np.exp(-lam*(grid-mu)**2), label="MLE normalized fit")
plt.plot(grid, c_fit*np.exp(-lam_fit*(grid-mu_fit)**2), label="Curve-fit (λ, μ, c)")

plt.xlabel("z")
plt.ylabel("density")
plt.legend()
plt.title("Assignment-1: Density fit on transformed variable z")
plt.show()


## Step 3 — Final values to submit
Use either the **normalized MLE** values or the **curve-fit** values, depending on what your instructor expects.
This cell prints both.

In [None]:

print("=== Roll number r =", r, "===")
print(f"ar = {ar:.6f}, br = {br:.6f}")
print("\n--- Normalized MLE ---")
print(f"lambda (λ) = {lam:.10f}")
print(f"mu (μ)     = {mu:.10f}")
print(f"c          = {c:.10f}")

print("\n--- Histogram curve-fit ---")
print(f"lambda (λ) = {lam_fit:.10f}")
print(f"mu (μ)     = {mu_fit:.10f}")
print(f"c          = {c_fit:.10f}")


## Save results to JSON

In [None]:

import json

results = {
    "student": "Rohan Malhotra",
    "roll_number": r,
    "csv_used": str(csv_path),
    "no2_column": no2_col,
    "ar": ar,
    "br": br,
    "normalized_mle": {"lambda": lam, "mu": mu, "c": c},
    "hist_curve_fit": {"lambda": float(lam_fit), "mu": float(mu_fit), "c": float(c_fit)},
    "n_points_used": int(z.size),
}

with open("results.json", "w") as f:
    json.dump(results, f, indent=2)

results
