In [1]:
# from google.colab import drive
# drive.flush_and_unmount()           # ignore errors if already unmounted

#If cannot remount, simply delete the mounted drive and then remount
# rm -rf /content/drive


In [2]:
# Colab cell
from google.colab import drive

drive.mount('/content/drive', force_remount=True)



Mounted at /content/drive


In [3]:
# Adjust these two for YOUR repo
REPO_OWNER = "kadkins3880"
REPO_NAME  = "STAT4160"   # e.g., unified-stocks-team1
BASE_DIR   = "/content/drive/MyDrive/dspt25"
CLONE_DIR  = f"{BASE_DIR}/{REPO_NAME}"
REPO_URL   = f"https://github.com/{REPO_OWNER}/{REPO_NAME}.git"

# if on my office computer

# REPO_NAME  = "lectureNotes"   # e.g., on my office computer
# BASE_DIR = r"E:\OneDrive - Auburn University Montgomery\teaching\AUM\STAT 4160 Productivity Tools" # on my office computer
# CLONE_DIR  = f"{BASE_DIR}\{REPO_NAME}"

import os, pathlib
pathlib.Path(BASE_DIR).mkdir(parents=True, exist_ok=True)


In [4]:
import os, subprocess, shutil, pathlib

if not pathlib.Path(CLONE_DIR).exists():
    !git clone {REPO_URL} {CLONE_DIR}
else:
    # If the folder exists, just ensure it's a git repo and pull latest
    os.chdir(CLONE_DIR)
    # !git status
    # !git pull --rebase # !git pull --ff-only
os.chdir(CLONE_DIR)
print("Working dir:", os.getcwd())

Working dir: /content/drive/MyDrive/dspt25/STAT4160


**Assumptions:** You completed Session 9–12 and have `data/processed/features_v1.parquet` (or `features_v1_ext.parquet`). If a file is missing, the lab provides a small synthetic fallback so tests still run. **Goal today:** Make it **hard to ship bad data** by adding precise, fast tests.

------------------------------------------------------------------------

## Session 13 — pytest + Data Validation

### Learning goals

By the end of class, students can:

1.  Write **fast, high‑signal tests** for data pipelines (shapes, dtypes, nulls, **no look‑ahead**).
2.  Validate a DataFrame with **Pandera** (schema + value checks) or **custom checks** only.
3.  Use **logging** effectively and capture logs in tests.
4.  Run tests in Colab / locally and prepare for CI in Session 14.

------------------------------------------------------------------------

## Agenda

-    What to test (and not), “data tests” vs unit tests, speed budget
-    Pandera schemas & custom checks; tolerance and stability
-   Logging basics (`logging`, levels, handlers); testing logs with `caplog`
-    **In‑class lab**: add `tests/test_features.py` (+ optional Pandera test), fixtures, config; run & fix
-    Wrap‑up + homework briefing

------------------------------------------------------------------------


### What to test

-   **Contract tests** (tests according to a contract bewteen consumer and a service provider) for data:

    -   **Schema**: required columns exist; dtypes sane (`ticker` categorical, calendar ints).
    -   **Nulls**: no NAs in training‑critical cols.
    -   **Semantics**: `r_1d` is **lead** of `log_return`; rolling features computed from **past only**.
    -   **Keys**: no duplicate `(ticker, date)`; dates strictly increasing within ticker.

-   Keep tests **under \~5s total** (CI budget). Avoid long recomputations; sample/take head.

### Pandera vs custom checks

-   **Pandera**: declarative schema; optional dependency; good for **column existence + ranges**.
-   **Custom**: essential for **domain logic** (look‑ahead bans, exact rolling formulas).

### Logging basics

-   Use `logging.getLogger(__name__)`; set level via env (`LOGLEVEL=INFO`).
-   Log **counts, ranges, and any data drops** inside build scripts.
-   In tests: use `caplog` to assert a warning is emitted for suspicious conditions.

------------------------------------------------------------------------


### 1) (Optional) Install test‑time helpers (Pandera)

In [5]:
!pip -q install pytest pandera pyarrow

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/292.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m286.7/292.9 kB[0m [31m8.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m292.9/292.9 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h

# Pandera

**Pandera** is a **Python library** used for **data validation and testing of pandas DataFrames**.

It helps you **define schemas** that describe what your data *should* look like — for example, the expected column names, data types, ranges, or even statistical properties — and then automatically checks whether a given DataFrame conforms to those expectations.

---

###  Why Pandera is useful

When working with data pipelines (e.g., ETL (Extract, Transform, Load), feature engineering, model training), your data can easily become corrupted or inconsistent.
Pandera acts as a **“data contract”** (agreement between different parts of a data system) between pipeline stages.

It helps you:

* Catch data quality issues early (e.g., missing columns, wrong dtypes, NaNs)
* Ensure data consistency across steps
* Document dataset expectations
* Write **unit tests** for data (like how `pytest` tests code logic)

---

###  Example

```python
import pandas as pd
import pandera as pa
from pandera import Column, Check

# Define a schema for a DataFrame
class StocksSchema(pa.DataFrameModel):
    ticker: pa.typing.Series[str]
    date: pa.typing.Series[pd.Timestamp]
    adj_close: pa.typing.Series[float] = pa.Field(gt=0)  # must be > 0
    volume: pa.typing.Series[int] = pa.Field(ge=0)        # must be ≥ 0
    log_return: pa.typing.Series[float]

# Create a sample DataFrame
df = pd.DataFrame({
    "ticker": ["AAPL", "AAPL"],
    "date": pd.to_datetime(["2024-01-01", "2024-01-02"]),
    "adj_close": [189.5, 190.2],
    "volume": [1000, 1200],
    "log_return": [0.0, 0.0037],
})

# Validate it
validated = StocksSchema.validate(df)
print("✅ Data passed validation!")
```

If the data violates the schema (say `adj_close` ≤ 0), Pandera will raise a **SchemaError** with a clear message.

---

###  Other features

* **Integration** with `pytest` for automated testing
* **Statistical checks** (e.g., mean, std, correlations)
* **Lazy validation** (collect all errors before raising)
* **Schema inference** (autogenerate schema from sample data)
* **Compatibility** with `pandas`, `polars`, `pyarrow`, and `modin`

---

###  Analogy

If **pandas** is for manipulating data,
then **pandera** is for *ensuring that the manipulated data makes sense.*



#  Put a tiny **logging helper** in your repo (used by build scripts & tests)

```python
def setup_logging(name: str = "dspt"):
```

Defines a function that sets up a **logger** named `"dspt"` (by default).
You can call it like `logger = setup_logging("myapp")` to get your own logger instance.

---

###  Determine the log level

```python
level = os.getenv("LOGLEVEL", "INFO").upper()
```

* Reads the environment variable `LOGLEVEL`.
* If not set, defaults to `"INFO"`.
* Converts it to uppercase (`"debug"` → `"DEBUG"`).

Typical levels:
`DEBUG < INFO < WARNING < ERROR < CRITICAL`

This allows you to change verbosity **without editing code**, e.g.:

```bash
export LOGLEVEL=DEBUG
python run_pipeline.py
```

---

### Get a named logger

```python
logger = logging.getLogger(name)
```

This retrieves (or creates) a **logger object** with that name.
Named loggers are hierarchical — for example, `"dspt.submodule"` inherits from `"dspt"`.

---

###  Add a handler (only once)

```python
if not logger.handlers:
    handler = logging.StreamHandler()
    fmt = "%(asctime)s | %(levelname)s | %(name)s | %(message)s" #% formating, see below for more explanation
    handler.setFormatter(logging.Formatter(fmt))
    logger.addHandler(handler)
```

* **Handlers** define *where* logs go (console, file, etc.).
  Here it uses `StreamHandler`, which prints to the console (`sys.stderr`).
* **Formatter** defines the output format:

  ```
  2025-10-06 09:45:23,612 | INFO | dspt | Building database...
  ```
* The `if not logger.handlers` guard prevents adding duplicate handlers if this function is called multiple times (a common pitfall).

---

###  Set the logging level

```python
logger.setLevel(level)
```

Tells the logger which messages to process (e.g., `DEBUG` shows all, `INFO` hides debug details).

---

###  Return it for use

```python
return logger
```

So you can do:

```python
logger = setup_logging("build_db")
logger.info("Building prices database...")
logger.debug(f"Using LOGLEVEL={os.getenv('LOGLEVEL')}")
```

---

### Typical Output Example

```
2025-10-06 09:47:22,023 | INFO | build_db | Building prices database...
2025-10-06 09:47:22,025 | DEBUG | build_db | Connected to data/prices.db
```



# __future__ annotations

```python
from __future__ import annotations
```

It tells Python to **treat all type annotations as strings**, i.e. *not* to evaluate them immediately at runtime.

This is called **“postponed evaluation of annotations”**.

Normally, when you write something like:

```python
class Node:
    def __init__(self, next: Node | None = None):
        self.next = next
```

Python tries to *evaluate* `Node` right away.
But at that point, the class `Node` hasn’t been fully defined yet — so you’d get an error:

```
NameError: name 'Node' is not defined
```

---

###  Fix using `from __future__ import annotations`

```python
from __future__ import annotations

class Node:
    def __init__(self, next: Node | None = None):
        self.next = next
```

Now Python **stores** `"Node | None"` as a **string** in the function’s `__annotations__` instead of trying to resolve it immediately.



### Why this matters

 **Avoids forward-reference issues**

```python
class A:
    def link(self, b: B):  # ← B not yet defined
        ...
class B: ...
```

→ Works fine if you’ve imported `annotations`.

 **Improves import performance**
Since annotations aren’t evaluated at import time, it can slightly speed up module loading.

 **Simplifies circular type references**
You no longer have to write `'B'` manually (string form) in type hints.

---
### Since Python 3.11+

Starting from **Python 3.11**, this behavior became the **default** (PEP (Python Enhancement Proposal) 563 → PEP 649 resolution).
So in Python 3.11+ you generally don’t *need* to import it — but many projects still include it for **compatibility with 3.9–3.10**.




# **Python’s old-style string formatting syntax** (often called the **printf-style** format).


```python
"%(asctime)s"
```

---

##  General pattern

```
%(<key>)<type>
```

is a placeholder that says:

> “Look up `<key>` in a dictionary and format its value according to `<type>`.”

---

###  The `%` sign

* Marks the start of a **format specifier**.
* It tells Python: “Insert a value here when formatting.”

Example:

```python
"Hello %s" % "World"
# → "Hello World"
```

---

###  The parentheses `(asctime)`

* When you see parentheses inside the `%`, it means **dictionary-based formatting**.
* The string will be formatted with values from a dictionary that has that key.

Example:

```python
"%(name)s is %(age)d years old" % {"name": "Alice", "age": 30}
# → "Alice is 30 years old"
```

Here:

* `name` and `age` are keys in the dictionary.
* `s` means format as string.
* `d` means format as integer.

---

###  The `s` at the end

That’s the **type specifier** — it tells Python how to render the value.

| Specifier | Meaning                       | Example                                 |
| --------- | ----------------------------- | --------------------------------------- |
| `%s`      | string (any object → `str()`) | `"Hello %s" % "World"`                  |
| `%d`      | integer                       | `"Count: %d" % 5`                       |
| `%f`      | floating point                | `"Pi ≈ %.2f" % 3.14159` → `"Pi ≈ 3.14"` |
| `%r`      | raw `repr()` string           | `"%r" % [1,2,3]` → `"[1, 2, 3]"`        |

---

###  In the logging context

The **logging formatter** uses this same printf-style syntax.
The `logging` module fills in a dictionary like:

```python
{
  "asctime": "2025-10-06 09:32:47,321",
  "levelname": "INFO",
  "name": "build_db",
  "message": "Building database..."
}
```

Then applies:

```python
"%(asctime)s | %(levelname)s | %(name)s | %(message)s" % record_dict
```

Which produces:

```
2025-10-06 09:32:47,321 | INFO | build_db | Building database...
```

---

### Modern alternative (f-strings)

Nowadays, you’d usually use **f-strings** or `.format()` for most Python code:

```python
name = "Alice"
age = 30
f"{name} is {age} years old"
```

But the **logging module** still uses the older `%()` style for backwards compatibility.



#  **`repr()`**

is a built-in function that returns the **official string representation** of an object — that is, a string that, if possible, can be used to **recreate the object** when passed to `eval()`.

---

###  **Definition**

```python
repr(object)
```

Returns a string that represents the object in a way **useful for debugging and development**.


* `repr()` is meant for **developers**.
* `str()` is meant for **users**.

Think of it like this:

| Function    | Audience  | Purpose                                       |
| ----------- | --------- | --------------------------------------------- |
| `repr(obj)` | Developer | Unambiguous, ideally `eval(repr(obj)) == obj` |
| `str(obj)`  | User      | Readable, meant for display                   |

---

###  **Examples**

```python
x = 'hello\nworld'
print(str(x))   # prints: hello
                #         world
print(repr(x))  # prints: 'hello\nworld'
```

Explanation:

* `str()` shows the human-friendly version.
* `repr()` shows the *literal representation* (including escape characters).

---

###  Definig a cutome ``__repr__`

You can define how `repr()` behaves for your own classes by implementing the **`__repr__`** method.

```python
class Point:
    def __init__(self, x, y):
        self.x, self.y = x, y

    def __repr__(self):
        return f"Point({self.x}, {self.y})"

p = Point(2, 3)
print(repr(p))  # Output: Point(2, 3)
```

That output is valid Python — `eval("Point(2, 3)")` would recreate the same object.


* `repr()` gives a precise, debug-oriented string form of an object.
* Used internally by the interpreter when you type an expression in the REPL:

  ```python
  >>> [1, 2, 3]
  [1, 2, 3]  # actually uses repr()
  ```
* If an object defines both `__str__` and `__repr__`, `print()` uses `__str__`, and the interactive shell uses `__repr__`.



In [6]:
# save this file to scripts/logsetup.py
from __future__ import annotations
import logging, os

def setup_logging(name: str = "dspt"):
    level = os.getenv("LOGLEVEL", "INFO").upper()
    logger = logging.getLogger(name)
    if not logger.handlers:
        handler = logging.StreamHandler()
        fmt = "%(asctime)s | %(levelname)s | %(name)s | %(message)s"
        handler.setFormatter(logging.Formatter(fmt))
        logger.addHandler(handler)
    logger.setLevel(level)
    return logger

# Create **pytest config** and a fixture (with safe fallback data)

 **pytest configuration and fixture setup** for a data-science project.

```python
from pathlib import Path
Path("pytest.ini").write_text("""[pytest]
addopts = -q
testpaths = tests
filterwarnings =
    ignore::FutureWarning
""")
```

* `[pytest]` — standard section header for Pytest config.
* `addopts = -q` — run tests in **quiet mode** (no verbose test names).
* `testpaths = tests` — look for tests inside the `tests/` directory only.
* `filterwarnings = ignore::FutureWarning` — suppress annoying warnings from pandas/numpy like:

  ```
  FutureWarning: The frame.append method is deprecated...
  ```

# Fixtures

 `tests/conftest.py`: This file defines **shared fixtures** for all tests in your suite.
Pytest automatically loads `conftest.py` in any directory it finds.

---

###  The fixture
In pytest, a fixture is a reusable setup function that provides data, resources, or state to your test functions.

```python
@pytest.fixture(scope="session")
def features_df():
```

Defines a reusable **fixture** named `features_df` that tests can depend on.


* `scope="session"` means it’s created **once per test session**, shared by all tests → faster test runs.

####  Example usage in a test

```python
def test_columns_exist(features_df):
    expected = {"ticker", "date", "adj_close", "log_return", "roll_mean_20", "roll_std_20"}
    assert expected.issubset(features_df.columns)
```

Pytest automatically injects the fixture, no import needed.




```python
@pytest.fixture
def features_df():
    ...
```

## The decorator “@” syntax

The `@` symbol in Python introduces a **decorator** —
a special shorthand for wrapping one function (or class) **inside another**.

Conceptually, this:

```python
@decorator
def func():
    ...
```

is equivalent to writing:

```python
def func():
    ...
func = decorator(func)
```

###  What a decorator does

A **decorator** is itself a function that:

* Takes another function (or class) as input.
* Returns a *modified* or *enhanced* version of it.

So decorators are a clean way to **add behavior** without changing the original function’s core logic.

---

###  Simple example

```python
def announce(func):
    def wrapper():
        print("Starting...")
        func()
        print("Done!")
    return wrapper
```

Now apply it:

```python
@announce
def greet():
    print("Hello!")

greet()
```

**Output:**

```
Starting...
Hello!
Done!
```

What happens internally:

1. `greet` is passed into `announce`.
2. `announce` returns the `wrapper` function.
3. The name `greet` now refers to `wrapper`.

---


---

###  Other common decorators

| Decorator                       | Meaning / Use                                                                    |
| ------------------------------- | -------------------------------------------------------------------------------- |
| `@staticmethod`                 | Define a static method inside a class                                            |
| `@classmethod`                  | Define a method that receives the class (`cls`) instead of the instance (`self`) |
| `@property`                     | Turn a method into an attribute-like property                                    |
| `@dataclass`                    | Auto-generate class boilerplate (`__init__`, `__repr__`, etc.)                   |
| `@lru_cache`                    | Cache function results for performance                                           |
| `@app.route(...)`               | Flask/Django-style URL binding                                                   |
| `@pytest.mark.parametrize(...)` | Run one test with multiple inputs                                                |




# Shifting a Time Series

```python
df["log_return"].shift(-1)
```


`Series.shift(n)` **moves** the data *downward* or *upward* by `n` rows, filling the empty spots with `NaN`.

| Direction                                | Argument | Meaning             |
| ---------------------------------------- | -------- | ------------------- |
| Downward (later rows get earlier values) | `n > 0`  | “look back” (lag)   |
| Upward (earlier rows get later values)   | `n < 0`  | “look ahead” (lead) |

---

###  Example

Suppose you have:

| index | log_return |
| :---- | ---------: |
| 0     |       0.01 |
| 1     |       0.02 |
| 2     |       0.03 |

Then:

```python
df["log_return"].shift(1)
```

→ moves values **down 1 row**:

| index | shifted |
| :---- | ------: |
| 0     |     NaN |
| 1     |    0.01 |
| 2     |    0.02 |

and

```python
df["log_return"].shift(-1)
```

→ moves values **up 1 row**:

| index | shifted |
| :---- | ------: |
| 0     |    0.02 |
| 1     |    0.03 |
| 2     |     NaN |



In [7]:
# pytest.ini
from pathlib import Path
Path("pytest.ini").write_text("""[pytest]
addopts = -q
testpaths = tests
filterwarnings =
    ignore::FutureWarning
""")

# tests/conftest.py
from pathlib import Path
import pandas as pd, numpy as np, pytest

def _synth_features():
    # minimal synthetic features for 3 tickers, 60 days
    rng = np.random.default_rng(0)
    dates = pd.bdate_range("2023-01-02", periods=60)
    frames=[]
    for t in ["AAPL","MSFT","GOOGL"]:
        ret = rng.normal(0, 0.01, size=len(dates)).astype("float32")
        adj = 100 * np.exp(np.cumsum(ret))
        df = pd.DataFrame({
            "date": dates,
            "ticker": t,
            "adj_close": adj.astype("float32"),
            "log_return": np.r_[np.nan, np.diff(np.log(adj))].astype("float32")
        })
        # next-day label
        df["r_1d"] = df["log_return"].shift(-1)
        # rolling
        df["roll_mean_20"] = df["log_return"].rolling(20, min_periods=20).mean()
        df["roll_std_20"]  = df["log_return"].rolling(20, min_periods=20).std()
        df["zscore_20"]    = (df["log_return"]-df["roll_mean_20"])/(df["roll_std_20"]+1e-8)
        df["weekday"] = df["date"].dt.weekday.astype("int8")
        df["month"]   = df["date"].dt.month.astype("int8")
        frames.append(df)
    out = pd.concat(frames, ignore_index=True).dropna().reset_index(drop=True)
    out["ticker"] = out["ticker"].astype("category")
    return out

@pytest.fixture(scope="session")
def features_df():
    p = Path("data/processed/features_v1.parquet")
    if p.exists():
        df = pd.read_parquet(p)
        # Ensure expected minimal cols exist (compute light ones if missing)
        if "weekday" not in df: df["weekday"] = pd.to_datetime(df["date"]).dt.weekday.astype("int8")
        if "month" not in df:   df["month"] = pd.to_datetime(df["date"]).dt.month.astype("int8")
        return df.sort_values(["ticker","date"]).reset_index(drop=True)
    # fallback
    return _synth_features().sort_values(["ticker","date"]).reset_index(drop=True)

#  4) **High‑value tests**: shapes, nulls, look‑ahead ban

### Check nulls
```
na = features_df[crit].isna().sum().to_dict()
```

`.sum()`	sums True values → counts the number of NaNs per column
`.to_dict()`	converts the Series into a dictionary, e.g. `{"log_return": 0, "r_1d": 0}`

###  `@pytest.mark.parametrize("W", [20])`

This is a **pytest parameterization decorator**.

It means the test will run once for each value of `W` in the list.
Here only `W = 20`, but you could add more (e.g. `[5, 10, 20, 50]`).

Purpose: easily test multiple rolling-window sizes without rewriting the test.



In [8]:
# tests/test_features.py
import numpy as np, pandas as pd
import pytest

REQUIRED_COLS = ["date","ticker","log_return","r_1d","weekday","month"]

def test_required_columns_present(features_df):
    missing = [c for c in REQUIRED_COLS if c not in features_df.columns]
    assert not missing, f"Missing required columns: {missing}"

def test_key_no_duplicates(features_df):
    dup = features_df[["ticker","date"]].duplicated().sum()
    assert dup == 0, f"Found {dup} duplicate (ticker,date) rows"

def test_sorted_within_ticker(features_df):
    for tkr, g in features_df.groupby("ticker"):
        assert g["date"].is_monotonic_increasing, f"Dates not sorted for {tkr}"

def test_nulls_in_critical_columns(features_df):
    crit = ["log_return","r_1d"]
    na = features_df[crit].isna().sum().to_dict()
    assert all(v == 0 for v in na.values()), f"NAs in critical cols: {na}"

def test_calendar_dtypes(features_df):
    assert str(features_df["weekday"].dtype) in ("int8","Int8"), "weekday should be compact int"
    assert str(features_df["month"].dtype)   in ("int8","Int8"), "month should be compact int"

def test_ticker_is_categorical(features_df):
    # allow object if reading from some parquet engines, but prefer category
    assert features_df["ticker"].dtype.name in ("category","CategoricalDtype","object")

def test_r1d_is_lead_of_log_return(features_df):
    for tkr, g in features_df.groupby("ticker"):
        # r_1d at t equals log_return at t+1
        assert g["r_1d"].iloc[:-1].equals(g["log_return"].iloc[1:]), f"Lead/lag mismatch for {tkr}"

@pytest.mark.parametrize("W", [20])
def test_rolling_mean_matches_definition(features_df, W):
    if f"roll_mean_{W}" not in features_df.columns:
        pytest.skip(f"roll_mean_{W} not present")
    for tkr, g in features_df.groupby("ticker"):
        s = g["log_return"]
        rm = s.rolling(W, min_periods=W).mean()
        # compare only where defined
        mask = ~rm.isna()  # ignore teh first `W-1` rows (containing `NaN`)
        diff = (g[f"roll_mean_{W}"][mask] - rm[mask]).abs().max()
        assert float(diff) <= 1e-7, f"roll_mean_{W} mismatch for {tkr} (max diff {diff})"

# Skip Directive for pytest

**skip directive** for pytest, and it’s very useful for conditionally skipping tests (or entire test modules) when a dependency isn’t available.

```python
pytest.skip("pandera not installed", allow_module_level=True)
```

The function `pytest.skip(reason)` immediately **stops execution of the current test** (or module, if allowed)
and marks it as **“skipped”** in the test results rather than as a failure.

So in your pytest report you’ll see something like:

```
SKIPPED [1] test_schema_validation.py:5: pandera not installed
```
The message

```python
"pandera not installed"
```

is the **reason** displayed in the test summary —
so you or other developers immediately know why the test was skipped.

---

### The keyword `allow_module_level=True`

Normally, you can only call `pytest.skip()` **inside** a test function or fixture.
If you call it **at the top level** of a module (before any tests run), pytest would raise an error.

Setting `allow_module_level=True` tells pytest:

> “It’s okay to skip all tests in this entire module right now.”

So it’s used for *module-wide conditional skipping.*

---

##  Typical usage pattern

You’d often see this near the top of a test file:

```python
import pytest

try:
    import pandera as pa
except ImportError:
    pytest.skip("pandera not installed", allow_module_level=True)
```

This means:

* If `pandera` is not available in the environment,
* Pytest will skip the **entire test file**,
* Instead of crashing or failing with an `ImportError`.

---

###  Example test file

```python
# tests/test_schema_validation.py
import pytest

try:
    import pandera as pa
except ImportError:
    pytest.skip("pandera not installed", allow_module_level=True)

def test_schema(features_df):
    # this test only runs if pandera is available
    ...
```

Output when Pandera missing:

```
collected 1 item / 1 skipped
============================= short test summary info =============================
SKIPPED [1] test_schema_validation.py:5: pandera not installed
```



# Pandera Column Validation
`nullable`, `coerce`, and `Check.str_length(...)`
are key parts of how **Pandera’s `Column`** validation works.
The schema defines the **expected properties** of one column:

```python
Column(dtype, nullable=..., coerce=..., checks=...)
```

where:

* `dtype` is the expected data type (e.g. `pa.Float`, `pa.String`, `pa.DateTime`)
* the other arguments control **how** the column is validated and possibly converted.

---

##  `nullable`

> Whether the column is allowed to contain `NaN` (missing) values.

| Setting          | Behavior                                      |
| ---------------- | --------------------------------------------- |
| `nullable=False` | Column must have **no nulls** (`NaN`, `None`) |
| `nullable=True`  | Missing values are allowed                    |

Example:

```python
Column(pa.Float, nullable=False)
```

fails if any value in that column is missing.

```python
Column(pa.String, nullable=True)
```

passes even if some rows have `NaN` or `None`.

---

##  `coerce`

**Meaning:**

> Whether Pandera should **automatically convert** the column’s data type
> to match the declared type before validating.

| Setting                  | Behavior                                            |
| ------------------------ | --------------------------------------------------- |
| `coerce=True`            | Try to cast column values to the specified dtype    |
| `coerce=False` (default) | Expect the column to already have the correct dtype |

Example:

```python
Column(pa.String, coerce=True)
```

If your data looks like:

```python
ticker
0   AAPL
1   MSFT
2   1234   ← numeric type, but still okay
```

Pandera will cast that numeric `1234` to string `"1234"` automatically before checking.

---

##  `Check.str_length(1, 12)`

**Meaning:**

> Validate that each string’s length lies between 1 and 12 characters.

It’s a **built-in Pandera check** specialized for string data.

Example:

```python
Check.str_length(min_value=1, max_value=12)
```

* Passes for `"AAPL"` (length 4)
* Fails for `""` (length 0)
* Fails for `"VERYLONGTICKERSYMBOL"` (length > 12)


---

###  other common `Check` helpers

| Check                      | Description                 |
| -------------------------- | --------------------------- |
| `Check.in_range(min, max)` | numeric range               |
| `Check.isin([...])`        | membership in a list or set |
| `Check.less_than(x)`       | all values < x              |
| `Check.greater_than(x)`    | all values > x              |
| `Check.str_matches(regex)` | regex match for strings     |




In [9]:
# tests/test_schema_pandera.py
import pytest, pandas as pd, numpy as np
try:
    import pandera.pandas as pa
    from pandera import Column, Check, DataFrameSchema
except Exception:
    pytest.skip("pandera not installed", allow_module_level=True)

schema = pa.DataFrameSchema({
    "date":     Column(pa.DateTime, nullable=False),
    "ticker":   Column(pa.String,  nullable=False, coerce=True, checks=Check.str_length(1, 12)),
    "log_return": Column(pa.Float, nullable=False,
                         checks=Check(lambda s: np.isfinite(s).all(), error="log_return must be finite")),
    "r_1d":       Column(pa.Float, nullable=False,
                         checks=Check(lambda s: np.isfinite(s).all(), error="r_1d must be finite")),
    "weekday":  Column(pa.Int8, checks=Check.isin(range(7))),                 # 0..6
    "month":    Column(pa.Int8, checks=Check.in_range(1, 12, inclusive="both")),
})

def test_schema_validate(features_df):
    # Cast ticker to string for schema validation; categorical is ok → string
    df = features_df.copy()
    df["ticker"] = df["ticker"].astype(str)
    schema.validate(df[["date","ticker","log_return","r_1d","weekday","month"]])

# 6) Duplicates


```python
df[["ticker","date"]].duplicated()
```

This returns a **Boolean Series** where each row is `True` if that combination of `ticker` and `date`
has appeared **before** in the DataFrame.

For example:

| ticker | date       | duplicated                    |
| :----- | :--------- | :---------------------------- |
| AAPL   | 2024-01-02 | False                         |
| AAPL   | 2024-01-03 | False                         |
| AAPL   | 2024-01-03 | True  ← duplicate of previous |
| MSFT   | 2024-01-02 | False                         |

---

### Important details:

* `duplicated()` marks **the second and later occurrences** of duplicates as `True`.
* The first appearance of a given `(ticker, date)` pair remains `False`.
* You can change that with `keep=False` to mark **all** duplicates, e.g.:

  ```python
  df[["ticker","date"]].duplicated(keep=False)
  ```



##  Example DataFrame

```python
import pandas as pd

df = pd.DataFrame({
    "ticker": ["AAPL", "AAPL", "AAPL", "MSFT", "MSFT", "GOOGL"],
    "date":   ["2024-01-02", "2024-01-02", "2024-01-03", "2024-01-02", "2024-01-02", "2024-01-02"],
    "adj_close": [189.5, 189.5, 190.2, 320.0, 320.0, 135.7]
})

print(df)
```

Output:

```
  ticker        date  adj_close
0   AAPL  2024-01-02      189.5
1   AAPL  2024-01-02      189.5
2   AAPL  2024-01-03      190.2
3   MSFT  2024-01-02      320.0
4   MSFT  2024-01-02      320.0
5  GOOGL  2024-01-02      135.7
```


```python
mask = df[["ticker", "date"]].duplicated()
print(mask)
```

Output:

```
0    False
1     True
2    False
3    False
4     True
5    False
dtype: bool
```

### Explanation:

* Row **1** is `True` because `("AAPL", "2024-01-02")` already appeared at row 0.
* Row **4** is `True` because `("MSFT", "2024-01-02")` already appeared at row 3.

So `.duplicated()` flags the **second and later occurrences** of duplicates.

```python
df[["ticker","date"]].duplicated().sum()
```

Output:

```
2
```



## Seeing the duplicate rows themselves

```python
dupes = df[df[["ticker","date"]].duplicated(keep=False)]
print(dupes)
```

Output:

```
  ticker        date  adj_close
0   AAPL  2024-01-02      189.5
1   AAPL  2024-01-02      189.5
3   MSFT  2024-01-02      320.0
4   MSFT  2024-01-02      320.0
```

Here `keep=False` marks **all** duplicates (both first and subsequent ones).

---

##  Dropping duplicates (optional)

```python
clean = df.drop_duplicates(subset=["ticker","date"], keep="first")
print(clean)
```

Output:

```
  ticker        date  adj_close
0   AAPL  2024-01-02      189.5
2   AAPL  2024-01-03      190.2
3   MSFT  2024-01-02      320.0
5  GOOGL  2024-01-02      135.7
```




# Pytest Caplog
```python
assert any("duplicate" in rec.message for rec in caplog.records)
```


##   what `caplog` is

`caplog` looks like an argment, but it is a special **pytest fixture** that *captures log messages* emitted during a test.
It lets you inspect what your code sent to `logging` — e.g., warnings, errors, debug messages.

Before the test starts, you set:

```python
caplog.set_level(logging.WARNING)
```

→ tells pytest: “Capture all log records at level WARNING or higher.”

So, if  `check_for_duplicates(df)` function uses something like:

```python
logger.warning("Found duplicate ticker-date pairs!")
```

then that message gets stored in `caplog.records`.



| Part                         | Meaning                                                              |
| ---------------------------- | -------------------------------------------------------------------- |
| `caplog.records`             | a list of captured `LogRecord` objects emitted during this test      |
| `rec.message`                | the text of each log message (after formatting)                      |
| `"duplicate" in rec.message` | check whether the substring `"duplicate"` appears in the log message |
| `any(...)`                   | returns `True` if *at least one* log record contains that substring  |

So this line asserts:

> “At least one WARNING message was logged that contains the word ‘duplicate’.”

---

###  Example of what it’s testing

If inside `check_for_duplicates` you have:

```python
logger.warning("Found 1 duplicate rows in key columns.")
```

then during the test:

```python
caplog.records[0].message == "Found 1 duplicate rows in key columns."
```

and the assertion passes because `"duplicate"` is in that string.

---

###  If no message was logged

Then `caplog.records` would be empty,
`any("duplicate" in rec.message for rec in caplog.records)` would be `False`,
and pytest would fail the test with:

```
AssertionError: assert False
```



In [10]:
# save to tests/test_logging.py
import logging, pandas as pd, numpy as np, pytest
from scripts.logsetup import setup_logging

def check_for_duplicates(df, logger=None):
    logger = logger or setup_logging("dspt")
    dups = df[["ticker","date"]].duplicated().sum()
    if dups > 0:
        logger.warning("Found %d duplicate (ticker,date) rows", dups)
    return dups

def test_duplicate_warning(caplog):
    caplog.set_level(logging.WARNING)
    df = pd.DataFrame({"ticker":["AAPL","AAPL"], "date":pd.to_datetime(["2024-01-02","2024-01-02"])})
    dups = check_for_duplicates(df)
    assert dups == 1
    assert any("duplicate" in rec.message for rec in caplog.records)

In [11]:
!pytest   tests/test_logging.py

[32m.[0m[32m                                                                        [100%][0m
[32m[32m[1m1 passed[0m[32m in 2.97s[0m[0m


# Homework (due before Session 14)

**Goal:** Create a **Health Check** notebook that prints key diagnostics and is easy to include in your Quarto report.

### Part A — Build a reusable **health** module


### Python datetime

```python
pd.to_datetime(df["date"]).min().date()
```


###  `.date()`

Converts the pandas `Timestamp` (or `datetime`) to a **plain Python `datetime.date`** object:

```
datetime.date(2024, 1, 2)
```

This strips the time portion (keeps only year–month–day).

---

##  Example

```python
import pandas as pd

df = pd.DataFrame({
    "date": ["2024-01-03", "2024-01-05", "2024-01-02"]
})

start_date = pd.to_datetime(df["date"]).min().date()
print(start_date)
```

Output:

```
2024-01-02
```

and the type is:

```python
type(start_date)
# datetime.date
```



### Conunt the number of distinct non-NA vlaues
```python
df["ticker"].nunique()
```

The method `.nunique()` means:

> “Count the number of **distinct (unique)** non-NA values in this Series.”

It’s shorthand for:

```python
len(df["ticker"].dropna().unique())
```

---

###  Example

```python
import pandas as pd

df = pd.DataFrame({
    "ticker": ["AAPL", "MSFT", "AAPL", "GOOGL", "MSFT", None]
})

print(df["ticker"].nunique())
```

Output:

```
3
```

---

## Notes

* `NaN` / `None` values are **ignored by default**.
* If you want to **include** NaN in the count, you can pass `dropna=False`:

```python
df["ticker"].nunique(dropna=False)
```

→ would return `4` in the example above (`AAPL`, `MSFT`, `GOOGL`, and NaN).




```python
np.nanmin(s)
```


`np.nanmin()` returns the **smallest finite value** in `s`, **ignoring NaN** values.

Equivalent logic:

```python
np.min(s[~np.isnan(s)])
```

So it “skips over” any missing data.



##  Example

```python
import numpy as np

s = np.array([3.5, np.nan, 2.1, 5.0])

print(np.nanmin(s))
```

**Output:**

```
2.1
```


##  Comparison with similar functions

| Function       | Behavior with NaN          | Example result for `[3.5, np.nan, 2.1]` |
| -------------- | -------------------------- | --------------------------------------- |
| `np.min()`     | **Fails** (returns NaN)    | `nan`                                   |
| `np.nanmin()`  | **Ignores NaN**            | `2.1`                                   |
| `np.nanmax()`  | Ignores NaN, finds maximum | `3.5`                                   |
| `np.nanmean()` | Ignores NaN, computes mean | `2.8`                                   |



##  Caution

If *all* elements are `NaN`, then:

```python
np.nanmin([np.nan, np.nan])
```

raises:

```
ValueError: All-NaN slice encountered
```





##  `if h.get("nulls"):`

* `h` is  a **dictionary** holding various statistics or summaries, like:

  ```python
  h = {
      "rows": 3000,
      "cols": 8,
      "nulls": {"log_return": 5, "r_1d": 3, "volume": 0}
  }
  ```
* `h.get("nulls")` tries to access the key `"nulls"`.

  * If it exists and is **truthy** (not empty), the code inside runs.
  * If `"nulls"` is missing or an empty dict `{}`, the `if` block is skipped.


---
# Build lines using Python

##  `lines += ["", "## Top Null Counts", ""]`

This adds three new strings to a list called `lines`, which is accumulating lines for a Markdown file.

It’s equivalent to appending:

```markdown
(blank line)
## Top Null Counts
(blank line)
```



---

##  `lines += [f"- **{k}**: {v}" for k,v in h["nulls"].items()]`

This is a **list comprehension** that builds a bulleted list of null counts.


| Variable            | Description                                        |
| ------------------- | -------------------------------------------------- |
| `k`                 | column name                                        |
| `v`                 | number of missing (null) values in that column     |
| `f"- **{k}**: {v}"` | formatted Markdown line like `- **log_return**: 5` |

Example result:

```python
["- **log_return**: 5", "- **r_1d**: 3", "- **volume**: 0"]
```

Then `lines += ...` appends those lines to the existing list.

---

##  So in Markdown, the generated section would look like:

```
## Top Null Counts

- **log_return**: 5
- **r_1d**: 3
- **volume**: 0
```





In [12]:
# save to scripts/health.py
from __future__ import annotations
import pandas as pd, numpy as np, json
from pathlib import Path

def df_health(df: pd.DataFrame) -> dict:
    out = {}
    out["rows"] = int(len(df))
    out["cols"] = int(df.shape[1])
    out["date_min"] = str(pd.to_datetime(df["date"]).min().date())
    out["date_max"] = str(pd.to_datetime(df["date"]).max().date())
    out["tickers"]  = int(df["ticker"].nunique())
    # Null counts (top 10)
    na = df.isna().sum().sort_values(ascending=False)
    out["nulls"] = na[na>0].head(10).to_dict()
    # Duplicates
    out["dup_key_rows"] = int(df[["ticker","date"]].duplicated().sum())
    # Example numeric ranges for core cols
    for c in [x for x in ["log_return","r_1d","roll_std_20"] if x in df.columns]:
        s = pd.to_numeric(df[c], errors="coerce")
        out[f"{c}_min"] = float(np.nanmin(s))
        out[f"{c}_max"] = float(np.nanmax(s))
    return out

def write_health_report(in_parquet="data/processed/features_v1.parquet",
                        out_json="reports/health.json", out_md="reports/health.md"):
    p = Path(in_parquet)
    if not p.exists():
        raise SystemExit(f"Missing {in_parquet}.")
    df = pd.read_parquet(p)
    h = df_health(df)
    Path(out_json).write_text(json.dumps(h, indent=2))
    # Render a small Markdown summary
    lines = [
        "# Data Health Summary",
        "",
        f"- Rows: **{h['rows']}**; Cols: **{h['cols']}**; Tickers: **{h['tickers']}**",
        f"- Date range: **{h['date_min']} → {h['date_max']}**",
        f"- Duplicate (ticker,date) rows: **{h['dup_key_rows']}**",
    ]
    if h.get("nulls"):
        lines += ["", "## Top Null Counts", ""]
        lines += [f"- **{k}**: {v}" for k,v in h["nulls"].items()]
    Path(out_md).write_text("\n".join(lines))
    print("Wrote", out_json, "and", out_md)

In [13]:
!python scripts/health.py

### Part B — **Health Check notebook** (`reports/health.ipynb`)

Create a new notebook `reports/health.ipynb` with **two cells**:

**Cell 1 (setup):**

In [14]:
# !pip install --upgrade ipython #use the modern replacement: importlib.reload. However this may cause problem in Colab.


`%load_ext autoreload` is an **IPython magic command** that tells the notebook to automatically reload your Python modules (like `scripts.health`) whenever you edit them, so you don’t have to restart the kernel.

* `%autoreload 2` means *always reload everything except excluded modules*.
* But the extension’s implementation (in `/usr/local/lib/python3.12/dist-packages/IPython/extensions/autoreload.py`) still calls:

  ```python
  from imp import reload
  ```

  which fails, since `imp` was **completely removed** in Python 3.12 (it was deprecated since 3.4).

---

##  Fix / Workarounds

### Option 1 — upgrade ipython

Newer IPython versions (≥ 8.26) already fixed this.
So first try upgrading IPython:

```bash
!pip install --upgrade ipython
```

Then restart your runtime and re-run:

```python
%load_ext autoreload
%autoreload 2
```

Works on recent versions that import `reload` from `importlib` instead of `imp`.

---

### Option 2 — Use the modern replacement: `importlib.reload`

If you can’t upgrade IPython (e.g., fixed Colab image), skip the magic and use the Python API directly:

```python
from importlib import reload
import scripts.health

reload(scripts.health)
from scripts.health import write_health_report
write_health_report()
```

This manually reloads the module after you edit it.



In [15]:
# The following cell will work only one time in Colab after upgrading ipython. If run this cell twice, it will crash and freeze.

# %load_ext autoreload
# %autoreload 2
# from scripts.health import write_health_report
# write_health_report()  # writes reports/health.json and reports/health.md

**Cell 2 (display in notebook):**

In [16]:
# Manual reload is safe.
from importlib import reload
import scripts.health

reload(scripts.health)
from scripts.health import write_health_report
write_health_report()


Wrote reports/health.json and reports/health.md


In [17]:
from pathlib import Path
print(Path("reports/health.md").read_text())

# Data Health Summary

- Rows: **3975**; Cols: **18**; Tickers: **25**
- Date range: **2020-01-29 → 2020-09-07**
- Duplicate (ticker,date) rows: **0**


### Part C — Include health output in your **Quarto report**

In `reports/eda.qmd`, add a section:

````markdown
## Data Health (auto-generated)

```{python}
from pathlib import Path
print(Path("reports/health.md").read_text())
```
````

### Part D — Add a **Makefile** target and a quick test

**Makefile append:** (when copy to Makefile, use four spaces to indent, and then the code below will turn it into the correct tab in Colab. This is just a workaround in Colab).
```bash
.PHONY: health test
health: ## Generate health.json and health.md from the current features parquet
  python scripts/health.py

pytest:
	pytest -q

test: pytest
```



# 1) `.PHONY`

Use **`.PHONY`**  to tell `make` that certain targets are *not files*, just commands.
Why? If a file named `test` or `pytest` exists, `make` would think the target is already “up to date” and skip running it. Marking targets as phony forces them to run.

```make
.PHONY: health test pytest
```

# 2) `test: pytest`

In a Makefile, the syntax is:

```
target: prerequisites
[TAB] recipe...
```

So in:

```make
test: pytest
```

* `test` is the **target**.
* `pytest` is a **prerequisite** (a dependency target).
* Putting `pytest` **on the same line after the colon** means: “before building `test`, build `pytest`.”

When you run `make test`, `make` first ensures the `pytest` target has run successfully; if `pytest` has its own recipe, that recipe runs. If `pytest` is also phony, it runs every time.

Example with recipes:

```make
.PHONY: test pytest

pytest:
	pytest -q

test: pytest
	@echo "All tests completed"
```

Here, `make test` will:

1. Run `pytest -q` (because `test` depends on `pytest`)
2. Then echo “All tests completed”.

You can list **multiple prerequisites**:

```make
test: lint unit integration
```

…and `make` will run `lint`, `unit`, and `integration` (in some order, unless you add ordering constraints) before `test`.




In [18]:
%%bash

# BACK UP FIRST
cp Makefile Makefile.bak
# Replace lines that BEGIN with 4 spaces by a single tab
perl -i -pe 's/^\h{4}(?=\S)/\t/' Makefile
cat Makefile

# Makefile — unified-stocks
SHELL := /bin/bash
.SHELLFLAGS := -eu -o pipefail -c
.ONESHELL:


PY := python
QUARTO := quarto

START ?= 2020-01-01
END   ?= 2025-08-01
ROLL  ?= 30

DATA_RAW := data/raw/prices.csv
FEATS    := data/processed/features.parquet
REPORT   := docs/reports/eda.html

# Default target
.DEFAULT_GOAL := help

.PHONY: help all clean clobber qa report backup

help: ## Show help for each target
	@awk 'BEGIN {FS = ":.*##"; printf "Available targets:\n"} /^[a-zA-Z0-9_\-]+:.*##/ {printf "  \033[36m%-18s\033[0m %s\n", $$1, $$2}' $(MAKEFILE_LIST)

# all: $(DATA_RAW) $(FEATS) report backup ## Run the full pipeline and back up artifacts
all: $(DATA_RAW) $(FEATS) report train backup

$(DATA_RAW): scripts/get_prices.py tickers_25.csv
	$(PY) scripts/get_prices.py --tickers tickers_25.csv --start $(START) --end $(END) --out $(DATA_RAW)

$(FEATS): scripts/build_features.py $(DATA_RAW) scripts/qa_csv.sh
	# Basic QA first
	scripts/qa_csv.sh $(DATA_RAW)
	$(PY) scripts/build_features.py

In [19]:
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/STAT4160"
make health
make pytest

python scripts/health.py
pytest tests/test_logging.py -q
.                                                                        [100%]


**Test that health files exist:**



In [20]:
# save to tests/test_health_outputs.py
import os, json

def test_health_files_exist():
    assert os.path.exists("reports/health.json")
    assert os.path.exists("reports/health.md")
    # json is valid
    import json
    json.load(open("reports/health.json"))

In [21]:
!pytest tests/test_health_outputs.py

[32m.[0m[32m                                                                        [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.31s[0m[0m



##  **The `-k` flag in pytest**

`-k` lets you **run only tests whose names (or test node IDs) match a keyword expression**.

###  Basic usage

```bash
pytest -k "duplicate"
```

Runs all tests whose names *contain* the substring `"duplicate"` —
for example:

* `test_duplicate_warning`
* `test_find_duplicates`

and skips the rest.



In [22]:
%%bash
make health
pytest -q -k health

python scripts/health.py
.                                                                        [100%]
