In [None]:
# from google.colab import drive
# drive.flush_and_unmount()           # ignore errors if already unmounted

#If cannot remount, simply delete the mounted drive and then remount
# rm -rf /content/drive


In [1]:
# Colab cell
from google.colab import drive

drive.mount('/content/drive', force_remount=True)



Mounted at /content/drive


In [2]:
# Adjust these two for YOUR repo
REPO_OWNER = "ywanglab"
REPO_NAME  = "STAT4160"   # e.g., unified-stocks-team1
BASE_DIR   = "/content/drive/MyDrive/dspt25"
CLONE_DIR  = f"{BASE_DIR}/{REPO_NAME}"
REPO_URL   = f"https://github.com/{REPO_OWNER}/{REPO_NAME}.git"

# if on my office computer

# REPO_NAME  = "lectureNotes"   # e.g., on my office computer
# BASE_DIR = r"E:\OneDrive - Auburn University Montgomery\teaching\AUM\STAT 4160 Productivity Tools" # on my office computer
# CLONE_DIR  = f"{BASE_DIR}\{REPO_NAME}"

import os, pathlib
pathlib.Path(BASE_DIR).mkdir(parents=True, exist_ok=True)


In [3]:
import os, subprocess, shutil, pathlib

if not pathlib.Path(CLONE_DIR).exists():
    !git clone {REPO_URL} {CLONE_DIR}
else:
    # If the folder exists, just ensure it's a git repo and pull latest
    os.chdir(CLONE_DIR)
    # !git status
    # !git pull --rebase # !git pull --ff-only
os.chdir(CLONE_DIR)
print("Working dir:", os.getcwd())

Working dir: /content/drive/MyDrive/dspt25/STAT4160




## Session 14 — pre‑commit & GitHub Actions CI

### Learning goals

Students will be able to:

1.  Configure **pre‑commit** to run **Black**, **Ruff** (lint + import sort), and **nbstripout** on every commit.
2.  Keep commits clean and **notebook outputs stripped**.
3.  Add a fast **GitHub Actions** CI workflow that runs pre‑commit hooks and **pytest** on each PR (Pull Request).
4.  Keep CI runtime **under \~3–4 minutes** with caching and a lean dependency set.

------------------------------------------------------------------------

## Agenda

-    why pre‑commit; the “quality gate”; anatomy of a fast CI
-    Black vs Ruff; when nbstripout matters; what belongs in CI
-    **In‑class lab**: configure pre‑commit (Black, Ruff, nbstripout) → run locally → add CI workflow → local dry‑run
-    Wrap‑up + homework briefing
-    Buffer

------------------------------------------------------------------------

## Main points

**Why pre‑commit?**

-   Prevent “drive‑by” problems before they enter history: unformatted code, stray notebook outputs, trailing whitespace.
-   Hooks run **locally on commit**, then again in **CI** for defense‑in‑depth.

**Black & Ruff**

-   **Black**: opinionated formatter → consistent diffs; no bikeshedding.
-   **Ruff**: very fast linter (flake8 family), plus **import sorting**; can also fix many issues (`--fix`).
-   You can use **both** (common) or let Ruff handle formatting too; we’ll use both for clarity.

**nbstripout**

-   Remove cell outputs from notebooks to keep diffs small, avoid binary (binary code) bloat, and reduce CI time.
-   Two patterns: **pre‑commit hook** (recommended) and/or **git filter** (`nbstripout --install`).

**CI scope (fast!)**

-   Lint + tests only; **no heavy training** in CI.
-   Cache dependencies; pin Python (3.11+).
-   Keep tests deterministic and **\< \~5s** (already done in Session 13).

------------------------------------------------------------------------



###  **Black**

* **What it is:** an **automatic Python code formatter**.
* **Goal:** make all Python code look the same — consistent indentation, spacing, quotes, etc.
* **Key idea:** “*Blackened code is code you can’t argue about.*”
* **Typical usage:**

  ```bash
  black .
  ```

  It reformats all `.py` files in place to follow a uniform style.

Benefits:

* No more style debates (“single vs double quotes”).
* Fast and deterministic formatting.
* Integrates easily with **pre-commit** and **GitHub Actions**.

---

###  **Ruff**

* **What it is:** a **blazing-fast Python linter and formatter**, written in Rust.
* **Replaces:** tools like `flake8`, `isort`, and partially `black`.
* **Purpose:** catch style errors, unused imports, bad patterns, and optionally auto-fix them.

Typical usage:

```bash
ruff check .
ruff format .
```

Benefits:

* Extremely fast.
* Enforces PEP8 style, import sorting, and code hygiene.
* Can run as part of pre-commit or CI.

---

###  **nbstripout**

* **What it is:** a **Jupyter Notebook cleaner**.
* **Goal:** remove execution outputs (plots, printed text, etc.) before committing `.ipynb` files to Git.
* Prevents large diffs and bloated repos.

Usage:

```bash
nbstripout --install
```

This installs a Git filter so that whenever you commit a notebook, outputs and metadata are automatically stripped.

 Benefits:

* Keeps notebook diffs clean (only code changes appear).
* Saves repo space.
* Plays nicely with CI and code review.

---

### Summary Table

| Tool           | Type               | Main Purpose                              | Typical Use                      |
| -------------- | ------------------ | ----------------------------------------- | -------------------------------- |
| **Black**      | Formatter          | Reformat code automatically               | `black .`                        |
| **Ruff**       | Linter / Formatter | Detect and fix style or logic issues fast | `ruff check .` / `ruff format .` |
| **nbstripout** | Notebook cleaner   | Strip outputs before Git commit           | `nbstripout --install`           |

---

Together, they form a strong trio for clean, reproducible, and review-friendly Python + Jupyter workflows.



##  **PEP 8**

**PEP 8** stands for *Python Enhancement Proposal 8*,
which is the **official style guide for Python code**.

It defines conventions for writing *clean, consistent, and readable* Python.

###  Key PEP 8 rules (summary)

| Category        | Examples                                                                                |
| --------------- | --------------------------------------------------------------------------------------- |
| **Indentation** | 4 spaces per level (no tabs)                                                            |
| **Line length** | ≤ 79 characters per line                                                                |
| **Naming**      | `snake_case` for functions/variables, `CamelCase` for classes, `ALL_CAPS` for constants |
| **Spacing**     | one space around operators (`x = y + 2`), none inside parentheses                       |
| **Imports**     | one import per line, grouped by standard → third-party → local                          |
| **Docstrings**  | use triple quotes `"""` for functions/classes/modules                                   |
| **Readability** | blank lines between top-level functions/classes                                         |

Tools like **Black** and **Ruff** help enforce PEP 8 automatically.





## In‑class lab (Colab‑friendly)

### 1) Install tools locally (for this Colab runtime)

In [5]:
!pip -q install pre-commit black ruff nbstripout pytest

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.0/221.0 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m90.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25h

### 2) Add **tool config** to `pyproject.toml` (Black + Ruff)

Refer to lec10-inclass.ipynb for python package and `pyproject.toml`.

> If you don’t have a `pyproject.toml`, this cell will create a minimal one; otherwise it appends/updates sections.

##  `.toml` — what it stands for

**TOML** = **Tom’s Obvious, Minimal Language**
(named after its creator, **Tom Preston-Werner**, co-founder of GitHub).

It’s a simple **configuration file format** — similar in spirit to `.ini` or `.yaml`, but more predictable and unambiguous.

So,
`pyproject.toml` literally means:

> “Python project configuration file written in the TOML format.”


TOML is designed to be:

* **Human-readable** like INI files,
* **Strictly structured** like JSON,
* **Minimal** and easy to parse.

### Example TOML syntax

```toml
[tool.black]
line-length = 88
target-version = ["py312"]

[project]
name = "my_package"
version = "0.1.0"
dependencies = ["pandas", "numpy"]
```

### Features

* Uses `key = value` pairs
* `[section]` headers group settings
* Supports strings, numbers, arrays, and nested tables

---

##  `pyproject.toml` in Python

This file is the **central configuration hub** for modern Python projects (PEP 518, PEP 621).
It replaced the older scattering of tool-specific configs:

* `setup.py`
* `setup.cfg`
* `tox.ini`
* `.flake8`
* `mypy.ini`, etc.

### Common uses

| Tool                      | Example section                                                            |
| ------------------------- | -------------------------------------------------------------------------- |
| **Build system**          | `[build-system]` — defines the build backend (e.g. `setuptools`, `poetry`) |
| **Black**                 | `[tool.black]` — formatting rules                                          |
| **Ruff**                  | `[tool.ruff]` — linting rules                                              |
| **pytest**                | `[tool.pytest.ini_options]` — test options                                 |
| **isort**, **mypy**, etc. | each can define their settings under `[tool.<name>]`                       |

**The file name itself doesn’t *have* to be** `pyproject.toml`,
but in most modern Python tools and workflows, that **is the standard and expected name**.


`pyproject.toml` was introduced by **PEP 518** and **PEP 621** as a *standardized configuration file* for Python projects.



The Python packaging ecosystem agreed to use **one shared file name** — `pyproject.toml` — so all build tools and linters know where to look automatically.

---

##  Tools that specifically *expect* that name

These tools look *only* for `pyproject.toml`:

| Tool                                                             | Purpose                      | Reads `pyproject.toml`?                         |
| ---------------------------------------------------------------- | ---------------------------- | ----------------------------------------------- |
| **pip**, **build**, **setuptools**, **poetry**                   | Packaging and dependencies   |  Required by spec                              |
| **Black**, **Ruff**, **Mypy**, **pytest**, **isort**, **pylint** | Code style and linting       |  Default location for their `[tool.*]` configs |
| **Pre-commit**, **tox**, **Flit**                                | CI/testing/build integration |  Common fallback                               |

So if you rename it to something else (e.g., `project_config.toml`), these tools **won’t detect it automatically** — you’d have to tell each tool manually where to find it (if they even support that).

---

## If you’re using your own script

In *your own* code — like your `upsert()` function — you can use **any filename you want**, for example:

```python
pyproj = Path("config.toml")
```

That’s perfectly fine as long as **your script** knows which file to read/write.

It just won’t be picked up automatically by external tools.



# How to write/update a pyproject.toml


* The function (`upsert` below) updates (`update`) an existing `[section]` if it already exists,
  or adds (`insert`) a new section if it doesn’t — hence **upsert**.
* Each “section” is assumed to look like this:

  ```
  [section_header]
  key = value
  another_key = value
  ```


# Use regex (advanced)

```python
pattern = rf"(?ms)^\[{re.escape(section_header)}\]\s*.*?(?=^\[|\Z)" # `r`: raw literal string
```
`re.escape(string)`: escapes all characters in `string` that have special meaning in regex, treating them literal characters.

| Component                     | Meaning                                                                                                                          |                                                                           |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| `(?ms)`                       | Enables **multiline (`m`)** and **dot-all (`s`)** modes:<br> • `^` and `$` match at line boundaries.<br> • `.` matches newlines. |                                                                           |
| `^\[`                         | Match a literal `[` at the start of a line (beginning of a section).                                                             |                                                                           |
| `{re.escape(section_header)}` | Insert the literal section name safely (escaping special regex chars).                                                           |                                                                           |
| `\]`                          | Closing bracket of the section header.                                                                                           |                                                                           |
| `\s*`                         | Optional whitespace/newline after the header.                                                                                    |                                                                           |
| `.*?`                         | Lazily match all lines in the section body (non-greedy).                                                                         |                                                                           |
| `(?=^[\| \Z)`                   | Stop when the next section header starts (`^[`) or at end of file (`\Z`). |

So this matches the *entire section block* from `[section_header]` up to (but not including) the next header or EOF.

Example match:

```
[tool.black]
line-length = 88
target-version = ["py312"]
```

---

### `if re.search(pattern, existing):`

Check whether that section already exists in the text.

---

###  If it exists → **replace it**

```python
existing = re.sub(pattern, f"[{section_header}]\n{body}\n", existing)
```
* `re.sub(pattern, replacement, string)`: search for all substrings that match `pattern`, and replaces them with `replacement`.
* `re.sub()` replaces the matched section with a new header and the new body.
* `body` is expected to be the lines inside the section, like:

  ```
  line-length = 88
  target-version = ["py312"]
  ```

Result: the section is replaced in-place.

---

###  Else → **append new section**

```python
existing += f"\n[{section_header}]\n{body}\n"
```

If the section didn’t exist, it’s appended at the end of the text file.

---

##  Example

```python
existing = """
[project]
name = "demo"
"""

upsert("tool.black", "line-length = 88\nskip-string-normalization = true")

print(existing)
```

**Output:**

```
[project]
name = "demo"

[tool.black]
line-length = 88
skip-string-normalization = true
```

If you run it again with a new body for `tool.black`, the function updates that section instead of adding a duplicate.





```regex
(?ms)
```

sets **two regex flags** inside the pattern:

| Flag | Name          | Effect                                                                                  |
| ---- | ------------- | --------------------------------------------------------------------------------------- |
| `m`  | **multiline** | makes `^` and `$` match *at the start and end of every line*, not just the whole string |
| `s`  | **dot-all**   | makes the dot (`.`) match *any* character, including **newlines**                       |

Together, they change how the rest of the regex behaves.

---

###  Without flags

```regex
^\[        # ^ only matches the very start of the string
.          # . matches any character except newline
```

* `^` would match only once, at the start of the text.
* `.` would **stop at line breaks**.

###  With `(?ms)`

* `^` matches at the start of **each line** (because of `m`).
* `.` matches **across multiple lines** (because of `s`).

So in:

```
[section1]
a = 1
b = 2

[section2]
c = 3
```

`(?ms)^\[section1\].*?`
will match `[section1]` and everything following it, including line breaks, up to the next `[section...`.

---

###  Equivalent Python flags

In code, you could have also written:

```python
re.search(pattern, text, flags=re.MULTILINE | re.DOTALL)
```

but `(?ms)` sets them directly inside the regex string.

---

##  `.*?` — the **lazy (non-greedy)** wildcard


| Symbol | Meaning                                                |
| ------ | ------------------------------------------------------ |
| `.`    | any character (because of `(?s)` it includes newlines) |
| `*`    | repeat **zero or more times**                          |
| `?`    | make it **lazy / non-greedy**                          |

###  Default behavior (greedy)

`.*` tries to match **as much text as possible**.

Example:

```python
re.search(r"\[.*\]", "[one][two][three]").group()
# → '[one][two][three]'  (too greedy!)
```

###  Lazy behavior (non-greedy)

`.*?` tries to match **as little text as possible**, still satisfying the pattern.

```python
re.search(r"\[.*?\]", "[one][two][three]").group()
# → '[one]'
```

So,

```regex
.*?
```

means “grab everything inside this section, **but stop as soon as** the next `[` header or the end of file appears.”

---

##  `(?=^\[|\Z)` — a **lookahead**

This part is called a **positive lookahead**.

### Syntax

```regex
(?=pattern)
```

It means:

> “Assert that what follows **matches `pattern`**,
> but **don’t include it** in the match.”

---


```regex
(?=^\[|\Z)
```


> “Stop right **before** the next `[` at the start of a line (`^\[`)
> or before the end of the file (`\Z`).”

| Symbol | Meaning                                            |    |
| ------ | -------------------------------------------------- | -- |
| `^`    | start of a line (because of multiline mode `(?m)`) |    |
| `\[ `  | literal `[` (start of next section header)         |    |
| `\| `                                                  | OR |
| `\Z`   | absolute end of the string                         |    |

So it **defines where the section ends** without consuming the next section’s header.

---

###  Example visualization

Input text:

```
[one]
a = 1
b = 2

[two]
x = 9
```

Regex (simplified):

```
(?ms)^\[one\]\s*.*?(?=^\[|\Z)
```

| Stage      | Matches                  |                             |
| ---------- | ------------------------ | --------------------------- |
| `^\[one\]` | start of `[one]` section |                             |
| `\s*.*?`   | all lines after it       |                             |
| `(?=^[     | \Z)`                     | stop *right before* `[two]` |

 Result matches only:

```
[one]
a = 1
b = 2
```

and stops cleanly before `[two]`.

---

##  Putting it all together

The full pattern:

```regex
(?ms)^\[SECTION\]\s*.*?(?=^\[|\Z)
```

| Component      | Meaning                                   |                                        |
| -------------- | ----------------------------------------- | -------------------------------------- |
| `(?ms)`        | multiline + dot-all modes                 |                                        |
| `^\[SECTION\]` | section header at start of a line         |                                        |
| `\s*`          | skip whitespace or newlines               |                                        |
| `.*?`          | lazily capture everything in this section |                                        |
| `(?=^[\| \Z)`   | stop before next header or end of file |

---

##  Summary Table

| Piece   | Name            | Function                                                |                                                     |
| ------- | --------------- | ------------------------------------------------------- | --------------------------------------------------- |
| `(?ms)` | regex flags     | allow `^` to match per line and `.` to include newlines |                                                     |
| `.*?`   | lazy quantifier | capture minimal content until next condition            |                                                     |
| `(?=^[  \| \Z)`            | positive lookahead                                      | stop matching before next `[header]` or end of file |




# Parentheses `()` are one of the most important (and most overloaded!) symbols in **regular expressions**.
They have **two major roles**, depending on context:
1️⃣ **Grouping**
2️⃣ **Capturing (and back-referencing)**

---

##  Grouping

Parentheses let you **group parts** of a regex together so you can:

* Apply a quantifier (`*`, `+`, `?`) to the **whole group**
* Combine multiple options with `|` (OR)
* Control precedence (like parentheses in math)

### Example 1 — group repetition

```regex
(ab)+
```

matches:

```
ab, abab, ababab, ...
```

because the `+` now applies to the **entire group** `ab`, not just `b`.

---

###  Example 2 — capturing with OR

```regex
(cat|dog)
```

matches either `"cat"` or `"dog"`, and capture it in a group

Without parentheses, `cat|dog` would match `cat` or `dog`, but no capture.



###  In

```regex
(?=^\[|\Z)
```

Here, the parentheses group the **two alternatives**:

```
^\[
```

or

```
\Z
```

so the lookahead reads as one combined condition:

> “stop before the next header **or** end of file.”

---

##  Capturing groups

By default, parentheses **capture** whatever text they match — so you can refer to it later.

###  Example

```python
import re
m = re.search(r"(\d{4})-(\d{2})-(\d{2})", "2025-10-06")
print(m.groups())
```

Output:

```
('2025', '10', '06')
```

Each pair of parentheses captured a piece of the match.

You can refer to them later by:

* index: `\1`, `\2`, etc. inside regex patterns,
* or `.group(1)`, `.group(2)` in Python.

---

###  Non-capturing groups

If you just want to **group but not capture**, use `(?:...)`
(e.g. to avoid extra stored groups).

Example:

```regex
(?:cat|dog)s?
```

matches `cat`, `cats`, `dog`, `dogs`, but doesn’t store any groups.

---

###  In

```regex
(?ms)
```

and

```regex
(?=^\[|\Z)
```

look similar but are **special forms of parentheses** that **do not capture**.

They are called **inline modifiers** and **lookaheads** respectively.
The parentheses are required syntactically to introduce those special regex features.

| Form      | Meaning                           |
| --------- | --------------------------------- |
| `(?ms)`   | inline flags (multiline + dotall) |
| `(?=...)` | positive lookahead                |
| `(?:...)` | non-capturing group               |
| `(abc)`   | capturing group                   |

---

##  Summary Table

| Pattern form | Name                | Captures? | Purpose                            |
| ------------ | ------------------- | --------- | ---------------------------------- |
| `(abc)`      | Capturing group     |  Yes     | store matched text                 |
| `(?:abc)`    | Non-capturing group |  No      | grouping only                      |
| `(?=abc)`    | Positive lookahead  |  No      | assert that next chars match `abc` |
| `(?!abc)`    | Negative lookahead  |  No      | assert next chars do **not** match |
| `(?ms)`      | Inline flags        |  No      | enable multiline/dot-all behavior  |

---



In [None]:
from pathlib import Path
import textwrap, re

pyproj = Path("pyproject.toml")
existing = pyproj.read_text() if pyproj.exists() else ""

#upsert: update + insert
def upsert(section_header, body):
    global existing    #update the global variable existing
    pattern = rf"(?ms)^\[{re.escape(section_header)}\]\s*.*?(?=^\[|\Z)"
    if re.search(pattern, existing):
        existing = re.sub(pattern, f"[{section_header}]\n{body}\n", existing)
    else:
        existing += f"\n[{section_header}]\n{body}\n"

# Black
upsert("tool.black", textwrap.dedent("""
line-length = 88
target-version = ["py311"]
""").strip())

# Ruff (modern layout)
upsert("tool.ruff", textwrap.dedent("""
line-length = 88
target-version = "py311"
""").strip())

upsert("tool.ruff.lint", textwrap.dedent("""
select = ["E","F","I"]  # flake8 errors, pyflakes, import sort
ignore = ["E501"]       # let Black handle line length
""").strip())

upsert("tool.ruff.lint.isort", textwrap.dedent("""
known-first-party = ["projectname"]
""").strip())

pyproj.write_text(existing.strip()+"\n")
print(pyproj.read_text())

[tool.black]
line-length = 88
target-version = ["py311"]

[tool.ruff]
line-length = 88
target-version = "py311"

[tool.ruff.lint]
select = ["E","F","I"]  # flake8 errors, pyflakes, import sort
ignore = ["E501"]       # let Black handle line length

[tool.ruff.lint.isort]
known-first-party = ["projectname"]



### 3) Create `.pre-commit-config.yaml` with hooks (Black, Ruff, nbstripout)

> Versions below are stable at time of writing—feel free to bump later.

##  What is a *hook*?

A **hook** is a small script that automatically runs when a certain Git event happens —
for example: before you commit, before you push, or after merging.


* The **pre-commit** framework installs *Git pre-commit hooks*.
* That means: **every time you run `git commit`**, pre-commit automatically runs checks and formatters on your code *before* the commit is saved.
* If any hook fails (e.g., Black reformats code or Ruff finds lint errors), the commit is **aborted** until you fix them.

It’s an automatic “quality gate.”

---

##  The config file: `.pre-commit-config.yaml`

This YAML file lists:

* Which repositories provide the hooks,
* Which versions to use,
* And which hooks from each repo to run.

You can think of each **repo block** as a *plugin source*, and each **hook entry** inside it as an *individual check or formatter*.

---

###  File structure overview

```yaml
repos:                        # Top-level key: list of repositories providing hooks
  - repo: <URL>               # 1st repository (source of hook)
    rev: <version>            # The exact version to use (tag, commit, or release)
    hooks:                    # List of hook definitions from that repo
      - id: <hook_id>         # One hook (name defined by that repo)
        <optional settings>   # e.g. args, files, language_version
```

---

####  1. Black — code formatter

```yaml
- repo: https://github.com/psf/black
  rev: 24.4.2
  hooks:
    - id: black
      language_version: python3.11
```

| Key                | Meaning                                           |
| ------------------ | ------------------------------------------------- |
| `repo`             | the GitHub repo where the hook lives              |
| `rev`              | the release version to pin (avoid auto-updates)   |
| `hooks`            | list of specific hooks from that repo             |
| `id: black`        | the hook’s internal ID — runs the Black formatter |
| `language_version` | tell Black which Python version to use            |

Effect: runs **Black** on staged `.py` files before committing, reformatting code automatically.

---

####  2. Ruff — linter and formatter

```yaml
- repo: https://github.com/astral-sh/ruff-pre-commit
  rev: v0.5.0
  hooks:
    - id: ruff
      args: [--fix, --exit-non-zero-on-fix]
    - id: ruff-format
```

| Key                            | Meaning                                                                  |
| ------------------------------ | ------------------------------------------------------------------------ |
| `repo`                         | Ruff’s pre-commit integration repository                                 |
| `rev`                          | version tag of Ruff                                                      |
| First hook: `id: ruff`         | run Ruff linter                                                          |
| `args`                         | extra command-line arguments                                             |
| `--fix`                        | auto-fix simple issues                                                   |
| `--exit-non-zero-on-fix`       | make Ruff fail if it fixed something (forces you to re-commit after fix) |
| Second hook: `id: ruff-format` | runs Ruff’s fast code formatter                                          |

 Effect: ensures linting and formatting are applied before committing.

---

#### 3. nbstripout — Jupyter cleaner

```yaml
- repo: https://github.com/kynan/nbstripout
  rev: 0.7.1
  hooks:
    - id: nbstripout
      files: \\.ipynb$
```

| Key                | Meaning                                        |
| ------------------ | ---------------------------------------------- |
| `repo`             | source repo of nbstripout hook                 |
| `rev`              | version                                        |
| `id: nbstripout`   | the actual hook name                           |
| `files: \\.ipynb$` | regex to restrict to notebook files (`.ipynb`) |

 Effect: strips notebook outputs before committing, keeping diffs small.

---

####  4. pre-commit built-in hooks

```yaml
- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v4.6.0
  hooks:
    - id: end-of-file-fixer
    - id: trailing-whitespace
    - id: check-yaml
    - id: check-added-large-files
```

These are general-purpose hygiene hooks:

| Hook ID                   | Purpose                                                        |
| ------------------------- | -------------------------------------------------------------- |
| `end-of-file-fixer`       | ensures files end with a single newline                        |
| `trailing-whitespace`     | removes extra spaces at line ends                              |
| `check-yaml`              | validates that `.yaml` files are properly formatted            |
| `check-added-large-files` | blocks adding very large files accidentally (e.g., data dumps) |

 Effect: enforces small, clean commits.

---

##  How it all works together

1. You install once:

   ```bash
   pip install pre-commit
   pre-commit install
   ```

   This sets up a **Git hook** in `.git/hooks/pre-commit`.

2. Now whenever you run:

   ```bash
   git commit -m "Update"
   ```

   pre-commit:

   * Runs **Black**, **Ruff**, **nbstripout**, etc.
   * Shows results (and auto-fixes where possible).
   * Aborts commit if anything fails.



In [None]:
from pathlib import Path
cfg = Path(".pre-commit-config.yaml")
cfg.write_text("""repos:
  - repo: https://github.com/psf/black
    rev: 24.4.2
    hooks:
      - id: black
        language_version: python3.12  # set to the version of your system or leave it blank

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.5.0
    hooks:
      - id: ruff
        args: [--fix, --exit-non-zero-on-fix]
      - id: ruff-format

  - repo: https://github.com/kynan/nbstripout
    rev: 0.7.1
    hooks:
      - id: nbstripout
        files: \\.ipynb$

  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.6.0
    hooks:
      - id: end-of-file-fixer
      - id: trailing-whitespace
      - id: check-yaml
      - id: check-added-large-files
""")
print(cfg.read_text())

repos:
  - repo: https://github.com/psf/black
    rev: 24.4.2
    hooks:
      - id: black
        language_version: python3.12  # set to the version of your system or leave it blank

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.5.0
    hooks:
      - id: ruff
        args: [--fix, --exit-non-zero-on-fix]
      - id: ruff-format

  - repo: https://github.com/kynan/nbstripout
    rev: 0.7.1
    hooks:
      - id: nbstripout
        files: \.ipynb$

  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.6.0
    hooks:
      - id: end-of-file-fixer
      - id: trailing-whitespace
      - id: check-yaml
      - id: check-added-large-files



### 4) Install the local git hook & run on all files

``` python
!pre-commit install
!pre-commit run --all-files
```

> The first run will **download** hook toolchains (Black, Ruff, etc.), format files, and strip notebook outputs. Commit changes after verifying.



##  `!pre-commit install`


This command **installs the Git hook** that runs every time you make a commit.

You usually run it **once per repository** (after adding `.pre-commit-config.yaml`).


When you run:

```bash
pre-commit install
```

pre-commit modifies your Git repo by creating this file:

```
.git/hooks/pre-commit
```

That’s a small shell script that looks something like:

```bash
#!/bin/sh
exec pre-commit run --hook-stage pre-commit "$@"
```

From then on, whenever you type:

```bash
git commit -m "something"
```

Git automatically executes that script *before* finalizing the commit.

Effect:

* All hooks listed in `.pre-commit-config.yaml` (Black, Ruff, nbstripout, etc.) are run on **staged files**.
* If any hook fails, the commit is **blocked** until you fix the issues.

---

###  You’ll usually see output like:

```
[INFO] Installing environment for https://github.com/psf/black.
[INFO] Once installed this environment will be reused.
[INFO] Installing environment for https://github.com/astral-sh/ruff-pre-commit.
...
```

The first time, pre-commit sets up small isolated environments for each tool.
Future runs are fast because those environments are cached.

---

##  `!pre-commit run --all-files`

This command **manually runs all hooks** on **every file** in the repository — not just staged ones.

The `--all-files` flag tells pre-commit to ignore Git’s staging and check everything.


* After first installing pre-commit, to clean up the entire codebase once:

  ```bash
  pre-commit run --all-files
  ```
* In continuous integration (CI), to verify the repo is clean before merging.

For each hook in `.pre-commit-config.yaml`, you’ll see output like:

```
black....................................................................Passed
ruff.....................................................................Passed
nbstripout..............................................................Passed
trailing-whitespace.....................................................Fixed
```

If a hook modifies files (e.g., Black reformats code), it says “Fixed”.
If a hook fails, it reports “Failed” and you’ll see why.

---

###  Quick workflow recap

```bash
# 1. Install framework
pip install pre-commit

# 2. Create .pre-commit-config.yaml

# 3. Activate hooks in this repo
pre-commit install

# 4. (Optional) Run everything manually
pre-commit run --all-files
```

From then on, `git commit` will automatically trigger pre-commit checks and fixes for you.



In [None]:
!pre-commit install #install the hooks
#!pre-commit run --all-files

pre-commit installed at .git/hooks/pre-commit


`pre-commit` makes it easy to run **one specific hook** on **one specific file** — perfect for debugging or quick fixes.


---

##   General syntax

```bash
pre-commit run <hook-id> --files <file1> [<file2> ...]
```

* `<hook-id>` → the ID of the hook, exactly as it appears in `.pre-commit-config.yaml`
* `--files` → the file(s) to run the hook on

---

##  Example 1 — run **Black** on a single Python file

```bash
pre-commit run black --files scripts/health.py
```

 This runs only the **Black** hook on that file — without touching anything else.

If Black reformats it, it will modify the file in place (just like `black scripts/health.py`).

---

##  Example 2 — run **Ruff** linter on one file

```bash
pre-commit run ruff --files notebooks/eda.ipynb
```

 Runs Ruff’s linting (and `--fix` if you enabled that in your config) only on that notebook.

---

##  Example 3 — run **nbstripout** manually on a notebook

```bash
pre-commit run nbstripout --files reports/demo.ipynb
```

 This will strip outputs from that notebook only

---

##  Example 4 — run **trailing-whitespace** cleanup on one file

```bash
pre-commit run trailing-whitespace --files README.md
```

 Removes any extra spaces at the end of lines in `README.md`.

---

##  Running multiple hooks on one file

You can omit the hook name to run **all configured hooks** on that file:

```bash
pre-commit run --files scripts/health.py
```

That runs every hook (Black, Ruff, etc.) that applies to `.py` files.

---

##  Adding `--all-files` (for comparison)

```bash
pre-commit run black --all-files
```

→ runs the **Black** hook on *every tracked file*.

So:

* `--files` = specific file(s)
* `--all-files` = everything

---

##  Checking available hook IDs

If you’re unsure of a hook’s `id`, list them:

```bash
pre-commit run --all-files --show-diff-on-failure
```

or just open `.pre-commit-config.yaml` — each hook’s ID appears under:

```yaml
hooks:
  - id: black
  - id: ruff
  - id: nbstripout
  - id: trailing-whitespace
```

---

## Summary

| Task                               | Command                                        |
| ---------------------------------- | ---------------------------------------------- |
| Run **one hook** on **one file**   | `pre-commit run <hook-id> --files <file>`      |
| Run **one hook** on **many files** | `pre-commit run <hook-id> --files file1 file2` |
| Run **all hooks** on **one file**  | `pre-commit run --files <file>`                |
| Run **one hook** on **all files**  | `pre-commit run <hook-id> --all-files`         |

---

### Example recap

| Goal                            | Command                                           |
| ------------------------------- | ------------------------------------------------- |
| Run Black on a script           | `pre-commit run black --files scripts/health.py`  |
| Run Ruff linter on a notebook   | `pre-commit run ruff --files notebooks/eda.ipynb` |
| Strip outputs from one notebook | `pre-commit run nbstripout --files demo.ipynb`    |





You **can** technically run a hook’s underlying tool (e.g. `black`, `ruff`, `nbstripout`) directly — you don’t *need* to use `pre-commit run`.

But — using `pre-commit run` ensures:

* The same **versions**, **arguments**, and **excludes** as defined in `.pre-commit-config.yaml`
* Consistent behavior across environments (since pre-commit manages isolated virtualenvs)

So:

| Approach                | Command                               | When to use                                                   |
| ----------------------- | ------------------------------------- | ------------------------------------------------------------- |
|  **Standard way**      | `pre-commit run black --files foo.py` | Use this for reproducible, config-driven automation           |
|  **Direct hook call** | `black foo.py`                        | Fine for ad-hoc local use, or when pre-commit isn’t available |

---

##  How hooks actually work

When you wrote in `.pre-commit-config.yaml`:

```yaml
- repo: https://github.com/psf/black
  rev: 24.4.2
  hooks:
    - id: black
```

pre-commit:

1. Downloads that repo.
2. Creates a **tiny isolated environment** for it (in `~/.cache/pre-commit/...`).
3. Runs the command listed in the hook metadata (in this case, the Black formatter).

So, when you call:

```bash
pre-commit run black --files foo.py
```

you’re really executing that internal command in the controlled environment (using the exact version pinned to `rev: 24.4.2`).

---

##  Invoking the hook tool directly

If you just want to run the formatter/linter directly, you can! For example:

```bash
black foo.py
ruff check foo.py --fix
nbstripout notebook.ipynb
```

That will work, **provided those tools are installed** (e.g. via `pip install black ruff nbstripout`).



---

##  Example comparison

**Using pre-commit:**

```bash
pre-commit run black --files src/app.py
```

→ uses Black 24.4.2 in pre-commit’s virtualenv
→ applies the same arguments (`--line-length`, etc.) as your config

**Using direct command:**

```bash
black src/app.py
```

→ uses whatever version of Black is installed globally (maybe newer/older)
→ may ignore settings if you didn’t pass them manually or define `pyproject.toml`

---

##  Hooks that *don’t* have a direct CLI

Some hooks (like `end-of-file-fixer` or `check-added-large-files`) are tiny Python scripts that exist **only inside** the pre-commit repository.
They’re not installed globally, so there’s no `end-of-file-fixer` command.

So for those, you **must** use:

```bash
pre-commit run end-of-file-fixer --files README.md
```



---

##  Summary

| Type of hook                                                   | Can you run directly? | Preferred way                         |
| -------------------------------------------------------------- | --------------------- | ------------------------------------- |
| Third-party tool (Black, Ruff, nbstripout)                     |  Yes                 | Either direct CLI or `pre-commit run` |
| Built-in pre-commit hooks (e.g. end-of-file-fixer, check-yaml) |  No                  | Must use `pre-commit run`             |
| Want consistent versioning/args                                |  Use pre-commit     |                                       |
| Want quick ad-hoc check                                        |  Run tool directly  |                                       |




### 5) (Optional) Also install **git filter** for nbstripout by writing the filters to .gitattributes

> This is an extra layer; pre‑commit hook above already strips outputs. Use this to guarantee outputs are removed even when bypassing pre‑commit.



In [None]:
!nbstripout --install --attributes .gitattributes
print(open(".gitattributes").read())

data/processed/*.parquet filter=lfs diff=lfs merge=lfs -text
data/*.db filter=lfs diff=lfs merge=lfs -text
models/*.pt filter=lfs diff=lfs merge=lfs -text
reports/*.pdf filter=lfs diff=lfs merge=lfs -text

*.ipynb filter=nbstripout
*.zpln filter=nbstripout
*.ipynb diff=ipynb



```bash
!nbstripout --install --attributes .gitattributes
```

using a **Jupyter shell command** (`!`) that calls `nbstripout` —
a tool that cleans Jupyter notebooks by **removing cell outputs and metadata** before committing to Git.

By default, Jupyter notebook files (`.ipynb`) contain a lot of:

* Execution outputs (plots, print results),
* Metadata (kernel info, timestamps, etc.).

That makes them:

* **large**,
* **hard to diff**, and
* **messy in version control**.

 **`nbstripout`** fixes this by automatically stripping those outputs whenever you `git add` or `git commit` the notebook.

When you run:

```bash
nbstripout --install
```

it tells Git to set up a **Git filter** in your repository configuration so that notebook files are automatically cleaned when committed.

This installs a small rule in `.gitattributes` (Git’s attribute file).

---

##  The `--attributes .gitattributes` flag

This specifies **which file** to write those rules to.

Normally, nbstripout writes to `.gitattributes` by default,
but this makes it explicit:

```bash
--attributes .gitattributes
```

So it adds the nbstripout filter configuration to the file `.gitattributes` in your repository root.



##  What `.gitattributes` is

`.gitattributes` is a Git file that defines **per-file behaviors** (like filters, merge drivers, or text normalization).

For example, it can say:

* “apply this filter when committing any `.ipynb` file,” or
* “treat these files as binary.”

After running your command, the file `.gitattributes` typically contains something like this:

```text
# nbstripout filter to automatically clean notebooks before committing
*.ipynb filter=nbstripout
```

### Bonus tip

You can confirm the filter is installed with:

```bash
git config --show-origin filter.nbstripout.clean
```

and test it manually with:

```bash
nbstripout notebook.ipynb
```


how **`.gitattributes`** configures Git’s behavior for specific file types.


`.gitattributes` is a plain-text configuration file that tells Git **how to handle certain files**, based on filename patterns (globs).

It can define:

* How to **store** files (`filter`, `text`, `binary`)
* How to **compare** them (`diff`)
* How to **merge** them (`merge`)
* Or which files should be treated as binary (no line endings normalized, etc.)

Each line has this general pattern:

```
<pattern> <attribute1> <attribute2> ...
```

---

##  These lines use **Git LFS (Large File Storage)**

```text
data/processed/*.parquet filter=lfs diff=lfs merge=lfs -text
data/*.db              filter=lfs diff=lfs merge=lfs -text
models/*.pt            filter=lfs diff=lfs merge=lfs -text
reports/*.pdf          filter=lfs diff=lfs merge=lfs -text
```

| Part         | Meaning                                                                                                                                                                                               |
| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `filter=lfs` | Send these files through the **Git LFS** filter when adding/committing. LFS replaces the actual binary file with a small *pointer file* in Git, while storing the real large data in the LFS storage. |
| `diff=lfs`   | Use LFS’s built-in diff driver (shows only file size + pointer info, not binary gibberish).                                                                                                           |
| `merge=lfs`  | Use LFS’s safe merge driver (prevents binary conflicts).                                                                                                                                              |
| `-text`      | Tell Git **not** to treat these as text files → no line-ending normalization.                                                                                                                         |

###  Why:

* `.parquet`, `.db`, `.pt`, `.pdf` files are **large binaries**.
* LFS prevents your repo from ballooning in size and keeps Git operations fast.
* Without LFS, Git would try to store and diff these binaries directly — very inefficient.

So those lines say:

> “For all Parquet, SQLite DBs, PyTorch model weights, and PDFs — store them via Git LFS and treat them as binary data.”

---

##  These lines integrate **nbstripout** for Jupyter notebooks

```text
*.ipynb filter=nbstripout
*.zpln filter=nbstripout
```

| Part                | Meaning                                              |
| ------------------- | ---------------------------------------------------- |
| `filter=nbstripout` | Use the **nbstripout** clean filter when committing. |

So before each commit:

* Notebook files (`*.ipynb`, `*.zpln`) are passed through `nbstripout`.
* It **removes all cell outputs, execution counts, and metadata**, leaving only the code + markdown.
* The result: small diffs and no output noise in Git history.

---

##  Special diff driver for notebooks

```text
*.ipynb diff=ipynb
```

| Part         | Meaning                                                                                                |
| ------------ | ------------------------------------------------------------------------------------------------------ |
| `diff=ipynb` | Use a custom diff driver (`ipynb`) that knows how to show JSON differences more clearly for notebooks. |

This works if you have an `ipynb` diff driver configured in Git (some tools or IDEs set this up for you).


* Normal `git diff` on `.ipynb` files is unreadable (huge JSON blobs).
* This option can make diffs cleaner (e.g. show changed cells, not full raw JSON).



###  Example flow

When you do:

```bash
git add data/processed/example.parquet
git commit -m "add data"
```

Git stores only a small pointer in history; the real `.parquet` file goes to LFS storage.

When you commit a notebook:

```bash
git add analysis.ipynb
git commit -m "update notebook"
```

`nbstripout` runs automatically, removing output cells before the commit, keeping your repo clean.




### 6) Add a tiny “bad style” file to see hooks in action

In [6]:
from pathlib import Path
p = Path("scripts/bad_style.py")
p.write_text("import os,sys\n\n\ndef add(a,b):\n  return(a +  b)\n")
print("Wrote:", p)
print(20*"-")
print(open("scripts/bad_style.py").read())
# directly apply the hook script
!black scripts/bad_style.py
print(open("scripts/bad_style.py").read())

Wrote: scripts/bad_style.py
--------------------
import os,sys


def add(a,b):
  return(a +  b)

[1mreformatted scripts/bad_style.py[0m

[1mAll done! ✨ 🍰 ✨[0m
[34m[1m1 file [0m[1mreformatted[0m.
import os, sys


def add(a, b):
    return a + b



In [None]:
from pathlib import Path
p = Path("scripts/bad_style.py")
p.write_text("import os,sys\n\n\ndef add(a,b):\n  return(a +  b)\n")
print("Wrote:", p)
print(20*"-")
print(open("scripts/bad_style.py").read())
# directly apply the hook script
!ruff check scripts/bad_style.py
print(open("scripts/bad_style.py").read())

Wrote: scripts/bad_style.py
--------------------
import os,sys


def add(a,b):
  return(a +  b)

[1m[91mE401 [0m[[1m[96m*[0m] [1mMultiple imports on one line[0m
 [1m[94m-->[0m scripts/bad_style.py:1:1
  [1m[94m|[0m
[1m[94m1 |[0m import os,sys
  [1m[94m|[0m [1m[91m^^^^^^^^^^^^^[0m
  [1m[94m|[0m
[1m[96mhelp[0m: [1mSplit imports[0m

[1m[91mI001 [0m[[1m[96m*[0m] [1mImport block is un-sorted or un-formatted[0m
 [1m[94m-->[0m scripts/bad_style.py:1:1
  [1m[94m|[0m
[1m[94m1 |[0m import os,sys
  [1m[94m|[0m [1m[91m^^^^^^^^^^^^^[0m
  [1m[94m|[0m
[1m[96mhelp[0m: [1mOrganize imports[0m

[1m[91mF401 [0m[[1m[96m*[0m] [1m`os` imported but unused[0m
 [1m[94m-->[0m scripts/bad_style.py:1:8
  [1m[94m|[0m
[1m[94m1 |[0m import os,sys
  [1m[94m|[0m        [1m[91m^^[0m
  [1m[94m|[0m
[1m[96mhelp[0m: [1mRemove unused import[0m

[1m[91mF401 [0m[[1m[96m*[0m] [1m`sys` imported but unused[0m
 [1m[94m-->[0m scr

In [None]:
from pathlib import Path
p = Path("scripts/bad_style.py")
p.write_text("import os,sys\n\n\ndef add(a,b):\n  return(a +  b)\n")
print("Wrote:", p)

# Run hooks just on this file
!pre-commit run --files scripts/bad_style.py  #warning: it takes more than an hour
print(open("scripts/bad_style.py").read())

Wrote: scripts/bad_style.py
[INFO][m Installing environment for https://github.com/psf/black.
[INFO][m Once installed this environment will be reused.
[INFO][m This may take a few minutes...
[INFO][m Installing environment for https://github.com/astral-sh/ruff-pre-commit.
[INFO][m Once installed this environment will be reused.
[INFO][m This may take a few minutes...
[INFO][m Installing environment for https://github.com/kynan/nbstripout.
[INFO][m Once installed this environment will be reused.
[INFO][m This may take a few minutes...
[INFO][m Installing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO][m Once installed this environment will be reused.
[INFO][m This may take a few minutes...
black....................................................................[42mPassed[m
ruff.....................................................................[41mFailed[m
[2m- hook id: ruff[m
[2m- exit code: 1[m

Found 3 errors (3 fixed, 0 remaining).

ruff-

> You should see Black and Ruff fix spacing/imports; trailing whitespace hooks may also fire.



##  What Colab has by default

By default, **Colab notebooks do not have any pre-commit hook installed.**
That means:

* There is **no `.pre-commit-config.yaml`**
* There is **no `.git/hooks/pre-commit`** script
* Commits (if you use `git`) happen with **no automatic checks or formatting**

So “default Colab” = **no pre-commit at all**.

---

##  If you want to *remove* or *disable* the custom pre-commit hook

###  Option A — Temporary disable for one commit

If you just want to bypass pre-commit once:

```bash
git commit -m "your message" --no-verify
```

That skips all hooks for that commit, but keeps them installed.

---

###  Option B — Uninstall pre-commit hooks

Remove the hook script from `.git/hooks`:

```bash
!pre-commit uninstall
```

or manually:

```bash
!rm -f .git/hooks/pre-commit
```

This restores Git’s default behavior (no pre-commit hook).
You can confirm it’s gone:

```bash
!ls .git/hooks | grep pre-commit
```

(should show nothing or be empty)

---

###  Option C — Remove configuration file and cache

If you want a full reset:

```bash
!rm -f .pre-commit-config.yaml
!pre-commit clean
```

That deletes:

* Your `.pre-commit-config.yaml`
* Pre-commit’s cached environments in `~/.cache/pre-commit`

Now pre-commit won’t run or reinstall automatically.

---

##   If you want to *restore to Colab default state* entirely

In Colab, you usually start from a blank VM every session, so you can:

```bash
%cd /content
!rm -rf .git
!rm -rf .pre-commit-config.yaml .gitattributes
```

This deletes both the Git repo and any pre-commit traces.


---

##  Summary

| Goal                      | Command                                           | Result                          |
| ------------------------- | ------------------------------------------------- | ------------------------------- |
| Skip pre-commit once      | `git commit --no-verify`                          | Temporarily bypass hooks        |
| Uninstall pre-commit hook | `pre-commit uninstall`                            | Removes `.git/hooks/pre-commit` |
| Full cleanup              | `rm .pre-commit-config.yaml` + `pre-commit clean` | Removes config + caches         |
| Reset to Colab default    | `rm -rf .git`                                     | No repo, no hooks, blank state  |

---

 **Most common safe route:**

```bash
!pre-commit uninstall
!pre-commit clean
```

Then you’re back to the same blank environment Colab starts with — commits won’t trigger any pre-commit checks.




# Run the following to get back to Colab default, and remember to comment off the last three lines about nbstripout and diff=ipynb.

In [None]:
!pre-commit uninstall

pre-commit uninstalled


### 7) Add a fast **GitHub Actions CI** workflow (`.github/workflows/ci.yml`)

GitHub Actions is a Continuous Integration / Continuous Deployment (CI/CD) platform built into GitHub.
It lets you define workflows — sequences of automated steps — that run when certain events happen in your repo (like a push, pull request, or issue creation).

The file
`.github/workflows/ci.yml` defines an **automated Continuous Integration (CI) workflow**.

##  What is `ci.yml`?

* It’s a **GitHub Actions workflow file**.
* Located at `.github/workflows/ci.yml`.
* It tells GitHub what to do automatically whenever someone **pushes**, **opens a pull request**, or merges into your repo.

In the code below, it runs your project’s **tests and pre-commit checks** on GitHub’s cloud runners.


```yaml
name: CI
```

* Gives your workflow a **name** (“CI” = Continuous Integration).
* This is what appears under the **Actions** tab in your GitHub repo.

---

###  Trigger conditions

```yaml
on:
  push:
    branches: [ main, master, develop ]
  pull_request:
    branches: [ main, master, develop ]
```

* `on:` defines **when** this workflow should run.

 **push:** Run when code is pushed to any of those branches (`main`, `master`, `develop`).
 **pull_request:** Run when a pull request targets one of those branches.

So any new commit or PR to your main development branches automatically triggers the CI checks.

---

###  Concurrency control

```yaml
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
```

This prevents **multiple concurrent runs** of the same workflow on the same branch.

* `group:` defines a unique identifier for a run (combines workflow name + branch ref)
* `cancel-in-progress: true` cancels any older run if a new one starts on the same branch

Example: If you push twice to `main` quickly, the older run is cancelled so only the newest one continues — saves time and compute.

---

###  Define jobs

```yaml
jobs:
  build:
    runs-on: ubuntu-latest
    timeout-minutes: 10
```

* A workflow can have **multiple jobs**; here, only one job named `build`.
* `runs-on: ubuntu-latest` → use GitHub’s Linux runner VM.
* `timeout-minutes: 10` → automatically cancel if it runs longer than 10 minutes.

Each job consists of sequential **steps**.

---

###  Step 1 — Checkout your repo

```yaml
- uses: actions/checkout@v4
```

* Downloads your repo onto the GitHub runner.
* Without this, the workflow wouldn’t see your files.

---

###  Step 2 — Setup Python

```yaml
- uses: actions/setup-python@v5
  with:
    python-version: '3.11'
    cache: 'pip'
    cache-dependency-path: |
      requirements.txt
      pyproject.toml
```

* Installs **Python 3.11** on the runner.
* Enables **pip cache**, so repeated runs reuse installed packages.
* Caches are keyed based on the contents of `requirements.txt` and `pyproject.toml`.

This makes Python setup faster in subsequent CI runs.

---

###  Step 3 — Install dependencies

```yaml
- name: Install dependencies
  run: |
    python -m pip install --upgrade pip
    if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
    pip install pre-commit pytest
```

* Runs a shell command (`run:` block).
* Upgrades pip.
* Installs dependencies listed in `requirements.txt` (if it exists).
* Also installs `pre-commit` and `pytest` for code formatting and testing.

---

###  Step 4 — Run pre-commit hooks

```yaml
- name: pre-commit
  uses: pre-commit/action@v3.0.1
```

* Runs all hooks defined in your `.pre-commit-config.yaml` (Black, Ruff, nbstripout, etc.).
* Uses the official **pre-commit GitHub Action**.
* Ensures consistent formatting and linting before any merge.

 This checks your code automatically, just like running:

```bash
pre-commit run --all-files
```

locally.

---

###  Step 5 — Run pytest

```yaml
- name: pytest
  run: pytest -q --maxfail=1
```

* Runs your Python unit tests quietly (`-q` = quiet mode).
* `--maxfail=1` stops after the first failure (saves time in CI).
* If any test fails, the entire workflow fails  — which blocks merging the PR.

 That ensures only tested, passing code can be merged into your main branches.

---

##  Big-picture summary

| Section                | Purpose                                           |
| ---------------------- | ------------------------------------------------- |
| `name: CI`             | Labels this workflow as “Continuous Integration”  |
| `on:`                  | Tells GitHub *when* to run the workflow (push/PR) |
| `concurrency:`         | Cancels older runs on same branch                 |
| `jobs:`                | Defines the set of automated tasks                |
| `runs-on:`             | Chooses OS (Ubuntu runner)                        |
| `steps:`               | The sequential actions in the job                 |
| `actions/checkout`     | Pulls your repo into the runner                   |
| `actions/setup-python` | Installs Python and caching                       |
| `Install dependencies` | Installs your project dependencies                |
| `pre-commit`           | Runs Black, Ruff, nbstripout, etc.                |
| `pytest`               | Runs your unit tests                              |

---

###  What happens on GitHub

Once you commit and push this file:

* It appears under **GitHub → Actions** tab.
* Every push or PR triggers a **“CI” run**.
* Each step’s logs show in the web UI.
* You get ✅ or ❌ status on the PR automatically.

---

###  TL;DR summary in plain English

> Every time you push code or open a pull request to `main`, `master`, or `develop`:
>
> * GitHub sets up a Python 3.11 Linux VM.
> * Installs dependencies and pre-commit tools.
> * Runs pre-commit checks (code formatting, linting).
> * Runs all tests (`pytest`).
> * Cancels old runs if a newer one starts.
> * Marks the PR green ✅ (if all pass) or red ❌ (if any fail).





##  What is a **GitHub cloud runner?**

Think of it as a **temporary virtual machine (VM)** that GitHub provides in the cloud to execute your workflow steps.

When your workflow (`ci.yml`) triggers on a push or pull request:

>  GitHub spins up a fresh, isolated virtual computer — called a **runner** — installs your dependencies, runs your commands, and then deletes the machine when done.

###  Key properties

| Feature                | Description                                                        |
| ---------------------- | ------------------------------------------------------------------ |
| **Ephemeral**          | It’s brand new for each workflow run (no leftover files or state). |
| **Location**           | Runs in GitHub’s cloud (you don’t pay extra for basic usage).      |
| **OS options**         | `ubuntu-latest`, `windows-latest`, or `macos-latest`.              |
| **Preinstalled tools** | Comes with Python, Git, Node.js, Docker, etc. ready to use.        |
| **Ephemeral storage**  | Anything written to disk disappears after the run.                 |
| **Security**           | Each run is sandboxed from other repos/users.                      |

So, in  

```yaml
runs-on: ubuntu-latest
```

means:

> “Run this job on a clean Ubuntu Linux VM hosted by GitHub.”

---

##  `uses:` — what it means in GitHub Actions

Each line that starts with `uses:` means:

> “Run this predefined reusable **Action** from the GitHub Actions Marketplace.”

An **Action** is basically a packaged script (usually in a public GitHub repo) that performs a common task — like checking out code, setting up Python, or running pre-commit.

GitHub automatically downloads and runs that Action on your runner.

---

##  `uses: actions/checkout@v4`

This is the **most common action** — it checks out (downloads) your repository’s code into the runner. Pin down the Git Action (code) version v4.

When the runner starts, it’s just an empty VM. It doesn’t have your code yet.

`actions/checkout` runs `git clone` under the hood, so later steps can actually see and test your files.

### Example of what it does internally

```bash
git init
git remote add origin https://github.com/<user>/<repo>.git
git fetch origin <branch>
git checkout <branch>
```

 After this step, your project files (Python scripts, notebooks, etc.) are available in the VM’s workspace.

---

##  `uses: actions/setup-python@v5`

This Action installs and configures the version of **Python** you need on the runner.

```yaml
- uses: actions/setup-python@v5
  with:
    python-version: '3.11'
    cache: 'pip'
    cache-dependency-path: |
      requirements.txt
      pyproject.toml
```

1. Installs Python 3.11 on the Ubuntu VM.
2. Caches pip packages so future runs are faster.
3. Sets the correct `PATH` so when you run `python` or `pip`, it uses your requested version.

After this step, commands like:

```bash
python --version
pip install ...
pytest
```

will all work with Python 3.11. The `|` in YAML (used in GitHub Actions workflow files) has a very specific meaning:

it tells YAML to treat the following indented lines as a multi-line string, preserving newlines (`\n`) inside it. If you printed that value, it would look like:
`"requirements.txt\npyproject.toml\n"`.

---

##  `uses: pre-commit/action@v3.0.1`

This is the official **pre-commit GitHub Action**.
It runs all hooks defined in your `.pre-commit-config.yaml`.

```yaml
- name: pre-commit
  uses: pre-commit/action@v3.0.1
```


* Installs the `pre-commit` package inside the runner.
* Automatically executes:

  ```bash
  pre-commit run --all-files
  ```

  using the exact same configuration you use locally.
* Runs your hooks: Black, Ruff, nbstripout, trailing-whitespace, etc.
* If any hook fails, the workflow fails (marking the PR ❌).

 This ensures everyone’s code passes the same style and lint checks before merging.

---


##  Visualization: what happens on GitHub

1 GitHub spins up a clean Linux VM (the **cloud runner**)
2 It runs each `uses:` step in order:

* Checkout → get your repo
* Setup Python → install tools
* Pre-commit → check formatting
* Pytest → run tests
   Displays logs in the **Actions** tab
   Destroys the VM after the workflow ends




In [None]:
from pathlib import Path

wf_dir = Path(".github/workflows")
wf_dir.mkdir(parents=True, exist_ok=True)

wf = wf_dir / "ci.yml"
wf.write_text("""name: CI
on:
  push:
    branches: [ main, master, develop ]
  pull_request:
    branches: [ main, master, develop ]
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
jobs:
  build:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
          cache-dependency-path: |
            requirements.txt
            pyproject.toml

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
          pip install pre-commit pytest

      # Run pre-commit (Black, Ruff, nbstripout, etc.)
      - name: pre-commit
        uses: pre-commit/action@v3.0.1

      # Run tests (fast only)
      - name: pytest
        run: pytest -q --maxfail=1
""")
print(wf.read_text())

name: CI
on:
  push:
    branches: [ main, master, develop ]
  pull_request:
    branches: [ main, master, develop ]
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
jobs:
  build:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
          cache-dependency-path: |
            requirements.txt
            pyproject.toml

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
          pip install pre-commit pytest

      # Run pre-commit (Black, Ruff, nbstripout, etc.)
      - name: pre-commit
        uses: pre-commit/action@v3.0.1

      # Run tests (fast only)
      - name: pytest
        run: pytest -q --maxfail=1



### 8) Add a `Makefile` convenience (optional but nice)



In [None]:
from pathlib import Path
mk = Path("Makefile")
text = mk.read_text() if mk.exists() else ""
if "lint" not in text:
    text += """

.PHONY: lint test ci-local
lint: ## Run pre-commit hooks on all files
\tpre-commit run --all-files

test: ## Run fast tests
\tpytest -q --maxfail=1

ci-local: lint test ## Simulate CI locally
"""
    mk.write_text(text)
print(mk.read_text())

# Makefile — unified-stocks
SHELL := /bin/bash
.SHELLFLAGS := -eu -o pipefail -c
.ONESHELL:


PY := python
QUARTO := quarto

START ?= 2020-01-01
END   ?= 2025-08-01
ROLL  ?= 30

DATA_RAW := data/raw/prices.csv
FEATS    := data/processed/features.parquet
REPORT   := docs/reports/eda.html

# Default target
.DEFAULT_GOAL := help

.PHONY: help all clean clobber qa report backup

help: ## Show help for each target
	@awk 'BEGIN {FS = ":.*##"; printf "Available targets:\n"} /^[a-zA-Z0-9_\-]+:.*##/ {printf "  \033[36m%-18s\033[0m %s\n", $$1, $$2}' $(MAKEFILE_LIST)

# all: $(DATA_RAW) $(FEATS) report backup ## Run the full pipeline and back up artifacts
all: $(DATA_RAW) $(FEATS) report train backup

$(DATA_RAW): scripts/get_prices.py tickers_25.csv
	$(PY) scripts/get_prices.py --tickers tickers_25.csv --start $(START) --end $(END) --out $(DATA_RAW)

$(FEATS): scripts/build_features.py $(DATA_RAW) scripts/qa_csv.sh
	# Basic QA first
	scripts/qa_csv.sh $(DATA_RAW)
	$(PY) scripts/build_features.py

## Wrap‑up

-   You configured **pre‑commit** with **Black**, **Ruff** (lint + import sort), and **nbstripout** to keep the repo clean.
-   You added a fast **CI** that runs the same hooks plus **pytest** on every PR.
-   CI time stays small due to **caching** and a **lean dependency set**; tests are fast by design (Session 13).

## Homework (due before next session)

**Goal:** Prove the workflow works end‑to‑end with a green PR from a **fresh clone**.

### Part A — Fresh‑clone smoke test (local)

In [None]:
%%bash
# On your laptop or a new Colab session:
git clone https://github.com/YOUR_USER/unified-stocks-teamX.git
cd unified-stocks-teamX
python -m pip install -U pip   #-m: use the current Python interpreter to run pip. -U: update
pip install pre-commit pytest
pre-commit install
pre-commit run --all-files
pytest -q --maxfail=1

### Part B — Open a PR that turns CI green

1.  **Create a branch** and make a tiny, style‑breaking change, then commit and let pre‑commit fix it automatically.


```bash
git checkout -b <branch-name>
```


`git checkout` lets you **switch branches** or **restore files**.

* Without `-b`, it means “switch to an existing branch”.

  ```bash
  git checkout main
  ```

  → moves you to the branch named `main`.

* With `-b`, it means “**create a new branch** and switch to it immediately”.



`-b` stands for **branch**.

So

```bash
git checkout -b new-feature
```

is shorthand for two separate steps:

```bash
git branch new-feature      # create the branch
git checkout new-feature    # switch to it
```

##   Typical workflow example

```bash
# 1. Make sure you’re up to date
git pull origin main

# 2. Create and switch to a new branch
git checkout -b docs/update-readme

# 3. Make some edits...

# 4. Commit your changes
git add README.md
git commit -m "Update README for new installation steps"

# 5. Push the new branch to GitHub
git push -u origin docs/update-readme
```

 Now a remote branch `docs/update-readme` exists, and you can open a Pull Request from it.

---

##  What happens internally

When you run `git checkout -b new-branch`, Git:

1. Creates a new **branch pointer** at the current commit.
2. Moves `HEAD` to point to that branch.
3. Updates your working directory to reflect that branch’s snapshot.

---

##  Optional: create from another branch

You can specify a base branch other than your current one:

```bash
git checkout -b hotfix critical-bug main
```

That means:

> “Create a new branch `hotfix` from `main`.”

---

##  Modern alternative

Git 2.23+ introduced a simpler command:

```bash
git switch -c <branch-name>
```

It’s the same as:

```bash
git checkout -b <branch-name>
```

For example:

```bash
git switch -c dev/test-feature
```

---

## Summary

| Command                       | Meaning                            |
| ----------------------------- | ---------------------------------- |
| `git checkout main`           | Switch to existing branch          |
| `git checkout -b new-feature` | Create and switch to a new branch  |
| `git checkout -b hotfix main` | Create new branch from `main`      |
| `git switch -c new-feature`   | Modern equivalent of `checkout -b` |




In [None]:
%%bash
git checkout -b chore/ci-badge-and-hooks
echo "# Tiny edit  " >> README.md   # trailing spaces (will be fixed)
git add -A
git commit -m "chore: add CI badge + enable pre-commit hooks"
git push -u origin chore/ci-badge-and-hooks

2.  **Add a CI badge** to `README.md`:
``` markdown
    ![CI](https://github.com/YOUR_USER/unified-stocks-teamX/actions/workflows/ci.yml/badge.svg)
    ```

Open a Pull Request on GitHub. Verify that:

The pre‑commit step passes.
pytest passes.
Total runtime is < ~3–4 minutes.
Merge once green. (If red, fix locally; do not disable hooks.)

### Part C — (Optional) Tune Ruff + Black to your taste

-   In `pyproject.toml`, try:

    ``` toml
    [tool.black]
    line-length = 100

    [tool.ruff]
    line-length = 100

    [tool.ruff.lint]
    select = ["E","F","I","B"]  # enable flake8-bugbear
    ignore = ["E501"]
    ```

-   Run `pre-commit run --all-files` and ensure CI remains green.



## Context: what `select` and `ignore` mean

Ruff (and Flake8) use **rule codes** — short identifiers for different linting checks.
Each code group (E, F, I, B, etc.) corresponds to a specific family of style or logic issues.

You can tell the linter which ones to:

* **enable** → using `select`
* **disable (ignore)** → using `ignore`

---


### `select = ["E", "F", "I", "B"]`


> “Enable all linting rules that start with these letters.”



| Code prefix | Origin                      | Purpose                                                                |
| ----------- | --------------------------- | ---------------------------------------------------------------------- |
| **E**       | pycodestyle (formerly PEP8) | Style errors — whitespace, indentation, etc.                           |
| **F**       | pyflakes                    | Logic errors — undefined variables, imports, etc.                      |
| **I**       | isort                       | Import sorting — ensures consistent import order                       |
| **B**       | flake8-bugbear              | Catches subtle bugs and bad practices (performance, readability, etc.) |

So this line activates the most common and useful rule sets:
 style, logic, imports, and bugbear-quality checks.

Example issues it might catch:

* E201: whitespace after '('
* F821: undefined variable name
* I001: imports not sorted
* B007: loop variable overwritten before use

---

###  `ignore = ["E501"]`


> “Ignore rule E501 (line too long).”

`E501` = *“line too long (> 79 characters)”* — a pycodestyle rule.
You’re ignoring it because most modern projects let **Black** handle line wrapping.

In other words:

* Ruff/Flake8 won’t warn about long lines.
* You rely on **Black** to auto-format them correctly.



### Part D — (Optional) Add notebook QA without executing them

-   Add **nbqa** to run Ruff on notebooks (markdown & code cells):

    ``` yaml
    # append to .pre-commit-config.yaml
    - repo: https://github.com/nbQA-dev/nbQA
      rev: 1.8.5
      hooks:
        - id: nbqa-ruff
          args: [--fix]
          additional_dependencies: [ruff==0.5.0]
    ```

-   Re‑install hooks and run `pre-commit run --all-files`.