Skip to content

14 continuous integration

github-actions[bot] edited this page Jun 25, 2026 · 1 revision

📖 This page is generated from modules/14-continuous-integration/README.md. Edit the source, not the wiki; edits here are overwritten on the next sync. Run the hands-on labs from the repo, linked inline.

Previous: Module 13: Testing in the AI Era

Module 14: Continuous Integration

The AI writes code that looks right. CI checks whether it actually is: automatically, on every push, before anyone trusts it. This module turns the tests you wrote in Module 13 into a gate that runs itself.


Prerequisites

  • Module 2: Version Control. Pushes, commits, and the diff habit are the substrate CI sits on.
  • Module 8: Remotes and Hosting. CI runs on the forge, triggered by pushes. You need a repo pushed to a remote (any forge: GitHub, GitLab, a self-hosted Forgejo/Gitea, whatever you set up in Module 8) for there to be anything to trigger.
  • Module 13: Testing in the AI Era. CI is mostly "run the tests, automatically." You need tests to run. If you skipped writing them, this module's lab ships a small suite so you're not blocked, but the real payoff is automating your tests.

You do not need Docker, secrets management, or your own runner yet; those are Modules 16, 17, and 19. On a SaaS forge (GitHub, GitLab.com, Bitbucket, and the rest) this module uses the forge's hosted runners, which require zero setup. One honesty note for the self-host track: a self-hosted Forgejo/Gitea/GitLab CE has the CI feature but no hosted compute; nothing actually runs until you attach a runner, and that's Module 19. The workflow you write here is correct either way and will run the moment a runner is registered; to watch it go green now, use a SaaS forge's hosted runners, then come back and own the compute end-to-end in Module 19.


Learning objectives

By the end of this module you can:

  1. Explain what CI actually is, automated checks bound to a trigger, and why "on every push" is the part that makes it valuable.
  2. Write a forge-native CI workflow that checks out your code, installs its tools, and runs a linter and your test suite.
  3. Read a CI run: find which step failed, read the log, and reproduce the failure locally.
  4. Watch CI catch a breaking change before it reaches anyone who would trust the broken code.
  5. Recognize that CI is the same concept on every forge, and port a pipeline from one to another.

Key concepts

What CI is, stripped down

Continuous Integration has a grand-sounding name and a mundane core: a set of checks that run automatically whenever you push code, on a clean machine you don't control. That's it. The checks are usually the same commands you'd run by hand (lint, build, test), and the magic is entirely in the word automatically.

You already run checks. Before you commit, you (sometimes) run the tests, (sometimes) run the linter, (sometimes) remember to. CI removes every "sometimes." It runs the checks the same way, every time, on every push, whether you remember or not, whether you're tired or not, whether it's a one-line fix you're sure about or not. The discipline you can't reliably enforce on yourself, a machine enforces for free.

Three properties make CI more than a glorified shell script:

  • It's triggered, not invoked. You don't run CI; pushing runs it. The check is bound to the event, so it can't be skipped by forgetting.
  • It runs on a clean machine. The forge spins up a fresh, throwaway runner with nothing of yours on it: no half-installed dependency, no environment variable you set six months ago and forgot. If your code only works because of something special about your laptop, CI finds out immediately. ("Works on my machine" dies here. Module 16 takes the reproducibility idea further with containers.)
  • Its result is visible and shared. A green check or a red X shows up on the commit and on the pull request (Module 10), where everyone (every human reviewer and, later, every agent) can see whether this code passed the gate.

The pipeline: checkout → setup → checks

Almost every CI configuration, on every forge, is the same four moves:

  1. Check out the code onto the runner. The runner starts empty; first you put your repo on it.
  2. Set up the environment: install the language runtime, pin its version.
  3. Install the tools the checks need: the test runner, the linter.
  4. Run the checks: lint, then test. Any check that exits non-zero fails the whole run.

That last point is the load-bearing one. CI's entire enforcement mechanism is the exit code. Every tool you'd run in a terminal returns 0 for success and non-zero for failure. python3 -m unittest exits non-zero if a test fails. ruff check exits non-zero if it finds a lint problem. CI runs your commands and watches those exit codes; one failure turns the run red. You're not learning a new testing system; you're wiring the tools you already have to a trigger.

What goes in a CI run for this audience

Three tiers of check, cheapest first, because a fast check that fails early saves you waiting on a slow one:

  • Lint. Static checks that don't run your code: style, unused imports, obvious mistakes. Fast, cheap, catches a surprising amount. We use a linter as the example here; the principle is tool-agnostic.
  • Build. Does the code even assemble? For an interpreted language like our Python example there's no compile step, so "build" often collapses into "does it import without erroring." For compiled languages this is where a broken type or missing symbol gets caught.
  • Test. The Module 13 suite. The expensive, high-value tier: it actually runs your code and checks behavior.

Order them cheap-to-expensive so the fast checks fail fast. There's no reason to spend two minutes running the test suite if the linter would have rejected the push in three seconds.

The worked example: a forge-native workflow

Here's a complete, real CI pipeline for the tasks-app. This is GitHub Actions YAML, the most common dialect and our default example, but read it as a concept, not a product. Every forge has the exact same pipeline in its own dialect; the GitLab version is in the lab folder, and it's the same five moves.

name: CI

on:
  push:
  pull_request:

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - name: Check out the code
        uses: actions/checkout@v7
      - name: Set up Python
        uses: actions/setup-python@v6
        with:
          python-version: "3.12"
      - name: Install tools
        run: pip install ruff
      - name: Lint
        run: ruff check .
      - name: Test
        run: python -m unittest

Reading it top to bottom: on: is the trigger (push and pull request). runs-on: picks the clean machine. The steps: are the four moves: checkout, set up Python, install the tools, then the two checks. uses: pulls in a pre-built action (someone else's reusable step); run: is just a shell command. The linter runs first because it's cheap; the tests run last because they're the expensive, decisive check. Only the linter needs a pip install here; the tests run on Python's standard-library unittest runner from Module 13, so there's nothing to install for them.

This file lives in the repo, committed and versioned like everything else. That's deliberate: your pipeline is code, it's reviewed as a diff in a PR (Module 10), and a teammate or an agent inherits it automatically by cloning. The same logic as committing the AI's config in Module 5. The automation around your work is itself a durable, shared artifact.

Reading a failed run

When CI goes red, the skill is triage, and it's fast once you know the shape:

  1. Open the run. The forge shows the job as a list of steps with a red X on the one that failed.
  2. The first red step is the cause. Steps run in order and stop at the first failure; everything after it is skipped, not broken. Don't get distracted by the skipped steps.
  3. Read that step's log. It's the same output the tool prints in your terminal: a failing unittest assertion, a ruff finding with a file and line number. CI didn't invent a new error format; it's showing you the command's own output.
  4. Reproduce it locally. The same command from the failed step (python3 -m unittest or ruff check .) fails the same way on your own machine, because CI ran exactly that command. That reproducibility is the point: fix locally, confirm green locally, push again.

That loop (red on the forge, reproduce locally, fix, push) is the entire day-to-day of working with CI. The clean-machine runner occasionally surfaces a failure you can't reproduce locally. That's not CI being flaky; it's CI correctly catching that your machine has something the clean one doesn't. (See "Where it breaks.")


The AI angle

This is the module where CI stops being generic devops hygiene and becomes specifically about AI-assisted work.

AI generates code that looks right. That's not a knock on the models; it's their defining property. They produce fluent, plausible, well-formatted code that passes a human skim, because "looks like correct code" is close to what they're optimizing for. The failure mode isn't garbage that obviously won't run; it's the function that's 95% right with a flipped comparison, the refactor that quietly drops an edge case, the "cleanup" that breaks one path you didn't think to re-check. A human reviewer skimming a confident-looking diff is exactly the reviewer that misses these (Module 10 is the whole skill of not missing them, and it's hard).

CI is the reviewer that doesn't skim. It runs the code. It doesn't care how clean the diff looks or how confidently the commit message is worded; it executes the tests and reports the exit code. The flipped comparison fails an assertion. The dropped edge case fails the test that covered it. The plausibility that fools a human is invisible to a process that only checks behavior.

This compounds with everything else AI changes about your workflow:

  • AI raises your push rate. You're making more changes, faster, more of them generated. Manual pre-push checking scales with discipline and doesn't survive volume. The automated gate scales for free; it doesn't get tired on the fortieth push of the day.
  • AI can fix what CI catches. A red CI run is a precise, machine-readable problem statement: the exact command, the exact failing assertion, the exact line. That's ideal input for an agent. Paste the failed log into Claude Code (or your agent) and direct it to fix the failure. (Module 25 automates this into agents that respond to a failing pipeline on their own. CI is the trigger that makes self-healing possible.)
  • CI is the gate that makes letting agents run safely possible at all. Every later module that hands the AI more autonomy (issue-to-PR agents, unattended runs) relies on the fact that nothing the agent produces reaches anyone without passing CI first. The supervision is structural: it's this gate, not a human watching the agent type.

You don't add CI despite using AI. The faster and more confidently the AI writes plausible code, the more you need a reviewer that checks behavior instead of believing the diff.


Hands-on lab

Starting point (this lab is skip-friendly). You do not need to have done the earlier labs. To begin from a clean, known state, copy this module's snapshot into a fresh tasks-app and make the first commit:

mkdir -p ~/ai-workflow-course/tasks-app
cp -r ~/ai-workflow-course/modules/14-continuous-integration/lab/start/. ~/ai-workflow-course/tasks-app/
cd ~/ai-workflow-course/tasks-app && git init -b main && git add -A && git commit -m "start: module 14"

Already carrying your tasks-app from earlier modules? Keep using it and ignore this box. Lab language: YAML (the CI config) plus the Python tasks-app and shell commands. You direct the agent to place files, commit, and recover; you commit a starter workflow, watch it pass, then break it on purpose and watch CI catch it.

You'll need:

  • The tasks-app from Modules 1–2, pushed to a forge (Module 8). Any forge works.
  • The starter files in this module's lab/:
    • ci-starter.yml: the workflow (GitHub Actions flavor).
    • gitlab-ci-starter.yml: the same pipeline for GitLab, if that's your forge.
    • test_tasks.py: a small test suite (use your Module 13 tests instead if you have them).
  • Python 3.10+ locally, and your agent. Examples use Claude Code; sub your own agent anywhere.

Part A: Run the checks locally first

Never push a workflow you haven't run by hand. CI just runs the same commands, so prove they work on your machine first.

  1. Direct your agent to set up the project, then run the checks yourself once. Tell Claude Code (sub your own agent): "Copy the lab's test_tasks.py next to tasks.py in ~/ai-workflow-course/tasks-app, then install ruff into this project." The agent places the file and handles the install, including the PEP 668 fallback (a per-project venv) if the system Python refuses a global install. What it runs looks like:

    cd ~/ai-workflow-course/tasks-app
    pip install ruff
    # if pip is refused with "externally-managed-environment" (PEP 668, common on recent
    # Debian/Ubuntu and Homebrew Python), the agent falls back to a per-project venv:
    #   python3 -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
    #   pip install ruff

    Then run both checks yourself, once. This is the one part you do by hand on purpose: feeling that CI is nothing more than these same two commands is what makes the rest of the module click.

    python3 -m unittest   # should report all tests passing
    ruff check .         # should report no issues (or fix what it flags)

    If both are clean locally, CI will be green. If not, fix it here; it's faster than waiting on a runner. (Only the linter needs installing. The stdlib unittest runner ships with Python.)

Part B: Add the workflow and watch it pass

  1. Direct the agent to put the workflow where your forge looks for it. Tell Claude Code which forge you're on and let it pick the path:

    • GitHub / Forgejo / Gitea: lab/ci-starter.yml goes to .github/workflows/ci.yml (Forgejo/Gitea also read .forgejo/workflows/ or .gitea/workflows/; the agent checks which yours uses).
    • GitLab: lab/gitlab-ci-starter.yml goes to .gitlab-ci.yml at the repo root.
  2. Direct the agent to commit and push it, then verify. Tell Claude Code: "Stage the new workflow and test_tasks.py, commit with a message about adding CI, and push." Let it decide what to stage and run the git for you. What it runs looks like:

    git add .github/workflows/ci.yml test_tasks.py    # path varies by forge; the agent picks it
    git commit -m "Add CI: lint and test on every push"
    git push

    Verify it committed the workflow and the test file (a git show --stat HEAD confirms what landed), not stray files.

  3. Open your repo in the forge's web UI and find the run (usually an "Actions," "CI/CD," or "Pipelines" tab, and a status icon on the commit). Watch the steps execute and turn green. That green check is the gate now standing guard on every future push. (Self-host track: if the run sits queued with nothing picking it up, that's the no-hosted-runner situation from the prerequisites; the workflow is correct, it just has no compute until you attach a runner in Module 19. Run this part on a SaaS forge to see green right now.)

Part C: Break it on purpose and watch CI catch it

This is the whole point. You're going to ship the kind of plausible-but-wrong change AI produces, and watch CI stop it.

  1. Introduce a breaking change with the agent. Ask Claude Code (sub your own) for something that sounds like a cleanup but changes behavior: "Refactor pending() in tasks.py to be simpler." If it stays correct, nudge it until the logic actually changes. The classic plausible break: have pending() return self.tasks (all tasks) instead of filtering out the done ones. It reads fine. It's wrong.

  2. Notice it still looks right. Glance at the diff. The function is short, clean, plausible. This is exactly the trap from "The AI angle": nothing in the appearance warns you.

  3. Direct the agent to commit and push the change it just made. Tell Claude Code: "Commit this and push it." What it runs looks like:

    git add tasks.py
    git commit -m "Simplify pending()"
    git push

    Then verify CI goes red.

  4. Watch CI go red. Open the run, find the first failed step (Test), and read the log: test_pending_excludes_completed_tasks failed, with the assertion and the actual-vs-expected values. CI caught in seconds what a skim would have waved through.

  5. Hand the failure to the agent and let it recover. Paste the red CI log (the failed Test step) into Claude Code and direct it: "Reproduce this locally, then undo the bad change safely; it's already pushed." Your job is to verify it makes the right call, not to type git. The check: because the commit is already on shared history, the team-safe undo is git revert, not git restore (Module 12). What the agent runs looks like:

    python3 -m unittest          # fails locally too: same command, same failure
    git revert --no-edit HEAD   # new commit that undoes "Simplify pending()" (Module 12)
    git push                    # CI re-runs on the fixed code and goes green again

    Verify CI goes green again, and that the agent chose revert (a new inverting commit) over a history-rewriting undo on a branch others may have pulled.

  6. (Optional, to feel the linter tier.) Add an obviously unused import to cli.py (import os at the top, unused), then direct the agent to commit and push. Watch the Lint step fail before the tests even run: the cheap check failing fast. Have the agent remove it and push again.

You've now seen both halves: CI passing as a guardrail that stays out of your way, and CI failing as the reviewer that caught a change you might have trusted.


Where it breaks

The honest caveats, because a skeptical audience trusts the limits more than the pitch:

  • CI only catches what your checks check. A green run means "the linter found nothing and the tests passed," not "the code is correct." If the AI broke behavior you have no test for, CI is cheerfully green while the bug ships. CI is exactly as good as your test suite (Module 13), and no better. The flipped-comparison bug above got caught because a test covered it.
  • Green CI is not "reviewed." It checks behavior, not design, intent, security, or whether the feature is even the right one. It does not replace human review (Module 10) or the security gates in Module 15; it sits alongside them. Treating a green check as sign-off is how plausible-wrong code with no failing test sails straight through.
  • The clean machine is a feature that feels like a bug. Sooner or later CI fails in a way you can't reproduce locally: a dependency you have installed but never declared, a file outside the repo your code quietly reads, a path that only exists on your machine. That's not flakiness; it's CI correctly catching that your code depends on something that isn't in the repo. Fix the dependency, don't blame the runner. (Module 16's containers make local and CI environments identical, which kills most of these.)
  • Slow CI gets ignored. If the run takes fifteen minutes, people stop waiting for it and start merging around it, and the gate is worthless. Keep it fast: cheap checks first, and don't put things in CI that don't need to run on every push.
  • CI is not free compute, and it's not infinite. Hosted runners have usage limits and queue times, and a workflow that triggers on every push to every branch can burn through them. (Module 19 is where you understand and own that compute.)
  • A committed workflow runs code from the repo. A pull request from an untrusted fork can propose changes to the workflow itself. Forges have settings for how CI handles fork PRs; the defaults are usually safe, but it's a real attack surface worth knowing exists (the supply-chain thread picks up in Modules 15 and 22).

Check for understanding

You're done when:

  • Your tasks-app has a committed CI workflow that runs a linter and your tests on every push, and you've watched it go green on the forge.
  • You pushed a plausible-but-wrong change and watched CI catch it: found the failed step, read the log, reproduced the failure locally, and fixed it.
  • You can explain, in your own words, why CI specifically matters for AI-generated code (it checks behavior, not appearance) and the one thing a green check does not tell you (that the code is correct; only that your checks passed).
  • You can point at the same pipeline in two forge dialects and see it's the same five moves.

When pushing a change and expecting the gate to either bless it or stop it feels automatic, when you'd be uneasy merging code that hadn't been through CI, you've got it. Module 15 adds the next gates on the same pushes: scanning for vulnerable dependencies, leaked secrets, and the packages AI hallucinates into existence.


Verify-before-publish

CI YAML and the actions it references drift faster than the rest of this durable-core material. Re-check at build time:

  • Action versions. Confirm actions/checkout and actions/setup-python major versions in ci-starter.yml are current and not deprecated. Pinned majors (@v7, @v6) age.
  • Runner labels. Confirm ubuntu-latest (and any GitLab image: tag) still resolves to a supported image; default runner OS versions roll forward.
  • Trigger and config syntax. Verify the on: keys and overall workflow schema against the forge's current docs; Actions YAML keys do change.
  • Forge UI labels. The tab names in the lab ("Actions," "CI/CD," "Pipelines") and the workflow file locations (.github/workflows/, .gitlab-ci.yml, .forgejo/, .gitea/) match what the current forge versions actually use.
  • Tool names. The example linter (ruff) is current, installable, and still behaves as described, or swap in the equivalent the rest of the course uses. (The test runner is Python's standard-library unittest, which ships with Python; no install, nothing to drift.)

Continue to: Module 15: Security Scanning for AI-Generated Code

Clone this wiki locally