This document outlines the setup and workflow of our GitHub Actions data pipeline. The primary goal is to manage and version-control data generated by our CI/CD processes, with a special focus on handling a SQLite database and its schema migrations.
A dedicated, orphaned branch (_data
) serves as the storage for data artifacts. This keeps large data files and frequent data updates out of the main source code history, making the main repository lighter and faster to clone.
This composite action manages the interaction with the _data
branch by creating a git worktree.
operation: setup
: Checks out the_data
branch into a.pipeline-data-worktree
directory and usesrsync
to copy the entire contents of the worktree'sdata
directory into the main workspace'sdata
directory.operation: update
: Usesrsync
to sync thedata
directory from the main workspace to the worktree, then commits and force-pushes the changes to the_data
branch.operation: cleanup
: Removes the.pipeline-data-worktree
directory. This should be run at the end of a workflow, typically using anif: always()
condition to ensure cleanup happens even if other steps fail.
This action handles the dumping and restoring of the SQLite database in a way that is compatible with our migration-based schema management.
-
operation: dump
:- Dumps the live SQLite database (e.g.,
data/db.sqlite
) into a diffable format in the specified dump directory (e.g.,data/dump
) usingsqlite-diffable
. - Copies the Drizzle migration journal (
drizzle/meta/_journal.json
) into the dump directory as_journal.json
. This is a critical step that versions the database schema state along with the data itself.
- Dumps the live SQLite database (e.g.,
-
operation: restore
:- Reads the latest migration number from the
_journal.json
file located within the dump directory. - Initializes a new, empty database.
- Runs database migrations from the main branch's
drizzle
directory up to the version specified in the journal file. This creates a database with the exact schema that corresponds to the dumped data. - Loads the data from the diffable dump into the database.
- Runs any remaining migrations from the main
drizzle
folder to bring the database schema fully up to date with the latest code in the main branch.
- Reads the latest migration number from the
This repository uses several GitHub Actions workflows to automate testing, data processing, and deployment.
This is the main data processing workflow. It's responsible for fetching the latest data from sources like GitHub, processing it, and generating summaries.
-
Triggers:
- Runs on a daily schedule (
cron: "0 23 * * *"
). - Can be manually triggered (
workflow_dispatch
) with various options to control its behavior (e.g., forcing re-ingestion, specifying date ranges).
- Runs on a daily schedule (
-
Key Jobs:
ingest-export
:- Checks out the
_data
branch and restores the database. - Runs the
ingest
pipeline to fetch new data (issues, PRs, etc.). - Runs the
process
pipeline to calculate scores and other metrics. - Runs the
export
pipeline to save processed data. - Dumps the updated database and pushes all new data artifacts to the
_data
branch.
- Checks out the
generate-summaries
:- Depends on the successful completion of
ingest-export
. - Restores the latest database from the
_data
branch. - Uses an AI service to generate project and contributor summaries.
- On the daily schedule, it generates project summaries daily and contributor summaries weekly.
- Pushes the generated summaries and updated database state back to the
_data
branch.
- Depends on the successful completion of
This workflow runs on every pull request against the main
branch to ensure code quality and prevent regressions.
-
Triggers:
pull_request
on themain
branch.
-
Key Jobs:
check
: Lints the code and runs type-checking with TypeScript.build
: Ensures the Next.js application builds successfully with the PR changes. It restores the production data to ensure the build process is realistic.test-pipelines
: Runs the core data pipelines (ingest
,process
,export
) in a test mode to verify their integrity.check-migrations
: If the database schema (src/lib/data/schema.ts
) is modified, this job verifies that a corresponding Drizzle migration has been generated.
This workflow handles the deployment of the application to GitHub Pages.
-
Triggers:
- Manually via
workflow_dispatch
. - Automatically after the
Run Pipelines
workflow successfully completes on themain
branch.
- Manually via
-
Key Steps:
- Restores the latest data from the
_data
branch. - Runs any pending database migrations.
- Builds the Next.js application for production.
- Copies the
data
directory into theout
directory to be included in the deployment. - Deploys the contents of the
out
directory to GitHub Pages.
- Restores the latest data from the