Automated detection and tracking of fake engagement on GitHub
A JS Labs project —
part of the AI Slop Intelligence initiative.
Runs every day. Scores every suspicious account. Detects coordinated bot campaigns.
Files issues directly on compromised repos so maintainers can act.
Support this project
BTC 3QjWqhQbHdHgWeYHTpmorP8Pe1wgDjJy54
ETH 0x5851e6145F4773d1585b8686095FB16E368a4dA1
ZEC t1KSR5YkNPbjqRSCoLKo5AddFWdm9Kzxh1B
GitHub stars are a trust signal. They are how developers decide what to evaluate, what to depend on, and what to recommend. That signal is being systematically corrupted.
During the AI boom of 2024-2026, an industry of bot farms emerged to manufacture credibility for low-quality, often malicious repositories. A project with 800 stars in 48 hours reads as legitimate to a developer scanning search results. That's the point. The goal of fake engagement isn't the stars themselves; it's the social proof those stars produce, and the downstream decisions that social proof influences.
The pattern is identifiable. Accounts created the same week, no bio, no followers, no original repositories, starring the same 15 repos within a 2-hour window. Not one campaign, but dozens running simultaneously, every day, across thousands of accounts. The data shows repos where 185 out of 185 engagers are bots. A 100% fakeness ratio. Entire trending placements built on nothing.
phantomstars was built because this problem is tractable. The signal-to-noise ratio in GitHub's public API is, for now, still high enough that coordinated campaigns leave clear fingerprints. This project reads those fingerprints, publishes the raw data, and notifies affected repository maintainers directly.
This is part of the broader AI Slop Intelligence work at JS Labs, ongoing research into the mechanics and measurable effects of low-quality AI-generated content flooding developer ecosystems. Fake engagement isn't a peripheral issue. It's the distribution mechanism that gets slop in front of real users.
phantomstars runs a daily GitHub Actions job that:
- Scrapes the GitHub Trending page for repos gaining stars today
- Queries the GitHub Search API for repos created in the last 7 days with sudden star activity (the wider window catches multi-day campaigns missed by 24h-only scans)
- Pulls recent engagement events (stars, forks) via the Events API (last 24 hours per repo)
- Fetches the full profile of every engaging account via GraphQL: account creation date, follower/following counts, bio, repo history
- Scores every account against a composite heuristics model: account age, profile completeness, repository patterns, and activity history
- Detects coordinated campaigns using timestamp clustering and union-find: clusters of suspicious accounts that engaged within a 3-hour window
- Appends all suspects to an append-only JSONL ledger committed back to this repo
- Publishes a per-repo intelligence feed showing which repos are being targeted and at what fakeness ratio
- Files GitHub issues directly on targeted repos so maintainers see the campaign data in their own issue tracker
- Writes a formatted scan report to the GitHub Actions job summary
No servers. No databases. No infrastructure bill.
Yes. When a repo's fakeness ratio exceeds 40% or a coordinated campaign is detected, phantomstars opens an issue directly on that repository. The issue contains the full suspect table, campaign membership, composite scores, and account creation dates: everything a maintainer needs to investigate and report to GitHub.
If issues are disabled on a targeted repo, the notification is skipped silently and recorded in the scan log.
Yes. If your account appears in data/suspects.jsonl and you believe the classification is incorrect, open a false positive issue using the provided template. Reports are reviewed manually before any allowlist addition. The allowlist is stored in data/allowlist.txt; accounts listed there are excluded from all future scans and from the suspects ledger.
A campaign ID (e.g. c-a3f9b2e1) is a deterministic 8-character hex fingerprint derived from the SHA-256 hash of the sorted set of member logins in that campaign. The same group of accounts will produce the same campaign ID across independent scan runs, enabling longitudinal tracking. It is not a repo name, a username, or any external identifier.
Stability: the ID is stable as long as the campaign's member set is unchanged. If bots are added or suspended between scans, the ID changes because the membership changed. This is expected and reflects real-world drift in bot farm composition.
Yes. Every account's creation date is fetched from the GitHub GraphQL API (createdAt field) and stored in each suspect record as account_created_at. It's also the primary input to the account age score, the strongest single signal for fake accounts. Accounts created within 2 days of engaging score 1.0 on age alone.
Individual scores carry meaningful false positive rates. A new developer with a sparse profile legitimately scores 0.75+. The tool accounts for this by requiring campaign-level evidence before filing issues; a single suspicious account is not enough. A coordinated cluster of 40+ accounts, all created the same week, all scoring 0.75+, all engaging within 90 minutes, is a different matter. That's where confidence becomes actionable.
The data is always probabilistic. The issue bodies say so explicitly. The goal is to give maintainers the signal and the raw evidence to make their own judgement.
| Date | Scanned | Likely Fake | Suspicious | Campaigns | New Fakes (24h) |
|---|---|---|---|---|---|
| 2026-05-17 | 8015 | 831 | 5709 | 82 | 831 |
| Repo | Engagers | Likely Fake | Fakeness % | Campaigns |
|---|---|---|---|---|
| Carolina313876/Quantum-Vanity-Address-Forge | 185 | 185 | 100.0% | 1 |
| tonylinden54/palisade-security-nexus | 185 | 185 | 100.0% | 1 |
| johanwolfaardt-ctrl/Account-Symphony-Dashboard | 185 | 185 | 100.0% | 1 |
| keerthanapranesh/Claude-Code-Swarm-Toolkit | 185 | 185 | 100.0% | 1 |
| psyicarus/quizlet-match-whisper | 185 | 185 | 100.0% | 1 |
| yanilsa09cabrera-jpg/soundboard-studio-pro | 185 | 185 | 100.0% | 1 |
| ogaawin/Draft-2026-CAD-Workspace | 185 | 185 | 100.0% | 1 |
| nanasalgadas1000-cell/seraph-nuke-inferno | 185 | 185 | 100.0% | 1 |
| 23k65A1408/Create-Aeronautics-Skywards | 185 | 185 | 100.0% | 1 |
| shritanu16007-ctrl/Delta-Executor-Next-Gen | 185 | 185 | 100.0% | 1 |
| 8015238355/mm2-analytics-dashboard-2026 | 185 | 185 | 100.0% | 1 |
| johnicassere/lab-rat-race | 185 | 185 | 100.0% | 1 |
| NazmulHudha/office-automation-toolkit | 185 | 185 | 100.0% | 1 |
| e7137768-stack/Extreme-DAW-Beat-Forge-2026 | 185 | 185 | 100.0% | 1 |
| jonathanngaboyeka/rust-movement-optimizer | 185 | 185 | 100.0% | 1 |
| wilmer-afk/Apex-Injector | 185 | 185 | 100.0% | 1 |
| husammuhayman/homm-legacy-lore-tome | 185 | 185 | 100.0% | 1 |
| ImanFahrel/joystick-canvas | 185 | 185 | 100.0% | 1 |
| ipinputra/GPT-Image-2-Unlocked-API-Toolkit | 185 | 185 | 100.0% | 1 |
| MHuy9911/Game-Network-Turbo-Chamber | 185 | 185 | 100.0% | 1 |
| LindyNongmaithem/guild-inflator-plus | 185 | 185 | 100.0% | 1 |
| Khanhhayho-spec/jetbrains-enhancement-kit | 185 | 185 | 100.0% | 1 |
| Aryanzzzz25/f95-zone-sync-manager | 185 | 185 | 100.0% | 1 |
| JaideepN07/Crosshair-Studio-Engine | 185 | 185 | 100.0% | 1 |
| ujan007/3dsmax-2027-studio-workflow | 185 | 185 | 100.0% | 1 |
Each account receives a composite suspicion score (0.0 = clean, 1.0 = likely fake) from four signals:
| Signal | Weight | Measurement |
|---|---|---|
| Account age | 35% | < 2 days → 1.00 · < 7 days → 0.90 · < 30 days → 0.55 · < 90 days → 0.20 · older → 0.00 |
| Profile completeness | 30% | Points for: no bio (+0.25), no location (+0.15), no company (+0.10), zero followers (+0.30), zero following (+0.10), bot-pattern username (+0.20) |
| Repository pattern | 25% | Zero repos → 0.90 · all repos are forks → 0.80 · >85% fork ratio → 0.55 |
| Activity history | 10% | Accounts >14 days old with zero repos + zero social graph → 0.80 (ghost accounts). Zero repos only → 0.60. All-forks + no social graph → 0.50 |
Classification thresholds:
| Score | Classification |
|---|---|
| ≥ 0.75 | likely_fake |
| ≥ 0.45 | suspicious |
| < 0.45 | clean (not stored) |
A campaign is a group of ≥ 4 suspicious accounts that all engaged with the same repo within a 3-hour window. The algorithm uses union-find to build connected components; accounts that co-engaged within the window are merged, and any component above the minimum size is flagged as a coordinated campaign.
Campaign IDs are stable SHA-256 fingerprints of the sorted member set. The same campaign detected on consecutive days will have the same ID as long as membership is unchanged.
Why campaigns are the real signal: Individual scores have meaningful false positive rates. A new developer with a sparse profile can score 0.80 alone. Forty accounts all scoring 0.75+, created within the same week, all starring the same repo within 90 minutes, is not a coincidence. The campaign signal is where the data becomes actionable: the difference between a suspicious data point and evidence of a coordinated operation.
All findings are committed to data/suspects.jsonl and data/repos.jsonl, one JSON record per line, append-only. The GitHub Actions job summary (visible in the Actions UI after each run) provides a formatted per-scan report.
suspects.jsonl — one record per flagged account per scan:
{
"login": "user98432",
"account_age_score": 0.9,
"profile_score": 0.8,
"repo_pattern_score": 0.8,
"activity_score": 0.85,
"composite": 0.842,
"classification": "likely_fake",
"campaign_id": "c-a3f9b2e1",
"scan_date": "2026-05-17",
"account_created_at": "2026-05-15",
"target_repos": ["owner/repo-a", "owner/repo-b"]
}repos.jsonl — one record per targeted repo per scan:
{
"full_name": "owner/suspicious-repo",
"total_scanned": 87,
"likely_fake": 62,
"suspicious": 18,
"fakeness_ratio": 0.713,
"classification": "likely_fake",
"campaign_count": 3,
"scan_date": "2026-05-17"
}Query examples:
# All likely_fake accounts from today
jq 'select(.scan_date == "2026-05-17" and .classification == "likely_fake") | .login' data/suspects.jsonl
# Accounts created in the last 3 days that were flagged
jq 'select(.account_created_at >= "2026-05-14") | [.login, .account_created_at, .classification] | @tsv' -r data/suspects.jsonl
# Which repos were targeted today, sorted by fakeness ratio
jq 'select(.scan_date == "2026-05-17") | [.full_name, .fakeness_ratio, .likely_fake] | @tsv' -r data/repos.jsonl | sort -t$'\t' -k2 -rn
# All members of a specific campaign
jq 'select(.campaign_id == "c-a3f9b2e1") | [.login, .account_created_at, .composite] | @tsv' -r data/suspects.jsonl
# Repos a specific account targeted
jq 'select(.login == "user98432") | .target_repos[]' data/suspects.jsonl
# High-confidence repos: fakeness ratio above 60%
jq 'select(.fakeness_ratio >= 0.6) | [.full_name, .fakeness_ratio, .campaign_count] | @tsv' -r data/repos.jsonl | sort -t$'\t' -k2 -rnYour fork owns the data. Results are committed back to data/suspects.jsonl and data/repos.jsonl on your fork after every daily run.
Create a classic Personal Access Token with scopes:
public_repo: read public repo events and stargazers, create issues on public reposread:user: fetch user profiles via GraphQL
Settings → Secrets and variables → Actions → New repository secret → name it GH_TOKEN.
The default
GITHUB_TOKENhas restricted rate limits and cannot call the user GraphQL endpoint at full capacity. A PAT is required.
Actions → Enable GitHub Actions on your fork. The workflow runs at 07:00 UTC daily (after GitHub resets the trending page). Manual trigger available via Actions → Daily Phantom Stars Scan → Run workflow.
After each run, the formatted scan report is visible in Actions → [run] → Summary.
git clone https://github.com/YOUR_USERNAME/phantomstars.git
cd phantomstars
python -m venv venv && source venv/bin/activate
pip install -e .
GH_TOKEN=ghp_your_token python -m phantomstars.mainphantomstars/
├── .github/
│ ├── workflows/daily-scan.yml # Cron: 07:00 UTC, free on public repos
│ └── ISSUE_TEMPLATE/false_positive.yml
├── src/phantomstars/
│ ├── config.py # All constants, no argparse, no env parsing
│ ├── models.py # Frozen dataclasses
│ ├── github_client.py # REST + GraphQL, tenacity retries, rate-limit aware
│ ├── heuristics.py # Per-user composite scoring engine
│ ├── campaigns.py # Timestamp clustering + union-find
│ ├── storage.py # JSONL append + query helpers
│ ├── reporter.py # README dashboard injector
│ ├── notifier.py # GitHub Issues notifier (files on targeted repos)
│ └── main.py # Orchestration entry point
├── tests/
│ ├── conftest.py
│ ├── test_heuristics.py
│ └── test_campaigns.py
├── data/
│ ├── suspects.jsonl # Append-only account findings ledger
│ ├── repos.jsonl # Append-only per-repo intelligence
│ └── allowlist.txt # Accounts excluded from future scans
└── pyproject.toml
- Events API cap: maximum 300 recent events per repo. Repos with thousands of stars in a day have partial coverage.
- Search index lag: GitHub's search index is eventually consistent. Repos created seconds before the scan boundary may be missed.
- Heuristic drift: Bot operators adapt. Score weights may require periodic tuning; adjust constants in
config.py. - Individual false positives: A new developer with a sparse profile scores 0.75+ in isolation. Campaign membership is the high-confidence signal.
- Campaign ID drift: If a bot farm's membership changes between scans (bots suspended, new bots added), the campaign ID changes. This reflects actual campaign evolution, not a bug.
- Rate limits: 5,000 API requests/hour on an authenticated PAT. Well within limits for standard trending page sizes.
- Issues disabled: Some targeted repos disable issues. Notifications for those repos are skipped silently.
If your account appears in data/suspects.jsonl and you believe it is incorrectly classified:
- Find your entry:
jq 'select(.login == "YOUR_LOGIN")' data/suspects.jsonl - Open a false positive issue with your login, classification, scan date, and explanation
- Reports are reviewed manually. Verified false positives are added to
data/allowlist.txtand excluded from all future scans.
Note: opening an issue does not modify or remove any existing data. The suspects ledger is append-only. The allowlist only affects future scans.
pip install -e ".[dev]"
python -m black .
python -m ruff check .
python -m mypy src
python -m pytestAll four must pass before a PR.
This tool performs read-only analysis of public GitHub data using the official GitHub API. Where issues are filed on targeted repositories, they contain probabilistic findings and are clearly labelled as automated. Findings are indicators, not accusations. False positives exist and are expected.
Built with AI as a coding partner, in response to an ecosystem problem created in part by AI.
Apache 2.0. See LICENSE
Built by tg12 · GitHub
A JS Labs project · AI Slop Intelligence Dashboards