Skip to content

tg12/phantomstars

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

phantomstars Python 3.13 Apache 2.0 GitHub Actions Daily

phantomstars

Automated detection and tracking of fake engagement on GitHub

A JS Labs project — part of the AI Slop Intelligence initiative.
Runs every day. Scores every suspicious account. Detects coordinated bot campaigns.
Files issues directly on compromised repos so maintainers can act.


Support this project

BTC   3QjWqhQbHdHgWeYHTpmorP8Pe1wgDjJy54
ETH   0x5851e6145F4773d1585b8686095FB16E368a4dA1
ZEC   t1KSR5YkNPbjqRSCoLKo5AddFWdm9Kzxh1B


Why this exists

GitHub stars are a trust signal. They are how developers decide what to evaluate, what to depend on, and what to recommend. That signal is being systematically corrupted.

During the AI boom of 2024-2026, an industry of bot farms emerged to manufacture credibility for low-quality, often malicious repositories. A project with 800 stars in 48 hours reads as legitimate to a developer scanning search results. That's the point. The goal of fake engagement isn't the stars themselves; it's the social proof those stars produce, and the downstream decisions that social proof influences.

The pattern is identifiable. Accounts created the same week, no bio, no followers, no original repositories, starring the same 15 repos within a 2-hour window. Not one campaign, but dozens running simultaneously, every day, across thousands of accounts. The data shows repos where 185 out of 185 engagers are bots. A 100% fakeness ratio. Entire trending placements built on nothing.

phantomstars was built because this problem is tractable. The signal-to-noise ratio in GitHub's public API is, for now, still high enough that coordinated campaigns leave clear fingerprints. This project reads those fingerprints, publishes the raw data, and notifies affected repository maintainers directly.

This is part of the broader AI Slop Intelligence work at JS Labs, ongoing research into the mechanics and measurable effects of low-quality AI-generated content flooding developer ecosystems. Fake engagement isn't a peripheral issue. It's the distribution mechanism that gets slop in front of real users.


What it does

phantomstars runs a daily GitHub Actions job that:

  1. Scrapes the GitHub Trending page for repos gaining stars today
  2. Queries the GitHub Search API for repos created in the last 7 days with sudden star activity (the wider window catches multi-day campaigns missed by 24h-only scans)
  3. Pulls recent engagement events (stars, forks) via the Events API (last 24 hours per repo)
  4. Fetches the full profile of every engaging account via GraphQL: account creation date, follower/following counts, bio, repo history
  5. Scores every account against a composite heuristics model: account age, profile completeness, repository patterns, and activity history
  6. Detects coordinated campaigns using timestamp clustering and union-find: clusters of suspicious accounts that engaged within a 3-hour window
  7. Appends all suspects to an append-only JSONL ledger committed back to this repo
  8. Publishes a per-repo intelligence feed showing which repos are being targeted and at what fakeness ratio
  9. Files GitHub issues directly on targeted repos so maintainers see the campaign data in their own issue tracker
  10. Writes a formatted scan report to the GitHub Actions job summary

No servers. No databases. No infrastructure bill.


Frequently asked questions

Does it notify the targeted repo?

Yes. When a repo's fakeness ratio exceeds 40% or a coordinated campaign is detected, phantomstars opens an issue directly on that repository. The issue contains the full suspect table, campaign membership, composite scores, and account creation dates: everything a maintainer needs to investigate and report to GitHub.

If issues are disabled on a targeted repo, the notification is skipped silently and recorded in the scan log.

Can I report a false positive?

Yes. If your account appears in data/suspects.jsonl and you believe the classification is incorrect, open a false positive issue using the provided template. Reports are reviewed manually before any allowlist addition. The allowlist is stored in data/allowlist.txt; accounts listed there are excluded from all future scans and from the suspects ledger.

What is the campaign ID?

A campaign ID (e.g. c-a3f9b2e1) is a deterministic 8-character hex fingerprint derived from the SHA-256 hash of the sorted set of member logins in that campaign. The same group of accounts will produce the same campaign ID across independent scan runs, enabling longitudinal tracking. It is not a repo name, a username, or any external identifier.

Stability: the ID is stable as long as the campaign's member set is unchanged. If bots are added or suspended between scans, the ID changes because the membership changed. This is expected and reflects real-world drift in bot farm composition.

Does it check account creation dates?

Yes. Every account's creation date is fetched from the GitHub GraphQL API (createdAt field) and stored in each suspect record as account_created_at. It's also the primary input to the account age score, the strongest single signal for fake accounts. Accounts created within 2 days of engaging score 1.0 on age alone.

How confident is it?

Individual scores carry meaningful false positive rates. A new developer with a sparse profile legitimately scores 0.75+. The tool accounts for this by requiring campaign-level evidence before filing issues; a single suspicious account is not enough. A coordinated cluster of 40+ accounts, all created the same week, all scoring 0.75+, all engaging within 90 minutes, is a different matter. That's where confidence becomes actionable.

The data is always probabilistic. The issue bodies say so explicitly. The goal is to give maintainers the signal and the raw evidence to make their own judgement.


Live dashboard

Date Scanned Likely Fake Suspicious Campaigns New Fakes (24h)
2026-05-17 8015 831 5709 82 831

Today's most-targeted repos

Repo Engagers Likely Fake Fakeness % Campaigns
Carolina313876/Quantum-Vanity-Address-Forge 185 185 100.0% 1
tonylinden54/palisade-security-nexus 185 185 100.0% 1
johanwolfaardt-ctrl/Account-Symphony-Dashboard 185 185 100.0% 1
keerthanapranesh/Claude-Code-Swarm-Toolkit 185 185 100.0% 1
psyicarus/quizlet-match-whisper 185 185 100.0% 1
yanilsa09cabrera-jpg/soundboard-studio-pro 185 185 100.0% 1
ogaawin/Draft-2026-CAD-Workspace 185 185 100.0% 1
nanasalgadas1000-cell/seraph-nuke-inferno 185 185 100.0% 1
23k65A1408/Create-Aeronautics-Skywards 185 185 100.0% 1
shritanu16007-ctrl/Delta-Executor-Next-Gen 185 185 100.0% 1
8015238355/mm2-analytics-dashboard-2026 185 185 100.0% 1
johnicassere/lab-rat-race 185 185 100.0% 1
NazmulHudha/office-automation-toolkit 185 185 100.0% 1
e7137768-stack/Extreme-DAW-Beat-Forge-2026 185 185 100.0% 1
jonathanngaboyeka/rust-movement-optimizer 185 185 100.0% 1
wilmer-afk/Apex-Injector 185 185 100.0% 1
husammuhayman/homm-legacy-lore-tome 185 185 100.0% 1
ImanFahrel/joystick-canvas 185 185 100.0% 1
ipinputra/GPT-Image-2-Unlocked-API-Toolkit 185 185 100.0% 1
MHuy9911/Game-Network-Turbo-Chamber 185 185 100.0% 1
LindyNongmaithem/guild-inflator-plus 185 185 100.0% 1
Khanhhayho-spec/jetbrains-enhancement-kit 185 185 100.0% 1
Aryanzzzz25/f95-zone-sync-manager 185 185 100.0% 1
JaideepN07/Crosshair-Studio-Engine 185 185 100.0% 1
ujan007/3dsmax-2027-studio-workflow 185 185 100.0% 1

Scoring model

Each account receives a composite suspicion score (0.0 = clean, 1.0 = likely fake) from four signals:

Signal Weight Measurement
Account age 35% < 2 days → 1.00 · < 7 days → 0.90 · < 30 days → 0.55 · < 90 days → 0.20 · older → 0.00
Profile completeness 30% Points for: no bio (+0.25), no location (+0.15), no company (+0.10), zero followers (+0.30), zero following (+0.10), bot-pattern username (+0.20)
Repository pattern 25% Zero repos → 0.90 · all repos are forks → 0.80 · >85% fork ratio → 0.55
Activity history 10% Accounts >14 days old with zero repos + zero social graph → 0.80 (ghost accounts). Zero repos only → 0.60. All-forks + no social graph → 0.50

Classification thresholds:

Score Classification
≥ 0.75 likely_fake
≥ 0.45 suspicious
< 0.45 clean (not stored)

Campaign detection

A campaign is a group of ≥ 4 suspicious accounts that all engaged with the same repo within a 3-hour window. The algorithm uses union-find to build connected components; accounts that co-engaged within the window are merged, and any component above the minimum size is flagged as a coordinated campaign.

Campaign IDs are stable SHA-256 fingerprints of the sorted member set. The same campaign detected on consecutive days will have the same ID as long as membership is unchanged.

Why campaigns are the real signal: Individual scores have meaningful false positive rates. A new developer with a sparse profile can score 0.80 alone. Forty accounts all scoring 0.75+, created within the same week, all starring the same repo within 90 minutes, is not a coincidence. The campaign signal is where the data becomes actionable: the difference between a suspicious data point and evidence of a coordinated operation.


Data format

All findings are committed to data/suspects.jsonl and data/repos.jsonl, one JSON record per line, append-only. The GitHub Actions job summary (visible in the Actions UI after each run) provides a formatted per-scan report.

suspects.jsonl — one record per flagged account per scan:

{
  "login": "user98432",
  "account_age_score": 0.9,
  "profile_score": 0.8,
  "repo_pattern_score": 0.8,
  "activity_score": 0.85,
  "composite": 0.842,
  "classification": "likely_fake",
  "campaign_id": "c-a3f9b2e1",
  "scan_date": "2026-05-17",
  "account_created_at": "2026-05-15",
  "target_repos": ["owner/repo-a", "owner/repo-b"]
}

repos.jsonl — one record per targeted repo per scan:

{
  "full_name": "owner/suspicious-repo",
  "total_scanned": 87,
  "likely_fake": 62,
  "suspicious": 18,
  "fakeness_ratio": 0.713,
  "classification": "likely_fake",
  "campaign_count": 3,
  "scan_date": "2026-05-17"
}

Query examples:

# All likely_fake accounts from today
jq 'select(.scan_date == "2026-05-17" and .classification == "likely_fake") | .login' data/suspects.jsonl

# Accounts created in the last 3 days that were flagged
jq 'select(.account_created_at >= "2026-05-14") | [.login, .account_created_at, .classification] | @tsv' -r data/suspects.jsonl

# Which repos were targeted today, sorted by fakeness ratio
jq 'select(.scan_date == "2026-05-17") | [.full_name, .fakeness_ratio, .likely_fake] | @tsv' -r data/repos.jsonl | sort -t$'\t' -k2 -rn

# All members of a specific campaign
jq 'select(.campaign_id == "c-a3f9b2e1") | [.login, .account_created_at, .composite] | @tsv' -r data/suspects.jsonl

# Repos a specific account targeted
jq 'select(.login == "user98432") | .target_repos[]' data/suspects.jsonl

# High-confidence repos: fakeness ratio above 60%
jq 'select(.fakeness_ratio >= 0.6) | [.full_name, .fakeness_ratio, .campaign_count] | @tsv' -r data/repos.jsonl | sort -t$'\t' -k2 -rn

Setup

1. Fork this repo

Your fork owns the data. Results are committed back to data/suspects.jsonl and data/repos.jsonl on your fork after every daily run.

2. Add a GitHub PAT secret

Create a classic Personal Access Token with scopes:

  • public_repo: read public repo events and stargazers, create issues on public repos
  • read:user: fetch user profiles via GraphQL

Settings → Secrets and variables → Actions → New repository secret → name it GH_TOKEN.

The default GITHUB_TOKEN has restricted rate limits and cannot call the user GraphQL endpoint at full capacity. A PAT is required.

3. Enable Actions

Actions → Enable GitHub Actions on your fork. The workflow runs at 07:00 UTC daily (after GitHub resets the trending page). Manual trigger available via Actions → Daily Phantom Stars Scan → Run workflow.

After each run, the formatted scan report is visible in Actions → [run] → Summary.

4. Run locally

git clone https://github.com/YOUR_USERNAME/phantomstars.git
cd phantomstars
python -m venv venv && source venv/bin/activate
pip install -e .
GH_TOKEN=ghp_your_token python -m phantomstars.main

Project structure

phantomstars/
├── .github/
│   ├── workflows/daily-scan.yml       # Cron: 07:00 UTC, free on public repos
│   └── ISSUE_TEMPLATE/false_positive.yml
├── src/phantomstars/
│   ├── config.py                      # All constants, no argparse, no env parsing
│   ├── models.py                      # Frozen dataclasses
│   ├── github_client.py               # REST + GraphQL, tenacity retries, rate-limit aware
│   ├── heuristics.py                  # Per-user composite scoring engine
│   ├── campaigns.py                   # Timestamp clustering + union-find
│   ├── storage.py                     # JSONL append + query helpers
│   ├── reporter.py                    # README dashboard injector
│   ├── notifier.py                    # GitHub Issues notifier (files on targeted repos)
│   └── main.py                        # Orchestration entry point
├── tests/
│   ├── conftest.py
│   ├── test_heuristics.py
│   └── test_campaigns.py
├── data/
│   ├── suspects.jsonl                 # Append-only account findings ledger
│   ├── repos.jsonl                    # Append-only per-repo intelligence
│   └── allowlist.txt                  # Accounts excluded from future scans
└── pyproject.toml

Limitations and known failure modes

  • Events API cap: maximum 300 recent events per repo. Repos with thousands of stars in a day have partial coverage.
  • Search index lag: GitHub's search index is eventually consistent. Repos created seconds before the scan boundary may be missed.
  • Heuristic drift: Bot operators adapt. Score weights may require periodic tuning; adjust constants in config.py.
  • Individual false positives: A new developer with a sparse profile scores 0.75+ in isolation. Campaign membership is the high-confidence signal.
  • Campaign ID drift: If a bot farm's membership changes between scans (bots suspended, new bots added), the campaign ID changes. This reflects actual campaign evolution, not a bug.
  • Rate limits: 5,000 API requests/hour on an authenticated PAT. Well within limits for standard trending page sizes.
  • Issues disabled: Some targeted repos disable issues. Notifications for those repos are skipped silently.

False positive process

If your account appears in data/suspects.jsonl and you believe it is incorrectly classified:

  1. Find your entry: jq 'select(.login == "YOUR_LOGIN")' data/suspects.jsonl
  2. Open a false positive issue with your login, classification, scan date, and explanation
  3. Reports are reviewed manually. Verified false positives are added to data/allowlist.txt and excluded from all future scans.

Note: opening an issue does not modify or remove any existing data. The suspects ledger is append-only. The allowlist only affects future scans.


Contributing

pip install -e ".[dev]"
python -m black .
python -m ruff check .
python -m mypy src
python -m pytest

All four must pass before a PR.


Disclaimer

This tool performs read-only analysis of public GitHub data using the official GitHub API. Where issues are filed on targeted repositories, they contain probabilistic findings and are clearly labelled as automated. Findings are indicators, not accusations. False positives exist and are expected.

Built with AI as a coding partner, in response to an ecosystem problem created in part by AI.


License

Apache 2.0. See LICENSE


Author

Built by tg12 · GitHub

A JS Labs project · AI Slop Intelligence Dashboards

Releases

No releases published

Packages

 
 
 

Contributors

Languages