# 🐍 Semgrep SAST Scanner Demo
**Globomantics Robot Fleet Manager** | Module 5 — Third-Party Scanners & SARIF Integration

This notebook demonstrates Semgrep as a third-party SAST scanner, SARIF output inspection, and integration with GitHub's unified Security tab.

In [None]:
import subprocess
import json
import os

REPO = "timothywarner-org/globomantics-robot-fleet"
REPO_DIR = r"C:\github\globomantics-robot-fleet"

def run(cmd, cwd=REPO_DIR):
    """Run a shell command and display formatted output."""
    print(f"\u26a1 Running: {cmd}\n{'\u2500' * 60}")
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True, cwd=cwd)
    if result.stdout.strip():
        print(result.stdout.strip())
    if result.stderr.strip():
        # Semgrep outputs progress to stderr, show it lighter
        print(result.stderr.strip())
    print(f"{'\u2500' * 60}")
    return result

## 🔧 Pre-flight Checks

In [None]:
run("semgrep --version")
run("codeql --version")
run("gh auth status")

## 🔍 Full Repository Scan (JavaScript + Rust)

Two Semgrep config packs stacked:
- **`p/security-audit`** — covers JavaScript, general web patterns, and common CWEs
- **`p/rust`** — catches unsafe blocks, panics in library code, and Rust anti-patterns

Output is **SARIF 2.1.0**, the standard interchange format for static analysis results.
One scan command covers the entire polyglot repo.

In [None]:
run("semgrep scan --config p/security-audit --config p/rust --sarif --output results.sarif .")

## 🦀 Rust-Only Scan

Targeting just the `rust-telemetry-cli/` directory with Rust-specific rules.

**Why this matters:** CodeQL does not support Rust. Its supported languages are JavaScript,
Python, Java, C/C++, C#, Go, Ruby, Swift, and Kotlin. For Rust coverage, you need a
third-party scanner like Semgrep. This is exactly why GHAS supports SARIF uploads from
external tools.

In [None]:
run("semgrep scan --config p/rust --sarif --output rust-scan.sarif ./rust-telemetry-cli")

## 📊 SARIF Structure Inspection

SARIF 2.1.0 has a clean hierarchical structure:

```
sarif
├─ $schema / version
└─ runs[]
   ├─ tool.driver
   │  ├─ name, semanticVersion
   │  └─ rules[] (definitions of each check)
   └─ results[]
      ├─ ruleId
      ├─ message.text
      ├─ locations[].physicalLocation (file + line)
      └─ partialFingerprints (tracks alerts across commits)
```

**GH-500 Exam Tip:** `partialFingerprints` is how GitHub tracks alerts across commits —
even through file renames. The SARIF version **must be 2.1.0** for GitHub ingestion.

In [None]:
sarif_path = os.path.join(REPO_DIR, "results.sarif")
if os.path.exists(sarif_path):
    with open(sarif_path, 'r') as f:
        sarif = json.load(f)

    for i, run_data in enumerate(sarif.get("runs", [])):
        tool = run_data.get("tool", {}).get("driver", {})
        results = run_data.get("results", [])
        rules = tool.get("rules", [])

        print(f"\ud83d\udcca Run {i+1}")
        print(f"   \ud83d\udd27 Tool: {tool.get('name', 'unknown')} v{tool.get('semanticVersion', '?')}")
        print(f"   \ud83d\udccb Rules loaded: {len(rules)}")
        print(f"   \ud83c\udfaf Results found: {len(results)}")
        print(f"{'\u2500' * 60}")

        if results:
            print(f"\n\ud83d\udd0d First 10 findings:")
            for r in results[:10]:
                rule_id = r.get("ruleId", "?")
                msg = r.get("message", {}).get("text", "")[:80]
                locs = r.get("locations", [])
                if locs:
                    phys = locs[0].get("physicalLocation", {})
                    file = phys.get("artifactLocation", {}).get("uri", "?")
                    line = phys.get("region", {}).get("startLine", "?")
                    print(f"   \u26a0\ufe0f  {rule_id}")
                    print(f"      \ud83d\udcc1 {file}:{line}")
                    print(f"      \ud83d\udcac {msg}")
                    print()
else:
    print("\u274c results.sarif not found \u2014 run the full scan first (Cell 6)")

## 🚀 Upload SARIF to GitHub

The `codeql github upload-results` command bridges non-GitHub CI systems (Jenkins,
Azure DevOps, GitLab CI, CircleCI) with the GitHub Security tab.

The `--sarif-category` flag is **critical** — it prevents Semgrep results from
overwriting CodeQL results. Each scanner or upload source needs its own category.

**GH-500 Exam Tip:** Categories are required when using multiple scanners. Without them,
the latest upload silently replaces previous results.

In [None]:
# Get current commit hash
result = subprocess.run("git rev-parse HEAD", shell=True, capture_output=True, text=True, cwd=REPO_DIR)
commit = result.stdout.strip()
print(f"\ud83d\udccc Current commit: {commit}\n")

# \u26a0\ufe0f This uploads to GitHub \u2014 uncomment to execute
# run(f"codeql github upload-results --repository={REPO} --ref=refs/heads/main --commit={commit} --sarif=results.sarif --sarif-category=semgrep-local")

print(f"\ud83d\udd17 View results: https://github.com/{REPO}/security/code-scanning")

## 📋 View Alerts in GitHub

In [None]:
# All alerts
print("\ud83d\udccb All Code Scanning Alerts:")
print("=" * 60)
run(f'gh api repos/{REPO}/code-scanning/alerts --jq ".[] | {{number, rule: .rule.id, tool: .tool.name, state: .state}}"')

print("\n\ud83d\udc0d Semgrep Alerts Only:")
print("=" * 60)
run(f'gh api repos/{REPO}/code-scanning/alerts --jq ".[] | select(.tool.name == \\"Semgrep\\") | {{number, rule: .rule.id}}"')

## 💡 GH-500 Exam Quick Reference

| Topic | Key Point |
|-------|----------|
| SARIF upload action | `github/codeql-action/upload-sarif@v4` (NOT `codeql-action/analyze`) |
| Permission required | `security-events: write` — 403 without it |
| Categories | REQUIRED for multiple scanners — prevents result overwrites |
| Autofix scope | Works on **CodeQL alerts only**, not third-party results |
| SARIF version | Must be **2.1.0** |
| Fingerprints | `partialFingerprints` track alerts across commits |
| Semgrep coverage | 30+ languages including Rust (which CodeQL doesn't support) |