Skip to content

sqlcss/sqlcsi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQL-CSI: SQL Server Case Scene Investigation

AI-powered toolkit for CSS engineers to analyze SQL Server ERRORLOG files, XEvent traces (system_health), crash dumps, and engine source code. Runs inside VS Code as a Copilot Chat agent (@sql-csi) with a router → sub-agent → skill architecture and MCP integration.

Architecture

sqlcsi/
├── sql-csi.agent.md                  # Entry agent — router + full-analysis orchestrator
│
├── agents/                            # Sub-agents (description + lightweight workflow)
│   ├── errorlog-analysis.agent.md
│   ├── xevent-analysis.agent.md
│   ├── ag-failover-analysis.agent.md
│   ├── docs-lookup.agent.md
│   ├── dump-analysis.agent.md
│   └── source-search.agent.md
│
├── skills/                            # Detailed methodologies (forked context per skill)
│   ├── errorlog-analysis/SKILL.md
│   ├── xevent-analysis/SKILL.md
│   ├── ag-failover-analysis/SKILL.md
│   ├── stuck-db-analysis/
│   │   ├── SKILL.md
│   │   └── reference/                 # Pre-cached source code knowledge base
│   │       ├── database_switch_roles_pipeline.md
│   │       └── lock_dependency.md
│   ├── docs-lookup/SKILL.md
│   └── dump-analysis/SKILL.md
│   #  (source-search has no separate skill — methodology is inline in its agent file)
│
├── scripts/                           # Standalone scripts (CLI + invoked by skills)
│   ├── parse_errorlog.js
│   ├── extract_xel.ps1
│   ├── parse_xevent.js
│   ├── gen_merged_report.js
│   └── ag-failover-analysis/          # AG failover multi-step pipeline
│       ├── parse_alwayson_out.js      # Phase 1: AlwaysOn.OUT → ag_schema.json
│       ├── extract_ag_errorlog.js     # Phase 2: ERRORLOG → ag_errorlog_events.json
│       ├── build_failover_timeline.js # Phase 2b: Detect FO incidents
│       ├── import_ag_xevent.sql       # Phase 3: XEL → SQL Server shredded tables
│       ├── merge_timeline.js          # Phase 2c: Merge ERRORLOG + XEvent
│       ├── gen_fo_report.js           # Phase 7: Single combined HTML report
│       └── gen_per_fo_report.js       # Phase 7: Per-FO HTML + MD reports
│
├── .vscode/
│   └── mcp.json                       # MCP server registrations
│
├── .github/
│   └── copilot-instructions.md        # Workspace-wide Copilot guidance
│
└── reports/                           # Generated reports (git-ignored)

Agents and Skills

Component Type Purpose Backed by
sql-csi Entry agent Route requests; orchestrate full analysis
errorlog-analysis Agent + Skill Parse ERRORLOG, extract errors, timeline, patterns scripts/parse_errorlog.js
xevent-analysis Agent + Skill Parse system_health XEL, wait analysis, merged report scripts/extract_xel.ps1, scripts/parse_xevent.js, scripts/gen_merged_report.js
docs-lookup Agent + Skill KB fixes, CU applicability, wait type research microsoft-learn MCP
dump-analysis Agent + Skill WinDbg / Mirrors commands for SQL Server dumps WinDbg (external)
source-search Agent Engine source code search across SQL 2016/2017/2019/2022/2025 msdata / csswiki MCP
ag-failover-analysis Agent + Skill AG failover: parse AlwaysOn.OUT/ERRORLOG/XEvent, classify stuck DBs, per-FO reports scripts/ag-failover-analysis/*
stuck-db-analysis Skill Deep-dive stuck RESOLVING DBs: Cat A/B/C comparison, source code pipeline mapping Pre-cached KB in reference/

Why split agent / skill? Agents are short entry files with description frontmatter so the router can match user intent. Skills hold the detailed step-by-step methodology and are loaded only when the matching agent activates, keeping context small.

Workflows

# Trigger (example) Routes to What it does
1 analyze errorlog <path> errorlog-analysis Parse ERRORLOG, extract errors, timeline, patterns
2 analyze xevent <path> xevent-analysis Parse system_health XEL files, wait analysis, merge with ERRORLOG
3 research error <N> docs-lookup Look up KB fixes, wait type causes via Microsoft Learn
4 analyze dump <path> dump-analysis Generate WinDbg/Mirrors commands for a SQL crash dump
5 search error <N> source-search Search SQL Server source code for error definition + raising code
6 full analysis All (orchestrated) Run 1 → 2 → 3 → 5 sequentially; produce combined report
7 analyze AG failover <path> ag-failover-analysis Parse AlwaysOn.OUT + ERRORLOG + XEvent, detect FO incidents, per-DB status tables, trigger analysis
8 analyze stuck DB stuck-db-analysis Cat A/B/C comparison, DatabaseSwitchRoles pipeline mapping, source code reference

Key Features

ERRORLOG Analysis

  • Auto-detect UTF-16LE / UTF-8 encoding
  • Multi-line message parsing
  • Error classification by subsystem (HADR, LOCKING, MEMORY, etc.)
  • Pattern detection: cascades, repeating, paired errors, LSN progression
  • Priority assignment: HIGH / MEDIUM / LOW
  • AG role change timeline tracking

XEvent Analysis

  • Binary XEL extraction via PowerShell Read-SqlXEvent
  • 4 focused analysis areas:
    • sp_server_diagnostics — WARNING/ERROR states only
    • scheduler_monitor — CPU > 75% or Memory < 80%
    • error_reported — ERRORLOG complement
    • wait_info — all waits with category classification
  • Cross-correlation with ERRORLOG findings

Microsoft Docs Lookup

  • KB article search and CU applicability check
  • Wait type root cause analysis from official CSS I/O guide
  • Diagnostic DMV queries

Dump Analysis

  • Subsystem-aware WinDbg/Mirrors script generation (HADR, Memory, Scheduler, Locking, IO, Connectivity)
  • SOSRingBuffer LINQ queries filtered by error number / task / scheduler
  • Parsing of pasted-back WinDbg output

Source Code Search

  • Error definition lookup in sqlerrorcodes.h
  • Find code that raises the error
  • Function logic analysis
  • XEvent diagnostics discovery
  • HTML report with Azure DevOps links

AG Failover Analysis

  • Multi-phase pipeline: AlwaysOn.OUT → ERRORLOG → XEvent → SQL Server import → merge → report
  • Per-failover incident detection with 120s clustering
  • Per-host, per-DB status tables with DTC, Reverting, ABORT kill, NQ rollback columns
  • Trigger analysis: connection timeouts, WSFC errors (41xxx chain), lease termination, SYSTEM_UNHEALTHY
  • Old PRIMARY vs New PRIMARY identification from conn_terminated + ag_directions
  • Key Event Timeline per host with color-coded badges
  • WSFC Error Chain cascade diagram (41005→41034→41144→41143→41161)
  • Conclusion with trigger cause, IO context, per-AG outcome, per-host recovery summary
  • Both HTML (Catppuccin Mocha theme) and Markdown output
  • Stuck DB deep-dive: Cat A/B/C comparison table mapped to DatabaseSwitchRoles source code
  • Pre-cached source code KB: pipeline steps with line numbers, lock dependency, deadlock chains

Prerequisites

  • VS Code with GitHub Copilot Chat extension
  • Node.js (v18+)
  • PowerShell with SqlServer module (auto-installed on first XEL extraction)
  • MCP servers configured in .vscode/mcp.json:
    • microsoft-learn — for docs-lookup
    • msdata / csswiki — for source-search
    • Optional: mssql, bluebird-mcp-*, azure-mcp, enghub, icm-prod, SqlOps

Quick Start

Open the sqlcsi folder in VS Code and use Copilot Chat:

@sql-csi analyze errorlog \\server\share\ERRORLOG

@sql-csi search error 19432 in SQL2022

@sql-csi full analysis, errorlog is \\server\share\ERRORLOG, case ID SR12345

AG Failover Analysis — Quick Start

1. Prepare logs — organize into per-host subdirectories:

C:\Temp\<case_id>\
├── <old_primary_host>\          e.g. HKAZEPWDB0031\
│   ├── *_AlwaysOn.OUT           (REQUIRED)
│   ├── *_ERRORLOG*              (REQUIRED)
│   ├── AlwaysOn_health*.xel     (REQUIRED)
│   ├── *SQLDIAG*.xel            (REQUIRED)
│   └── system_health*.xel       (optional)
└── <new_primary_host>\          e.g. HKAZEPWDB0011\
    └── (same files)

2. Have a local SQL Server — XEvent data is imported into a temp database (ag_<case_id>). Default connection: localhost with Windows auth. No special setup needed.

3. Invoke the agent — provide the required parameters:

@ag-failover-analysis 使用 ag-failover-analysis skill 分析 case <case_id>:
- Case Dir: C:\Temp\<case_id>
- Old Primary: <hostname>
- New Primary: <hostname>
- Date: YYYY-MM-DD
- UTC Offset: +8
- SQL Server: localhost

The agent runs 7 steps automatically:

  1. Parse AlwaysOn.OUT → ag_schema.json
  2. Extract ERRORLOG → ag_errorlog_events.json + failover_incidents.json
  3. Import XEvent → SQL database ag_<case_id>
  4. Merge ERRORLOG + XEvent → merged_timeline.json
  5. Generate per-FO reports → reports/<case_id>_fo[1-3]_analysis.html/.md
  6. If stuck DBs found → invoke stuck-db-analysis for deep-dive
  7. Search CSS Wiki TSGs, Microsoft Learn KBs, and bug work items → reports/<case_id>_docs/

4. Reports — generated in reports/:

File Content
<case_id>_fo1_analysis.html FO1 report (HTML, Catppuccin Mocha theme)
<case_id>_fo1_analysis.md FO1 report (Markdown)
<case_id>_fo2_analysis.html/.md FO2 report
<case_id>_fo3_analysis.html/.md FO3 report
<case_id>_fo3_stuck_analysis.md Stuck DB deep-dive (if applicable)

Standalone Scripts

The scripts in scripts/ can also be used directly from the CLI:

# ERRORLOG
node scripts/parse_errorlog.js ERRORLOG* --days 7 --json --output reports/findings.json

# XEvent
powershell -File scripts/extract_xel.ps1 -Path "system_health*.xel" -Days 7 -Output reports/extract.json
node scripts/parse_xevent.js reports/extract.json --errorlog reports/findings.json --json --output reports/xe_findings.json

# Merged report
node scripts/gen_merged_report.js reports/findings.json reports/xe_findings.json reports/merged.html

License

Internal tool for Microsoft CSS. Not for public distribution.

About

SQL-CSI: SQL Server Case Scene Investigation - AI-powered ERRORLOG, XEvent, and source code analysis toolkit for CSS engineers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors