AI-powered toolkit for CSS engineers to analyze SQL Server ERRORLOG files, XEvent traces (system_health), crash dumps, and engine source code. Runs inside VS Code as a Copilot Chat agent (@sql-csi) with a router → sub-agent → skill architecture and MCP integration.
sqlcsi/
├── sql-csi.agent.md # Entry agent — router + full-analysis orchestrator
│
├── agents/ # Sub-agents (description + lightweight workflow)
│ ├── errorlog-analysis.agent.md
│ ├── xevent-analysis.agent.md
│ ├── ag-failover-analysis.agent.md
│ ├── docs-lookup.agent.md
│ ├── dump-analysis.agent.md
│ └── source-search.agent.md
│
├── skills/ # Detailed methodologies (forked context per skill)
│ ├── errorlog-analysis/SKILL.md
│ ├── xevent-analysis/SKILL.md
│ ├── ag-failover-analysis/SKILL.md
│ ├── stuck-db-analysis/
│ │ ├── SKILL.md
│ │ └── reference/ # Pre-cached source code knowledge base
│ │ ├── database_switch_roles_pipeline.md
│ │ └── lock_dependency.md
│ ├── docs-lookup/SKILL.md
│ └── dump-analysis/SKILL.md
│ # (source-search has no separate skill — methodology is inline in its agent file)
│
├── scripts/ # Standalone scripts (CLI + invoked by skills)
│ ├── parse_errorlog.js
│ ├── extract_xel.ps1
│ ├── parse_xevent.js
│ ├── gen_merged_report.js
│ └── ag-failover-analysis/ # AG failover multi-step pipeline
│ ├── parse_alwayson_out.js # Phase 1: AlwaysOn.OUT → ag_schema.json
│ ├── extract_ag_errorlog.js # Phase 2: ERRORLOG → ag_errorlog_events.json
│ ├── build_failover_timeline.js # Phase 2b: Detect FO incidents
│ ├── import_ag_xevent.sql # Phase 3: XEL → SQL Server shredded tables
│ ├── merge_timeline.js # Phase 2c: Merge ERRORLOG + XEvent
│ ├── gen_fo_report.js # Phase 7: Single combined HTML report
│ └── gen_per_fo_report.js # Phase 7: Per-FO HTML + MD reports
│
├── .vscode/
│ └── mcp.json # MCP server registrations
│
├── .github/
│ └── copilot-instructions.md # Workspace-wide Copilot guidance
│
└── reports/ # Generated reports (git-ignored)
| Component | Type | Purpose | Backed by |
|---|---|---|---|
sql-csi |
Entry agent | Route requests; orchestrate full analysis | — |
errorlog-analysis |
Agent + Skill | Parse ERRORLOG, extract errors, timeline, patterns | scripts/parse_errorlog.js |
xevent-analysis |
Agent + Skill | Parse system_health XEL, wait analysis, merged report | scripts/extract_xel.ps1, scripts/parse_xevent.js, scripts/gen_merged_report.js |
docs-lookup |
Agent + Skill | KB fixes, CU applicability, wait type research | microsoft-learn MCP |
dump-analysis |
Agent + Skill | WinDbg / Mirrors commands for SQL Server dumps | WinDbg (external) |
source-search |
Agent | Engine source code search across SQL 2016/2017/2019/2022/2025 | msdata / csswiki MCP |
ag-failover-analysis |
Agent + Skill | AG failover: parse AlwaysOn.OUT/ERRORLOG/XEvent, classify stuck DBs, per-FO reports | scripts/ag-failover-analysis/* |
stuck-db-analysis |
Skill | Deep-dive stuck RESOLVING DBs: Cat A/B/C comparison, source code pipeline mapping | Pre-cached KB in reference/ |
Why split agent / skill? Agents are short entry files with description frontmatter so the router can match user intent. Skills hold the detailed step-by-step methodology and are loaded only when the matching agent activates, keeping context small.
| # | Trigger (example) | Routes to | What it does |
|---|---|---|---|
| 1 | analyze errorlog <path> |
errorlog-analysis |
Parse ERRORLOG, extract errors, timeline, patterns |
| 2 | analyze xevent <path> |
xevent-analysis |
Parse system_health XEL files, wait analysis, merge with ERRORLOG |
| 3 | research error <N> |
docs-lookup |
Look up KB fixes, wait type causes via Microsoft Learn |
| 4 | analyze dump <path> |
dump-analysis |
Generate WinDbg/Mirrors commands for a SQL crash dump |
| 5 | search error <N> |
source-search |
Search SQL Server source code for error definition + raising code |
| 6 | full analysis |
All (orchestrated) | Run 1 → 2 → 3 → 5 sequentially; produce combined report |
| 7 | analyze AG failover <path> |
ag-failover-analysis |
Parse AlwaysOn.OUT + ERRORLOG + XEvent, detect FO incidents, per-DB status tables, trigger analysis |
| 8 | analyze stuck DB |
stuck-db-analysis |
Cat A/B/C comparison, DatabaseSwitchRoles pipeline mapping, source code reference |
- Auto-detect UTF-16LE / UTF-8 encoding
- Multi-line message parsing
- Error classification by subsystem (HADR, LOCKING, MEMORY, etc.)
- Pattern detection: cascades, repeating, paired errors, LSN progression
- Priority assignment: HIGH / MEDIUM / LOW
- AG role change timeline tracking
- Binary XEL extraction via PowerShell
Read-SqlXEvent - 4 focused analysis areas:
sp_server_diagnostics— WARNING/ERROR states onlyscheduler_monitor— CPU > 75% or Memory < 80%error_reported— ERRORLOG complementwait_info— all waits with category classification
- Cross-correlation with ERRORLOG findings
- KB article search and CU applicability check
- Wait type root cause analysis from official CSS I/O guide
- Diagnostic DMV queries
- Subsystem-aware WinDbg/Mirrors script generation (HADR, Memory, Scheduler, Locking, IO, Connectivity)
- SOSRingBuffer LINQ queries filtered by error number / task / scheduler
- Parsing of pasted-back WinDbg output
- Error definition lookup in
sqlerrorcodes.h - Find code that raises the error
- Function logic analysis
- XEvent diagnostics discovery
- HTML report with Azure DevOps links
- Multi-phase pipeline: AlwaysOn.OUT → ERRORLOG → XEvent → SQL Server import → merge → report
- Per-failover incident detection with 120s clustering
- Per-host, per-DB status tables with DTC, Reverting, ABORT kill, NQ rollback columns
- Trigger analysis: connection timeouts, WSFC errors (41xxx chain), lease termination, SYSTEM_UNHEALTHY
- Old PRIMARY vs New PRIMARY identification from
conn_terminated+ag_directions - Key Event Timeline per host with color-coded badges
- WSFC Error Chain cascade diagram (41005→41034→41144→41143→41161)
- Conclusion with trigger cause, IO context, per-AG outcome, per-host recovery summary
- Both HTML (Catppuccin Mocha theme) and Markdown output
- Stuck DB deep-dive: Cat A/B/C comparison table mapped to
DatabaseSwitchRolessource code - Pre-cached source code KB: pipeline steps with line numbers, lock dependency, deadlock chains
- VS Code with GitHub Copilot Chat extension
- Node.js (v18+)
- PowerShell with
SqlServermodule (auto-installed on first XEL extraction) - MCP servers configured in .vscode/mcp.json:
microsoft-learn— fordocs-lookupmsdata/csswiki— forsource-search- Optional:
mssql,bluebird-mcp-*,azure-mcp,enghub,icm-prod,SqlOps
Open the sqlcsi folder in VS Code and use Copilot Chat:
@sql-csi analyze errorlog \\server\share\ERRORLOG
@sql-csi search error 19432 in SQL2022
@sql-csi full analysis, errorlog is \\server\share\ERRORLOG, case ID SR12345
1. Prepare logs — organize into per-host subdirectories:
C:\Temp\<case_id>\
├── <old_primary_host>\ e.g. HKAZEPWDB0031\
│ ├── *_AlwaysOn.OUT (REQUIRED)
│ ├── *_ERRORLOG* (REQUIRED)
│ ├── AlwaysOn_health*.xel (REQUIRED)
│ ├── *SQLDIAG*.xel (REQUIRED)
│ └── system_health*.xel (optional)
└── <new_primary_host>\ e.g. HKAZEPWDB0011\
└── (same files)
2. Have a local SQL Server — XEvent data is imported into a temp database (ag_<case_id>).
Default connection: localhost with Windows auth. No special setup needed.
3. Invoke the agent — provide the required parameters:
@ag-failover-analysis 使用 ag-failover-analysis skill 分析 case <case_id>:
- Case Dir: C:\Temp\<case_id>
- Old Primary: <hostname>
- New Primary: <hostname>
- Date: YYYY-MM-DD
- UTC Offset: +8
- SQL Server: localhost
The agent runs 7 steps automatically:
- Parse AlwaysOn.OUT →
ag_schema.json - Extract ERRORLOG →
ag_errorlog_events.json+failover_incidents.json - Import XEvent → SQL database
ag_<case_id> - Merge ERRORLOG + XEvent →
merged_timeline.json - Generate per-FO reports →
reports/<case_id>_fo[1-3]_analysis.html/.md - If stuck DBs found → invoke
stuck-db-analysisfor deep-dive - Search CSS Wiki TSGs, Microsoft Learn KBs, and bug work items →
reports/<case_id>_docs/
4. Reports — generated in reports/:
| File | Content |
|---|---|
<case_id>_fo1_analysis.html |
FO1 report (HTML, Catppuccin Mocha theme) |
<case_id>_fo1_analysis.md |
FO1 report (Markdown) |
<case_id>_fo2_analysis.html/.md |
FO2 report |
<case_id>_fo3_analysis.html/.md |
FO3 report |
<case_id>_fo3_stuck_analysis.md |
Stuck DB deep-dive (if applicable) |
The scripts in scripts/ can also be used directly from the CLI:
# ERRORLOG
node scripts/parse_errorlog.js ERRORLOG* --days 7 --json --output reports/findings.json
# XEvent
powershell -File scripts/extract_xel.ps1 -Path "system_health*.xel" -Days 7 -Output reports/extract.json
node scripts/parse_xevent.js reports/extract.json --errorlog reports/findings.json --json --output reports/xe_findings.json
# Merged report
node scripts/gen_merged_report.js reports/findings.json reports/xe_findings.json reports/merged.htmlInternal tool for Microsoft CSS. Not for public distribution.