Replication Package: What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants
This repository contains the official replication package, source code, and datasets for the empirical study: "What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants." This package provides all necessary scripts, raw data, and qualitative coding guidelines required to replicate our data collection, filtering, and taxonomy construction for both real-world GitHub incidents and prior academic literature.
The repository is organized into the following main directories and files:
📦 replication-package
┣ 📜 README.md # This documentation file
┣ 📜 annotation-guideline.pdf # The coding manual used by the authors
┣ 📂 Figures/ # Contains all figures presented in the paper
┗ 📂 Code/ # Contains all automated collection and filtering scripts
┃ ┣ 📂 Issue-collection/
┃ ┃ ┗ 📜 issue_llm_filter.py # Uses LLMs to filter for genuine operational safety failures
┃ ┗ 📂 Paper-collection/
┃ ┃ ┣ 📜 keyword_paper_filter.py # Filters academic literature based on predefined SE safety keywords
┃ ┃ ┗ 📜 paper_llm_filter.py # Uses LLM to assess paper relevance to code generation
This document contains the coding procedures used to map the 547 in-the-wild GitHub incidents and 185 academic papers into our 33-node, 7-dimension safety taxonomy. It includes definitions, inclusion/exclusion criteria, and examples to ensure inter-rater reliability.
This folder contains all six figures presented in this paper.
Scripts to extract and refine real-world operational failures caused by autonomous coding agents.
issue_llm_filter.py: Feeds the parsed issues through LLMs to automatically filter out standard bugs and isolate genuine autonomous safety and execution failures.
Scripts to collect and filter the academic literature dataset.
keyword_paper_filter.py: Applies keyword filter to isolate papers mentioning LLM safety, code generation, and agentic workflows.paper_llm_filter.py: Evaluates the titles and abstracts of the collected papers to determine if their focus is on code generation.