Skip to content

sageerthan/ai-code-assistant-optimization-study

Repository files navigation

AI Code Assistant Optimization Study

This repository contains datasets, experimental code, evaluation metrics, and analysis for a comparative study on how AI code assistants (Cursor, GitHub Copilot, Codeium & Amazon Q) optimize code across multiple programming languages (Java, JavaScript, Python & PHP).The study examines optimization quality & assistant-specific performance patterns.

Experimental Setup

The experimental setup was deliberately designed to maintain equality in testing all tools and programming languages. Every single programming problem was shown separately to each of the four AI assistants using an identical problem description extracted straight from LeetCode. To mitigate potential bias, prompts remained consistent across tools and languages, with minimal rephrasing adapted solely for syntax or input constraints specific to a given language.

The experimental procedure comprised seven distinct steps:

Step 1: Query Generation and Prerequisites

A total of fifty LeetCode challenges were randomly chosen while keeping a mix of 15 easy, 20 medium, and 15 hard problems. This specific ratio prioritizes medium-level tasks, as they more closely approximate the complexity of typical software development challenges, encompassing critical aspects such as optimization, correctness, and debugging. The basic elements of the context for the queries, namely the function name, arguments, and comments, were extracted manually from the LeetCode data set. A script was used based on the extracted data, which generated different files for code implementation in Java, JavaScript, PHP, and Python languages. A total of 200 files for each AI assistant was obtained, making a total of 800 files. These files were stored in separate folders labeled according to the AI assistant.

Step 2: AI Assistant Suggestion Collection

For each AI assistant, a corresponding set of code files was then integrated into a VSCode project. The code assistant for each AI was enabled to provide suggestions for all files within that project. When completing code suggestions for a given code assistant, it was disabled before moving to the next code assistant to ensure isolated code suggestion collection for each code assistant. In all instances, it only recorded the first code suggestion for each code assistant.

Step 3: Correctness Evaluation

Each of the 800 collected suggestions was manually submitted to the LeetCode coding platform for its relevant problem. The LeetCode API was then used to fetch the results of the code submissions in the form of a JSON file. These were then used to provide the required statistics to resolve the research questions regarding the correctness of code produced by each of the tools in the study. To facilitate analysis, a dedicated Python script was developed to process these JSON files and compute the relevant statistics. This analysis provided the quantitative insights required to address the research questions pertaining to the correctness of the code generated by each AI tool.

Step 4: Efficiency Evaluation

The outcome of the submission was processed in an Anaconda Jupyter Notebook environment after having been obtained in JSON form using the LeetCode API. Relevant characteristics, including problem statements, runtime, memory usage and programming languages used in each solution, were identified to form a CSV data set to analyze runtime efficiency. Another process involved forming another CSV data set to analyze memory efficiency and runtime efficiency. Through well-structured data sets, overall analysis of runtime efficiency, memory efficiency, and overall performance was performed on all four AI-assisted code development tools using Python scripts in Jupyter Notebook.

Step 5: Understandability, Reliability and Maintainability Evaluation

For the evaluation of non-functional code quality attributes, SonarQube was executed on a total of 800 code files, comprising 200 files (50 problems × 4 languages) generated by each AI assistant. The evaluation was conducted on the following code quality attributes: Cognitive complexity metric and cyclomatic complexity metric for the understandability attribute; security rating metric for the code security attribute; number of bugs metric for the reliability attribute; and number of code smells metric for maintainability. Analyzing bugs and code smells separately allows an understanding of the difference between the flaws causing incorrect or unstable behavior, which are bugs, and flaws causing impacts on code understanding or code maintainability, which are code smells. Additionally, a Python script was created with the capability to extract the calculated code quality attributes from SonarQube and compile them together in an Excel document, which includes the values of the cognitive complexity metric, cyclomatic complexity metric, security rating metric, number of bugs metric, and number of code smells metric.

Step 6: Debugging Evaluation

In the debugging phase of the evaluation, each of the AI models was fed the buggy code that they themselves had previously generated during the problem-solving phase of the evaluation. In addition to this code, each of the models was also fed the respective error messages that pointed to the location of the error in the code. With this information at hand, the model had to trace the cause of the error and then correct the code. This enabled me to observe the effectiveness of each of the models at modifying its previous work based on the feedback received. Through this simulation of an iterative debug process, this approach has enabled the model to develop an understanding of its ability to adjust and correct while functioning within environments that are akin to real-world programming settings.

Step 7: Security Evaluation

This security evaluation employed a structured experimental setup to systematically evaluate security weaknesses in AI-generated source code extracted from open-source GitHub repositories. The experimental environment was designed to support static security analysis of isolated code snippets while ensuring consistency and reproducibility across different AI code generation tools.

7.1 Data Preparation and Organization

The collected repositories were first cloned locally, and relevant Python and JavaScript files containing AI-generated code segments were identified based on repository metadata and in-code annotations. From each repository, only the code snippets explicitly attributed to AI-assisted generation tools were extracted. These snippets were then organized according to the generating tool (GitHub Copilot, Amazon Q, or Cursor), programming language, and application domain to enable comparative analysis. To avoid introducing external dependencies or configuration inconsistencies, the extracted code snippets were analyzed in isolation without modifying their original structure or logic. Any non-AI-generated code within the same files was excluded from the analysis to ensure that the evaluation focused solely on AI-produced content.

7.2 Static Security Analysis Environment

Static security analysis was conducted using established automated analysis tool capable of detecting security vulnerabilities in Python and JavaScript source code. The analysis tool Codeql was executed in a controlled local environment to ensure consistent results across all experiments. Default security rule sets provided by the tools were used to minimize configuration bias and to reflect real-world usage scenarios. Each code snippet was analyzed independently, and the detected security issues were recorded along with their severity levels, vulnerability descriptions, and source locations. The analysis process was repeated uniformly across all datasets to maintain comparability between AI code generation tools.

7.3 False Positive Validation

Given the known limitations of automated static analysis tools, a manual validation phase was incorporated into the experimental setup. All reported vulnerabilities were reviewed by the researcher to identify and eliminate false positives. This validation involved examining the surrounding code context, usage patterns, and data flows to determine whether the reported issues constituted genuine security weaknesses. Only validated vulnerabilities were retained for subsequent analysis. This step was essential to ensure the accuracy and reliability of the reported findings and to prevent overestimation of security risks in AI-generated code.

7.4 Vulnerability Classification and Analysis

The confirmed security vulnerabilities were classified using the Common Weakness Enumeration (CWE) framework. Each identified issue was mapped to its corresponding CWE category, enabling standardized categorization and comparison across different tools and codebases. The categorized results were then aggregated and analyzed to examine trends in vulnerability types, frequency, and distribution across AI code generation tools, programming languages, and application domains. This structured experimental setup provided a robust foundation for addressing the research questions related to the security characteristics of AI-generated code.

About

This repository contains datasets, experimental code, evaluation metrics, and analysis for a comparative study on how AI code assistants (Cursor, GitHub Copilot, Codeium & Amazon Q) optimize code across multiple programming languages (Java, JavaScript, Python & PHP).The study examines optimization quality & assistant-specific performance patterns.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors