# "AI Software Testing" vs. "Testing Software with AI"

### Introduction

Since the beginning of 2023, the world has been experiencing a new wave of applications powered by deep learning and a few of those are focused at automating the process of software testing. Contradictorily, no thought seems to be given to the way that testing will change in the context of AI-based applications. This essay aims to discuss the challenges and opportunities that arise from the intersection of these two trends.

Foundational models will progressively be incorporated in virtually all systems, independent of its target domain. This will lead to a new generation of software that will be able to learn from their users and improve over time. As put by Andrej Karpathy in his seminal article about Software 2.0: "we are witnessing a massive transition across the industry where of a lot of 1.0 code is being ported into 2.0 code. Software (1.0) is eating the world, and now AI (Software 2.0) is eating software"[1]. This new generation of software will require a new understanding of what software testing is and how it should be done.

Nevertheless, these new testing tools seem to fail to address these questions. Software has historically been developed and tested with an implicit expectation of determinism, i.e that for a given set of inputs the software should produce the same output. This is not the case for applications powered by deep learning, as they are inherently non-deterministic and hallucination is an expected behavior. 

In order to build a system that is correct enough to be safe, it is necessary to have an approach to software testing that effectively incorporates the "correctness spectrum problem", which has been studied in the context of self-driving cars[2]. In that context, correctness is defined as the probability of the system to fail in all scenarios that the system is expected to encounter in the real world ("knowns") as well as all scenarios that the system is **not** expected to encounter in the real world ("unknowns"). If all "knowns" are defined and guardrails are built to prevent any "unknowns", then it is possible to achieve high bars of correctness for specific use cases.

This gives rise to the concept of operational design domain (ODD)[3], which is the set of conditions that the system is expected to encounter in the real world. By bounding the application to an ODD, it becomes much easier to test and validate the correctness of a system and establish safe operating conditions. All scenarios that fall out of the ODD are considered "unknowns" and are not expected to be handled by the system. This is a fundamental shift in the way software is developed and tested.

A possible architecture for applications powered by deep learning is composed of five layers [4]:

1. General AI models: these are the foundational models that are trained on large, public datasets.
2. Specific AI models: these are trained on narrow data to outperform general models in specific use cases.
3. Hyperlocal AI models: these are trained on local, proprietary data.
4. Generative OS or API layer: this layer helps the application to access all the AI models required to solve a problem and is responsible for orchestrating the data flow between them.
5. Applications layer: this is the user-facing layer and should have powerful network effects and embedding characteristics.

The main challenge faced by engineers in order to build a system that is correct enough to be safe is to put in place a pipeline that is able to enforce ODD while being able to properly handle results that are out of distribution. Such a pipeline should use general models to generate completion candidates, which are then embedded after user validation. These completion embeddings are then continously used to fine-tune a foundational model on the data being collected from the users in order to improve its performance over time.

In order to measure correctness and enforce ODD, uncertainty can be estimated and used as data points that will help the system to understand when it is out of distribution[5] [6]. This is a fundamental shift in the way software is developed and tested, as it requires a new set of tools and processes to be put in place, such as an infrastructure to support end to end validation, in which the application is instrumented to collect and validate data that will be consumed in the data flow of the different application modules.

### Preparing an AI application for deployment

For deploying Generative AI applications powered by deep learning, the primary focus should lie on utilizing the foundational models (both general and specific) to generate potential outcomes, followed by gathering user feedback to refine these outputs. This feedback loop aids in creating custom embeddings, which are crucial for enhancing the application performance over time, through the use of techniques such as RAG (Retrieval Augmented Generation), in which vectors are selected based on their similarity with the query being processed (some vector similarity functions include cosine and Euclidian, among others).

In this context, the application layer (GUI or otherwise) serves as the interface where users interact, providing feedback, which in turn helps refine the AI models and their outputs. It is important for the user interface to have network effects and embedding characteristics, preferably adopting some aspect of data validation, where users can see the data being collected and validate it.

In terms of architecture, the API layer should be deployed using output parsers, which are responsible for post-processing the results and validate the format of the objects returned from the neural networks. Whenever possible, a data point should be stored only after user validation. This is particularly important considering that AI models either hallucinate or generate output in unexpected formats.

### Scaling safely

Scaling involves expanding the scope of AI models from generalized to hyperlocal ones. This transition focuses on leveraging proprietary data to fine-tune foundational models for specific, localized use cases. The emphasis here is on adapting foundational models to better suit the unique attributes and requirements of a particular domain.

Several steps are involved in scaling:

<li>Collecting localized data: this involves the procedures put in place to collect data using prompt engineering and pre-processing data with custom embeddings, and should be available since inception of the application, as discussed above.</li>
<li>Training foundational models on this proprietary data to create hyperlocal models: this involves fine-tuning the foundational models on the localized data to create hyperlocal models that are better suited to the specific domain. This can be done using frameworks such as Pytorch, Tensorflow, or JAX and platforms of open-source tools and models, such as HuggingFace.</li>
<li>Error analysis: a simple technique with outsized benefits is to sort the validation samples by the model loss. When the label is passed during the forward pass, the loss is automatically calculated and returned. The goal of this exercise is to detect wrong labels or quirks of the dataset. This information can, in turn, help to refine the dataset, which might be either mislabeled or require a new class altogether. This often can lead to as big a performance gain (or more) as having more data or larger models.</li>

Deep learning models are exceptionally good at finding and exploiting shortcuts to get to a prediction. For this reason, it is also worth investigating the predictions the model is most confident about, so to check if the model is not improperly exploiting random features in the text.

### Maintainability of Generative AI applications

Maintaining Generative AI applications involves managing the orchestration of data flow between different layers of AI models. The API Layer plays a pivotal role in ensuring seamless communication and integration between these models, routing data inputs and outputs between the different models, falling back on the foundational models when necessary.

Key components for maintainability are:
<li>Developing robust APIs that efficiently manage data flow between various AI model layers: using a service provider has the benefit of not having to worry about the infrastructure of the API, given they already provide several "quality-of-life" solutions, such as function calling and data serialization, which are crucial for the maintainability of the application. Once the application gets robust enough so that it is reasonable to train and use its own models, this infrastructure needs to be built from scratch, which is a non-trivial task.</li>
<li>Ensuring compatibility and scalability within the API infrastructure to accommodate potential future model expansions or changes: it is important to ensure that the API infrastructure is flexible enough to accommodate retro and forward compatibility with several model providers. This is particularly relevant as the models are constantly being fine-tuned and updated, which can lead to changes in the data flow between the models.</li>

Maintainability also includes establishing a comprehensive validation system that spans the entire AI application pipeline. End-to-end validation involves validating data inputs, model predictions, and outputs at every stage of the application, guaranteeing the integrity and reliability of the system. Regularly auditing and updating validation protocols to maintain system integrity is also crucial to ensure the robustness of the application.

### Sources: 

[1] Karpathy, Andrej, 2017. Software 2.0. URL: https://karpathy.medium.com/software-2-0-a64152b37c35

[2] Morgan Beller, 2023. The AI Startup Litmus Test. URL: https://www.nfx.com/post/ai-startup-litmus-test

[3] [Wikipedia definition: ODD](https://en.wikipedia.org/wiki/Operational_design_domain#:~:text=Operational%20design%20domain%20(ODD)%20is,of%20day%2C%20and%20other%20conditions)

[4] James Currier, 2022. Generative AI Market Map and 5-Layer Tech Stack. URL: https://www.nfx.com/post/generative-ai-tech-5-layers#The-5--Layer-Generative-Tech-Stack

[5] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183–1192. PMLR, 2017.

[6] Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 93–102, 2019.