# GenAI Ops – Deployment, Scaling, Maintainability, Security
By Marcos Florencio

**Disclosure: A few ideas here were taken from a paper I am currently working on related to active learning in the context of generative AI applications.**

TL;DR: Generative AI applications powered by deep learning require a new set of tools and processes to be deployed, scaled, maintained and secured properly. This text discusses some of the challenges and opportunities in building such applications.

## Introduction

Foundational models will progressively be incorporated in virtually all applications, independent of its target domain. This will lead to a new generation of software that will be able to learn from their users and improve over time, giving rise to new design patterns and operations. A possible architecture for these applications is composed of five layers`[1]`:

1. General AI models: these are the foundational models that are trained on large, public datasets.
2. Specific AI models: these are trained on narrow data to outperform general models in specific use cases.
3. Hyperlocal AI models: these are trained on local, proprietary data.
4. Generative OS or API layer: this layer helps the application to access all the AI models required to solve a problem and is responsible for orchestrating the data flow between them.
5. Applications layer: this is the user-facing layer and should have powerful network effects and embedding characteristics.

Layers one and two are easily commoditized, as they are trained on public data. Layer three is the most defensible, as it is trained on proprietary data. Layer four is the most valuable, as it is the layer that orchestrates the data flow between the models and reflects the business's domain knowledge and value proposition. Layer five should have powerful network effects and embedding characteristics, as similar as possible to those of a social network.

Machine learning applications have historically been used in the context of supervised learning, where the model is trained on a labeled dataset. In order to deploy a generative AI application powered by deep learning, it is necessary to have an infrastructure that is able to support the following tasks:

1. Prompt engineering and pre-processing data with custom embeddings: in which a general model is used to generate completion candidates, which are then used to generate custom embeddings after user validation.
2. Fine-tuning foundational models: in which foundational models are continuouly fine-tuned on the data being collected from the users, in order to improve its performance over time. 
3. Post-processing results to enforce ODD: large foundational models hallucinate and generate results that are out of distribution. In order to measure correctness, it is necessary to enforce ODD (operational design domain), which is the set of attributes that the model is expected to perform well on. This can be achieved through uncertainty estimation, which consists of analyzing uncertain data points to help the model better learn about the data (see Error Analysis at Scaling) `[2, 3]`.
4. Infrastructure to support end to end validation: in which the application is instrumented to collect and validate data that will be fed into the models as well as orchestrate the data flow among different models.

Given a fixed budget for application development, some ingenuity is required in order to build a system that is correct enough to be safe. The main challenge is to be able to circumvent the need to acquire annotated data to train a model to perform well on a specific task while retaining maximum generalization performance. This is known as the data annotation bottleneck and is related to the correctness spectrum problem`[4]`.

The correctness problem has been studied in the context of self-driving cars, where the goal is to build a system that is correct enough to be safe. In that context, correctness is defined as the probability of the system to fail in a given scenario, which is related to the concept of known unknowns and unknown unknowns, which are the set of scenarios that the system is expected to encounter in the real world and the set of scenarios that the system is not expected to encounter in the real world, respectively. If all "knowns" are defined and guardrails are built to prevent any "unknown unknowns", then it is possible to achieve high bars of correctness for specific use cases.

This gives rise to the concept of operational design domain (ODD), which is the set of scenarios that the system is expected to encounter in the real world. By bounding the application to an ODD, it becomes much easier to test and validate the correctness of a pipeline. Below is discussed some architectural decisions that can be made to build a system that is correct enough to be safe.

## Deployment

For deploying Generative AI applications powered by deep learning, the primary focus should lie on utilizing the foundational models (both general and specific) to generate potential outcomes, followed by gathering user feedback to refine these outputs. This feedback loop aids in creating custom embeddings, which are crucial for enhancing the application performance over time. In this context, the application layer (GUI or otherwise) serves as the interface where users interact, providing feedback, which in turn helps refine the AI models and their outputs.

In terms of architecture, the API layer could de deployed using some container orchestration tool such as Kubernetes or in a serverless fashion, both of which would allow for the application to be deployed on a cloud provider such as Azure. All endpoints containing Generative AI logic should use output parsers, which are responsible for post-processing the results to enforce ODD. This is particularly important considering that AI models either hallucinate or generate output in unexpected formats.

## Scaling

Scaling within GenAI Ops involves expanding the scope of AI models from generalized to hyperlocal ones. This transition focuses on leveraging proprietary data to fine-tune foundational models for specific, localized use cases. The emphasis here is on adapting foundational models to better suit the unique attributes and requirements of a particular domain.

Several steps are involved in scaling:

1. Collecting localized data: this involves the procedures put in place to collect data using prompt engineering and pre-processing data with custom embeddings, and should be available since inception of the application, as discusses above.

2. Training foundational models on this proprietary data to create hyperlocal models: this involves fine-tuning the foundational models on the localized data to create hyperlocal models that are better suited to the specific domain. This can be done using frameworks such as Pytorch, Tensorflow, or JAX and platforms of open source tools and models, such as HuggingFace.

3. Error analysis: a simple technique with outsized benefits is to sort the validation samples by the model loss. When the label is passed during the forward pass, the loss is automatically calculated and returned. The goal of this exercise is to detect wrong labels or quirks of the dataset. This information can, in turn, help to refine the dataset, which might be either mislabeled or require a new class altogether. This often can lead to as big a performance gain (or more) as having more data or larger models.

Deep learning models are exceptionally good at finding and exploiting shortcuts to get to a prediction. For this reason, it is also worth investigating the predictions the model is most confident about, so to check if the model is not improperly exploiting random features in the text.

Alas, unlike software development and release, where there are mature continuous integration and continuous delivery (CI/CD) pipelines, many of the AI related operations are ofter performed manually, hindering the ability to scale, given that manual operations are hard to standardize and error-prone, not to mention that the team can only grow linearly, while throughput tends to scale exponentially.

In order to scale, GenAI Ops can use AI models to automatically analyze large volumes of telemetry data to detect and diagnose incidents, scaling its capabilities with growing demand. This process of adoption should happen in a continuum, where the AI models are progressively incorporated into the system, starting with the most critical components and gradually expanding to other areas.

## Maintainability

Maintaining Generative AI applications involves managing the orchestration of data flow between different layers of AI models. The API Layer plays a pivotal role in ensuring seamless communication and integration between these models. It streamlines data processing, facilitates model interactions, and supports the overall system's functioning.

Key components for maintainability:

Developing robust APIs that efficiently manage data flow between various AI model layers.
Ensuring compatibility and scalability within the API infrastructure to accommodate potential future model expansions or changes.

Maintainability also includes establishing a comprehensive validation system that spans the entire AI application pipeline. End-to-end validation involves validating data inputs, model predictions, and outputs at every stage of the application, guaranteeing the integrity and reliability of the system.

Actions for maintainability:

Implementing robust validation mechanisms throughout the entire pipeline, including data input validation, model performance validation, and output verification.
Regularly auditing and updating validation protocols to maintain system integrity.

## Security

It is possible to reverse engineer and unvectorize vectors. I believe not all companies have realized this and its possible consequences yet. This is a huge security risk, as the information sent to external model providers and services in unencrypted form can be used to potentially leak sensitive information. As such, it is important to deploy applications that can potentially live on the VPC, so that the data is not sent to external services, thus mitigating the risk of data leakage and increasing security and privacy, including GDPR compliance.

Another relevant source of risk is the requirement to enforce Operational Design Domain (ODD) in order to mitigate risks associated with unexpected scenarios. By defining the scope within which the AI system operates, it becomes more manageable to anticipate potential threats and vulnerabilities. This can be achieved by defining and enforcing strict operational protocols to identify and flag any anomalies or deviations from the defined ODD, ensuring that the system operates within the specified boundaries and any potential security breaches or deviations are detected and addressed promptly.

According to Cheng et al., key GenAI Ops security measures include incident detection, failure prediction, root cause analysis and automated actions[5]. In the article, they offer a plethora of recommendations for each of these measures, but, for the sake of brevity, I will only discuss incident detection, which I believe to be the most relevant.

The most commom telemetry data available to identify and act upon incidents are logs, metrics, and traces. Logs are the most common source of telemetry data, as they are the most flexible and can be used to capture any type of information. Metrics are used to capture information about the system's performance, such as CPU and memory usage. Traces are used to capture information about the system's behavior, such as the sequence of events that led to a particular outcome. Moreover, multi-modal anomaly detection techniques can be employed if multiple telemetry data sources are involved in the detection process (Cheng et al., page 6). These can be consumed by deep learning models specifically designed to detect anomalies.

To enhance security, it is relevant to incorporate real-time monitoring and response mechanisms to address security threats. The system should also be able to automatically treat incident evidences to serve as data points for training of future models.

## Conclusion

By addressing these aspects of Deployment, Scaling, Maintainability, and Security within the GenAI Ops framework, organizations can effectively harness the potential of Generative AI while mitigating risks and ensuring the robustness and reliability of their applications.

## Example code available at:

[1] https://github.com/themarcosf/inteli-year2-project4-ai: This repository contains the code for fine-tuning a base foundational model using proprietary data (check `nn` folder). The project consists of building a Generative AI application powered by deep learning in order to better categorizing expenses. The application is composed of two layers: a hyperlocal model, which is responsible for classifying the expenses and an API layer that orchestrates the data flow between the models.

[2] ai-cookbook

[3] skill-assist

## References:

[1] James Currier, 2022. Generative AI Market Map and 5-Layer Tech Stack. URL: https://www.nfx.com/post/generative-ai-tech-5-layers#The-5--Layer-Generative-Tech-Stack

[2] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183–1192. PMLR, 2017.

[3] Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 93–102, 2019.

[4] Morgan Beller, 2023. The AI Startup Litmus Test. URL: https://www.nfx.com/post/ai-startup-litmus-test

[5] Qian Cheng, et al., 2023. AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges. arXiv preprint arXiv:2109.14882. URL: https://arxiv.org/pdf/2304.04661.pdf