# Retrieval Augmented Generation System Report

## System Design
By revisiting our original requirements below, we can confirm that the system meets the requirements.
1. The system should achieve a high relevancy on domain-specific (even institution-specific) information
    * This was achieved by utilizing retrieval augmented generation, where relevant context is obtained from a corpus of specific information and used to generate a response. This allows the system's answers to be highly relevant and even include institution-specific information taken from the corpus of documents.
2. The system should be able to adapt to dynamic/growing information (Add new documents or new information and Remove or modify stale information)
    * This was achieved by implementing the `PUT /document` and `DELETE /document` endpoints to allow users to add and remove documents from the corpus of information.
3. The system should have the ability to process through billions of documents
    * This was achieved by using an indexing strategy on the corpus of documents to make searching, inserting, and removing documents more efficient. In real-world situations, more advanced indexing strategies and databases may be used to further improve scalability.
4. The system should have the ability to cite information sources
    * This was achieved by returning the context information along with the answer to a question in the response from the `POST /question` endpoint.

A diagram of the entire system, from the Module 10 lecture slides, can be seen below. There are four main services: the extraction service, the retrieval service, the generator service, and the interface service. The extraction service processes the corpus documents, splits them into chunks of sentences, and extracts embedding vectors from each chunk, saving the embeddings to numpy files. The extraction service applies a KD-Tree indexing strategy to store the text embeddings in an organized way that makes the search process more efficient. The retrieval service is used to perform a nearest neighbors search to find the document chunks that are most relevant to a given input question, using the KD-Tree index. The generator service uses the retrieved context (document chunks) and the input question to generate a response to the question. Finally the interface service is how the outside world interacts with the system by providing questions as text input, adding and removing documents from the corpus, or by making API calls to retrieve information such as logs or document files.

![diagram](../assets/images/diagram.png)

## Data, Data Pipelines, and model

### Data
There are four main types of data: corpus documents, embedding vectors, questions, and question logs. The 10 corpus documents that were provided for this case study are stored in the `storage/corpus` folder. A set of 109 validation questions with known answers, used for the analyses in [design_considerations.ipynb](./design_considerations.ipynb), is stored in `qa_resources/questions.csv`.  When the system starts, the embeddings for each document chunk in the corpus are precomputed and saved to `.npy` files in the `storage/embeddings` directory, inside the folder that corresponds with the model name and sentences per chunk value used to compute the embeddings. Finally, a `.json` file is stored in the `storage/logs` folder each time a question is asked using the `POST /question` endpoint. The log contains the question, the answer, timestamp, and the context chunks used to generate the answer.

### Data Pipelines
There are two main data pipelines: one used to precompute the document embeddings when the system starts, and one used to generate answers to questions while the system is running. The pipeline that is used to precompute the embeddings involves only the extraction service, which includes the preprocessing, embedding, and indexing modules. It starts by using the `DocumentProcessing` class to read the documents and divide them into chunks of sentences. Next the `Embedding` class is used to calculate the embedding vector for each chunk, and save the vectors to the `storage/embeddings` folder. Finally, the embedding vectors are organized using the `KDTree` class to index the embedding vectors in a way that makes the search process quicker than a brute force approach, since entire branches of the search space can be disregarded during the search process. This KDTree index will be used by the retrieval service in the next data pipeline.

The second pipeline is used each time the question answering process is triggered using the `POST /question` endpoint. This involves the extraction, retrieval, and generator services. This pipeline starts by using the `Embedding` class to compute the embedding vector for the question and then the `KDTreeSearch` class to perform a nearest neighbors search on the KDTree created in the first data pipeline. This search returns the k embedding vectors that are the closest to that of the question in embeddings space. These vectors correspond to the document chunks that are most semantically similar to the question. Based on the analysis done in [design_considerations.ipynb](./design_considerations.ipynb), the k value used in the deployed system was 10. The text associated with those document chunks and the original question are passed to the generator service, which generates an answer to the given question using the provided context. The answer and the context chunks used to generate the answer are then returned to the user and also logged along with the original question and timestamp in a `.json` file.

### Model
There are two different models used in this system: one is used for extracting embedding vectors from text, and the other is used to generate a human-like response to a given question. The embedding model used for this case study is the `all-mpnet-base-v2` Hugging Face Sentence Transformer model described (along with other model versions) here: [https://sbert.net/docs/sentence_transformer/pretrained_models.html](https://sbert.net/docs/sentence_transformer/pretrained_models.html). The `all-mpnet-base-v2` version was chosen for deployment because the Hugging Face documentation states that it procides the best quality. If better efficiency is needed, the `all-MiniLM-L6-v2` version could be used instead. Additionally, different embedding models could be analyzed using the process described in [design_considerations.ipynb](./design_considerations.ipynb) to determine the best embedding model four our use case.

The generator model used in this case study is a version of a Bidirectional Encoder Representation from Transformers (BERT) model. The versions available from hugging face are listed in their [pretrained models page](https://huggingface.co/models?sort=trending&search=bert), but the main general versions are BERT Base (`google-bert/bert-base-uncased`) and BERT Large (`google-bert/bert-large-uncased`). Another version of the BERT Large model fine-tuned for with a question answering dataset is `google-bert/bert-large-cased-whole-word-masking-finetuned-squad`. This was the version used in the deployed system because it had the best performance in the analysis described in [design_considerations.ipynb](./design_considerations.ipynb).

## Metrics Definition
Most of the metrics tracked by the system are offline metrics due to time and feasibility constraints of this Case Study project. In the real world, cloud resources such as AWS CloudWatch, Datadog, and Splunk could be used to obtain more online metrics. These could be configured to include things like API errors and traffic, latency, and duration.

### Offline Metrics
The exact match and transformer match metrics have been implemented in this case study, and are used mainly by the analyses in [design_considerations.ipynb](design_considerations.ipynb). The average precision @ k, average recall @ k, mean average precision @ k, and mean reciprocal rank @ k have been implemented in the previous case study, and could be used again here to help evaluate the context retrieval service, if we have ground truth answers to questions and/or a way of defining whether a retrieved context chunk is a positive or negative outcome. Additionally, [Ragas metrics](https://docs.ragas.io/en/stable/) could also be calculated to more specifically evaluate the context retrieval and answer quality as described [here](https://towardsdatascience.com/evaluating-rag-pipelines-with-ragas-5ff28aa27984). The Flask application doesn't include an endpoint to directly get these metrics since that was not a specified requirement of the case study, and we don't have the correct ground truth answers for real world question data. In a real system, an endpoint to get these metrics would be useful, but would require implementing a way to collect ground truth answers for each question asked.
1. **Exact Match**: This metric gives an idea of how well a generated answer to a question matches the ground truth answer. However, the answer pair is only considered a match if every character matches exactly, which is often too strict for many natural language tasks.
2. **Transformer Match**: This metric gives a better idea of how well a generated answer to a question matches the ground truth answer. It tries to determine matches based on semantic similarity, which allows for more flexibility with things like phrasing, paraphrasing, and synonyms in the generated answer.
3. **Average Precision @ k**: This metric is tracked so we can identify how useful a model is for reducing the risk of false positives, i.e. document chunks that the system thinks are relevant, but actually aren't. A low false positive rate is an important aspect of the RAG system because we want to minimize the rate of incorrect answers or hallucinations, which make the system appear unreliable. This metric is also necessary in order to calculate mAP (discussed below). However, it's important to note that this metric only considers the mumber of positive matches, and not the ranking of those matches.
4. **Average Recall @ k**: This is an important metric for ensuring that we're correctly identifying relevant document chunks that are truly relevant, allowing us to maximize true positives and correct answers. This is important so that questions with known answers in the corpus documents are answered correctly. However, it's important to note that this metric doesn't consider the rank of the positive matches, only the number of positive matches in the entire dataset (which could potentially be very large).
5. **Mean Average Precision @ k**: Since precision @ k and recall @ k both have their drawbacks and do not consider ranking of positive matches, then a more robust measure of performance may be Mean Average Precision. The mAP @ k value can give us a more robust measure of system performance since it takes into account both number of positive context matches as well as their rankings, which can be useful when comparing different model versions and tuning extraction and retrieval parameters.
6. **Mean Reciprocal Rank @ k**: This metric is simple and straightforward since it's based only on the rank of the first positive match in a list of predicted matches. However, it only considers the rank of the first positive match, and ignores all other potential positive matches in the list, making it not ideal for comparing the performance of different models, especially since there could potentially be multiple positive context chunk matches for a given input question.

### Online Metrics
These metrics were not implemented for this case study due to time and feasibility constraints, but given more time to develop a real-world system, these metrics could be very useful for determining how well the deployed system is performing.
1. **Latency**: Keeping track of the time that it takes for the system to produce an answer upon recieving aquestion will be important for ensuring that the system remains performant and scalable, even if the number of corpus documents. The system should still be able to produce answers quickly so that users do not have to wait long periods of time to get an answer to a simple question.
2. **CPU Utilization**: Keeping track of CPU utilization would be important to ensure that the API interface responds adequately to all users and is not overloaded. If CPU utilization is too high, it may result in performance degredation leading to increased latency. Additionally, increased utilization may indicate an attempted misuse of the system by someone who may be purposefully trying to overload the system. Sending CPU utilization alerts, as well as including a load balancer in the system architecture will help ensure consistent system performance and load.
3. **Faithfulness**: This is a [Ragas](https://docs.ragas.io/en/stable/) metric that doesn't require a ground truth answer, so could be calculated for every input question. It uses the generated answer and retrieved context to measure how factual a generated answer is compared to the retrieved context, which can be useful to ensure that the system is using its retrieved context properly.
4. **Answer Relevancy**: This is another [Ragas](https://docs.ragas.io/en/stable/) metric that doesn't require a ground truth answer. It uses the original question, retrieved context, and generated answer to detmerine how relevant the generated answer is to the orginal question. This may be useful because if the answers are not relevant, perhaps this may indicate that corpus documents may need to be added or removed to help create more relevant answers by providing more relevant context.
5. **Context Precision***: This is another [Ragas](https://docs.ragas.io/en/stable/) metric that doesn't require a ground truth answer. It uses the original question, retrieved context, and generated answer to determine how useful teh retrieved context was for generating an answer to the original question. If this metric is low, that indicates that the corpus documents may need to be adjusted since the retrieved context is not very useful to the generator.

## Analysis of System Parameters and Configurations
Exploratory data analysis was performed in [data_analysis.ipynb](data_analysis.ipynb), which led to the discovery of corpus embeddings generally clustered by document, with question and answer embeddings generally near their corresponding documents. Additionally, the anylsis showed uneven document sizes and relevant documents among the validation questions.  More detail is in that file, but the general conclusion was that the `all-mpnet-base-v2` embedding model seemed to work fairly well for clustering similar document chunks together, indicating potential for good context retrieval results.

The following design considerations and analyses were performed and described in [design_considerations.ipynb](design_considerations.ipynb), and are described again here. References are cited in that file as well.

### Design Considerations
1. **Sentences per Chunk**: One important factor to consider is the number of sentences per chunk, which is used by the `DocumentProcessing` class to split the corpus documents into smaller chunks before the pipeline calculates the embedding vector for each chunk. During the retrieval and generation process, these chunks will be used as context and provided as input to the generator LLM that generates the answer to the question. Therefore, the number of sentences in these chunks will directly affect how much contextual information the generator will have available to generate a response. A high number of sentences per chunk provides the generator with lots of contextual information, which might be useful for complex questions. However, too much information may result in irrelevant information being provided to the generator, which may confuse the model and produce unreliable answers. A small number of sentences per chunk could decrease the chances of providing irrelevant information, but could also decrease the chances of including the true answer in the context. **Analysis Plan:** To analyze the effects of different numbers of sentences per chunk, a validation set of questions, like the ones in [qa_resources/questions.csv](../qa_resources/questions.csv), could be used to test how the RAG system performs when the corpus documents have been processed using different numbers of sentences per chunk. For example we could use sentence per chunk values of [1,2,3,4,5,6] and for each value, we can calculate the document embeddings using the specified sentences per chunk, and use those embeddings to retrieve the context to be used to generate an answer for each question in the validation set. The answers to each question can be evaluated using automated metrics such as exact match and transformer match metrics. Additionally, the time it takes to get the context and generate an answer for a question can be tracked to identify any potential differences in latency for different sentences per chunk values. The sentences per chunk value that results in the best balance of high transformer match and low latency would be chosen for deployment to the system. Transformer match metric is prioritized over exact match because it allows for some flexibility in phrasing, paraphrasing, or use of synonyms in the generated response.
2. **Similarity Measure**: Another important design consideration is the similarity measure used in the Nearest Neighbor search when calculating the distance between two embedding vectors during the context retrieval process. This is important because it can affect how relevant the retrieved context is to a given question. The distance metric must ensure that higher distances are associated with dissimilar text and lower distances are associated with similar text strings. Different distance metrics can lead to different results for the same points, so it's important to carefully choose the distance measure that results in the best system performance. Some popular distance metrics include Manhattan distance (aka "city block" distance), Euclidean distance (aka "straight line" distance), Minkowski distance (a generalized version of Manhattan and Euclidean distances), and cosine similarity which measures the angle between two vectors. **Analysis Plan:** To analyze the different similarity measures, each variation (Manhattan distance, Euclidean distance, Minkwoski distance with different values for the p parameter, and cosine similarity) can be used to retrieve context used to generate answers for validation set of questions with known answers. The exact match and transformer match metrics can then be calculated for each run, so that the quality of the system outputs can be compared for the context search using each similarity measure. Additionally, the mean search time per question can be tracked so we can identify any differences in efficiency between the similarity measures. The similarity measure resulting in the best balance of transformer match and efficiency would be chosen for deployment. Note that for the analysis, the transformer match is calculated using the output of the generator that creates a response using the context identified using the KNN search. This does not directly evaluate the outputs of the KNN search, but additional analysis could be done to evaluate the nearest neighbors directly. This could be done by calculating the quality of the ranked nearest samples by finding the mean avereage precision, where relevant matches are determined by the transformer match matric applied to the nearest neighbor chunk and the ground truth answer. This analysis would more directly determine how the similarity measure affects the relevance of the retrieved context chunks.
3. **Number of Nearest Neighbors (K)**: Another important design consideration is the number of nearest neighbors (k value) to be used in the search service when retrieving relevant context chunks for a given question. Lower k values may be more sensitive to noisy data (overlap between chunks of corpus documents in our case), but can also be useful to account for complex decision boundaries between clusters of document chunks. Higher k values produce results that are less affected by outliers or overlap in document chunk clusters, but can introduce bias from irrelevant sentences that may become included when we expand k. Similar to the sentences per chunk analysis, a small k value could decrease the chances of providing irrelevant information, but could also decrease the chances of including the true answer in the context. Ideally, k should be large enough to capture chunks containing the true answer to the question, but small enough to exclude irrelevant information as much as possible. **Analysis Plan:** To find the best k value, the search service can be run on a validation set of questions, using a different k value for each run. For each run, for each question, the k retrieved context chunks will then be used as input with the question to the generater that will create an answer to the question. The answers can be evaluated using the exact match and transformer match metrics for each run, so that the quality of the answers can be compared for each k value. Additionally, mean search time per question can be tracked so we can identify any differences in efficiency for different k values. The k value resulting in the best balance of transformer match metric and search time would be chosen for deployment.
4. **Embedding Model Selection**: Another important design consideration is the embedding model used to generate text embedding vectors from the corpus documents and input questions. The `all-MiniLM-L6-v2` sentence transformer model was provided in the case study as an example, with additional pretrained sentence transformer models described at [https://sbert.net/docs/sentence_transformer/pretrained_models.html](https://sbert.net/docs/sentence_transformer/pretrained_models.html). Each model produces different embeddings, leading to differences in cluster separability of text chunks and documents, leading to differing KNN search results, resulting in different contextual information and therefore generated responses. Models that produce embeddings that put text chunks of the same topic (i.e. from the same document) closer together, and text of different topics further apart (i.e. models that produce well-defined clusters of text) will ultimately lead to the most relevant context retrieved and best generated answer results. Additionally, the different embedding models all have different sizes, speeds, and accuracies. Typically, the larger models perform better but may be slower. Smaller models can be quicker and faster, but may perform worse. Furthermore, the models produce embedding vectors of different sizes, which is important to consider since it will affect our system's storage requirements since we are saving the embedding vectors for each document chunk. Additionally, smaller embedding vectors are more computationally efficient, but contain less information and therefore may not fully represent key aspects that differentiate each text chunk. Larger embedding vectors are more computationally intensive to search, but contain more information and may represent more complex aspects that differentiate text. However, larger embedding vectors may also contain more extraneous, or noisy information unrelated to the text topic such as specific punctuation or phrasing information. Ideally, the embedding vectors should be large enough to capture key aspects of the text that are significant for differentiating between ideas, but small enough so they don't contain too much noisy information that isn't important for a sentence (e.g. punctuation or paraphrasing). **Analysis Plan:** To analyze and compare each model, embeddings can be computed for the corpus documents using a few different models such as `all-MiniLM-L6-v2`, `all-mpnet-base-v2`, `all-MiniLM-L12-v2`, and `paraphrase-MiniLM-L3-v2`. The clustering quality of the resulting embeddings can be measured using silhouette score. A higher silhouette score indicates better intra-cluster cohesion and inter-cluster separation. Therefore, a higher silhouette score indicates that using the model in our system will result in better clustering of document text leading to more relevant context retrieval and ideally better generated answers. To further support this, a validation set of known questions can be run through our RAG system using the embeddings produced by each of the 4 models, and the transformer match metric can be calculated for the generated answer produced using the context retrieved from the KNN search using the embedding vectors produced by each model. Additionally, response time per question can be measured to analyze any potential differences in efficiency of using the embeddings produced by each model. The model that leads to the highest transformer match value with an efficient response time is the one that should be chosen for deployment.
5. **Generator Model Selection**: Another important design consideration is the generator model used to produce an answer to a given question using the provided context. The models used for this case study are versions of Bidirectional Encoder Representation from Transformers (BERT) models. The versions available from hugging face are listed in their [pretrained models page](https://huggingface.co/models?sort=trending&search=bert), but the main general versions are BERT Base (`google-bert/bert-base-uncased`) and BERT Large (`google-bert/bert-large-uncased`). Another version of the BERT Large model fine-tuned for with a question answering dataset is `google-bert/bert-large-cased-whole-word-masking-finetuned-squad`. BERT Large uses higher dimensional embeddings, more encoder blocks and attention heads per encoder block, and larger hidden layers, resulting in roughly 3x more parameters than BERT Base. Similar to the embedding models discussed above, typically the larger models perform better, but may be slower and require more storage space. SMaller models may be faster and require less space, but may not perform as well. The larger model may be better for more complicated language tasks, while the smaller one may be sufficient for simpler tasks. **Analysis Plan:** To analyze and compare each generator model, our validation set of questions with known answers can be used with their related context to generate responses using each of the generator model versions. The answers from each version can be evaluated by calculating the transformer match metric. Additionally, the time to generate each answer can be tracked to identify any potential differences in effeciency between the generator model versions. The model that results in the best transformer match metric and reasonable time efficiency should be chosen for deployment.
6. **Search Technique**: Choosing a search technique is an important design consideration for the retrieval service in the system. The search module is responsible for searching the database of known embeddings (from a corpus of documents) to find the most relevant context chunks to the input question. This search needs to be accurate so that we are able to provide the generator model with contextual knowledge that includes the information needed to generate aa correct response, but the search also needs to be efficient so that users don't have to wait for long periods of time for the system to generate an answer to a simple question. Even if we decide that we'll use K-Nearest Neighbors for our search algorithm, there are still two main aspects of the search module to consider: exact vs. approximate search and indexing strategy. Implementing some kind of indexing is important because it allows the data to be organized in a way that makes searching, inserting, deleting, and retrieval more efficient. **(1) Exact vs. Approximate:** Exact searches like brute force can be highly accurate and can guarantee that the returned samples are the most relevant to the input question; however, this can be computationally expensive, especially if the embeddings database has many samples or high dimensionality. Approximate searches don't require visiting every single sample in the embeddings database, and therefore are more computationally efficient; however, may be less accurate since they don't guarantee that the optimal solution is identified, since they only search for an adequate solution that's good enough. **(2) Indexing Strategies:** Indexing strategies organize the data in a way the improves efficiency for search and retrieval. Some strategies like KD trees can be applied to both exact and approximate searches, while others like local sensitivity hashing (LSH) and hierarchical navigable small world graphs (HNSW) are intended for approximate searches. KD trees partition the data by dimension into "hyperrectangles", allowing for more efficient search since some branches can be eliminated; however, tree depth increases with dimensionality, making this less efficient for high dimensional data. LSH groups similar items into buckets by increasing the probability of hashing collisions for similar items, and is a good alternative for high dimensional data; however, may result in false negatives due to its probablistic nature. HNSW is an accurate, scalable and efficient solution for high dimensional data, where the data is organized into heirarchical layers, with each course layers first and more fine-grained layers towards the bottom; however, storing a graph of all of the layers can be memory intensive. **Analysis Plan:** To analyze the different search techniques, each variation (brute force, exact KNN,KD trees, LSH, HNSW) can be run on a validation set of questions with known answers. The transformer match metric can be calculated for the generated answers produced using the context retrieved from the KNN search for each search technique. Additionally, the mean search time per question will be tracked, so we can determine which technique is the most efficient. The search technique resulting in the best balance of the transformer match metric and efficiency would be chosen for deployment.

### Extraction Service
1. **Sentences per Chunk**: The first analysis is to compare different values of the sentences per chunk parameter used to split up the corpus documents, on the quality of the answers to the validation set of questions. Additionally, the time it takes to retrieve the most relevant documents and generate an answer is tracked to compare efficiency of the different sentence per chunk values. The results indicate that a sentence per chunk value of 6 is the optimal value of the ones tested since it resulted in the lowest average time per answer and the highest transformer match metric. Although the exact match metric was highest for a value of 4, the transformer match metric is prioritized over exact match because it allows for some flexibility in phrasing, paraphrasing, or use of synonyms in the generated response. For these reasons, I chose to use a sentences per chunk value of 6 for my deployed system.

### Retrieval Service
1. **Number of Nearest Neighbors (K)**: The next analysis is for determining the best number of nearest neighbors (k value) to use in the KNN search service used for retrieving the most relevant context chunks for a given question. The k values explored were [1, 2, 3, 4, 5, 10, 20, 25]. I went up to 25 because a common rule of thumb is to choose k equal to the square root of the number of "training" samples [Ref 6]. For our corpus, using sentences_per_chunk = 6 results in 512 total document chunks, and sqrt(512) is about 23, so I rounded it to 25 for my experiment. The `all-mpnet-base-v2` embedding model, BERT Large Fine-tuned question answering model, sentences per chunk = 6, and Euclidean similarity measure were used for this analysis. To measure the quality and efficiency of answer generation results for each value of k, the average time per question to get context and generate an answer were tracked, and the exact and transformer match metrics were calculated using the generated answer and ground truth answer. The results indicate that the average time per question generally increases as k increases, with more increase between k=1 to k=5, and then a more gradual increase for larger k, especially from k=20 to k=25. The exact match and transformer match metrics both increase with k between k=1 to k=3, but then decrease at k=4, increasing again to a maximum value at k=10 before decreasing for k > 10. Since the exact match and transformer match metrics are maximum at k=10 and with reasonable efficiency, then this led to the use of k=10 for the nearest neighbors value deployed to the system.

### Generator Service
1. **Generator Model Selection**: The next analysis is to determine the effect of different question answering models on the quality and efficiency of the generated answers. BERT Large, BERT Base, and a BERT Large version fine tuned for question answering are used to generate answers to the validation set of questions, and then the exact match, transformer match, and time to generate answers are tracked. The results indicate that the BERT Large fine-tuned version results in the best answers based on the transformer match and exact match metrics; however, it also takes the second longest amount of time to generate an answer with an average of about 0.18 seconds per answer. Although the BERT Base model was about 3 times faster, the transformer and exact metrics were over 3 times worse. Ultimately, since 0.18 seconds is a reasonable response time for a human using a chatbot, I chose to use the best performing model, BERT Large fine-tuned, for deployment to the system.

### Overall System
Overall, each component of the system was chosen based on analysis results, including the k value, generator model, and sentences per chunk parameter. However, the system allows for flexibility and tuning, to adjust system performance possibly as part of future system enhancements. For example, the pipeline's `search_context()` function allows k to be passed in as a parameter, the `DocumentProcessing` class's `split_document()` function allows `sentences_per_chunk` to be passed as a parameter, the `KDTreeSearch` class allows a similarity measure to be passed, and the `Embedding` and `BERTQuestionAnswer` classes allow model names to be passed in to be use for the embedding model and generator model. This flexibility and parameterization will allow our system to tuned and refined as needed based on the results and potential issues identified upon further testing or production use after system deployment.

## Post-Deployment Policies
### Monitoring and Maintenance Plan
The stored json logs, as well as the online metrics are a large part of the monitoring and mitigation plan. Every question sent to the `POST /question` endpoint and its corresponding answer and retrieved context are saved in their own files. This gives us insights into all questions asked, and will allow us to identify any potential issues with the model and system, including reasons for why the system may not be performing as expected if incorrect answers are generated. For example, if users are getting incorrect or outdated answers, the logs could help identify the documents or chunks of information that need to be updated so the system's corpus can be updated. Online metrics such as latency, answer relevancy, and context precision can also help identify how well the deployed system is working. For example, if there's a decrease in answer relevancy or context precision, that would indicate a need to check the logs and look for any potential patterns in the questions, answers, or context that may indicate ways in which the corpus can be improved. Any system maintenance or changes would be performed offline, and the docker image could be built and deployed in a container to ensure stability and reproducibility. Additionally, cloud providers such as AWS, Microsoft Azure, or Google Cloud Platform provide services that make it easy to deploy new containers. For example, we could start the new container and precompute the embeddings for corpus documents, and deploy the container after the precomputing is done, so that the system will not be offline while the precomputing is occurring.

### Fault Mitigation Strategies
Some fault mitigation strategies may include backing up the docker image so it can be redeployed if for some reason the system goes down. Containerization makes it easier to rebuild the system the exact same way repeatedly. Additionally, the corpus documents, logs, and precomputed embedding vectors could be stored in a database hosted separately instead of in a file system in the docker container or local computer. This would ensure that the data is saved even if the interface system, server, or the docker container goes down, or the docker image is redeployed. Carefully monitoring any potential irregularities in the question logs and online metrics can help us catch any potential issues before they arise or immediately when they arise, allowing us to ensure that the system remains reliable, useful, and trustworthy to users. This monitoring and alerting could occur by setting up alerts to notify engineers of potential incorrect answers or irrelevant context retrieval, like if the nearest neighbors search returns neighbors that are all fairly far away from the input question (i.e. the similarity measure is above some threshold for all predicted neighbors), or if the answer relevancy metric is low. The alerts could potentially be configured by using third party monitoring tools like AWS CloudWatch, Splunk, or Datadog. Additionally, the system's `POST /question` endpoint could be updated to include the similarity measure between the question and each predicted context neighbor returned in the response, giving the endpoint consumer a quantitative idea of the how relevant the retrieved context chunks are to the input question, so the output can be overridden (i.e. a warning message could be displayed) if necessary. Similar output re-writing can be done to return a user-friendly response if the generator returns an unuseful answer such as an empty string or the `[CLS]` stop token. Addtionally, content moderation guardrails could be implemented to prevent the system from processing questions or returning answers that include things like violent, hateful, or abusive content, protected code or material without citations, or prompt injection attacks. Another fault mitigation strategy includes context re-ranking and question/context rewriting or summarization to decrease the number of tokens sent to the generator model, since more tokens cost more and models have fixed context windows, meaning they can process only a maximum number of tokens.