# Visual Search System Report

## System Design
By revisiting our original requirements below, we can confirm that the system meets the requirements.
1. The system should be able to identify millions of known personnel
    * This can be achieved by making the system scalable, which includes using an indexing strategy to organize the data in a way that makes the search process quicker. In this system, we used a KD-Tree to achieve this. The embeddings for the gallery are preocomputed when the system first starts, and saved to files so they can be reused later, so that the search process will be quicker. Analysis in [design_considerations.ipynb](./design_considerations.ipynb) showed that the probe images could be processed in a couple hundredths of a second each.
2. The system should be able to detect non-employees
    * This was achieved by using a representation learning strategy, which involved precomputing embeddings for images from a gallery of known employees. Non-employees can be detected by finding the nearest neighbors to the probe image, and if the similarity measure between the indetified nearest neighbors and the probe image is very large, this indicates that the probe does not sufficiently match any of the gallery images, indicating that the person in the probe image is an intruder.
3. The system should be able to maintain a high performance despite different lighting conditions
    * This was achieved by implementing a preprocessing module in the extraction service, which scales and resizes the input images to ensure that the inputs provided to the model are consistent. Additional transformations, such as brightness and color adjustments could be added to this preprocessing module if necessary to further accomodate for different lighting conditions. Additionally, our use of the SimCLR model for calculating embeddings is helpful for this requirement since it's trained to apply random augmentations to the input images, and produce embedding vectors so that those of differently augmented views of the same image are close together, while vectors for different images are far apart. This is ideal so that images of the same person in different lighting conditions will ideally be close together in embeddings space.
4. The system should be able to adjust access permissions (add new hires / remove recent departures)
    * This was achieved by implementing the `add_identity` and `remove_identity` endpoints, which can be used to add new identites and images to the gallery of known employees, or remove existing images from the gallery of known employees. These endpoints can be used to add or remove a single image at a time, but can be called multiples times to apply the functionality to multiple images. I also implemented a `GET /identity` endpoint to make it easy to see what image files are in the gallery for a given identity (person's name). The `POST /images` endpoint can be used to send a list of filenames and get a zip file returned containing the `.jpg` images associated with the provided filenames.

A diagram of the entire system, from the Module 8 lecture slides, can be seen below. There are three main services: the extraction service, the retrieval service, and the interface service. The extraction service processes images and extracts embedding vectors from each image, saving the embeddings to numpy files. This is run on both the gallery images and probe images. For the gallery images, the extraction service applies a KD-Tree indexing strategy to store the gallery embeddings in an organized way that makes the search process more efficient. The retrieval service is used to perform a nearest neighbors search to find the images most similar to a given probe image, using the KD-Tree index created after the gallery embeddings were computed upon system start up. Finally the interface is how the outside world interacts with the system by providing input as authentication requests and images or by making API calls to retrieve information such as access logs or image files.

![diagram](../assets/images/diagram.png)

## Data, Data Pipelines, and model

### Data
There are three main types of data: gallery and probe images, embedding vectors, and output prediction data. The data obtained from the `identities.tar` and `multi_image_identities.tar` files are stored in the `storage/gallery` and `storage/multi_image_gallery` directories respectively. The `gallery` has 500 images (one per identity) and the `multi_image_gallery` has 2265 images (multiple images for many identities). The 999 `probe` images from the `multi_image_identities.tar` file are stored in the `simclr_resources/probe` directory. These probe images were used as a test set during the analysis in [design_considerations.ipynb](./design_considerations.ipynb). When the system starts, the embeddings for each image in the gallery are precomputed and saved to `.npy` files in the `storage/embeddings` directory, inside the folder that corresponds with the model name used to compute the embeddings. Finally, the system outputs its nearest neighbor predictions in the form of access logs upon every authentication request. The probe image is stored as a `.jpg` file and the access log as a `.json` file in the `storage/logs` folder. The access log contains the path to the probe image, timestamp, and list of predicted closest matches (including filename, first name, and last name).

### Data Pipelines
There are two main data pipelines: one used on the gallery of known employees and one used on the unknown probe images. The data pipeline that is used for the gallery of known employees is run when the system first starts, and it involves only the extraction service, including the preprocessing, embedding, and indexing modules. The pipeline starts by using the `Preprocessing` class to scale and resize each image in the gallery. Next, the `Model` class is used to calculate the embedding vector for each image. The embedding vectors are then saved to their own `.npy` file in the `storage/embeddings` folder as described above in the Data section. Finally, the embedding vectors are organized into a KD-Tree using the `KDTree` class, which indexes the known embedding vectors in a way that makes the search process quicker than a brute force approach, since entire branches of the search space can be disregarded during the search process. This KDTree index will be used by the retrieval service in the next data pipeline.

The second data pipeline is run for each probe image that is provided in each authentication request initialized using the `POST /authenticate` endpoint. This pipeline involves both the extraction and retrieval services. Like the first pipeline, this one starts by scaling and resizing the probe image using the `Preprocessing` class and calculating the embedding vector using the `Model` class. Then the `KDTreeSearch` class is used to perform a nearest neighbors search on the KDTree created in the first data pipeline. This search returns the k embedding vectors that are closest to that of the probe image in embeddign vector space based on a defined similarity measure. Based on the analysis done in [design_considerations.ipynb](./design_considerations.ipynb), the k value used in the deployed system was 3 and the similarity measure used was Euclidean distance. See that file for further analysis details. The nearest neighbor information is saved in an acccess log and returned to the interface user.

### Model
The embedding model used for this case study is the SimCLR model, which is a self-superised model that can be used for image representation. All aspects of the model architecture, icluding the projection head neural network and the backbone neural network, are implemented in [model.py](../src/extraction/model.py). This model works well for our image representation task because it aims to produce embedding vectors such that the distance between differently augmented views of the same image is minimized, while mazimizing the distance between views of different images. This means that ideally similar images (images of the same person) will be closer together in embeddings space, while images of different people will be further apart in embeddings space. Four versions of trained SimCLR models were provided for this case study, which have been trained on each combination of the following image sizes and architectures: image sizes = [64, 224] and architectures = [resnet_018, resnet_034]. Based on the model selection analysis performed in [design_considerations.ipynb](./design_considerations.ipynb), the `model_size_224_resnet_018` model was used for the deployed system in `deployment.py`.

## Metrics Definition
Most of the metrics tracked by the system are offline metrics due to time and feasibility constraints of this Case Study project. In the real world, cloud resources such as AWS CloudWatch, Datadog, and Splunk could be used to obtain more online metrics. These could be configured to include things like API errors and traffic, latency, and duration.

### Offline Metrics
These metrics have been implemented in this case study, and are used mainly by the analyses in [design_considerations.ipynb](design_considerations.ipynb). The Flask application doesn't include an endpoint to directly get these metrics since that was not a specified requirement of the case study, and we don't have the correct identification labels for real world probe data. In a real system, an endpoint to get these metrics would be useful, but would require implementing a way to collect human annotations by security guards of the system's predicted outputs. With the true labels known, these offline metrics can be calculated.
1. **Average Precision @ k**: This metric is tracked so we can identify how useful a model is for reducing the risk of false positives. A low false positive rate is an important aspect of the identification system because we want to minimize the rate of non-employees (intruders) who are allowed access to the company. It is also necessary in order to calculate mAP (discussed below). However, it's important to note that this metric only considers the mumber of positive matches, and not the ranking of those matches.
2. **Average Recall @ k**: This is an important metric for ensuring that we're correctly identifying employees who are truly known employees in the gallery, allowing us to maximize true positives. This is important so that real employees do not get locked out of the company, resulting in an inability for them to do their jobs. However, it's important to note that this metric doesn't consider the rank of the positive matches, only the number of positive matches in the entire dataset (which could be very large).
3. **Mean Average Precision @ k**: Since precision @ k and recall @ k both have their drawbacks and do not consider ranking of positive matches, then a more robust measure of performance may be Mean Average Precision The mAP @ k value can give us a more robust measure of model performance since it takes into account both numper of positive matches as well as their rankings, which can be useful when comparing different model versions and tuning model parameters. This is why mAP was the prioritized metric when comparing models during the analyses described below.
4. **Mean Reciprocal Rank @ k**: This metric is simple and straightforward since it's based only on the rank of the first positive match in a list of predicted matches. However, it only considers the rank of the first positive match, and ignores all other potential positive matches in the list, making it not ideal for comparing the performance of different models, especially when using the multi image gallery where there could potentially be multiple positive matches for a given input probe.

### Online Metrics
These metrics were not implemented for this case study due to time and feasibility constraints, but given more time to develop a real-world system, these metrics could be very useful for determining how well the deployed system is performing.
1. **Latency**: Keeping track of the time that it takes for the system to produce predicted identification results upon recieving an image in an authentication request will be important for ensuring that the system remains performant and scalable, even if the number of employees or gallery images grows. The system should still be able to produce identification results quickly so that entrants do not have to wait at the access point for long periods of time. The system should be at least as fast (hopefully faster) than a human security guard, and tracking this latency will help ensure that goal is met.
2. **Successful Breaches**: Tracking the number of successful breaches (intruders) will be helpful in determining how well the system is performing (i.e. the number of false positives). This will help to identify weaknesses in the system,and checking the access logs will help to identify any potential patterns associated with these breaches. This metric would require some way to provide feedback to the system to indicate that an authentication request resulted in a breach.
3. **Alert Response Time**: Keeping track of the time taken for security guards to respond to any system alerts regarding potential security breaches would be an important metric for determining how well the automated system and human security guards can work together to maintain security of the company. The system's alerts should be sent soon enough to allow the security guards time to intervene, while the security guards should intervene swiftly upon recieving an alert so they can stop potential intruders before they get too far.
4. **CPU Utilization**: Keeping track of CPU utilization would be important to ensure that the API interface responds adequately to all locations and is not overloaded. If CPU utilization is too high, it may result in performance degredation leading to increased latency. Additionally, increased utilization may indicate an attempted hack/system breach by someone who may be purposefully trying to overload the system. Sending CPU utilization alerts, as well as including a load balancer in the system architecture will help ensure consistent system performance and load, while alerting security guards of potential breaches.
5. **Performance by Location**: Tracking all of the above metrics for each office location and access point will be important for identifying any potential differences in locations, and addressing any weaknesses for specific locations. Additionally, high performing locations could be analyed to find ways to improve lower performing locations.

## Analysis of System Parameters and Configurations
Exploratory data analysis was performed in [data_analysis.ipynb](data_analysis.ipynb), which led to the discovery of different embedding distributions and clustering strength between the four different trained models. More detail is in that file, but the general conclusion was that the `model_size_224_resnet_018` model may be best to include in the deployed system to get higher identification rates of employees due to its clustering power and embedding distribution compared to those of the other models.

The following design considerations and analyses were performed and described in [design_considerations.ipynb](design_considerations.ipynb), and are described again here. References are cited in that file as well.

### Design Considerations
1. **Search Technique**: Choosing a search technique is an important design consideration for the retrieval service in the system. The search module is responsible for searching the database of known embeddings to find the most similar samples to the input probe. This search needs to be accurate so that we don't misidentify an intruder as a known employee (i.e. a false positive), but also needs to be efficient so that entrants don't have to wait at the access point for long periods of time for the system to complete its search. Even if we decide that we'll use K-Nearest Neighbors for our search algorithm, there are still two main aspects of the search module to consider: exact vs. approximate search and indexing strategy. Implementing some kind of indexing is important because it allows the data to be organized in a way that makes searching, inserting, deleting, and retrieval more efficient. **(1) Exact vs. Approximate:**  Exact searches like brute force can be highly accurate and can guarantee that the returned samples are the most similar to the input probe; however, this can be computationally expensive, especially if the embeddings database has many samples or high dimensionality. Approximate searches don't require visiting every single sample in the embeddings database, and therefore are more computationally efficient; however, may be less accurate since they don't guarantee that the optimal solution is identified, since they only search for an adequate solution that's good enough. **(2) Indexing Strategies:** Indexing strategies organize the data in a way the improves efficiency for search and retrieval. Some strategies like KD trees can be applied to both exact and approximate searches, while others like local sensitivity hashing (LSH) and hierarchical navigable small world graphs (HNSW) are intended for approximate searches. KD trees partition the data by dimension into "hyperrectangles", allowing for more efficient search since some branches can be eliminated; however, tree depth increases with dimensionality, making this less efficient for high dimensional data. LSH groups similar items into buckets by increasing the probability of hashing collisions for similar items, and is a good alternative for high dimensional data; however, may result in false negatives due to its probablistic nature. HNSW is an accurate, scalable and efficient solution for high dimensional data, where the data is organized into heirarchical layers, with each course layers first and more fine-grained layers towards the bottom; however, storing a graph of all of the layers can be memory intensive. **Analysis Plan:** To analyze the different search techniques, each variation (brute force, exact KNN,KD trees, LSH, HNSW) can be run on a validation set of embedding vectors for known probe images. The mean average precision can then be calculated for each run, so that the quality of the search outputs (ranked nearest samples) can be compared for each search technique. Additionally, the mean search time per sample will be tracked, so we can determine which technique is the most efficient. The search technique resulting in the best balance of mean average precision and efficiency would be chosen for deployment. Mean average precision is chosen as a comparison metric since it takes into account both the number of positive samples identified, as well as the rank of those samples, unlike precision@k, recall@k, and mean reciprocal rank.
2. **Similarity Measure**: Another important design consideration is the similarity measure used in the Nearest Neighbor search when calculating the distance between two embedding vectors. The distance metric must ensure that higher distances are associated with dissimilar images and lower distances are associated with similar images. Different distance metrics can lead to different results for the same points, so it's important to carefully choose the distance measure. Some popular distance metrics include Manhattan distance (aka "city block" distance), Euclidean distance (aka "straight line" distance), Minkowski distance (a generalized version of Manhattan and Euclidean distances), and cosine similarity which measures the angle between two vectors. **Analysis Plan:** To analyze the different similarity measures, each variation (Manhattan distance, Euclidean distance, Minkwoski distance with different values for the p parameter, and cosine similarity) can be run on a validation set of embedding vectors for known probe images. The mean average precision can then be calculated for each run, so that the quality of the search outputs (ranked nearest samples) can be compared for the search using each similarity measure. Additionally, the mean search time per sample will be tracked, so we can identify any differences in efficiency between the similarity measures. The similarity measure resulting in the best balance of mean average precision and efficiency would be chosen for deployment.
3. **Number of Nearest Neighbors (K)**: Another important design consideration is the number of nearest neighbors (k value) to be used in the search service when return possible matches for a given probe. Lower k values may be more sensitive to noisy data (overlap between clusters of identities in our case), but can also be useful to account for complex decision boundaries between clusters. Higher k values produce result that are less affected by outliers or overlap in clusters (assuming we have somewhat well-defined clusters), but can introduce bias from false matches that may become included when we expand k. Additionally, a larger k value would mean the upstream service from our search system (whether that's some kind of result synthesizer or just the security person looking at a computer) would need to look through more potential matches in order to identify the person in the probe image. Ideally, k should be large enough to capture true matches, but small enough to exclude non-matching identities as much as possible. **Analysis Plan:** To find the best k value, the search service can be run on a validation set of probe images, using a different k value for each run. The mean average precision can again be calculated for each run, so that the quality of search outputs (accounting for rank of postives and number of positives) can be compared for each k value. Additionally, mean search time per image can be tracked so we can identify any difference in efficiency for different k values. The k value resulting in the best balance of mean average precision and search time would be chosen for deployment.
4. **Embedding Vector Dimensionality**: Another important design consideration is the dimensionality of the embedding vectors. Although all 4 models provided for this case study have been trained to produce embedding vectors of 256 dimensions, additional model variations could be trained to produce vectors of other dimensions. Smaller embedding vectors are more computationally efficient, but contain less information and therefore may not fully represent key aspects that differentiate each identity. Larger embedding vectors are more computationally intensive to search, but contain more information and may represent more complex aspects that differentiate identities. However, larger embedding vectors may also contain more extraneous, or noisy information unrelated to the person's identity. Ideally, the embedding vectors should be large enough to capture key aspects of the image that are significant for differentiating between identities, but small enough so they don't contain too much noisy information that isn't important for a person's identity (e.g. whether they're wearing a hat or not). **Analysis Plan:** To find the best embedding vector size, additional SimCLR model versions can be trained using different sizes in the ProjectionHead that is used to convert the features to the embedding space. Each of these models can be used to produce embeddings for the multi image gallery, and the quality of the embeddings can be numerically determined by measuring the cluster separability using silhouette score. A higher silhouette score indicates embeddings that are easier to cluster by identity, which will lead to more accurate identifications by our search service. An additional analysis can be done by using the search service with the embeddings of each model to find neighbors for a set of known probe images so we can determine mAP values for each model. However, the cluster analysis using silhouette score may suffice since better clusters are expected to lead to better search results.
5. **Potential for Racial Bias**: Another important design consideration is the potential for racial bias. This can be caused by lack of representation in the training set as well as image augmentation and preprocessing techniques. Lack of racial representation in training data has been shown repeadtely to result in biased models [Ref 6, 7, 8]. Ideally, the dataset used to train the embeddings model should contain images of people with varying skin tones. Additionally, image processing steps such as contrast adjustments do not have a unifrom effect for all skin tones, and additional steps such as hue or saturation adjustments may be beneficial. Additional augmentations such as color and geometric transforms have been shown to improve performance as well [Ref 9]. Ensuring that the model is free from racial bias as much as possible will save IronClad Technology from potential lawsuits, as well as save time and effort for the security guard who is responsible for confirming the match results that our system returns to them. **Analysis Plan:** To analyze the potential for racial bias, we can build a validation set of known identities with varying skin tones and use our search service to find the closest matches for each probe image in the validation set. If we categorize the validation images into light, medium, and dark skin tones, we can calculate the mAP for each group and ensure that the system performs similarly for each group. Additionally, we can anlyze the training dataset to determine whether all skin tones are adequately represented. This can be done using a machine learning model similar to the STAR-ED framework introduced by Tadesse et al [Ref 10], which assesses skin tone representation in educational materials. This will allow us to determine whether we need to augment the training set ton include more samples of specific skin tones.
6. **Model Selection**: Another important design consideration is the embedding model version to be deployed in our system. Four versions of trained SimCLR models have been provided for this case study, which have been trained on each combination of the following image sizes and architectures: image sizes = [64, 224] and architectures = [resnet_018, resnet_034]. Each model produces different embeddings, leading to differences in cluster separability of identities, leading to differing KNN search results, resulting in different identification rates. Models that produces embeddings that put images of the same person closer togehter, and images of different people further apart (i.e. models that produce well-defined clusters of identities) will ultimately lead to the best identification results. **Analysis Plan:** To analyze and compare each model, embeddings can be computed for the multi image gallery using all 4 models. The clustering quality of the resulting embeddings can be measure using silhouette score. A higher silhouette score indicates better intra-cluster cohesion and inter-cluster separation. Therefore, a higher silhouette score indicates that using the model in our system will result in better identification results. To further support this, a validation set of known probe images can be run through our search system using the embeddings produced by each of the 4 models, and the mAP metric can be calculated for the search results for each model. Additionally, search time per image can be measured to analyze any potential differences in efficiency of searching the embeddings produced by each model. The model that leads to the highest mAP value with an efficient search time is the one that should be chosen for deployment.
7. **Gallery Selection (Number of Images per Identity)**: Another important design consideration is the gallery selection, or more specifically, the number of images per identity. For our case study, we've been provided a gallery containing one image per identity and another multi image gallery that contains multiple images for some identities. The gallery selection, and number of images per identity, will impact our search results significantly. If there's only one image per identity in the gallery, then a correct identification by the KNN search will require the probe image to be closest (or very close, depending on k value) to the one correct identity in the gallery. However, if there are multiple images per identity in the gallery, then the probe image can be close to any of the correct identity images, and the KNN search will be more likely to accurately identify the probe. Having multiple images per identity can lead to system robustness against noise unrelated to identity such as differences in angles of the probe image, differences in hair style, or outfit. However, having too many images per identity could lead to lots of cluster overlap, especially if the cluster boundaries are not well-defined, resulting in more noisy or incorrect search results. **Analysis Plan:** To determine the best gallery to use in the system, embeddings can be calculated for each image in each gallery. Then a validation set of known probe images can be run through the search service using each gallery. The mAP metric can be calculated for the search results for each gallery, and the one that produces the highest mAP value is the one that should be chosen for deployment. To determine more specifically the optimal number of images per identity to include in the gallery, a similar iterative experiment can be run, but using a specially constructed gallery each time. For example, we can start by constructing a gallery containing two images per identity (since we already covered the one-image case with the single-image gallery), and running our validation set of probe images to calculate the mAP for the search results. This can be repeated by constructing a gallery containing 3,4,...,n images per identity and comparing the mAP values. The gallery that produces the highest mAP value should be the one that determines the optimal number of images per identity.

### Extraction Service
1. **Model Selection**: The first analysis is for selecting which embedding model to deploy to the system. Recall that four versions of trained SimCLR models have been provided for this case study, which have been trained on each combination of the following image sizes and architectures: image sizes = [64, 224] and architectures = [resnet_018, resnet_034]. The embeddings were computed for the gallery images using each of the four models. In `data_analysis.ipynb`, the clustering quality of the embeddings produced by each of the four models was analyzed by comparing silhouette scores. The results of that analysis indicated that the models in order of best to worst clustering ability were `model_size_224_resnet_018`, `model_size_064_resnet_034`, `model_size_064_resnet_018`, then `model_size_224_resnet_034`. In the analysis here, we'll use our validation set of probe images to find the nearest neighbors to each probe and calculate the mean average precision (mAP) for each model. Precision@k, recall@k, and mean reciprocal rank @ k are also tracked to get a better idea of model behavior, but mAP is given priority since it accounts for both number of positive items as well as their rank. Additionally, search time per image is tracked for each model. The results indicated that the search time per image increases with more complex models; however, the difference in search time between the slowest (`model_size_224_resnet_034`) and fastest (`model_size_064_resnet_018`) models is only about 0.01 seconds, which is negligible and insignificant to the end user, i.e. the security guard at the access point who is confirming the search results. Additionally, the metric values of all four metrics (precision@k, recall@k, MRR@k, mAP) all support the results of the silhouette score clustering analysis, indicating that the models ranked from highest performing to lowest performing are `model_size_224_resnet_018`, `model_size_064_resnet_034`, `model_size_064_resnet_018`, then `model_size_224_resnet_034` based on the search result metrics. This confirms that `model_size_224_resnet_018` will be deployed to the system.
2. **Gallery Selection**: The next analysis is for selecting which gallery to deploy to the system. Recall that we've been given a gallery containing one image per identity and another multi image gallery containing multiple images for some identities. The embeddings were computed for each of these galleries, and were used to search for the nearest neighbors to our validation set of probe images. The same metrics used in the previous analysis (precision@k, recall@k, MRR@k, mAP, search time) were also used here to measure the quality of the search results using each gallery. The results indicated that the search times per image were slightly higher for the multi image gallery than the single image gallery; however, the difference was on the order of thousandths of a second per image, which is negligible and unnoticable to the end user, i.e. the security guard confirming the search results at the access point. The results also indicate that the multi image gallery resulted in higher values for all four metrics, with recall and MRR nearly doubled and mAP nearly quadroupled for all models compared to the results of the single image gallery. This led to the decision to deploy the multi image gallery to the system.

### Search Service
1. **Similarity Measure**: The first search service analysis is for determining the best similarity measure to use for determining distance between embedding vectors in the KNN search service. The similarity measures explored were Euclidean distance, Manhattan distance, cosine similarity, Minkowski distance with p=5, and Minkowski distance with p=10. The `model_size_224_resnet_018` model and multi image gallery were used for this analysis. Each similarity measure was used to find the nearest neighbors to our validation set of probe images. I added a `set_search_measure()` function to my pipeline class to easily update the search measure used for the KD Tree search. To measure the quality and efficiency of the search results for each similarity measure, the same metrics used in the previous analyses were callculated (precision@k, recall@k, MRR@k, mAP, search time). The results indicated that the search times per image were best using cosine similarity, moderate using Euclidean and Manhattan distance, and worst using either of the Minkowski distances; however, the difference was on the order of hundredths of a second per image, which is negligible and unnoticable to the end user, i.e. the security guard confirming the search results at the access point. The results also indicate that cosine similarity produced the worst search results (with all four performance metrics being 0), and the remaining four similarity measures produced similar resutls to each other. However, the Minkowski distances did result in slightly lower values for all four metrics, compared to the Euclidean and Manhattan distances which performed very similarly. The slightly better efficiency and metric values of Euclidean distance over Manhattan distance led to the decision to deploy the Euclidean distance similarity measure to the system.
2. **Number of Nearest Neighbors (K)**: The next analysis is for determining the best number of nearest neighbors (k value) to use in the KNN search service. The k values explored were [1, 2, 3, 4, 5, 10, 30, 50]. I went up to 50 because a common rule of thumb is to choose k equal to the square root of the number of "training" samples [Ref 11], which for our multi image gallery is about 48, so I rounded it to 50 for my experiment. The `model_size_224_resnet_018` model, multi image gallery, and Euclidean distance measure were used for this analysis. To measure the quality and efficiency of the search results for each similarity measure, the same metrics used in the previous analyses were callculated (precision@k, recall@k, MRR@k, mAP, search time). The result indicated that the search times per image generaly increased as k incrased for k > 1; however, the difference was on the order of thousandths of a second per image, which is negligible and unnoticable to the end user, i.e. the security guard confirming the search results at the access point. The results also indicate that increasing k leads to increased recall@k and MRR@k, and decreased precision@k. Additionally, mAP increases as k increases from 1 through 3, but then mAP decreases as k increases beyond 3. Since we're prioritizing mAP over the other metrics, and because a low false positive rate is important for this system so we correctly identify non-employees as intruders, I chose k=3 to deploy to the system since it produced the highest mAP value. Additionally, three identities seems manageable enough to not overwhelm the security guard or upstream service that has to synthesize the search results of our search service.

### Overall System
Overall, each component of the system was chosen based on analysis results, including the selected model, k value, similarity measure, and gallery. However, the system allows for flexibility and tuning, to adjust system performance possibly as part of future system enhancements. For example, the `KDTree` class, `RankingMetrics` class, and `nearest_neighbors` search function allow k to be passed as a parameter, the `Preprocessing` class allows image size to be specified, the `KDTreeSearch` class allows a similarity measure to be passed, and the `Model` class allows a different model path to be specified via parameter. This flexibility and parameterization will allow our system to tuned and refined as needed based on the results and potential issues identified upon further testing or production use after system deployment. Some kind of rectification-like service that allows the security guards to provide input on the quality of the system's predicted identifications, which is out of scope of this case study, would be useful for tracking performance of the deployed system and helping to determine any potential areas for improvement.

## Post-Deployment Policies
### Monitoring and Maintenance Plan
The stored images and json objects in the access logs, as well as the online metrics are a large part of the monitoring and mitigation plan. Every probe image sent to the `/authenticate` endpoint and its corresponding access log with predictions are saved in their own files. This gives us insights into everyone who attempts to enter through the access points, and will allow us to identify any potential issues with the model and system, including reasons for why the system may not be performing as expected if incorrect identifications are made. For example, if criminals are coming up with new creative ways to try to trick the system, these access logs could help identify the new patterns so the system's weaknesses can be addressed. Online metrics such as number of breaches and alert response time can also help identify how well the deployed system is working. For example, if there's an increase in the number of breaches, that would indicate a need to check the access logs and look for any potential patterns in the breaches. Additional online metrics such as latency, ranking accuracy, false positive, and false negative rate all would help to ensure our system is continuing to meet performance requirements. This would require a way to collect additional input by the security guard indicating how accurate the system's predictions are. Any system maintenance or changes would be performed offline, and the docker image could be built and deployed in a container to ensure stability and reproducibility. Additionally, cloud providers such as AWS, Microsoft Azure, or Google Cloud Platform provide services that make it easy to deploy new containers. For example, we could start the new container and precompute the embeddings for gallery images, and deploy the container after the precomputing is done, so that the system will not be offline while the precomputing is occurring.

### Fault Mitigation Strategies
Some fault mitigation strategies may include backing up the docker image so it can be redeployed if for some reason the system goes down. Containerization makes it easier to rebuild the system the exact same way repeatedly. Additionally, the gallery images, logs, and precomputed embedding vectors could be stored in a database hosted separately instead of in a file system in the docker container or local computer. This would ensure that the data is saved even if the interface system or or local computer at the access point crashes, the docker container goes down, or the docker image is redeployed. Carefully monitoring any potential irregularities in the access logs and online metrics, in addition to implementing more online metrics as described above, can help us catch any potential issues before they arise or immediately when they arise, allowing us to capture any intruders and position additional security guards at access points and perform manual identification if necessary at times. This monitoring and alerting could occur by setting up alerts to notify humans of potential intruders, like if the nearest neighbors search returns neighbors that are all fairly far away from the probe image (i.e. the similarity measure is above some threshold for all predicted neighbors). The alerts could potentially be configured by using third party monitoring tools like AWS CloudWatch, Splunk, or Datadog. Additionally, the system's `/authenticate` endpoint could be updated to include the similarity measure between the probe and each predicted neighbor returned in the response, giving the endpoint consumer a quantitative idea of the how closely the system's predictions match the input image, so the output can be overridden (e.g. by the security guard) if necessary.