# SentenceFeature Extractor bottleneck analysis

## Analysis and observations from the workflow profiling
One of the major bottle neck observed during the workflow profiling was the overhead for SentenceFeature Extractor (i.e., function used to calculate the feature embedding value)

### Problem:
The curent implementation calculated the feature embedding for every entry in the cache table to find the most similar cache entry from the database. It is to be noted that the SentenceFeature function has a significant overhead since it needs to compute the embedding for a given sentence(text). Computing this value each time is going to negatively impact performance of the cache library. Therefore, it is necessary to find solutions to avoid this overhead and improve the performance of the SentenceFeature extraction

### Proposed solution
Instead of computing the embeddings each time on database search, we can precompute the embedding and store the feature in a separate column of the cache database to avoid the repeated overhead.

In [10]:
# # get OpenAI key if needed
from openai import OpenAIError
import os

try:
    api_key = os.environ["OPENAI_API_KEY"]
except OpenAIError:
    api_key = str(input("🔑 Enter your OpenAI key: "))
    os.environ["OPENAI_API_KEY"] = api_key

In [4]:
api_key = str(input("🔑 Enter your OpenAI key: "))
os.environ["OPENAI_API_KEY"] = api_key

## Performance analysis with precomputed embeddings

In [3]:
from large_test_optimized import test_optimized_largetest

test_optimized_largetest()

  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]


Total pair 101
Total similar pair 36
Total time taken after using Precomputed Embeddings for Similarity  66.65347003936768
Precision 0.5666666666666667
Recall 0.9444444444444444
TP FN FP 34 2 26


## Performance analysis without precomputed embeddings

In [8]:
from large_test_old import test_largetest

test_largetest()

  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
  sentences_sorted = [sentences[idx] for idx in length_sorted_

Total pair 101
Total similar pair 36
Total time taken 168.68619990348816
Precision 0.5666666666666667
Recall 0.9444444444444444
TP FN FP 34 2 26


## Performance Comparision

**With precomputed embeddings** Total time taken for cache.get() for 100 entries = 66.65347003936768

**With precomputed embeddings** Total time taken for cache.get() for 100 entries = 168.68619990348816

By using precomputed embeddings, the performance of the cache.get() function can be improved significantly as detailed in below:
#### Speedup: 
The optimized code is approximately **2.53 times faster** than the unoptimized code.
#### Relative Improvement: 
The optimized code is approximately **60.49% faster** than the unoptimized code.
#### Throughput:
Before optimization: Throughput (per 100 entries) = 100 / 168.68 ≈ 0.593 operations per 100 entries per second.

After optimization: Throughput (per 100 entries) = 100 / 66.65 ≈ 1.499 operations per 100 entries per second.
