# Smart Document Retrieval
<p>The primary focus of the notebook is to illustrate the process of using a transformer model to embed text data into a numerical representation that can be used to calculated a similarity score as compared to a query string embedding. Additionally, we explore some of the architectural theory of a complete application.</p>
<p>Smart Document Retrieval can be divided into the following key components:
    <li>1) Document Management</li>
    <li>2) Source Text Extraction</li>
    <li>3) Source Text Storage</li>
    <li>4) Source Text Embedding
        <ul>
            <li>4a) Model Selection</li>
        </ul>
    </li>
    <li>5) Source Embedding Storage and Management</li>
    <li>6) Query String Embedding</li>
    <li>7) Similarity Scoring and Ranking</li>
    <li>8) Advanced techniques <a href='https://www.sbert.net/examples/applications/retrieve_rerank/README.html'>Retrieval and Re-Ranking - Bi-Encoders(Retrieval) and Cross-Encoders(Re-Ranker)</a>
</p>

## 1) Document Management
<p>The exact manner you manage the documents/resources to use will be based on your use case and is beyond the scope of this notebook; however, it is important to consider several items.
    <li><b>Access</b>
        <ul>
            <li>With the application have continuous access to source documents?</li>
            <li>Will the application need privileged permissions?</li>
        </ul>
    </li>
    <li><b>Versioning</b>
        <ul>
            <li>Is there a document versioning process?</li>
            <li>Are there duplicate/variations of a documents?</li>
        </ul>
    </li>
    <li><b>Document/Resource Types</b>
        <ul>
            <li>What type of document formats will be used? (e.g., MS Word, Excel, Google Docs, Websites, Emails, etc.)</li>
            <li>Are there different format versions?</li>
        </ul>
    </li>
    </p>

## 2) Source Text Extraction
<p>The process for extracting the source text will vary by use case, we offer some things to consider during your design but there may be other considerations based on your requirements. The example dataset used in this notebook was extracted from USPTO patent XML files selecting just the abstract for embedding.
    <li><b>Content Extraction - Technical</b>
        <ul>
            <li>How will you access the source text within the resource/document?</li>
            <li>What libraries / tools will be needed to extract text?</li>
        </ul>
    </li>
    <li><b>Content Selection</b>
        <ul>
            <li>What parts of the document will be selected for extraction? (e.g., Subject Line, Executive Summary, Individual Sections, etc.)</li>
            <li>If you would like the application to identify specific locations within a document that contain the relevant information you will need to extract source text at the same level.</li>
        </ul>
    </li>
    <li><b>Content Quality</b>
        <ul>
            <li>Do you need to remove meta-data or file formatting components such as XML tags?</li>
            <li>Are there errors that need to be fixed? (e.g., spelling, formatting)</li>
        </ul>
    </li>
    </p>

## 3) Source Text Storage
<p>The example dataset used in this notebook has been stored in a simple parquet file format however if your use case needs to scale to millions, billions, or more items a database may be beneficial. One option could be to use MongoDB running in its own container to store the Source Text data.<p>

In [2]:
%load_ext autoreload
%autoreload
# Importing the needed libraries & Modules

# Import cudf. cudf is part of the NVIDIA RAPIDS datascience SDK and is used to store the dataframes 
# used in gpu memory.
import cudf

# Import SentenceTransformer and util from the HuggingFace sentence_transformer library which has
# been pre-installed in this environment.
from sentence_transformers import SentenceTransformer, util

# Import pickle. pickle is used to store the embedding
import pickle

# Import Path. Used to manage file system
from pathlib import Path

# Import smart_search_models. This module was created for this example to simplify the management of the 
# various models that can be used for the embedding process.
import smart_search

# Set some notebook variables
DATASET_NAME = "enron"
DATA_PATH = "../data/"
MODEL_PATH = "../models/"
EMBEDDING_FOLDER = DATA_PATH + "../data/embeddings/"
PARQUET_PATH = DATA_PATH + '../data/enron_extracted/email_data.parquet'
RUN_EXAMPLE_DATASET = True

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
# Verify the dataset exists. If not, download, extract, and preprocess the dataset.
file_path = Path(PARQUET_PATH)
if file_path.exists():
    print("The file exists.")
else:
    print("The file does not exist. Setting up dataset now.")
    %run data_setup.py

The file exists.


### Loading Example Dataset
<p>The dataset being used in this example is comprised of over 500,000 email messages from the Enron dataset. The dataset has been downloaded and extracted using the included <a href='gather_enron_dataset.ipynb'>gather_enron_dataset</a> notebook and parsed using <a href='parse_enron_data.ipynb'>parse_enron_data</a> notebook</p>

<p>The source text dataset is stored in a parquet file containing the following:
    <li><b>file_path:</b> source file path from the Enron dataset
    <li><b>message_id:</b> unique id assigned in the Enron dataset
    <li><b>date:</b> email date
    <li><b>from_address:</b> email address of the parent email
    <li><b>to_address:</b> destination email address in the parent email
    <li><b>org_filename:</b> name of the file source for the enron dataset
    <li><b>is_reply_forward:</b> flag indicating if the message was a reply or forward
    <li><b>message:</b> email message content
    
</p>
<p>The method of storing the source text may vary based on your use case (e.g. CSV, parquet, JSON, MongoDB, etc..). We use a parquet file in this example for simplicity.</p>

In [5]:
%%time
# Load in the example dataset
df = cudf.read_parquet(PARQUET_PATH).reset_index(drop=True)
print("The dataset contains {} entrees".format(df.shape[0]))
df.head(2)

The dataset contains 517401 entrees
CPU times: user 3.22 s, sys: 3.39 s, total: 6.61 s
Wall time: 6.56 s


Unnamed: 0,file_path,message_id,date,from_address,to_address,org_filename,is_reply_forward,message
0,/project/data/maildir/panus-s/inbox/25.,<31058281.1075863216949.JavaMail.evans@thyme>,"Thu, 15 Nov 2001 10:35:36 -0800 (PST)",mark.greenberg@enron.com,d..gros@enron.com,SPANUS (Non-Privileged).pst,False,\n\nTom -\n\nPlease take a look at the attache...
1,/project/data/maildir/panus-s/inbox/28.,<26834675.1075863217034.JavaMail.evans@thyme>,"Wed, 21 Nov 2001 17:37:47 -0800 (PST)",cheryl.johnson@enron.com,"laurel.adams@enron.com, lane.alexander@enron.c...",SPANUS (Non-Privileged).pst,False,"\n\nAttached is the final report for November,..."


## 4) Source Text Embedding
<p>Historical methods for search involved simple <a href='https://en.wikipedia.org/wiki/Lexicography'>lexicographical</a> similarity pattern matching such as <a href='https://en.wikipedia.org/wiki/Regular_expression'>regex</a>. Although methods such as lexical search can be useful for some use cases, they have several disadvantages such as needing to specific the precise terms to search for. To improve search results it can be advantageous to search based on <a href='https://en.wikipedia.org/wiki/Semantic_similarity#:~:text=Semantic%20similarity%20is%20a%20metric,as%20opposed%20to%20lexicographical%20similarity.'>sematic similarity</a> using concepts rather than word for comparison.</p>

<p>To be able to search by concept we must be able to represent our data in the form of concepts. This is where <a href='https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)'>Transformers</a> come in. <a href='https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)'>Transformers</a> are a form of Machine Learning that can be applied to Natural Language Processing (NLP), the models have been trained on extremely large datasets such as Wikipedia to develop the ability to represent input text as a highly dimensional numerical representation, this process is called <a href='https://vaclavkosar.com/ml/transformer-embeddings-and-tokenization'>embedding</a>. If this sounds complicated, don't worry the hard parts are all abstracted away for us, we just need to use the sentence transformer library. Although there are benefits of understanding how the models work, sometimes it can be just as valuable to show how easy they are to use and how impressive the results can be using off-the-shelf models. If greater accuracy is needed you can always <a href='https://www.sbert.net/docs/training/overview.html'>train transformers</a> on your own datasets to improve their capabilities.</p>

### 4a) Model Selection
<p> There are a large number of models to choose from on <a href='https://huggingface.co/'>HuggingFace</a> even for just the task of <a href='https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads'>Sentence Similarity</a>(>800 as of 11/2022). We have included a python module to help simplify organization and selection of a smaller subset of models to experiment with (~100). Using <a href='https://huggingface.co/'>HuggingFace</a> simplifies the process of downloading and running the various models, it is not the only way to consume Transformers but it was chosen as it is one of the easiest ways to get started.</p>
    
There are several areas to consider when selecting a model for a given task
<li><b>Model Size</b> - Large models need more VRAM and can take longer to run but may be more 'accurate'</li>
<li><b>Model Architecture</b> - Some models might be designed for specific use cases or finetuned for a given problem. If your use case is similar, you might have high performance out of the box.</li>
<li><b>Task</b> - Different models have been trained for different tasks. Some examples of various tasks include; Semantic Similarity, Semantic Search, Questioning and Answering, and Document Summarization. 
    
<p>As stated above, the models have been trained to solve a specific workflow. In our case we are trying to identify Semantically Similar documents to our query string. Within the Semantic Similarity group there are subgroups of tasks. These tasks include identifying semantically similar sentences where we try to evaluate two or more sentences and score their similarity. When the elements being evaluated are of similar length (sentence to sentence, paragraph to parapraph) the process is called <b>symmetric semantic search</b>. If you are evaluating a short query phrase or word to sentance, paragraphs, or even documents it is refered to as <b>asymmetric semantic search</b> and models have been specially trained for each type.</p>
    
<li><a href='https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models'>Symmetric Semantic Search Pretrained Models</a></li>
<li><a href='https://www.sbert.net/docs/pretrained-models/msmarco-v3.html'>Asymmetric Semantic Search</a>

### Loading the Model
<p>Loading the model is a simple as passing the model's name as an input argument to create a model object. If the model isn't available locally it will be downloaded automatically. One of the hardest parts of working with HuggingFace is keeping track of all the models available. You can view all the models available for <a href='https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads'>Sentence Similarity</a> and copy the name into the code or to simplify things we have created a very basic python module <a href='smart_search.py'>smart_search.py</a> to hold model names.</p>

<details>
  <summary>SentenceTransformer Parameters</summary>
<li><b>model_name_or_path</b> – If it is a filepath on disc, it loads the model from that path. If it is not a path, it first tries to download a pre-trained SentenceTransformer model. If that fails, tries to construct a model from Huggingface models repository with that name.</li>
<li><b>modules</b> – This parameter can be used to create custom SentenceTransformer models from scratch.</li>
<li><b>device</b> – Device (like ‘cuda’ / ‘cpu’) that should be used for computation. If None, checks if a GPU can be used.</li>
<li><b>cache_folder</b> – Path to store models</li>
<li><b>use_auth_token</b> – HuggingFace authentication token to download private models.</li>
    </details>

In [6]:
# Select and load model.
# Note: If a given model hasn't been used since the container has been loaded it will be downloaded automatically.

# The sentence_models list is a large list of models. They have not been grouped by task beyond sentence similarity 
#model_name = smart_search_models.sentence_models[6]
#model_name = smart_search_models.default_model

# asymmetric_cosine_similarity_models are special purpose models for Asymmetric Semantic Similarity through cosine similarity calculations
model_name = smart_search.asymmetric_cosine_similarity_models[1]

# symmetric_models are special purpose models for Symmetric Sematic Similarity
#model_name = smart_search.symmetric_models[1]

print("Loading model: '{}'".format(model_name))
model = SentenceTransformer(model_name,cache_folder = MODEL_PATH)

Loading model: 'msmarco-roberta-base-v3'




### Source Text Embedding
<p>To embed the source text, we can pass the entire column of our dataset into the model object in a single line of code as shown in the cell blocks below.</p>

<p>A couple important items to note here:
    <li>You only need to embed the source text once for a given model. Depending on your use case you may wish to database the embeddings for later use, just remember to keep track of the model used for embedding and the source document.</li>
    <li>As each model will embed the input text differently you need to ensure the source text and query text were embedded using the same model. If you choose to database or store your embedding for later just be sure to track which models were used for the embedding as you will likely get unexpected results if comparing embedding from different models.</li>
    </p>


<details>
  <summary>encode Parameters</summary>
    <li><b>sentences</b> – the sentences to embed</li>
    <li><b>batch_size</b> – the batch size used for the computation</li>
    <li><b>show_progress_bar</b> – Output a progress bar when encode sentences</li>
    <li><b>output_value</b> – Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values</li>
    <li><b>convert_to_numpy</b> – If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.</li>
    <li><b>convert_to_tensor</b> – If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy</li>
    <li><b>device</b> – Which torch.device to use for the computation</li>
    <li><b>normalize_embeddings</b> – If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.</li>
    </details>

#### Performance Considerations
The examples below show the benifits of processing an array of inputs versus iterating over a loop. Additional performance improvements can be found by utilizing Triton Inference Server. 

In [7]:
# Create a subset to be used as an example
if RUN_EXAMPLE_DATASET:
    example_size = 10000
    example_df = df[0:example_size]

<p>In the cell below we call the encoder for every message. This does not take advantage parrallel processing and we can see the processing time difference. The timings are a result of test runs using an NVIDIA RTX A6000.</p>
<p><b>NVIDIA RTX A6000</b>
<li>Model: all-mpnet-base-v2</li>
<li>Example Size: 10,000 Messages</li>
<li>Wall Time: 2min 13s --> 133 Seconds</li></p>

<b>NVIDIA RTX A3500 (note: Now running on CUDA 12.2)</b>
<p>
<li>Model: all-mpnet-base-v2</li>
<li>Example Size: 10,000 Messages</li>
<li>Wall Time: 1min 35s --> 95 Seconds</li>
</p>

<p>
<li>Model: msmarco-roberta-base-v3</li>
<li>Example Size: 10,000 Messages</li>
<li>Wall Time: </li>
</p>

In [8]:
%%time
if RUN_EXAMPLE_DATASET:
    # Initialize embedding list
    source_embedding_loop = []
    
    # Loop though the examples and embed each message
    for i in range(0,example_df.shape[0]):   
        source_embedding_loop.append(model.encode(example_df['message'][i], convert_to_tensor=True))

KeyboardInterrupt: 

<p>In the cell below we call the encoder with the entire array. This method allows us to take advantage of batch processing. The timings are a result of test runs using an NVIDIA RTX A6000. Note we can run various batch sizes.</p>

<b>NVIDIA RTX A6000</b>
<p>Model: all-mpnet-base-v2
<li>Example Size: 10,000 Messages</li>
<li>Wall Time: 34.5 Seconds</li></p>

<p>
Model: msmarco-distilbert-base-v4
<li>Example Size: 10,000 Messages</li>
<li>Wall Time: 18.4 Seconds</li></p>

<b>NVIDIA RTX A3500 (note: Now running on CUDA 12.2)</b>
<p>
<li>Model: msmarco-roberta-base-v3</li>
<li>Example Size: 10,000 Messages</li>
<li>Wall Time: </li>
</p>




In [9]:
%%time
if RUN_EXAMPLE_DATASET:
    source_embeddings_array = model.encode(example_df.message.to_pandas(),convert_to_tensor=True)

CPU times: user 6min 53s, sys: 5.97 s, total: 6min 59s
Wall time: 6min 36s


<p>Note that the performance was 4x, this could have a significant impact if running on the full 500,000 message.</p>

Model: msmarco-distilbert-base-v4
Corpus Size: 500,000 Messages
Batch Size: 64
Wall Time: 18min 5s

Note: Batch size does not seem to have a significant effect on performance.

### Embedding the entire dataset
We only need to embed the entire dataset once. We can check if the model / dataset embeddings already exist. If so just load them from disk. If not, process them. This can take as long as 30 minutes to embed the ~500,000 emails.

In [22]:
# Create helper functions to read and write embedding to files.
def load_embeddings(embedding_file_path):
        
    #Load sentences & embeddings from disc
    with open(embedding_file_path, "rb") as fIn:
        stored_data = pickle.load(fIn)
        stored_message_id = stored_data['message_id']
        stored_embeddings = stored_data['embeddings']

    # As of now we only need the stored embeddings
    return stored_embeddings

def write_embeddings(embedding_folder, embedding_file_name,message_ids,source_embeddings):
   
    # Check if directory exits
    dir_path = Path(embedding_folder)
    
    if not dir_path.is_dir():
        print("Directory does not exist. Creating it now.")
        # If the directory doesn't exist create it.
        dir_path.mkdir()
        
    # Create the file path
    file_path = embedding_folder + embedding_file_name
    
    # Write out the embedding and message_id to disk
    with open(file_path, "wb") as fOut:
        pickle.dump({'message_id': message_ids, 'embeddings': source_embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)

In [23]:
%%time

# Flag for multi-gpu embedding.
TRAIN_MULTI = False

# Create the file name that would be used to store the embeddings.
embedding_file_name = "embeddings_{}_{}.pkl".format(DATASET_NAME,model_name)

# Create embedding Path object
embedding_file = Path(EMBEDDING_FOLDER + embedding_file_name)

# Check if the file 
if embedding_file.is_file():
    # If a file exists with the embedding file for this dataset / model combination exists load it.
    print("Embedding file exists. Loading it now.")
    source_embeddings = load_embeddings(embedding_file)
else:
    # If an embedding file does not exist. Embed the dataset and cache the data.
    print("Embedding file does not exist. Creating now.")
    
    if TRAIN_MULTI:
        pool = model.start_multi_process_pool()
        source_embeddings = model.encode_multi_process(df.message.to_pandas(),pool)
        model.stop_multi_process_pool(pool)
    else:
        source_embeddings = model.encode(df.message.to_pandas(),convert_to_tensor=True,show_progress_bar=True)
    
    # Write out the generated embeddings
    write_embeddings(EMBEDDING_FOLDER,embedding_file_name,df.message_id.to_pandas(),source_embeddings)
    
print(embedding_file)

Embedding file does not exist. Creating now.


Batches:  19%|█▉        | 3048/16169 [26:56<1:55:58,  1.89it/s] 


KeyboardInterrupt: 

### Timinings
<li>A6000 embeddings_enron_msmarco-roberta-base-v3.pkl - 34min 24s</li>
<li>A6000 embeddings_enron_msmarco-distilbert-base-v3.pkl - 21min 55s</li>
<li>A6000 embeddings_enron_multi-qa-mpnet-base-dot-v1 - 45min 29s</li>
<li>A3500 embeddings_enron_msmarco-distilbert-base-v4 - 20min 3s</li>
<li>A3500 embeddingg_enron_msmarco-roberta-base-v3 - </li>

## 6) Query String Embedding
<p>Using the same model, we then embed our query string to be used for comparison.</p>

In [24]:
%%time
# Embed the query string
query_string = 'we are having a baby'
query_embedding = model.encode(query_string,convert_to_tensor=True)

CPU times: user 12.5 ms, sys: 164 µs, total: 12.7 ms
Wall time: 11.9 ms


## 7) Similarity Scoring and Ranking
<p>Next, we need to calculate the similarity between the query embedding and all the source text embeddings. One of the most common approaches is to calculate the cosine similarity. Again, the complexities and math have been abstracted here with the <a href='https://www.sbert.net/docs/package_reference/util.html'>util.cos_sim</a> and sematic_search functions.</p>

In [25]:
%%time
# Set k as the number of top results
k = 100

# Using the util function to run semantic search, default to cosine
topk_results = util.semantic_search(query_embedding, source_embeddings, top_k=k)[0]

# Extract the result ids
topk_results_ids = [result['corpus_id'] for result in topk_results]

# Get a dataframe of the top k results
topk_df = df.iloc[topk_results_ids].reset_index()

CPU times: user 501 ms, sys: 0 ns, total: 501 ms
Wall time: 486 ms


In [26]:
# Display the message. 
import re

# Using re to clean up empty lines
msg = re.sub('\n\n', '', topk_df.message[0])
print(msg)

Diane has passed her 5 year mark with the company, and as such is entitled to all the rights and priviledges of a veteran.  Please congratulate her with me and come admire her gift!Cara


## Advanced Techniques


In [15]:
# Import the cross encoder library
from sentence_transformers.cross_encoder import CrossEncoder

# Load the cross encoder model
cross_model = CrossEncoder(smart_search.cross_encoder_models[1])

  return self.fget.__get__(instance, owner)()


In [16]:
%%time
# Calculate the cross-encoder scores and assign scores to dataframe column
topk_df['score'] = [cross_model.predict([query_string,msg]) for msg in topk_df.message.to_pandas()]

CPU times: user 1.97 s, sys: 142 ms, total: 2.11 s
Wall time: 2.02 s


In [17]:
# Sort the dataframe based on descending score
topk_df = topk_df.sort_values('score',ascending=False).reset_index()

# Print the top result
print(topk_df.message[0])



Hey,

I just got an email from Linda McKula (we went to her wedding in Napa);
anyway she was sharing the news that her and Tim are going to be having a
baby.  She's about 2 1/2 months preggo right now.  She said things have
gotton really quiet/slow work wise for them but they are thankful that they
still have jobs (for now).  Later.

I just found out that I might have to be in a 4 hour mtg today from 11-3; so
let me know what your plans are this afternoon via email.


### Additional Resources

<li><a href='https://huggingface.co/'>HuggingFace</a></li>
<li><a href='https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads'>Sentence Similarity Models</a></li>
<li><a href='https://www.sbert.net/docs/usage/semantic_textual_similarity.html'>Sematic Textual Similarity</a></li>

### Environments
#### Past Environment
This notebook has been developed and tested on the following:
<li>NVIDIA RTX A6000</li>
<li>RAPIDS - rapidsai-core:22.10-cuda11.5-base-ubuntu20.04-py3.9</li>
<li>Pytorch 1.12.1</li>
<li>sentence-transformers</li>

#### New Environment
<li>NVIDIA RTX A3500</li>
<li>nvcr.io/nvidia/ai-workbench/pytorch:1.0.2</li>
<li>Pytorch 2.1</li>
<li>sentence-transformers</li>