# Smart Document Retrieval
<p>The primary focus of the notebook is to illustrate the process of using a transformer model to embed text data into a numerical representation that can be used to calculated a similarity score as compared to a query string embedding.</p>

In [None]:
!pip install sentence_transformers

In [1]:
## Imports and dependencies
%load_ext autoreload
%autoreload
# Importing the needed libraries & Modules

import glob
import re
import pandas as pd

# Import SentenceTransformer and util from the HuggingFace sentence_transformer library which has
# been pre-installed in this environment.
from sentence_transformers import SentenceTransformer, util

# Import pickle. pickle is used to store the embedding
import pickle

# Import Path. Used to manage file system
from pathlib import Path

# Import smart_search_models. This module was created for this example to simplify the management of the 
# various models that can be used for the embedding process.
import smart_search

def get_txt_files_in_folder(folder_path):
    txt_files = glob.glob(folder_path + '/*.txt')
    return txt_files

def read_file_lines(filename):
    try:
        with open(filename, 'r') as file:
            lines = file.readlines()
            lines = [line.strip() for line in lines]
            return lines
    except FileNotFoundError:
        print(f"File '{filename}' not found.")
        return []  

def extract_info_from_txt(txt_path):
    with open(txt_path, 'r') as file:
        content = file.read()

    doc_number = re.search(r'Document Number: (\d+)', content).group(1)
    date = re.search(r'Date: (\d+)', content).group(1)
    title = re.search(r'Title: (.+)', content).group(1)
    abstract = re.search(r'Abstract: (.+)', content).group(1)

    return {
        'Document Number': doc_number,
        'Date': date,
        'Title': title,
        'Abstract': abstract
    }

def create_dataframe_from_txt_files(file_paths):
    data = [extract_info_from_txt(file_path) for file_path in file_paths]
    df = pd.DataFrame(data)
    return df    

# Set some notebook variables
DATASET_NAME = "uspto"
EMBEDDING_FOLDER = "embeddings/"

ModuleNotFoundError: No module named 'sentence_transformers'

# Source Text Storage
<p>The example dataset used in this notebook has been stored in a collection of abstracts extracted from the patent-grant-full-text-dataxml dataset. The abstracts were stored as plain text files, then imported into a dataframe and stored in a parquet file.<p>

In [2]:
# Create dataframe of abstracts
folder_path = "./output/abstracts/"
abs_files = get_txt_files_in_folder(folder_path)

print(f"There are {len(abs_files)} files.")

There are 534169 files.


In [3]:
# Create the DataFrame
abstract_dataframe_path = 'uspto_abstracts.parquet'
abstract_file = Path(abstract_dataframe_path)

if abstract_file.is_file():
    df = pd.read_parquet(abstract_dataframe_path)
else:    
    print('Did not find abstract file.')
    df = create_dataframe_from_txt_files(abs_files)
    df.to_parquet('uspto_abstracts.parquet')
    df.head()
    
df.head()    

Unnamed: 0,Document Number,Date,Title,Abstract
0,11276134,20220315,Reconfigurable image processing hardware pipeline,A reconfigurable image processing pipeline in...
1,11324814,20220510,Live attenuated oral vaccine against ETEC and ...,Disclosed is the attenuated Salmonella typhi ...
2,11508069,20221122,Method for processing event data flow and comp...,The present disclosure provides a method for ...
3,11304408,20220419,Leash attachment,A leash attachment and method for using the l...
4,11383015,20220712,System and method for plasma purification prio...,A method of collecting mononuclear cells incl...


In [4]:
abs_txt = read_file_lines(abs_files[0])    
# Show example document

for line in abs_txt:
    print(line)

Document Number: 11276134
Date: 20220315
Title: Reconfigurable image processing hardware pipeline
Abstract:  A reconfigurable image processing pipeline includes an image signal processor (ISP), a control processor, and a local memory. ISP processes raw pixel data for a frame based on an image processing parameter and provides lines of processed pixel data to control processor via a first interface. For each region of interest (ROI) in the frame, ISP generates auto-exposure and auto-white balance (2A) statistics based on the lines for the ROI and writes them to the local memory via a second interface. Control processor reads 2A statistics from the local memory, determines the image processing parameter based on them, and provides the image processing parameter to ISP. ISP also generates an integer N bin histogram for control processor, which sums a portion of the N total bins and compares the summed bin count to a lighting transition threshold. The image processing parameter is further 

## Source Text Embedding
<p>Historical methods for search involved simple <a href='https://en.wikipedia.org/wiki/Lexicography'>lexicographical</a> similarity pattern matching such as <a href='https://en.wikipedia.org/wiki/Regular_expression'>regex</a>. Although methods such as lexical search can be useful for some use cases, they have several disadvantages such as needing to specific the precise terms to search for. To improve search results it can be advantageous to search based on <a href='https://en.wikipedia.org/wiki/Semantic_similarity#:~:text=Semantic%20similarity%20is%20a%20metric,as%20opposed%20to%20lexicographical%20similarity.'>sematic similarity</a> using concepts rather than word for comparison.</p>

<p>To be able to search by concept we must be able to represent our data in the form of concepts. This is where <a href='https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)'>Transformers</a> come in. <a href='https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)'>Transformers</a> are a form of Machine Learning that can be applied to Natural Language Processing (NLP), the models have been trained on extremely large datasets such as Wikipedia to develop the ability to represent input text as a highly dimensional numerical representation, this process is called <a href='https://vaclavkosar.com/ml/transformer-embeddings-and-tokenization'>embedding</a>. If this sounds complicated, don't worry the hard parts are all abstracted away for us, we just need to use the sentence transformer library. Although there are benefits of understanding how the models work, sometimes it can be just as valuable to show how easy they are to use and how impressive the results can be using off-the-shelf models. If greater accuracy is needed you can always <a href='https://www.sbert.net/docs/training/overview.html'>train transformers</a> on your own datasets to improve their capabilities.</p>

### Model Selection
<p> There are a large number of models to choose from on <a href='https://huggingface.co/'>HuggingFace</a> even for just the task of <a href='https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads'>Sentence Similarity</a>(>800 as of 11/2022). We have included a python module to help simplify organization and selection of a smaller subset of models to experiment with (~100). Using <a href='https://huggingface.co/'>HuggingFace</a> simplifies the process of downloading and running the various models, it is not the only way to consume Transformers but it was chosen as it is one of the easiest ways to get started.</p>
    
There are several areas to consider when selecting a model for a given task
<li><b>Model Size</b> - Large models need more VRAM and can take longer to run but may be more 'accurate'</li>
<li><b>Model Architecture</b> - Some models might be designed for specific use cases or finetuned for a given problem. If your use case is similar, you might have high performance out of the box.</li>
<li><b>Task</b> - Different models have been trained for different tasks. Some examples of various tasks include; Semantic Similarity, Semantic Search, Questioning and Answering, and Document Summarization. 
    
<p>As stated above, the models have been trained to solve a specific workflow. In our case we are trying to identify Semantically Similar documents to our query string. Within the Semantic Similarity group there are subgroups of tasks. These tasks include identifying semantically similar sentences where we try to evaluate two or more sentences and score their similarity. When the elements being evaluated are of similar length (sentence to sentence, paragraph to parapraph) the process is called <b>symmetric semantic search</b>. If you are evaluating a short query phrase or word to sentance, paragraphs, or even documents it is refered to as <b>asymmetric semantic search</b> and models have been specially trained for each type.</p>
    
<li><a href='https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models'>Symmetric Semantic Search Pretrained Models</a></li>
<li><a href='https://www.sbert.net/docs/pretrained-models/msmarco-v3.html'>Asymmetric Semantic Search</a>

### Loading the Model
<p>Loading the model is a simple as passing the model's name as an input argument to create a model object. If the model isn't available locally it will be downloaded automatically. One of the hardest parts of working with HuggingFace is keeping track of all the models available. You can view all the models available for <a href='https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads'>Sentence Similarity</a> and copy the name into the code or to simplify things we have created a very basic python module <a href='smart_search.py'>smart_search.py</a> to hold model names.</p>

<details>
  <summary>SentenceTransformer Parameters</summary>
<li><b>model_name_or_path</b> – If it is a filepath on disc, it loads the model from that path. If it is not a path, it first tries to download a pre-trained SentenceTransformer model. If that fails, tries to construct a model from Huggingface models repository with that name.</li>
<li><b>modules</b> – This parameter can be used to create custom SentenceTransformer models from scratch.</li>
<li><b>device</b> – Device (like ‘cuda’ / ‘cpu’) that should be used for computation. If None, checks if a GPU can be used.</li>
<li><b>cache_folder</b> – Path to store models</li>
<li><b>use_auth_token</b> – HuggingFace authentication token to download private models.</li>
    </details>

In [5]:
# Select and load model.
# Note: If a given model hasn't been used since the container has been loaded it will be downloaded automatically.

# The sentence_models list is a large list of models. They have not been grouped by task beyond sentence similarity 
#model_name = smart_search.sentence_models[3]
#model_name = smart_search_models.default_model

# asymmetric_cosine_similarity_models are special purpose models for Asymmetric Semantic Similarity through cosine similarity calculations
model_name = smart_search.asymmetric_cosine_similarity_models[0]

# symmetric_models are special purpose models for Symmetric Sematic Similarity
#model_name = smart_search.symmetric_models[3]

print("Loading model: '{}'".format(model_name))
#model = SentenceTransformer(model_name,cache_folder='./models/', device='cpu')
model = SentenceTransformer(model_name,cache_folder='./models/', device='cuda')

Loading model: 'msmarco-distilbert-base-v4'


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Source Text Embedding
<p>To embed the source text, we can pass the entire column of our dataset into the model object in a single line of code as shown in the cell blocks below.</p>

<p>A couple important items to note here:
    <li>You only need to embed the source text once for a given model. Depending on your use case you may wish to database the embeddings for later use, just remember to keep track of the model used for embedding and the source document.</li>
    <li>As each model will embed the input text differently you need to ensure the source text and query text were embedded using the same model. If you choose to database or store your embedding for later just be sure to track which models were used for the embedding as you will likely get unexpected results if comparing embedding from different models.</li>
    </p>


<details>
  <summary>encode Parameters</summary>
    <li><b>sentences</b> – the sentences to embed</li>
    <li><b>batch_size</b> – the batch size used for the computation</li>
    <li><b>show_progress_bar</b> – Output a progress bar when encode sentences</li>
    <li><b>output_value</b> – Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values</li>
    <li><b>convert_to_numpy</b> – If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.</li>
    <li><b>convert_to_tensor</b> – If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy</li>
    <li><b>device</b> – Which torch.device to use for the computation</li>
    <li><b>normalize_embeddings</b> – If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.</li>
    </details>

<p>In the cell below we call the encoder for every message. This does not take advantage parrallel processing and we can see the processing time difference. The timings are a result of test runs using an NVIDIA RTX A6000.</p>
<li>Example Size: 534,169 Patents</li>

<b>asymmetric_cosine_similarity_models</b>
<li>Model: msmarco-distilbert-base-v4</li>
<li>Wall Time: 12min 45s</li>
</br>
<li>Model: msmarco-roberta-base-v3</li>
<li>Example Size: 534,169 Patents</li>
<li>Wall Time: 23min 4s</li>
</br>
<li>Model: msmarco-distilbert-base-v3</li>

<li>Wall Time: 12min 16s</li>
</br>
<b>symmetric_cosine_similarity_models</b>
<li>Model: all-mpnet-base-v2</li>
<li>Wall Time: 24min 47s</li>
</br>
<li>Model: multi-qa-mpnet-base-dot-v1</li>
<li>Wall Time: </li>
</br>
<li>Model: all-distilroberta-v1</li>
<li>Wall Time: </li>
</br>
<li>Model: all-MiniLM-L12-v2</li>
<li>Wall Time: </li>
</br>



### Embedding the entire dataset
We only need to embed the entire dataset once. We can check if the model / dataset embeddings already exist. If so just load them from disk. If not, process them. This can take as long as 30 minutes to embed the ~500,000 patents.

In [6]:
# Create helper functions to read and write embedding to files.
def load_embeddings(embedding_file_path):
        
    #Load sentences & embeddings from disc
    with open(embedding_file_path, "rb") as fIn:
        stored_data = pickle.load(fIn)
        stored_message_id = stored_data['document']
        stored_embeddings = stored_data['embeddings']

    # As of now we only need the stored embeddings
    return stored_embeddings

def write_embeddings(embedding_folder, embedding_file_name,message_ids,source_embeddings):
   
    # Check if directory exits
    dir_path = Path(embedding_folder)
    
    if not dir_path.is_dir():
        print("Directory does not exist. Creating it now.")
        # If the directory doesn't exist create it.
        dir_path.mkdir()
        
    # Create the file path
    file_path = embedding_folder + embedding_file_name
    
    # Write out the embedding and message_id to disk
    with open(file_path, "wb") as fOut:
        pickle.dump({'document': message_ids, 'embeddings': source_embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)

In [7]:
%%time

# Create the file name that would be used to store the embeddings.
embedding_file_name = "embeddings_{}_{}.pkl".format(DATASET_NAME,model_name)

# Create embedding Path object
embedding_file = Path(EMBEDDING_FOLDER + embedding_file_name)

# Check if the file 
if embedding_file.is_file():
    # If a file exists with the embedding file for this dataset / model combination exists load it.
    print("Embedding file exists. Loading it now.")
    source_embeddings = load_embeddings(embedding_file)
else:
    # If an embedding file does not exist. Embed the dataset and cache the data.
    print("Embedding file does not exist. Creating now.")
    
    source_embeddings = model.encode(df.Abstract,convert_to_tensor=True,show_progress_bar=True)
    
    # Write out the generated embeddings
    write_embeddings(EMBEDDING_FOLDER,embedding_file_name,df.Abstract,source_embeddings)
    
print(embedding_file)

Embedding file exists. Loading it now.
embeddings/embeddings_uspto_msmarco-distilbert-base-v4.pkl
CPU times: user 407 ms, sys: 2.38 s, total: 2.79 s
Wall time: 2.8 s


## 6) Query String Embedding
<p>Using the same model, we then embed our query string to be used for comparison.</p>

In [8]:
%%time
# Embed the query string
#query_string = 'Artificial intelligence (AI) anomaly monitoring in a storage system. The AI anomaly monitoring may include writing commands into a log jointly with the execution of the commands on storage media of a drive. The log includes information regarding the operation of the drive including, at least, the commands. In turn, each drive in the storage system may include an AI processor core that may access the log and apply an AI analysis to the log to monitor for an anomaly regarding the operation of the drive. As each drive in the storage system may use the AI process core to detect anomalies locally to the drive, the computational and network resources needed to employ the AI monitoring may be reduced.'
query_string = "An anomaly detector includes a writing unit that writes anomaly detection data readable by an external diagnostic device to an external memory when an anomaly is detected in an on-board device. Further, the anomaly detector includes a determination unit that determines whether a failure is occurring in a memory, which is used when a processor is operated during the writing unit performs the writing. Also, the anomaly detector includes a resetting unit that resets the memory by activating a specified one of reset functions of the processor when the determination unit determines that a failure is occurring in the memory. When the determination unit determines that a failure is occurring in the memory, the writing unit writes the anomaly detection data after the memory is reset by the specified one of the reset functions. "
query_embedding = model.encode(query_string,convert_to_tensor=True)

print(query_string)

An anomaly detector includes a writing unit that writes anomaly detection data readable by an external diagnostic device to an external memory when an anomaly is detected in an on-board device. Further, the anomaly detector includes a determination unit that determines whether a failure is occurring in a memory, which is used when a processor is operated during the writing unit performs the writing. Also, the anomaly detector includes a resetting unit that resets the memory by activating a specified one of reset functions of the processor when the determination unit determines that a failure is occurring in the memory. When the determination unit determines that a failure is occurring in the memory, the writing unit writes the anomaly detection data after the memory is reset by the specified one of the reset functions. 
CPU times: user 295 ms, sys: 125 ms, total: 420 ms
Wall time: 414 ms


## 7) Similarity Scoring and Ranking
<p>Next, we need to calculate the similarity between the query embedding and all the source text embeddings. One of the most common approaches is to calculate the cosine similarity. Again, the complexities and math have been abstracted here with the <a href='https://www.sbert.net/docs/package_reference/util.html'>util.cos_sim</a> and sematic_search functions.</p>

In [9]:
%%time
# Set k as the number of top results
k = 50

# Using the util function to run semantic search, default to cosine
topk_results = util.semantic_search(query_embedding, source_embeddings, top_k=k)[0]

# Extract the result ids
topk_results_ids = [result['corpus_id'] for result in topk_results]

# Get a dataframe of the top k results
topk_df = df.iloc[topk_results_ids].reset_index()

CPU times: user 18 ms, sys: 2.9 ms, total: 20.9 ms
Wall time: 19.1 ms


In [10]:
topk_df.head()

Unnamed: 0,index,Document Number,Date,Title,Abstract
0,371003,11379310,20220705,Anomaly detector,An anomaly detector includes a writing unit t...
1,370470,11283705,20220322,"Anomaly detector, anomaly detection network, m...",An anomaly detector (100) for detecting an ab...
2,405805,11423698,20220823,Anomaly detector for detecting anomaly using c...,Embodiments of the present disclosure disclos...
3,448866,11333580,20220517,"Anomaly detecting device, anomaly detection me...",An anomaly detecting device includes: a singu...
4,503675,11520672,20221206,"Anomaly detection device, anomaly detection me...",An anomaly detection device according to the ...


In [11]:
for i in range(5):
    print(f"Document: {topk_df['Document Number'][i]}")
    print(f"Title: {topk_df['Title'][i]}")
    print(f"Abstract: {topk_df['Abstract'][i]}\n")

Document: 11379310
Title: Anomaly detector
Abstract:  An anomaly detector includes a writing unit that writes anomaly detection data readable by an external diagnostic device to an external memory when an anomaly is detected in an on-board device. Further, the anomaly detector includes a determination unit that determines whether a failure is occurring in a memory, which is used when a processor is operated during the writing unit performs the writing. Also, the anomaly detector includes a resetting unit that resets the memory by activating a specified one of reset functions of the processor when the determination unit determines that a failure is occurring in the memory. When the determination unit determines that a failure is occurring in the memory, the writing unit writes the anomaly detection data after the memory is reset by the specified one of the reset functions. 

Document: 11283705
Title: Anomaly detector, anomaly detection network, method for detecting an abnormal activity,

## Advanced Techniques


In [12]:
# Import the cross encoder library
from sentence_transformers.cross_encoder import CrossEncoder

# Load the cross encoder model
cross_model = CrossEncoder(smart_search.cross_encoder_models[0])



config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [13]:
%%time
# Calculate the cross-encoder scores and assign scores to dataframe column
topk_df['score'] = [cross_model.predict([query_string,msg]) for msg in topk_df.Abstract]

CPU times: user 355 ms, sys: 21.1 ms, total: 376 ms
Wall time: 374 ms


In [14]:
# Sort the dataframe based on descending score
topk_df = topk_df.sort_values('score',ascending=False).reset_index()

In [15]:
for i in range(5):
    print(f"Document: {topk_df['Document Number'][i]}")
    print(f"Title: {topk_df['Title'][i]}")
    print(f"Abstract: {topk_df['Abstract'][i]}\n")

Document: 11379310
Title: Anomaly detector
Abstract:  An anomaly detector includes a writing unit that writes anomaly detection data readable by an external diagnostic device to an external memory when an anomaly is detected in an on-board device. Further, the anomaly detector includes a determination unit that determines whether a failure is occurring in a memory, which is used when a processor is operated during the writing unit performs the writing. Also, the anomaly detector includes a resetting unit that resets the memory by activating a specified one of reset functions of the processor when the determination unit determines that a failure is occurring in the memory. When the determination unit determines that a failure is occurring in the memory, the writing unit writes the anomaly detection data after the memory is reset by the specified one of the reset functions. 

Document: 11580005
Title: Anomaly pattern detection system and method
Abstract:  Provided is an anomaly pattern d

# Identification of most similar sentence(s)

In [18]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [17]:
!pip install nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m0m
Installing collected packages: nltk
Successfully installed nltk-3.8.1
[0m

In [19]:
def preprocess(sentence):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    tokens = nltk.word_tokenize(sentence.lower())
    filtered_tokens = [t for t in tokens if t not in stopwords and t.isalpha()]
    return ' '.join(filtered_tokens)

def compute_tfidf(sentences):
    vectorizer = TfidfVectorizer()
    return vectorizer.fit_transform(sentences)

def compute_similarity(vector1, vector2):
    return cosine_similarity(vector1, vector2)[0][0]

def sentence_similarity(sentence1, sentence2):
    preprocessed_sentences = [preprocess(sentence) for sentence in [sentence1, sentence2]]
    tfidf_vectors = compute_tfidf(preprocessed_sentences)
    similarity = compute_similarity(tfidf_vectors[0], tfidf_vectors[1])
    return similarity

def split_paragraph_into_sentences(paragraph):
    # Tokenize the paragraph into sentences
    sentences = nltk.sent_tokenize(paragraph)
    return sentences


In [20]:
cross_model = CrossEncoder(smart_search.cross_encoder_models[0])

#scores = [cross_model.predict([query_string,msg]) for msg in topk_df.Abstract]

i = 1

print(topk_df['Title'][i])

sentences = split_paragraph_into_sentences(topk_df['Abstract'][i])

print(len(sentences))

for sentence in sentences:
    score = cross_model.predict([query_string,sentence])
    print(f"Similarity Score: {score}, Sentence: {sentence}\n")



Anomaly pattern detection system and method
2
Similarity Score: 8.103279113769531, Sentence:  Provided is an anomaly pattern detection system including an anomaly detection device connected to one or more servers.

Similarity Score: 8.256790161132812, Sentence: The anomaly detection device may include an anomaly detector configured to model input data by considering all of the input data as normal patterns, and detect an anomaly pattern from the input data based on the modeling result.



In [21]:
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('punkt')

def get_synonyms(word):
    """Returns a set of synonyms for the given word."""
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.add(lemma.name())
    return synonyms

def similar_words_or_synonyms(string1, string2):
    """Finds and returns similar words or synonyms between two strings."""
    words1 = nltk.word_tokenize(string1)
    words2 = nltk.word_tokenize(string2)

    common_words = set(words1) & set(words2)
    
    synonym_matches = set()

    for word1 in words1:
        for word2 in words2:
            if word1 != word2 and word2 in get_synonyms(word1):
                synonym_matches.add((word1, word2))
                
    return common_words, synonym_matches

string1 = "I am happy and cheerful today."
string2 = "It is a joyous and delighted day."

common, synonyms = similar_words_or_synonyms(query_string, topk_df['Abstract'][i])

print("Common words:", common)
print("Synonym matches:", synonyms)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Common words: {'anomaly', 'detector', 'detection', '.', 'to', 'data', 'an', 'of', ',', 'by', 'device', 'the', 'is', 'one'}
Synonym matches: {('detected', 'detect'), ('includes', 'include')}
