To perform a search using the full prompt, especially without relying on basic keyword matching, you can use Natural Language Processing (NLP) techniques to find the most relevant documents. One effective approach is to use embedding-based semantic search, which leverages vector representations of text. Here's how it can be done:

Solution Overview

    - Embed each document in your docs list and the query.
    - Compare the query embedding with each document's embedding to find the most relevant ones.
    - Retrieve the documents that are most similar to your query.

Here are some steps to implement this approach:

Install Sentence Transformers for easy text embedding generation.

In [None]:
'''pip install sentence-transformers'''

Generate Embeddings for documents and query.

Use a pre-trained model like `sentence-transformers` to convert each document and the query into vectors.

Compute Similarity between the query vector and each document vector.

Retrieve the Most Relevant Documents based on similarity scores.

# Keywords

In [None]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load a pre-trained model for generating embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')  # This model is efficient and accurate for semantic similarity

# Example documents
docs = [
    'This document covers basics of NLP and its applications.',
    'IoT systems are expanding rapidly in the tech world.',
    'This is an overview of NLP techniques in text summarization.',
    'Artificial Intelligence in healthcare is transforming the industry.',
    'NLP models are being used to improve customer support experiences.'
]

# Query
query = "Give me all documents about NLP."

# Step 1: Generate embeddings for each document
doc_embeddings = model.encode(docs)

# Step 2: Generate embedding for the query
query_embedding = model.encode(query)

# Step 3: Compute cosine similarity between the query and each document
similarities = util.cos_sim(query_embedding, doc_embeddings)[0]  # [0] to get the row array

# Step 4: Retrieve documents with high similarity scores
# Convert to list and get indices of top matches (e.g., similarity > threshold)
threshold = 0.4  # Define a threshold for relevance
relevant_docs = [docs[i] for i in np.where(similarities > threshold)[0]]

# Print results
print("Relevant documents related to your query:")
for doc in relevant_docs:
    print(doc)


  from tqdm.autonotebook import tqdm, trange
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Relevant documents related to your query:
This document covers basics of NLP and its applications.
This is an overview of NLP techniques in text summarization.
NLP models are being used to improve customer support experiences.


# Title

In [6]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load a pre-trained model for generating embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')  # This model is efficient and accurate for semantic similarity

# Example documents
docs = [
    'Title: NLP Integration. Introduction: This document covers basics of NLP and its applications. Date: April 9, 2024',
    'Title: IoT System. Introduction: IoT systems are expanding rapidly in the tech world. Date: June 3, 2023',
    'Title: NLP Techniques. Introduction: This is an overview of NLP techniques in text summarization. Date: July 12, 2024',
    'Title: AI in Health. Introduction: Artificial Intelligence in healthcare is transforming the industry. December 12, 2021',
    'Title: NLP Models. Introduction: NLP models are being used to improve customer support experiences. December 1, 2020'
]

# Query
query = "Give me the document entitled: IoT System."

# Step 1: Generate embeddings for each document
doc_embeddings = model.encode(docs)

# Step 2: Generate embedding for the query
query_embedding = model.encode(query)

# Step 3: Compute cosine similarity between the query and each document
similarities = util.cos_sim(query_embedding, doc_embeddings)[0]  # [0] to get the row array

# Step 4: Retrieve documents with high similarity scores
# Convert to list and get indices of top matches (e.g., similarity > threshold)
threshold = 0.4  # Define a threshold for relevance
relevant_docs = [docs[i] for i in np.where(similarities > threshold)[0]]

# Print results
print("Relevant documents related to your query:")
for doc in relevant_docs:
    print(doc)


Relevant documents related to your query:
Title: IoT System. Introduction: IoT systems are expanding rapidly in the tech world. Date: June 3, 2023


# Semantic Search or Year Filtering

In [5]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import re

# Load a pre-trained model for generating embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')  # This model is efficient and accurate for semantic similarity

# Function to extract years from the query
def extract_years_from_query(query):
    years = re.findall(r'\b(20\d{2})\b', query)  # Finds all four-digit numbers starting with '20'
    return [int(year) for year in years] if years else None  # Convert to integers for easier processing

def semantic_search(docs,query):
    # Semantic Search
    # Step 1: Generate embeddings for each document
    doc_embeddings = model.encode(docs)
    # Step 2: Generate embedding for the query
    query_embedding = model.encode(query)
    # Step 3: Compute cosine similarity between the query and each document
    similarities = util.cos_sim(query_embedding, doc_embeddings)[0]  # [0] to get the row array
    # Step 4: Retrieve documents with high similarity scores
    # Convert to list and get indices of top matches (e.g., similarity > threshold)
    threshold = 0.4  # Define a threshold for relevance
    relevant_docs = [docs[i] for i in np.where(similarities > threshold)[0]]

    # Print results
    print("Relevant documents related to your query:")
    for doc in relevant_docs:
        print(doc)
        
    return relevant_docs

def search(docs,query):
    # Extract years
    years_in_query = extract_years_from_query(query)
    # Output results
    if years_in_query:
        print("Years found in query:", years_in_query)
        # Filter documents that contain any of the years found in the query
        filtered_docs = [
            doc for doc in docs
            if any(str(year) in doc for year in years_in_query)  # Check if any year is in the document text
        ]
        
        print("Relevant documents based on years in the query:")
        for doc in filtered_docs:
            print(doc)
    
        # search again using query
        result = semantic_search(filtered_docs,query)
        if not result:
            return filtered_docs
        else:
            return result
    else:
        print("No years found in query.")
        return semantic_search(filtered_docs,query)


# Example documents
docs = [
    'Title: NLP Integration. Introduction: This document covers basics of NLP and its applications. Date: April 9, 2024',
    'Title: IoT System. Introduction: IoT systems are expanding rapidly in the tech world. Date: June 3, 2023',
    'Title: NLP Techniques. Introduction: This is an overview of NLP techniques in text summarization. Date: July 12, 2024',
    'Title: AI in Health. Introduction: Artificial Intelligence in healthcare is transforming the industry. December 12, 2021',
    'Title: NLP Models. Introduction: NLP models are being used to improve customer support experiences. December 1, 2020'
]

# Query
query = "Give me all documents using NLP at 2020 and 2023."

search(docs,query)


Years found in query: [2020, 2023]
Relevant documents based on years in the query:
Title: IoT System. Introduction: IoT systems are expanding rapidly in the tech world. Date: June 3, 2023
Title: NLP Models. Introduction: NLP models are being used to improve customer support experiences. December 1, 2020
Relevant documents related to your query:
Title: NLP Models. Introduction: NLP models are being used to improve customer support experiences. December 1, 2020


['Title: NLP Models. Introduction: NLP models are being used to improve customer support experiences. December 1, 2020']