**Source:** https://www.mongodb.com/developer/products/atlas/llm-accuracy-vector-search-unstructured-metadata/

In [1]:
# Optional - disable warnings from the Tokenizer
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
# Install Unstructured partition for PDF and dependencies
!pip install unstructured["pdf"]
!apt-get -qq install poppler-utils tesseract-ocr
!pip install -q --user --upgrade pillow

!pip install pymongo
!pip install sentence-transformers
!pip install requests
!pip install google-generativeai



## Get Paper from MongoDB and load PDF File

In [3]:
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

uri = "mongodb+srv://admin:KnNJJlnMn0B8bs22@cluster0.lp7lpsn.mongodb.net/ceur_ws?retryWrites=true&w=majority&appName=Cluster0"

# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [4]:
# Fetch the first document from the collection
db = client.get_database('ceur_ws')
papers_collection = db.papers

title = "Empowering Supply Chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification"
document = papers_collection.find_one({'title': title})
print(document)

{'_id': ObjectId('666efa86afd0ba1cc886a845'), 'url': 'https://ceur-ws.org/Vol-3707/D2R224_paper_2.pdf', 'title': 'Empowering Supply Chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification', 'pages': None, 'author': ['Maryam Shahsavari', 'Omar Khadeer Hussain', 'Morteza Saberi', 'Pankaj Sharma'], 'abstract': None, 'keywords': None, 'content': None, 'volume_id': ObjectId('666efa85afd0ba1cc886a83f')}


In [5]:
import requests
import tempfile

response = requests.get(document['url'])

# Create a temporary file to write the PDF content
filename = f"{document['title']}.pdf"
print(filename)
temp_file_path = os.path.join(tempfile.gettempdir(), filename)

with open(temp_file_path, 'wb') as temp_file:
    temp_file.write(response.content)

Empowering Supply Chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification.pdf


# Extract Texts and Metadata using Unstructured's `partition_pdf`

The document that we are extracting is the "Empowering Supply Chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification" paper. We will import `partition_pdf` from the Unstructured library to do so.

Parameters for the `partition_pdf`:
* filename
* `strategy`: you can set `ocr_only` if you only extract the text. We use `hi_res` strategy to detect complex elements.
* `infer_table_structured`: set to `True` if we want to extract the metadata for tables.

In [6]:
import os
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(temp_file_path,
                         strategy="hi_res",
                         infer_table_structured=True)

os.remove(temp_file_path)

Let's observe the sample outputs, ie. the element type and text

In [7]:
display(*[(type(element), element.text) for element in elements[14:18]])
print("\n")

(unstructured.documents.elements.NarrativeText,
 'Figure 1: An overview of CERIA framework [5]')

(unstructured.documents.elements.ListItem,
 '• BN construction: What are the causal relationships between events that ultimately pose or contribute to SC risks?')

(unstructured.documents.elements.ListItem,
 '• Event detection (direct inference): How can risk events be detected from textual sources? What are the relevant phrases?')

(unstructured.documents.elements.ListItem,
 '• Probability of risk event occurrence: How can the chance of risk event occurrence be quantified?')





In [8]:
from collections import Counter

display(Counter(type(element) for element in elements))
print("\n")

Counter({unstructured.documents.elements.Title: 16,
         unstructured.documents.elements.NarrativeText: 33,
         unstructured.documents.elements.Text: 3,
         unstructured.documents.elements.Image: 4,
         unstructured.documents.elements.Footer: 1,
         unstructured.documents.elements.ListItem: 18,
         unstructured.documents.elements.Formula: 1,
         unstructured.documents.elements.FigureCaption: 1,
         unstructured.documents.elements.Table: 1})





We can convert the extracted elements into a Python dictionary using `convert_to_dict` function so we can transform the records for Vector Database.

In [9]:
from unstructured.staging.base import convert_to_dict

# built-in function to convert Unstructured elements to Python dictionary
records = convert_to_dict(elements)

# display the first record
records[0]

{'type': 'Title',
 'element_id': 'abec683fd59890f103c80927814cf7c4',
 'text': 'Empowering Supply chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification',
 'metadata': {'detection_class_prob': 0.7737700343132019,
  'coordinates': {'points': ((248.03055555555554, 231.17932844444465),
    (248.03055555555554, 340.961669921875),
    (1407.0091552734375, 340.961669921875),
    (1407.0091552734375, 231.17932844444465)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-06-16T21:26:12',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 1,
  'file_directory': '/tmp',
  'filename': 'Empowering Supply Chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification.pdf'}}

# Prepare the Data for Storage and Retrieval in MongoDB

## Vectorize the texts using `SentenceTransformer` library

We will store and retrieve both the text and metadata in MongoDB. First, we need to vectorize the text to allow vector search based on the query and records similarity.

In [10]:
from sentence_transformers import SentenceTransformer
from pprint import pprint

model = SentenceTransformer('microsoft/mpnet-base')

Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['mpnet.pooler.dense.bias', 'mpnet.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# Let's test and check the number of embedding size using this model
emb = model.encode("this is a test").tolist()
print(len(emb))
print(emb[:10])
print("\n")

768
[-0.15820947289466858, 0.008248747326433659, -0.03334711864590645, -0.017948446795344353, -0.07760719209909439, -0.0592564232647419, 0.24941007792949677, 0.0987861379981041, 0.010875198058784008, 0.09362597018480301]




We can now vectorize the text in our records.

We will structure each record in the following format to be stored and retrieved in MongoDB:
* Type
* Element ID
* Metadata
* Text
* Embedding

In [12]:
for record in records:
    txt = record['text']
    record['embedding'] = model.encode(txt).tolist()

In [13]:
# print the first record with embedding
records[0]

{'type': 'Title',
 'element_id': 'abec683fd59890f103c80927814cf7c4',
 'text': 'Empowering Supply chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification',
 'metadata': {'detection_class_prob': 0.7737700343132019,
  'coordinates': {'points': ((248.03055555555554, 231.17932844444465),
    (248.03055555555554, 340.961669921875),
    (1407.0091552734375, 340.961669921875),
    (1407.0091552734375, 231.17932844444465)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-06-16T21:26:12',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 1,
  'file_directory': '/tmp',
  'filename': 'Empowering Supply Chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification.pdf'},
 'embedding': [-0.026116566732525826,
  -0.03958490118384361,
  -0.08057887107133865,
  0.04201807081699371,
  0.13461977243423462,
  -0.08225736021995544,
  0.09691937267780304,
  0.015870891511440277,
  0.

## Connect and upload records onto MongoDB Atlas

We will use PyMongo `insert_many` to upload our records to the specified `Database` and `Collection` within the `Cluster`.

In [14]:
db_name = "ceur_ws"
collection_name = "vector_records"

# delete all first
client[db_name][collection_name].delete_many({})

# insert
client[db_name][collection_name].insert_many(records)

InsertManyResult([ObjectId('666f5914a59438a4efb90e88'), ObjectId('666f5914a59438a4efb90e89'), ObjectId('666f5914a59438a4efb90e8a'), ObjectId('666f5914a59438a4efb90e8b'), ObjectId('666f5914a59438a4efb90e8c'), ObjectId('666f5914a59438a4efb90e8d'), ObjectId('666f5914a59438a4efb90e8e'), ObjectId('666f5914a59438a4efb90e8f'), ObjectId('666f5914a59438a4efb90e90'), ObjectId('666f5914a59438a4efb90e91'), ObjectId('666f5914a59438a4efb90e92'), ObjectId('666f5914a59438a4efb90e93'), ObjectId('666f5914a59438a4efb90e94'), ObjectId('666f5914a59438a4efb90e95'), ObjectId('666f5914a59438a4efb90e96'), ObjectId('666f5914a59438a4efb90e97'), ObjectId('666f5914a59438a4efb90e98'), ObjectId('666f5914a59438a4efb90e99'), ObjectId('666f5914a59438a4efb90e9a'), ObjectId('666f5914a59438a4efb90e9b'), ObjectId('666f5914a59438a4efb90e9c'), ObjectId('666f5914a59438a4efb90e9d'), ObjectId('666f5914a59438a4efb90e9e'), ObjectId('666f5914a59438a4efb90e9f'), ObjectId('666f5914a59438a4efb90ea0'), ObjectId('666f5914a59438a4efb90e

## Query the index based on embedding similarity

Now that we have the records stored in MongoDB Atlas, we can search the relevant texts using the vector search. To do so, we need to vectorize the `query` using the same embedding model and use the `aggregate` function to retrieve the records from the index.

In the `pipeline` we will specify the following:
* `index`: the name of the vector search index in the collection.
* `vector`: the vectorized query from the user.
* `k`: number of most similar records we want to extract from the collection.
* `score`: the similarity score generated by MongoDB Atlas.

**MongoDB Search Vector Index**
```json
{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embedding": {
        "type": "knnVector",
        "dimensions": 768,
        "similarity": "euclidean"
      }
    }
  }
}
```

In [15]:
# Simulate a query
query = "What is the CERIA framework?"

vector_query = model.encode(query).tolist()

In [16]:
vector_query

[-0.0978887751698494,
 -0.042714040726423264,
 -0.04149283841252327,
 0.0005479802493937314,
 0.0009935381822288036,
 -0.005337526556104422,
 0.14971381425857544,
 0.055392105132341385,
 -0.00322324363514781,
 0.10315658152103424,
 0.10232720524072647,
 0.009140967391431332,
 0.03893635794520378,
 0.15473856031894684,
 -0.07673250883817673,
 -0.03659925237298012,
 0.11289875954389572,
 0.11759001761674881,
 0.019048327580094337,
 -0.0028604757972061634,
 0.010642807930707932,
 0.07243625819683075,
 -0.005616281647235155,
 -0.003398529253900051,
 0.10700326412916183,
 0.057952478528022766,
 -0.036660611629486084,
 0.18039658665657043,
 -0.03456201031804085,
 0.1459304392337799,
 3.62518330803141e-05,
 -0.031441908329725266,
 -0.0669606402516365,
 -0.05446122586727142,
 0.11367166042327881,
 -0.09401025623083115,
 0.06049805134534836,
 0.0476296991109848,
 0.10526076704263687,
 0.2341161072254181,
 0.0568121075630188,
 0.023488763719797134,
 -0.1051904484629631,
 -0.03761982172727585,
 -

In [21]:
pipeline = [
  {
    "$search": {
        "index":"default",
        "knnBeta": {
            "vector": vector_query,
            "path": "embedding",
            "k": 5,
        }
    }
  },
  {
    "$project": {
        "embedding": 0,
        "_id": 0,
        "score": {
            "$meta": "searchScore"
        },
    }
  }
]

results = list(client[db_name][collection_name].aggregate(pipeline))
print(results)

[{'type': 'NarrativeText', 'element_id': 'b24583b42c9cec08eb89aab234f4d992', 'text': 'Figure 1: An overview of CERIA framework [5]', 'metadata': {'detection_class_prob': 0.8344075679779053, 'coordinates': {'points': [[248.03055555555554, 892.219915111111], [248.03055555555554, 919.8939151111111], [801.4171752929688, 919.8939151111111], [801.4171752929688, 892.219915111111]], 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2339}, 'last_modified': '2024-06-16T21:26:12', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 3, 'file_directory': '/tmp', 'filename': 'Empowering Supply Chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification.pdf'}, 'score': 0.2424938976764679}, {'type': 'ListItem', 'element_id': '37e56ca85ce4d240f46ef1a4067c91fd', 'text': '• BN construction: What are the causal relationships between events that ultimately pose or contribute to SC risks?', 'metadata': {'detection_class_prob': 0.8987293839454651, 'coordinat

# Generate the LLM Output with Source Document Citation

Once we retrieve the relevant records, we can generate the LLM output. In this tutorial, we will use the GPT-4 model from OpenAI. You must provide the API Key and use the `ChatCompletion` function.

To prepare for generating the output, we'll send the full records retrieved in the earlier step. Next, we need to tweak the prompts so the LLM can use the metadata in the output, ie. the source filename and page number.

In [22]:
import os
import json
import google.generativeai as genai

# Set Gemini API Key
GOOGLE_API_KEY = "AIzaSyAsq09J79ZNyhPI4lnGPSnUpmYEGAhhTsY"
genai.configure(api_key=GOOGLE_API_KEY)
gemini_model = genai.GenerativeModel('gemini-1.5-flash')

In [23]:
# Stringify the list of results
import json

context = json.dumps(results)

In [24]:
# Generate response using Gemini model
prompt = f"""You are a useful assistant. Use the assistant's content to answer the user's query. Summarize your answer using the provided context and cite any relevant 'page_number' and 'filename' metadata in your reply.
  Context: {context}
  Query: {query}
"""

try:
    response = gemini_model.generate_content(prompt)
    print(response.text)
except Exception as e:
    print(f"Error generating response: {e}")


The CERIA framework is an approach to identifying and managing risks in supply chains. The CERIA framework involves a series of steps that start with identifying causal relationships between events. This information is then used to build a Bayesian network (BN) that can be used to predict and mitigate potential supply chain risks.  The framework is mentioned in the paper "Empowering Supply Chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification.pdf" on page 3. 



## Evaluating the LLM Output Quality with Source Document

> **User query**: "What is the CERIA framework?"

> **Gemini**: "The CERIA framework is an approach to identifying and managing risks in supply chains. The CERIA framework involves a series of steps that start with identifying causal relationships between events. This information is then used to build a Bayesian network (BN) that can be used to predict and mitigate potential supply chain risks.  The framework is mentioned in the paper "Empowering Supply Chains Resilience: LLMs-Powered BN for Proactive Supply Chain Risk Identification.pdf" on page 3."

The highly specific output cites information from the source document, "Attention is All You Need.pdf," stored in the 'example-docs' directory. The answers are referenced with exact page numbers, making it easy for anyone to verify the information. This level of detail is crucial when answering queries related to research and it significantly enhances the trustworthiness and reliability of the LLM outputs.

By incorporating unstructured metadata and MongoDB's Vector Search, we can not only improve the precision of the outputs but also make them more reliable and verifiable.