### This notebook contains the code for RAG implementation with Qdrant DB for Proposal Pro backend
Steps followed:
- Upload files
- for file in files:
    <!-- update_vector_db.py -->
    - for each page in file: 
        - update metadata
        - chunkify with metadata (RecursiveCharacterTextSplitter)
        - append to a docs_list
    - return docs_list
    - update metadata for each chunk in docs_list:
        - chunk.metadata.update({"tags": "", "flename": ""})
    - get embedding model to use
    - create/update vector DB with the docs_list, embedding model, using collection_name (creates embeddings too)


## DB Architecture 
### Approach 1:
Each **Collection** in the Qdrant DB will be associated with a **Project**.
The collections will be simple, Vector databases with: -

- **collection_name**: *ProjectName_ProjectID*.
- **VectorParams**: Size=1024, Distance=COSINE
- **PointStruct**: id, vector=chunk_embedding, payload=chunk_metadata
    - *chunk_metadata*= {filename, tags, page no., content src}

A Qdrant collection can be viewed as a pool of embeddings from all the docs associated with a Project, with the chunk info (Filename, Page No.) maintained in the payload.

For ***Retrieval***, query is sent to a Qdrant collection which returns the *Docs* with the closest match to the query.

#### Pros of this approach:
1. One collection per project

2. Maximum isolation

3. Fastest retrieval

4. Clean resource boundaries

5. Easy deletes and backup

Perfect for proposal-development application (where projects = main unit of work)
#### Cons to this approach:
1. Too many collections will be created. Each colleciton has:
    - Its own HNSW index (a high-perf indexing struct which performs search in logarithmic time)
    - Its own metadata and memory footprint
    - Background maintenance processes
 ***Conclusion***: *Not scalable* when the collection count exceeds beyond a few thousand

2. Cross-project *search* is not possible with this approach. Searches like:
    - “Search across all user projects”
    - “Search across multiple teams”
    - or some form of global context
  become very hard because of isolation of each project.

3. A Qdrant instance with thousands of collections is more complex to move or manage than a single large collection with payload filters.

The next approach overcomes these issues.

### Approach 2:
A **Collection** will hold all the data for all the projects in a single pool.
The **Payload** to each chunk will contain the following info:
- *userID*
- *projectID*
- *fileID*
- *chunkID*
- *text*
- *fileURL_GCS*
- *fileName*

For ***Retrieval***, we must *filter* the search query *userID* and *projectID*.
#### Cons to this approach:
1. This approach results in cluttered storage of embeddings, as it appears
2. Search will be slower with filters and larger index
3. This approach will result in more complex retrieval logic
4. Buggy filter may result in data leak


### Approach 3:
One **Collection** per **tenant**, or per **User**.

### Approach 4:
Use Named-Vectors






In [3]:
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct,VectorParams, Distance
from qdrant_client.models import Filter, FieldCondition, MatchValue

qdrant_client = QdrantClient("localhost",port=6333)

In [2]:
qdrant_client.create_collection(
    collection_name="iQss",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)
print(qdrant_client.get_collections())

collections=[CollectionDescription(name='local_collection'), CollectionDescription(name='iQss')]


In [2]:
qdrant_client.scroll(
    collection_name="iQss",
    limit=20
)

([Record(id='001de355-51a8-5cd7-91c1-00f71afee879', payload={'source': 'Attachment 0002 Instructions to Quoters 19Aug.1724079599359.pdf', 'tags': [], 'sf_pages': [], 'image_pages': [], 'total_pages': 10, 'page_number': 3, 'is_sf_page': False, 'content_source': 'text', 'filename': 'Attachment 0002 Instructions to Quoters 19Aug.1724079599359.pdf', 'projectId': '28099d9c-1616-40da-9656-f392ce157ee5', 'chunk_text': 'Page 3:\nfacilities and experience and will base its evaluation on the information presented in the\nQuote.\n\nL2.4 Quoters that include data they do not want disclosed to the public for any purpose, or\nused by the Government except for evaluation purposes, shall mark the title page with the\nfollowing language:\n\n"This Quote includes data that shall not be disclosed outside the Government and shall not\nbe duplicated, used or disclosed--in whole or in part--for any purpose other than to\nevaluate this Quote. If, however, a Task Order is awarded to this Quoter as a result of-

In [4]:
# retrieve/scroll the points first, before deletetion
qdrant_client.scroll(
    collection_name="iQss",
    scroll_filter=Filter(
        must=[
            FieldCondition(key="filename", 
                           match=MatchValue(value='Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf')),
            FieldCondition(key="projectId",
                           match=MatchValue(value="ae74669a-c985-45bb-9940-d3bb694e53f0"))
        ]
    ),
    limit=20
)

([Record(id='00effeeb-b5c4-5313-bf65-5b999538d159', payload={'source': 'Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf', 'tags': [], 'sf_pages': [], 'image_pages': [1], 'total_pages': 39, 'page_number': 26, 'is_sf_page': False, 'content_source': 'text', 'filename': 'Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf', 'projectId': 'ae74669a-c985-45bb-9940-d3bb694e53f0', 'chunk_text': 'Page 26:\n953\n\n954 **7.9. Service Contract Reporting Requirements**\n\n955 The Contractor shall meet all Service Contract Reporting Requirements IAW DFARS\n956 252.204-7023 Reporting Requirements for Contracted Services - Basic.\n\n957\n### 958 8. Performance Requirements Summary (PRS)\n\n959', 'chunk_index': 45}, vector=None, shard_key=None, order_value=None),
  Record(id='011c6840-0fd4-5544-bc84-9a1088470979', payload={'source': 'Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf', 'tags': [], 'sf_pages': [], 'imag

In [5]:
# delete points for a partiular file in a project:
qdrant_client.delete(
    collection_name="iQss",
    points_selector=Filter(
        must=[
            FieldCondition(key="filename",
                           match=MatchValue(value='Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf')),
            FieldCondition(key="projectId",
                           match=MatchValue(value="ae74669a-c985-45bb-9940-d3bb694e53f0"))
        ]
    )
)

UpdateResult(operation_id=248, status=<UpdateStatus.COMPLETED: 'completed'>)

In [8]:
qdrant_client.scroll(
    collection_name="iQss",
    scroll_filter=Filter(
        must=[
            FieldCondition(key="filename",
                           match=MatchValue(value='Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf')),
            FieldCondition(key="projectId",
                           match=MatchValue(value="ae74669a-c985-45bb-9940-d3bb694e53f0"))
        ]
    )
)

([], None)

In [9]:
qdrant_client.scroll(
    collection_name="iQss",
    limit=20)

([Record(id='001de355-51a8-5cd7-91c1-00f71afee879', payload={'source': 'Attachment 0002 Instructions to Quoters 19Aug.1724079599359.pdf', 'tags': [], 'sf_pages': [], 'image_pages': [], 'total_pages': 10, 'page_number': 3, 'is_sf_page': False, 'content_source': 'text', 'filename': 'Attachment 0002 Instructions to Quoters 19Aug.1724079599359.pdf', 'projectId': '28099d9c-1616-40da-9656-f392ce157ee5', 'chunk_text': 'Page 3:\nfacilities and experience and will base its evaluation on the information presented in the\nQuote.\n\nL2.4 Quoters that include data they do not want disclosed to the public for any purpose, or\nused by the Government except for evaluation purposes, shall mark the title page with the\nfollowing language:\n\n"This Quote includes data that shall not be disclosed outside the Government and shall not\nbe duplicated, used or disclosed--in whole or in part--for any purpose other than to\nevaluate this Quote. If, however, a Task Order is awarded to this Quoter as a result of-

In [10]:
#Modify payload fields: set_payload (=>Upadate/add to payload), overwrite_payload(=> replace entire payload)
qdrant_client.set_payload(
    collection_name="iQss",
    points=['149a9cb3-82ff-50cb-b998-6750a70f730a'],
    payload={"new_field": "new_value"}
    
)

UpdateResult(operation_id=249, status=<UpdateStatus.COMPLETED: 'completed'>)

In [16]:
qdrant_client.retrieve(
    collection_name="iQss",
    ids=['149a9cb3-82ff-50cb-b998-6750a70f730a'],
    with_vectors=False,
    with_payload=True
)

[Record(id='149a9cb3-82ff-50cb-b998-6750a70f730a', payload={'source': 'Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf', 'tags': [], 'sf_pages': [], 'image_pages': [1], 'total_pages': 39, 'page_number': 39, 'is_sf_page': False, 'content_source': 'text', 'filename': 'Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf', 'projectId': '23d73a68-3ca6-48c9-bbac-2eb888e73585', 'chunk_text': '1392 12. DoD Cloud Service Provider Security Requirements Guide, Version 1, Release 1,\n1393 14 June 2024.\n\n1394 13. DoD Directive 8140.01, “Cyberspace Workforce Management”, Dated 05\n1395 October 2020\n\n1396 14. DoD Instruction 8140.02, “Identification, Tracking, and Reporting of Cyberspace\n1397 Workforce Requirements”, Dated 21 December 2021\n\n1398 15. DoD- Manual 8140.03, “Cyberspace Workforce Qualification and Management\n1399 Program”, dated 15 February 2023\n\n1400 16. DoD Instruction 8500.01, “Cybersecurity”, dated 7 October 2019, Inc

In [17]:
# delete_payload(deletes a field/fields in payld) and clear_payload (clears the entire payld)
qdrant_client.delete_payload(
    collection_name="iQss",
    points=['149a9cb3-82ff-50cb-b998-6750a70f730a'],
    keys=["new_field"]
)

UpdateResult(operation_id=251, status=<UpdateStatus.COMPLETED: 'completed'>)

In [19]:
qdrant_client.retrieve(
    collection_name="iQss",
    ids=['149a9cb3-82ff-50cb-b998-6750a70f730a'],
    with_payload=True,
    with_vectors=False

)

[Record(id='149a9cb3-82ff-50cb-b998-6750a70f730a', payload={'source': 'Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf', 'tags': [], 'sf_pages': [], 'image_pages': [1], 'total_pages': 39, 'page_number': 39, 'is_sf_page': False, 'content_source': 'text', 'filename': 'Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf', 'projectId': '23d73a68-3ca6-48c9-bbac-2eb888e73585', 'chunk_text': '1392 12. DoD Cloud Service Provider Security Requirements Guide, Version 1, Release 1,\n1393 14 June 2024.\n\n1394 13. DoD Directive 8140.01, “Cyberspace Workforce Management”, Dated 05\n1395 October 2020\n\n1396 14. DoD Instruction 8140.02, “Identification, Tracking, and Reporting of Cyberspace\n1397 Workforce Requirements”, Dated 21 December 2021\n\n1398 15. DoD- Manual 8140.03, “Cyberspace Workforce Qualification and Management\n1399 Program”, dated 15 February 2023\n\n1400 16. DoD Instruction 8500.01, “Cybersecurity”, dated 7 October 2019, Inc

In [22]:
qdrant_client.clear_payload(
    collection_name="iQss",
    points_selector=['149a9cb3-82ff-50cb-b998-6750a70f730a']
)

UpdateResult(operation_id=253, status=<UpdateStatus.COMPLETED: 'completed'>)

In [23]:
qdrant_client.retrieve(
    collection_name="iQss",
    ids=['149a9cb3-82ff-50cb-b998-6750a70f730a'],
    with_payload=True,
    with_vectors=False

)

[Record(id='149a9cb3-82ff-50cb-b998-6750a70f730a', payload={}, vector=None, shard_key=None, order_value=None)]

## Vector Manipulation:
- __create__: upsert()
- __read__: retrieve(), scroll()
- __update__: upsert() with the new vector
- __delete__: delete(), delete the entire vector using Filters or point ids

### To replace vectors:
- retrieve the vector
- modify it 
- upsert it back 

- __search()__ using query vector, with or without Filters
- __search_batch()__