### This notebook contains the code for RAG implementation with Qdrant DB for Proposal Pro backend
Steps followed:
- Upload files
- for file in files:
    <!-- update_vector_db.py -->
    - for each page in file: 
        - update metadata
        - chunkify with metadata (RecursiveCharacterTextSplitter)
        - append to a docs_list
    - return docs_list
    - update metadata for each chunk in docs_list:
        - chunk.metadata.update({"tags": "", "flename": ""})
    - get embedding model to use
    - create/update vector DB with the docs_list, embedding model, using collection_name (creates embeddings too)


## DB Architecture 
### Approach 1:
Each **Collection** in the Qdrant DB will be associated with a **Project**.
The collections will be simple, Vector databases with: -

- **collection_name**: *ProjectName_ProjectID*.
- **VectorParams**: Size=1024, Distance=COSINE
- **PointStruct**: id, vector=chunk_embedding, payload=chunk_metadata
    - *chunk_metadata*= {filename, tags, page no., content src}

A Qdrant collection can be viewed as a pool of embeddings from all the docs associated with a Project, with the chunk info (Filename, Page No.) maintained in the payload.

For ***Retrieval***, query is sent to a Qdrant collection which returns the *Docs* with the closest match to the query.

#### Pros of this approach:
1. One collection per project

2. Maximum isolation

3. Fastest retrieval

4. Clean resource boundaries

5. Easy deletes and backup

Perfect for proposal-development application (where projects = main unit of work)
#### Cons to this approach:
1. Too many collections will be created. Each colleciton has:
    - Its own HNSW index (a high-perf indexing struct which performs search in logarithmic time)
    - Its own metadata and memory footprint
    - Background maintenance processes
 ***Conclusion***: *Not scalable* when the collection count exceeds beyond a few thousand

2. Cross-project *search* is not possible with this approach. Searches like:
    - “Search across all user projects”
    - “Search across multiple teams”
    - or some form of global context
  become very hard because of isolation of each project.

3. A Qdrant instance with thousands of collections is more complex to move or manage than a single large collection with payload filters.

The next approach overcomes these issues.

### Approach 2:
A **Collection** will hold all the data for all the projects in a single pool.
The **Payload** to each chunk will contain the following info:
- *userID*
- *projectID*
- *fileID*
- *chunkID*
- *text*
- *fileURL_GCS*
- *fileName*

For ***Retrieval***, we must *filter* the search query *userID* and *projectID*.
#### Cons to this approach:
1. This approach results in cluttered storage of embeddings, as it appears
2. Search will be slower with filters and larger index
3. This approach will result in more complex retrieval logic
4. Buggy filter may result in data leak


### Approach 3:
One **Collection** per **tenant**, or per **User**.

### Approach 4:
Use Named-Vectors






In [1]:
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct,VectorParams, Distance
from qdrant_client.models import Filter, FieldCondition, MatchValue

qdrant_client = QdrantClient("localhost",port=6333)

In [None]:
# qdrant_client.delete_collection(collection_name="iQss")

True

In [12]:
qdrant_client.create_collection(
    collection_name="iQss",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)
print(qdrant_client.get_collections())

collections=[CollectionDescription(name='iQss'), CollectionDescription(name='local_collection')]


In [2]:
qdrant_client.scroll(
    collection_name="iQss",
    limit=20
)

([Record(id='00488fc6-cb35-5d97-ad3e-21a86bb4fdc7', payload={'source': 'Exhibit A Contract Data Requirements List CDRL 13AUG.1723575501858.pdf', 'tags': [], 'sf_pages': [2, 4, 5, 6, 7, 8], 'image_pages': [1], 'total_pages': 16, 'page_number': 15, 'is_sf_page': False, 'content_source': 'text', 'filename': 'Exhibit A Contract Data Requirements List CDRL 13AUG.1723575501858.pdf', 'projectId': '46b64b60-b3e1-4971-aa54-fb25874ccf85', 'chunk_text': 'calendar days after contract award. Government<br>comments provided within (10) business days from receipt.<br>Transition Out Plan: Submit within 60 calendar days of receiving a written request from the Government<br>or NLT 60 calendar days from the start of the final Option Period of the contract.<br>BLK 14: All submissions in electronic format ONLY in the form of editable Excel files and submitted to<br>the COR, and Contracting Officer. The Government may choose to expand the list of recipients.<br>Delivery of record shall be a Government direc

In [None]:
# retrieve/scroll the points first, before deletetion
qdrant_client.scroll(
    collection_name="iQss",
    scroll_filter=Filter(
        must=[
            FieldCondition(key="filename", 
                           match=MatchValue(value='Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf')),
            FieldCondition(key="projectId",
                           match=MatchValue(value="ae74669a-c985-45bb-9940-d3bb694e53f0"))
        ]
    ),
    limit=20
)

In [5]:
# delete points for a partiular file in a project:
qdrant_client.delete(
    collection_name="iQss",
    points_selector=Filter(
        must=[
            FieldCondition(key="filename",
                           match=MatchValue(value='Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf')),
            FieldCondition(key="projectId",
                           match=MatchValue(value="ae74669a-c985-45bb-9940-d3bb694e53f0"))
        ]
    )
)

UnexpectedResponse: Unexpected Response: 500 (Internal Server Error)
Raw response content:
b'{"status":{"error":"Service internal error: task 985 panicked with message \\"called `Result::unwrap()` on an `Err` value: OutputTooSmall { expected: 4, actual: 0 }\\""},"time":1.03975541}'

In [2]:
qdrant_client.scroll(
    collection_name="iQss",
    scroll_filter=Filter(
        must=[
            FieldCondition(key="projectId",
                           match=MatchValue(value="58458d50-e546-4d8b-a797-54ca2b31acce"))
        ]
    )
)

UnexpectedResponse: Unexpected Response: 500 (Internal Server Error)
Raw response content:
b'{"status":{"error":"Service internal error: 1 of 1 read operations failed:\\n  Service internal error: task 11485 panicked with message \\"called `Result::unwrap()` on an `Err` value: OutputTooSmall  ...'

In [9]:
qdrant_client.scroll(
    collection_name="iQss",
    limit=20)

([Record(id='001de355-51a8-5cd7-91c1-00f71afee879', payload={'source': 'Attachment 0002 Instructions to Quoters 19Aug.1724079599359.pdf', 'tags': [], 'sf_pages': [], 'image_pages': [], 'total_pages': 10, 'page_number': 3, 'is_sf_page': False, 'content_source': 'text', 'filename': 'Attachment 0002 Instructions to Quoters 19Aug.1724079599359.pdf', 'projectId': '28099d9c-1616-40da-9656-f392ce157ee5', 'chunk_text': 'Page 3:\nfacilities and experience and will base its evaluation on the information presented in the\nQuote.\n\nL2.4 Quoters that include data they do not want disclosed to the public for any purpose, or\nused by the Government except for evaluation purposes, shall mark the title page with the\nfollowing language:\n\n"This Quote includes data that shall not be disclosed outside the Government and shall not\nbe duplicated, used or disclosed--in whole or in part--for any purpose other than to\nevaluate this Quote. If, however, a Task Order is awarded to this Quoter as a result of-

In [10]:
#Modify payload fields: set_payload (=>Upadate/add to payload), overwrite_payload(=> replace entire payload)
qdrant_client.set_payload(
    collection_name="iQss",
    points=['149a9cb3-82ff-50cb-b998-6750a70f730a'],
    payload={"new_field": "new_value"}
    
)

UpdateResult(operation_id=249, status=<UpdateStatus.COMPLETED: 'completed'>)

In [16]:
qdrant_client.retrieve(
    collection_name="iQss",
    ids=['149a9cb3-82ff-50cb-b998-6750a70f730a'],
    with_vectors=False,
    with_payload=True
)

[Record(id='149a9cb3-82ff-50cb-b998-6750a70f730a', payload={'source': 'Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf', 'tags': [], 'sf_pages': [], 'image_pages': [1], 'total_pages': 39, 'page_number': 39, 'is_sf_page': False, 'content_source': 'text', 'filename': 'Attachment 0001 ECMA Cloud Cybersecurity Support PWS 15AUG24.1723739597369.pdf', 'projectId': '23d73a68-3ca6-48c9-bbac-2eb888e73585', 'chunk_text': '1392 12. DoD Cloud Service Provider Security Requirements Guide, Version 1, Release 1,\n1393 14 June 2024.\n\n1394 13. DoD Directive 8140.01, “Cyberspace Workforce Management”, Dated 05\n1395 October 2020\n\n1396 14. DoD Instruction 8140.02, “Identification, Tracking, and Reporting of Cyberspace\n1397 Workforce Requirements”, Dated 21 December 2021\n\n1398 15. DoD- Manual 8140.03, “Cyberspace Workforce Qualification and Management\n1399 Program”, dated 15 February 2023\n\n1400 16. DoD Instruction 8500.01, “Cybersecurity”, dated 7 October 2019, Inc

In [17]:
# delete_payload(deletes a field/fields in payld) and clear_payload (clears the entire payld)
qdrant_client.delete_payload(
    collection_name="iQss",
    points=['149a9cb3-82ff-50cb-b998-6750a70f730a'],
    keys=["new_field"]
)

UpdateResult(operation_id=251, status=<UpdateStatus.COMPLETED: 'completed'>)

In [8]:
qdrant_client.retrieve(
    collection_name="iQss",
    ids=['149a9cb3-82ff-50cb-b998-6750a70f730a'],
    with_payload=False,
    with_vectors=True

)

[Record(id='149a9cb3-82ff-50cb-b998-6750a70f730a', payload=None, vector=[0.06298948, 0.014014003, -0.07050165, -0.017617786, 0.0022400487, 0.047341123, 0.06916976, 0.024694154, -0.0037767435, -0.019113036, -0.018501468, 0.04903598, 0.013226027, 0.050686512, -0.027578423, -0.04445896, 0.08981244, 0.054441098, -0.002859866, 0.017270872, -0.009756833, -0.033335976, -0.01466845, -0.007997554, -0.01740664, -0.035193883, 0.061325334, -0.040383887, -0.026761763, -0.0626083, 0.029271832, 0.04640405, -0.0199372, -0.025123173, 0.015867177, -0.0038895044, -0.020880543, 0.017268507, 0.055887695, -0.0041888724, -0.07694683, -0.06601306, -0.07465553, 0.04871077, -0.032403734, 0.029506478, -0.015492165, 0.013802274, -0.08777208, 0.028352642, 0.030651437, 0.02516033, -0.06642999, -0.00094483775, -0.008977796, -0.069117084, -0.02050515, -0.065528326, -0.0021071497, -0.003673967, -0.020597052, -0.017770324, 0.044295486, -0.019505078, 0.02980397, -0.004716311, -0.00046019713, -0.027671715, -0.055217046, 

In [22]:
qdrant_client.clear_payload(
    collection_name="iQss",
    points_selector=['149a9cb3-82ff-50cb-b998-6750a70f730a']
)

UpdateResult(operation_id=253, status=<UpdateStatus.COMPLETED: 'completed'>)

In [None]:
import pprint
pprint.pprint((qdrant_client.retrieve(
    collection_name="iQss",
    ids=['0ca682cc-f072-53a0-90e7-a0460b1d1037'],
    with_payload=True,
    with_vectors=True

)))

[Record(id='0ca682cc-f072-53a0-90e7-a0460b1d1037', payload={'source': 'Exhibit A Contract Data Requirements List CDRL 13AUG.1723575501858.pdf', 'tags': [], 'sf_pages': [2, 4, 5, 6, 7, 8], 'image_pages': [1], 'total_pages': 16, 'page_number': 9, 'is_sf_page': False, 'content_source': 'text', 'filename': 'Exhibit A Contract Data Requirements List CDRL 13AUG.1723575501858.pdf', 'projectId': 'ae74669a-c985-45bb-9940-d3bb694e53f0', 'chunk_text': 'will prevent disclosure or reconstruction<br>of the document.<br>BLK 10, 12, 13: Monthly, NLT 5 business days after the end of the month.<br>BLK 14: All submissions in electronic format ONLY in the form of editable Word, Excel, PowerPoint, or<br>Visio documents and submitted to the COR, ECMA Data Manager, Project Manager, and Contracting<br>Officer. The Government may choose to expand the list of recipients. Delivery of record shall be a<br>Government direct data repository.|16. REMARKS<br>BLK 4: Contractor format is acceptable, but shall include: 

In [3]:
qdrant_client.count(
    collection_name="iQss",
    count_filter=Filter(
        must=[
            FieldCondition(
                key="projectId",
                match=MatchValue(value="e3e99169-7708-42b9-97a1-920b1bf13b9c"),
            )
        ]
    ),
)


CountResult(count=120)

## Vector Manipulation:
- __create__: upsert()
- __read__: retrieve(), scroll()
- __update__: upsert() with the new vector
- __delete__: delete(), delete the entire Point using Filters or point ids

### To replace vectors:
- retrieve the vector
- modify it 
- upsert it back 

- __search()__ using query vector, with or without Filters
- __search_batch()__

In [25]:
qdrant_client.delete(
    collection_name="iQss",
    points_selector=Filter(
        must=[
            FieldCondition(key="projectId",
                           match=MatchValue(value="b53ead60-a4ab-458e-98e4-88cf023023b1"))
        ]
    )
)

UpdateResult(operation_id=259, status=<UpdateStatus.COMPLETED: 'completed'>)