### This notebook contains the code for RAG implementation with Qdrant DB for Proposal Pro backend
Steps followed:
- Upload files
- for file in files:
    - for each page in file: 
        - update metadata
        - chunkify with metadata (RecursiveCharacterTextSplitter)
        - append to a docs_list
    - return docs_list
    - update metadata for each chunk in docs_list:
        - chunk.metadata.update({"tags": "", "flename": ""})
    - get embedding model to use
    - create embeddings for each chunk (OR):
    - create/update vector DB with the docs_list, embedding model, using collection_name (creates embeddings too)


## DB Architecture 
### Approach 1:
Each **Collection** in the Qdrant DB will be associated with a **Project**.
The collections will be simple, Vector databases with: -

- **collection_name**: *ProjectName_ProjectID*.
- **VectorParams**: Size=1024, Distance=COSINE
- **PointStruct**: id, vector=chunk_embedding, payload=chunk_metadata
    - *chunk_metadata*= {filename, tags, page no., content src}

A Qdrant collection can be viewed as a pool of embeddings from all the docs associated with a Project, with the chunk info (Filename, Page No.) maintained in the payload.

For ***Retrieval***, query is sent to a Qdrant collection which returns the *Docs* with the closest match to the query.

#### Cons to this approach:
1. Too many collections will be created. Each colleciton has:
    - Its own HNSW index
    - Its own metadata and memory footprint
    - Background maintenance processes
 ***Conclusion***: Not feasible when the collection count exceeds beyond a few thousand
2. Cross-project *search* is not possible with this approach. Searches like:
    - “Search across all user projects”
    - “Search across multiple teams”
    - or some form of global context
  become very hard because of isolation of each project.

3. A Qdrant instance with thousands of collections is more complex to move or manage than a single large collection with payload filters.


### Approach 2:
A **Collection** will hold all the data for all the projects in a single pool.
The **Payload** to each chunk will contain the following info:
- *userID*
- *projectID*
- *fileID*
- *chunkID*
- *text*
- *fileURL_GCS*
- *fileName*

For ***Retrieval***, we must *filter* the search query *userID* and *projectID*.
#### Cons to this approach:
1. This approach results in cluttered storage of embeddings, as it appears





