#  VectorDBPipe: End-to-End RAG Pipeline

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yashdesai023/vectorDBpipe/blob/main/experiments/vector_pipeline_demo.ipynb)

Welcome to the **vectorDBpipe** demo! 

In this notebook, we will demonstrate how to build a production-ready **Retrieval-Augmented Generation (RAG)** pipeline in less than 5 minutes.

**What we will do:**
1.  **Install** the library.
2.  **Generate** dummy text data (representing your internal knowledge base).
3.  **Ingest** the data (Clean -> Chunk -> Embed -> Store).
4.  **Search** the data using semantic queries.

## 1. Installation

First, we install the package directly from PyPI. 
> **Note:** If you are running on Google Colab, we add a few extra flags to ensure compatibility.

In [1]:
# Install vectorDBpipe
!pip install vectordbpipe

# If on Windows local machine without GPU, uncomment the line below:
# !pip install -r https://github.com/yashdesai023/vectorDBpipe/raw/main/requirements-cpu.txt



## 2. Setup & Data Generation

We need some data to search! Let's create a temporary directory `my_data/` and add some text files representing corporate documents.

In [2]:
import os

# 1. Create Data Directory
data_dir = "my_data"
os.makedirs(data_dir, exist_ok=True)

# 2. Create Dummy Files
documents = {
    "policy_remote_work.txt": """
    Remote Work Policy (2025):
    Employees are allowed to work from home 3 days a week.
    Core hours are 10 AM to 4 PM EST.
    Approval from the manager is required for full-time remote work.
    """,
    "project_alpha_specs.txt": """
    Project Alpha Specifications:
    The goal is to build a scalable vector search engine.
    It must support Pinecone and ChromaDB.
    The latency for search queries should be under 200ms.
    """,
    "meeting_minutes_jan.txt": """
    Minutes of Meeting - Jan 15:
    - Discussed the budget for Q1.
    - Approved the purchase of new H100 GPUs.
    - Team outing scheduled for Feb 20th at the Central Park.
    """
}

for filename, content in documents.items():
    with open(os.path.join(data_dir, filename), "w", encoding="utf-8") as f:
        f.write(content.strip())

print(f"‚úÖ Generated {len(documents)} sample documents in '{data_dir}'")

‚úÖ Generated 3 sample documents in 'my_data'


### üí° Pro Tip: Customizing `config.yaml`

In a real project, you would edit `vectorDBpipe/config/config.yaml`. Here is a quick guide on key settings:

*   **`model.name`**: Switch to a larger model like `sentence-transformers/all-mpnet-base-v2` for better accuracy.
*   **`vector_db.type`**: Change from `"chroma"` (local) to `"pinecone"` (cloud serverless) for production.
*   **`paths.data_dir`**: Point this to your actual document folder (e.g., `"/content/my_pdfs/"`).

## 3. Initialize the Pipeline

We use `TextPipeline` as our main entry point. 

By default, it looks for a `config.yaml`. In this demo, we create one dynamically:

In [3]:
from vectorDBpipe.pipeline.text_pipeline import TextPipeline
import yaml

# Optional: Create a custom config file to point to our new data folder
config_content = """
paths:
  data_dir: "my_data/"
  logs_dir: "logs/"

model:
  name: "sentence-transformers/all-MiniLM-L6-v2"

vector_db:
  type: "chroma"  # Using local ChromaDB for this demo
  persist_directory: "chroma_db_storage"
"""

with open("demo_config.yaml", "w") as f:
    f.write(config_content)

# Initialize Pipeline with our custom config
pipeline = TextPipeline(config_path="demo_config.yaml")
print("‚úÖ Pipeline Initialized!")

  from .autonotebook import tqdm as notebook_tqdm


OSError: [WinError 1114] A dynamic link library (DLL) initialization routine failed. Error loading "e:\Private\AI-PROJECTS-PORTFOLIO-DOCS-ASSETS\ALL-PROJECTS-PACKAGES\vectorDBpipe\venv\Lib\site-packages\torch\lib\c10.dll" or one of its dependencies.

## 4. Run Ingestion (Load -> Clean -> Chunk -> Embed -> Store)

This is the magic step. We call `process()` to read all files in `my_data/`, turn them into vectors, and store them in ChromaDB.

We use `batch_size=10` here because our data is small, but for large datasets, use `100` or more.

In [None]:
pipeline.process(batch_size=10)

## 5. Semantic Search

Now that our data is indexed, let's ask questions! Notice we don't need exact keyword matches. 

*   **Query 1**: "Can I work from home?" (Matches *Remote Work Policy*)
*   **Query 2**: "What GPUs are we buying?" (Matches *Meeting Minutes*)
*   **Query 3**: "What are the requirements for Project Alpha?" (Matches *Project Specs*)

In [None]:
def print_results(results):
    print("\n--- üîç Search Results ---")
    for i, res in enumerate(results):
        meta = res.get('metadata', {})
        print(f"Result {i+1} (Source: {meta.get('source', 'Unknown')}):")
        print(f"  \"...{meta.get('text', '')}...\"\n")

# Test Query 1
q1 = "Can I work from home on Fridays?"
print(f"Query: {q1}")
results = pipeline.search(q1, top_k=1)
print_results(results)

# Test Query 2
q2 = "What did we decide about new hardware?"
print(f"Query: {q2}")
results = pipeline.search(q2, top_k=1)
print_results(results)

## 6. What's Next?

You have successfully built a Search Engine!

**To go to Production:**
1.  Change `vector_db.type` to `"pinecone"` in `config.yaml`.
2.  Set your `PINECONE_API_KEY` environment variable.
3.  Point `data_dir` to your real Gigabyte-scale dataset.
4.  Run `pipeline.process(batch_size=100)`.

Enjoy using **vectorDBpipe**! üöÄ