# Creating Vector Database for Diabetic-Friendly Recipes

This notebook creates a vector database from the diabetic-friendly recipes dataset using ChromaDB and sentence transformers. 

- Loads the “ashikan/diabetic-friendly-recipes” dataset from HuggingFace, ensuring access to a high-quality, real-world data source.
- Processes and formats recipe data (title, ingredients, instructions, nutritional info) for semantic search and retrieval.
- Generates dense vector embeddings for each recipe using a state-of-the-art SentenceTransformer model, enabling advanced AI-powered similarity search.
- Creates a persistent ChromaDB vector database to store recipe embeddings and metadata, supporting scalable and efficient retrieval-augmented generation (RAG) workflows.
- Saves all embeddings to disk for downstream visualization and analysis.


In [1]:
# imports
import os
import numpy as np
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
import chromadb
from datasets import load_dataset




In [2]:
# Load the dataset
dataset = load_dataset("ashikan/diabetic-friendly-recipes")
recipes = dataset['train']

In [3]:
# Initialize ChromaDB client
DB_PATH = "recipes_vectorstore"
client = chromadb.PersistentClient(path=DB_PATH)

# Create or get collection
collection_name = "recipes"
existing_collections = [collection.name for collection in client.list_collections()]
if collection_name in existing_collections:
    client.delete_collection(collection_name)
    print(f"Deleted existing collection: {collection_name}")

collection = client.create_collection(collection_name)

Benefits compared to OpenAI embeddings:

It's free and fast, and we can run it locally, so the data never leaves our box 

In [4]:
# Initialize the sentence transformer model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [5]:
# Prepare data for vector database
documents = []
metadatas = []
ids = []

for idx, recipe in enumerate(tqdm(recipes)):
    # Create a text representation of the recipe
    recipe_text = f"Title: {recipe['recipeName']}\n"
    recipe_text += f"Ingredients: {', '.join(recipe['ingredients'])}\n"
    recipe_text += f"Instructions: {recipe['steps']}\n"
    recipe_text += f"NER: {recipe['NER']}\n"

    documents.append(recipe_text)
    metadatas.append({
        'title': recipe['recipeName'],
        'ingredients_count': len(recipe['ingredients']),
        'instructions_length': len(recipe['steps'])
    })
    ids.append(str(idx))

  0%|          | 0/718 [00:00<?, ?it/s]

100%|██████████| 718/718 [00:00<00:00, 3949.25it/s]


In [6]:
# Generate embeddings and add to collection
batch_size = 100
for i in tqdm(range(0, len(documents), batch_size)):
    batch_docs = documents[i:i + batch_size]
    batch_metadatas = metadatas[i:i + batch_size]
    batch_ids = ids[i:i + batch_size]
    
    # Generate embeddings
    embeddings = model.encode(batch_docs).tolist()
    
    # Add to collection
    collection.add(
        documents=batch_docs,
        embeddings=embeddings,
        metadatas=batch_metadatas,
        ids=batch_ids
    )

100%|██████████| 8/8 [01:16<00:00,  9.57s/it]
