## **🛠️ Tools You May Consider**  
(*These are recommendations to help you get started. You are free to use alternative tools—just document your choices clearly!*)  
- **Database**: FAISS, ChromaDB, SQLite, Elasticsearch, Neo4j and etc.  
- **Embedding Models**: Hugging Face Sentence-Transformers, OpenAI Embeddings  
- **LLM for Generation**: OpenAI: gpt-4o-mini
- **Others**: Langchain, GraphRAG, and etc.

## **📌 Final Delivery**  
Your final submission should include:  
✅ A well-documented **GitHub repository or notebook**  
✅ A clear **README** explaining your approach  
✅ A structured **retrieval and generation modules**  

### **🔥 Bonus Points For**  
✨ Innovative retrieval techniques  
✨ Well-organized, modular code  
✨ Creative visualizations or user interfaces  


# 1. Set up working environment

In [None]:
!git clone https://github.com/richdanis/NaNsense.git
!cd NaNsense
!pip install -r requirements.txt

In [None]:
# Load the Drive and mount
from google.colab import drive
drive.mount('/content/drive/')

# 2. Knowledge Base Preparation

## 2.1 Load documents

Once you are added access to this folder, it will appear at your google drive "Shared drives". Then you can mount your drive and as following, and access your data from "/content/drive/Shared drives/Datathon/Data/hackathon_data/". Enjoy the ride! :)

In [None]:
folder_path = "/content/drive/Shared drives/Datathon/Data/hackathon_data/"# Google drive path of the dataset
folder_path = "data/hackathon_data"


Load json file.

In [None]:
!mkdir data/clean

In [None]:
import os
from tqdm import tqdm
from src.preprocessing import filter_json_file

for filename in tqdm(os.listdir(folder_path)):
    if filename.endswith(".json"):
        filepath = os.path.join(folder_path, filename)
        filter_json_file(filepath, "data/clean")

## 2.2 Pre-process documents.

Feel free to explore and pre-process the data. You may want to clean or segment the documents as you see fit.

In [7]:
def document_clean(docs):
  """
  You may want to clean the dataset, add the code here.
  """
  pass

## 2.3 Document Indexing and Storage (Profiling)

Feel free to choose different ways to indexing and storing the provided documents in a knowledge database.

So that they can be retrieved in different ways according to your system design choices, such as search by keywords, vector representation, graph relation, and etc.

In [1]:
import langchain
from langchain_community.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import json
import os
from tqdm import tqdm

def chunk_documents(documents, chunk_size=500, chunk_overlap=100):
    """
    Split documents into chunks for better retrieval.
    
    Args:
        documents: List of document dictionaries with content and metadata
        chunk_size: Maximum size of chunks
        chunk_overlap: Overlap between chunks
    
    Returns:
        List of LangChain Document objects
    """
    from langchain.schema import Document
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    
    chunked_docs = []
    for doc in tqdm(documents):
        splits = text_splitter.split_text(doc["content"])
        for i, split in enumerate(splits):
            chunked_docs.append(
                Document(
                    page_content=split,
                    metadata={
                        **doc["metadata"],
                        "chunk_id": i
                    }
                )
            )
    
    return chunked_docs

#### Option 1: Wihout Metadata

In [2]:
# go over the data/clean folder and chunk the documents
documents = []
for filename in tqdm(os.listdir("data/clean")):
    if filename.endswith(".json"):
        filepath = os.path.join("data/clean", filename)
        with open(filepath, "r") as f:
            data = json.load(f)
            for url in data["text_by_page_url"]:
                documents.append({"content": data["text_by_page_url"][url], "metadata": {"source": url}})

100%|██████████| 13144/13144 [00:12<00:00, 1070.66it/s]


#### Option 2: With Metadata

In [None]:
from src.fuzzy_metadata import fuzzy_is_meta

documents= fuzzy_is_meta(use_all_doc=True)

#### Continue as before

In [3]:
chunked_documents = chunk_documents(documents)

100%|██████████| 258097/258097 [02:01<00:00, 2123.07it/s]


In [4]:
from src.keyword_retrieval import fuzzy_search

keywords = fuzzy_search(["France", "Cheese", "Wine"], chunked_documents)


Fuzzy searching: 100%|██████████| 3976253/3976253 [01:21<00:00, 49037.01it/s]


In [11]:
keywords[20]

Document(metadata={'source': 'https://www.thehenryrestaurant.com/locations/the-henry-west-hollywood/menus/dinner-menu/', 'chunk_id': 2}, page_content='fig, pumpkin seed, candied pecan, pecorino, mustard vinaigrette Entrées Wagyu Cheeseburger* 25 lettuce, tomato, pickle, charred onion, white cheddar, american cheese, henry sauce Scottish Salmon* 38 toasted quinoa, marcona almond pesto, crispy sweet potato, watercress, pomegranate glaze Filet Mignon* 56 horseradish gratin, roasted brussels sprout, wild mushroom, cipollini onion, burgundy sauce Add Lobster 24 Bolognese 29 garganelli pasta, truffle mushroom butter, herbed ricotta, garlic toast')

# 3. Retrieval Augmented Generation

## 3.1 Load Knowledge Database

In [17]:
import torch
from langchain_community.embeddings import HuggingFaceEmbeddings

# Replace OpenAI embeddings with a local model
def get_local_embeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"):
    """
    Create a local embedding model using HuggingFace models.
    
    Args:
        model_name: Name of the HuggingFace embedding model
    
    Returns:
        HuggingFaceEmbeddings model
    """
    model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}  # Use 'cuda' if you have a GPU
    encode_kwargs = {'normalize_embeddings': True}
    
    embeddings = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )
    
    return embeddings

In [18]:
def create_vector_db(documents, persist_directory="./chroma_db"):
    """
    Create and persist a vector database from documents.
    
    Args:
        documents: List of LangChain Document objects
        embedding_model_name: Name of the OpenAI embedding model to use
        persist_directory: Directory to save the vector database
    
    Returns:
        Chroma vector store
    """
    # Initialize the embedding model
    embeddings = get_local_embeddings()
    
    # Create and persist the vector store
    vectordb = Chroma.from_documents(
        documents=documents,
        embedding=embeddings,
        persist_directory=persist_directory,

    )
    
    vectordb.persist()
    print(f"Vector database created with {len(documents)} chunks and saved to {persist_directory}")
    
    return vectordb

In [None]:
vector_db = create_vector_db(documents)

## 3.2 Relevant Document Retrieval

Feel free to check and improve your retrieval performance as it affect the generation results significantly.

In [25]:
def retrieve_documents(query, vectordb, k=1):
    """
    Retrieve relevant documents from the vector database based on the query.
    
    Args:
        query: User query string
        vectordb: Vector database to search
        k: Number of documents to retrieve
    
    Returns:
        List of retrieved documents
    """
    retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k": k})
    docs = retriever.get_relevant_documents(query)
    return docs

## 3.3 Response Generation

In [None]:
from src.prompts import generate_answer, load_prompts

query = "What company is located in 29010 Commerce Center Dr., Valencia, 91355, California, US?"
retrieved_docs = retrieve_documents(query, vector_db)
prompts = load_prompts()
prompt_template = prompts["rag_default"]
response = generate_answer(query, retrieved_texts=retrieved_docs, prompt_template=prompt_template, model="gpt-4o")

print("Query:", query)
print("Retrieved Documents:", ["ABC Corporation is located at 29010 Commerce Center Dr., Valencia, 91355, California, US."])
print("Generated Answer:", response)

# 4. Evaluation

In [None]:
# 4. Evaluation
from src.evaluate import evaluate_rag_system, save_evaluation_results

# Set evaluation parameters
benchmark_file = "benchmark.json"  # Path to your benchmark file
k = 3  # Number of documents to retrieve for each query
prompt_template_name = "rag_default"  # Prompt template to use
model = "gpt-4o-mini"  # Model to use for generation

# Run evaluation
print("Starting evaluation...")
evaluation_results = evaluate_rag_system(
    benchmark_file=benchmark_file,
    k=k,
    prompt_template_name=prompt_template_name,
    model=model
)

# Print summary metrics
print("\nEvaluation Results:")
for metric, value in evaluation_results["metrics"].items():
    print(
        f"{metric}: {value:.2f}%"
        if "percentage" in metric
        else f"{metric}: {value}"
    )

# Save results
save_evaluation_results(evaluation_results, "evaluation_results.json")

# Display some example results
print("\nExample Results:")
for i, result in enumerate(evaluation_results["results"][:3]):  # Show first 3 examples
    print(f"\nExample {i+1}:")
    print(f"Question: {result['question']}")
    print(f"Reference: {result['reference_answer']}")
    print(f"Prediction: {result['predicted_answer']}")
    print(f"Exact Match: {'Yes' if result['exact_match'] == 1.0 else 'No'}")