## **🛠️ Tools You May Consider**  
(*These are recommendations to help you get started. You are free to use alternative tools—just document your choices clearly!*)  
- **Database**: FAISS, ChromaDB, SQLite, Elasticsearch, Neo4j and etc.  
- **Embedding Models**: Hugging Face Sentence-Transformers, OpenAI Embeddings  
- **LLM for Generation**: OpenAI: gpt-4o-mini
- **Others**: Langchain, GraphRAG, and etc.

## **📌 Final Delivery**  
Your final submission should include:  
✅ A well-documented **GitHub repository or notebook**  
✅ A clear **README** explaining your approach  
✅ A structured **retrieval and generation modules**  

### **🔥 Bonus Points For**  
✨ Innovative retrieval techniques  
✨ Well-organized, modular code  
✨ Creative visualizations or user interfaces  


# 1. Set up working environment

# 2. Knowledge Base Preparation

## 2.1 Load documents

Once you are added access to this folder, it will appear at your google drive "Shared drives". Then you can mount your drive and as following, and access your data from "/content/drive/Shared drives/Datathon/Data/hackathon_data/". Enjoy the ride! :)

In [1]:
# Load the Drive and mount
# from google.colab import drive
# drive.mount('/content/drive/')

Load json file.

In [6]:
import os
from src.preprocessing import filter_json_file

for filename in os.listdir("data/hackathon_data")[:5]:
    if filename.endswith(".json"):
        filepath = os.path.join("data/hackathon_data", filename)
        filter_json_file(filepath, "data/clean")

Filtered: cabotcorp.com.json (kept 68/70 pages)
Filtered: thedesignpeople.com.json (kept 69/70 pages)
Filtered: stenograph.com.json (kept 70/70 pages)
Filtered: cleaningguys.com.json (kept 70/70 pages)
Filtered: fmssolutions.com.json (kept 69/70 pages)


## 2.2 Pre-process documents.

Feel free to explore and pre-process the data. You may want to clean or segment the documents as you see fit.

In [7]:
def document_clean(docs):
  """
  You may want to clean the dataset, add the code here.
  """
  pass

## 2.3 Document Indexing and Storage (Profiling)

Feel free to choose different ways to indexing and storing the provided documents in a knowledge database.

So that they can be retrieved in different ways according to your system design choices, such as search by keywords, vector representation, graph relation, and etc.

In [8]:
import langchain
from langchain_community.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import json

def chunk_documents(documents, chunk_size=500, chunk_overlap=100):
    """
    Split documents into chunks for better retrieval.
    
    Args:
        documents: List of document dictionaries with content and metadata
        chunk_size: Maximum size of chunks
        chunk_overlap: Overlap between chunks
    
    Returns:
        List of LangChain Document objects
    """
    from langchain.schema import Document
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    
    chunked_docs = []
    for doc in documents:
        splits = text_splitter.split_text(doc["content"])
        for i, split in enumerate(splits):
            chunked_docs.append(
                Document(
                    page_content=split,
                    metadata={
                        **doc["metadata"],
                        "chunk_id": i
                    }
                )
            )
    
    return chunked_docs

In [9]:
# go over the data/clean folder and chunk the documents
documents = []
for filename in os.listdir("data/clean"):
    if filename.endswith(".json"):
        filepath = os.path.join("data/clean", filename)
        with open(filepath, "r") as f:
            data = json.load(f)
            for url in data["text_by_page_url"]:
                documents.append({"content": data["text_by_page_url"][url], "metadata": {"source": url}})

documents = chunk_documents(documents)

In [16]:
documents = documents[:5]

# 3. Retrieval Augmented Generation

## 3.1 Load Knowledge Database

In [17]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Replace OpenAI embeddings with a local model
def get_local_embeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"):
    """
    Create a local embedding model using HuggingFace models.
    
    Args:
        model_name: Name of the HuggingFace embedding model
    
    Returns:
        HuggingFaceEmbeddings model
    """
    model_kwargs = {'device': 'cpu'}  # Use 'cuda' if you have a GPU
    encode_kwargs = {'normalize_embeddings': True}
    
    embeddings = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )
    
    return embeddings

In [18]:
def create_vector_db(documents, persist_directory="./chroma_db"):
    """
    Create and persist a vector database from documents.
    
    Args:
        documents: List of LangChain Document objects
        embedding_model_name: Name of the OpenAI embedding model to use
        persist_directory: Directory to save the vector database
    
    Returns:
        Chroma vector store
    """
    # Initialize the embedding model
    embeddings = get_local_embeddings()
    
    # Create and persist the vector store
    vectordb = Chroma.from_documents(
        documents=documents,
        embedding=embeddings,
        persist_directory=persist_directory,

    )
    
    vectordb.persist()
    print(f"Vector database created with {len(documents)} chunks and saved to {persist_directory}")
    
    return vectordb

In [20]:
vector_db = create_vector_db(documents)

Vector database created with 5 chunks and saved to ./chroma_db


## 3.2 Relevant Document Retrieval

Feel free to check and improve your retrieval performance as it affect the generation results significantly.

In [25]:
def retrieve_documents(query, vectordb, k=1):
    """
    Retrieve relevant documents from the vector database based on the query.
    
    Args:
        query: User query string
        vectordb: Vector database to search
        k: Number of documents to retrieve
    
    Returns:
        List of retrieved documents
    """
    retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k": k})
    docs = retriever.get_relevant_documents(query)
    return docs

## 3.3 Response Generation

In [23]:
from src.prompts import generate_answer, load_prompts

query = "What company is located in 29010 Commerce Center Dr., Valencia, 91355, California, US?"
retrieved_docs = retrieve_documents(query, vector_db)
prompts = load_prompts()
prompt_template = prompts["rag_default"]
response = generate_answer(query, retrieved_texts=retrieved_docs, prompt_template=prompt_template, model="gpt-4o")

print("Query:", query)
print("Retrieved Documents:", ["ABC Corporation is located at 29010 Commerce Center Dr., Valencia, 91355, California, US."])
print("Generated Answer:", response)

  docs = retriever.get_relevant_documents(query)


Query: What company is located in 29010 Commerce Center Dr., Valencia, 91355, California, US?
Retrieved Documents: ['ABC Corporation is located at 29010 Commerce Center Dr., Valencia, 91355, California, US.']
Generated Answer: I don't have enough information to answer this question.


# 4. Evaluation