## **🛠️ Tools You May Consider**  
(*These are recommendations to help you get started. You are free to use alternative tools—just document your choices clearly!*)  
- **Database**: FAISS, ChromaDB, SQLite, Elasticsearch, Neo4j and etc.  
- **Embedding Models**: Hugging Face Sentence-Transformers, OpenAI Embeddings  
- **LLM for Generation**: OpenAI: gpt-4o-mini
- **Others**: Langchain, GraphRAG, and etc.

## **📌 Final Delivery**  
Your final submission should include:  
✅ A well-documented **GitHub repository or notebook**  
✅ A clear **README** explaining your approach  
✅ A structured **retrieval and generation modules**  

### **🔥 Bonus Points For**  
✨ Innovative retrieval techniques  
✨ Well-organized, modular code  
✨ Creative visualizations or user interfaces  


# 1. Set up working environment

In [1]:
# !pip install openai

# # Database options
# !pip install chromadb # if you use chromadb as your vector database

# # Others
# !pip install langchain-community # if you use langchain for orchastration
# !pip install transformers #if you use huggingface for vector embedding

In [2]:
# enable GPU if needed, GPU can speed up your vector embedding if you computing these vectors locally (not using API)

# import torch

# device = "cuda" if torch.cuda.is_available() else "cpu"
# print(f"Using device: {device}")

In [1]:
import os
# import chromadb
# from langchain.embeddings import HuggingFaceEmbeddings
# from langchain.vectorstores import Chroma
# from langchain.llms import OpenAI
# from langchain.chains import RetrievalQA

# # Set OpenAI API Key
# os.environ["OPENAI_API_KEY"] = ""


# 2. Knowledge Base Preparation

## 2.1 Load documents

Once you are added access to this folder, it will appear at your google drive "Shared drives". Then you can mount your drive and as following, and access your data from "/content/drive/Shared drives/Datathon/Data/hackathon_data/". Enjoy the ride! :)

In [None]:
# Load the Drive and mount
# from google.colab import drive
# drive.mount('/content/drive/')

Load json file.

In [None]:
# folder_path = "/content/drive/Shared drives/Datathon/Data/hackathon_data/"# Google drive path of the dataset
folder_path = "../data/hackathon_data/"
files_in_folder = os.listdir(folder_path)

len(files_in_folder)

In [None]:
def load_documents(json_file):
    """Loads the JSON file."""
    with open(json_file, 'r') as f:
      try:
          data = json.load(f)
          return data
      except json.JSONDecodeError:
          print(f"Error reading {json_file}, it may not be a valid JSON file.")
    return []

In [None]:
for filename in files_in_folder:
    if filename.endswith('.json'):
        file_path = os.path.join(folder_path, filename)
        doc = load_documents(file_path)
        break
print(doc.keys())
doc

## 2.2 Pre-process documents.

Feel free to explore and pre-process the data. You may want to clean or segment the documents as you see fit.

In [None]:
def page_segment(docs):
    """You may prefer to load each page separately."""
    i = 0
    page_segment = []
    for s in list(docs['text_by_page_url'].values()):
      page_segment.append({"docID": docs['doc_id'], "pageID": 'page_' + str(i), "text": s})
      i += 1
    return page_segment

In [None]:
def segment_documents(docs, chunk_size=500):
    """Segments documents into chunks of a given token size. Replace this function with your segmentation approach or maybe use the original document without segmentation."""
    segmented = []
    for doc_id, content in docs.items():
        for i in range(0, len(content), chunk_size):
            segment = content[i : i + chunk_size]
            segmented.append({"id": doc_id, "text": segment})
    return segmented



In [None]:
def document_clean(docs):
  """
  You may want to clean the dataset, add the code here.
  """
  pass

## 2.3 Document Indexing and Storage (Profiling)

Feel free to choose different ways to indexing and storing the provided documents in a knowledge database.

So that they can be retrieved in different ways according to your system design choices, such as search by keywords, vector representation, graph relation, and etc.

# 3. Retrieval Augmented Generation

## 3.1 Load Knowledge Database

## 3.2 Relevant Document Retrieval

Feel free to check and improve your retrieval performance as it affect the generation results significantly.

In [None]:
def retrieve_documents(query, db_path, embedding_model):
  """
  retrieve relevant documents from the knowledge database to the query.
  """
  pass

## 3.3 Response Generation

In [2]:
from src.prompts import generate_answer, load_prompts

query = "What company is located in 29010 Commerce Center Dr., Valencia, 91355, California, US?"
retrieved_docs = ["ABC Corporation is located at 29010 Commerce Center Dr., Valencia, 91355, California, US."]
prompts = load_prompts()
prompt_template = prompts["rag_default"]
response = generate_answer(query, retrieved_texts="Lalo", prompt_template=prompt_template, model="gpt-4o")

print("Query:", query)
print("Retrieved Documents:", ["ABC Corporation is located at 29010 Commerce Center Dr., Valencia, 91355, California, US."])
print("Generated Answer:", response)

Query: What company is located in 29010 Commerce Center Dr., Valencia, 91355, California, US?
Retrieved Documents: ['ABC Corporation is located at 29010 Commerce Center Dr., Valencia, 91355, California, US.']
Generated Answer: I don't have enough information to answer this question.


# 4. Evaluation