# Retrieval-Augmented Generation using Gemma LLMs

Deskripsi :
Projek ini ditujukan untuk mengimplementasikan Retrieval-Augmented Generation secara lokal baik sistem serta databasenya.


## Key terms

| Term                                | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ----------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Token**                           | A sub-word piece of text. For example, "hello, world!" could be split into ["hello", ",", "world", "!"]. A token can be a whole word,<br> part of a word or group of punctuation characters. 1 token ~= 4 characters in English, 100 tokens ~= 75 words.<br> Text gets broken into tokens before being passed to an LLM.                                                                                                                                                                                                                                                                                  |
| **Embedding**                       | A learned numerical representation of a piece of data. For example, a sentence of text could be represented by a vector with<br> 768 values. Similar pieces of text (in meaning) will ideally have similar values.                                                                                                                                                                                                                                                                                                                                                                                        |
| **Embedding model**                 | A model designed to accept input data and output a numerical representation. For example, a text embedding model may take in 384 <br>tokens of text and turn it into a vector of size 768. An embedding model can and often is different to an LLM model.                                                                                                                                                                                                                                                                                                                                                 |
| **Similarity search/vector search** | Similarity search/vector search aims to find two vectors which are close together in high-demensional space. For example, <br>two pieces of similar text passed through an embedding model should have a high similarity score, whereas two pieces of text about<br> different topics will have a lower similarity score. Common similarity score measures are dot product and cosine similarity.                                                                                                                                                                                                         |
| **Large Language Model (LLM)**      | A model which has been trained to numerically represent the patterns in text. A generative LLM will continue a sequence when given a sequence. <br>For example, given a sequence of the text "hello, world!", a genertive LLM may produce "we're going to build a RAG pipeline today!".<br> This generation will be highly dependant on the training data and prompt.                                                                                                                                                                                                                                     |
| **LLM context window**              | The number of tokens a LLM can accept as input. For example, as of March 2024, GPT-4 has a default context window of 32k tokens<br> (about 96 pages of text) but can go up to 128k if needed. A recent open-source LLM from Google, Gemma (March 2024) has a context<br> window of 8,192 tokens (about 24 pages of text). A higher context window means an LLM can accept more relevant information<br> to assist with a query. For example, in a RAG pipeline, if a model has a larger context window, it can accept more reference items<br> from the retrieval system to aid with its generation.      |
| **Prompt**                          | A common term for describing the input to a generative LLM. The idea of "[prompt engineering](https://en.wikipedia.org/wiki/Prompt_engineering)" is to structure a text-based<br> (or potentially image-based as well) input to a generative LLM in a specific way so that the generated output is ideal. This technique is<br> possible because of a LLMs capacity for in-context learning, as in, it is able to use its representation of language to breakdown <br>the prompt and recognize what a suitable output may be (note: the output of LLMs is probable, so terms like "may output" are used). |


## Requirements and Setup


In [1]:
import subprocess
import sys


def install_from_requirements(requirements_file="requirements.txt"):
    try:
        subprocess.check_call(
            [sys.executable, "-m", "pip", "install", "-r", requirements_file]
        )
        print(f"Sukses menginstal semua paket dari {requirements_file}")
    except subprocess.CalledProcessError as e:
        print(f"Gagal menginstal paket dari {requirements_file}: {e}")

## 1. PDF Document Reading


In [2]:
import os
import requests

# Nama folder tujuan
target_folder = "information_file"

# Nama file PDF
pdf_filename = "human-nutrition-text.pdf"

# Path lengkap file PDF di dalam folder target
pdf_path = os.path.join(target_folder, pdf_filename)

# URL file PDF yang ingin diunduh
url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

# Buat folder jika belum ada
if not os.path.exists(target_folder):
    os.makedirs(target_folder)
    print(f"Folder '{target_folder}' telah dibuat.")

# Download PDF jika belum ada
if not os.path.exists(pdf_path):
    print(f"File '{pdf_filename}' tidak ditemukan di '{target_folder}'. Mengunduh...")

    try:
        # Kirim permintaan GET ke URL
        response = requests.get(url)
        response.raise_for_status()  # Akan memunculkan HTTPError untuk respons yang buruk (status code 4xx atau 5xx)

        # Buka file dalam mode binary write dan simpan konten
        with open(pdf_path, "wb") as file:
            file.write(response.content)
        print(f"File '{pdf_filename}' telah diunduh dan disimpan di '{pdf_path}'.")

    except requests.exceptions.RequestException as e:
        print(f"Gagal mengunduh file. Error: {e}")

else:
    print(f"File '{pdf_filename}' sudah ada di '{pdf_path}'.")

# Cek keberadaan file setelah (mencoba) diunduh
if os.path.exists(pdf_path):
    print(f"Pemeriksaan: File '{pdf_filename}' ditemukan di dalam folder '{target_folder}'.")
else:
    print(f"Pemeriksaan: File '{pdf_filename}' TIDAK ditemukan di dalam folder '{target_folder}'.")

File 'human-nutrition-text.pdf' sudah ada di 'information_file\human-nutrition-text.pdf'.
Pemeriksaan: File 'human-nutrition-text.pdf' ditemukan di dalam folder 'information_file'.


In [3]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import pymupdf # (pymupdf, found this is better than pypdf for our use case, note: licence is AGPL-3.0, keep that in mind if you want to use any code commercially)
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm 

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    #doc = fitz.open(pdf_path)  # open a document
    doc = pymupdf.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - 41,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

  from .autonotebook import tqdm as notebook_tqdm
1208it [00:01, 920.20it/s]


[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [4]:
#Taking sample of text from the information file
import random
import pandas as pd

random.sample(pages_and_texts, k=3)
df = pd.DataFrame(pages_and_texts)
df.head()

#stats of the book we collected
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,14.0,400.88
max,1166.0,2308.0,429.0,32.0,577.0


## 2. Text Splitting/Chunking


In [5]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/ 
nlp.add_pipe("sentencizer")

# Create a document instance as an example
doc = nlp("This is a sentence. This another sentence.")
assert len(list(doc.sents)) == 2

# Access the sentences of the document
list(doc.sents)

[This is a sentence., This another sentence.]

In [6]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences 
    item["page_sentence_count_spacy"] = len(item["sentences"])

100%|██████████| 1208/1208 [00:02<00:00, 524.08it/s]


In [9]:
# Inspect an example
random.sample(pages_and_texts, k=1)

[{'page_number': 985,
  'page_char_count': 67,
  'page_word_count': 15,
  'page_sentence_count_raw': 3,
  'page_token_count': 16.75,
  'text': 'PART\xa0XVII  CHAPTER 17. FOOD SAFETY  Chapter 17. Food Safety  |  985',
  'sentences': ['PART\xa0XVII  CHAPTER 17.',
   'FOOD SAFETY  Chapter 17.',
   'Food Safety  |  985'],
  'page_sentence_count_spacy': 3}]

In [12]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10 

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, 
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

100%|██████████| 1208/1208 [00:00<00:00, 403626.16it/s]


In [11]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': 74,
  'page_char_count': 2158,
  'page_word_count': 378,
  'page_sentence_count_raw': 17,
  'page_token_count': 539.5,
  'text': 'From the Stomach to the Small Intestine  When food enters the stomach, a highly muscular organ, powerful  peristaltic contractions help mash, pulverize, and churn food into  chyme. Chyme is a semiliquid mass of partially digested food that  also contains gastric juices secreted by cells in the stomach. These  gastric juices contain hydrochloric acid and the enzyme pepsin, that  chemically start breakdown of the protein components of food.  The length of time food spends in the stomach varies by the  macronutrient composition of the meal. A high-fat or high-protein  meal takes longer to break down than one rich in carbohydrates.  It usually takes a few hours after a meal to empty the stomach  contents completely into the small intestine.  The small intestine is divided into three structural parts: the  duodenum, the jejunum, and the ileum. On

In [13]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

100%|██████████| 1208/1208 [00:00<00:00, 27726.53it/s]


1843

In [14]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.44,112.33,183.61
std,347.79,447.54,71.22,111.89
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,44.0,78.75
50%,586.0,746.0,114.0,186.5
75%,890.0,1118.5,173.0,279.62
max,1166.0,1831.0,297.0,457.75


In [71]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 11.0 | Text: 978 | Food Supplements and Food Replacements
Chunk token count: 5.25 | Text: Young Adulthood | 907
Chunk token count: 24.25 | Text: There are several lecithin supplements on the market Nonessential and Essential Fatty Acids | 315
Chunk token count: 20.5 | Text: PART XVI CHAPTER 16. PERFORMANCE NUTRITION Chapter 16. Performance Nutrition | 931
Chunk token count: 16.25 | Text: Table 14.2  Micronutrient Levels during Puberty 886 | Adolescence


In [68]:
pages_and_chunks_over_min_token_len = df[
    df["chunk_token_count"] > min_token_length
].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

## 3. Chunk Embedding


In [None]:
# Requires: !pip install transformers torch

from transformers import BertTokenizer, BertModel
import torch

device = torch.device("cpu")
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
embedding_model = BertModel.from_pretrained('bert-base-uncased')
embedding_model.to(device)
embedding_model.eval()  # set model to evaluation mode

# List of sentences
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Tokenize and encode each sentence and get the embedding
embeddings_dict = {}
for sentence in sentences:
    # Tokenize and convert to tensor
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, padding=True)
    
    with torch.no_grad():  # no need to calculate gradients
        outputs = embedding_model(**inputs)

    # Use the [CLS] token representation as the sentence embedding
    cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
    
    embeddings_dict[sentence] = cls_embedding

# Print embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")


In [70]:
single_sentence = "Yo! How cool are embeddings?"
# Gunakan tokenizer() bukan encode()

inputs = tokenizer(single_sentence, return_tensors='pt')

print(f"Sentence: {single_sentence}")
print(f"Token IDs:\n{inputs['input_ids']}")
print(f"Token IDs shape: {inputs['input_ids'].shape}")

Sentence: Yo! How cool are embeddings?
Token IDs:
tensor([[  101, 10930,   999,  2129,  4658,  2024,  7861,  8270,  4667,  2015,
          1029,   102]])
Token IDs shape: torch.Size([1, 12])


In [72]:
%%time
#Proses Embedding
for item in tqdm(pages_and_chunks_over_min_token_len):
    sentence = item["sentence_chunk"]
    
    # Tokenisasi + pindah ke CPU
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, padding=True).to(device)
    
    with torch.no_grad():
        outputs = embedding_model(**inputs)

    # Ambil embedding dari token [CLS]
    cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().cpu().numpy()
    
    # Simpan ke dictionary
    item["embedding"] = cls_embedding

100%|██████████| 1680/1680 [05:44<00:00,  4.87it/s]

CPU times: total: 13min 43s
Wall time: 5min 44s





In [None]:
# Contoh output
for item in pages_and_chunks_over_min_token_len:
    print("Sentence:", item["sentence_chunk"])
    print("Embedding:", item["embedding"][:5], "...")  # Cetak 5 dimensi awal
    print()

In [36]:
# Ambil text_chunks dari data
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

# Fungsi untuk batching dan embedding
def get_bert_embeddings(texts, batch_size=32):
    embeddings = []
    for i in tqdm(range(0, len(texts), batch_size)):
        batch_texts = texts[i:i+batch_size]

        # Tokenisasi batch
        inputs = tokenizer(batch_texts, padding=True, truncation=True, return_tensors="pt")
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = embedding_model(**inputs)

        # Ambil CLS token embeddings
        cls_embeddings = outputs.last_hidden_state[:, 0, :]  # shape: (batch_size, hidden_size)
        embeddings.append(cls_embeddings.cpu())  # pindahkan ke CPU agar aman

    # Gabungkan semua batch
    return torch.cat(embeddings, dim=0)


In [37]:
%%time
text_chunk_embeddings = get_bert_embeddings(text_chunks, batch_size=32)

# Lihat shape hasil
print("Embeddings shape:", text_chunk_embeddings.shape)  # [num_sentences, 768]

100%|██████████| 51/51 [07:30<00:00,  8.83s/it]

Embeddings shape: torch.Size([1617, 768])
CPU times: total: 24min 42s
Wall time: 7min 30s





In [74]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [39]:
# Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[-3.05841178e-01 -1.73748225e-01 -3.68292153e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[-6.83715582e-01 -1.44871831e-01 -4.53810960e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5,[-4.82553899e-01 3.34395736e-01 -4.54973906e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,142,235.25,[-5.82269311e-01 2.03452647e-01 -5.93261480e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[-6.98299170e-01 2.10482717e-01 -2.91191518e-...


## 4. RAG System (use vector search)


In [75]:
import random

import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df[
    "embedding"
].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(
    np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32
).to(device)
embeddings.shape

torch.Size([1680, 768])

In [None]:
from torch.nn.functional import cosine_similarity

# 1. Define the query
query = "macronutrients functions"
print(f"Query: {query}")


# Tokenize query
inputs = tokenizer(query, return_tensors="pt", truncation=True, padding=True).to(device)
with torch.no_grad():
    outputs = embedding_model(**inputs)
    query_embedding = outputs.last_hidden_state[:, 0, :].squeeze(0)  # (768,)


# text_chunk_embeddings assumed already created, shape: [N, 768]
# Pastikan bertipe tensor dan berada di device yang sama
query_embedding = query_embedding.to(device)
text_chunk_embeddings = text_chunk_embeddings.to(device)


# Gunakan dot product (bisa juga cosine_similarity jika mau)
dot_scores = torch.matmul(text_chunk_embeddings, query_embedding)

# 4. Get top-5 most similar results
top_k = 5
top_scores, top_indices = torch.topk(dot_scores, k=top_k)

print("\nTop results:")
for i, (score, idx) in enumerate(zip(top_scores, top_indices)):
    print(f"{i+1}. Score: {score.item():.4f} | Text: {text_chunks[idx]}")

Query: macronutrients functions

Top results:
1. Score: 184.0620 | Text: Table 2.1 The Eleven Organ Systems in the Human Body and Their Major Functions Organ System Organ Components Major Function Cardiovascular heart, blood/lymph vessels, blood, lymph Transport nutrients and waste products Digestive mouth, esophagus, stomach, intestines Digestion and absorption Endocrine all glands (thyroid, ovaries, pancreas) Produce and release hormones Lymphatic tonsils, adenoids, spleen and thymus A one-way system of vessels that transport lymph throughout the body Immune white blood cells, lymphatic tissue, marrow Defend against foreign invaders Integumentary skin, nails, hair, sweat glands Protective, body temperature regulation Muscular skeletal, smooth, and cardiac muscle Body movement Nervous brain, spinal cord, nerves Interprets and responds to stimuli Reproductive gonads, genitals Reproduction and sexual characteristics Respiratory lungs, nose, mouth, throat, trachea Gas exchange Skeletal b

In [80]:
# 2. Definisikan query
from time import perf_counter as timer
query = "macronutrients functions"
print(f"Query: {query}")

# 3. Tokenisasi dan embed query
with torch.no_grad():
    query_inputs = tokenizer(query, return_tensors="pt", truncation=True, padding=True).to(device)
    query_outputs = embedding_model(**query_inputs)
    query_embedding = query_outputs.last_hidden_state[:, 0, :].squeeze(0)  # shape: (768,)
    query_embedding = query_embedding.unsqueeze(0)  # shape: (1, 768)
    

# 4. Pastikan text_chunk_embeddings sudah tersedia
# (Jika belum, kamu bisa panggil fungsi get_bert_embeddings() dari penjelasan sebelumnya)
#text_chunk_embeddings = get_bert_embeddings(text_chunks)  # jika belum ada

# 5. Hitung cosine similarity
start_time = timer()
similarity_scores = cosine_similarity(text_chunk_embeddings, query_embedding, dim=1)  # shape: (N,)
end_time = timer()
print(f"\nTime taken to compute cosine similarity: {end_time - start_time:.4f} seconds")

# 6. Ambil top-k hasil
top_k = 5
top_scores, top_indices = torch.topk(similarity_scores, k=top_k)

# 7. Fungsi bantu print teks yang terpotong rapi
def print_wrapped(text, wrap_length=80):
    print(textwrap.fill(text, wrap_length))

# 8. Tampilkan hasil
print(f"\nTop {top_k} Results:")
for score, idx in zip(top_scores, top_indices):
    result = pages_and_chunks_over_min_token_len[idx]
    print(f"\nScore: {score:.4f}")
    print("Text:")
    print_wrapped(result["sentence_chunk"])
    print(f"Page number: {result.get('page_number', 'N/A')}")
    print("-" * 80)

Query: macronutrients functions

Time taken to compute cosine similarity: 0.0077 seconds

Top 5 Results:

Score: 0.8466
Text:
Enriched wheat flour refers to white flour with added vitamins.)Eat less of
products that list HFCS and other sugars such as sucrose, honey, dextrose, and
cane sugar in the first five ingredients. If you want to eat less processed
foods then, in general, stay away from products with 274 | Carbohydrates and
Personal Diet Choices
Page number: 274
--------------------------------------------------------------------------------

Score: 0.8463
Text:
cure a disease. Science is a stepwise process that builds on past evidence and
finally culminates into a well-accepted conclusion. Unfortunately, not all
scientific conclusions are developed in the interest of human health, and some
can be biased. Therefore, it is important to know where a scientific study was
conducted and who provided the funding, as this can have an impact on the
scientific conclusions being made. For 

In [77]:
larger_embeddings = torch.randn(100 * embeddings.shape[0], 768).to(device)
print(f"Embeddings shape: {larger_embeddings.shape}")

Embeddings shape: torch.Size([168000, 768])


In [78]:
import textwrap

# Helper untuk membungkus teks panjang agar rapi saat print
def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

print(f"\nQuery: '{query}'\n")
print("Top Results:")

# Tampilkan top-k hasil dari pencarian dot product
for score, idx in zip(top_scores, top_indices):
    print(f"Score: {score:.4f}")
    
    # Ambil data dari dictionary
    chunk_data = pages_and_chunks_over_min_token_len[idx]
    
    print("Text:")
    print_wrapped(chunk_data["sentence_chunk"])
    
    # Tampilkan nomor halaman jika tersedia
    if "page_number" in chunk_data:
        print(f"Page number: {chunk_data['page_number']}")
    
    print("\n" + "-"*80 + "\n")



Query: 'macronutrients functions'

Top Results:
Score: 0.8466
Text:
Enriched wheat flour refers to white flour with added vitamins.)Eat less of
products that list HFCS and other sugars such as sucrose, honey, dextrose, and
cane sugar in the first five ingredients. If you want to eat less processed
foods then, in general, stay away from products with 274 | Carbohydrates and
Personal Diet Choices
Page number: 274

--------------------------------------------------------------------------------

Score: 0.8463
Text:
cure a disease. Science is a stepwise process that builds on past evidence and
finally culminates into a well-accepted conclusion. Unfortunately, not all
scientific conclusions are developed in the interest of human health, and some
can be biased. Therefore, it is important to know where a scientific study was
conducted and who provided the funding, as this can have an impact on the
scientific conclusions being made. For example, an air quality study paid for by
a tobacco comp

Query: 'macronutrients functions'

Results:


NameError: name 'top_results_dot_product' is not defined

In [82]:
from transformers import BertTokenizer, BertModel
import torch
from torch.nn.functional import cosine_similarity
from time import perf_counter as timer
import textwrap

# Load BERT model dan tokenizer sekali saja
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")
bert_model.eval()
bert_model.to("cpu")  # atau "cuda" jika tersedia

def embed_query_bert(query: str) -> torch.Tensor:
    """Mengubah query menjadi embedding vektor menggunakan BERT [CLS] token."""
    with torch.no_grad():
        inputs = tokenizer(query, return_tensors="pt", truncation=True, padding=True)
        outputs = bert_model(**inputs)
        embedding = outputs.last_hidden_state[:, 0, :]  # Ambil [CLS] token
        return embedding.squeeze(0)  # Shape: (768,)

def retrieve_relevant_resources_bert(
    query: str,
    embeddings: torch.Tensor,
    n_resources_to_return: int = 5,
    print_time: bool = True,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Embeds a query with BERT and returns top-k cosine similarity scores and indices from a tensor of embeddings.
    """

    # Embed query
    query_embedding = embed_query_bert(query).unsqueeze(0)  # (1, 768)

    # Compute cosine similarity
    start_time = timer()
    scores = cosine_similarity(embeddings, query_embedding, dim=1)
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time - start_time:.5f} seconds.")

    top_scores, top_indices = torch.topk(scores, k=n_resources_to_return)
    return top_scores, top_indices

def print_wrapped(text: str, wrap_length: int = 80):
    """Membungkus teks panjang agar lebih mudah dibaca di console."""
    print(textwrap.fill(text, wrap_length))

def print_top_results_and_scores_bert(
    query: str,
    embeddings: torch.Tensor,
    pages_and_chunks: list[dict],
    n_resources_to_return: int = 5,
):
    """
    Takes a query, retrieves most relevant resources using BERT and prints them out in descending order.
    """

    scores, indices = retrieve_relevant_resources_bert(
        query=query,
        embeddings=embeddings,
        n_resources_to_return=n_resources_to_return
    )

    print(f"\nQuery: '{query}'\n")
    print("Results:")
    for score, idx in zip(scores, indices):
        chunk = pages_and_chunks[idx]
        print(f"Score: {score:.4f}")
        print("Text:")
        print_wrapped(chunk["sentence_chunk"])
        print(f"Page number: {chunk.get('page_number', 'N/A')}")
        print("-" * 80)


In [84]:
print_top_results_and_scores_bert(
    query="symptoms of pellagra",
    embeddings=text_chunk_embeddings,  # hasil dari proses BERT sebelumnya
    pages_and_chunks=pages_and_chunks_over_min_token_len,
    n_resources_to_return=5
)


[INFO] Time taken to get scores on 1617 embeddings: 0.00547 seconds.

Query: 'symptoms of pellagra'

Results:
Score: 0.8262
Text:
Learning Activities Technology Note: The second edition of the Human Nutrition
Open Educational Resource (OER) textbook features interactive learning
activities.  These activities are available in the web-based textbook and not
available in the downloadable versions (EPUB, Digital PDF, Print_PDF, or Open
Document). Learning activities may be used across various mobile devices,
however, for the best user experience it is strongly recommended that users
complete these activities using a desktop or laptop computer and in Google
Chrome.   An interactive or media element has been excluded from this version of
the text. You can view it online here: http://pressbooks.oer.hawaii.edu/
humannutrition2/?p=144   An interactive or media element has been excluded from
this version of the text. You can 170 | Regulation of Water Balance
Page number: 170
--------------------

## 5. Prompting


## 6. Answer Generation


In [None]:
import subprocess


def export_requirements(output_file="requirements.txt"):
    try:
        # Jalankan pip freeze dan arahkan outputnya ke file
        with open(output_file, "w") as f:
            subprocess.check_call(["pip", "freeze"], stdout=f)
        print(f"Daftar paket berhasil disimpan ke {output_file}")
    except subprocess.CalledProcessError as e:
        print(f"Gagal mengekspor paket: {e}")


export_requirements()  # akan menyimpan ke requirements.txt