# Retrieval-Augmented Generation using Gemma LLMs

Deskripsi :
Projek ini ditujukan untuk mengimplementasikan Retrieval-Augmented Generation secara lokal baik sistem serta databasenya.


## Key terms

| Term                                | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ----------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Token**                           | A sub-word piece of text. For example, "hello, world!" could be split into ["hello", ",", "world", "!"]. A token can be a whole word,<br> part of a word or group of punctuation characters. 1 token ~= 4 characters in English, 100 tokens ~= 75 words.<br> Text gets broken into tokens before being passed to an LLM.                                                                                                                                                                                                                                                                                  |
| **Embedding**                       | A learned numerical representation of a piece of data. For example, a sentence of text could be represented by a vector with<br> 768 values. Similar pieces of text (in meaning) will ideally have similar values.                                                                                                                                                                                                                                                                                                                                                                                        |
| **Embedding model**                 | A model designed to accept input data and output a numerical representation. For example, a text embedding model may take in 384 <br>tokens of text and turn it into a vector of size 768. An embedding model can and often is different to an LLM model.                                                                                                                                                                                                                                                                                                                                                 |
| **Similarity search/vector search** | Similarity search/vector search aims to find two vectors which are close together in high-demensional space. For example, <br>two pieces of similar text passed through an embedding model should have a high similarity score, whereas two pieces of text about<br> different topics will have a lower similarity score. Common similarity score measures are dot product and cosine similarity.                                                                                                                                                                                                         |
| **Large Language Model (LLM)**      | A model which has been trained to numerically represent the patterns in text. A generative LLM will continue a sequence when given a sequence. <br>For example, given a sequence of the text "hello, world!", a genertive LLM may produce "we're going to build a RAG pipeline today!".<br> This generation will be highly dependant on the training data and prompt.                                                                                                                                                                                                                                     |
| **LLM context window**              | The number of tokens a LLM can accept as input. For example, as of March 2024, GPT-4 has a default context window of 32k tokens<br> (about 96 pages of text) but can go up to 128k if needed. A recent open-source LLM from Google, Gemma (March 2024) has a context<br> window of 8,192 tokens (about 24 pages of text). A higher context window means an LLM can accept more relevant information<br> to assist with a query. For example, in a RAG pipeline, if a model has a larger context window, it can accept more reference items<br> from the retrieval system to aid with its generation.      |
| **Prompt**                          | A common term for describing the input to a generative LLM. The idea of "[prompt engineering](https://en.wikipedia.org/wiki/Prompt_engineering)" is to structure a text-based<br> (or potentially image-based as well) input to a generative LLM in a specific way so that the generated output is ideal. This technique is<br> possible because of a LLMs capacity for in-context learning, as in, it is able to use its representation of language to breakdown <br>the prompt and recognize what a suitable output may be (note: the output of LLMs is probable, so terms like "may output" are used). |


## Requirements and Setup


In [None]:
import subprocess
import sys


def install_from_requirements(requirements_file="requirements.txt"):
    try:
        subprocess.check_call(
            [sys.executable, "-m", "pip", "install", "-r", requirements_file]
        )
        print(f"Sukses menginstal semua paket dari {requirements_file}")
    except subprocess.CalledProcessError as e:
        print(f"Gagal menginstal paket dari {requirements_file}: {e}")

## 1. PDF Document Reading


In [3]:
import os
import requests

# Nama folder tujuan
target_folder = "information_file"

# Nama file PDF
pdf_filename = "human-nutrition-text.pdf"

# Path lengkap file PDF di dalam folder target
pdf_path = os.path.join(target_folder, pdf_filename)

# URL file PDF yang ingin diunduh
url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

# Buat folder jika belum ada
if not os.path.exists(target_folder):
    os.makedirs(target_folder)
    print(f"Folder '{target_folder}' telah dibuat.")

# Download PDF jika belum ada
if not os.path.exists(pdf_path):
    print(f"File '{pdf_filename}' tidak ditemukan di '{target_folder}'. Mengunduh...")

    try:
        # Kirim permintaan GET ke URL
        response = requests.get(url)
        response.raise_for_status()  # Akan memunculkan HTTPError untuk respons yang buruk (status code 4xx atau 5xx)

        # Buka file dalam mode binary write dan simpan konten
        with open(pdf_path, "wb") as file:
            file.write(response.content)
        print(f"File '{pdf_filename}' telah diunduh dan disimpan di '{pdf_path}'.")

    except requests.exceptions.RequestException as e:
        print(f"Gagal mengunduh file. Error: {e}")

else:
    print(f"File '{pdf_filename}' sudah ada di '{pdf_path}'.")

# Cek keberadaan file setelah (mencoba) diunduh
if os.path.exists(pdf_path):
    print(f"Pemeriksaan: File '{pdf_filename}' ditemukan di dalam folder '{target_folder}'.")
else:
    print(f"Pemeriksaan: File '{pdf_filename}' TIDAK ditemukan di dalam folder '{target_folder}'.")

File 'human-nutrition-text.pdf' sudah ada di 'information_file\human-nutrition-text.pdf'.
Pemeriksaan: File 'human-nutrition-text.pdf' ditemukan di dalam folder 'information_file'.


In [10]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import pymupdf # (pymupdf, found this is better than pypdf for our use case, note: licence is AGPL-3.0, keep that in mind if you want to use any code commercially)
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm 

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    #doc = fitz.open(pdf_path)  # open a document
    doc = pymupdf.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - 41,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

1208it [00:01, 953.05it/s] 


[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [15]:
#Taking sample of text from the information file
import random
import pandas as pd

random.sample(pages_and_texts, k=3)
df = pd.DataFrame(pages_and_texts)
df.head()

#stats of the book we collected
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,14.0,400.88
max,1166.0,2308.0,429.0,32.0,577.0


## 2. Text Splitting/Chunking


In [17]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/ 
nlp.add_pipe("sentencizer")

# Create a document instance as an example
doc = nlp("This is a sentence. This another sentence.")
assert len(list(doc.sents)) == 2

# Access the sentences of the document
list(doc.sents)

[This is a sentence., This another sentence.]

In [18]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences 
    item["page_sentence_count_spacy"] = len(item["sentences"])

100%|██████████| 1208/1208 [00:02<00:00, 429.41it/s]


In [20]:
# Inspect an example
random.sample(pages_and_texts, k=1)

[{'page_number': 277,
  'page_char_count': 1924,
  'page_word_count': 318,
  'page_sentence_count_raw': 16,
  'page_token_count': 481.0,
  'text': 'often.  • Calm your “sweet tooth” by eating fruits, such as berries or an  apple.  • Replace sugary soft drinks with seltzer water, tea, or a small  amount of 100 percent fruit juice added to water or soda water.  The Food Industry: Functional Attributes of  Carbohydrates and the Use of Sugar Substitutes  In the food industry, both fast-releasing and slow-releasing  carbohydrates are utilized to give foods a wide spectrum of  functional attributes, including increased sweetness, viscosity, bulk,  coating ability, solubility, consistency, texture, body, and browning  capacity. The differences in chemical structure between the  different carbohydrates confer their varied functional uses in foods.  Starches, gums, and pectins are used as thickening agents in making  jam, cakes, cookies, noodles, canned products, imitation cheeses,  and a varie

In [21]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10 

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, 
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

100%|██████████| 1208/1208 [00:00<00:00, 19744.21it/s]


In [22]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': 56,
  'page_char_count': 1877,
  'page_word_count': 334,
  'page_sentence_count_raw': 10,
  'page_token_count': 469.25,
  'text': '•  Explain the anatomy and physiology of the  digestive system and other supporting organ systems  •  Describe the relationship between diet and each of  the organ systems  •  Describe the process of calculating Body Mass  Index (BMI)  The Native Hawaiians believed there was a strong connection  between health and food. Around the world, other cultures had  similar views of food and its relationship with health. A famous  quote by the Greek physician Hippocrates over two thousand years  ago, “Let food be thy medicine and medicine be thy food” bear much  relevance on our food choices and their connection to our health.  Today, the scientific community echoes Hippocrates’ statement as  it recognizes some foods as functional foods. The Academy of  Nutrition and Dietetics defines functional foods as “whole foods  and fortified, enriched, or enh

In [31]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

100%|██████████| 1208/1208 [00:00<00:00, 21135.64it/s]


1843

In [34]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.44,112.33,183.61
std,347.79,447.54,71.22,111.89
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,44.0,78.75
50%,586.0,746.0,114.0,186.5
75%,890.0,1118.5,173.0,279.62
max,1166.0,1831.0,297.0,457.75


In [36]:
# Show random chunks with under 30 tokens in length
min_token_length = 40
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 17.5 | Text: Published August 2011. Accessed September 22, 2017. Introduction | 147
Chunk token count: 11.75 | Text: Accessed April 15, 2018. 1046 | Comparing Diets
Chunk token count: 4.5 | Text: 516 | Introduction
Chunk token count: 21.0 | Text: Updated September 2003. Accessed November 28,2017. Discovering Nutrition Facts | 735
Chunk token count: 31.5 | Text: view it online here: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=494 944 | The Essential Elements of Physical Fitness


## 3. Chunk Embedding


In [37]:
# Requires !pip install sentence-transformers
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device="cpu") # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

ModuleNotFoundError: No module named 'sentence_transformers'

In [15]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from huggingface_hub import login

login(token="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")

# Load Gemma 3 PT 4B model and tokenizer
model_id = "google/gemma-3-4b-it"  # Gemma 3 PT 4B model
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto",
    torch_dtype=torch.float16,  # Use half precision to save memory
    low_cpu_mem_usage=True      # Optimize memory usage during loading
).eval()


Loading checkpoint shards: 100%|██████████| 2/2 [00:16<00:00,  8.23s/it]


In [16]:
def get_gemma_embeddings(sentences, model, tokenizer):
    """Generate embeddings from Gemma model with proper handling to avoid NaN values"""
    embeddings = []
    
    for sentence in sentences:
        # Tokenize with padding and truncation for safety
        inputs = tokenizer(
            sentence, 
            return_tensors="pt", 
            padding=True, 
            truncation=True, 
            max_length=512  # Limit length to avoid memory issues
        ).to(model.device)
        
        try:
            # Get hidden states with gradient tracking disabled
            with torch.no_grad():
                # Request hidden states explicitly
                outputs = model(
                    **inputs, 
                    output_hidden_states=True,
                    return_dict=True
                )
                
                # Access hidden states properly (making sure it's not None)
                if outputs.hidden_states is None:
                    print(f"Warning: No hidden states produced for: {sentence[:50]}...")
                    # Use a zero vector as fallback
                    hidden_state = torch.zeros(1, inputs['input_ids'].shape[1], model.config.hidden_size, 
                                              device=model.device, dtype=torch.float32)
                else:
                    # Get last hidden layer, convert to float32 for stability
                    hidden_state = outputs.hidden_states[-1].to(dtype=torch.float32)
                
                # Get attention mask and handle possible missing values
                mask = inputs.attention_mask.unsqueeze(-1)
                
                # Safe mean pooling: first sum, then divide, with safety checks
                sum_embeddings = torch.sum(hidden_state * mask, dim=1)
                sum_mask = torch.clamp(mask.sum(dim=1), min=1e-9)  # Avoid division by zero
                embedding = sum_embeddings / sum_mask
                
                # Check for NaN values and replace
                if torch.isnan(embedding).any():
                    print(f"Warning: NaN detected in embedding for: {sentence[:50]}...")
                    # Replace NaNs with zeros
                    embedding = torch.nan_to_num(embedding, nan=0.0)
                
                # Convert to numpy and add to results
                embeddings.append(embedding.cpu().numpy().squeeze())
        
        except Exception as e:
            print(f"Error processing sentence: {sentence[:50]}...")
            print(f"Error details: {str(e)}")
            # Add a zero vector as fallback
            embeddings.append(np.zeros(model.config.hidden_size))
    
    return embeddings

In [18]:
# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Get embeddings
embeddings = get_gemma_embeddings(sentences, model, tokenizer)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding shape:", embedding.shape)
    print("First 5 values:", embedding[:5])
    print("")

Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding shape: (2560,)
First 5 values: [0. 0. 0. 0. 0.]

Sentence: Sentences can be embedded one by one or as a list of strings.
Embedding shape: (2560,)
First 5 values: [0. 0. 0. 0. 0.]

Sentence: Embeddings are one of the most powerful concepts in machine learning!
Embedding shape: (2560,)
First 5 values: [0. 0. 0. 0. 0.]

Sentence: Learn to use embeddings well and you'll be well on your way to being an AI engineer.
Embedding shape: (2560,)
First 5 values: [0. 0. 0. 0. 0.]



## 4. RAG System (use vector search)


## 5. Prompting


## 6. Answer Generation


In [8]:
# Function to generate text with Gemma
def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Generate text
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            top_p=0.9,
        )
    
    # Decode and return the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

In [19]:
# Example usage
prompt = "Explain the concept of embeddings in NLP"
response = generate_text(prompt)
print(response)

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

In [None]:
import subprocess


def export_requirements(output_file="requirements.txt"):
    try:
        # Jalankan pip freeze dan arahkan outputnya ke file
        with open(output_file, "w") as f:
            subprocess.check_call(["pip", "freeze"], stdout=f)
        print(f"Daftar paket berhasil disimpan ke {output_file}")
    except subprocess.CalledProcessError as e:
        print(f"Gagal mengekspor paket: {e}")


export_requirements()  # akan menyimpan ke requirements.txt

Daftar paket berhasil disimpan ke requirements.txt
