Only run the block below once to install packages.

In [None]:
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install fitz
!pip install tqdm
!pip install huggingface_hub
!pip install PyMuPDF
!pip install transformers
!pip install spacy
!pip install sentence_transformers
!pip install llama_index
!pip install llama-index-embeddings-huggingface
!pip install flash-attn
!pip install accelerate

In [25]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 8 GB


In [1]:
# keep the notebook clean

import warnings
warnings.filterwarnings("ignore")

Steps to Implement the Code

1. Open a PDF Document:

You can use nearly any PDF document for this process.

2. Format the Text for an Embedding Model:

Prepare the PDF text for processing by splitting it into manageable chunks. This process is known as text splitting or chunking.

3. Embed the Text Chunks:

Convert each chunk of text into a numerical representation (embedding) that can be stored for later use.

4. Build a Retrieval System:

Implement a vector search mechanism to find relevant text chunks based on a query.

5. Create a Prompt with Retrieved Text:

Incorporate the retrieved text chunks into a prompt for further processing.

6. Generate an Answer:

Use the passages retrieved from the CRYSTAL23 manual to generate a comprehensive answer to the query.

In [2]:
# Get PDF document
PDF_PATH = "crystal23.pdf"


This function below is designed to extract and process text from a PDF document using the PyMuPDF library (fitz) and the tqdm library for progress tracking. The primary functions are text formatting and extracting text content from each page of the PDF, along with some basic text statistics.

In [4]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import fitz # (pymupdf, found this is better than pypdf for our use case, note: licence is AGPL-3.0, keep that in mind if you want to use any code commercially)
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=PDF_PATH)

0it [00:00, ?it/s]

This section of the code integrates the spaCy library to further process the extracted text from each page of a PDF document. Specifically, it uses spaCy’s `sentencizer` to split the text into sentences more accurately. The `sentencizer` component is added to the spaCy pipeline to split the text into sentences. The `sentencizer` uses punctuation to identify sentence boundaries without performing full parsing, making it faster and suitable for simple sentence segmentation tasks.

In [5]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/
nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/524 [00:00<?, ?it/s]

This function aims to split the sentences extracted from each page of the PDF into manageable chunks. These chunks are defined by a specified number of sentences, which allows for better handling and processing of the text data in subsequent steps. Feel free to play around with `num_sentence_chunk_size`.

In [6]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/524 [00:00<?, ?it/s]

This section of the code further processes the text by splitting each chunk of sentences into its own item and calculating various statistics for each chunk. The resulting data is stored in a new list called `pages_and_chunks`. 

In [7]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

  0%|          | 0/524 [00:00<?, ?it/s]

Let's take a look at some of the stats. 

In [8]:
import pandas as pd

# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(3)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1079.0,1079.0,1079.0,1079.0
mean,271.449,1014.742,168.678,253.686
std,160.304,677.433,114.814,169.358
min,0.0,1.0,1.0,0.25
25%,122.5,412.5,64.0,103.125
50%,278.0,944.0,162.0,236.0
75%,408.0,1507.5,249.5,376.875
max,523.0,3177.0,566.0,794.25


Note that some chunks have a low token count, let's see what these chunks contain.

In [9]:
# Show random chunks with under 20 tokens in length
min_token_length = 20
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 9.25 | Text: So, the input looks like: ... DFT 138
Chunk token count: 9.75 | Text: In some cases (e.g. optimization of 207
Chunk token count: 11.25 | Text: Electronic conﬁguration: [Ar] 4s(2) 3d(8) 381
Chunk token count: 0.75 | Text: 323
Chunk token count: 13.5 | Text: Any character can follow: ENDGEOM, ENDGINP, etc etc.48


It looks like they are mostly references or numbers, we can exclude them for the RAG process.

In [10]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")

Now it's time to embed the chunks. Our objective is to transform each of our text chunks into numerical representations known as embedding vectors. These vectors are sequences of numbers that capture the semantic meaning of the text, making it easier for computers to understand and find patterns.

Once the text samples are converted into embedding vectors, they will no longer be human-readable. However, this transformation is crucial for computational analysis and pattern recognition. The computer can use these embeddings to perform various tasks, such as searching for similar texts or clustering related topics.

To achieve this, we'll use the `sentence-transformers` library, which offers a variety of pre-trained embedding models. Specifically, we will utilize the `all-mpnet-base-v2` model for generating embeddings.

In [11]:
%%time

from sentence_transformers import SentenceTransformer


embedding_model = SentenceTransformer(model_name_or_path="sentence-transformers/all-mpnet-base-v2", device='cuda:0',
                                      trust_remote_code=True) # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

# Create embeddings on the GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/1029 [00:00<?, ?it/s]

CPU times: total: 4min 2s
Wall time: 32 s


In [12]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [13]:
import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([1029, 768])

This code section demonstrates how to instantiate a language model using the Hugging Face Transformers library, specifically the `AutoModelForCausalLM` class. We will use quantization to reduce the memory footprint. 

Quantization is a technique used in machine learning to reduce the precision of numerical values. Specifically, 4-bit quantization refers to representing numbers using only 4 bits, which allows for significant reduction in memory usage and faster computations compared to using the default 32-bit or even 16-bit floating-point precision.

Additionaly, Flash Attention reduces the computational and memory costs associated with traditional dense self-attention. It achieves this by focusing computational resources only on the most relevant tokens or positions in the input sequence, rather than attending to all tokens uniformly.

In [14]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available

model_id = "google/gemma-7b-it"

use_quantization_config = True
# Create quantization config for smaller model loading (optional)
# Requires !pip install bitsandbytes accelerate, see: https://github.com/TimDettmers/bitsandbytes, https://huggingface.co/docs/accelerate/
# For models that require 4-bit quantization (use this if you have low GPU memory available)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)

# Bonus: Setup Flash Attention 2 for faster inference, default to "sdpa" or "scaled dot product attention" if it's not available
# Flash Attention 2 requires NVIDIA GPU compute capability of 8.0 or above, see: https://developer.nvidia.com/cuda-gpus
# Requires !pip install flash-attn, see: https://github.com/Dao-AILab/flash-attention
if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
  attn_implementation = "flash_attention_2"
else:
  attn_implementation = "sdpa"
print(f"[INFO] Using attention implementation: {attn_implementation}")



# Instantiate tokenizer (tokenizer turns text into numbers ready for the model)

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)
# Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 torch_dtype=torch.float16, # datatype to use, we want float16
                                                 quantization_config=quantization_config if use_quantization_config else None,
                                                 low_cpu_mem_usage=False, # use full memory
                                                 attn_implementation=attn_implementation) # which attention version to use

if not use_quantization_config: # quantization takes care of device setting automatically, so if it's not used, send model to GPU
    llm_model.to("cuda")
    
llm_model    

[INFO] Using attention implementation: sdpa


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 3072, padding_idx=0)
    (layers): ModuleList(
      (0-27): 28 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=3072, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=24576, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=24576, bias=False)
          (down_proj): Linear4bit(in_features=24576, out_features=3072, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
  

On my local machine, Gemma fits into 8 Gb of VRAM, so we can continue!

![GPU Usage](./images/gpu_usage.png)

Now it's time to start generating text. Gemma requires a certain template to format prompts for chat, let's check it out:

In [15]:
input_text = "What is the VCI method"
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

Input text:
What is the VCI method

Prompt (formatted):
<bos><start_of_turn>user
What is the VCI method<end_of_turn>
<start_of_turn>model



Now we can tokenize the prompt and input text, then generate text with Gemma:

In [16]:
%%time

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

# Generate outputs passed on the tokenized input
# See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=512) # define the maximum number of new tokens to create
print(f"Model output (tokens):\n{outputs[0]}\n")

Model input (tokenized):
{'input_ids': tensor([[    2,     2,   106,  1645,   108,  1841,   603,   573,   744, 10621,
          2370,   107,   108,   106,  2516,   108]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

Model output (tokens):
tensor([     2,      2,    106,   1645,    108,   1841,    603,    573,    744,
         10621,   2370,    107,    108,    106,   2516,    108,    651,    744,
         10621,   2370,  12353,    604,  32137, 235290, 102254,  11832,   6441,
        235265,   1165,    603,    476,  21843,   2370,   1671,    577,  14461,
           573,  12087,  15133,    604,    476,   6280,  14464, 235265,    714,
           744,  10621,   2370,    603,   3482,    611,    573,   4268,    674,
           573,  12087,  15133,    604,    476,   6280,  14464,    603,   2764,
           731, 235292,    109,   4471, 235263,  22173,    868, 235287,   8402,
        235278, 235263, 235278, 235274, 235290, 2352

Note that the output is just numbers. We need to decode these numbers to make them human readable:

In [17]:
# Decode the output tokens to readable text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model output (decoded):
<bos><bos><start_of_turn>user
What is the VCI method<end_of_turn>
<start_of_turn>model
The VCI method stands for Variable-Confidence Index Method. It is a statistical method used to estimate the confidence interval for a population proportion. The VCI method is based on the idea that the confidence interval for a population proportion is given by:

$$p ± z*sqrt(p(1-p)/n)$$

where p is the sample proportion, z is the z-score for the desired confidence level, and n is the sample size.

The VCI method estimates the confidence interval by using the sample proportion as an estimate for the population proportion and then substituting this estimate into the formula for the confidence interval. The VCI method is a simple and straightforward method for estimating confidence intervals, but it can be biased if the sample proportion is close to 0 or 1.<eos>



This output is incredibly wrong. Gemma was trained on a lot of data, but lacks niche information about CRYSTAL23. Let's start implementing the RAG pipeline, starting with semantic search. This will feed context to Gemma so it can answer our question.

In [18]:
from sentence_transformers import util, SentenceTransformer

def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, 
                                   convert_to_tensor=True) 

    # Get dot product scores on embeddings
    # start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    # end_time = timer()

    # if print_time:
        # print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores, 
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

In [19]:
def prompt_formatter(query: str, 
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query   
    base_prompt = base_prompt.format(context=context, query=query)

    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

Now, let's generate a prompt with context:

In [20]:
query = "What is the VCI method?"
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
    
# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)

Query: What is the VCI method?
<bos><start_of_turn>user
Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.

Now use the following context items to answer the user query:
- 8.46) to be diagonalized. In particular, this latter aspect is the main limiting factor to the application of standard VCI to the study of those systems where more than just a few vibration modes need to be coupled. Therefore, it is crucial to devise eﬀective schemes to reduce as much as possible the conﬁg- urational space used in the VCI expansion. We have implemented two such schemes, to be introduced below. The following strategies can be used: 246
- If instead one wants to run a VCI calculation on a previously computed harmonic and anhar- monic PES (from ﬁles FREQINFO. DAT and SCANPES. DAT), th

In [21]:
# Define helper function to print wrapped text 
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

With the prompt correctly formatted with context, we can now search the CRYSTAL23 manual with context and return the top 5 results (or relevant passages) for our query. Note that the score and page number are printed to verify the information.

In [22]:
query_embedding = embedding_model.encode(query, convert_to_tensor=True)
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
top_results_dot_product = torch.topk(dot_scores, k=5)

print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'What is the VCI method?'

Results:
Score: 0.4764
Text:
8.46) to be diagonalized. In particular, this latter aspect is the main limiting
factor to the application of standard VCI to the study of those systems where
more than just a few vibration modes need to be coupled. Therefore, it is
crucial to devise eﬀective schemes to reduce as much as possible the conﬁg-
urational space used in the VCI expansion. We have implemented two such schemes,
to be introduced below. The following strategies can be used: 246
Page number: 245


Score: 0.4637
Text:
If instead one wants to run a VCI calculation on a previously computed harmonic
and anhar- monic PES (from ﬁles FREQINFO. DAT and SCANPES. DAT), the input would
read like: ... Optimized Geometry Input... FREQCALC RESTART RESTPES ANHAPES 12 4
5 6 7 8 9 10 11 12 13 14 15 3 0.9 VCI 6 3 1 END [End of FREQCALC block] END [End
of Geometry block] ... Basis Set... The results of the VCI calculation are
printed in the output ﬁle below this header:

Now, let's generate text with Gemma:

In [23]:
def ask(query, 
        temperature=0.2,
        max_new_tokens=2048,
        format_answer_text=True, 
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """
    
    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings)
    
    # Create a list of context items
    context_items = [pages_and_chunks[i] for i in indices]

    # Add score to context item
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu() # return score back to CPU 
        
    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)
    
    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate an output of tokens
    outputs = llm_model.generate(**input_ids,
                                 temperature=temperature,
                                 do_sample=True,
                                 max_new_tokens=max_new_tokens)
    
    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])

    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here is the answer to the user query:\n\n", "")

    # Only return the answer without the context items
    if return_answer_only:
        return output_text
    
    return output_text, context_items

In [24]:
%%time

query = "What is the VCI method?"
print(f"Query: {query}")

# Answer query with context and return context 
answer, context_items = ask(query=query, 
                            temperature=0.2,
                            max_new_tokens=2048,
                            return_answer_only=False)

print(f"Answer:\n")
print_wrapped(answer)
print(f"Context items:")
context_items

Query: What is the VCI method?
Answer:

The VCI (Vibrational Configuration Interaction) method is a method for
calculating vibrational states of a molecular system based on the expansion of
the wave function of each vibrational state in terms of M-mode wave functions of
different vibrational configurations. The VCI method relies on the construction
and diagonalization of the VCI Hamiltonian matrix, from which all vibrational
states are simultaneously determined. The VCI method can be expressed in matrix
form as follows: HA = AE, where A is the squared matrix containing, column-wise,
the coeﬃcients An,s of the eigenvectors, E is the diagonal matrix of the
eigenvalues and H is the VCI Hamiltonian matrix.
Context items:
CPU times: total: 6.09 s
Wall time: 24.2 s


[{'page_number': 245,
  'sentence_chunk': '8.46) to be diagonalized. In particular, this latter aspect is the main limiting factor to the application of standard VCI to the study of those systems where more than just a few vibration modes need to be coupled. Therefore, it is crucial to devise eﬀective schemes to reduce as much as possible the conﬁg- urational space used in the VCI expansion. We have implemented two such schemes, to be introduced below. The following strategies can be used: 246',
  'chunk_char_count': 455,
  'chunk_word_count': 78,
  'chunk_token_count': 113.75,
  'embedding': array([-9.30144172e-03,  2.51885125e-04, -3.25771905e-02,  2.66325697e-02,
         -8.01468827e-03, -9.70605481e-03,  2.31933165e-02,  2.03944044e-03,
         -4.89266962e-02,  1.20525993e-02, -2.63694208e-02, -1.82678681e-02,
         -7.24090682e-03,  3.13642025e-02, -3.17825563e-02, -3.46553628e-03,
          4.41575609e-02,  5.86699927e-03, -4.37664948e-02, -3.10996938e-02,
          3.66917

Much better! This answer is much improved thanks to the context provided to Gemma. Feel free to play around with the `temperature` and `max_new_tokens`. 