# Query PDF documents using RAG (Llama-Index + Nebius AI)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nebius/token-factory-cookbook/blob/main/rag/rag-pdf-llama-index/rag_pdf_query.ipynb)
[![](https://img.shields.io/badge/Powered%20by-Nebius%20AI-orange?style=flat&labelColor=orange&color=green)](http://tokenfactory.nebius.com/)

This example shows querying a PDF using  [llama index](https://docs.llamaindex.ai/en/stable/) framework and running LLM on [Nebius Token Factory](https://tokenfactory.nebius.com/)

[Read more about it here](https://github.com/nebius/token-factory-cookbook/blob/main/rag/rag-pdf-llama-index/README.md)


## References and Acknowledgements

- [llamaindex documentation](https://docs.llamaindex.ai/en/stable/)
- [Nebius Token Factory](https://tokenfactory.nebius.com/)
- [Nebius Token Factory documentation](https://docs.tokenfactory.nebius.com//inference/quickstart)

## Pre requisites

- Nebius API key.  Sign up for free at [Token Factory](https://tokenfactory.nebius.com/)

## 2 - Install Dependencies

In [1]:
import os, sys

if os.getenv("COLAB_RELEASE_TAG"):
   RUNNING_ON_COLAB = True
   print("Running in Colab")
else:
  RUNNING_ON_COLAB = False
  print("NOT Running in Colab")

Running in Colab


In [12]:
if RUNNING_ON_COLAB:
  # Install the required packages
  !pip install -q llama-index-llms-litellm \
                  jedi \
                  llama-index-readers-file \
                  llama-index-llms-nebius \
                  llama-index-embeddings-nebius \
                  llama-index-embeddings-huggingface \
                  python-dotenv

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/329.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m327.7/329.1 kB[0m [31m13.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.1/329.1 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## 3 - Load Configuration

In [13]:
import os, sys

## Recommended way of getting configuration
if RUNNING_ON_COLAB:
   from google.colab import userdata
   NEBIUS_API_KEY = userdata.get('NEBIUS_API_KEY')
else:
   from dotenv import load_dotenv
   load_dotenv()
   NEBIUS_API_KEY = os.getenv('NEBIUS_API_KEY')


## quick hack (not recommended) - you can hardcode the config key here
# NEBIUS_API_KEY = "your_key_here"

if NEBIUS_API_KEY:
  print ('✅ NEBIUS_API_KEY found')
  os.environ['NEBIUS_API_KEY'] = NEBIUS_API_KEY
else:
  raise RuntimeError ('❌ NEBIUS_API_KEY NOT found')

✅ NEBIUS_API_KEY found


## 4 - Data

In [19]:
import shutil

input_dir = 'data'

if RUNNING_ON_COLAB:
    shutil.os.makedirs(input_dir, exist_ok=True)
    !wget -O  '{input_dir}/attention.pdf' 'https://github.com/sriks8/nebius/raw/main/attn.pdf'


--2026-02-02 00:15:48--  https://github.com/sriks8/nebius/raw/main/attn.pdf
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sriks8/nebius/main/attn.pdf [following]
--2026-02-02 00:15:49--  https://raw.githubusercontent.com/sriks8/nebius/main/attn.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/octet-stream]
Saving to: ‘data/attention.pdf’


2026-02-02 00:15:49 (31.0 MB/s) - ‘data/attention.pdf’ saved [2215244/2215244]



## 5 - Setup Embedding Model

We have a choice of local embedding model (fast) or running it on the cloud

If running locally:
- choose smaller models
- less accuracy but faster

If running on the cloud
- We can run large models (billions of params)

In [20]:
## Local model
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

## Running embedding models on Nebius cloud
# from llama_index.embeddings.nebius import NebiusEmbedding
# Settings.embed_model = NebiusEmbedding(
#                         model_name='BAAI/bge-en-icl',
#                         api_key=os.getenv("NEBIUS_API_KEY") # if not specfified here, it will get taken from env variable
#                        )

## Try out a few open source embedding models locally
Settings.embed_model = HuggingFaceEmbedding(
    # model_name = 'sentence-transformers/all-MiniLM-L6-v2' # 23 M params
    model_name = 'BAAI/bge-small-en-v1.5'  # 33M params
    # model_name = 'Qwen/Qwen3-Embedding-0.6B'  # 600M params
    # model_name = 'BAAI/bge-en-icl'  # 7B params
    #model_name = 'intfloat/multilingual-e5-large-instruct'  # 560M params
)



## 6 - Setup LLama Index with Nebius

We can use `llama_index.llms.nebius.NebiusLLM` or `llama_index.llms.litellm.LiteLLM`.

See examples below

In [21]:
from llama_index.llms.nebius import NebiusLLM
from llama_index.llms.litellm import LiteLLM
from llama_index.core import Settings

Settings.llm = NebiusLLM(
                model='meta-llama/Llama-3.3-70B-Instruct',
                # model='deepseek-ai/DeepSeek-R1-0528',
                # model='openai/gpt-oss-20b',
                api_key=os.getenv("NEBIUS_API_KEY") # if not specfified, it will get taken from env variable
    )

# Settings.llm = LiteLLM(
#                 model='nebius/meta-llama/Llama-3.3-70B-Instruct',
#                 model='nebius/deepseek-ai/DeepSeek-R1-0528',
#                 model='nebius/openai/gpt-oss-20b',
#                 api_key=os.getenv("NEBIUS_API_KEY") # if not specfified, it will get taken from env variable
#     )

## 6 - Read PDFs

In [22]:
import os
import glob

pattern = os.path.join(input_dir, '*.pdf')
input_file_count = len(glob.glob(pattern, recursive=True))

In [23]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_dir).load_data()
print (f'Loaded {len(documents)} docs from {input_file_count} files')


Loaded 15 docs from 1 files


In [24]:
%%time

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

CPU times: user 9.13 s, sys: 398 ms, total: 9.53 s
Wall time: 9.63 s


##  7 - Query documents

In [25]:
response = index.as_query_engine().query("What is attention mechanism?")
print (response)

The attention mechanism is a technique used to focus on specific parts of the input data that are relevant for a particular task, rather than considering the entire input equally. It is often used in deep learning models, particularly in natural language processing and computer vision tasks. The attention mechanism allows the model to weigh the importance of different input elements, such as words or pixels, and allocate more attention to the elements that are most relevant for the task at hand. This is typically achieved through a set of attention weights, which are learned during training and are used to compute a weighted sum of the input elements. The attention mechanism can be used to model complex dependencies and relationships between different parts of the input data, and has been shown to be effective in a wide range of applications.


In [None]:
# see where the answer came from
response.metadata

{'31b40151-7d77-440a-91e3-5911a070dff9': {'file_path': '/home/sujee/my-stuff/projects/nebius/token-factory-cookbook-1/rag/rag-pdf-llama-index/data/attention.pdf',
  'file_name': 'attention.pdf',
  'file_type': 'application/pdf',
  'file_size': 2215244,
  'creation_date': '2025-07-07',
  'last_modified_date': '2025-07-07'},
 '0409e07f-8930-4a01-806d-b94ce9301994': {'file_path': '/home/sujee/my-stuff/projects/nebius/token-factory-cookbook-1/rag/rag-pdf-llama-index/data/attention.pdf',
  'file_name': 'attention.pdf',
  'file_type': 'application/pdf',
  'file_size': 2215244,
  'creation_date': '2025-07-07',
  'last_modified_date': '2025-07-07'}}