# Query PDF documents using RAG (Llama-Index + Nebius AI)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nebius/ai-studio-cookbook/blob/main/rag/rag-pdf-llama-index/rag_pdf_query.ipynb)
[![](https://img.shields.io/badge/Powered%20by-Nebius%20AI-orange?style=flat&labelColor=orange&color=green)](https://nebius.com/ai-studio)

This example shows querying a PDF using  [llama index](https://docs.llamaindex.ai/en/stable/) framework and running LLM on [Nebius AI Studio](https://studio.nebius.com/)

[Read more about it here](https://github.com/nebius/ai-studio-cookbook/blob/main/rag/rag-pdf-llama-index/README.md)


## References and Acknowledgements

- [llamaindex documentation](https://docs.llamaindex.ai/en/stable/)
- [Nebius AI Studio](https://studio.nebius.com/)
- [Nebius AI Studio documentation](https://docs.nebius.com/studio/inference/quickstart)

## Pre requisites

- Nebius API key.  Sign up for free at [AI Studio](https://studio.nebius.com/)

## 1 - Setup

### 1.1 - If running on Google Colab

Add `NEBIUS_API_KEY` to **Secrets** as follows

![](https://github.com/nebius/ai-studio-cookbook/raw/main/images/google-colab-1.png)


### 1.2 - If running locally

Create an `.env` file with NEBIUS_API_KEY as follows

```text
NEBIUS_API_KEY=your_api_key_goes_here
```



## 2 - Install Dependencies

In [1]:
import os, sys

if os.getenv("COLAB_RELEASE_TAG"):
   RUNNING_ON_COLAB = True
   print("Running in Colab")
else:
  RUNNING_ON_COLAB = False
  print("NOT Running in Colab")

NOT Running in Colab


In [2]:
if RUNNING_ON_COLAB:
  # Install the required packages
  !pip install -q llama-index-llms-litellm \
                  llama-index-llms-nebius \
                  llama-index-embeddings-nebius \
                  llama-index-embeddings-huggingface \
                  python-dotenv

## 3 - Load Configuration

In [3]:
import os, sys

## Recommended way of getting configuration
if RUNNING_ON_COLAB:
   from google.colab import userdata
   NEBIUS_API_KEY = userdata.get('NEBIUS_API_KEY')
else:
   from dotenv import load_dotenv
   load_dotenv()
   NEBIUS_API_KEY = os.getenv('NEBIUS_API_KEY')


## quick hack (not recommended) - you can hardcode the config key here
# NEBIUS_API_KEY = "your_key_here"

if NEBIUS_API_KEY:
  print ('✅ NEBIUS_API_KEY found')
  os.environ['NEBIUS_API_KEY'] = NEBIUS_API_KEY
else:
  raise RuntimeError ('❌ NEBIUS_API_KEY NOT found')

✅ NEBIUS_API_KEY found


## 4 - Data

In [4]:
import shutil

input_dir = 'data'

if RUNNING_ON_COLAB:
    shutil.os.makedirs(input_dir, exist_ok=True)
    !wget -O  '{input_dir}/attention.pdf' 'https://raw.githubusercontent.com/nebius/ai-studio-cookbook/main/data/whitepapers/attention-is-all-you-need.pdf'
    

## 5 - Setup Embedding Model

We have a choice of local embedding model (fast) or running it on the cloud

If running locally:
- choose smaller models
- less accuracy but faster

If running on the cloud
- We can run large models (billions of params)

In [5]:
## Local model
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

## Running embedding models on Nebius cloud
# from llama_index.embeddings.nebius import NebiusEmbedding
# Settings.embed_model = NebiusEmbedding(
#                         model_name='BAAI/bge-en-icl',
#                         api_key=os.getenv("NEBIUS_API_KEY") # if not specfified here, it will get taken from env variable
#                        )

## Try out a few open source embedding models locally
Settings.embed_model = HuggingFaceEmbedding(
    # model_name = 'sentence-transformers/all-MiniLM-L6-v2' # 23 M params
    model_name = 'BAAI/bge-small-en-v1.5'  # 33M params
    # model_name = 'Qwen/Qwen3-Embedding-0.6B'  # 600M params
    # model_name = 'BAAI/bge-en-icl'  # 7B params
    #model_name = 'intfloat/multilingual-e5-large-instruct'  # 560M params
)



## 6 - Setup LLama Index with Nebius

We can use `llama_index.llms.nebius.NebiusLLM` or `llama_index.llms.litellm.LiteLLM`.

See examples below

In [6]:
from llama_index.llms.nebius import NebiusLLM
from llama_index.llms.litellm import LiteLLM
from llama_index.core import Settings

Settings.llm = NebiusLLM(
                model='meta-llama/Llama-3.3-70B-Instruct',
                # model='deepseek-ai/DeepSeek-R1-0528',
                # model='Qwen/Qwen3-30B-A3B',
                api_key=os.getenv("NEBIUS_API_KEY") # if not specfified, it will get taken from env variable
    )

# Settings.llm = LiteLLM(
#                 model='nebius/meta-llama/Llama-3.3-70B-Instruct',
#                 model='nebius/deepseek-ai/DeepSeek-R1-0528',
#                 model='nebius/Qwen/Qwen3-30B-A3B',
#                 api_key=os.getenv("NEBIUS_API_KEY") # if not specfified, it will get taken from env variable
#     )

## 6 - Read PDFs

In [7]:
import os
import glob

pattern = os.path.join(input_dir, '*.pdf')
input_file_count = len(glob.glob(pattern, recursive=True))

In [8]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_dir).load_data()
print (f'Loaded {len(documents)} docs from {input_file_count} files')


Loaded 1 docs from 1 files


In [9]:
%%time

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

CPU times: user 11.6 s, sys: 344 ms, total: 12 s
Wall time: 9.78 s


##  7 - Query documents

In [10]:
response = index.as_query_engine().query("What is attention mechanism?")
print (response)

Attention mechanism appears to be a concept related to artificial intelligence and machine learning, with various subsections and topics discussed, including scaled dot-product attention, multi-head attention, applications of attention, and self-attention. It seems to be a key component in training and optimizing models, with discussions on positional encoding, embeddings, and softmax. However, the exact definition and explanation of the attention mechanism are not explicitly provided.


In [11]:
# see where the answer came from
response.metadata

{'31b40151-7d77-440a-91e3-5911a070dff9': {'file_path': '/home/sujee/my-stuff/projects/nebius/ai-studio-cookbook-1/rag/rag-pdf-llama-index/data/attention.pdf',
  'file_name': 'attention.pdf',
  'file_type': 'application/pdf',
  'file_size': 2215244,
  'creation_date': '2025-07-07',
  'last_modified_date': '2025-07-07'},
 '0409e07f-8930-4a01-806d-b94ce9301994': {'file_path': '/home/sujee/my-stuff/projects/nebius/ai-studio-cookbook-1/rag/rag-pdf-llama-index/data/attention.pdf',
  'file_name': 'attention.pdf',
  'file_type': 'application/pdf',
  'file_size': 2215244,
  'creation_date': '2025-07-07',
  'last_modified_date': '2025-07-07'}}