## NOTE:
This notebook is a modification of Project 6 Step 2 (Integrating Open-Source LLMs into RAG (with a PDF!)), and this notebook specifically focuses on building an open-source LLM alternative using Mistral / Phi-2 / TinyLlama open-source libraries for the same RAG pipeline previously implemented on the Sample Contract Document

## Open-Source LLM Model Code (Mistral)

In [None]:
# Install required libraries with CUDA support
!pip install -q torch

In [None]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

CUDA available: True
GPU: Tesla T4


In [None]:
# Check CUDA version first
!nvcc --version

# Install llama-cpp-python with CUDA 12.x support
!pip install --no-cache-dir llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu123
Collecting llama-cpp-python==0.2.90
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.90-cu123/llama_cpp_python-0.2.90-cp312-cp312-linux_x86_64.whl (444.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m444.5/444.5 MB[0m [31m53.3 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.90)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m266.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installe

In [None]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.14.4-py3-none-any.whl.metadata (13 kB)
Collecting llama-index-cli<0.6,>=0.5.0 (from llama-index)
  Downloading llama_index_cli-0.5.3-py3-none-any.whl.metadata (1.4 kB)
Collecting llama-index-core<0.15.0,>=0.14.4 (from llama-index)
  Downloading llama_index_core-0.14.4-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.6,>=0.5.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.5.1-py3-none-any.whl.metadata (400 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.9.4-py3-none-any.whl.metadata (3.7 kB)
Collecting llama-index-llms-openai<0.7,>=0.6.0 (from llama-index)
  Downloading llama_index_llms_openai-0.6.1-py3-none-any.whl.metadata (3.0 kB)
Collecting llama-index-readers-file<0.6,>=0.5.0 (from llama-index)
  Downloading llama_index_readers_file-0.5.4-py3-none-any.whl.metadata (5.7 kB)
Collecting llama-inde

In [None]:
from llama_cpp import Llama
import os

# Download Mistral model if not already present
model_path = "/content/mistral.gguf"
if not os.path.exists(model_path):
    !wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf -O {model_path}
    print(f"Model downloaded to {model_path}")

# Verify file exists and check size
if os.path.exists(model_path):
    print(f"Model file exists. Size: {os.path.getsize(model_path) / (1024 * 1024):.2f} MB")
else:
    print("Model file not found!")

# Load the model with GPU acceleration
try:
    llm = Llama(
        model_path=model_path,
        n_gpu_layers=1,  # Start with 1 layer on GPU to be safe
        n_ctx=2048,      # Context window size
        verbose=True     # Show loading progress
    )

    print("Model loaded successfully!")



except Exception as e:
    print(f"Error loading model: {e}")

--2025-10-03 17:51:37--  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
Resolving huggingface.co (huggingface.co)... 18.164.174.17, 18.164.174.55, 18.164.174.118, ...
Connecting to huggingface.co (huggingface.co)|18.164.174.17|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cas-bridge.xethub.hf.co/xet-bridge-us/65778ac662d3ac1817cc9201/865f5e4682dddb29c2e20270b2471a7590c83a414bbf1d72cf4c08fdff2eeca4?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20251003%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251003T174403Z&X-Amz-Expires=3600&X-Amz-Signature=c54013192af0fe4a99d8ce51767ee5839fba4e15ee4405d6ed9774a44a996d8b&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27mistral-7b-instruct-v0.2.Q4_K_M.gguf%3B+filename%3D%22mistral-7b-instruct-v0.2.Q4_K_M.gguf%22%3B&x-id=GetObject&Expires=17

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /content/mistral.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.

Model downloaded to /content/mistral.gguf
Model file exists. Size: 4166.07 MB


llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 1 rep

Model loaded successfully!


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': 

In [None]:
!pip install pymupdf
!pip install llama-index-llms-llama-cpp

Collecting pymupdf
  Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.4
Collecting llama-index-llms-llama-cpp
  Downloading llama_index_llms_llama_cpp-0.5.1-py3-none-any.whl.metadata (4.2 kB)
Collecting llama-cpp-python<0.4,>=0.3.0 (from llama-index-llms-llama-cpp)
  Downloading llama_cpp_python-0.3.16.tar.gz (50.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.7/50.7 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Downloading llama_inde

In [None]:
!pip install llama-index-embeddings-huggingface

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.6.1-py3-none-any.whl.metadata (458 bytes)
Downloading llama_index_embeddings_huggingface-0.6.1-py3-none-any.whl (8.9 kB)
Installing collected packages: llama-index-embeddings-huggingface
Successfully installed llama-index-embeddings-huggingface-0.6.1


In [None]:
import fitz  # PyMuPDF

# Load the sample contract PDF
pdf_path = "/content/sample_contract.pdf"
doc = fitz.open(pdf_path)

# Extract text from all pages
text = "\n".join([page.get_text() for page in doc])

print(f"Extracted {len(text.split())} words from the PDF.")

Extracted 315 words from the PDF.


In [None]:
from llama_index.core import VectorStoreIndex, Document, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.settings import Settings
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Configure the LLM
llm = LlamaCPP(
    model_path="/content/mistral.gguf",
    temperature=0.7,
    max_new_tokens=512,
    context_window=2048,
    model_kwargs={"n_gpu_layers": 1}
)

# Configure open-source embedding model
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"  # Lightweight but effective embedding model
)

# Set as the default LLM and embedding model
Settings.llm = llm
Settings.embed_model = embed_model

# Create documents from your text
documents = [Document(text=text)]  # 'text' should be your document content

# Build index
index = VectorStoreIndex.from_documents(documents)

# Configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,  # Retrieve 2 most similar chunks
)

# Configure response synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /content/mistral.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
queries = [
    "What are the penalties for late payments?",
    "Summarize the key terms in this contract.",
    "What is the refund policy?",
]

# Assemble query engine (same as your snippet)
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# Run all three
for q in queries:
    print("\n" + "="*90)
    print(q)
    print("="*90)
    response = query_engine.query(q)
    print(response)


What are the penalties for late payments?



llama_print_timings:        load time =    2129.12 ms
llama_print_timings:      sample time =       0.76 ms /    16 runs   (    0.05 ms per token, 21108.18 tokens per second)
llama_print_timings: prompt eval time =    3773.20 ms /   555 tokens (    6.80 ms per token,   147.09 tokens per second)
llama_print_timings:        eval time =    8193.68 ms /    15 runs   (  546.25 ms per token,     1.83 tokens per second)
llama_print_timings:       total time =   11978.92 ms /   570 tokens
Llama.generate: 541 prefix-match hit, remaining 15 prompt tokens to eval


1.5% per month from the due date until paid in full.

Summarize the key terms in this contract.



llama_print_timings:        load time =    2129.12 ms
llama_print_timings:      sample time =       9.30 ms /   192 runs   (    0.05 ms per token, 20640.72 tokens per second)
llama_print_timings: prompt eval time =    6357.91 ms /    15 tokens (  423.86 ms per token,     2.36 tokens per second)
llama_print_timings:        eval time =  110849.58 ms /   191 runs   (  580.36 ms per token,     1.72 tokens per second)
llama_print_timings:       total time =  117371.16 ms /   206 tokens
Llama.generate: 541 prefix-match hit, remaining 11 prompt tokens to eval



This contract is between ABC Company Inc. (Service Provider) and XYZ Corporation (Client), entered into on January 15, 2025. The Service Provider agrees to provide consulting services (Services) to the Client as described in Exhibit A. The Services shall be performed in accordance with industry standards and practices. The Client agrees to pay the Service Provider at the rates specified in Exhibit B on a monthly basis, with net 30-day payment terms. Late payments will accrue interest at 1.5% per month. The Agreement has a one-year term, which can be terminated by either party with thirty (30) days written notice. The Client may request a refund within 14 days of service delivery, but no refunds will be issued for completed projects. Both parties acknowledge the confidentiality of each other's information and agree to maintain confidentiality.

What is the refund policy?



llama_print_timings:        load time =    2129.12 ms
llama_print_timings:      sample time =       3.96 ms /    79 runs   (    0.05 ms per token, 19939.42 tokens per second)
llama_print_timings: prompt eval time =    3889.22 ms /    11 tokens (  353.56 ms per token,     2.83 tokens per second)
llama_print_timings:        eval time =   45901.87 ms /    78 runs   (  588.49 ms per token,     1.70 tokens per second)
llama_print_timings:       total time =   49849.22 ms /    89 tokens


1. If Client is dissatisfied with the Services, Client may request a refund within 14 days of service delivery. 2. Refunds are issued at the sole discretion of Service Provider and will be processed within 30 days of approval. 3. No refunds will be issued for completed projects that meet the specifications outlined in Exhibit A.


## Closed-Source LLM Model Code (Gemini)

In [None]:
# (run once if needed)
!pip -q install llama-index-llms-gemini

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/4.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m2.5/4.5 MB[0m [31m75.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
from llama_index.llms.gemini import Gemini
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

GOOGLE_API_KEY = ""
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

# Make sure your API key is set in the environment
if not os.environ.get("GOOGLE_API_KEY"):
    raise RuntimeError("Set GOOGLE_API_KEY before running Gemini. Example: os.environ['GOOGLE_API_KEY']='YOUR_KEY'")

# Initialize Gemini (same model you used before; tweak temperature / max_tokens if you like)
gemini_llm = Gemini(model="models/gemini-2.0-flash", temperature=0.2, max_tokens=400)

# Build a response synthesizer that uses Gemini
response_synthesizer = get_response_synthesizer(response_mode="compact", llm=gemini_llm)

# Assemble query engine with your existing retriever
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# The three suggested queries
queries = [
    "What are the penalties for late payments?",
    "Summarize the key terms in this contract.",
    "What is the refund policy?",
]

for q in queries:
    print("\n" + "="*90)
    print(q)
    print("="*90)
    resp = query_engine.query(q)
    print(resp)

  gemini_llm = Gemini(model="models/gemini-2.0-flash", temperature=0.2, max_tokens=400)



What are the penalties for late payments?
Late payments will incur interest at a rate of 1.5% per month from the due date until the payment is made in full.


Summarize the key terms in this contract.
This agreement, which is effective as of January 15, 2025, is between ABC Company Inc. and XYZ Corporation. ABC Company Inc. will provide consulting services to XYZ Corporation as described in Exhibit A, and will invoice them monthly, with payments due within 30 days. Late payments will incur a 1.5% monthly interest fee. The agreement lasts for one year, but can be terminated by either party with 30 days written notice. Refunds can be requested within 14 days of service delivery, but are not available for completed projects meeting Exhibit A specifications, and are processed within 30 days of approval. Both parties agree to keep each other's confidential information private.


What is the refund policy?
If a client is not happy with the services, they can ask for a refund within 14 days 