## Load model

In [1]:
from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="llama3.2:latest",
    request_timeout=300.0,
    additional_kwargs={"num_ctx": 16384, "num_predict": -1},
    batch_size=8,
)

# response = llm.stream_complete("What are the key findings of llama2 paper?")

# for r in response:
#     print(r.delta, end="")

## Load an embedding model

In [4]:
from llama_index.embeddings.fastembed import FastEmbedEmbedding

embed_model = FastEmbedEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
)

Fetching 5 files: 100%|██████████| 5/5 [00:27<00:00,  5.41s/it]


In [5]:
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

384
[-0.0033337711356580257, -0.011862986721098423, 0.04161721095442772, -0.03819842264056206, 0.024203987792134285]


## Ingestion Pipeline

In [None]:
# Load data

from llama_index.readers.file import PyMuPDFReader

documents = PyMuPDFReader().load_data(file_path="./data/llama2.pdf")
# loader = PyMuPDFReader()
# documents = loader.load(file_path="./data/llama2.pdf")

TypeError: PyMuPDFReader.load_data() got an unexpected keyword argument 'show_progress'

In [7]:
# Initialize vector database and add nodes to it

from llama_index.core import Settings, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

Settings.embed_model = embed_model

client = qdrant_client.QdrantClient(
    # location=":memory:",
    host="localhost",
    port=6333,
)

vector_store = QdrantVectorStore(client=client, collection_name="rag_demo_collection")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    transformations=[
        SentenceSplitter(chunk_size=3000, chunk_overlap=400),
        ],
    show_progress=True
)

Parsing nodes: 100%|██████████| 77/77 [00:00<00:00, 2135.34it/s]
Generating embeddings: 100%|██████████| 77/77 [00:07<00:00,  9.90it/s]


## Retrieval pipeline

In [None]:
from llama_index.core.retrievers import VectorIndexRetriever

# configure retriever
# retriever = VectorIndexRetriever(
#     index=index,
#     similarity_top_k=5,
# )

In [None]:
# query_str = "What are the difference variants of llama 2 model?"

# response_nodes = retriever.retrieve(query_str)

# for node in response_nodes:
#     # print(node.metadata)
#     print(f"---------------------------------------------")
#     print(f"Score: {node.score:.3f}")
#     print(node.get_content())
#     print(f"---------------------------------------------\n\n")

## Generation pipeline with Query Engine

In [None]:
# # Prompt components

# persona = "You are a world-class research scientist and expert in Large Language Models."
# instruction = "Analyze the provided research paper with extreme attention to detail and provide answer to the user queries."
# data_format = "Organize your nice and organized markdown format. Use headings and subheadings to make it easy to read and understand. Use bullet-points wherever necessary."
# audience = "While this is for busy researchers, provide complete technical depth. Do not summarize or simplify technical details."
# tone = "The tone should be professional and clear."

# qa_prompt_tmpl = (
#     f"{persona}\n\n"
#     f"{instruction}\n\n"
#     f"{data_format}\n\n"
#     f"{audience}\n\n"
#     f"{tone}\n\n"
#     "Context information is below.\n"
#     "---------------------\n"
#     "{context_str}\n"
#     "---------------------\n"
#     "Using the context information, provide an accurate response to the user query.\n"
#     "Query: {query_str}\n"
#     "Answer: "
# )

In [19]:
# Prompt components

persona = "You are a world-class research scientist and expert in Large Language Models. You can break down complex ideas into comprehensible pieces and fetch key points from long research documents."
instruction = "Analyze the provided context with extreme attention to detail and provide an accurate response to the queries. Include all the key details from each sections of the paper."
data_format = "Organize your response in nice and organized markdown format. Use headings and subheadings to make it easy to read and understand. Use bullet-points wherever necessary."
audience = "You audience is AI researchers. Provide complete technical depth. Avoid summarization and do not simplify technical details."
tone = "The tone should be professional and clear."


qa_prompt_tmpl = (
    f"{persona}\n\n"
    f"{data_format}\n\n"
    f"{instruction}\n\n"
    f"{audience}\n\n"
    f"{tone}\n\n"
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Using the context information, provide an accurate response to the user query.\n"
    "Query: {query_str}\n"
    "Answer: "
)

In [20]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core import PromptTemplate
from IPython.display import display, Markdown

qa_prompt = PromptTemplate(qa_prompt_tmpl)

# configure response synthesizer
# response_synthesizer = get_response_synthesizer(llm=llm, streaming=True, text_qa_template=qa_prompt)


# # assemble query engine
# query_engine = RetrieverQueryEngine(
#     retriever=retriever,
#     response_synthesizer=response_synthesizer,
#     # node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
# )

query_engine = index.as_query_engine(
    llm=llm,
    streaming=True,
    text_qa_template=qa_prompt,
)

query_str = "Provide a detailed section-wise break-down of the llama2 paper."
    
streaming_response = query_engine.query(
    query_str,
)

# display(Markdown(streaming_response.response))
streaming_response.print_response_stream()

**Section A.1 Introduction**
The Llama2 paper provides an overview of the advancements made in Large Language Models (LLMs) using the LLaMA-2 architecture. The authors highlight the importance of pre-training on diverse and high-quality datasets to improve the performance and robustness of LLMs.

**Section A.2 Methodology**

* **Pre-training**: The authors used a combination of 7B, 30B, and 40B parameters for the base model and fine-tuned it using a smaller dataset.
* **Fine-tuning**: The model was fine-tuned on specific tasks such as writing, science, history, and more to improve its performance on those domains.
* **Multi-task learning**: The model was trained on multiple tasks simultaneously to learn shared representations across different domains.

**Section A.3 Results**

| Model Size | Average Perplexity (7B) | Average Perplexity (30B) |
| --- | --- | --- |
| 7B | 0.24 | - |
| 30B | 0.28 | 0.36 |
| 40B | 0.38 | 0.53 |

The results show that the model with 30B parameters outperfor

In [21]:
query_str = "How does llama2 perform? Also, add result table showing performance comparison with other models."

streaming_response = query_engine.query(query_str)

streaming_response.print_response_stream()

**Llama2 Performance Analysis**

The provided context information outlines the performance of Llama2 on various benchmarks, including MMLU, BBH, AGI Eval, and more. Here's a detailed analysis of Llama2's performance:

### Model Size Comparison

Llama2 models are compared to other large language models, including MPT, Falcon, Llama 1, and open-source models like GPT-3.5, GPT-4, PaLM, and PaLM-2-L.

*   **MMLU (5-shot)**: Llama2 70B outperforms Llama1 65B by approximately 5 points and MPT models of the corresponding size.
*   **BBH (3-shot)**: Llama2 70B improves results on BBH by around 8 points compared to Llama1 65B.
*   **AGI Eval**: Llama2 models outperform Falcon models in all categories, except code benchmarks.

### Performance Comparison

Here is a summary of the performance comparison between Llama2 and other models:

| Benchmark | Llama2 (70B) | GPT-3.5 | GPT-4 | PaLM | PaLM-2-L |
| --- | --- | --- | --- | --- | --- |
| MMLU (5-shot) | 68.9 | 70.0 | - | 78.3 | - |
| TriviaQA (1