## Load model

In [1]:
from llama_index.llms.ollama import Ollama

llm = Ollama(model="llama3.1:latest", request_timeout=1200.0, context_window=4000, additional_kwargs={"num_predict": 100000})

# response = llm.stream_complete("What are the key findings of llama2 paper?")

# for r in response:
#     print(r.delta, end="")

## Load an embedding model

In [2]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# load BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", device="mps")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

384
[-0.0032757290173321962, -0.011690833605825901, 0.04155922308564186, -0.03814816474914551, 0.024183081462979317]


## Ingestion Pipeline

In [4]:
# Load data

from llama_index.readers.file import PyMuPDFReader

loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

In [5]:
# Initialize vector database and add nodes to it

from llama_index.core import Settings, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

Settings.embed_model = embed_model

# index = VectorStoreIndex.from_documents(
#     documents,
#     transformations=[
#         SentenceSplitter(chunk_size=1024, chunk_overlap=0)
#         ],
#     show_progress=True
#     )

client = qdrant_client.QdrantClient(
    # location=":memory:",
    host="localhost",
    port=6333,
)

vector_store = QdrantVectorStore(client=client, collection_name="pytholic")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=0),
        ],
    show_progress=True
)


Parsing nodes: 100%|██████████| 77/77 [00:00<00:00, 1087.12it/s]
Generating embeddings: 100%|██████████| 107/107 [00:02<00:00, 44.22it/s]


## Retrieval pipeline

In [6]:
from llama_index.core.retrievers import VectorIndexRetriever

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

In [7]:
query_str = "What are the difference variants of llama 2 model?"

response_nodes = retriever.retrieve(query_str)

for node in response_nodes:
    # print(node.metadata)
    print(f"---------------------------------------------")
    print(f"Score: {node.score:.3f}")
    print(node.get_content())
    print(f"---------------------------------------------\n\n")

---------------------------------------------
Score: 0.811
A.7
Model Card
Table 52 presents a model card (Mitchell et al., 2018; Anil et al., 2023) that summarizes details of the models.
Model Details
Model Developers
Meta AI
Variations
Llama 2 comes in a range of parameter sizes—7B, 13B, and 70B—as well as
pretrained and fine-tuned variations.
Input
Models input text only.
Output
Models generate text only.
Model Architecture
Llama 2 is an auto-regressive language model that uses an optimized transformer
architecture. The tuned versions use supervised fine-tuning (SFT) and reinforce-
ment learning with human feedback (RLHF) to align to human preferences for
helpfulness and safety.
Model Dates
Llama 2 was trained between January 2023 and July 2023.
Status
This is a static model trained on an offline dataset. Future versions of the tuned
models will be released as we improve model safety with community feedback.
License
A custom commercial license is available at:
ai.meta.com/resources/


## Generation pipeline with Query Engine

In [28]:
# Prompt components

persona = """You are an expert in Large Language models.
You excel at breaking down complex papers into digestible key details.\n""" 

# instruction = "Summarize the key findings of the paper provided.\n"

# context = """Extract and highlight the most crucial points from each section that can help 
# researchers quickly understand the most vital information of the paper.\n
# Highlight all the proposed key model variants, performance comparisons, methodologies, 
# training details, and experiments. Engineers really care about experimental details and benchmarks.
# Be as detailed as possible. Your details should be minimum five pages long, encapsulating all the
# important information. Go through each page one-by-one."""

data_format = """Create a bullet-point output that outlines the each part. 
Follow this up with a concise paragraph that encapsulates the main results.\n"""

audience = """This output is designed for busy researchers that quickly 
need to grasp the newest trends in Large Language Models.\n"""

tone = "The tone should be professional and clear.\n"

qa_prompt_tmpl = (
    f"{persona}{data_format}{audience}{tone}"
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)


In [29]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core import PromptTemplate

qa_prompt = PromptTemplate(qa_prompt_tmpl)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(llm=llm, streaming=True, text_qa_template=qa_prompt)


# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    # node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
)

# query
query_str = """Extract and highlight the most crucial points from each section that can help 
researchers quickly understand the most vital information of the paper.\n
Highlight all the proposed key model variants, performance comparisons, methodologies, 
training details, and experiments. Engineers really care about experimental details and benchmarks.
Be as detailed as possible. Your details should be minimum five pages long, encapsulating all the
important information. Do not miss any paragraph from the paper."""
streaming_response = query_engine.query(
    query_str,
)

streaming_response.print_response_stream()

Based on the provided context, I will extract and highlight the most crucial points from each section to help researchers quickly understand the vital information of the paper.

**A. Contributions**

* The authors list their contributions, with the following leaders:
	+ Science and Engineering Leadership: Guillem Cucurull, Naman Goyal, Louis Martin, Thomas Scialom, Ruan Silva, Kevin Stone, Hugo Touvron.
	+ Technical and Management Leadership: Sergey Edunov, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic.
* Core Contributors: a list of 27 individuals who contributed to the project.
* Contributors: a list of 25 individuals who assisted with annotations, quality control, and other tasks.

**A.1 Acknowledgments**

* The authors acknowledge the help of many contributors, including:
	+ Human annotators and internal leads for organizing annotations and quality control.
	+ A large internal red team that helped improve model safety and robustness.
	+ Members of 

In [21]:
# from IPython.display import Markdown, display

# # define prompt viewing function
# def display_prompt_dict(rompts_dict):
#     for k, p in prompts_dict.items():
#         text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"
#         display(Markdown(text_md))
#         print(p.get_template())
#         display(Markdown("<br><br>"))
        
# prompts_dict = query_engine.get_prompts()
# display_prompt_dict(prompts_dict)