## Load Models

In [1]:
from langchain_ollama.llms import OllamaLLM

In [2]:
llm = OllamaLLM(
    model="llama3.2:latest",
    request_timeout=300.0,
    additional_kwargs={"num_ctx": 16384, "num_predict": -1},
)

In [None]:
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

embed_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",
    show_progress=True,
    model_kwargs={"device": "mps"},
)

  embed_model = HuggingFaceEmbeddings(
  from tqdm.autonotebook import tqdm, trange


In [4]:
embeddings = embed_model.embed_query(text="Hello world!")
print(len(embeddings))
print(embeddings[:5])

Batches: 100%|██████████| 1/1 [00:00<00:00,  2.66it/s]

384
[-0.0032757290173321962, -0.011690833605825901, 0.04155922308564186, -0.03814816474914551, 0.024183081462979317]





## Ingestion Pipeline

In [5]:
# Load data

from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(file_path="./data/llama2.pdf")
documents = loader.load()

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


def create_academic_text_splitter():
    """
    Creates a text splitter optimized for academic papers with appropriate
    chunk sizes and overlap to maintain context and section coherence.
    """
    return RecursiveCharacterTextSplitter(
        # Larger chunk size to keep more context together
        chunk_size=3000,
        # Significant overlap to maintain context across chunks
        chunk_overlap=400,
        # Common section headers in academic papers
        separators=["\n## ", "\n### ", "\n\n", "\n", " ", ""],
        # Keep sentences together
        keep_separator=True,
        # Merge smaller chunks
        length_function=len,
        add_start_index=True,
    )

In [7]:
# Initialize vector database and add nodes to it

# from langchain_core.vectorstores import InMemoryVectorStore
from langchain.indexes.vectorstore import VectorstoreIndexCreator
from langchain_qdrant import QdrantVectorStore

# Define host and port
host = "localhost"
port = "6333"

# Create index with VectorStoreIndexCreator
index_creator = VectorstoreIndexCreator(
    vectorstore_cls=QdrantVectorStore,
    embedding=embed_model,
    text_splitter=create_academic_text_splitter(),
    vectorstore_kwargs={
        "collection_name": "rag_demo_collection",
        # "location": ":memory:"
        "url": f"http://{host}:{port}"  # Constructed from host and port
    }
)

# Create index
index = index_creator.from_documents(documents)


Batches: 100%|██████████| 1/1 [00:00<00:00, 25.86it/s]
Batches: 100%|██████████| 2/2 [00:01<00:00,  1.21it/s]
Batches: 100%|██████████| 2/2 [00:01<00:00,  1.46it/s]


## Retrieval Pipeline

In [8]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate

def create_system_prompt():
    """
    Creates a system prompt template for paper summarization.
    """
    return ChatPromptTemplate.from_messages(
        [
            (
                "system",
                (
                    "You are a world-class research scientist and expert in Large Language Models. You can break down complex ideas into digestable pieces and fetch key points from long research documents."
                    
                    "Context information is below.\n"
                    "---------------------\n"
                    "{context}\n"
                    "---------------------\n"
                    
                    "Using the context information, provide an accurate response to the user query.\n"
                    "Questions: {query}\n"
                ),
            )
        ]
    )

In [9]:
def get_response(query: str, top_docs=5):
    """
    Get response using RAG pipeline.
    """
    
    docs = index.vectorstore.similarity_search(query, k=top_docs)
    
    # Create the prompt and chain
    prompt = create_system_prompt()
    document_chain = create_stuff_documents_chain(
        llm=llm,
        prompt=prompt,
    )

    # Define the prompt variables
    prompt_vars = {
        "context": docs,
        "query": query,
    }

    # Generate response using retrieved documents
    for token in document_chain.stream(prompt_vars):
        print(token, end="")

In [10]:
instruction = "Analyze the provided research paper with extreme attention to detail and provide a detailed analysis of the paper. Include all the key details from each sections of the paper."
data_format = "Organize your nice and organized markdown format. Use headings and subheadings to make it easy to read and understand. Use bullet-points wherever necessary."
audience = "While this is for busy researchers, provide complete technical depth. Do not summarize or simplify technical details."
tone = "The tone should be professional and clear."

query_str = (
    f"{instruction}\n\n" f"{data_format}\n\n" f"{audience}\n\n" f"{tone}\n\n"
)

get_response(query=query_str)

Batches: 100%|██████████| 1/1 [00:00<00:00,  3.36it/s]


# Detailed Analysis of the Research Paper on Large Language Models

## Introduction

The provided research paper focuses on Supervised Fine-Tuning (SFT) for large language models. The authors aim to improve the quality of the model by fine-tuning it with high-quality annotation data.

### Key Takeaways

* SFT is a crucial step in improving the performance of large language models.
* High-quality annotation data is essential for achieving good results.
* The authors implemented a quality assurance process to ensure that only high-quality annotations are used.

## Getting Started

The paper begins by discussing the importance of collecting high-quality SFT data. The authors highlight that third-party SFT data is available, but it often lacks diversity and quality.

### Key Points

* The authors started by collecting publicly available instruction tuning data (Chung et al., 2022) and used this data to bootstrap their SFT stage.
* Quality is all you need: the authors focused on collecting 

In [11]:
get_response(query="Summarize the fine-tuning process of llama2.")

Batches: 100%|██████████| 1/1 [00:00<00:00,  8.75it/s]


The fine-tuning process of Llama 2 involves several stages and techniques:

1. **Supervised Fine-Tuning**: The initial version of Llama 2-Chat is created through supervised fine-tuning, where the model is trained on a dataset specifically designed for chat-based conversations.

2. **Reward Modeling**: After the initial fine-tuning, the model undergoes iterative reward modeling using techniques such as rejection sampling and Proximal Policy Optimization (PPO).

3. **Reinforcement Learning with Human Feedback (RLHF)**: The model is further refined through RLHF methodologies, which involve aligning the model's behavior with human feedback.

4. **Ghost Attention (GAtt)**: A new technique called Ghost Attention is introduced to control dialogue flow over multiple turns.

The fine-tuning process involves several iterations, and the accumulation of reward modeling data in parallel with model enhancements is crucial to ensure that the reward models remain within distribution.