![title.png](https://i.ibb.co/2KmT38V/title.png)

![objectives.png](https://i.ibb.co/fxbWnNQ/objectives.png)



## Notebook setup and dependency installation

In [1]:
!   pip install -qU \
    openai==1.30 \
    pinecone-client==4.1.0 \
    datasets==2.19 \
    tqdm

In [2]:
from IPython.display import HTML, display, Markdown
from typing import Dict


def chunk_display_html(chunk: Dict[str, str]) -> str:
    html_template = """
<html>
<head>
<style>
    table {{
        font-family: arial, sans-serif;
        border-collapse: collapse;
        width: 100%;
    }}
    td, th {{
        border: 1px solid #dddddd;
        text-align: left;
        padding: 8px;
    }}
</style>
</head>
<body>
    <table>
        <tr>
            <th>Key</th>
            <th>Value</th>
        </tr>
        <tr>
            <td>Title</td>
            <td>{title}</td>
        </tr>
        <tr>
            <td>DOI</td>
            <td>{doi}</td>
        </tr>
        <tr>
            <td>Chunk ID</td>
            <td>{chunk_id}</td>
        </tr>
        <tr>
            <td>Chunk</td>
            <td>{chunk}</td>
        </tr>
        <tr>
            <td>ID</td>
            <td>{id}</td>
        </tr>
        <tr>
            <td>Summary</td>
            <td>{summary}</td>
        </tr>
        <tr>
            <td>Source</td>
            <td>{source}</td>
        </tr>
        <tr>
            <td>Authors</td>
            <td>{authors}</td>
        </tr>
        <tr>
            <td>Categories</td>
            <td>{categories}</td>
        </tr>
        <tr>
            <td>Comment</td>
            <td>{comment}</td>
        </tr>
        <tr>
            <td>Journal Reference</td>
            <td>{journal_ref}</td>
        </tr>
        <tr>
            <td>Primary Category</td>
            <td>{primary_category}</td>
        </tr>
        <tr>
            <td>Published</td>
            <td>{published}</td>
        </tr>
        <tr>
            <td>Updated</td>
            <td>{updated}</td>
        </tr>
        <tr>
            <td>References</td>
            <td>{references}</td>
        </tr>
    </table>
</body>
</html>
"""

    # Format the HTML with the generated rows
    html_output = html_template.format(
        doi=chunk.get("doi", "N/A"),
        chunk_id=chunk.get("chunk-id", "N/A"),
        chunk=chunk.get("chunk", "N/A"),
        id=chunk.get("id", "N/A"),
        title=chunk.get("title", "N/A"),
        summary=chunk.get("summary", "N/A"),
        source=chunk.get("source", "N/A"),
        authors=chunk.get("authors", "N/A"),
        categories=chunk.get("categories", "N/A"),
        comment=chunk.get("comment", "N/A"),
        journal_ref=chunk.get("journal_ref", "N/A"),
        primary_category=chunk.get("primary_category", "N/A"),
        published=chunk.get("published", "N/A"),
        updated=chunk.get("updated", "N/A"),
        references=chunk.get("references", "N/A"),
    )

    # Display the HTML in an IPython notebook
    display(HTML(html_output))


def display_retrieved_context(context_response):
    # HTML template for the main container and individual tables
    html_template = """
    <html>
    <head>
    <style>
        .container {{
            display: flex;
            flex-wrap: wrap;
        }}
        .table-container {{
            margin: 10px;
            padding: 10px;
            border: 1px solid #dddddd;
        }}
        table {{
            font-family: arial, sans-serif;
            border-collapse: collapse;
            width: 100%;
        }}
        td, th {{
            border: 1px solid #dddddd;
            text-align: left;
            padding: 8px;
        }}
    </style>
    </head>
    <body>
        <div class="container">
            {tables}
        </div>
    </body>
    </html>
    """

    # Function to generate HTML table for a single dictionary
    def generate_table_for_dict(data):
        rows = "\n".join(
            "<tr><td>{key}</td><td>{value}</td></tr>".format(
                key=key, value=value if value is not None else "N/A"
            )
            for key, value in data.items()
        )
        table_html = """
        <div class="table-container">
            <table>
                <tr>
                    <th>Key</th>
                    <th>Value</th>
                </tr>
                {rows}
            </table>
        </div>
        """.format(
            rows=rows
        )
        return table_html

    # Generate HTML tables for all dictionaries in the list
    tables = "\n".join(
        generate_table_for_dict(data["metadata"]) for data in context_response
    )

    # Format the main HTML with the generated tables
    html_output = html_template.format(tables=tables)

    # Display the HTML in an IPython notebook
    display(HTML(html_output))


def display_markdown(content: str) -> None:
    display(Markdown(content))

![step-away-rag.png](https://i.ibb.co/Y288TWR/step-away-rag.png)

## Setup OpenAI

Enter the OpenAI API key and instantiate the OpenAI clinet.

In [3]:
import getpass
from openai import OpenAI

OPENAI_API_KEY = getpass.getpass("Enter your OpenAI API key: ")
openai = OpenAI(api_key=OPENAI_API_KEY)

## Implement request to OpenAI GPT Models

Implement a function that will send a prompt to an LLM and return an answer.
The OpenAI client has the following signature:
`openai_client.chat.completions.create(model: str, messages=List[Dict[str, str]])`

API reference: https://platform.openai.com/docs/api-reference/chat/create 

In [4]:
def llm_completion(prompt: str, openai_client: OpenAI, model: str = "gpt-3.5-turbo") -> str:

    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Output is markdown"},
            {"role": "user", "content": prompt},
        ],
        temperature=0.0,
    )

    return response.choices[0].message.content


In [5]:
display_markdown(llm_completion("What is the capital of Germany?", openai))

The capital of Germany is Berlin.

In [6]:
display_markdown(llm_completion("Who is 25th person that landed on the moon?", openai, model="gpt-3.5-turbo"))

The 25th person to land on the moon was Harrison Schmitt. He was part of the Apollo 17 mission in December 1972.

![hallucinations-problem.png](https://i.ibb.co/gMvNZC6/hallucinations-problem.png)

In [7]:
display_markdown(llm_completion("What are key benefits of mistral 7B?", openai, model="gpt-3.5-turbo"))

### Key Benefits of Mistral 7B:
1. **High Performance**: The Mistral 7B offers excellent performance capabilities, making it suitable for a wide range of applications.
  
2. **Reliability**: This model is known for its reliability and durability, ensuring that it can withstand various environmental conditions and heavy usage.

3. **Energy Efficiency**: The Mistral 7B is designed to be energy-efficient, helping to reduce operational costs and environmental impact.

4. **User-Friendly**: It is user-friendly and easy to operate, making it suitable for both experienced and novice users.

5. **Versatility**: The Mistral 7B is versatile and can be used for different purposes, making it a flexible choice for various projects.

6. **Compact Design**: Its compact design makes it easy to transport and store, ideal for users who require mobility and space-saving solutions.

7. **Advanced Features**: The Mistral 7B comes equipped with advanced features that enhance its performance and usability, providing users with a seamless experience.

![knowledge-cutoff.png](https://i.ibb.co/ccccpxZ/knowledge-cutoff.png)

In [8]:
context_example = """
Answer the question based on the following context. If you don't can't find the answer, tell I don't know.

Context:
We introduce Mistral 7B, a 7–billion-parameter language model engineered for
superior performance and efficiency. Mistral 7B outperforms the best open 13B
model (Llama 2) across all evaluated benchmarks, and the best released 34B
model (Llama 1) in reasoning, mathematics, and code generation. Our model
leverages grouped-query attention (GQA) for faster inference, coupled with sliding
window attention (SWA) to effectively handle sequences of arbitrary length with a
reduced inference cost. We also provide a model fine-tuned to follow instructions,
Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
automated benchmarks. Our models are released under the Apache 2.0 license.

Question: What are key benefits of Mistral 7B?
"""
display_markdown(llm_completion(context_example, openai))

The key benefits of Mistral 7B include:

1. Superior performance and efficiency compared to other models.
2. Outperforming the best open 13B model (Llama 2) across all evaluated benchmarks.
3. Outperforming the best released 34B model (Llama 1) in reasoning, mathematics, and code generation.
4. Leveraging grouped-query attention (GQA) for faster inference.
5. Using sliding window attention (SWA) to effectively handle sequences of arbitrary length with reduced inference cost.
6. Providing a fine-tuned model, Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model on both human and automated benchmarks.

![why-rag.png](https://i.ibb.co/KF3xr64/why-rag.png)

![rag-vs-prompt-vs-finetune.png](https://i.ibb.co/C8rvnTY/Screenshot-2024-05-30-at-21-17-45.png)

## How to build a RAG?

Steps:
1. Build a knowledge base
2. Build retrieval
3. Augment and generate


## 1. Build a knowledge base

![build-knowledge-base.png](https://i.ibb.co/dGnjrCk/build-knowledge-base.png)


### Setup Pinecone 

Enter the Pinecone API key inside the prompt and create a Pinecone client.

In [10]:
PINECONE_REGION = "eu-west-1"
PINECONE_CLOUD = "aws"
INDEX_NAME = "pinecone-workshop-1"
VECTOR_DIMENSIONS = 1536
PINECONE_API_KEY = getpass.getpass("Enter your Pinecone API key: ")

In [11]:
from pinecone import Pinecone

pinecone = Pinecone(api_key=PINECONE_API_KEY)

pinecone.list_indexes()

{'indexes': [{'dimension': 1536,
              'host': 'pinecone-workshop-1-2kw20wn.svc.apw5-4e34-81fa.pinecone.io',
              'metric': 'cosine',
              'name': 'pinecone-workshop-1',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-west-2'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 1536,
              'host': 'pinecone-worshop-1-2kw20wn.svc.apw5-4e34-81fa.pinecone.io',
              'metric': 'cosine',
              'name': 'pinecone-worshop-1',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-west-2'}},
              'status': {'ready': True, 'state': 'Ready'}}]}

### Create a Pinecone Index

Create a Serverless Index with OpenAI embeddings size with cosine similarity metrics. 
The index creation requires index name, dimension of embeddings, similarity metric and serverless spec for a serverless setup.

More info o serverless: https://docs.pinecone.io/reference/architecture/serverless-architecture#overview  
Ref API: https://docs.pinecone.io/reference/api/control-plane/create_index 

In [12]:
from pinecone import ServerlessSpec

# TODO: Create a new index with the specified name, dimension, metric, and spec
# 1. check if the index already exists and delete it
# 2. create a Pinecone index
# 3. create a new Index reference object with the specified name and pool_threads

if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
    pinecone.delete_index(INDEX_NAME)

pinecone.create_index(
    name=INDEX_NAME,
    dimension=VECTOR_DIMENSIONS,
    metric="cosine",
    spec=ServerlessSpec(cloud=PINECONE_CLOUD, region=PINECONE_REGION),
)

index = pinecone.Index(INDEX_NAME, pool_threads=20)
print(index.describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


### Dataset

We are going to use a sample of 1000 AI papers that can be found here: https://huggingface.co/datasets/smartcat/ai-arxiv2-chunks-embedded 

The data set is already chunked and encoded using `text-embeddings-3-small` so we can just load and upsert it to the Pinecone.

If you want to play with chunking strategies and embeddings, you can find the full data set here: https://huggingface.co/datasets/jamescalam/ai-arxiv2

Dataset API reference: https://huggingface.co/docs/datasets/en/index  
Slicing and indexing: https://huggingface.co/docs/datasets/en/access 

In [13]:
import datasets

dataset = datasets.load_dataset("smartcat/ai-arxiv2-chunks-embedded", split="train")
dataset

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references', 'embeddings', 'metadata'],
    num_rows: 79782
})

In [14]:
chunk_display_html(dataset[0])

Key,Value
Title,Foundations of Vector Retrieval
DOI,2401.09350
Chunk ID,0
Chunk,"4 2 0 2 n a J 7 1 ] S D . s c [ 1 v 0 5 3 9 0 . 1 0 4 2 : v i X r a Sebastian Bruch # Foundations of Vector Retrieval # Preface We are witness to a few years of remarkable developments in Artificial Intelligence with the use of advanced machine learning algorithms, and in particular, deep learning. Gargantuan, complex neural networks that can learn through self-supervisionâand quickly so with the aid of special- ized hardwareâtransformed the research landscape so dramatically that, incremental overnight it seems, many fields experienced not the usual, progress, but rather a leap forward. Machine translation, natural language understanding, information retrieval, recommender systems, and computer vision are but a few examples of research areas that have had to grapple with the shock. Countless other disciplines beyond computer science such as robotics, biology, and chemistry too have benefited from deep learning."
ID,2401.09350#0
Summary,"Vectors are universal mathematical objects that can represent text, images, speech, or a mix of these data modalities. That happens regardless of whether data is represented by hand-crafted features or learnt embeddings. Collect a large enough quantity of such vectors and the question of retrieval becomes urgently relevant: Finding vectors that are more similar to a query vector. This monograph is concerned with the question above and covers fundamental concepts along with advanced data structures and algorithms for vector retrieval. In doing so, it recaps this fascinating topic and lowers barriers of entry into this rich area of research."
Source,http://arxiv.org/pdf/2401.09350
Authors,Sebastian Bruch
Categories,"cs.DS, cs.IR"
Comment,


In [15]:
print(len(dataset[0]["embeddings"]))
print(dataset[0]["embeddings"][:10])

1536
[-0.020014504, -0.013545036, 0.04353385, -0.0029185074, 0.03552278, -0.034943033, 0.013927143, 0.06566971, -0.06888468, 0.026971487]


In [16]:
list(dataset[0]["metadata"].keys())

['authors',
 'chunk_id',
 'doc_id',
 'primary_category',
 'published',
 'source',
 'summary',
 'text',
 'title']

## Data upsert to Pinecone

Let insert data to the Pinecone in batches. 

From our data set we need 3 columns:
1. `id` - the ID of the chunk we want to insert
2. `embeddings` - contains a vector embedding of the chunk. It uses `text-embeddings-3-small`
3. `metadata` - a dictinary with additional data about the chunk. 

Metadata filtering: https://docs.pinecone.io/guides/data/filter-with-metadata 

### Note on optimization

We are going to add `async_req=True` parameter. Upsert will return futures that we need to wait.

Optimization tips:
1. Deploy application at the same region
2. Use batching upsert
3. Use parallelized upsert
4. Use GRPCIndex, but make sure to add backoff for throttling
5. Use namespaces and metadata filtering
6. Avoid quotas and limits: https://docs.pinecone.io/reference/quotas-and-limits

For scale-up and optimizations make sure to read: : https://docs.pinecone.io/guides/operations/performance-tuning#increasing-throughput

In [17]:
from tqdm import tqdm
from pinecone import Index

def upsert_batch(ds: datasets.Dataset, index: Index, batch_size: int = 100) -> None:
    # TODO: Upsert the vectors to the Pinecone index
    # 1. Iterate over the dataset in batches
    # 2. Select the batch
    # 3. Extract the IDs, embeddings, and metadata
    # 4. Upsert the vectors to the Pinecone index
    # 5. Optimization: Use req_async=True to make the upserts asynchronous and wait for futures to complete

    futures = []
    for i in range(0, len(ds), batch_size):
        i_end = min(i + batch_size, len(ds))
        batch = ds.select(range(i, i_end))

        ids = batch["id"]
        embeddings = batch["embeddings"]
        metadata = batch["metadata"]
        futures.append(
            index.upsert(vectors=list(zip(ids, embeddings, metadata)), async_req=True)
        )

    for future in tqdm(futures, total=len(futures), desc="Upsert to Pinecone"):
        future.get()

In [18]:
upsert_batch(dataset.select(range(500)), index, batch_size=20)

Upsert to Pinecone: 100%|██████████| 25/25 [00:00<00:00, 131.28it/s]


In [20]:
print(index.describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 500}},
 'total_vector_count': 500}


In [40]:
upsert_batch(dataset, index)

Upsert to Pinecone: 100%|██████████| 798/798 [00:00<00:00, 3071.12it/s]


In [42]:
print(index.describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 82082}},
 'total_vector_count': 82082}


### What has been done with data set?

![chunking-dataset.png](https://i.ibb.co/9wV70Q7/chunking-dataset.png)

#### Chunking Strategies
1. Character split (with overlapping)
2. Recursive character split
3. Document specific splitting
4. Semantic Chunking
5. Agentic?
6. More?

Introduction to chunking: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

### Revisit Agenda

![done-next.png](https://i.ibb.co/QYC9nrb/done-next.png)


## 2. Retrieve against the query

Once we inserted everythong to Pinecone, let's query it. The input is query text and the output is the top similar chunks.
Steps: 
1. Encode the input text to generate embeddings
2. Call Pinecone's query function to retrieve top K similar results


Embedding API Ref: https://platform.openai.com/docs/api-reference/embeddings  
Query API Ref: https://docs.pinecone.io/reference/api/data-plane/query 

![semantic-search.png](https://i.ibb.co/2dWGRHn/semantic-search.png)

In [25]:
from typing import List


def encode(
    text: str, openai_client: OpenAI, model: str = "text-embedding-3-small"
) -> List[float]:
    # TODO: create embeddings using OpenAI API
    # 1. Call openai_client.embeddings.create with the model and text
    response = openai_client.embeddings.create(
        model=model, input=text
    )
    return response.data[0].embedding

In [26]:
res = encode("What are key benefits of mistral 7B?", openai)
print(res[:10])
print(len(res))

[-0.02581052854657173, 0.03242715075612068, 0.019921164959669113, -0.021119002252817154, -0.036020658910274506, 0.023999514058232307, -0.05059434100985527, 0.05886511504650116, 0.024940671399235725, -0.006969555746763945]
1536


In [27]:
from pinecone import QueryResponse


def semantic_search(
    query: str, index: Index, openai_client: OpenAI, top_k: int = 10
) -> QueryResponse:
    # TODO: Perform semantic search on input query
    # 1. encode the query using OpenAI embeddings API
    # 2. call index.query with the query embedding to get the top_k results
    # 3. Include metadata and exclude values in the query
    query_embedding = encode(query, openai_client)
    ret = index.query(
        vector=query_embedding, top_k=top_k, include_metadata=True, include_values=False
    )
    return ret

In [28]:
res = semantic_search("What is Mistral 7B?", index, openai)
print(res.matches[0].metadata['text'])

Abstract
We introduce Mistral 7B, a 7âbillion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B â Instruct, that surpasses Llama 2 13B â chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/
# Introduction


## 3. Augment and generate

![workflow-rag-simple.png](https://i.ibb.co/p0gwY23/workflow-rag-simple.png)

### Generate a final response

Combine all pieces together:
1. Perform a semantic search on input query
2. Build a context (prompt) for a LLM
3. Call LLM to generate a final response
4. Return a final response and retrieved context

The relevant context can be found in metadata, you can use:
1. `title` - a Paper title
2. `published` - a publish date
3. `primart_category` 
4. `summary` - a paper summary
5. `text` - a chunk text - this is the most useful info to build a context for LLM

In [29]:
from typing import Tuple

def from_metadata(metadata: Dict) -> str:
    return f"""

***************************************
Title: {metadata['title']}
Authors: {metadata['authors'][:5]}
Published: {metadata['published']}
Primary category: {metadata['primary_category']}
Paper summary: {metadata['summary']}
Text: {metadata['text']}
***************************************
"""


def get_prompt(query: str, query_results: QueryResponse) -> str:
    context = "\n".join(
        [from_metadata(result.metadata) for result in query_results.matches]
    )
    return f"""
You are a helpful assistant that answers questions. Answer the following question: "{query}"
-----------------------------------
Use the following context to answer the question {context}
-----------------------------------
If the answer is not in the context, answer with there is not enough information to answer the question.
The answers should be clear, easy to understand, complete and comprehensive.
"""


def rag(query: str, index: Index, openai_client: OpenAI, top_k: int = 5) -> Tuple[str, QueryResponse]:
    # TODO: Wire up the RAG pipeline
    # 1. Perform semantic search to retrieve the context (QueryResponse)
    # 2. Generate the prompt using the context (QueryResponse). Make sure to include the query and relevant metadata in the prompt
    # 3. Call the LLM completion with the prompt to get the response
    query_results = semantic_search(query, index, openai_client, top_k=top_k)
    prompt = get_prompt(query, query_results)
    response = llm_completion(prompt, openai_client)
    return response, query_results

In [30]:
answer, context = rag("What are key benefits of Mistral 7B?", index, openai)
display_markdown(answer)

### Key Benefits of Mistral 7B:

1. **Superior Performance**: Mistral 7B is engineered for superior performance, outperforming other models across various benchmarks, especially in reasoning, mathematics, and code generation.

2. **Efficiency**: Mistral 7B is designed to be efficient, striking a balance between high performance and computational costs. It leverages innovative attention mechanisms like grouped-query attention (GQA) and sliding window attention (SWA) to enhance efficiency.

3. **Ease of Deployment**: Mistral 7B is released under the Apache 2.0 license, accompanied by a reference implementation that facilitates easy deployment on various platforms such as AWS, GCP, or Azure. Integration with tools like vLLM and Hugging Face is streamlined for easier deployment.

4. **Adaptability**: Mistral 7B is crafted for ease of fine-tuning across a wide range of tasks, showcasing its adaptability and versatility in real-world applications.

5. **Model Fine-Tuning**: Mistral 7B offers the capability for fine-tuning models to follow specific instructions, surpassing other models in both human and automated benchmarks.

6. **Balanced Model**: Mistral 7B demonstrates that a carefully designed language model can deliver high performance while maintaining an efficient inference, addressing the challenges of escalating model sizes and computational costs in the NLP domain.

7. **Innovative Attention Mechanisms**: The model leverages grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to handle sequences of arbitrary length effectively, contributing to its enhanced performance and efficiency.

### Note:
If you need more specific details or have any other questions, feel free to ask!

In [31]:
display_retrieved_context(context.matches[:2])

Key,Value
authors,"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed"
chunk_id,3.0
doc_id,2310.06825
primary_category,cs.CL
published,20231010
source,http://arxiv.org/pdf/2310.06825
summary,"We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license."
text,"Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference implementation1 facilitating easy deployment either locally or on cloud platforms such as AWS, GCP, or Azure using the vLLM [17] inference server and SkyPilot 2. Integration with Hugging Face 3 is also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B â Chat model. Mistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications. # 2 Architectural details The cat sat on the The cat sat on the window size â_ââ> The cat sat on the Vanilla Attention Sliding Window Attention Effective Context Length"
title,Mistral 7B

Key,Value
authors,"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed"
chunk_id,1.0
doc_id,2310.06825
primary_category,cs.CL
published,20231010
source,http://arxiv.org/pdf/2310.06825
summary,"We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license."
text,"Abstract We introduce Mistral 7B, a 7âbillion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B â Instruct, that surpasses Llama 2 13B â chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/ # Introduction"
title,Mistral 7B


In [32]:
display_markdown(llm_completion("Does ChatGPT have personality?", openai))

ChatGPT doesn't have a personality of its own, as it is an AI language model designed to assist with a wide range of tasks and provide helpful responses based on the input it receives. Its responses are generated based on patterns in the data it has been trained on and the context of the conversation.

In [33]:
answer, _ = rag("Does ChatGPT have personality? Write also paper title and year of publish", index, openai)
display_markdown(answer)

### Answer:
Based on the paper titled "The Self-Perception and Political Biases of ChatGPT" published in 20230414, ChatGPT does exhibit personality traits. The paper reveals that ChatGPT perceives itself as highly open and agreeable, has the Myers-Briggs personality type ENFJ, and is among the 15% of test-takers with the least pronounced dark traits.

### Paper Title and Year of Publish:
- **Title:** The Self-Perception and Political Biases of ChatGPT
- **Year of Publish:** 20230414

In [34]:
answer, _ = rag(
    "What is retrieval augmented generation?",
    index,
    openai,
    top_k=20
)
display_markdown(answer)

### Answer: 

**Retrieval augmented generation** refers to a paradigm where a generative model is enhanced by incorporating a retrieval mechanism to improve the quality of generated outputs. This approach allows the model to retrieve relevant information from external sources, such as a knowledge base or a set of documents, to augment the generation process. By combining the capabilities of generative models with the retrieval of external knowledge, retrieval augmented generation aims to enhance the accuracy and relevance of the generated content.

In the provided context, retrieval augmented generation is discussed in the context of language models that utilize retrieval mechanisms to enhance their performance in tasks such as language modeling, question answering, and knowledge-intensive natural language processing tasks. The integration of retrieval mechanisms allows these models to access external knowledge sources, leading to improved performance in generating text and answering queries.

If you need more specific details or examples related to retrieval augmented generation in the context of language models, please let me know.

In [35]:
answer, _ = rag("The difference between Mistral and Kosmos?", index, openai, top_k=20)
display_markdown(answer)

### Difference between Mistral and Kosmos

Based on the provided context, Mistral and Kosmos are both large language models engineered for superior performance and efficiency. However, they differ in their specific capabilities and focus areas:

- **Mistral**:
  - **Publication Date**: 20231010
  - **Primary Category**: cs.CL
  - **Key Features**:
    - Engineered for superior performance and efficiency
    - Outperforms Llama 2 13B across all evaluated benchmarks
    - Leverages grouped-query attention (GQA) and sliding window attention (SWA) for faster inference
    - Provides a model fine-tuned to follow instructions, surpassing the Llama 2 13B -- Chat model
    - Released under the Apache 2.0 license

- **Kosmos**:
  - **Publication Date**: 20230227
  - **Primary Category**: cs.CL
  - **Key Features**:
    - A Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions
    - Trained on web-scale multimodal corpora, including text and images
    - Achieves impressive performance in language understanding, generation, OCR-free NLP, perception-language tasks, and vision tasks
    - Benefits from cross-modal transfer, transferring knowledge between language and multimodal domains
    - Introduces a dataset of Raven IQ test for diagnosing nonverbal reasoning capability

In summary, Mistral is focused on language model efficiency and performance, while Kosmos is a multimodal model that excels in various tasks related to language understanding, generation, and perception.

## LangChain

Now, we've got all concepts right, we can explore framework Langchain

In [36]:
!pip install -qU \
    langchain-pinecone \
    langchain-openai \
    langchain

In [37]:
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=OPENAI_API_KEY,
)

vectorstore = PineconeVectorStore(
    index=index, embedding=embeddings, text_key="text", pinecone_api_key=PINECONE_API_KEY, index_name=INDEX_NAME,)

llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo", temperature=0.0)
langchain_rag = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever()
)

In [38]:
langchain_rag.invoke("What are all benefits of using Mistral 7B?")

{'question': 'What are all benefits of using Mistral 7B?',
 'answer': 'The benefits of using Mistral 7B include superior performance and efficiency, easy deployment on cloud platforms, integration with Hugging Face, and ease of fine-tuning across various tasks.\n',
 'sources': 'http://arxiv.org/pdf/2310.06825'}

## Advanced RAG - TBA

TODO: Add examples

In [39]:
answer, context = rag(
    "The key difference between Mistral 7B and Palm?", index, openai, top_k=20
)
display_markdown(answer)

The key difference between Mistral 7B and Palm is that Mistral 7B is a 7-billion-parameter language model engineered for superior performance and efficiency. It outperforms Llama 2 13B across all evaluated benchmarks and Llama 1 34B in reasoning, mathematics, and code generation. On the other hand, the context does not provide specific information about Palm to make a direct comparison.

### Indexing Time

### Query Time

### Evaluations

### Agentic Approach

In [None]:
# Notes
# Better embeddings
# Better chunking
# Better indexing
# Better query understanding
# History management
# Evaluation
# Ranking (Context Limitation)
# Guardrails
# Frameworks (langchain / canopy)