![title.png](https://i.ibb.co/2KmT38V/title.png)

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

![objectives.png](https://i.ibb.co/fxbWnNQ/objectives.png)

![image.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)

## Notebook setup and dependency installation

In [268]:
!   pip install -qU \
    openai==1.30 \
    "pinecone-client[grpc]==4.1.0" \
    datasets==2.19 \
    tqdm

In [180]:
from IPython.display import HTML, display, Markdown
from typing import Dict


def chunk_display_html(chunk: Dict[str, str]) -> str:
    html_template = """
<html>
<head>
<style>
    table {{
        font-family: arial, sans-serif;
        border-collapse: collapse;
        width: 100%;
    }}
    td, th {{
        border: 1px solid #dddddd;
        text-align: left;
        padding: 8px;
    }}
</style>
</head>
<body>
    <table>
        <tr>
            <th>Key</th>
            <th>Value</th>
        </tr>
        <tr>
            <td>Title</td>
            <td>{title}</td>
        </tr>
        <tr>
            <td>doi</td>
            <td>{doi}</td>
        </tr>
        <tr>
            <td>Chunk ID</td>
            <td>{chunk_id}</td>
        </tr>
        <tr>
            <td>chunk</td>
            <td>{chunk}</td>
        </tr>
        <tr>
            <td>id</td>
            <td>{id}</td>
        </tr>
        <tr>
            <td>Summary</td>
            <td>{summary}</td>
        </tr>
        <tr>
            <td>Source</td>
            <td>{source}</td>
        </tr>
        <tr>
            <td>Authors</td>
            <td>{authors}</td>
        </tr>
        <tr>
            <td>Categories</td>
            <td>{categories}</td>
        </tr>
        <tr>
            <td>Comment</td>
            <td>{comment}</td>
        </tr>
        <tr>
            <td>Journal Reference</td>
            <td>{journal_ref}</td>
        </tr>
        <tr>
            <td>Primary Category</td>
            <td>{primary_category}</td>
        </tr>
        <tr>
            <td>Published</td>
            <td>{published}</td>
        </tr>
        <tr>
            <td>Updated</td>
            <td>{updated}</td>
        </tr>
        <tr>
            <td>References</td>
            <td>{references}</td>
        </tr>
    </table>
</body>
</html>
"""

    # Format the HTML with the generated rows
    html_output = html_template.format(
        doi=chunk.get("doi", "N/A"),
        chunk_id=chunk.get("chunk-id", "N/A"),
        chunk=chunk.get("chunk", "N/A"),
        id=chunk.get("id", "N/A"),
        title=chunk.get("title", "N/A"),
        summary=chunk.get("summary", "N/A"),
        source=chunk.get("source", "N/A"),
        authors=chunk.get("authors", "N/A"),
        categories=chunk.get("categories", "N/A"),
        comment=chunk.get("comment", "N/A"),
        journal_ref=chunk.get("journal_ref", "N/A"),
        primary_category=chunk.get("primary_category", "N/A"),
        published=chunk.get("published", "N/A"),
        updated=chunk.get("updated", "N/A"),
        references=chunk.get("references", "N/A"),
    )

    # Display the HTML in an IPython notebook
    display(HTML(html_output))


def display_retrieved_context(context_response):
    # HTML template for the main container and individual tables
    html_template = """
    <html>
    <head>
    <style>
        .container {{
            display: flex;
            flex-wrap: wrap;
        }}
        .table-container {{
            margin: 10px;
            padding: 10px;
            border: 1px solid #dddddd;
        }}
        table {{
            font-family: arial, sans-serif;
            border-collapse: collapse;
            width: 100%;
        }}
        td, th {{
            border: 1px solid #dddddd;
            text-align: left;
            padding: 8px;
        }}
    </style>
    </head>
    <body>
        <div class="container">
            {tables}
        </div>
    </body>
    </html>
    """

    # Function to generate HTML table for a single dictionary
    def generate_table_for_dict(data):
        rows = "\n".join(
            "<tr><td>{key}</td><td>{value}</td></tr>".format(
                key=key, value=value if value is not None else "N/A"
            )
            for key, value in data.items()
        )
        table_html = """
        <div class="table-container">
            <table>
                <tr>
                    <th>Key</th>
                    <th>Value</th>
                </tr>
                {rows}
            </table>
        </div>
        """.format(
            rows=rows
        )
        return table_html

    # Generate HTML tables for all dictionaries in the list
    tables = "\n".join(
        generate_table_for_dict(data["metadata"]) for data in context_response
    )

    # Format the main HTML with the generated tables
    html_output = html_template.format(tables=tables)

    # Display the HTML in an IPython notebook
    display(HTML(html_output))


def display_markdown(content: str) -> None:
    display(Markdown(content))

![image.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)

![step-away-rag.png](https://i.ibb.co/Y288TWR/step-away-rag.png)

## Setup OpenAI

Enter the OpenAI API key and instantiate the OpenAI clinet.

Copy this openai API key `sk-XXXXXXXXXX` into the prompt.

In [181]:
import getpass
from openai import OpenAI

OPENAI_API_KEY = getpass.getpass("Enter your OpenAI API key: ")
openai = OpenAI(api_key=OPENAI_API_KEY)

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

## Generate text using OpenAI GPT


API reference: https://platform.openai.com/docs/api-reference/chat/create 

In [182]:
def generate(prompt: str, openai_client: OpenAI, model: str = "gpt-3.5-turbo") -> str:

    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Output is markdown"},
            {"role": "user", "content": prompt},
        ],
        temperature=0.0,
    )

    return response.choices[0].message.content


![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

In [183]:
display_markdown(generate("What is the capital of Germany?", openai))

The capital of Germany is Berlin.

In [184]:
display_markdown(generate("Who is 25th person that landed on the moon?", openai, model="gpt-3.5-turbo"))

The 25th person to land on the moon was Harrison Schmitt. He was part of the Apollo 17 mission in December 1972 and was the only geologist to walk on the moon.

![hallucinations-problem.png](https://i.ibb.co/gMvNZC6/hallucinations-problem.png)

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

In [185]:
display_markdown(generate("What are key benefits of mistral 7B?", openai, model="gpt-3.5-turbo"))

The Mistral 7B is a powerful and versatile wind turbine that offers several key benefits:

1. **High Efficiency**: The Mistral 7B is designed to maximize energy production with its high efficiency blades and generator, ensuring that you get the most out of the available wind resources.

2. **Robust Construction**: This wind turbine is built to withstand harsh weather conditions and has a durable construction that ensures long-term reliability and performance.

3. **Quiet Operation**: The Mistral 7B is designed to operate quietly, making it suitable for residential areas or locations where noise levels need to be minimized.

4. **Easy Installation**: With its user-friendly design and easy installation process, the Mistral 7B can be set up quickly and efficiently, saving time and effort.

5. **Remote Monitoring**: Some models of the Mistral 7B come with remote monitoring capabilities, allowing you to track the performance of the turbine and troubleshoot any issues from a distance.

Overall, the Mistral 7B offers a combination of efficiency, durability, and ease of use, making it a popular choice for renewable energy generation.

![knowledge-cutoff.png](https://i.ibb.co/ccccpxZ/knowledge-cutoff.png)

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

In [186]:
context_example = """
Answer the question based on the following context. If you don't can't find the answer, tell I don't know.

Context:
We introduce Mistral 7B, a 7–billion-parameter language model engineered for
superior performance and efficiency. Mistral 7B outperforms the best open 13B
model (Llama 2) across all evaluated benchmarks, and the best released 34B
model (Llama 1) in reasoning, mathematics, and code generation. Our model
leverages grouped-query attention (GQA) for faster inference, coupled with sliding
window attention (SWA) to effectively handle sequences of arbitrary length with a
reduced inference cost. We also provide a model fine-tuned to follow instructions,
Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
automated benchmarks. Our models are released under the Apache 2.0 license.

Question: What are key benefits of Mistral 7B?
"""
display_markdown(generate(context_example, openai))

The key benefits of Mistral 7B include:

1. Superior performance and efficiency compared to other models like Llama 2 (13B) and Llama 1 (34B).
2. Outperforming other models in reasoning, mathematics, and code generation.
3. Leveraging grouped-query attention (GQA) for faster inference.
4. Using sliding window attention (SWA) to handle sequences of arbitrary length with reduced inference cost.
5. Providing a fine-tuned model, Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model on both human and automated benchmarks.

![visual-breakpoint.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)

![why-rag.png](https://i.ibb.co/KF3xr64/why-rag.png)

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

![how-to-build-rag.png](https://i.ibb.co/wgXqfFv/how-to-build-rag.png)

![visual-breakpoint.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)

## 1. Build a knowledge base

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

![build-knowledge-base.png](https://i.ibb.co/dGnjrCk/build-knowledge-base.png)


![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

### Setup Pinecone 

Enter the Pinecone API key inside the prompt and create a Pinecone client.

In [260]:
PINECONE_REGION = "us-east-1"
PINECONE_CLOUD = "aws"
INDEX_NAME = "pinecone-workshop-1"
VECTOR_DIMENSIONS = 1536
PINECONE_API_KEY = getpass.getpass("Enter your Pinecone API key: ")

In [314]:
from pinecone.grpc import PineconeGRPC
#from pinecone import Pinecone

pinecone = PineconeGRPC(api_key=PINECONE_API_KEY)
# pinecone = Pinecone(api_key=PINECONE_API_KEY)

pinecone.list_indexes()

{'indexes': [{'dimension': 1536,
              'host': 'pinecone-worshop-1-2kw20wn.svc.apw5-4e34-81fa.pinecone.io',
              'metric': 'cosine',
              'name': 'pinecone-worshop-1',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-west-2'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 1536,
              'host': 'pinecone-workshop-1-2kw20wn.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'pinecone-workshop-1',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}}]}

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

### Create a Pinecone Index

More info o serverless: https://docs.pinecone.io/reference/architecture/serverless-architecture#overview  
Ref API: https://docs.pinecone.io/reference/api/control-plane/create_index 

In [315]:
from pinecone import ServerlessSpec

# Check if the index already exists and delete it
if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
    pinecone.delete_index(INDEX_NAME)

# Create a new index with the specified name, dimension, metric, and spec
# Docs: https://docs.pinecone.io/reference/api/control-plane/create_index
pinecone.create_index(
    name=INDEX_NAME,
    dimension=VECTOR_DIMENSIONS,
    metric="cosine",
    spec=ServerlessSpec(cloud=PINECONE_CLOUD, region=PINECONE_REGION),
)

# Create a new Index reference object with the specified name and pool_threads=30
index = pinecone.Index(INDEX_NAME)
print(index.describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 0}},
 'total_vector_count': 0}


![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

### Dataset

We are going to use a sample of 1000 AI papers that can be found here: https://huggingface.co/datasets/smartcat/ai-arxiv2-chunks-embedded 

The data set is already chunked and encoded using `text-embeddings-3-small` so we can just load and upsert it to the Pinecone.

If you want to play with chunking strategies and embeddings, you can find the full data set here: https://huggingface.co/datasets/jamescalam/ai-arxiv2

Dataset API reference: https://huggingface.co/docs/datasets/en/index  
Slicing and indexing: https://huggingface.co/docs/datasets/en/access 

In [248]:
import datasets

dataset = datasets.load_dataset("smartcat/ai-arxiv2-chunks-embedded", split="train")
dataset

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references', 'embeddings', 'metadata'],
    num_rows: 79782
})

In [249]:
chunk_display_html(dataset[0])

Key,Value
Title,Foundations of Vector Retrieval
doi,2401.09350
Chunk ID,0
chunk,"4 2 0 2 n a J 7 1 ] S D . s c [ 1 v 0 5 3 9 0 . 1 0 4 2 : v i X r a Sebastian Bruch # Foundations of Vector Retrieval # Preface We are witness to a few years of remarkable developments in Artificial Intelligence with the use of advanced machine learning algorithms, and in particular, deep learning. Gargantuan, complex neural networks that can learn through self-supervisionâand quickly so with the aid of special- ized hardwareâtransformed the research landscape so dramatically that, incremental overnight it seems, many fields experienced not the usual, progress, but rather a leap forward. Machine translation, natural language understanding, information retrieval, recommender systems, and computer vision are but a few examples of research areas that have had to grapple with the shock. Countless other disciplines beyond computer science such as robotics, biology, and chemistry too have benefited from deep learning."
id,2401.09350#0
Summary,"Vectors are universal mathematical objects that can represent text, images, speech, or a mix of these data modalities. That happens regardless of whether data is represented by hand-crafted features or learnt embeddings. Collect a large enough quantity of such vectors and the question of retrieval becomes urgently relevant: Finding vectors that are more similar to a query vector. This monograph is concerned with the question above and covers fundamental concepts along with advanced data structures and algorithms for vector retrieval. In doing so, it recaps this fascinating topic and lowers barriers of entry into this rich area of research."
Source,http://arxiv.org/pdf/2401.09350
Authors,Sebastian Bruch
Categories,"cs.DS, cs.IR"
Comment,


In [250]:
print(len(dataset[0]["embeddings"]))
print(dataset[0]["embeddings"][:10])

1536
[-0.020014504, -0.013545036, 0.04353385, -0.0029185074, 0.03552278, -0.034943033, 0.013927143, 0.06566971, -0.06888468, 0.026971487]


In [251]:
list(dataset[0]["metadata"].keys())

['authors',
 'chunk_id',
 'doc_id',
 'primary_category',
 'published',
 'source',
 'summary',
 'text',
 'title']

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

### What has been done with data set?

![chunking-dataset.png](https://i.ibb.co/9wV70Q7/chunking-dataset.png)

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

#### Chunking Strategies
1. Character split (with overlapping)
2. Recursive character split
3. Document specific splitting
4. Semantic Chunking
5. Agentic?
6. More?

Introduction to chunking: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

### Data upsert to Pinecone

Let insert data to the Pinecone in batches. 

From our data set we need 3 columns:
1. `id` - the ID of the chunk we want to insert
2. `embeddings` - contains a vector embedding of the chunk. It uses `text-embeddings-3-small`
3. `metadata` - a dictinary with additional data about the chunk. 

The code for upserting:
```python
index.upsert(vectors=[ 
    (id1, vector1, metadata1),
    (id2, vector2, metadata2),
    ....
 ])
```

### Note on optimization

To improve througput, we are going to add `async_req=True` parameter. 
Upsert will return futures that we need to wait.

Optimization tips:
1. Deploy application at the same region
2. Use batching upsert
3. Use parallelized upsert
4. Use GRPCIndex, but make sure to add backoff for throttling
5. Use namespaces and metadata filtering
6. Avoid quotas and limits: https://docs.pinecone.io/reference/quotas-and-limits

Optimized upsert code:
```python
future = index.upsert(vectors=[ 
    (id1, vector1, metadata1),
    (id2, vector2, metadata2),
    ....
    ],
    async_req=True
)
....
future.get() # wait for upsert to complete
```

For scale-up and optimizations make sure to read: : https://docs.pinecone.io/guides/operations/performance-tuning#increasing-throughput  
Metadata filtering: Metadata filtering: https://docs.pinecone.io/guides/data/filter-with-metadata 

In [316]:
from tqdm import tqdm
from pinecone import Index


def upsert_batch(ds: datasets.Dataset, index: Index, batch_size: int = 100) -> None:
    futures = []
    for i in range(0, len(ds), batch_size):
        i_end = min(i + batch_size, len(ds))
        batch = ds.select(range(i, i_end))

        # Select the id, embeddings, and metadata from the batch
        ids = batch["id"]
        embeddings = batch["embeddings"]
        metadata = batch["metadata"]

        # Upsert the vectors to the Pinecone index asynchronously
        # Use `index.upsert(vectors=[(id1, vector1, metadata1), (id2, vector2, metadata2)], async_req=True)`
        futures.append(
            index.upsert(vectors=list(zip(ids, embeddings, metadata)), async_req=True)
        )

    # Wait for all the upserts to complete
    for future in tqdm(futures, total=len(futures), desc="Waiting for upserts to complete"):
        future.get()


def upsert_batch(ds: datasets.Dataset, index: Index, batch_size: int = 200) -> None:
    df = ds.to_pandas()
    df = df[["id", "embeddings", "metadata"]]
    df.rename(columns={"embeddings": "values"}, inplace=True)
    index.upsert_from_dataframe(df, batch_size=batch_size, show_progress=True)

In [317]:
upsert_batch(dataset.select(range(500)), index, batch_size=100)

sending upsert requests:   0%|          | 0/500 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/5 [00:00<?, ?it/s]

In [318]:
print(index.describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 500}},
 'total_vector_count': 500}


In [319]:
upsert_batch(dataset, index, batch_size=150)

sending upsert requests:   0%|          | 0/79782 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/532 [00:00<?, ?it/s]

![visual-breakpoint.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)

## Revisit Agenda


![done-next.png](https://i.ibb.co/QYC9nrb/done-next.png)


![visual-breakpoint.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)

## 2. Retrieve against the query



![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

![semantic-search.png](https://i.ibb.co/2dWGRHn/semantic-search.png)

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

Once we inserted everythong to Pinecone, let's query it. The input is query text and the output is the top similar chunks.
Steps: 
1. Encode the input text to generate embeddings 
2. Call Pinecone's query function to retrieve top K similar results

Embedding API Ref: https://platform.openai.com/docs/api-reference/embeddings  
Query API Ref: https://docs.pinecone.io/reference/api/data-plane/query 

In [203]:
from typing import List


def encode(
    text: str, openai_client: OpenAI, model: str = "text-embedding-3-small"
) -> List[float]:
    # Use the OpenAI API to encode the text
    # Docs: https://platform.openai.com/docs/api-reference/embeddings
    response = openai_client.embeddings.create(
        model=model, input=text
    )

    return response.data[0].embedding

In [72]:
res = encode("What are key benefits of mistral 7B?", openai)
print(res[:10])
print(len(res))

[-0.02581052854657173, 0.03242715075612068, 0.019921164959669113, -0.021119002252817154, -0.036020658910274506, 0.023999514058232307, -0.05059434100985527, 0.05886511504650116, 0.024940671399235725, -0.006969555746763945]
1536


![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

### Query Pinecone

In [204]:
from pinecone import QueryResponse


def retrieve(
    query: str, index: Index, openai_client: OpenAI, top_k: int = 10
) -> QueryResponse:
    # Encode the query using the OpenAI API
    query_embedding = encode(query, openai_client)

    # Use the Pinecone index to query the encoded vector
    # Docs: https://docs.pinecone.io/reference/api/data-plane/query
    ret = index.query(
        vector=query_embedding, top_k=top_k, include_metadata=True, include_values=False
    )
    return ret

In [205]:
res = retrieve("What is Mistral 7B?", index, openai)
print(res.matches[0].metadata['text'])

Abstract
We introduce Mistral 7B, a 7âbillion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B â Instruct, that surpasses Llama 2 13B â chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/
# Introduction


![visual-breakpoint.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)

## 3. Augment and generate

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

![workflow-rag-simple.png](https://i.ibb.co/p0gwY23/workflow-rag-simple.png)

![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

### Generate a final response

Combine all pieces together:
1. Perform a semantic search on input query
2. Build a context (prompt) for a LLM
3. Call LLM to generate a final response
4. Return a final response and retrieved context

The relevant context can be found in metadata, you can use:
1. `title` - a Paper title
2. `published` - a publish date
3. `primart_category` 
4. `summary` - a paper summary
5. `text` - a chunk text - this is the most useful info to build a context for LLM

In [206]:
from typing import Tuple

def from_metadata(metadata: Dict) -> str:
    return f"""
***
    Title: {metadata['title']}
    Authors: {metadata['authors'][:5]}
    Published: {metadata['published']}
    Primary category: {metadata['primary_category']}
    Paper summary: {metadata['summary']}
    Text: {metadata['text']}
"""


def augment(query: str, query_results: QueryResponse) -> str:
    context = "\n".join(
        [from_metadata(result.metadata) for result in query_results.matches]
    )
    return f"""
Answer the question based on the following context. If you don't can't find the answer, tell I don't know.
The answers should be clear, easy to understand, complete and comprehensive.

Context:
{context}

Question: {query}
"""


def rag(query: str, index: Index, openai_client: OpenAI, top_k: int = 5) -> Tuple[str, QueryResponse]:
    # 1. [RETRIEVE]: Reuse `semantic_search` function to get the top_k results
    # 2. [AUGMENT]: Use `get_prompt` function to generate the prompt (context + question)
    # 3. [GENERATE]: Use `llm_completion` function to generate the response
    # 4. Return the response and query_results
    query_results = retrieve(query, index, openai_client, top_k=top_k)
    prompt = augment(query, query_results)
    response = generate(prompt, openai_client)
    return response, query_results

In [321]:
answer, context = rag("What are key benefits of Mistral 7B?", index, openai)
display_markdown(answer)

The key benefits of Mistral 7B are:

1. **Superior Performance**: Mistral 7B is engineered for superior performance, outperforming the Llama 2 13B model across all evaluated benchmarks and surpassing the Llama 1 34B model in reasoning, mathematics, and code generation.

2. **Efficiency**: Mistral 7B is designed to be efficient, balancing high performance with reduced computational costs and inference latency. It demonstrates that a carefully designed language model can deliver high performance while maintaining efficient inference.

3. **Innovative Attention Mechanisms**: Mistral 7B leverages grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to handle sequences of arbitrary length effectively with reduced inference cost. These attention mechanisms contribute to the enhanced performance and efficiency of Mistral 7B.

4. **Ease of Deployment and Fine-Tuning**: Mistral 7B is released under the Apache 2.0 license, accompanied by a reference implementation for easy deployment on cloud platforms. It is also crafted for ease of fine-tuning across a myriad of tasks, making it adaptable for various real-world applications.

5. **Outperforming Competing Models**: Mistral 7B surpasses the Llama 2 13B model in both human and automated benchmarks, showcasing its ability to outperform existing models in various tasks such as chat modeling and following instructions.

Overall, Mistral 7B aims to provide a high-performing language model that is efficient, easy to deploy, and adaptable for a wide range of applications.

In [322]:
display_retrieved_context(context.matches[:2])

Key,Value
text,"Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference implementation1 facilitating easy deployment either locally or on cloud platforms such as AWS, GCP, or Azure using the vLLM [17] inference server and SkyPilot 2. Integration with Hugging Face 3 is also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B â Chat model. Mistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications. # 2 Architectural details The cat sat on the The cat sat on the window size â_ââ> The cat sat on the Vanilla Attention Sliding Window Attention Effective Context Length"
chunk_id,3.0
primary_category,cs.CL
authors,"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed"
source,http://arxiv.org/pdf/2310.06825
summary,"We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license."
published,20231010
title,Mistral 7B
doc_id,2310.06825

Key,Value
text,"Abstract We introduce Mistral 7B, a 7âbillion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B â Instruct, that surpasses Llama 2 13B â chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/ # Introduction"
chunk_id,1.0
primary_category,cs.CL
authors,"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed"
source,http://arxiv.org/pdf/2310.06825
summary,"We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license."
published,20231010
title,Mistral 7B
doc_id,2310.06825


![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

In [324]:
answer, context = rag("What is chain of thoughts?", index, openai)
display_markdown(answer)

A chain of thought refers to a coherent flow of sentences that reveals the premises and conclusion of a reasoning problem. It clearly decomposes a multi-hop reasoning task into intermediate steps instead of solving the task in a black-box way. The chain of thought can be the step-by-step thought process before arriving at the final answer or explanations that come after the answer. In the context of language models, a chain of thought is a series of intermediate reasoning steps that help improve the ability of large language models to perform complex reasoning tasks. It allows models to decompose multi-step problems into intermediate steps, providing a more interpretable window into the behavior of the model and enabling additional computation to be allocated to problems that require more reasoning steps.

In [333]:
answer, context = rag("How to apply LLMs to recommender engines?", index, openai)
display_markdown(answer)

To apply Large Language Models (LLMs) to recommender engines, several key strategies and techniques can be employed based on the context provided:

1. **LLMs as Recommendation Models**:
   - **Zero-Shot Paradigm**: LLMs can be prompted to complete the recommendation task without parameter tuning. This approach involves prompt engineering methods like recency-focused and in-context learning to improve recommendation performance and mitigate model biases.
   - **Instruction Tuning**: Specializing LLMs for personalized recommendations through instruction tuning is another effective method. High-quality instruction data is crucial for adapting LLMs to recommendation tasks. This data can be constructed based on user-item interactions with heuristic templates to enhance instruction diversity.

2. **LLM-Enhanced Recommendation Models**:
   - **Incorporating Universal Knowledge**: LLMs can leverage the universal knowledge encoded in them to enhance traditional recommender systems. This can be achieved through various approaches:
     - Inferring users' potential intentions from historical interaction data and using them to improve item retrieval.
     - Encoding side information of items and users (e.g., descriptions, reviews) using LLMs to provide more informative representations for traditional recommender systems.
     - Adopting distillation-like methods to transfer LLM capacities to improve traditional recommenders by aligning hidden states of LLMs and traditional models through joint training.

3. **LLMs as Recommendation Simulators**:
   - LLMs can also be utilized as recommendation simulators to simulate potential user instructions in various scenarios like product search and personalized recommendations. This approach can help in enhancing the recommendation process by incorporating diverse user behaviors and preferences.

4. **Addressing Challenges**:
   - **Semantic Gap**: Despite the potential of LLMs, challenges such as understanding personalized user behaviors and domain-specific collaborative semantics need to be addressed. Instruction tuning and vocabulary extension with semantic identifiers can help bridge this gap.
   - **Efficiency**: To deploy LLMs effectively in real-world recommender systems, techniques like efficient tuning, quantization methods, and improved context length extension should be explored to enhance inference speed and reduce memory overhead.

By implementing these strategies and addressing the associated challenges, LLMs can be effectively applied to recommender engines to enhance recommendation performance and user experience.

In [334]:
display_retrieved_context(context.matches[:2])

Key,Value
text,"LLM-enhanced Recommendation Models. In addition to instructing LLMs to directly provide recommendations, re- searchers also propose leveraging the universal knowledge encoded in LLMs to improve traditional recommender sys- tems. Existing approaches in this line can be divided into three main categories. The first category employs LLMs to infer usersâ potential intention from their historical interac- tion data. Furthermore, traditional recommendation/search models employ the inferred intentions to improve the re- trieval of relevant items [812, 813]. Additionally, several studies explore the use of LLMs as feature encoders. They employ LLMs to encode the side information of items and 75 users (e.g., itemâs descriptions and userâs reviews), thus de- riving more informative representations of users and items. These representations are then fed into traditional recom- mender systems as augmented input [814, 815]. As an- other alternative approach, several studies [816, 817] adopt a distillation-like way to transfer LLMâs capacities (e.g., semantic encoding) to improve traditional recommenders (i.e., small models). Specially, they align the hidden states of LLMs and traditional recommendation models via joint training. After training, since only the enhanced small model will be deployed online, it can avoid the huge over- head of LLMs in online service."
chunk_id,481.0
primary_category,cs.CL
authors,"Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen"
source,http://arxiv.org/pdf/2303.18223
summary,"Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since researchers have found that model scaling can lead to performance improvement, they further study the scaling effect by increasing the model size to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement but also show some special abilities that are not present in small-scale language models. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. In this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions."
published,20230331
title,A Survey of Large Language Models
doc_id,2303.18223

Key,Value
text,"LLMs as Recommendation Models. With specific methods or mechanisms, LLMs can be adapted to serve as recom- mendation models. Existing work along this line can be generally divided into two main categories. First, some methods prompt LLMs for completing the recommendation task in a zero-shot paradigm (i.e., without parameter tun- ing) [805, 806]. A series of prompt engineering methods like recency-focused and in-context learning are introduced to improve recommendation performance as well as alleviate the potential model biases [807, 808]. Second, another cat- egory of studies aim to specialize LLMs for personalized recommendation through instruction tuning [357, 809]. Spe- cially, high-quality instruction data is key to adapt LLMs to the recommendation tasks, which can be constructed based on user-item interactions with heuristic templates. To further improve the instruction diversity, InstructRec [357] employs self-instruct technique to simulate large amounts of potential user instructions in various scenarios like product search and personalized recommendations. In addition to representing each item by its text description, there is also growing attention on extending LLMâs vocabulary with semantic identifiers in recommender systems [810, 811], to incorporate collaborative semantics into LLMs."
chunk_id,480.0
primary_category,cs.CL
authors,"Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen"
source,http://arxiv.org/pdf/2303.18223
summary,"Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since researchers have found that model scaling can lead to performance improvement, they further study the scaling effect by increasing the model size to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement but also show some special abilities that are not present in small-scale language models. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. In this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions."
published,20230331
title,A Survey of Large Language Models
doc_id,2303.18223


![section-breakpoint.png](https://i.ibb.co/344JqH3/section-breakpoint.png)

In [326]:
answer, context = rag(
    "Write side by side summaries between mistral, kosmos, palm. If there is key difference between them, what is it",
    index,
    openai,
    top_k=10,
)
display_markdown(answer)

I don't know the answer to the question about writing side by side summaries between Mistral, Kosmos, and Palm.

In [225]:
display_retrieved_context(context.matches[:5])

Key,Value
authors,"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed"
chunk_id,12.0
doc_id,2310.06825
primary_category,cs.CL
published,20231010
source,http://arxiv.org/pdf/2310.06825
summary,"We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license."
text,"Table 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks. Size and Efficiency. We computed âequivalent model sizesâ of the Llama 2 family, aiming to understand Mistral 7B modelsâ efficiency in the cost-performance spectrum (see Figure 5). When evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On the Knowledge benchmarks, Mistral 7Bâs performance achieves a lower compression rate of 1.9x, which is likely due to its limited parameter count that restricts the amount of knowledge it can store. Evaluation Differences. On some benchmarks, there are some differences between our evaluation protocol and the one reported in the Llama 2 paper: 1) on MBPP, we use the hand-verified subset 2) on TriviaQA, we do not provide Wikipedia contexts. # Instruction Finetuning"
title,Mistral 7B

Key,Value
authors,"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed"
chunk_id,9.0
doc_id,2310.06825
primary_category,cs.CL
published,20231010
source,http://arxiv.org/pdf/2310.06825
summary,"We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license."
text,"Detailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4 compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama 1 34B4 in different categories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on most benchmarks. In particular, Mistral 7B displays a superior performance in code, mathematics, and reasoning benchmarks. 4Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B. 3 jm Mistral 7B = mm LLaMA2 138 50 lm Mistral 7B mm LLaMA2 138 mmm LlaMA278 lm LLaMA1 348 bel mmm LlaMA2 78 mem LlaMA 1348 70 40 vt = = eo g 7 = 330 Â£ g gs0 : < <20 40 10 ay MMLU Knowledge Reasoning Comprehension AGI Eval Math BBH Code"
title,Mistral 7B

Key,Value
authors,"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed"
chunk_id,8.0
doc_id,2310.06825
primary_category,cs.CL
published,20231010
source,http://arxiv.org/pdf/2310.06825
summary,"We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license."
text,"# 3 Results We compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for fair comparison. We measure performance on a wide variety of tasks categorized as follow: â¢ Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22], OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24] â¢ World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15] â¢ Reading Comprehension (0-shot): BoolQ [8], QuAC [7] â¢ Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4 â¢ Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot) â¢ Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29] (3-5-shot, English multiple-choice questions only)"
title,Mistral 7B

Key,Value
authors,"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed"
chunk_id,3.0
doc_id,2310.06825
primary_category,cs.CL
published,20231010
source,http://arxiv.org/pdf/2310.06825
summary,"We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license."
text,"Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference implementation1 facilitating easy deployment either locally or on cloud platforms such as AWS, GCP, or Azure using the vLLM [17] inference server and SkyPilot 2. Integration with Hugging Face 3 is also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B â Chat model. Mistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications. # 2 Architectural details The cat sat on the The cat sat on the window size â_ââ> The cat sat on the Vanilla Attention Sliding Window Attention Effective Context Length"
title,Mistral 7B

Key,Value
authors,"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed"
chunk_id,14.0
doc_id,2310.06825
primary_category,cs.CL
published,20231010
source,http://arxiv.org/pdf/2310.06825
summary,"We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license."
text,"Table 3: Comparison of Chat models. Mistral 7B â Instruct outperforms all 7B models on MT-Bench, and is comparable to 13B â Chat models. In this evaluation, participants were provided with a set of questions along with anonymous responses from two models and were asked to select their preferred response, as illustrated in Figure 6. As of October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B. 4 âe LlaMA2 âe- LLaMA2 65) = Mistral 70; = Mistral a = |. 60; & inal = 268 3 â¬ = 55 8 = Â§ 66 50 Â« Effective LLaMA 64 Effective LlaMA 451 Â¢ i size 23B (3.3x)___ : __size 38B (5.4x)_{ : 7 13 34 70 7 13 34 70 Model size (billion parameters) = Model size (billion parameters) 70) âeâ LLaMA 2 âe- LLaMA2 65) = Mistral Zee} = Mistral FS < 2 60 364, 3 5 2 2 B55 Â£62 Ã© 5 & fa â50 5 2 60 a LlaMA e LLaMA 45 ize 9x) si B (3x fi 13 34 70 7 13 34 70 Model size (billion parameters) Model size (billion parameters)"
title,Mistral 7B


![visual-breakpoint.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)

## Hands-On Exercises

### Exercise #1 - Query Comprehension

Your task is to figure out how to execute the following query:  

`Write side by side summaries between mistral, kosmos, palm. If there is key difference between them, what is it`

Hints:
1. JSON response from GPT: https://platform.openai.com/docs/api-reference/chat/create 
2. Use LLM to extract information from the query
3. Apply the right techniques of RAG to construct the best response


In [332]:
# TODO: Implement a query comprehension

# 1. Use LLM to extract the following information:
#     - Category - (summarization, comparison, other, etc.)
#     - Queries to run based on the category
# 2. IF the category is "summarization":
#    - run retrieval on query to get the context and ask llm to summarize the context
# 3. IF the category is "other":
#    - run the rag function pipeline
# 4. IF the category is "comparison":
#    - run multiple retrieval + summarization pipelines and ask llm to compare them

# Example 1: user_query = "Write side by side summaries between mistral, kosmos, palm. If there is key difference between them, what is it"
# 1. LLM will return the category as "comparison", and queries to run as ["mistral", "kosmos", "palm"]
# 2. Run the retrieval + summarization pipeline for each query
# 3. Ask llm to compare the summaries, write it side by side and highlight the key differences

# Example 2: user_query = "What are key benefits of Mistral 7B?"
# 1. LLM will return the category as "summarization", and queries to run as ["Key benefits of Mistral 7B"]
# 2. Run the retrieval + summarization pipeline for the query

# Example 3: user_query = "How to fine tune LLAMA 2?"
# 1. LLM will return the category as "other", and queries to run as ["How to fine tune LLAMA 2"]
# 2. Run the rag pipeline for the query



### Exercise #2 - Paper listing and filtering

Adapt RAG model to work on paper listing, ie it should be able to answer this type of question:  
`List me papers names and sources that are published in 2023 about e-commerce search techniques`

Hints:
1. Use metadata filtering: https://docs.pinecone.io/guides/data/filter-with-metadata 
2. Use LLM (prompt engineering) to extract the useful data from the query (search term, date, task, etc)
3. JSON reponse from ChatGPT - https://platform.openai.com/docs/api-reference/chat/create   

## Appendix

### LangChain

Once we explored the simple components of RAG, we can use existing frameworks to build them.  

Langchain: https://www.langchain.com/langchain 

```python
!pip install -qU \
    langchain-pinecone \
    langchain-openai \
    langchain
```

```python
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=OPENAI_API_KEY,
)

vectorstore = PineconeVectorStore(
    index=index, embedding=embeddings, text_key="text", pinecone_api_key=PINECONE_API_KEY, index_name=INDEX_NAME,)

llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo", temperature=0.0)
langchain_rag = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever()
)
```

```python
langchain_rag.invoke("What are all benefits of using Mistral 7B?")
```

![visual-breakpoint.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)

### Fine-Tune vs RAG vs Prompt Engineering

![rag-vs-prompt-vs-finetune.png](https://i.ibb.co/C8rvnTY/Screenshot-2024-05-30-at-21-17-45.png)

![visual-breakpoint.png](https://i.ibb.co/rHVSp3w/visual-breakpoint.png)