# RAG Systems
---

In this project, a RAG pipeline was implemented using a dataset containing news information from BBC News. The goal is to enable the LLM to retrieve relevant news details from the dataset and use that information to generate more informed responses. The model used is the [llama-3-1-8b-instruct-turbo](https://www.together.ai/models/llama-3-1), which was trained on data up to December 2023. The idea is to create a RAG system that allows it to include information on events that occurred in 2024.

Tasks include:
- Use a query and retrieval function to access relevant data given a query
- Format the data appropriately
- Generate a prompt with the query and the relevant data to feed into the LLM


# Table of Contents
- [ 1 - Introduction](#1)
  - [ 1.1 Importing the necessary libraries](#1-1)
- [ 2 - Loading the dataset](#2)
- [ 3 - Main Functions](#3)
  - [ 3.1 Query news by index function](#3-1)
  - [ 3.2 Retrieve function](#3-2)
  - [ 3.3 Get relevant data](#3-3)
  - [ 3.4 Formatting the relevant rata](#3-4)
  - [ 3.5 Generate the final prompt](#3-5)
  - [ 3.6 LLM call](#3-6)


<a id='1-1'></a>
### 1.1 Importing the necessary libraries


In [None]:
from utils import (
    retrieve,
    pprint,
    generate_with_single_input,
    read_dataframe,
    display_widget
)
import unittests

<a id='2'></a>

<a id='2'></a>
## 2 - Loading the dataset

---

Working with the Kaggle dataset [News Headlines 2024](https://www.kaggle.com/datasets/dylanjcastillo/news-headlines-2024). This dataset contains thousands of news headlines and related information from BBC News.


In [None]:
NEWS_DATA = read_dataframe("news_data_dedup.csv")

Let's check the data structure.

In [None]:
pprint(NEWS_DATA[9:11])

[
  {
    "guid": "5dae28f191cfd1047f67c409e616fc3f",
    "title": "Paris's Moulin Rouge loses windmill sails overnight",
    "description": "The cause of the sails' collapse from the roof of the world famous cabaret club is not yet clear.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68895836",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "d2c3ff79d4e068911d05416ca061cd51",
    "title": "Ukraine uses longer-range US missiles for first time",
    "description": "Missiles secretly delivered this month have been used to strike Russian targets in Crimea, US media say.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68893196",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  }
]


Important fields are `title`, `description`, `url` and `published_at`. These fields will give good information to the LLM to answer the majority of questions with good enough data.

<a id='3'></a>
## 3 - Main Functions
---
- `query_news`: Given a list of indices, this function returns all documents corresponding to those indices.
- `retrieve`: Given a query and an integer called `top_k`, this function retrieves the `top_k` most relevant documents.

- `get_relevant_data`: This function takes a query and a `top_k` value and returns the `top_k` relevant documents.
- `format_relevant_data`: Given a list of documents, this function creates a formatted string with the document information.


<a id='3-1'></a>
### 3.1 Query news by index function

This simple function just simplifies the return of documents given a list of indices.

In [None]:
def query_news(indices):
    """
    Retrieves elements from a dataset based on specified indices.

    Parameters:
    indices (list of int): A list containing the indices of the desired elements in the dataset.

    Returns:
    list: A list of elements from the dataset corresponding to the indices provided in list_of_indices.
    """

    output = [NEWS_DATA[index] for index in indices]

    return output

In [None]:
# Fetching some indices
indices = [3, 6, 9]
pprint(query_news(indices))

[
  {
    "guid": "e696224ac208878a5cec8bdc9f97c632",
    "title": "Europe risks dying and faces big decisions - Macron",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68898887",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "4f585bad8f61b715fbafe2f022ab0ae8",
    "title": "Supreme Court divided on whether Trump has immunity",
    "description": "The justices discussed immunity, coups, pardons, Operation Mongoose - and the future of democracy.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-us-canada-68901817",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "5dae28f191cfd1047f67c409e616fc3f",
    "title": "Paris's Moulin Rouge loses windmill sails overnight",
    "description": "The cause of the sails' collapse from the roof of the world famous cabaret club is not yet clear.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68895836",
    "

<a id='3-2'></a>
### 3.2 Retrieve function

**Parameters:**
- `query`: A string representing the search query for which you want to find the most relevant documents.
- `top_k`: An integer indicating the number of top similar documents to return.

**Output:**
- The function returns a list of indices corresponding to the top `k` most similar documents from the corpus, based on their similarity scores with the query.

**Call:**

```Python
retrieve(query: str, top_k: int)
```

In [None]:
# Let's test the retrieve function!
indices = retrieve("Concerts in North America", top_k = 1)
print(indices)

[350]


In [None]:
# Now let's query the corresponding news_
retrieved_documents = query_news(indices)
pprint(retrieved_documents)

[
  {
    "guid": "927257674585bb6ef669cf2c2f409fa7",
    "title": "\u2018The working class can\u2019t afford it\u2019: the shocking truth about the money bands make on tour",
    "description": "As Taylor Swift tops $1bn in tour revenue, musicians playing smaller venues are facing pitiful fees and frequent losses. Should the state step in to save our live music scene?When you see a band playing to thousands of fans in a sun-drenched festival field, signing a record deal with a major label or playing endlessly from the airwaves, it\u2019s easy to conjure an image of success that comes with some serious cash to boot \u2013 particularly when Taylor Swift has broken $1bn in revenue for her current Eras tour. But looks can be deceiving. \u201cI don\u2019t blame the public for seeing a band playing to 2,000 people and thinking they\u2019re minted,\u201d says artist manager Dan Potts. \u201cBut the reality is quite different.\u201dPost-Covid there has been significant focus on grassroots mus

<a id='3-3'></a>
### 3.3 Get relevant data

</details>

In [None]:
def get_relevant_data(query: str, top_k: int = 5) -> list[dict]:
    """
    Retrieve and return the top relevant data items based on a given query.

    This function performs the following steps:
    1. Retrieves the indices of the top 'k' relevant items from a dataset based on the provided `query`.
    2. Fetches the corresponding data for these indices from the dataset.

    Parameters:
    - query (str): The search query string used to find relevant items.
    - top_k (int, optional): The number of top items to retrieve. Default is 5.

    Returns:
    - list[dict]: A list of dictionaries containing the data associated
      with the top relevant items.

    """

    # Retrieve the indices of the top_k relevant items given the query
    relevant_indices = retrieve(query, top_k)

    # Obtain the data related to the items using the indices from the previous step
    relevant_data = query_news(relevant_indices)

    return relevant_data

In [None]:
query = "Greatest storms in the US"
relevant_data = get_relevant_data(query, top_k = 1)
pprint(relevant_data)

[
  {
    "guid": "3ca548fe82c3fcae2c4c0c635d03eb2e",
    "title": "Large tornado seen touching down in Nebraska",
    "description": "Severe and powerful storms have moved across several US states, leaving many experiencing power shortages.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-us-canada-68860070",
    "published_at": "2024-04-26",
    "updated_at": "2024-04-28"
  }
]


**Expected output**
```
[{'guid': '3ca548fe82c3fcae2c4c0c635d03eb2e',
  'title': 'Large tornado seen touching down in Nebraska',
  'description': 'Severe and powerful storms have moved across several US '
                 'states, leaving many experiencing power shortages.',
  'venue': 'BBC',
  'url': 'https://www.bbc.co.uk/news/world-us-canada-68860070',
  'published_at': '2024-04-26',
  'updated_at': '2024-04-28'}]
```

In [None]:
# Run this cell to perform several tests on your function. If you receive "All test passed!" it means that your solution will likely pass the autograder too.
unittests.test_get_relevant_data(get_relevant_data)

[92m All tests passed!


<a id='3-4'></a>

<a id='3-4'></a>
### 3.4 Formatting the relevant rata


<a id='ex02'></a>

<a id='ex02'></a>

* News title
* News description
* News published date
* News URL


In [None]:
def format_relevant_data(relevant_data):
    """
    Formats a list of relevant documents into a structured string for use in a RAG system.

    Parameters:
    relevant_data (list): A list with relevant data.

    Returns:
    str: A formatted string with the relevant documents, structured for use in a Retrieval-Augmented Generation (RAG) system."
    """

    # Create a list to store the formatted documents
    formatted_documents = []

    # Iterate over each relevant document.
    for document in relevant_data:

        # Format each document into a structured layout string.
        formatted_document = f""" Title: {document['title']},
        Description: {document['description']},
        Published at: {document['published_at']}\nURL: {document['url']}"""

        # Append the formatted document string to the formatted_documents list
        formatted_documents.append(formatted_document)

    return "\n".join(formatted_documents)

In [None]:
example_data = NEWS_DATA[4:8]

In [None]:
print(format_relevant_data(example_data))

 Title: Prosecutors ask for halt to case against Spain PM's wife,
        Description: Pedro Sánchez is deciding whether to resign after a case against his wife by an anti-corruption group., 
        Published at: 2024-04-25
URL: https://www.bbc.co.uk/news/world-europe-68895727
 Title: WATCH: Would you pay a tourist fee to enter Venice?,
        Description: From Thursday visitors making a trip to the famous city at peak times will be charged a trial entrance fee., 
        Published at: 2024-04-25
URL: https://www.bbc.co.uk/news/world-europe-68898441
 Title: Supreme Court divided on whether Trump has immunity,
        Description: The justices discussed immunity, coups, pardons, Operation Mongoose - and the future of democracy., 
        Published at: 2024-04-25
URL: https://www.bbc.co.uk/news/world-us-canada-68901817
 Title: More than 150 killed as heavy rains pound Tanzania,
        Description: The prime minister warns that El Niño-triggered heavy rains are likely to continue into 

In [None]:
# Test your function!
unittests.test_format_relevant_data(format_relevant_data)

[92m All tests passed!


<a id='3-5'></a>
### 3.5 Generate the final prompt


In [None]:
# EDITABLE CELL

def generate_final_prompt(query, top_k=5, use_rag=True, prompt=None):
    """
    Generates a final prompt based on a user query, optionally incorporating relevant data using retrieval-augmented generation (RAG).

    Args:
        query (str): The user query for which the prompt is to be generated.
        top_k (int, optional): The number of top relevant data pieces to retrieve and incorporate. Default is 5.
        use_rag (bool, optional): A flag indicating whether to use retrieval-augmented generation (RAG)
                                  by including relevant data in the prompt. Default is True.
        prompt (str, optional): A template string for the prompt. It can contain placeholders {query} and {documents}
                                for formatting with the query and formatted relevant data, respectively.

    Returns:
        str: The generated prompt, either consisting solely of the query or expanded with relevant data
             formatted for additional context.
    """
    # If RAG is not being used, format the prompt with just the query or return the query directly
    if not use_rag:
        return query

    # Retrieve the top_k relevant data pieces based on the query
    relevant_data = get_relevant_data(query, top_k=top_k)

    # Format the retrieved relevant data
    retrieve_data_formatted = format_relevant_data(relevant_data)

    # If no custom prompt is provided, use the default prompt template
    if prompt is None:
        prompt = (
            f"Answer the user query below. There will be provided additional information for you to compose your answer. "
            f"The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, "
            f"you should not rely only on this information to answer the query, but add it to your overall knowledge."
            f"Query: {query}\n"
            f"2024 News: {retrieve_data_formatted}"
        )
    else:
        # If a custom prompt is provided, format it with the query and formatted relevant data
        prompt = prompt.format(query=query, documents=retrieve_data_formatted)

    return prompt

In [None]:
print(generate_final_prompt("Tell me about the US GDP in the past 3 years."))

Answer the user query below. There will be provided additional information for you to compose your answer. The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, you should not rely only on this information to answer the query, but add it to your overall knowledge.Query: Tell me about the US GDP in the past 3 years.
2024 News:  Title: America's Economy Is No. 1. That Means Trouble,
        Description: If you want a single number to capture America’s economic stature, here it is: This year, the U.S. will account for 26.3% of the global gross domestic product, the highest in almost two decades. That’s based on the latest projections from the International Monetary Fund. According to the IMF, Europe’s share of world GDP has dropped 1.4 percentage points since 2018, and Japan’s by 2.1 points. The U.S. share, by contrast, is up 2.3 points., 
        Published at: 2024-04-26
URL: https://www.wsj.com/articles/americas-economy-is-n

<a id='3-6'></a>
### 3.6 LLM call


- `query`: the query to be passed to the LLM.
- `use_rag`: a boolean telling whether using RAG or not. This parameter will help you compare queries using a RAG system and not using it.
- `model`: the model to be used. You might change it, but the standard is the Llama 3 Billion parameter.  

In [None]:
def llm_call(query, top_k = 5, use_rag = True, prompt = None):
    """
    Calls the LLM to generate a response based on a query, optionally using retrieval-augmented generation.

    Args:
        query (str): The user query that will be processed by the language model.
        use_rag (bool, optional): A flag that indicates whether to use retrieval-augmented generation by
                                  incorporating relevant documents into the prompt. Default is True.

    Returns:
        str: The content of the response generated by the language model.
    """


    # Get the prompt with the query + relevant documents
    prompt = generate_final_prompt(query, top_k, use_rag, prompt)

    # Call the LLM
    generated_response = generate_with_single_input(prompt)

    # Get the content
    generated_message = generated_response['content']

    return generated_message

In [None]:
query = "Tell me about the US GDP in the past 3 years."

In [None]:
print(llm_call(query, use_rag = True))

Based on the provided information from 2024, here's an overview of the US GDP in the past 3 years:

1. **2024**: According to the International Monetary Fund (IMF), the US will account for 26.3% of the global gross domestic product (GDP), the highest in almost two decades. The US share of world GDP has increased by 2.3 percentage points since 2018. However, the US economic growth slowed in the first quarter, with GDP expanding at a 1.6% seasonally- and inflation-adjusted annual rate.

2. **2023**: Unfortunately, there is no specific information provided about the US GDP in 2023. However, it can be inferred that the US economy continued to grow, given the IMF's projection of a 2.3 percentage point increase in the US share of world GDP since 2018.

3. **2022**: There is no specific information provided about the US GDP in 2022. However, it can be inferred that the US economy continued to grow, given the IMF's projection of a 2.3 percentage point increase in the US share of world GDP sinc

In [None]:
print(llm_call(query, use_rag = False))

The US GDP (Gross Domestic Product) is a widely used indicator of a country's economic performance. Here's a brief overview of the US GDP for the past 3 years (2022-2024, note: 2024 data is not yet available, so I'll provide data up to 2022 and some projections for 2023 and 2024):

**2022:**

* The US GDP grew by 2.1% in 2022, according to the Bureau of Economic Analysis (BEA).
* The GDP was $25.4 trillion in 2022, up from $24.8 trillion in 2021.
* The growth was driven by consumer spending, which accounted for about 70% of the GDP growth.

**2023 (projected):**

* The US GDP is expected to grow by around 1.5% in 2023, according to the Congressional Budget Office (CBO).
* The GDP is projected to reach around $26.5 trillion in 2023.
* The growth is expected to be driven by consumer spending and business investment.

**2024 (projected):**

* The US GDP is expected to grow by around 2.0% in 2024, according to the CBO.
* The GDP is projected to reach around $27.5 trillion in 2024.
* The gr

In [None]:
display_widget(llm_call)