## RAG Pipeline
The Pipeline enables answering queries related to the Bundesliga 2023/24 Season. In this example we focus on the game weeks 7 to 11 which totals to 5.

### Pipeline functionality
**1. Generation of Faiss Index**
- Takes in a dictionary with keys `metadata` and `content`
- The `content` is embedded with an openai model. Please ensure your `OPENAI_API_KEY` is set a system variable.
- `metadata`, `content` and `embeddings` are stored in a faiss index which is saved. After one generation we can load constantly load it in.

**2. Query Answering**

This step implements multiple __advanced RAG__ techniques, such as:
- Query Expansion
- Vector Search
- Reranking
- Reordering

In [None]:
OPENAI_API_KEY = "" # TODO: PLEASE ENSURE THAT YOUR API KEY IS SET AS A SYSTEM ENVIRONMENT VARIABLE

In [1]:
import numpy as np

def filter_chunks_by_gameweeks(chunks, gameweek_list):
    """
    Given a list of data chunks (each chunk is presumably a dict with 
    chunk["metadata"]["gameweek"] defined), return only those that match 
    the specified gameweeks.
    
    :param chunks: List of chunk dictionaries.
    :param gameweek_list: List (or set) of integers representing the 
                        gameweeks of interest.
    :return: Filtered list of chunks for the supplied gameweeks.
    """
    # Extract all gameweeks
    gws = np.array([chunk["metadata"]["gameweek"] for chunk in chunks])
    
    # Create a boolean mask of which chunks' gameweeks are in gameweek_list
    interested = np.isin(gws, gameweek_list)
    
    # Apply the mask to filter out irrelevant chunks
    filtered_chunks = list(np.array(chunks)[interested])
    
    return filtered_chunks

We are interested in Gameweeks 7 to 11 - Indexing chunks data to relevant Gameweeks

In [2]:
from pipelines.rag import RAGPipeline

regenerate_index = False #TODO: set to False if you want to load an existing index (this is the case for people who do not have access to the data)
index_name = "football_index_711" #TODO: set the index name
gameweeks = [7, 8, 9, 10, 11] #TODO: set the gameweeks to filter the data

# Initialize the pipeline
rag = RAGPipeline(openai_embedding_model="text-embedding-3-small")

if regenerate_index:
    # Load the event chunks - list of dictionaries with entries metadata and content
    chunks = rag.load_json("data\event_chunks\chunks.json")
    
    # Filter chunks by gameweeks
    chunks = filter_chunks_by_gameweeks(chunks, gameweeks)

    # Create the FAISS index - we generate embeddings for each chunk and store them in the index
    rag.create_faiss_index(chunks, index_name=index_name, save_embeddings=True)
else:
    # load the FAISS index
    rag.load_faiss_index(index_name)

  from .autonotebook import tqdm as notebook_tqdm


In [21]:
# Defining the top k documents to retrieve and rerank
# The pipeline will retrieve kwargs["retrieve"] documents (in total 6*kwargs["retrieve"] due to query expansion) and then rerank the kwargs["rerank"] of retrieved documents

# sort_by: "rerank_score" or "rse_score"
kwargs={"retrieve": 200, "top_k": 250, "threshold": 0.2, "reorder": False, "alpha": 1, "beta": 1}

In [22]:
from IPython.display import Markdown, display

query = "How many goals were scored after the 90th minute during Game Weeks 7 to 11?" #TODO: set the query
# Answer the query
response, context, rerank_scores, similarity_scores, queries = rag.answer_query(query, kwargs=kwargs)
print("User Query:")
display(Markdown(f"**{query}**"))
print("Generated Response:")
display(Markdown(response))

Batches: 100%|██████████| 13/13 [00:00<00:00, 20.65it/s]


User Query:


**How many goals were scored after the 90th minute during Game Weeks 7 to 11?**

Generated Response:


In Game Weeks 7 to 11, the following goals were scored after the 90th minute:

1. Game Week 7: 
   - Hoffenheim 3:2 Werder Bremen (Goal at 91:32)
   - Mainz 2:2 Borussia M.Gladbach (Goal at 95:47)

2. Game Week 9:
   - Mainz 2:2 Bochum (Goals at 96:10)

3. Game Week 10:
   - Bayern 4:2 FC Heidenheim (Goal at 95:8)

4. Game Week 11:
   - Eintracht Frankfurt 2:2 Werder Bremen (Goals at 95:0)
   - Borussia M.Gladbach 4:0 Wolfsburg (Goals at 92:4)

Total goals scored after the 90th minute during Game Weeks 7 to 11: 6 goals.

In [10]:
from IPython.display import Markdown, display

query = "How many goals did Harry Kane score from gameweek 7 to 11? From what field zones did he score these goals?"
# Answer the query
response = rag.answer_query(query, top_k=top_k)
print("User Query:")
display(Markdown(f"**{query}**"))
print("Generated Response:")
display(Markdown(response))

Batches: 100%|██████████| 4/4 [00:00<00:00, 12.03it/s]


User Query:


**How many goals did Harry Kane score from gameweek 7 to 11? From what field zones did he score these goals?**

Generated Response:


Harry Kane scored a total of 5 goals from gameweek 7 to 11. The field zones from which he scored these goals are:

1. Game Week 7: Field zone 12
2. Game Week 9: Field zone 9
3. Game Week 9: Field zone 12
4. Game Week 11: Field zone 12
5. Game Week 11: Field zone 12