# Chat with Data: Retrieval Augmented Generation (RAG) with LlamaIndex and OpenAI API

We will build a simple RAG system to retrieve information about players at EURO 2024 using LlamaIndex library

**Data**: CSV file: euro2024_players.csv

This table presents information about football players participating in Euro 2024. It includes the following details for each player:

**Name**: Full name of the player

**Position**: The position they play on the field (e.g., goalkeeper, centre-back)

**Age**: Age of the player

**Club**: The club they currently play for

**Height**: Height of the player in centimeters

**Foot**: Their dominant foot (right or left)

**Caps**: The number of international matches played for their country

**Goals**: The number of goals scored in international matches

**Market Value**: Estimated market value of the player in Euros

**Country**: Country the player represents

#LLamaIndex
https://docs.llamaindex.ai/en/stable/

LlamaIndex, formerly known as GPT Index, is a data framework designed to streamline the development of applications that leverage Large Language Models (LLMs). It provides developers with tools and functionalities for integrating various data sources (e.g., documents, APIs, databases) with LLMs, enabling them to build more intelligent and context-aware applications.


##Chat Engine
###Concept:

Chat engine is a high-level interface for having a conversation with your data (multiple back-and-forth instead of a single question & answer). Think ChatGPT, but augmented with your knowledge base.

By keeping track of the conversation history, it can answer questions with past context in mind.

Configuring a Chat Engine
https://docs.llamaindex.ai/en/stable/module_guides/deploying/chat_engines/usage_pattern/




#Setup

In [None]:
import sys
import os
import pandas as pd

Install LlamaIndex library, adopted to OpenAI
The optional flag -q installs "quietly" without printing out details of the installation.

In [None]:
%pip install llama-index-llms-openai -q

In [None]:
!pip install llama-index -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/250.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.6/250.6 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m297.0/298.7 kB[0m [31m169.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.7/298.7 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/164.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.9/164.9 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

##Set your OpenAI API key

By default, we use the OpenAI gpt-3.5-turbo model for text generation and text-embedding-ada-002 for retrieval and embeddings. In order to use this, you must have an OPENAI_API_KEY set up as an environment variable. You can obtain an API key by logging into your OpenAI account and creating a new API key. https://platform.openai.com/api-keys

###Obtain an OpenAI API Key
Log in to Your Account:
Go to https://platform.openai.com/login and log in with your credentials.

Navigate to the API Section:
Once logged in, click on your profile icon in the top-right corner and select "API Keys" from the dropdown menu.

Generate a New API Key:
Click the "Create new secret key" button.
Copy the generated API key. You will not be able to view it again, so store it securely in a password manager or a safe document.

In [None]:
#put open ai key
os.environ['OPENAI_API_KEY'] = input("Enter Open API key ")


Enter Open API key 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


## Upload Data from Google drive

1. Create a new folder at your Google drive.
2. Upload euro2024_players.csv file to Google drive in your folder
3. Mounted Google.drive to have an access to file from Google.Colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#load data to dataframe to explore
data = pd.read_csv("/content/drive/MyDrive/euro_2024/euro2024_players.csv")

In [None]:
# Print a concise summary of the DataFrame 'data',
# including information about columns, data types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 623 entries, 0 to 622
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name         623 non-null    object
 1   Position     623 non-null    object
 2   Age          623 non-null    int64 
 3   Club         623 non-null    object
 4   Height       623 non-null    int64 
 5   Foot         620 non-null    object
 6   Caps         623 non-null    int64 
 7   Goals        623 non-null    int64 
 8   MarketValue  623 non-null    int64 
 9   Country      623 non-null    object
dtypes: int64(5), object(5)
memory usage: 48.8+ KB


In [None]:
#first 5 players
data.head()

Unnamed: 0,Name,Position,Age,Club,Height,Foot,Caps,Goals,MarketValue,Country
0,Marc-André ter Stegen,Goalkeeper,32,FC Barcelona,187,right,40,0,28000000,Germany
1,Manuel Neuer,Goalkeeper,38,Bayern Munich,193,right,119,0,4000000,Germany
2,Oliver Baumann,Goalkeeper,34,TSG 1899 Hoffenheim,187,right,0,0,3000000,Germany
3,Nico Schlotterbeck,Centre-Back,24,Borussia Dortmund,191,left,12,0,40000000,Germany
4,Jonathan Tah,Centre-Back,28,Bayer 04 Leverkusen,195,right,25,0,30000000,Germany


In [None]:
#filter by country
data[data['Country'] == 'Spain']

Unnamed: 0,Name,Position,Age,Club,Height,Foot,Caps,Goals,MarketValue,Country
104,David Raya,Goalkeeper,28,Arsenal FC,183,right,5,0,35000000,Spain
105,Unai Simón,Goalkeeper,26,Athletic Bilbao,190,right,39,0,30000000,Spain
106,Álex Remiro,Goalkeeper,29,Real Sociedad,191,right,1,0,25000000,Spain
107,Robin Le Normand,Centre-Back,27,Real Sociedad,187,right,10,1,40000000,Spain
108,Dani Vivian,Centre-Back,24,Athletic Bilbao,184,right,2,0,25000000,Spain
109,Aymeric Laporte,Centre-Back,30,Al-Nassr FC,191,left,28,1,20000000,Spain
110,Nacho Fernández,Centre-Back,34,Real Madrid,180,right,24,1,3000000,Spain
111,Alejandro Grimaldo,Left-Back,28,Bayer 04 Leverkusen,171,left,3,0,45000000,Spain
112,Marc Cucurella,Left-Back,25,Chelsea FC,173,left,3,0,25000000,Spain
113,Daniel Carvajal,Right-Back,32,Real Madrid,173,right,43,0,12000000,Spain


In [None]:
#filter by club
data[data['Club'] == 'Real Madrid']

Unnamed: 0,Name,Position,Age,Club,Height,Foot,Caps,Goals,MarketValue,Country
5,Antonio Rüdiger,Centre-Back,31,Real Madrid,190,right,69,3,25000000,Germany
15,Toni Kroos,Central Midfield,34,Real Madrid,183,right,109,17,10000000,Germany
110,Nacho Fernández,Centre-Back,34,Real Madrid,180,right,24,1,3000000,Spain
113,Daniel Carvajal,Right-Back,32,Real Madrid,173,right,43,0,12000000,Spain
129,Joselu,Centre-Forward,34,Real Madrid,191,right,10,5,5000000,Spain
145,Luka Modric,Central Midfield,38,Real Madrid,172,right,174,24,6000000,Croatia
250,Jude Bellingham,Attacking Midfield,20,Real Madrid,186,right,29,3,180000000,England
400,Ferland Mendy,Left-Back,29,Real Madrid,180,left,9,0,22000000,France
402,Aurélien Tchouaméni,Defensive Midfield,24,Real Madrid,188,right,31,3,100000000,France
404,Eduardo Camavinga,Central Midfield,21,Real Madrid,182,left,16,1,100000000,France


#Load Documents to Prepare for Indexing Using LlamaIndex

Load documents (our csv file) to build the VectorStoreIndex. The folder may consist of several files. SimpleDirectoryReader will load all files in the folder. The format of the data may vary.

In [None]:
# load data
documents = SimpleDirectoryReader("/content/drive/MyDrive/euro_2024/").load_data()

In [None]:
print(documents)

[Document(id_='6cb049e9-5daa-474b-8627-e367d7a71b97', embedding=None, metadata={'file_path': '/content/drive/MyDrive/euro_2024/euro2024_players.csv', 'file_name': 'euro2024_players.csv', 'file_type': 'text/csv', 'file_size': 49009, 'creation_date': '2025-02-06', 'last_modified_date': '2024-06-08'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text="Marc-André ter Stegen, Goalkeeper, 32, FC Barcelona, 187, right, 40, 0, 28000000, Germany\nManuel Neuer, Goalkeeper, 38, Bayern Munich, 193, right, 119, 0, 4000000, Germany\nOliver Baumann, Goalkeeper, 34, TSG 1899 Hoffenheim, 187, right, 0, 0, 3000000, Germany\nNico Schlotterbeck, Centre-Back, 24,

### Build index
Creates a searchable vector index from the loaded documents
 - Converts text into numerical vectors using embeddings
 - Enables efficient semantic search and retrieval
 - Stores document vectors in memory for quick access
 - Used by chat engine to find relevant context when answering questions

In [None]:
index = VectorStoreIndex.from_documents(documents)

In [None]:
#setup LLM model
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)

#Configuring a Chat Engine

Configuring a chat engine is very similar to configuring a query engine.

##High-Level API

You can directly build and configure a chat engine from an index in 1 line of code:


*chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)*

**Note**: you can access different chat engines by specifying the chat_mode as a kwarg. condense_question corresponds to CondenseQuestionChatEngine, react corresponds to ReActChatEngine, context corresponds to a ContextChatEngine.

**Note**: While the high-level API optimizes for ease-of-use, it does NOT expose full range of configurability.


##Available Chat Modes:

**condense_question** - Look at the chat history and re-write the user message to be a query for the index. Return the response after reading the response from the query engine.

**context** - Retrieve nodes from the index using every user message. The retrieved text is inserted into the system prompt, so that the chat engine can either respond naturally or use the context from the query engine.

**condense_plus_context** - A combination of condense_question and context. Look at the chat history and re-write the user message to be a retrieval query for the index. The retrieved text is inserted into the system prompt, so that the chat engine can either respond naturally or use the context from the query engine.

**simple** - A simple chat with the LLM directly, no query engine involved.

**best** - Turn the query engine into a tool, for use with a ReAct data agent or an OpenAI data agent, depending on what your LLM supports. OpenAI data agents require gpt-3.5-turbo or gpt-4 as they use the function calling API from OpenAI.

**react** - Same as best, but forces a ReAct data agent.

**openai** - Same as best, but forces an OpenAI data agent.



### Condense_question
Condense question is a simple chat mode built on top of a query engine over your data.

For each chat interaction:

- first generate a standalone question from conversation context and last message, then
- query the query engine with the condensed question for a response.

This approach is simple, and works for questions directly related to the knowledge base. Since it always queries the knowledge base, it can have difficulty answering meta questions like "what did I ask you before?"

In [None]:
chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)


###Chat with your data

In [None]:
response = chat_engine.chat("Hello! What club played Toni Kroos during euro 2024?")

Querying with: Hello! What club played Toni Kroos during euro 2024?


In [None]:
print(response)

Real Madrid


In [None]:
response = chat_engine.chat("What is his age?")

Querying with: How old is Toni Kroos, who played for Real Madrid during Euro 2024?


In [None]:
print(response)

34


In [None]:
response = chat_engine.chat("List all players from Real Madrid Club")

Querying with: Can you please list all players from the Real Madrid Club?


In [None]:
print(response)

Joselu, Centre-Forward, 34, Real Madrid, 191, right, 10, 5, 5000000, Spain


In [None]:
response = chat_engine.chat("From which club is David Raum?")

Querying with: Which club did David Raum play for?


In [None]:
print(response)

David Raum played for TSG Hoffenheim.



##Reset conversation state

In [None]:
chat_engine.reset()

##Chat Engine - Condense Plus Context Mode

This is a multi-step chat mode built on top of a retriever over your data.

For each chat interaction:

- First condense a conversation and latest user message to a standalone question
- Then build a context for the standalone question from a retriever,
- Then pass the context along with prompt and user message to LLM to generate a response.

This approach is simple, and works for questions directly related to the knowledge base and general interactions.



Since the context retrieved can take up a large amount of the available LLM context, let's ensure we configure a smaller limit to the chat history!

In [None]:
from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=3900)

chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    memory=memory,
    llm=llm,
    context_prompt=(
        "You are a chatbot, able to have normal interactions, as well as talk"
        " about football players in Euro 2024 based on the provided data"
        "Here are the relevant documents for the context:\n"
        "{context_str}"
        "\nInstruction: Use the previous chat history, or the context above, to interact and help the user."
    ),
    verbose=False,
)


###Chat with your data

In [None]:
response = chat_engine.chat("Hello! What club played Toni Kroos during euro 2024?")

In [None]:
print(response)

Toni Kroos played for Real Madrid during Euro 2024.


In [None]:
response = chat_engine.chat("What is his age?")

In [None]:
print(response)

Toni Kroos was 34 years old during Euro 2024.


In [None]:
response = chat_engine.chat("List all players from Real Madrid Club")

In [None]:
print(response)

Here are the players from Real Madrid Club in Euro 2024:

1. Joselu, Centre-Forward, 34 years old
2. Toni Kroos, Central Midfield, 34 years old


In [None]:
response = chat_engine.chat("From which club is David Raum?")

In [None]:
print(response)

David Raum is not listed in the provided data for Euro 2024 players. If you have any other players in mind or need information about the existing players, feel free to ask!



###Reset conversation state

In [None]:
chat_engine.reset()

In [None]:
response = chat_engine.chat("Hello! What do you know?")

In [None]:
print(response)

Hello! I know a lot of things. Is there anything specific you would like to talk about or learn more about?


###Context chat engine

In [None]:
chat_engine = index.as_chat_engine(chat_mode="context", llm=llm, verbose=True)

## Task 1
Try different modes of chat engine: simple,
best, react, openai. Chat with data. Copmare the output. Which engine mode is the best for this tasks? Don't forget to reset the chatbot after each conversation

## Task 2
##Test you chats. Chat with you data asking these questions:

**Factual Questions (Direct Retrieval):**

Which club does Manuel Neuer play for?

What is the age of the youngest player in the dataset?

How many international caps does Marc-André ter Stegen have?

Which player has scored the most goals in international matches?

What is the total market value of all players in the dataset?

**Comparative Questions (Simple Inference):**

Which goalkeeper has the higher market value, ter Stegen or Neuer?

Which player is taller, Schlotterbeck or Tah?

Who is older, Baumann or Neuer?

Which centre-back has more international caps?

**Complex Questions (Requires Reasoning):**

Based on the data, who would you consider the most valuable goalkeeper for Germany? Explain your reasoning.

Which centre-back might be a better fit for a team prioritizing defensive experience? Why?

If you were a club manager looking for a young, promising centre-back, who would you be interested in signing? Justify your choice.

**Open-ended Questions (Promotes Generation):**

What potential strengths and weaknesses could this group of players bring to the German national team at Euro 2024?

How might the combination of experience and youth in this dataset impact Germany's performance in the tournament?

**Bonus Question (Tests Out-of-Domain Knowledge):**

Which of these players were part of Germany's World Cup-winning squad in 2014? (This requires external knowledge, as it's not in the provided dataset)

#Part 2. Non-structural data