## General note: be aware that LLM models are not deterministic, so you may get a different result when trying to replicate this notebook on your own.

### Let's try to ask GPT API a simple question about Premier League match.

In [1]:
question = 'In which premier league match Zaroury was sent off after a foul on Walker?'

In [2]:
from openai import OpenAI
import os
MODEL = 'gpt-4o-mini'
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

In [3]:
def ask_gpt(user_prompt):
    completion = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "user", "content": user_prompt},
        ],
        temperature=0,
        max_tokens=200
    )
    return completion.choices[0].message.content

In [4]:
print(ask_gpt(question))

Zaroury was sent off after a foul on Kyle Walker during the Premier League match between Burnley and Manchester City on September 30, 2023.


### How this may happen? This match didn't took place on this date! Can AI make mistakes ?! SOMEBODY CALL 911 !!!!!1111 (we need a RAG here ;)) ... Jokes aside, I think we have a hallucination here. Let's check the official Premier League webpage to verify this information. If you go to this link, you will see that the match actually took place on August 11th, not September 30th.
Link - https://www.premierleague.com/match/93321
![match date](images/burnley-city-actual.png "match date")

### If you ask GPT website it responds with correct answer, but we can see it was not taken from the model weights, but from the issued web search (as below).

![Chat GPT Web response](images/zaroury-gpt.png "Chat GPT Web response")

### Maybe they are using web scrapping along with RAG then they discover potential hallucinations :)

### Let's try to involve RAG into this process. I've found suitable dataset for this at [Kaggle](https://www.kaggle.com/). It is [Premier League commentary](https://www.kaggle.com/datasets/pranavkarnani/english-premier-league-match-commentary) . Let's download it and use as context for the question.

In [5]:
#!/bin/bash
!mkdir data && curl -L -o ./data/english-premier-league-match-commentary.zip \
  https://www.kaggle.com/api/v1/datasets/download/pranavkarnani/english-premier-league-match-commentary && unzip ./data/english-premier-league-match-commentary.zip -d ./data/

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0-:-- --:--:--     0
100 2064k  100 2064k    0     0  1504k      0  0:00:01  0:00:01 --:--:-- 4556k
Archive:  ./data/english-premier-league-match-commentary.zip
  inflating: ./data/23_24_match_details.csv  
  inflating: ./data/23_24_match_stats.csv  


In [6]:
import pandas as pd

In [8]:
match_csv_df = pd.read_csv('./data/23_24_match_details.csv', index_col='id')
match_csv_df.head()

Unnamed: 0_level_0,Home,Away,Date,Stadium,Attendance,Referee,events,summary
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
93323,Bournemouth,West Ham,2023-08-12,"Vitality Stadium, Bournemouth",,Robert Jones,Hello and welcome to live coverage of the Prem...,Referee: Peter Bankes. Assistants: Dan Robatha...
93336,Man City,Newcastle,2023-08-19,"Etihad Stadium, Manchester",,Robert Jones,Hello everyone and welcome to live text covera...,"Referee: Robert Jones. Assistants: Ian Hussin,..."
93343,Brentford,Crystal Palace,2023-08-26,"Gtech Community Stadium, Brentford",16997.0,Peter Bankes,Hello and welcome to the live commentary of th...,
93344,Brighton,West Ham,2023-08-26,"American Express Stadium, Falmer",31508.0,Anthony Taylor,Hello and welcome to live coverage of the Prem...,Referee: Anthony Taylor. Assistants: Gary Besw...
93347,Everton,Wolves,2023-08-26,"Goodison Park, Liverpool",38851.0,Craig Pawson,Hello and welcome to live coverage of this Pre...,


### Let's merge all the columns into one information column called `text`.

In [9]:
match_csv_df['text'] = match_csv_df[match_csv_df.columns].astype(str).agg(' '.join, axis=1, )
match_csv_df.head()

Unnamed: 0_level_0,Home,Away,Date,Stadium,Attendance,Referee,events,summary,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
93323,Bournemouth,West Ham,2023-08-12,"Vitality Stadium, Bournemouth",,Robert Jones,Hello and welcome to live coverage of the Prem...,Referee: Peter Bankes. Assistants: Dan Robatha...,Bournemouth West Ham 2023-08-12 Vitality Stadi...
93336,Man City,Newcastle,2023-08-19,"Etihad Stadium, Manchester",,Robert Jones,Hello everyone and welcome to live text covera...,"Referee: Robert Jones. Assistants: Ian Hussin,...","Man City Newcastle 2023-08-19 Etihad Stadium, ..."
93343,Brentford,Crystal Palace,2023-08-26,"Gtech Community Stadium, Brentford",16997.0,Peter Bankes,Hello and welcome to the live commentary of th...,,Brentford Crystal Palace 2023-08-26 Gtech Comm...
93344,Brighton,West Ham,2023-08-26,"American Express Stadium, Falmer",31508.0,Anthony Taylor,Hello and welcome to live coverage of the Prem...,Referee: Anthony Taylor. Assistants: Gary Besw...,Brighton West Ham 2023-08-26 American Express ...
93347,Everton,Wolves,2023-08-26,"Goodison Park, Liverpool",38851.0,Craig Pawson,Hello and welcome to live coverage of this Pre...,,"Everton Wolves 2023-08-26 Goodison Park, Liver..."


In [10]:
pd.set_option('display.max_colwidth', 200)
match_df = pd.DataFrame(match_csv_df['text'])
match_df.head()

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
93323,"Bournemouth West Ham 2023-08-12 Vitality Stadium, Bournemouth nan Robert Jones Hello and welcome to live coverage of the Premier League clash between Burnley and Manchester City at Turf Moor.\nThe..."
93336,"Man City Newcastle 2023-08-19 Etihad Stadium, Manchester nan Robert Jones Hello everyone and welcome to live text coverage of the Premier League match between Arsenal and Nottingham Forest at the ..."
93343,"Brentford Crystal Palace 2023-08-26 Gtech Community Stadium, Brentford 16997.0 Peter Bankes Hello and welcome to the live commentary of the Premier League clash between Bournemouth and West Ham at..."
93344,"Brighton West Ham 2023-08-26 American Express Stadium, Falmer 31508.0 Anthony Taylor Hello and welcome to live coverage of the Premier League meeting between Brighton and Hove Albion and Luton Tow..."
93347,"Everton Wolves 2023-08-26 Goodison Park, Liverpool 38851.0 Craig Pawson Hello and welcome to live coverage of this Premier League fixture as Everton get their 2023-24 campaign under way against Fu..."


In [11]:
pd.reset_option('display.max_colwidth');

In [12]:
match_df.iloc[0]

text    Bournemouth West Ham 2023-08-12 Vitality Stadi...
Name: 93323, dtype: object

### Our match is first on the list. Let's extract the commentary with other details into a string which will be our context.

In [13]:
first_match = match_df['text'].iloc[0]
print(first_match)

Bournemouth West Ham 2023-08-12 Vitality Stadium, Bournemouth nan Robert Jones Hello and welcome to live coverage of the Premier League clash between Burnley and Manchester City at Turf Moor.
The Premier League is back for a new season, and it gets underway with an intriguing match-up between a newly promoted side and the reigning champions. Burnley could have hardly been given a tougher opening fixture, but they should come into it with confidence after cruising to the Championship title last time out in their first season back in the second tier. Led by Vincent Kompany, a Citizens' legend from his playing career, Burnley reached 101 points and secured promotion with seven games to play, and they will be hoping to get off the mark with a positive result on home soil.
City, meanwhile, claimed a historic treble last term, finally winning their first Champions League title while also taking Premier League and FA Cup glory. It has been a transfer window of change with key players of recen

### This is the part then Zaroury is sent off with a red card

In [14]:
zaroury_red_index = first_match.find('ZAROURY IS SENT OFF!!!')
print(first_match[zaroury_red_index-188:zaroury_red_index+295])

Zaroury brings down Walker with a sliding challenge, receiving the first yellow of the contest. However, VAR is checking for a potential red card, and Zaroury may be in further trouble...
ZAROURY IS SENT OFF!!! The replays show the substitute's right foot catches Walker high on his left leg, and a VAR review rules the tackle is worthy of a red card. Burnley are down to 10 men!
Walker, just involved in the incident that saw Zaroury given his marching orders, makes way for McAtee.


### Let's enhance our query with context containing match details.

In [15]:
context = first_match
question = 'In which premier league match Zaroury was sent off after a foul on Walker?'

### This time we will ask local Ollama, as we are aiming for a RAG that can be run even on bare metal servers in highly secure, air-gapped (no Internet) environment. You can start locally Ollama by running this command:
```bash
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama_0_4_7 ollama/ollama:0.4.7
```

In [17]:
!docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama_0_4_7 ollama/ollama:0.4.7

886bff33b71e32abeeb1aef99e1f72faf275a9e5bcd67c51261822059113890c


### Let's run a pull command inside Ollama container to download our target model

In [18]:
!docker exec ollama_0_4_7 ollama pull llama3.2:3b

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest 
pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB                         
pulling 966de95ca8a6... 100% ▕████████████████▏ 1.4 KB                         
pulling fcc5a6bec9da... 100% ▕████████████████▏ 7.7 KB                         
pulling a70ff7e570d9... 100% ▕████████████████▏ 6.0 KB                         
pulling 56bb8bd477a5... 100% ▕████████████████▏   96 B                         
pulling 34bb5ab01051... 100% ▕████████████████▏  561 B                         
verifying sha256 digest 
writing manifest 
success [?25h


In [19]:
from ollama import chat, ChatResponse

In [20]:
prompt = f"""Given below context:

```
{context}
```

Answer below question:

```
{question}
```
"""

response: ChatResponse = chat(model='llama3.2:1b', messages=[
  {
    'role': 'user',
    'content': prompt,
  },
])
print(response.message.content)

Zaroury was sent off in the Premier League match against Burnley at Turf Moor after a foul on Walker, with a red card.


### Response looks promising, but let's get the date of the match.

In [21]:
context = first_match
question = 'In which premier league match Zaroury was sent off after a foul on Walker? Anwer with result and date of the match.'

In [23]:
prompt = f"""Given below context:

```
{context}
```

Answer below question:

```
{question}
```
"""

response: ChatResponse = chat(model='llama3.2:1b', messages=[
  {
    'role': 'user',
    'content': prompt,
  },
])
print(response.message.content)

Zaroury was sent off in Manchester City's 3-0 Premier League victory over Burnley at Turf Moor, on October 1, 2022.


### Originally it took over 3 minutes to get a response on my 8 core Mac Pro 2019. Check the resource consumption below (BTW: last screenshot is from [lazydocker](https://github.com/jesseduffield/lazydocker)):
![](images/activity-monitor.png "")
![](images/cpu-history.png "")
![](images/lazydocker-ollama.png "")
### But the answer is wrong. Where did he get this October from? Let's try to find out...

In [24]:
import re

In [25]:
re.compile('october|oct', re.IGNORECASE).findall(first_match)

[]

In [26]:
re.compile('10', re.IGNORECASE).findall(first_match)

['10', '10', '10']

In [27]:
for match in re.compile('10', re.IGNORECASE).finditer(first_match):
    print(first_match[match.start()-15:match.end()+15])

urnley reached 101 points and se
nham (0-0 in 2010-11, 0-1 in 202
ey are down to 10 men!
Walker, j


### OK, there are some 10s in the text, but not suggesting October. Let's try the same question with bigger (3B) model. To do this we need to pull bigger model with below command.

In [28]:
!docker exec ollama_0_4_7 ollama pull llama3.2:3b

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest 
pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB                         
pulling 966de95ca8a6... 100% ▕████████████████▏ 1.4 KB                         
pulling fcc5a6bec9da... 100% ▕████████████████▏ 7.7 KB                         
pulling a70ff7e570d9... 100% ▕████████████████▏ 6.0 KB                         
pulling 56bb8bd477a5... 100% ▕████████████████▏   96 B                         
pulling 34bb5ab01051... 100% ▕████████████████▏  561 B                         
verifying sha256 digest 
writing manifest 
success [?25h


In [29]:
prompt = f"""Given below context:

```
{context}
```

Answer below question:

```
{question}
```
"""

response: ChatResponse = chat(model='llama3.2:3b', messages=[
  {
    'role': 'user',
    'content': prompt,
  },
])
print(response.message.content)

The Premier League match in which Zaroury was sent off after a foul on Walker was against Manchester City at Turf Moor. However, Zaroury's sending-off occurred 6 minutes into the additional time period.

I couldn't find any information about Zaroury being sent off with 20 minutes to play and Burnley down by one goal.


### Hmmm..., strange, let's check the Ollama container logs ...
![ollama logs](images/truncated-prompt.png "ollama logs")
### OMG, context was truncated, let's fix it and try again with both Ollama models.

In [30]:
prompt = f"""Given below context:

```
{context}
```

Answer below question:

```
{question}
```
"""

response: ChatResponse = chat(model='llama3.2:1b', messages=[
  {
    'role': 'user',
    'content': prompt,
  }],
  options={"num_ctx": 4096}
)
print(response.message.content)

Zarur was sent off in the Premier League match against West Ham on September 24, 2023. The match was played at the London Stadium, and Manchester City won 2-0.


In [31]:
prompt = f"""Given below context:

```
{context}
```

Answer below question:

```
{question}
```
"""

response: ChatResponse = chat(model='llama3.2:3b', messages=[
  {
    'role': 'user',
    'content': prompt,
  }],
  options={"num_ctx": 4096}
)
print(response.message.content)

The match where Zaroury was sent off after a foul on Walker is Manchester City vs Burnley, played on 12 August 2023 at Turf Moor. The match ended in a 3-0 win for Manchester City.

Zaroury received a red card and was sent off in the 84th minute of the game.


### OK, seems like we will not get this answer from Ollama 1B, but 3B sometimes answers "correctly". On one hand, this is correct, because date is loosely dropped and the beginning of the match commentary, but on the other hand it could be derived from the text. Let's try with GPT API, but now including the context the same way as for Ollama.

In [32]:
print(prompt)

Given below context:

```
Bournemouth West Ham 2023-08-12 Vitality Stadium, Bournemouth nan Robert Jones Hello and welcome to live coverage of the Premier League clash between Burnley and Manchester City at Turf Moor.
The Premier League is back for a new season, and it gets underway with an intriguing match-up between a newly promoted side and the reigning champions. Burnley could have hardly been given a tougher opening fixture, but they should come into it with confidence after cruising to the Championship title last time out in their first season back in the second tier. Led by Vincent Kompany, a Citizens' legend from his playing career, Burnley reached 101 points and secured promotion with seven games to play, and they will be hoping to get off the mark with a positive result on home soil.
City, meanwhile, claimed a historic treble last term, finally winning their first Champions League title while also taking Premier League and FA Cup glory. It has been a transfer window of change

In [33]:
print(ask_gpt(prompt))

Zaroury was sent off after a foul on Walker in the Premier League match between Burnley and Manchester City, which ended with a result of Burnley 0-3 Manchester City on August 12, 2023.


### OK, we got what we wanted, but of course GPT didn't verified that it should be August 11th. It is fine, because we enforced him to use our "ground truth".

In [34]:
def ask_llama(prompt=prompt, model='llama3.2:1b'):
  response: ChatResponse = chat(model=model, messages=[
    {
      'role': 'user',
      'content': prompt,
    }],
    options={"num_ctx": 8096}
  )
  print(response.message.content)

In [35]:
def gen_prompt(question, context=first_match):
    prompt = f"""Given below context:

```
{context}
```

Answer below question:

```
{question}
```
"""
    return prompt

### Let's try a different question with all our models. We would like to get information that it was a "clever header".

In [36]:
question = "In what way Rodri asisted Haland's first goal in opening match for Premier League season 2023/2024?"

In [37]:
rodri_assist = gen_prompt(question)
print(rodri_assist)

Given below context:

```
Bournemouth West Ham 2023-08-12 Vitality Stadium, Bournemouth nan Robert Jones Hello and welcome to live coverage of the Premier League clash between Burnley and Manchester City at Turf Moor.
The Premier League is back for a new season, and it gets underway with an intriguing match-up between a newly promoted side and the reigning champions. Burnley could have hardly been given a tougher opening fixture, but they should come into it with confidence after cruising to the Championship title last time out in their first season back in the second tier. Led by Vincent Kompany, a Citizens' legend from his playing career, Burnley reached 101 points and secured promotion with seven games to play, and they will be hoping to get off the mark with a positive result on home soil.
City, meanwhile, claimed a historic treble last term, finally winning their first Champions League title while also taking Premier League and FA Cup glory. It has been a transfer window of change

In [38]:
ask_llama(rodri_assist)

Rodri helped Haland score the second goal in Manchester City's 3-0 Premier League victory over Burnley at Turf Moor, by delivering a precise free-kick that Haaland slotted home.


### Hmmm... totally wrong, smallest Llama might be too "dumb" for this question, but let's try again.

In [39]:
ask_llama(rodri_assist)

Rodri assisted Haland's first goal in the opening match of the 2023-2024 Premier League season by delivering a deep cross from which Haaland scored.


### Not exactly what we planned. Let's try bigger Llama.

In [40]:
ask_llama(prompt=rodri_assist, model='llama3.2:3b')

Rodri assisted Haaland's first goal with a clever headed assist, where he nodded the ball back across to Haaland, who then smashed home to put City 1-0 up.


### Good answer, but usually it returns not exactly what we planned. Let's try GPT API.

In [41]:
print(ask_gpt(rodri_assist))

Rodri assisted Haaland's first goal in the opening match of the Premier League season 2023/2024 by nodding the ball back across the box after meeting a deep cross from Kevin De Bruyne. Haaland was positioned in the box to finish the move, smashing the ball home to put Manchester City ahead.


### Not exactly what we planned. Let's try simple question.

In [42]:
question = "Which two important Citizens defenders missed opening match in season 2023/2024?"

In [43]:
two_defenders = gen_prompt(question)
print(two_defenders)

Given below context:

```
Bournemouth West Ham 2023-08-12 Vitality Stadium, Bournemouth nan Robert Jones Hello and welcome to live coverage of the Premier League clash between Burnley and Manchester City at Turf Moor.
The Premier League is back for a new season, and it gets underway with an intriguing match-up between a newly promoted side and the reigning champions. Burnley could have hardly been given a tougher opening fixture, but they should come into it with confidence after cruising to the Championship title last time out in their first season back in the second tier. Led by Vincent Kompany, a Citizens' legend from his playing career, Burnley reached 101 points and secured promotion with seven games to play, and they will be hoping to get off the mark with a positive result on home soil.
City, meanwhile, claimed a historic treble last term, finally winning their first Champions League title while also taking Premier League and FA Cup glory. It has been a transfer window of change

In [44]:
ask_llama(question)

I'm not aware of the specific teams or leagues you're referring to. Can you please provide more context or information about the Citizens defenders and their matches for the upcoming season 2023/2024? I'll do my best to help.


### Smallest Llama without a context at least don't try to hallucinate this time. ++ for that

In [45]:
ask_llama(two_defenders)

John Stones and Ruben Dias both missed the opening match of the season against Burnley at Turf Moor.


### With context it is correct.

In [46]:
print(ask_gpt(question))

In the opening match of the 2023/2024 season, Manchester City missed two important defenders: Ruben Dias and John Stones. Both players were unavailable due to injuries.


### GPT is also correct even without a context.

### OK, we've passed an arbitrary chosen context, but what when we don't know which context is correct? Then we need to embed every context and do a similarity check and use context that is "closest" to the question. We will do this excercise using Llama models and [Facebook AI Similarity Search](https://ai.meta.com/tools/faiss/) library.

In [47]:
from ollama import embed

### Let's try sample embedding

In [48]:
emb = embed(model='llama3.2:1b', input='Fantastic shot from De Bruyne')

### Embedding size

In [49]:
len(emb.embeddings[0])

2048

In [50]:
import faiss

### Embedding size must be used to create FAISS index

In [51]:
faiss_idx = faiss.IndexFlatL2(len(emb.embeddings[0]))

### Let's embed all our commentary texts

In [52]:
def apply_faiss_embeddings(text):
    text = text.replace('\n', ' ')
    embedding = embed(
        model='llama3.2:1b',
        input=text,
        options={"num_ctx": 8096}
    )
    return embedding.embeddings[0]

In [124]:
match_df['embeddings'] = match_df['text'].apply(apply_faiss_embeddings)

### Uffff... it took 900 minutes on my laptop, so let's better persist this.

In [None]:
match_df.to_csv('match_with_embeddings.csv')

### Important note: here I faced an issue with embeddings enclosed with doulbe quotes in persisted csv file, which cause a issues when converting to numpy array later (list was treated as string). To prevent this there is a pd.eval added while loading a file - see: https://stackoverflow.com/a/23112008/13182755.

In [53]:
match_df = pd.read_csv('match_with_embeddings.csv', converters={'embeddings': pd.eval})
match_df.head()

Unnamed: 0,id,text,embeddings
0,93323,Bournemouth West Ham 2023-08-12 Vitality Stadi...,"[0.00982937, -0.015813109, 0.013958775, 0.0210..."
1,93336,"Man City Newcastle 2023-08-19 Etihad Stadium, ...","[0.013961506, -0.010563543, 0.015210669, 0.013..."
2,93343,Brentford Crystal Palace 2023-08-26 Gtech Comm...,"[-0.008094572, 0.024903629, 0.013604184, -0.00..."
3,93344,Brighton West Ham 2023-08-26 American Express ...,"[0.021451954, -0.012039059, 0.014355913, 0.019..."
4,93347,"Everton Wolves 2023-08-26 Goodison Park, Liver...","[-0.010019751, 0.02847876, 0.012722477, -0.007..."


In [54]:
match_df['embeddings'].head()

0    [0.00982937, -0.015813109, 0.013958775, 0.0210...
1    [0.013961506, -0.010563543, 0.015210669, 0.013...
2    [-0.008094572, 0.024903629, 0.013604184, -0.00...
3    [0.021451954, -0.012039059, 0.014355913, 0.019...
4    [-0.010019751, 0.02847876, 0.012722477, -0.007...
Name: embeddings, dtype: object

### Wouldn't it be cool if you coould add embedding to and index just like below?

In [55]:
faiss_idx.add(match_df['embeddings'])

ValueError: not enough values to unpack (expected 2, got 1)

### We need to pass embeddings to index in an NxD matrix, where N is number of embedded entries and D is the embedding size.

In [57]:
import numpy as np

In [58]:
emb_vectors = np.array(match_df['embeddings'].tolist())
emb_vectors.shape

(303, 2048)

In [59]:
faiss_idx.add(emb_vectors)

### Let's now embed a question

In [60]:
question = 'In which premier league match Zaroury was sent off after a foul on Walker?'

In [61]:
query_emb = apply_faiss_embeddings(question)
print(len(query_emb))
print(query_emb[:5])

2048
[-0.0028298623, 0.054887388, 0.001383555, -0.0043045585, 0.04309273]


In [62]:
query_emb_array = np.array(query_emb).reshape(1, -1)
query_emb_array.shape

(1, 2048)

### And find 10 "closest" match commentaries.

In [63]:
distances, indices = faiss_idx.search(query_emb_array, k=10)
print(distances)
print(indices)

[[0.6306199  0.641779   0.64206946 0.6510188  0.65511256 0.66085064
  0.6622771  0.6720246  0.6773503  0.6774502 ]]
[[289 258 168 254 112  72 116 118 218 150]]


In [64]:
match_df.iloc[indices[0][0]].text

'Brighton Arsenal 2024-04-06 American Express Stadium, Falmer 31677.0 John Brooks Hello and welcome to live coverage of Newcastle United v Everton.\nNewcastle are on a four-match unbeaten run and looking to build on their late season momentum, having come from 3-1 down to beat West Ham 4-3 at the weekend. The hosts are in eighth place on 43 points, trailing fifth-placed Aston Villa by 13 points for a Europa League spot. They’ve only lost three times at home this season, while their record against Everton will give them further confidence: the Magpies have won five of their last seven meetings. A win today would see them leapfrog West Ham into sixth place.\nEverton are desperate to bring a halt to a grim run of results, and will likely take inspiration from the fact that their only win in eight midweek Premier League matches was a 3-0 victory over Newcastle in the reverse fixture. The Toffees sit 16th in the Premier League, just three points above the drop zone, and are winless in their

### Not good, the "closest" should be index equals to 0, because the question refers to the first match. Let's try if model dedicated for embeddings will perform better here.

In [65]:
!docker exec ollama_0_4_7 ollama pull nomic-embed-text:v1.5

[?25lpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest 
pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB                         
pulling c71d239df917... 100% ▕████████████████▏  11 KB                         
pulling ce4a164fc046... 100% ▕████████████████▏   17 B                         
pulling 31df23ea7daa... 100% ▕████████████████▏  420 B                         
verifying sha256 digest 
writing manifest 
success [?25h


In [66]:
from ollama import embeddings
import ollama

In [68]:
def apply_faiss_embeddings(text):
    text = text.replace('\n', ' ')
    embedding = embeddings(
        model='nomic-embed-text:v1.5',
        prompt=text,
        options={"num_ctx": 8096}
    )
    return embedding.embedding

### This time around 200 minutes ...

In [None]:
match_df['nomic_embd'] = match_df['text'].apply(apply_faiss_embeddings)

In [None]:
match_df.to_csv('match_with_nomic_emb.csv')

In [69]:
match_df = pd.read_csv('match_with_nomic_emb.csv', converters={'nomic_embd': pd.eval, 'embeddings': pd.eval}, index_col='id')
match_df.head()

Unnamed: 0_level_0,text,embeddings,nomic_embd
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
93323,Bournemouth West Ham 2023-08-12 Vitality Stadi...,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1..."
93336,"Man City Newcastle 2023-08-19 Etihad Stadium, ...","[0.013961506, -0.010563543, 0.015210669, 0.013...","[0.1500709056854248, 0.009443949908018112, -2...."
93343,Brentford Crystal Palace 2023-08-26 Gtech Comm...,"[-0.008094572, 0.024903629, 0.013604184, -0.00...","[0.5953104496002197, 0.40444305539131165, -3.0..."
93344,Brighton West Ham 2023-08-26 American Express ...,"[0.021451954, -0.012039059, 0.014355913, 0.019...","[0.36835864186286926, -0.3567712604999542, -3...."
93347,"Everton Wolves 2023-08-26 Goodison Park, Liver...","[-0.010019751, 0.02847876, 0.012722477, -0.007...","[-0.15921032428741455, 0.7213845252990723, -3...."


In [70]:
emb_vectors = np.array(match_df['nomic_embd'].tolist())
emb_vectors.shape

(303, 768)

In [71]:
faiss_idx = faiss.IndexFlatL2(emb_vectors.shape[1])

In [72]:
faiss_idx.add(emb_vectors)

### Let's again try to find 10 "closest" match commentaries (now using `nomic` embeddings).

In [75]:
query_emb = apply_faiss_embeddings(question)
print(len(query_emb))
print(query_emb[:5])

768
[0.786298394203186, -0.3261018395423889, -4.200957775115967, -0.2517128586769104, 1.574384093284607]


In [76]:
query_emb_array = np.array(query_emb).reshape(1, -1)
query_emb_array.shape

(1, 768)

In [77]:
distances, indices = faiss_idx.search(query_emb_array, k=10)
print(distances)
print(indices)

[[267.42175 270.1125  270.14392 270.6881  272.0106  272.3169  272.9947
  275.40152 275.45593 276.5636 ]]
[[281  97 124 103 157 218 121 229 148 176]]


### Hmmm... the "closest" should be index 0, because the question refers to the first match in the dataset. Something is wrong here. Let's try to check how big part of the original text we should embed to get the first match as "closest".

#### Trying with the most important phrase only ... no luck

In [78]:
q = """ZAROURY IS SENT OFF"""
d, i = faiss_idx.search(np.array(apply_faiss_embeddings(q)).reshape(1, -1), k=10)
print(match_df.iloc[i[0][0]]['text'][:100])

Aston Villa Sheffield Utd 2023-12-22 Villa Park, Birmingham 41651.0 Anthony Taylor Hello and welcome


#### Trying with the first 100 characters of original text ... no luck

In [79]:
q = match_df.iloc[0]['text'][:100]
d, i = faiss_idx.search(np.array(apply_faiss_embeddings(q)).reshape(1, -1), k=10)
print(match_df.iloc[i[0][0]]['text'][:100])

Brighton Bournemouth 2023-09-24 American Express Stadium, Falmer 31617.0 John Brooks nan Referee: Jo


#### Trying with the first 150 characters of original text ... no luck

In [80]:
q = match_df.iloc[0]['text'][:150]
d, i = faiss_idx.search(np.array(apply_faiss_embeddings(q)).reshape(1, -1), k=10)
print(match_df.iloc[i[0][0]]['text'][:100])

Brentford Crystal Palace 2023-08-26 Gtech Community Stadium, Brentford 16997.0 Peter Bankes Hello an


#### Trying with the first 200 characters of original text ... success!

In [81]:
q = match_df.iloc[0]['text'][:200]
d, i = faiss_idx.search(np.array(apply_faiss_embeddings(q)).reshape(1, -1), k=10)
print(match_df.iloc[i[0][0]]['text'][:100])

Bournemouth West Ham 2023-08-12 Vitality Stadium, Bournemouth nan Robert Jones Hello and welcome to 


### Something is definitelly wrong in our process. I think size of the embedded text in compare to embedding size is too small to properly store information. Let's try chunking the text by splitting after every new line and repeat the whole experiment.

In [82]:
match_df['text_splitted'] = match_df['text'].str.split('\n')
len(match_df.iloc[0]['text_splitted'])

73

In [83]:
match_df = match_df.explode(['text_splitted'])

In [84]:
len(match_df)

28553

In [85]:
def apply_nomic_embeddings(text):
    # text = text.replace('\n', ' ')
    embedding = embeddings(
        model='nomic-embed-text:v1.5',
        prompt=text,
        options={"num_ctx": 8096}
    )
    return embedding.embedding

In [86]:
match_df = match_df.drop(['text'], axis=1)

In [87]:
match_df.head()

Unnamed: 0_level_0,embeddings,nomic_embd,text_splitted
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...",Bournemouth West Ham 2023-08-12 Vitality Stadi...
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...","The Premier League is back for a new season, a..."
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...","City, meanwhile, claimed a historic treble las..."
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...",BURNLEY (4-4-2): James Trafford; Connor Robert...
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...","SUBS: Josh Brownhill, Jacob Bruun Larsen, Arij..."


### Embedding took 267 minutes.

In [181]:
match_df['nomic_embd_splt'] = match_df['text_splitted'].apply(apply_nomic_embeddings)

In [182]:
match_df.to_csv('match_embeddings_splt.csv')

In [88]:
match_df = pd.read_csv('match_embeddings_splt.csv', converters={'nomic_embd': pd.eval, 'embeddings': pd.eval, 'nomic_embd_splt': pd.eval}, index_col='id')
match_df.head()

Unnamed: 0_level_0,embeddings,nomic_embd,text_splitted,nomic_embd_splt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...",Bournemouth West Ham 2023-08-12 Vitality Stadi...,"[0.5638356804847717, -0.14036604762077332, -3...."
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...","The Premier League is back for a new season, a...","[-0.10109501332044601, -0.08101896196603775, -..."
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...","City, meanwhile, claimed a historic treble las...","[-0.2390199452638626, 0.10547727346420288, -3...."
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...",BURNLEY (4-4-2): James Trafford; Connor Robert...,"[0.172206848859787, -0.6128902435302734, -3.61..."
93323,"[0.00982937, -0.015813109, 0.013958775, 0.0210...","[0.9866039752960205, -0.4991585314273834, -3.1...","SUBS: Josh Brownhill, Jacob Bruun Larsen, Arij...","[-1.0973767042160034, 0.5310721397399902, -3.7..."


In [89]:
faiss_nomic_splt_idx = faiss.IndexFlatL2(len(match_df.nomic_embd_splt.iloc[0]))

In [90]:
emb_vectors = np.array(match_df['nomic_embd_splt'].tolist())
emb_vectors.shape

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (28553,) + inhomogeneous part.

### Hmmm... we faced an issue when trying to covert to matrix. Error suggests that entries may not have equal length. Let's check it.

In [91]:
match_df[match_df.nomic_embd_splt.apply(lambda x: len(x) != 768)]

Unnamed: 0_level_0,embeddings,nomic_embd,text_splitted,nomic_embd_splt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
93427,"[0.017350046, -0.014460054, 0.007997568, 0.016...","[0.7821648716926575, -0.6712018847465515, -3.5...",,[]
93526,"[0.004836533, -0.004325786, 0.014474527, 0.020...","[0.1423361450433731, -0.07177310436964035, -3....",,[]
93574,"[0.017602252, -0.006939734, 0.007896192, 0.018...","[-0.27723392844200134, 0.11305049806833267, -2...",,[]
93623,"[0.0011206537, -0.024544625, 0.009935, 0.01932...","[0.19563862681388855, -0.13900986313819885, -3...",,[]
93628,"[-0.0012025437, 0.03419225, 0.0018559819, -0.0...","[0.8460286855697632, 0.7662405967712402, -3.27...",,[]
93628,"[-0.0012025437, 0.03419225, 0.0018559819, -0.0...","[0.8460286855697632, 0.7662405967712402, -3.27...",,[]
93628,"[-0.0012025437, 0.03419225, 0.0018559819, -0.0...","[0.8460286855697632, 0.7662405967712402, -3.27...",,[]
93628,"[-0.0012025437, 0.03419225, 0.0018559819, -0.0...","[0.8460286855697632, 0.7662405967712402, -3.27...",,[]


In [99]:
match_df[match_df.nomic_embd_splt.apply(lambda x: len(x) != 768)].text_splitted.isna()

id
93427    True
93526    True
93574    True
93623    True
93628    True
93628    True
93628    True
93628    True
Name: text_splitted, dtype: bool

### OK, so some of the splits are empty string and they resulted in empty embedding. Let's discard them while creating matrix for FAISS index.

In [100]:
emb_vectors = np.array(match_df[match_df.text_splitted.notna()].nomic_embd_splt.tolist())
emb_vectors.shape

(28545, 768)

In [101]:
faiss_nomic_splt_idx.add(emb_vectors)

In [102]:
question = 'In which premier league match Zaroury was sent off after a foul on Walker?'

In [103]:
query_emb = apply_nomic_embeddings(question)
print(len(query_emb))
print(query_emb[:5])

768
[0.786298394203186, -0.3261018395423889, -4.200957775115967, -0.2517128586769104, 1.574384093284607]


In [104]:
query_emb_array = np.array(query_emb).reshape(1, -1)
query_emb_array.shape

(1, 768)

In [105]:
distances, indices = faiss_nomic_splt_idx.search(query_emb_array, k=10)
print(distances)
print(indices)

[[164.65869 216.30637 219.08981 220.096   220.09999 222.4321  227.52008
  227.96379 228.98056 230.73674]]
[[   67 22043  7373  8708  6053 24800  3965 24607    66 13706]]


## SUCCESS! Now similarity search returned exact chunk of the text in which eveng of our question occurs.

In [106]:
match_df.iloc[indices[0][0]].text_splitted

"ZAROURY IS SENT OFF!!! The replays show the substitute's right foot catches Walker high on his left leg, and a VAR review rules the tackle is worthy of a red card. Burnley are down to 10 men!"

In [107]:
match_csv_df.loc[int(match_df.iloc[indices[0][0]].name)]

Home                                                Bournemouth
Away                                                   West Ham
Date                                                 2023-08-12
Stadium                           Vitality Stadium, Bournemouth
Attendance                                                  NaN
Referee                                            Robert Jones
events        Hello and welcome to live coverage of the Prem...
summary       Referee: Peter Bankes. Assistants: Dan Robatha...
text          Bournemouth West Ham 2023-08-12 Vitality Stadi...
Name: 93323, dtype: object