# 1) Loading fiqa-pl corpus

In [9]:
from datasets import load_dataset

ds = load_dataset("clarin-knext/fiqa-pl", "corpus")
ds

DatasetDict({
    corpus: Dataset({
        features: ['_id', 'title', 'text'],
        num_rows: 57638
    })
})

In [10]:
corpus = ds['corpus']
corpus

Dataset({
    features: ['_id', 'title', 'text'],
    num_rows: 57638
})

# 2) Preparing elasticsearch

Let's define a link for a local elasticsearch, beacuase i disabled ssl in elastic config we do not need to link to certificate

In [11]:
link = 'http://localhost:9200/'

In [12]:
from elasticsearch import Elasticsearch

es = Elasticsearch(link)
es

<Elasticsearch(['http://localhost:9200'])>

Now let's define the index 
- First we need analyzer one with synonym filter and one without. Both have to have lowercase and morfologik_stem filter
- Secondly filter that defines the synonyms for Polish month `kwiecień`
- Thirdly mappings one for each anylzer

In [13]:
index_config = {
    "settings": {
        "analysis": {
            "analyzer": {
                "polish_with_synonyms": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "synonym_filter",
                        "morfologik_stem"
                    ]
                },
                "polish_without_synonyms": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "morfologik_stem"
                    ]
                }
            },
            "filter": {
                "synonym_filter": {
                    "type": "synonym",
                    "synonyms": ["kwiecień, kwi, IV"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "content_with_synonyms": {"type": "text", "analyzer": "polish_with_synonyms"},
            "content_without_synonyms": {"type": "text", "analyzer": "polish_without_synonyms"}
        }
    }
}


Now we need to prepare texts/documents for elastic indexing. We loop through corpus and index every text. Thanks to `helpers.bulk()` function we can efficiently index multiple documents in a single API call.

In [14]:
from elasticsearch import helpers

index_name =  "biore_sie_do_roboty"

def generate_docs(ds):
    for doc in ds:
        yield {
            "_index": index_name,
            "_id": doc["_id"],
            "_source": {
                "content_with_synonyms": doc["text"], 
                "content_without_synonyms": doc["text"]
            }
        }


Let's also delete and add the index so that we always create a new on

In [15]:
es.options(ignore_status=[400, 404]).indices.delete(index=index_name)
es.indices.create(index=index_name, body=index_config)

try:
    success, errors = helpers.bulk(es, generate_docs(corpus))
    
    print(f"Successfully indexed {success} documents")
    if errors:
        print(f"Errors during indexing: {errors}")
        
except Exception as e:
    print(f"Error during bulk indexing: {e}")

Successfully indexed 57638 documents


Now let's define the queries for `kwiecień` and see how many times it occures in corpus with and without synonyms

In [16]:
query_with_synonyms = {
    "query": {
        "match": {
            "content_with_synonyms": "kwiecień"
        }
    }
}

query_without_synonyms = {
    "query": {
        "match": {
            "content_without_synonyms": "kwiecień"
        }
    }
}

response = es.search(index=index_name, body=query_with_synonyms)
print("The number of occurrences of the word 'kwiecień' including its synonyms:",response['hits']['total']['value'])
response = es.search(index=index_name, body=query_without_synonyms)
print("The number of occurrences of the word 'kwiecień' without synonyms:",response['hits']['total']['value'])


The number of occurrences of the word 'kwiecień' including its synonyms: 306
The number of occurrences of the word 'kwiecień' without synonyms: 257


# 3) Using fiqa-pl-qrels dataset

### a) Preparing the qa dataset

In [17]:
ds2 = load_dataset("clarin-knext/fiqa-pl-qrels")
ds2

DatasetDict({
    train: Dataset({
        features: ['query-id', 'corpus-id', 'score'],
        num_rows: 14166
    })
    validation: Dataset({
        features: ['query-id', 'corpus-id', 'score'],
        num_rows: 1238
    })
    test: Dataset({
        features: ['query-id', 'corpus-id', 'score'],
        num_rows: 1706
    })
})

In [18]:
ds3 = load_dataset("clarin-knext/fiqa-pl", "queries")
queries = ds3['queries']
queries

Dataset({
    features: ['_id', 'title', 'text'],
    num_rows: 6648
})

In [19]:
corpus[0]

{'_id': '3',
 'title': '',
 'text': 'Nie mówię, że nie podoba mi się też pomysł szkolenia w miejscu pracy, ale nie możesz oczekiwać, że firma to zrobi. Szkolenie pracowników to nie ich praca – oni tworzą oprogramowanie. Być może systemy edukacyjne w Stanach Zjednoczonych (lub ich studenci) powinny trochę martwić się o zdobycie umiejętności rynkowych w zamian za ich ogromne inwestycje w edukację, zamiast wychodzić z tysiącami zadłużonych studentów i narzekać, że nie są do niczego wykwalifikowani.'}

In [51]:
from collections import defaultdict

corpus_list = []
queries_list = []
query_to_corpus = defaultdict(list)

for query_id, corpus_id in zip(ds2['test']["query-id"], ds2['test']["corpus-id"]):
    query_to_corpus[query_id].append(corpus_id)
    queries_list.append(query_id)
    corpus_list.append(corpus_id)

query_to_corpus = dict(query_to_corpus)
#Answer to query 8
query_to_corpus[8]

[566392, 65404]

In [52]:
corpus_dict = defaultdict(list)

for item in corpus:
  if int(item['_id']) in corpus_list:
    corpus_dict[int(item['_id'])].append(item['text'])

corpus_dict = dict(corpus_dict)
print(corpus_dict[566392])
print(corpus_dict[65404])

['Poproś o ponowne wystawienie czeku właściwemu odbiorcy.']
['Po prostu poproś współpracownika o podpisanie odwrotu, a następnie zdeponowanie go. Nazywa się to czekiem strony trzeciej i jest całkowicie legalne. Nie zdziwiłbym się, gdyby czek był dłuższy i, jak zawsze, nie dostaniesz pieniędzy, jeśli czek nie zostanie zrealizowany. Teraz możesz mieć problemy, jeśli jest to duża kwota lub nie jesteś zbyt dobrze znany w banku. W takim przypadku możesz poprosić współpracownika o udanie się do banku i zatwierdzenie go przed kasjerem za pomocą dowodu tożsamości. Technicznie nawet nie musisz tam być. Każdy może wpłacić pieniądze na Twoje konto, jeśli ma numer konta. Mógł też po prostu wpłacić go na swoje konto i wypisać czek na firmę.']


In [64]:
queries_dict = defaultdict(list)

for item in queries:
  if int(item['_id']) in queries_list:
    queries_dict[int(item['_id'])].append(item['text'])

queries_dict = dict(queries_dict)
queries_dict[8]

['Jak zdeponować czek wystawiony na współpracownika w mojej firmie na moje konto firmowe?']

### b) prepering the lemmatization and synonymous indexes

In [54]:
index_config_lamentizer = {
  "settings": {
    "analysis": {
      "analyzer": {
        "polish_with_synonyms_with_lam": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "morfologik_stem",
            "lowercase"
          ]
        },
        "polish_with_synonyms_without_lam": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "morfologik_stem",
          ]
        },
        "polish_without_synonyms_with_lam": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "lowercase",
          ]
        },
        "polish_without_synonyms_without_lam": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
          ]
        }
      },
    }
  },
  "mappings": {
    "properties": {
      "answer_with_synonyms_with_lam": {
        "type": "text",
        "analyzer": "polish_with_synonyms_with_lam"
      },
      "answer_with_synonyms_without_lam": {
        "type": "text",
        "analyzer": "polish_with_synonyms_without_lam"
      },
      "answer_without_synonyms_with_lam": {
        "type": "text",
        "analyzer": "polish_without_synonyms_with_lam"
      },
      "answer_without_synonyms_without_lam": {
        "type": "text",
        "analyzer": "polish_without_synonyms_without_lam"
      }
    }
  }
}

In [55]:
index_name_lam = index_name + '_lam'

In [121]:
def generate_docs_lam(ds):
    for doc in ds:
        yield {
            "_index": index_name_lam,
            "_id": doc["_id"],
            "_source": {
                "answer_with_synonyms_with_lam": doc["text"],
                "answer_with_synonyms_without_lam": doc["text"],
                "answer_without_synonyms_with_lam": doc['text'],
                "answer_without_synonyms_without_lam": doc['text']
            }
        }


In [122]:
es.options(ignore_status=[400, 404]).indices.delete(index=index_name_lam)
es.indices.create(index=index_name_lam, body=index_config_lamentizer)

try:
    success, errors = helpers.bulk(es, generate_docs_lam(corpus))
    
    print(f"Successfully indexed {success} documents")
    if errors:
        print(f"Errors during indexing: {errors}")
        
except Exception as e:
    print(f"Error during bulk indexing: {e}")

Successfully indexed 57638 documents


In [123]:
results_lam_syn = {}

indexes_types = [
  "answer_with_synonyms_with_lam",
  "answer_with_synonyms_without_lam",
  "answer_without_synonyms_with_lam",
  "answer_without_synonyms_without_lam"
]
for i in indexes_types:
  results_lam_syn[i] = {}

results_lam_syn

{'answer_with_synonyms_with_lam': {},
 'answer_with_synonyms_without_lam': {},
 'answer_without_synonyms_with_lam': {},
 'answer_without_synonyms_without_lam': {}}

In [124]:
for key, value  in queries_dict.items():
    query_with_synonyms_with_lam = {
        "query": {
            "match": {
                "answer_with_synonyms_with_lam": value[0]
            }
        },
        "size": 5
    }
    query_with_synonyms_without_lam = {
        "query": {
            "match": {
                "answer_with_synonyms_without_lam": value[0]
            }
        },
        "size": 5
    }
    query_without_synonyms_with_lam = {
        "query": {
            "match": {
                "answer_without_synonyms_with_lam": value[0]
            }
        },
        "size": 5
    }
    query_without_synonyms_without_lam = {
        "query": {
            "match": {
                "answer_without_synonyms_without_lam": value[0]
            }
        },
        "size": 5
    }

    queries_temp_list = [query_with_synonyms_with_lam,
                         query_with_synonyms_without_lam,
                         query_without_synonyms_with_lam,
                         query_without_synonyms_without_lam]

    for j in range(len(queries_temp_list)):
        response = es.search(index=index_name_lam, body=queries_temp_list[j])
        response = response['hits']['hits']
        temp_list = []
        for i in response:
            temp_list.append(int(i['_id']))
        results_lam_syn[indexes_types[j]][key] = temp_list


In [131]:
results_lam_syn[indexes_types[0]][4641]

[376148, 253614, 580025, 497993, 32833]

In [126]:
queries_dict[4641]

['Gdzie powinienem zaparkować mój fundusz na deszczowy dzień / awaryjny?']

NDCG implemented from https://en.wikipedia.org/wiki/Discounted_cumulative_gain

In [184]:
import numpy as np

def calculate_dcg(documents_relevance, k):
  sum = 0
  for index in range(k):
    #need to add another + 1 because python lists starts from 0
    sum += documents_relevance[index] / np.log2(index + 1 + 1)

  return sum

In [190]:
def calculate_ndcg(results, query_to_corpus, k):
  ndcg_list = []
  for key, items in results.items():
    true_relevance = [1 if i in query_to_corpus[key] else 0 for i in items]
    
    dcg = calculate_dcg(true_relevance, k)
    idcg = calculate_dcg(sorted(true_relevance, reverse=True), k)

    ndcg_list.append(0 if dcg == 0 else dcg / idcg)
  return np.mean(ndcg_list)

In [191]:
dcg = calculate_dcg([3,2,3,0,1,2], 6)
idcg = calculate_dcg(sorted([3,2,3,0,1,2,3,2], reverse=True), 6)
print(dcg/idcg)

0.785002371969948


In [196]:
for j in range(len(results_lam_syn)):
  result = calculate_ndcg(results_lam_syn[indexes_types[j]],query_to_corpus, 5)
  print(f"NDCG@5 for {indexes_types[j]}: {result}")

NDCG@5 for answer_with_synonyms_with_lam: 0.2657322972429154
NDCG@5 for answer_with_synonyms_without_lam: 0.2657322972429154
NDCG@5 for answer_without_synonyms_with_lam: 0.20782902393038719
NDCG@5 for answer_without_synonyms_without_lam: 0.20782902393038719


As we can see the analizers without synonyms suffer great disadvantage against those with them, also lamantazizer seems to not influence the results.

## What are the strengths and weaknesses of regular expressions versus full text search regarding processing of text?

The main advantage of full-text search with Elasticsearch lies in its indexing capability. Words are indexed just once, allowing us to efficiently run various queries with different terms. However, this strength can also be its greatest limitation—any new text added to documents requires reindexing, and deletions necessitate updates to the index.

On the other hand, regex shines in its simplicity. It is straight forward to write and easy to use, making it a better choice when dealing with smaller amounts of text where indexing overhead is unnecessary.

## Can an LLM be applied in the context of searching for documents? Justify your answer, excluding the obvious observation that an LLM can be used to formulate the answer.

Yes LLM can be used for searching for text in documents. 
- LLM can help deveolping regex queries or even optimizing
- Can extract key entities from text like NER
- Searching for keywords like regex