# 1) Loading fiqa-pl corpus

In [2]:
from datasets import load_dataset

ds = load_dataset("clarin-knext/fiqa-pl", "corpus")
ds

DatasetDict({
    corpus: Dataset({
        features: ['_id', 'title', 'text'],
        num_rows: 57638
    })
})

In [3]:
corpus = ds['corpus']
corpus

Dataset({
    features: ['_id', 'title', 'text'],
    num_rows: 57638
})

# 2) Preparing elasticsearch

Let's define a link for a local elasticsearch, beacuase i disabled ssl in elastic config we do not need to link to certificate

In [4]:
link = 'http://localhost:9200/'

In [5]:
from elasticsearch import Elasticsearch

es = Elasticsearch(link)
es

<Elasticsearch(['http://localhost:9200'])>

Now let's define the index 
- First we need analyzer one with synonym filter and one without. Both have to have lowercase and morfologik_stem filter
- Secondly filter that defines the synonyms for Polish month `kwiecień`
- Thirdly mappings one for each anylzer

In [6]:
index_config = {
    "settings": {
        "analysis": {
            "analyzer": {
                "polish_with_synonyms": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "synonym_filter",
                        "morfologik_stem"
                    ]
                },
                "polish_without_synonyms": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "morfologik_stem"
                    ]
                }
            },
            "filter": {
                "synonym_filter": {
                    "type": "synonym",
                    "synonyms": ["kwiecień, kwi, IV"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "content_with_synonyms": {"type": "text", "analyzer": "polish_with_synonyms"},
            "content_without_synonyms": {"type": "text", "analyzer": "polish_without_synonyms"}
        }
    }
}


Now we need to prepare texts/documents for elastic indexing. We loop through corpus and index every text. Thanks to `helpers.bulk()` function we can efficiently index multiple documents in a single API call.

In [7]:
from elasticsearch import helpers

index_name =  "biore_sie_do_roboty"

def generate_docs(ds):
    for doc in ds:
        yield {
            "_index": index_name,
            "_id": doc["_id"],
            "_source": {
                "content_with_synonyms": doc["text"], 
                "content_without_synonyms": doc["text"]
            }
        }


Let's also delete and add the index so that we always create a new on

In [8]:
es.options(ignore_status=[400, 404]).indices.delete(index=index_name)
es.indices.create(index=index_name, body=index_config)

try:
    success, errors = helpers.bulk(es, generate_docs(corpus))
    
    print(f"Successfully indexed {success} documents")
    if errors:
        print(f"Errors during indexing: {errors}")
        
except Exception as e:
    print(f"Error during bulk indexing: {e}")

Successfully indexed 57638 documents


Now let's define the queries for `kwiecień` and see how many times it occures in corpus with and without synonyms

In [9]:
query_with_synonyms = {
    "query": {
        "match": {
            "content_with_synonyms": "kwiecień"
        }
    }
}

query_without_synonyms = {
    "query": {
        "match": {
            "content_without_synonyms": "kwiecień"
        }
    }
}

response = es.search(index=index_name, body=query_with_synonyms)
print("The number of occurrences of the word 'kwiecień' including its synonyms:",response['hits']['total']['value'])
response = es.search(index=index_name, body=query_without_synonyms)
print("The number of occurrences of the word 'kwiecień' without synonyms:",response['hits']['total']['value'])


The number of occurrences of the word 'kwiecień' including its synonyms: 306
The number of occurrences of the word 'kwiecień' without synonyms: 257


# 3) Using fiqa-pl-qrels dataset

### a) Preparing the qa dataset

In [10]:
ds2 = load_dataset("clarin-knext/fiqa-pl-qrels")
ds2

DatasetDict({
    train: Dataset({
        features: ['query-id', 'corpus-id', 'score'],
        num_rows: 14166
    })
    validation: Dataset({
        features: ['query-id', 'corpus-id', 'score'],
        num_rows: 1238
    })
    test: Dataset({
        features: ['query-id', 'corpus-id', 'score'],
        num_rows: 1706
    })
})

In [11]:
ds3 = load_dataset("clarin-knext/fiqa-pl", "queries")
queries = ds3['queries']
queries

Dataset({
    features: ['_id', 'title', 'text'],
    num_rows: 6648
})

In [108]:
corpus[0]

{'_id': '3',
 'title': '',
 'text': 'Nie mówię, że nie podoba mi się też pomysł szkolenia w miejscu pracy, ale nie możesz oczekiwać, że firma to zrobi. Szkolenie pracowników to nie ich praca – oni tworzą oprogramowanie. Być może systemy edukacyjne w Stanach Zjednoczonych (lub ich studenci) powinny trochę martwić się o zdobycie umiejętności rynkowych w zamian za ich ogromne inwestycje w edukację, zamiast wychodzić z tysiącami zadłużonych studentów i narzekać, że nie są do niczego wykwalifikowani.'}

In [115]:
from collections import defaultdict

corpus_list = []
queries_list = []
query_to_corpus = defaultdict(list)

for query_id, corpus_id in zip(ds2['test']["query-id"], ds2['test']["corpus-id"]):
    query_to_corpus[query_id].append(corpus_id)
    queries_list.append(query_id)
    corpus_list.append(corpus_id)

query_to_corpus = dict(query_to_corpus)
query_to_corpus[8]

[566392, 65404]

In [122]:
corpus_dict = defaultdict(list)

for item in corpus:
  if int(item['_id']) in corpus_list:
    corpus_dict[int(item['_id'])].append(item['text'])

corpus_dict = dict(corpus_dict)
corpus_dict[566392]

['Poproś o ponowne wystawienie czeku właściwemu odbiorcy.']

In [127]:
corpus_dict[65404]

['Po prostu poproś współpracownika o podpisanie odwrotu, a następnie zdeponowanie go. Nazywa się to czekiem strony trzeciej i jest całkowicie legalne. Nie zdziwiłbym się, gdyby czek był dłuższy i, jak zawsze, nie dostaniesz pieniędzy, jeśli czek nie zostanie zrealizowany. Teraz możesz mieć problemy, jeśli jest to duża kwota lub nie jesteś zbyt dobrze znany w banku. W takim przypadku możesz poprosić współpracownika o udanie się do banku i zatwierdzenie go przed kasjerem za pomocą dowodu tożsamości. Technicznie nawet nie musisz tam być. Każdy może wpłacić pieniądze na Twoje konto, jeśli ma numer konta. Mógł też po prostu wpłacić go na swoje konto i wypisać czek na firmę.']

In [126]:
queries_dict = defaultdict(list)

for item in queries:
  if int(item['_id']) in queries_list:
    queries_dict[int(item['_id'])].append(item['text'])

queries_dict = dict(queries_dict)
queries_dict[8]

['Jak zdeponować czek wystawiony na współpracownika w mojej firmie na moje konto firmowe?']

### b) prepering the lemmatization indexes

In [78]:
index_config_lamentizer = {
  "settings": {
    "analysis": {
      "analyzer": {
        "polish_with_synonyms_lam": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "synonym_filter",
            "morfologik_stem",
            "lowercase"
          ]
        },
        "polish_without_synonyms_lam": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "morfologik_stem",
            "lowercase"
          ]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": ["kwiecień, kwi, IV"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "answer_syn": {
        "type": "text",
        "analyzer": "polish_with_synonyms_lam"
      },
      "answer_without_sym": {
        "type": "text",
        "analyzer": "polish_without_synonyms_lam"
      }
    }
  }
}

In [87]:
index_name_lam = index_name + '_lam'

In [128]:
def generate_docs_lam(dict):
    for key, value in dict.items():
        yield {
            "_index": index_name_lam,
            "_id": key,
            "_source": {
                "answer_syn": value,
                "answer_without_sym": value,
            }
        }


In [129]:
es.options(ignore_status=[400, 404]).indices.delete(index=index_name_lam)
es.indices.create(index=index_name_lam, body=index_config_lamentizer)

try:
    success, errors = helpers.bulk(es, generate_docs_lam(corpus_dict))
    
    print(f"Successfully indexed {success} documents")
    if errors:
        print(f"Errors during indexing: {errors}")
        
except Exception as e:
    print(f"Error during bulk indexing: {e}")

Successfully indexed 1706 documents


In [139]:
responses_syn = defaultdict(list)
responses = defaultdict(list)
for key, value  in queries_dict.items():
    query_with_synonyms = {
        "query": {
            "match": {
                "answer_syn": value[0]
            }
        },
        "size": 5
    }

    query_without_synonyms = {
        "query": {
            "match": {
                "answer_without_sym": value[0]
            }
        },
        "size": 5
    }

    response = es.search(index=index_name_lam, body=query_with_synonyms)
    responses_syn[key] = response['hits']['hits']
    response = es.search(index=index_name_lam, body=query_without_synonyms)
    responses[key] = response['hits']['hits']

In [145]:
responses_syn = dict(responses_syn)
responses = dict(responses)
print(queries_dict[4641])
print(responses[4641][0]['_source']['answer_syn'])

['Gdzie powinienem zaparkować mój fundusz na deszczowy dzień / awaryjny?']
['Jak na razie świetne odpowiedzi, więc dodam jeszcze tylko jedną uwagę: płynność. Pieniądze zainwestowane w fundusz inwestycyjny (z wyłączeniem kont emerytalnych z karami za wcześniejszą wypłatę) mają stosunkowo wysoką płynność. Podczas gdy nadwyżka kapitału w twoim domu z powodu wcześniejszej spłaty ma bardzo niską płynność. Mówiąc prościej: jeśli znajdziesz się w rozpaczliwej sytuacji (długotrwałe bezrobocie), lepiej jest spieniężyć fundusz powierniczy, niż próbować szybko sprzedać swój dom i zamieszkać z matką. Płynność staje się mniejszym problemem, jeśli uda Ci się również sfinansować przyzwoity fundusz na deszczowe dni (6-9 miesięcy wydatków na życie).']


In [181]:
from sklearn.metrics import ndcg_score
import numpy as np

def calculate_ndcg(outputs):
  ndcg5_list = []

  for key, value in queries_dict.items():

    corpuse_ids = [int(value_syn['_id']) for value_syn  in outputs[key]]
    scores = np.asarray([value_syn['_score'] for value_syn  in outputs[key]])

    true_relevance = np.asarray([1 if item in query_to_corpus[key] else 0 for item in corpuse_ids])
    true_relevance = true_relevance.reshape(1, -1)
    scores = scores.reshape(1, -1)

    ndcg = ndcg_score(true_relevance, scores, k=5)

    ndcg5_list.append(ndcg)

  return np.mean(ndcg5_list)

In [186]:
ndcg_lam_synonym = calculate_ndcg(responses_syn)
ndcg_lam_no_synonym = calculate_ndcg(responses)
print(f"NDCG@5 for synonymous {ndcg_lam_synonym}")
print(f"NDCG@5 without synonymous {ndcg_lam_no_synonym}")

NDCG@5 for synonymous 0.516269528610264
NDCG@5 without synonymous 0.516269528610264


### c) Prepering no lamantaizer

In [225]:
index_config_no_lamentizer = {
  "settings": {
    "analysis": {
      "analyzer": {
        "polish_with_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "synonym_filter",
            "morfologik_stem",
          ]
        },
        "polish_without_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "morfologik_stem"
          ]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": ["kwiecień, kwi, IV"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "answer_syn": {
        "type": "text",
        "analyzer": "polish_with_synonyms"
      },
      "answer_without_sym": {
        "type": "text",
        "analyzer": "polish_without_synonyms"
      }
    }
  }
}

In [226]:
index_name_no_lam = index_name + '_no_lam_2'
index_name_no_lam

'biore_sie_do_roboty_no_lam_2'

In [227]:
def generate_docs_lam(dict):
    for key, value in dict.items():
        yield {
            "_index": index_name_no_lam,
            "_id": key,
            "_source": {
                "answer_syn": value,
                "answer_without_sym": value,
            }
        }


In [228]:
es.options(ignore_status=[400, 404]).indices.delete(index=index_name_lam)

ObjectApiResponse({'acknowledged': True})

In [229]:
es.options(ignore_status=[400, 404]).indices.delete(index=index_name_no_lam)
es.indices.create(index=index_name_no_lam, body=index_config_no_lamentizer)

try:
    success, errors = helpers.bulk(es, generate_docs_lam(corpus_dict))
    
    print(f"Successfully indexed {success} documents")
    if errors:
        print(f"Errors during indexing: {errors}")
        
except Exception as e:
    print(f"Error during bulk indexing: {e}")

Successfully indexed 1706 documents


In [231]:
responses_syn = defaultdict(list)
responses = defaultdict(list)
for key, value  in queries_dict.items():
    query_with_synonyms = {
        "query": {
            "match": {
                "answer_syn": value[0]
            }
        },
        "size": 5
    }

    query_without_synonyms = {
        "query": {
            "match": {
                "answer_without_sym": value[0]
            }
        },
        "size": 5
    }

    response = es.search(index=index_name_no_lam, body=query_with_synonyms)
    responses_syn[key] = response['hits']['hits']
    response = es.search(index=index_name_no_lam, body=query_without_synonyms)
    responses[key] = response['hits']['hits']

In [233]:
ndcg_no_lam_synonym = calculate_ndcg(responses_syn)
ndcg_no_lam_no_synonym = calculate_ndcg(responses)
print(f"NDCG@5 for synonymous no lamentaizer {ndcg_no_lam_synonym}")
print(f"NDCG@5 without synonymous no lamentaizer {ndcg_no_lam_no_synonym}")
print(f"NDCG@5 for synonymous with lamentaizer {ndcg_lam_synonym}")
print(f"NDCG@5 without synonymous with  lamentaizer {ndcg_lam_no_synonym}")

NDCG@5 for synonymous no lamentaizer 0.516269528610264
NDCG@5 without synonymous no lamentaizer 0.516269528610264
NDCG@5 for synonymous with lamentaizer 0.516269528610264
NDCG@5 without synonymous with  lamentaizer 0.516269528610264


## What are the strengths and weaknesses of regular expressions versus full text search regarding processing of text?

The main advantage of full-text search with Elasticsearch lies in its indexing capability. Words are indexed just once, allowing us to efficiently run various queries with different terms. However, this strength can also be its greatest limitation—any new text added to documents requires reindexing, and deletions necessitate updates to the index.

On the other hand, regex shines in its simplicity. It is straightforward to write and easy to use, making it a better choice when dealing with smaller amounts of text where indexing overhead is unnecessary.

## Can an LLM be applied in the context of searching for documents? Justify your answer, excluding the obvious observation that an LLM can be used to formulate the answer.

Yes LLM can be used for searching for text in documents. 
- LLM can help deveolping regex queries or even optimizing
- Can extract key entities from text like NER
- Searching for keywords like regex