# Semantično iskanje v slovenskih besedilih

## Namestitev Pythonovih knjižnic

In [1]:
!pip install txtai[all] sentencepiece sacremoses fasttext torch torchvision

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting txtai[all]
  Downloading txtai-8.2.0-py3-none-any.whl.metadata (30 kB)
Collecting faiss-cpu>=1.7.1.post2 (from txtai[all])
  Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting fastapi>=0.94.0 (from txtai[all])
  Downloading fastapi-0.115.7-py3-none-any.whl.metadata (27 kB)
Collecting python-multipart>=0.0.7 (from txtai[all])
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting uvicorn>=0.12.1 (from txtai[all])
  Downloading uvicorn-0.

## Besedila

In [1]:
# Sample data for indexing
data = [
  "Število novorojenih otrok v naši državi se je v tem letu po večletnem zaskrbljujočem nazadovanju končno spet povečalo",
  "Po poplavah v Nemčiji je tudi Slovenija doživela katastrofalno povodenj, ki je zajela tretjino države",
  "Upokojen gradbeni delavec je v loteriji dobil 100.000 evrov",
  "V tem stoletju pričakujemo dvig zračne temperature za 2 stopnji Celzija",
  "V Sudanu je bilo leta 2017 več kot 100.000 smrtnih žrtev",

]

## Semantično indeksiranje

Ustvarimo vektorsko reprezentacijo besedil, kar omogoča semantično iskanje.

In [2]:
from txtai import Embeddings

embeddings = Embeddings(path="cjvt/sloberta-trendi-topics")
embeddings.index(data)

print("Semantic Search Results:")
for query in ["feel good story", "climate change"]:
    uid = embeddings.search(query, 1)[0][0]
    print(f"Query: {query}, Result: {data[uid]}")


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
* 'fields' has been removed
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

Some weights of CamembertModel were not initialized from the model checkpoint at cjvt/sloberta-trendi-topics and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/576 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/800k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

Semantic Search Results:
Query: feel good story, Result: V tem stoletju pričakujemo dvig zračne temperature za 2 stopnji Celzija
Query: climate change, Result: Po poplavah v Nemčiji je tudi Slovenija doživela katastrofalno povodenj, ki je zajela tretjino države


Še preizkus modela za ustvarjanje vektorske reprezentacije besedil s slovenskimi gesli.

In [3]:
from txtai import Embeddings
embeddings = Embeddings(path="cjvt/sloberta-trendi-topics")
embeddings.index(data)

print("Semantic Search Results:")
for query in ["vesel dogodek", "podnebne spremembe"]:
    results = embeddings.search(query, len(data))
    print(f"Query: {query}")
    for uid, score in results:
        # Adjust the threshold value as needed
        if score > 0.72:
            print(data[uid])

Some weights of CamembertModel were not initialized from the model checkpoint at cjvt/sloberta-trendi-topics and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Semantic Search Results:
Query: vesel dogodek
Upokojen gradbeni delavec je v loteriji dobil 100.000 evrov
Query: podnebne spremembe
Po poplavah v Nemčiji je tudi Slovenija doživela katastrofalno povodenj, ki je zajela tretjino države
V tem stoletju pričakujemo dvig zračne temperature za 2 stopnji Celzija


### Semantično iskanje v naloženem besedilu

V tretjem preizkusu naložimo besedilo (npr. "Zakisljevanje oceanov.txt"), da bi v njem poiskali povedi o določeni temi (npr. podnebne spremembe).

In [31]:
from txtai import Embeddings
import os
from nltk.tokenize import sent_tokenize
import nltk

# Ensure nltk resources are available
nltk.download('punkt')
nltk.download('punkt_tab')

# Filepath to the speech
filepath = "/content/Antrittsrede_von_Donald_Trump_englisch.txt"

# Read the content of the file
with open(filepath, 'r', encoding='utf-8') as file:
    data = file.read()  # Read entire file content

# Sentence tokenization
data = sent_tokenize(data)  # Tokenizes the content into sentences

# Create embeddings instance
embeddings = Embeddings(path="intfloat/multilingual-e5-large")
embeddings.index(data)  # Index the tokenized sentences

print("Semantic Search Results:")
for query in ["gender policy", "illegal immigration", "climate policy"]:
    results = embeddings.search(query, len(data))
    print(f"Query: {query}")
    for uid, score in results:
        # Adjust the threshold value as needed
        if score > 0.795:
            print(data[uid])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Semantic Search Results:
Query: gender policy
As of today, it will henceforth be the official policy of the United States government that there are only two genders, male and female.
under the constitutional rule of law.
This week, I will also end the government policy of trying to socially engineer race and gender into every aspect of public and private life.
Query: illegal immigration
It fails to protect our magnificent, law-abiding American citizens, but provides sanctuary and protection for dangerous criminals, many from prisons and mental institutions, that have illegally entered our country from all over the world.
All illegal entry will immediately be halted and we will begin the process of returning millions and millions of criminal aliens back to the places from which they came.
to collect all tariffs, duties, and revenues.
We have a government that has given unlimited funding to the defense of foreign borders, but refuses to defend American borders or, more importantly, its o

### Semantično iskanje v več besedilih

In [38]:
from txtai import Embeddings
import os
from nltk.tokenize import sent_tokenize
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')

directory_path = "/content/"

all_sentences = []

for filename in os.listdir(directory_path):
    file_path = os.path.join(directory_path, filename)
    if os.path.isfile(file_path):
        with open(file_path, 'r', encoding='utf-8') as file:
            file_data = file.read()
            tokenized_sentences = sent_tokenize(file_data)
            all_sentences.extend(tokenized_sentences)

# Remove duplicate sentences
all_sentences = list(set(all_sentences))

embeddings = Embeddings(path="intfloat/multilingual-e5-large")
embeddings.index(all_sentences)

print("Semantic Search Results:")
with open("output_mixed.txt", "w", encoding="utf-8") as f:
    for query in ["gender policy", "illegal immigration", "climate policy"]:
        results = embeddings.search(query, 50)  # Limit number of results
        seen_results = set()  # Track unique results
        f.write(f"Query: {query}\n")
        print(f"Query: {query}")
        for uid, score in results:
            if score > 0.795 and all_sentences[uid] not in seen_results:
                seen_results.add(all_sentences[uid])
                print(all_sentences[uid])
                f.write(all_sentences[uid] + "\n")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Semantic Search Results:
Query: gender policy
As of today, it will henceforth be the official policy of the United States government that there are only two genders, male and female.
under the constitutional rule of law.
This week, I will also end the government policy of trying to socially engineer race and gender into every aspect of public and private life.
Query: illegal immigration
It fails to protect our magnificent, law-abiding American citizens, but provides sanctuary and protection for dangerous criminals, many from prisons and mental institutions, that have illegally entered our country from all over the world.
All illegal entry will immediately be halted and we will begin the process of returning millions and millions of criminal aliens back to the places from which they came.
to collect all tariffs, duties, and revenues.
We have a government that has given unlimited funding to the defense of foreign borders, but refuses to defend American borders or, more importantly, its o