# Find numbers

among plain text corpus.

We are looking for numbers now. First, we download that.

* [IlyaGusev/librusec](https://huggingface.co/datasets/IlyaGusev/librusec)
* [IlyaGusev/pikabu](https://huggingface.co/datasets/IlyaGusev/pikabu)

> `pip install datasets zstandard jsonlines pysimdjson` is advised.

The most simple way is to execute `git clone https://huggingface.co/datasets/IlyaGusev/librusec` eg.

> One is necessarily to turn on an lfs support though.
> 
> ```
> curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
> sudo apt-get install git-lfs
> git lfs install
> ```

In [None]:
from datasets import load_dataset
from pprint import pprint, pformat
from tqdm.notebook import tqdm
import json

Now we use the most direct approach and just morph a number in all the ways possible.

In [None]:
numbers = [
    'ноль', 'нуль',
    'один', 'два', 'двe', 'три', 'четыре', 'пять', 'шесть', 'семь', 'восемь', 'девять', 'десять',
    'одиннадцать', 'двенадцать', 'тринадцать', 'четырнадцать', 'пятнадцать', 'шестнадцать', 'семнадцать', 'восемнадцать', 'девятнадцать', 'двадцать',
    'тридцать', 'сорок', 'пятьдесят', 'шестьдесят', 'семьдесят', 'восемьдесят', 'девяносто', 'сто',
    'двести', 'триста', 'четыреста', 'пятьсот', 'шестьсот', 'семьсот', 'восемьсот', 'девятьсот',
    'тысяча', 'миллион', 'миллиард', 'триллион',
]

In [None]:
!pip install pymorphy2 pymorphy2-dicts-ru

In [None]:
import pymorphy2
from itertools import chain

morph = pymorphy2.MorphAnalyzer()

def get_lexeme(word):
    return set(chain(*([_.word for _ in parsing.lexeme] for parsing in morph.parse(word) if parsing.tag.POS in ("NUMR", "NOUN"))))

get_lexeme("ноль")

We face some mistakes as `семь` would be inflected as `семью` which is a form of `семья` as well so that we might want to do something about in in the future.
We do an MVP now though so let it be.

To not to search all the forms inflected one may to find a common part and change the (future) corresponding regexp according to it—and perform a fast `.contains()` check beforehand.

In [None]:
def get_max_common(words):
    """
    Find a leading part only.

    get_max_common(["мама", "мать", "матриарх"]) -> "ма"
    """
    words = list(words)
    if not words:
        return None
    result = words[0]
    for word in words[1:]:
        if word.startswith(result):
            continue
        for i, (ch1, ch2) in enumerate(zip(result, word)):
            if ch1 != ch2:
                result = result[:i]
                break
    return result

get_max_common(get_lexeme("ноль"))

In [None]:
import re

numbers_data = {}
for number in numbers:
    elem = {
        "word": number,
        "lexeme": get_lexeme(number)
    }
    elem["substr"] = get_max_common(elem["lexeme"])
    elem["regexp"] = re.compile(fr'\b({elem["substr"]}(?:{"|".join((_[len(elem["substr"]):] for _ in elem["lexeme"]))}))\b')
    numbers_data[number] = elem
numbers_data["одиннадцать"]

Now lets inspect what had we downloaded so far.

# pikabu

In [None]:
next(iter(load_dataset('/home/jovyan/wdc1/datasets/_WEB20/pikabu', split="train", streaming=True)))

So we want to split texts as they are too big to fit into GPU as LLM train.

We do not want to split on **sentences** now as the LLM we will train should see not single sentences only.
One is not trivial to combine arbitrary sentences together.

To split on paragraths (like `.split("\n")`) seems to be a good approach.

We do not want to see latin and digits for now as we dont know how to normalize it so we filter any sentence containing.

In [None]:
import re


regexp_lat_dig = re.compile(r"[a-zA-Z\d]+")


def get_matches(text):
    texts = text.split("\n")
    result = []
    for text in texts:
        if re.search(regexp_lat_dig, text):
            continue
        matches = []
        for number, elem in numbers_data.items():
            if elem["substr"] and elem["substr"] not in text:
                continue
            if match := re.search(elem["regexp"], text):
                matches.append({"number": number, "place": match.span(), "form": match[0]})
        if matches:
            result.append({
                "text": text,
                "matches": matches
            })
    return result

Now we are going to process a corpus and save the result into `jsonl` file now.

I use multiprocessing as multiprocessing goes brrr.

In [None]:
DATASET_PATH = "/home/jovyan/wdc1/datasets/_WEB20/pikabu"
OUTPUT_PATH = "pikabu.jsonl"

In [None]:
from multiprocessing import Process, Queue
from multiprocessing import Pool


queue = Queue()


def process_example(**kwargs):
    if matches := get_matches(kwargs["text_markdown"]):
        queue.put({
            "index": kwargs["id"],
            "texts": matches
        })


def write(queue):
    f = open(OUTPUT_PATH, "w")
    while True:
        item = queue.get()
        if item is None:
            return
        json.dump(item, f, ensure_ascii=False)
        f.write("\n")
    f.close()


writer = Process(target=write, args=(queue, ))
writer.start()
dataset = load_dataset(DATASET_PATH, split="train", streaming=True)
with Pool(10) as p:
    for example in tqdm(dataset):
        p.apply(process_example, kwds={**example})
    queue.put(None)
p.join()
writer.join()

# librusec

Paragraths here are too big so we sentencize the texts.

Tried to use stanza but it turned out to be too slow.

```
!pip install stanza
import stanza
stanza.download('ru')
nlp = stanza.Pipeline('ru', processors='tokenize')
```

Ended up with using spacy (haha classic).

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download ru_core_news_sm

In [None]:
import re
import spacy


nlp_sentencizer = spacy.blank("ru")
nlp_sentencizer.add_pipe("sentencizer")
text = '"В ходе проверочных мероприятий в целях профилактики правонарушений сотрудниками полиции было доставлено для административного разбирательства из центральной части города около 3 тысяч иностранных граждан. Как выяснилось, более 600 мигрантов находятся на территории России с различными нарушениями миграционного законодательства. Все они привлечены к административной ответственности", - отметил собеседник агентства.'
tokens = nlp_sentencizer(text)
[str(sent) for sent in tokens.sents]

Some boilerplate here.
Have to design text parts separation externally—as a function, at least.

Better to do a nice class but that depends on would I do that process again for some other corpus.

In [None]:
regexp_lat_dig = re.compile(r"[a-zA-Z\d]+")


def get_matches(text):
    nlp_sentencizer.max_length = len(text) + 100
    doc = nlp_sentencizer(text)
    result = []
    for sentence in enumerate(doc.sents):
        print(str(sentence))
        text = str(sentence)
        if re.search(regexp_lat_dig, text):
            continue
        matches = []
        for number, elem in numbers_data.items():
            if elem["substr"] and elem["substr"] not in text:
                continue
            if match := re.search(elem["regexp"], text):
                matches.append({"number": number, "place": match.span(), "form": match[0]})
        if matches:
            result.append({
                "text": text,
                "matches": matches
            })
    return result

In [None]:
DATASET_PATH = "/home/jovyan/wdc1/datasets/_PLAIN/librusec"
OUTPUT_PATH = "librusec.jsonl"

Mostly the same but boilerplace again as the key is not `text_markdown` but `text` now.

Should make some refactoring later.

In [None]:
from multiprocessing import Process, Queue
from multiprocessing import Pool


queue = Queue()


def process_example(**kwargs):
    if matches := get_matches(kwargs["text"]):
        queue.put({
            "index": kwargs["id"],
            "texts": matches
        })


def write(queue):
    f = open(OUTPUT_PATH, "w")
    while True:
        item = queue.get()
        if item is None:
            return
        json.dump(item, f, ensure_ascii=False)
        f.write("\n")
    f.close()


writer = Process(target=write, args=(queue, ))
writer.start()
dataset = load_dataset(DATASET_PATH, split="train", streaming=True)
with Pool(10) as p:
    for example in tqdm(dataset):
        p.apply(process_example, kwds={**example})
    queue.put(None)
p.join()
writer.join()