Материалы: https://yarolika00.notion.site/Spot-the-bot-Information-for-Spot-the-bot-b03e2b3e31794caeaa43936d47d88c08

# Язык коми

Материалы:
* https://komikyv.org/ - источник художественных текстов для корпуса
* https://github.com/timarkh/uniparser-grammar-komi-zyrian - морфологический анализатор
* https://komi-zyrian.web-corpora.net/ - корпус прессы, к которому получила доступ у автора, и могу позже тоже добавить
* https://dict.fu-lab.ru/ - онлайн-словари

In [1]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import pandas as pd
import json

In [2]:
def GetTextLinks(page_link):
    main_link = 'https://komikyv.org/'

    text_links = []

    while True:
        soup = BeautifulSoup(session.get(page_link).text, 'html.parser')

        texts_on_page = soup.find_all('td', {'id': 'tht1'})
        for text_on_page in texts_on_page:
            text_link = text_on_page.find('a').attrs['href']
            text_links.append(main_link + text_link)

        next_html = soup.find('li', {'class': 'next'})
        if next_html:
            page_link = main_link + next_html.find('a').attrs['href']
        else:
            break

    return text_links

In [3]:
def GetTexts(text_links):
    text_dicts = []
    for link in tqdm(text_links):
        soup = BeautifulSoup(session.get(link).text, 'html.parser')
        metadata = soup.find('div', {'class': 'bloknot-v'})

        title = soup.find('span', {'class': 'rdf-meta element-hidden'}).attrs['content']

        author = ''
        author_raw = metadata.find('div', {'class': 'field field-name-field-autor field-type-entityreference field-label-above'})
        if author_raw:
            author = author_raw.find('div', {'class': 'field-item even'}).text

        genre = ''
        genre_raw = metadata.find('div', {'class': 'field field-name-field-janr field-type-taxonomy-term-reference field-label-above'})
        if genre_raw:
            genre = genre_raw.find('div', {'class': 'field-item even'}).text

        source = ''
        source_raw = metadata.find('div', {'class': 'field field-name-field-istochnik field-type-taxonomy-term-reference field-label-above'})
        if source_raw:
            source = source_raw.find('div', {'class': 'field-item even'}).text

        place = ''
        place_raw = metadata.find('div', {'class': 'field field-name-field-moydanin field-type-taxonomy-term-reference field-label-above'})
        if place_raw:
            place = place_raw.find('div', {'class': 'field-item even'}).text

        text = soup.find('div', {'class': 'padd'}).get_text()

        text_dict = {'link': link, 'title': title, 'author': author, 'genre': genre, 'place': place, 'source': source, 'text': text}
        text_dicts.append(text_dict)

    return text_dicts

In [4]:
session = requests.session()

## Раздел Проза

In [None]:
prose_first_link = 'https://komikyv.org/kpv/library?field_category_tid_1=Проза&field_perevodl_tid=Нет'
prose_links = GetTextLinks(prose_first_link)
len(prose_links)

1929

In [None]:
with open('proza_links.txt', 'w', encoding='utf-8') as file:
    file.write('\n'.join(prose_links))

In [None]:
prose_dicts = GetTexts(prose_links)

In [None]:
with open('proza_texts.json', 'w') as file:
    json.dump(prose_dicts, file, ensure_ascii=False)

In [161]:
df = pd.read_json(r"komi\proza_texts.json", encoding="UTF-8")
df.head()

Unnamed: 0,link,title,author,genre,place,source,text
0,https://komikyv.org//kpv/content/%D0%B0-%D0%B2...,А вӧрыс ыпъялӧ-сотчӧ,Белых Иван,Висьт,Сыктывкар,Важыс уськӧдчывлӧ вӧтӧн (2005),\n\nА ВӦРЫС ЫПЪЯЛӦ-СОТЧӦ\n\n\t \n\n\tКолӧ жӧ в...
1,https://komikyv.org//kpv/content/%D0%B0-%D0%BA...,А комиясыд медсюсьӧсь,Парма Вань,Висьт,,Винегрет (2011),\n\nА КОМИЯСЫС МЕДСЮСЬӦСЬ\n\nКлуб дорын кино в...
2,https://komikyv.org//kpv/node/30420,А поезд мунӧ...,Афанасьев Евгений,Висьт,,Войвыв кодзув (1999 №7),\n\nА ПОЕЗД МУНӦ...\n\nКольӧм гожӧмӧ ас чужан ...
3,https://komikyv.org//kpv/node/32723,Абу йӧй,Лодыгин Василий,Мойд,,Паса шор (1995),"\n\nАБУ ЙӦЙ\n\nКатшалы, югыд синъяснас потшӧс ..."
4,https://komikyv.org//kpv/contents/abu-moyd-zbyl,"Абу мойд, а збыль",Коданёв Иван,Висьт,,Ягъяслӧн йӧлӧга шы (1984),"\n\nАБУ МОЙД, А ЗБЫЛЬ\n\nКор поезд воис Эжваӧ,..."


In [163]:
# рассматриваем только тексты больше 100 слов, потому что есть такая статья Громов, ... ....

print(df.shape)
df['words'] = df['text'].apply(lambda x: len(x.split()))
prose_texts = df[df['words'] > 100]
prose_texts.shape

(1929, 8)


(1872, 8)

In [164]:
prose_texts.to_csv("long_prose_texts.csv", index=False, encoding='utf-8')

## Раздел Драма

In [None]:
drama_first_link = 'https://komikyv.org/kpv/library?field_category_tid_1=Драма&field_perevodl_tid=Нет'
drama_links = GetTextLinks(drama_first_link)
len(drama_links)

60

In [None]:
with open('drama_links.txt', 'w', encoding='utf-8') as file:
    file.write('\n'.join(drama_links))

In [None]:
drama_dicts = GetTexts(drama_links)

100%|██████████| 60/60 [00:21<00:00,  2.80it/s]


In [None]:
with open('drama_texts.json', 'w') as file:
    json.dump(drama_dicts, file, ensure_ascii=False)

In [158]:
df = pd.read_json(r"komi\drama_texts.json", encoding="UTF-8")
df.head()

Unnamed: 0,link,title,author,genre,place,source,text
0,https://komikyv.org//kpv/node/31394,Августын,Козлова Елена,Ӧти юкӧна пьеса,,Войвыв кодзув (1988. № 2),\n\nАВГУСТЫН\n\nВорсысьяс:\nКостромин Михаил В...
1,https://komikyv.org//kpv/node/31006,Арт,Нёбдінса Виттор,Ворсантор,,Вабергач (1982),\n\nАРТ\nӦти торъя ворсантор\n\nЙӧз:\n\nВӧрпро...
2,https://komikyv.org//kpv/node/36070,Бурань,Лебедев Михаил,Сьылӧмӧн ворсантор,,Ордым (1929 № 1),\n\nБУРАНЬ\nСьылӧмӧн ворсантор важ коми йӧз ол...
3,https://komikyv.org//kpv/node/39996,Бушкола вояс,Иливапыс,Пьеса,,Бӧрйӧм гижӧдъяс (2009),\n\nБУШКОЛА ВОЯС\n\nДЕЙСТВУЙТӦНЫ:\n\nАндрей Ул...
4,https://komikyv.org//kpv/node/31001,Ва шыр,Нёбдінса Виттор,Теш,,Вабергач (1982),\n\nВА ШЫР\nКык торъя теш\n\nЙӧз:\n\nОТЕЧ ТАРА...


In [159]:
# рассматриваем только тексты больше 100 слов, потому что есть такая статья Громов, ... ....

print(df.shape)
df['words'] = df['text'].apply(lambda x: len(x.split()))
drama_texts = df[df['words'] > 100]
drama_texts.shape

(60, 7)


(59, 8)

In [160]:
drama_texts.to_csv("long_drama_texts.csv", index=False, encoding='utf-8')

## Раздел Фольклор

In [None]:
folklore_first_link = 'https://komikyv.org/kpv/library?field_category_tid_1=Фольклор&field_perevodl_tid=Нет'
folklore_links = GetTextLinks(folklore_first_link)
len(folklore_links)

423

In [None]:
with open('folklore_links.txt', 'w', encoding='utf-8') as file:
    file.write('\n'.join(folklore_links))

In [None]:
folklore_dicts = GetTexts(folklore_links)

100%|██████████| 423/423 [01:11<00:00,  5.92it/s]


In [None]:
with open('folklore_texts.json', 'w') as file:
    json.dump(folklore_dicts, file, ensure_ascii=False)

In [155]:
df = pd.read_json(r"komi\folklore_texts.json", encoding="UTF-8")
df.head()

Unnamed: 0,link,title,author,genre,place,source,text
0,https://komikyv.org//kpv/node/35602,Айлы,,Бӧрдкыв,,Зырянскій край при епископахъ Пермскихъ и Зыря...,"\n\nАЙЛЫ\n\nЮгыд шондіӧй, айӧй!\nВердысь-удысь..."
1,https://komikyv.org//kpv/content/%D0%B0%D0%BB%...,Алӧй ленточка,,Сьыланкыв,,Войвыв кодзув (1993 № 3-4),"\nАЛӦЙ ЛЕНТОЧКА\n\nМенам вӧлі алӧй ленточка,\n..."
2,https://komikyv.org//kpv/node/30219,Аральӧ-пӧльӧ,,Мойд,Лымва,Коми мойдъяс (1991),\n\nАРАЛЬӦ-ПӦЛЬӦ\nВажӧн вӧлі старик гозъя. Нал...
3,https://komikyv.org//kpv/node/30218,Арӧй Дрӧй,,Мойд,Вомын,Коми мойдъяс (1991),\n\nАРӦЙ ДРӦЙ\nОлісны-вылісны крестьянин гозъя...
4,https://komikyv.org//kpv/node/30208,Арсень да чӧртъяс,,Мойд,Вомынбӧж,Коми мойдъяс (1991),\n\nАРСЕНЬ ДА ЧӦРТЪЯС\nВажӧн ӧти сиктын чукӧрт...


In [156]:
# рассматриваем только тексты больше 100 слов, потому что есть такая статья Громов, ... ....

print(df.shape)
df['words'] = df['text'].apply(lambda x: len(x.split()))
folklore_texts = df[df['words'] > 100]
folklore_texts.shape

(423, 7)


(284, 8)

In [157]:
folklore_texts.to_csv("long_folklore_texts.csv", index=False, encoding='utf-8')

## Раздел Поэзия

In [None]:
poetry_first_link = 'https://komikyv.org/kpv/library?field_category_tid_1=Поэзия&field_perevodl_tid=Нет'
poetry_links = GetTextLinks(poetry_first_link)
len(poetry_links)

7673

In [None]:
with open('poetry_links.txt', 'w', encoding='utf-8') as file:
    file.write('\n'.join(poetry_links))

In [None]:
poetry_dicts = GetTexts(poetry_links)

100%|██████████| 7673/7673 [35:17<00:00,  3.62it/s]


In [None]:
with open('poetry_texts.json', 'w') as file:
    json.dump(poetry_dicts, file, ensure_ascii=False)

In [150]:
df = pd.read_json(r"komi\poetry_texts.json", encoding="UTF-8")

In [151]:
df.head()

Unnamed: 0,link,title,author,genre,place,source,text
0,https://komikyv.org//kpv/node/40865,"""А кор ывлаыс ланьтас...""",Касеева Клавдия,Кывбур,,Вын (2022),\n\n* * *\n\nА кор ывлаыс ланьтас\nДа сынӧдас ...
1,https://komikyv.org//kpv/node/40846,"""бабъяс печкӧны...""",Касеева Клавдия,Кывбур,,Вын (2022),\n\n* * *\n\nбабъяс печкӧны\nолӧм вурун.\nстав...
2,https://komikyv.org//kpv/node/40965,"""Босьт, шуӧ, гезсӧ да лок...""",Карманова Ксения,Кывбур,,Вын (2022),"\n\n* * *\n\nБосьт, шуӧ, гезсӧ да лок,\nТэнад ..."
3,https://komikyv.org//kpv/node/40835,"""Быттьӧ кытчӧ он видзӧдлы...""",Касеева Клавдия,Кывбур,,Вын (2022),\n\n* * *\n\nБыттьӧ кытчӧ он видзӧдлы —\nГӧгӧр...
4,https://komikyv.org//kpv/node/40967,"""Войся пемыд енэжын ӧтка кодзув...""",Карманова Ксения,Кывбур,,Вын (2022),\n\n* * *\n\nВойся пемыд енэжын ӧтка кодзув......


In [152]:
# рассматриваем только тексты больше 100 слов, потому что есть такая статья Громов, ... ....

print(df.shape)
df['words'] = df['text'].apply(lambda x: len(x.split()))
poetry_texts = df[df['words'] > 100]
poetry_texts.shape

(7673, 7)


(1999, 8)

In [153]:
poetry_texts.to_csv("long_poetry_texts.csv", index=False, encoding='utf-8')

## Результаты

* Проза: 1872
* Драма: 59
* Фольклор: 284
* Поэзия: 1999

Всего: 4214

## Объединение коми текстов разных жанров в единый корпус

In [125]:
import json


with open(r"komi\poetry_texts.json", encoding="UTF-8") as file:
    poetry_texts = json.load(file)

with open(r"komi\drama_texts.json", encoding="UTF-8") as file:
    drama_texts = json.load(file)

with open(r"komi\proza_texts.json", encoding="UTF-8") as file:
    proza_texts = json.load(file)

with open(r"komi\folklore_texts.json", encoding="UTF-8") as file:
    folklore_texts = json.load(file)

In [168]:
import pandas as pd

poetry_texts_df = pd.read_csv(r"komi\long_poetry_texts.csv", encoding="UTF-8")
poetry_texts_df["genre_wider"] = "poetry"

drama_texts_df = pd.read_csv(r"komi\long_drama_texts.csv", encoding="UTF-8")
drama_texts_df["genre_wider"] = "drama"

prose_texts_df = pd.read_csv(r"komi\long_prose_texts.csv", encoding="UTF-8")
prose_texts_df["genre_wider"] = "prose"

folklore_texts_df = pd.read_csv(r"komi\long_folklore_texts.csv", encoding="UTF-8")
folklore_texts_df["genre_wider"] = "folklore"

In [171]:
komi_fiction = pd.concat([poetry_texts_df, drama_texts_df, prose_texts_df, folklore_texts_df])
komi_fiction.head()

Unnamed: 0,link,title,author,genre,place,source,text,words,genre_wider
0,https://komikyv.org//kpv/node/40856,"""Дзурк-дзурк, дзурки-дзурк...""",Касеева Клавдия,Кывбур,,Вын (2022),"\n\n* * *\n\nусьӧмаяслы, видзысьяслы да волысь...",148,poetry
1,https://komikyv.org//kpv/node/40830,"""катша-катша, китш-котш...""",Касеева Клавдия,Кывбур,,Вын (2022),"\n\n* * *\n\nкатша-катша, китш-котш,\nтшӧті-ка...",108,poetry
2,https://komikyv.org//kpv/node/40990,"""Став сьӧлӧмсянь кӧсъя, мед быдӧнлы нимкодя ов...",Карманова Ксения,Кывбур,,Войвыв кодзув (2020. №3),"\n\n* * *\n\nСтав сьӧлӧмсянь кӧсъя, мед быдӧнл...",111,poetry
3,https://komikyv.org//kpv/node/40834,"""тан кодкӧ эм на...""",Касеева Клавдия,Кывбур,,Вын (2022),"\n\n* * *\n\n— тан кодкӧ эм на,\nме кындзи?\nм...",101,poetry
4,https://komikyv.org//kpv/node/40985,"""Тэ ставныскӧд ӧтлаын качӧдчин-лэбин...""",Карманова Ксения,Кывбур,,Вын (2022),\n\n* * *\n\nТэ ставныскӧд ӧтлаын качӧдчин-лэб...,103,poetry


In [173]:
komi_fiction.to_csv("komi\komi_fiction.csv", index=False, encoding='utf-8')

## Лемматизация и прочий морфологический анализ художественных коми текстов

In [83]:
# !pip3 install uniparser-komi-zyrian

In [42]:
from string import punctuation
punctuation += '«»—'

import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from tqdm import tqdm
tqdm.pandas()

from uniparser_komi_zyrian import KomiZyrianAnalyzer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ttais\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [195]:
a = KomiZyrianAnalyzer(mode='strict')

In [212]:
# для проверки анализа конкретных слов
ana = a.analyze_words("Бабаыс идрасис горт гӧгӧр, Отсасис мужиклы ытшкыны, куртны.".split())
ana[1]

[<Wordform object>
 идрасис
 идравны; 3,V,pass,pass_ysj,pst,sg
 идра-с-ис
 STEM-PASS-PST.3SG
 trans_ru	убрать (урожай),
 <Wordform object>
 идрасис
 идравны; 3,V,pass,pass_ysj,pst,sg,tr
 идра-с-ис
 STEM-PASS-PST.3SG
 trans_ru	прибрать; обрядить; убрать (урожай)]

In [97]:
def prepare_komi_text(text, a, punctuation):
    sents_draft = text.split('...')
    sents = []
    for sent_draft in sents_draft:
        sents.extend(sent_tokenize(sent_draft))
    sents_prepared = []
    for sent in sents:
        punct_removed = []
        for word in sent.split():
            word = word.strip(punctuation)
            if word:
                punct_removed.append(word)
        analyses = a.analyze_words(punct_removed)
        pos_dict = {'NUM': '0', 'PRO': '1', 'APRO': '2', 'ADVPRO': '3', 'PN': '4'}
        prepared = []
        for ana in analyses:
            wordform = ana[0].wf
            lemma = ana[0].lemma
            pos = ana[0].gramm.split(",")[0]
            #if the word is a number, we replace it with token for NUM
            if wordform.isdigit():
                prepared.append('0')
            #if the pos of the word is NUM, PRO, APRO, ADVPRO or PN
            #(see gramm tags here https://komi-zyrian.web-corpora.net/)
            #we replace it with the token
            elif pos in pos_dict:
                prepared.append(pos_dict[pos])
            elif lemma != "":
                #we replace word with its lemma
                prepared.append(lemma)
            else:
                #we mark unrecognized wordforms
                prepared.append('_' + wordform)
        sents_prepared.append(" ".join(prepared))
    return "\n".join(sents_prepared)

In [98]:
komi_fiction_df = pd.read_csv(r"komi\komi_fiction.csv", encoding="UTF-8")

In [99]:
komi_fiction_df['lemmas'] = komi_fiction_df['text'].progress_apply(lambda x: prepare_komi_text(x, a, punctuation))

100%|██████████| 4214/4214 [1:17:26<00:00,  1.10s/it]  


In [100]:
komi_fiction_df.to_csv("komi\komi_fiction_prepared.csv", index=False, encoding='utf-8')

In [104]:
komi_fiction_df['unrecognized'] = komi_fiction_df['lemmas'].apply(lambda x: " ".join([word for word in x.split() if "_" in word]))
komi_fiction_df['lemmas_count'] = komi_fiction_df['lemmas'].apply(lambda x: len(x.split()))
komi_fiction_df['unrecognized_count'] = komi_fiction_df['unrecognized'].apply(lambda x: len(x.split()))

In [102]:
# процент нераспознанных слов
round(sum(komi_fiction_df['unrecognized_count']) / sum(komi_fiction_df['lemmas_count']) * 100, 2)

4.64

## Извлечение текстов для проверки консультантом

In [185]:
# print(komi_fiction_df.loc[2043, 'text'])
# print(komi_fiction_df.loc[2043, 'lemmas'])

In [179]:
komi_fiction_df[komi_fiction_df.genre_wider == "drama"].sample(3)

Unnamed: 0,link,title,author,genre,place,source,text,words,genre_wider,lemmas,unrecognized,lemmas_count,unrecognized_count
2031,https://komikyv.org//kpv/node/31007,Моль,Нёбдінса Виттор,Ворсантор,,Вабергач (1982),\n\nМОЛЬ\nӦти торъя ворсантор\n\nЙӧз:\n\nМИКАЙ...,3928,drama,моль 0 торъя ворсантор йӧз Микайлӧ ар 0 коммун...,_18–19 _комсомолеч _54–55 _30–32 _КРИСАН _ЛАБО...,3869,191
2039,https://komikyv.org//kpv/node/35465,Пемыд пармаын (Шыпича),Сук Парма,Либретто,,Тувсов кадын (1941),\n\nПЕМЫД ПАРМАЫН\n(Шыпича)\n\nВит серпаса дра...,7371,drama,пемыд парма шыпича 0 серпас драма ворсны 0\nЯр...,_Шонді-Ныв _Шараф _КЫВВОДЗ _Анук _пӧдсӧн _Анук...,7213,182
2043,https://komikyv.org//kpv/node/39992,Пӧсь сынӧд,Иливапыс,Пьеса,,Бӧрйӧм гижӧдъяс (2009),\n\nПӦСЬ СЫНӦД\n\nВОРСЫСЬЯС:\n\nСавин — трестс...,10840,drama,пӧсь сын ворсны Савин трест представитель\nГуд...,_Красов _Красовлӧн _Кочев _Сокерин _Курыдкашин...,10781,997


## Морфологический анализ нераспознанных слов с дефисом

In [5]:
komi_fiction_df = pd.read_csv(r"komi\komi_fiction_prepared.csv", encoding="UTF-8")

In [40]:
komi_fiction_df.head(2)

Unnamed: 0,link,title,author,genre,place,source,text,words,genre_wider,lemmas
0,https://komikyv.org//kpv/node/40856,"""Дзурк-дзурк, дзурки-дзурк...""",Касеева Клавдия,Кывбур,,Вын (2022),"\n\n* * *\n\nусьӧмаяслы, видзысьяслы да волысь...",148,poetry,усьны видзысь да воны _Дзурк-дзурк _дзурки-дзу...
1,https://komikyv.org//kpv/node/40830,"""катша-катша, китш-котш...""",Касеева Клавдия,Кывбур,,Вын (2022),"\n\n* * *\n\nкатша-катша, китш-котш,\nтшӧті-ка...",108,poetry,_катша-катша _китш-котш _тшӧті-кадыс _тітш-тот...


In [86]:
def prepare_hyphenated(text, a):
    sents = text.split('\n')
    sents_prepared = []
    pos_dict = {'NUM': '0', 'PRO': '1', 'APRO': '2', 'ADVPRO': '3', 'PN': '4'}
    for sent in sents:
        prepared = []
        for word in sent.split():
            if word[0] == '_' and '-' in word:
                analyses = a.analyze_words(word[1:].split('-'))
                is_unrecognizable = True
                hyphenated_lemmas = []
                for ana in analyses:
                    wordform = ana[0].wf
                    lemma = ana[0].lemma
                    pos = ana[0].gramm.split(",")[0]
                    #if the word is a number, we replace it with token for NUM
                    if wordform.isdigit():
                        hyphenated_lemmas.append('0')
                        is_unrecognizable = False
                    #if the pos of the word is NUM, PRO, APRO, ADVPRO or PN
                    #(see gramm tags here https://komi-zyrian.web-corpora.net/)
                    # we replace it with the token
                    elif pos in pos_dict:
                        hyphenated_lemmas.append(pos_dict[pos])
                        is_unrecognizable = False
                    elif lemma != "":
                        #we replace word with its lemma
                        hyphenated_lemmas.append(lemma)
                        is_unrecognizable = False
                    else:
                        hyphenated_lemmas.append('_' + wordform)
                if is_unrecognizable:
                    #still unrecognized wordforms remain unchanged
                    prepared.append(word)
                else:
                    prepared.extend(hyphenated_lemmas)
            else:
                prepared.append(word)
        sents_prepared.append(" ".join(prepared))
    return "\n".join(sents_prepared)

In [None]:
a = KomiZyrianAnalyzer(mode='strict')

In [87]:
komi_fiction_df['lemmas_hyphenated_preprocessed'] = komi_fiction_df['lemmas'].progress_apply(lambda x: prepare_hyphenated(x, a))

100%|██████████| 4214/4214 [05:02<00:00, 13.93it/s] 


In [105]:
komi_fiction_df['unrecognized'] = komi_fiction_df['lemmas_hyphenated_preprocessed'].apply(lambda x: " ".join([word for word in x.split() if "_" in word]))
komi_fiction_df['new_lemmas_count'] = komi_fiction_df['lemmas_hyphenated_preprocessed'].apply(lambda x: len(x.split()))
komi_fiction_df['new_unrecognized_count'] = komi_fiction_df['unrecognized'].apply(lambda x: len(x.split()))

In [107]:
# процент нераспознанных слов
initial_unrecognized_percent = round(sum(komi_fiction_df['unrecognized_count']) / sum(komi_fiction_df['lemmas_count']) * 100, 2)
new_unrecognized_percent = round(sum(komi_fiction_df['new_unrecognized_count']) / sum(komi_fiction_df['new_lemmas_count']) * 100, 2)
print(f"Получилось уменьшить процент нераспознанных слов с {initial_unrecognized_percent}% до {new_unrecognized_percent}%.")

Получилось уменьшить процент нераспознанных слов с 4.64% до 3.65%.


In [112]:
komi_fiction_df = komi_fiction_df.sample(frac=1).reset_index(drop=True)
komi_fiction_df.head(2)

Unnamed: 0,link,title,author,genre,place,source,text,words,genre_wider,lemmas,lemmas_hyphenated_preprocessed,unrecognized,lemmas_count,unrecognized_count,new_lemmas_count,new_unrecognized_count
0,https://komikyv.org//kpv/contents/olis-vylis-k...,Оліс-выліс Кулӧмдінын,Напалков Виктор,Кывбур,,"О, Енмӧй!.. (2007)",\n\nОЛІС-ВЫЛІС КУЛӦМДІНЫН\n\n\nОг нимсӧ нывкаы...,120,poetry,_ОЛІС-ВЫЛІС Кулӧмдін ог ним нывка индыны\n1 ча...,овны _ВЫЛІС Кулӧмдін ог ним нывка индыны\n1 ча...,_ВЫЛІС _выліс _кодьсӧ _тивкйӧдлӧны,119,4,121,4
1,https://komikyv.org//kpv/node/30620,Водз на шойччыны,Нёбдінса Виттор,Кывбур,,Югыд кодзув (1980),\nВОДЗ НА ШОЙЧЧЫНЫ(Думъяс)\n\nСьӧкыд кад олам ...,145,poetry,водз на _ШОЙЧЧЫНЫ(Думъяс сьӧкыд кад овны 1 мув...,водз на _ШОЙЧЧЫНЫ(Думъяс сьӧкыд кад овны 1 мув...,_ШОЙЧЧЫНЫ(Думъяс,143,1,143,1


In [113]:
komi_fiction_df.to_csv("komi\komi_fiction_prepared.csv", index=False, encoding='utf-8')

In [234]:
komi_fiction_df = pd.read_csv(r"komi\komi_fiction_prepared.csv", encoding="UTF-8")

In [235]:
def prepare_unrecognized(text):
    text_prepared = ''
    for sent in str(text).split('\n'):
        sent_prepared = ''
        for word in str(sent).split():
            if word[0] == '_':
                word = word[1:]
            sent_prepared += word
            sent_prepared += ' '
        text_prepared += sent_prepared
        text_prepared += '\n'
    return text_prepared.lower()

In [237]:
komi_fiction_prepared_df['preprocessed'] = komi_fiction_df['lemmas_hyphenated_preprocessed'].apply(prepare_unrecognized)
komi_fiction_prepared_df['preprocessed'] = komi_fiction_prepared_df['preprocessed'].apply(lambda x: x.strip())
komi_fiction_prepared_df['unrecognized'] = komi_fiction_prepared_df['unrecognized'].apply(prepare_unrecognized)
komi_fiction_prepared_df = komi_fiction_prepared_df.drop(columns=['words', 'lemmas', 'lemmas_hyphenated_preprocessed', 'lemmas_count', 'unrecognized_count', 'new_lemmas_count', 'new_unrecognized_count'])
komi_fiction_prepared_df.sample(2)

Unnamed: 0,link,title,author,genre,place,source,text,genre_wider,unrecognized,preprocessed
913,https://komikyv.org//kpv/node/35768,Вӧр гормӧг,Куратова Нина,Повесьт,,Куим небӧгӧ ӧтувтӧм гижӧд чукӧр. T. 1 (2015),\n\nВӦР ГОРМӦГ\n\nА и вывті кӧ нин сьӧкыд лоӧ ...,prose,богу в всё равно витальыс витальыс ванюшыс ван...,вӧрны гормӧг а и вывті кӧ нин сьӧкыд лоны 1 ов...
4188,https://komikyv.org//kpv/node/42817,Сирпи конъяліс,Митерас Ӧльӧш,Висьт,,Войвыв кодзув (1996 №12),\n\nСИРПИ КОНЪЯЛІС\n\n— Пашӧ-ӧ! Па-а-ашӧ!\nӦши...,prose,па ашӧ модялысь модьӧ этайӧс ыхы складникас мо...,сирпи конъявны паш ӧ \nпа а ашӧ \nӧшинь ув вес...


In [238]:
komi_fiction_prepared_df.to_csv("komi\literary_komi-zyryan_corpus.csv", index=False, encoding='utf-8')

## Создание корпуса художественных коми-текстов в виде файла .txt

In [239]:
komi_fiction_prepared_df = pd.read_csv(r"komi\literary_komi-zyryan_corpus.csv", encoding="UTF-8")

In [240]:
with open(r'komi\komi_fiction_sent_corpus.txt', 'w', encoding='UTF-8') as file:
    file.write(komi_fiction_prepared_df['preprocessed'].str.replace('\n\n', '\n').str.cat(sep='\n'))
with open(r'komi\komi_fiction_text_corpus.txt', 'w', encoding='UTF-8') as file:
    file.write(komi_fiction_prepared_df['preprocessed'].str.replace('\n\n', ' ').str.replace('\n', ' ').str.cat(sep='\n'))

## SVD на художественных коми-текстах

In [241]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse.linalg import svds
import numpy as np

In [242]:
def make_matrix_W_list_of_words(corpus_path, min_df, max_df=None, token_pattern = None, use_idf = True):
  '''
  corpus_path - is a path to the corpus, where one line - one text

  min_df - is the minimum times (or fraction of the texts) a word must occur in the corpus

  max_df - is the maximum times (or fraction of the texts) a word must occur in the corpus
  if it is None, there are no upper bound

  token_pattern - alphabet, which will be considered. Usually can be all letters of the language and numbers
  if None all symbols will be OK

  use_idf - is bool value whether to use idf
  '''
  with open(corpus_path, 'r', encoding='UTF-8') as corpus_file:
        if token_pattern:
            vectorizer = TfidfVectorizer(analyzer='word', min_df=min_df, token_pattern=token_pattern, use_idf=use_idf)
        else:
            vectorizer = TfidfVectorizer(analyzer='word', min_df=min_df, use_idf=use_idf)
        data_vectorized = vectorizer.fit_transform(corpus_file)
  return data_vectorized, vectorizer.get_feature_names_out()

In [243]:
W, words_list = make_matrix_W_list_of_words(r'komi\komi_fiction_text_corpus.txt', 3)

In [244]:
W.shape

(4214, 22856)

In [245]:
def apply_svd(W, k, output_folder):
  '''
  W - matrix texts x words
  k - the rank of the SVD, must be less than any dimension of W
  '''
  #Apply the SVD function
  u, sigma, vt = svds(W, k)

  #The function does not garantee, that the order of the singular values is descending
  #So, we need to dreate it by hand
  descending_order_of_inds = np.flip(np.argsort(sigma))
  u = u[:,descending_order_of_inds]
  vt = vt[descending_order_of_inds]
  sigma = sigma[descending_order_of_inds]

  #Checking that sizes are ok
  assert sigma.shape == (k,)
  assert vt.shape == (k, W.shape[1])
  assert u.shape == (W.shape[0], k)

  #Now, we'll save all the matrixes in folder (just in case)
  with open(output_folder + '\\' + str(k) + '_sigma_vt.npy', 'wb') as f:
        np.save(f, np.dot(np.diag(sigma), vt).T)
  with open(output_folder + '\\' +  str(k) + '_sigma.npy', 'wb') as f:
        np.save(f, sigma)
  with open(output_folder + '\\' +  str(k) + '_u.npy', 'wb') as f:
        np.save(f, u)
  with open(output_folder + '\\' +  str(k) + '_vt.npy', 'wb') as f:
        np.save(f, vt)
  return np.dot(np.diag(sigma), vt).T

In [246]:
vv = apply_svd(W, 1000, 'komi')

In [247]:
vv.shape

(22856, 1000)

In [248]:
def create_dictionary(words_list, vv, output_file):
  dictionary = {}
  for word, vector in zip(words_list, vv):
    dictionary[word] = vector
  np.save(output_file, dictionary)
  return dictionary

In [249]:
dictionary = create_dictionary(words_list, vv, r'komi\svd_dictionary.npy')

In [250]:
dictionary = np.load(r'komi\dictionary.npy', allow_pickle=True)[()]

In [251]:
def convert_text_to_vector(text, dictionary, n, m):
  #split the text into words
  text = text.strip().split()
  #It is the list for all the vectors in text
  text_vectors = []

  #inspect all n-grams in the text
  for i in range(len(text) - n + 1):
      #The list of vectors for words in the n-gram
      gram_vec = []
      #Let's look into every word of the n-gram
      for word in text[i:i+n]:
        #If current word is not in dictionary, we skip this n-gram
        if word not in dictionary:
          gram_vec = []
          break
        vec_ = dictionary[word][:m]
        gram_vec.append(vec_)

      #If the list of vectors is not correct, we skip the n-gram
      if len(gram_vec) != n or len(gram_vec[-1]) != m:
        continue

      text_vectors.append(np.array(gram_vec).flatten())
  return text_vectors

In [252]:
n = 2 #How long is the n-gram
m = 100 #How long is a vector for every word
text = 'усьӧма видзны да воны дзуртны гусь парма чурк важ видзӧдны а 1 чужны челядь 1 нин эз сьывны да бур кыв удж эз ышавны эз гӧтырпу война бӧръя топавны сыв а 1 кольӧма вижавны гижны да 3 кывны тыш усьны тыш а чужны му виччыны да дзик 2 воны корны дитя локны кӧть лым локны дон но кольны 1 3 3 тувччавны бӧръя кокны и 1 оз аддзыны нин 1 кодйыны гу 3 пыдӧстӧм енэж окавны кӧдзавны му да кыа садьмыны а рыт вочасӧн ку 1 кыв кыдз во пыр дзуртны да сёрнитны важ парма чурк 1 кыв но бӧр ог воны но 2 воны май тӧлысь ойдыны юны и лэдзавны кор нэриник бадь пуны а 3 кӧ чужны му выв 1 ковны на ковны и дзик 2 воны бӧръя лым 1 парма воны'
text_vector = convert_text_to_vector(text, dictionary, n, m)

In [253]:
len(text_vector)

87

## Word2vec на художественных коми-текстах

In [254]:
import io
from gensim.models import Word2Vec

In [255]:
def load_corpus(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    documents = []
    for line in fin:
        documents.append(line.split())
    return documents

In [256]:
def save_dictionary(fname, dictionary, args):
    length, dimension = args
    fin = io.open(fname, 'w', encoding='utf-8')
    fin.write('%d %d\n' % (length, dimension))
    for word in dictionary:
        fin.write('%s %s\n' % (word, ' '.join(map(str, dictionary[word]))))

def load_dictionary(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    length, dimension = map(int, fin.readline().split())
    dictionary = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        dictionary[tokens[0]] = map(float, tokens[1:])
    return dictionary

In [257]:
documents = load_corpus(r'komi\komi_fiction_text_corpus.txt')
len(documents)

4214

In [258]:
%%time

dimension = 100
model = Word2Vec(sentences=documents, vector_size=dimension, min_count=3)

CPU times: total: 16.4 s
Wall time: 16.8 s


In [259]:
dictionary = {key : model.wv[key] for key in model.wv.key_to_index}
len(dictionary)

28799

In [260]:
model.wv.most_similar('гӧтырпу')

[('ныланӧй', 0.7535287141799927),
 ('жӧник', 0.7276376485824585),
 ('тиюк', 0.7068349123001099),
 ('рочакань', 0.7066267728805542),
 ('невеста', 0.70535808801651),
 ('унінь', 0.7049809098243713),
 ('зараньӧс', 0.6998177766799927),
 ('сарича', 0.6969175934791565),
 ('парша', 0.692864179611206),
 ('вежай', 0.6888182163238525)]

In [261]:
save_dictionary(r'komi\cbow_dictionary.txt', dictionary, (len(dictionary), dimension))

In [262]:
dictionary = load_dictionary(r'komi\cbow_dictionary.txt')

## Корпус прессы
in progress...