Skip to content

A curated list of resources such as tools and datasets useful for the processing of Slovak language

Notifications You must be signed in to change notification settings

slovak-nlp/resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 

Repository files navigation

Resources

A curated list of resources for the processing of Slovak language.

Pages

Corpora, datasets, vocabularies

Web

  • Created for training mT5 model.
  • Contains 67GB Slovak part. bv
  • Available in Tensorflow format and HuggingFace JSON format.
  • Can be downloaded from allenai/allennlp#5056 using git LFS.
  • automatic POS (SNK)
  • source: web
  • can be downloaded from Clarin
  • deduplicated
  • source: Common Crawl
  • automatic POS (AUT, TreeTagger)
  • source: web
  • no annotattion
  • twitter part

Question Answering

  • Manually annotated clone of SQUAD 2.0
  • Contains "unanswerable questions"
  • 92k items
  • Machine translation of SQUAD 2.0 Database
  • 140k annotated items
  • Slovak version of the Question to Declarative Sentence (QA2D).
  • Machine-translated using DeepL service.
  • https://arxiv.org/abs/2312.10171
  • 70k questions and answers
  • 5 000 yes-no questions
  • Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
  • Machine translated
  • Can be used also for summarization

Morpho-syntactic

  • tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
  • manual annotation
  • format: conllu, PDT tagset
  • source: SNK

Reference:

  • Gajdošová, Katarína; Šimková, Mária and et al., 2016, Slovak Dependency Treebank, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-1822.

A conversion of the Slovak Dependency Treebank into Universal Dependency tagset.

  • GitHub page
  • tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
  • manual annotation
  • format: conllu, UD tagset
  • source: SNK

Reference:

  • Zeman, Daniel. (2017). Slovak Dependency Treebank in Universal Dependencies. Journal of Linguistics/Jazykovedný casopis. 68. 10.1515/jazcas-2017-0048.
  • tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
  • format: conllu
  • source: Slovak UD, SNK
  • form, lemma, POS (SNK)
  • source: SNK
  • form, lemma, POS
  • form, lemma, POS (Multext East)

Parallel

  • Corpus of the Šariš dialect
  • 4.7k examples.
  • authors: Viktória Ondrejová and Marek Šuppa
  • 62 languages, 1,782 bitexts
  • Slovak part contains 100 mil. tokens
  • source: Europarl
  • speech, vectors, language
  • automatic POS (SNK)
  • source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
  • automatic POS (SNK)
  • source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
  • sentence aligned, POS
  • Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian
  • source: "1984" novel

Reference:

Erjavec, Tomaž; et al., 2010, MULTEXT-East "1984" annotated corpus 4.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1043.

  • Parallel web Corpus with Slovak Part
  • 3.3 mil sentences English-Slovak
  • Unsupervised processing of Wikipedia to obtain parallel corpora
  • Used LASER embeddings.
  • 85 different languages, 1620 language pairs, 134M parallel sentences, out of which 34M are aligned with English

Semantic textual similarity

  • Machine translated by OPUS-en-sk model
  • Sentence similarity dataset contains two sentences with a floating-point number between 0 and 5 as a target, where the highest number means higher similarity. The dataset contains train: 5 749, validation: 1 500 and test: 1 379 examples.
  • Referenced from this report by J. Agarský.

Sentiment

  • Unknown/undocumented source
  • positive/negative
  • source: Twitter
  • 3 categories - positive, negative, neutral
  • Dataset contains totally 1 588 comments in Slovak language from various Facebook pages. The texts are annotated by 5 categories.
  • Machine translated
  • Sentiment analysis dataset, binary classification task: positive sentiment, negative sentiment. It includes reviews from 7 categories with positive, neutral and negative sentiment labels.
  • Source: Slovakbert auxiliary repository BY Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšták, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, and Filip Uhlárik. , 2021
  • Referenced from this report by J. Agarský.
  • CSFD Movie Reviews
  • 25k items

Fact Checking

9.1k Czech, 2.8k Polish and 12.6k Slovak labeled claims with reasoning: demagog.zip (~16.5 MB)

  • Machine translated facts with evidence representend as references to Wikipedia pages.
  • 350k items

Instructions

Named Entity Recognition

  • 8,48k sentences
  • Annotated by a large langauge model
  • PER, ORG, LOC annotations
  • 250 rent agreements, annotated for entities such as tennant, landlord, monthly fee, subject.
  • Unknown origin and license.
  • 10k manually annotated items from Wikipedia

translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate

  • automatically annotated Wikipedia for Named Entities
  • massively multilingual
  • Slovak part has 500k sentences.
  • Reference: Al-Rfou, Rami, et al. "Polyglot-NER: Massive multilingual named entity recognition." Proceedings of the 2015 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2015.
  • Manually annotated set
  • Diploma thesis at Commeius University
  • PER, ORG, LOC, MISC annotations
  • cca 7k sentences.

Spelling

  • corpus of spelling errors created from edits in Wikipedia
  • spelling errors are sorted into 5 categories,

Wordnet

Summarization

  • Multilingual Summarization Dataset
  • Slovak part has 1.3k rows.
  • 200k of news article summaries
  • Reference: Marek Suppa and Jergus Adamec. 2020. A Summarization Dataset of Slovak News Articles. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6725–6730, Marseille, France. European Language Resources Association.

Models

General Models

  • Contains Floret Word Vectors.
  • Tagger module uses Slovak National Corpus Tagset.
  • Morphological analyzer uses Universal dependencies tagset and is trained on Slovak dependency treebank.
  • Lemmatizer is trained on Slovak dependency treebank.
  • Named entity recognizer is trained separately on WikiAnn database.

Word embeddings

  • source: Wikipedia, Common Crawl
  • source: Common Crawl
  • source: Wikipedia

Document Embeddings

  • Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.
  • Language agnostic sentence embeddings.
  • Multilingual document embeddings, based on Sentence Transformers.

Transformers

  • Slovak RoBERTa base language model
  • trained on web corpus
  • Slovak BERT by Ardevop SK

Slovak T5 small, created by fine-tuning mT5 small.

  • VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
  • Facebook's Wav2Vec2 base model pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine-tuned on the transcribed data in sk
  • multilingual BERT, trained on Wikipedia

Translation models

  • Bidirectional translation models for Slovak for multiple languages
  • Also available for HF Transformers
  • Contains SentencePiece tokenization models
  • For MarianNMT
  • English, German, Finish, French, Swedich,
  • Transformer models for machine translation
  • Slovak, English, Finish, Swedish, Spanish, French
  • Flores101: Large-Scale Multilingual Machine Translation
  • Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.
  • Includes Slovak language
  • For fairseq

Tools

  • Spelling Dictionary
  • List of common names, abbreviations, pejoratives and neologisms.
  • tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
  • models trained on UD
  • implementation in Python/PyTorch, command-line interface, web service interface
  • license: Apache v2.0
  • tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
  • models trained on UD
  • implementation in Python/dyNET, command-line interface, web service interface
  • license: Apache v2.0
  • tokenization, stemming
  • tokenization, segmentation
  • implementation in C++
  • license: GPL v3.0
  • UPOS, UD
  • models trained on UD
  • implementation in Python/PyTorch, command-line interface
  • license: MIT
  • tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
  • models trained on UD
  • implementation in C++, bindings in Java, Python, Perl, C#, command-line interface, web service interface
  • license: MPL v2.0
  • tokenization, stemming, lemmatization, diacritic restoration, POS (SNK), NER
  • web service interface only
  • license: ?
  • tokenization, segmentation, lemmatization, POS (OpenNLP, SNK), UD (CoreNLP), NER
  • web interface at http://nlp.bednarik.top/
  • Swagger REST API
  • implementation in Java/DL4J
  • source codes available
  • license: GNU AGPLv3
  • Web-based Visualisation of Slovak word vectors
  • Lemmatization for 25 languages
  • In Python
  • Slovak trained on UDP corpus

About

A curated list of resources such as tools and datasets useful for the processing of Slovak language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published