A curated list of resources for the processing of Slovak language.
- Slovak resources by Essential Data
- Slovak speech and language processing at KEMT FEI TUKE with tools, demos and language resources.
- Slovak National Corpus
- Created for training mT5 model.
- Contains 67GB Slovak part. bv
- Available in Tensorflow format and HuggingFace JSON format.
- Can be downloaded from allenai/allennlp#5056 using git LFS.
- automatic POS (SNK)
- source: web
- can be downloaded from Clarin
- deduplicated
- source: Common Crawl
- automatic POS (AUT, TreeTagger)
- source: web
- no annotattion
- twitter part
- Manually annotated clone of SQUAD 2.0
- Contains "unanswerable questions"
- 92k items
- Machine translation of SQUAD 2.0 Database
- 140k annotated items
- Slovak version of the Question to Declarative Sentence (QA2D).
- Machine-translated using DeepL service.
- https://arxiv.org/abs/2312.10171
- 70k questions and answers
- 5 000 yes-no questions
- Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
- Machine translated
- Can be used also for summarization
- tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
- manual annotation
- format: conllu, PDT tagset
- source: SNK
Reference:
- Gajdošová, Katarína; Šimková, Mária and et al., 2016, Slovak Dependency Treebank, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-1822.
A conversion of the Slovak Dependency Treebank into Universal Dependency tagset.
- GitHub page
- tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
- manual annotation
- format: conllu, UD tagset
- source: SNK
Reference:
- Zeman, Daniel. (2017). Slovak Dependency Treebank in Universal Dependencies. Journal of Linguistics/Jazykovedný casopis. 68. 10.1515/jazcas-2017-0048.
- tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
- format: conllu
- source: Slovak UD, SNK
- form, lemma, POS (SNK)
- source: SNK
- form, lemma, POS
- form, lemma, POS (Multext East)
- Corpus of the Šariš dialect
- 4.7k examples.
- authors: Viktória Ondrejová and Marek Šuppa
- 62 languages, 1,782 bitexts
- Slovak part contains 100 mil. tokens
- source: Europarl
- speech, vectors, language
- automatic POS (SNK)
- source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
- automatic POS (SNK)
- source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
- sentence aligned, POS
- Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian
- source: "1984" novel
Reference:
Erjavec, Tomaž; et al., 2010, MULTEXT-East "1984" annotated corpus 4.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1043.
- Parallel web Corpus with Slovak Part
- 3.3 mil sentences English-Slovak
- Unsupervised processing of Wikipedia to obtain parallel corpora
- Used LASER embeddings.
- 85 different languages, 1620 language pairs, 134M parallel sentences, out of which 34M are aligned with English
- Machine translated by OPUS-en-sk model
- Sentence similarity dataset contains two sentences with a floating-point number between 0 and 5 as a target, where the highest number means higher similarity. The dataset contains train: 5 749, validation: 1 500 and test: 1 379 examples.
- Referenced from this report by J. Agarský.
- Unknown/undocumented source
- positive/negative
- source: Twitter
- 3 categories - positive, negative, neutral
- Dataset contains totally 1 588 comments in Slovak language from various Facebook pages. The texts are annotated by 5 categories.
- Machine translated
- Sentiment analysis dataset, binary classification task: positive sentiment, negative sentiment. It includes reviews from 7 categories with positive, neutral and negative sentiment labels.
- Source: Slovakbert auxiliary repository BY Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšták, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, and Filip Uhlárik. , 2021
- Referenced from this report by J. Agarský.
- CSFD Movie Reviews
- 25k items
9.1k Czech, 2.8k Polish and 12.6k Slovak labeled claims with reasoning: demagog.zip (~16.5 MB)
- Machine translated facts with evidence representend as references to Wikipedia pages.
- 350k items
- Machine translation of the Stanford Alpaca
- 40k annotations
- 8,48k sentences
- Annotated by a large langauge model
- PER, ORG, LOC annotations
- 250 rent agreements, annotated for entities such as tennant, landlord, monthly fee, subject.
- Unknown origin and license.
- 10k manually annotated items from Wikipedia
translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate
- automatically annotated Wikipedia for Named Entities
- massively multilingual
- Slovak part has 500k sentences.
- Reference: Al-Rfou, Rami, et al. "Polyglot-NER: Massive multilingual named entity recognition." Proceedings of the 2015 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2015.
- download data
- automatic annotation
- source: Wikipedia
- Manually annotated set
- Diploma thesis at Commeius University
- PER, ORG, LOC, MISC annotations
- cca 7k sentences.
- corpus of spelling errors created from edits in Wikipedia
- spelling errors are sorted into 5 categories,
- Multilingual Summarization Dataset
- Slovak part has 1.3k rows.
- 200k of news article summaries
- Reference: Marek Suppa and Jergus Adamec. 2020. A Summarization Dataset of Slovak News Articles. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6725–6730, Marseille, France. European Language Resources Association.
- Contains Floret Word Vectors.
- Tagger module uses Slovak National Corpus Tagset.
- Morphological analyzer uses Universal dependencies tagset and is trained on Slovak dependency treebank.
- Lemmatizer is trained on Slovak dependency treebank.
- Named entity recognizer is trained separately on WikiAnn database.
- source: Wikipedia, Common Crawl
- source: Common Crawl
- source: Wikipedia
- Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.
- Language agnostic sentence embeddings.
- Multilingual document embeddings, based on Sentence Transformers.
- Slovak RoBERTa base language model
- trained on web corpus
- Slovak BERT by Ardevop SK
Slovak T5 small, created by fine-tuning mT5 small.
- VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
- Facebook's Wav2Vec2 base model pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine-tuned on the transcribed data in sk
- multilingual BERT, trained on Wikipedia
- Bidirectional translation models for Slovak for multiple languages
- Also available for HF Transformers
- Contains SentencePiece tokenization models
- For MarianNMT
- English, German, Finish, French, Swedich,
- Multilingual translation model for Fairseq
- Provides also language detection models
- Original Fairseq REPO
- HuggingFace Transformers integration - distilled 600M version
- Transformer models for machine translation
- Slovak, English, Finish, Swedish, Spanish, French
- Multilingual translation model with Slovak support.
- Build for Fairseq
- HuggingFace Transformers model
- Flores101: Large-Scale Multilingual Machine Translation
- Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.
- Includes Slovak language
- For fairseq
- Spelling Dictionary
- List of common names, abbreviations, pejoratives and neologisms.
- tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
- models trained on UD
- implementation in Python/PyTorch, command-line interface, web service interface
- license: Apache v2.0
- tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
- models trained on UD
- implementation in Python/dyNET, command-line interface, web service interface
- license: Apache v2.0
- tokenization, stemming
- tokenization, segmentation
- implementation in C++
- license: GPL v3.0
- UPOS, UD
- models trained on UD
- implementation in Python/PyTorch, command-line interface
- license: MIT
- tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
- models trained on UD
- implementation in C++, bindings in Java, Python, Perl, C#, command-line interface, web service interface
- license: MPL v2.0
- tokenization, stemming, lemmatization, diacritic restoration, POS (SNK), NER
- web service interface only
- license: ?
- tokenization, segmentation, lemmatization, POS (OpenNLP, SNK), UD (CoreNLP), NER
- web interface at http://nlp.bednarik.top/
- Swagger REST API
- implementation in Java/DL4J
- source codes available
- license: GNU AGPLv3
- Web-based Visualisation of Slovak word vectors
- Lemmatization for 25 languages
- In Python
- Slovak trained on UDP corpus