Resources

5 000 yes-no questions
Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
Machine translated
Can be used also for summarization

Morpho-syntactic

Slovak Dependency Treebank

tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
manual annotation
format: conllu, PDT tagset
source: SNK

Reference:

Gajdošová, Katarína; Šimková, Mária and et al., 2016, Slovak Dependency Treebank, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-1822.

Slovak Universal Dependencies

A conversion of the Slovak Dependency Treebank into Universal Dependency tagset.

GitHub page
tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
manual annotation
format: conllu, UD tagset
source: SNK

Reference:

Zeman, Daniel. (2017). Slovak Dependency Treebank in Universal Dependencies. Journal of Linguistics/Jazykovedný casopis. 68. 10.1515/jazcas-2017-0048.

Artificial Treebank with Ellipsis

tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
format: conllu
source: Slovak UD, SNK

Morphological vocabulary

form, lemma, POS (SNK)
source: SNK

Morphological vocabulary

form, lemma, POS

MULTEXT-East free lexicons 4.0

form, lemma, POS (Multext East)

Parallel

ŠarišSet

Corpus of the Šariš dialect
4.7k examples.
authors: Viktória Ondrejová and Marek Šuppa

OpenSubtitles

62 languages, 1,782 bitexts
Slovak part contains 100 mil. tokens

VoxPopuli

source: Europarl
speech, vectors, language

Czech-Slovak Parallel Corpus

automatic POS (SNK)
source: Acquis, Europarl, EU-journal, EC-Europa, OPUS

English-Slovak Parallel Corpus

automatic POS (SNK)
source: Acquis, Europarl, EU-journal, EC-Europa, OPUS

MULTEXT-East "1984" annotated corpus

sentence aligned, POS
Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian
source: "1984" novel

Reference:

Erjavec, Tomaž; et al., 2010, MULTEXT-East "1984" annotated corpus 4.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1043.

Paracrawl

Parallel web Corpus with Slovak Part
3.3 mil sentences English-Slovak

WikiMatrix

Unsupervised processing of Wikipedia to obtain parallel corpora
Used LASER embeddings.
85 different languages, 1620 language pairs, 134M parallel sentences, out of which 34M are aligned with English

Semantic textual similarity

STSB-sk

Machine translated by OPUS-en-sk model
Sentence similarity dataset contains two sentences with a floating-point number between 0 and 5 as a target, where the highest number means higher similarity. The dataset contains train: 5 749, validation: 1 500 and test: 1 379 examples.
Referenced from this report by J. Agarský.

Sentiment

Slovak_sentiment

Unknown/undocumented source
positive/negative

Twitter sentiment for 15 European languages

source: Twitter
3 categories - positive, negative, neutral

SentiGrade

Dataset contains totally 1 588 comments in Slovak language from various Facebook pages. The texts are annotated by 5 categories.

STS2-sk

Machine translated
Sentiment analysis dataset, binary classification task: positive sentiment, negative sentiment. It includes reviews from 7 categories with positive, neutral and negative sentiment labels.
Source: Slovakbert auxiliary repository BY Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšták, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, and Filip Uhlárik. , 2021
Referenced from this report by J. Agarský.

sk csfd movie reviews

CSFD Movie Reviews
25k items

Fact Checking

Demagog

9.1k Czech, 2.8k Polish and 12.6k Slovak labeled claims with reasoning: demagog.zip (~16.5 MB)

qacg-sk

Machine translated facts with evidence representend as references to Wikipedia pages.
350k items

Instructions

SlovAlpaca

Machine translation of the Stanford Alpaca
40k annotations

Named Entity Recognition

Universal NER Slovak

8,48k sentences
Annotated by a large langauge model
PER, ORG, LOC annotations

ner-rent-sk

250 rent agreements, annotated for entities such as tennant, landlord, monthly fee, subject.
Unknown origin and license.

WikiGold

10k manually annotated items from Wikipedia

ju-bezdek/conll2003-SK-NER

translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate

Polyglot NER

automatically annotated Wikipedia for Named Entities
massively multilingual
Slovak part has 500k sentences.
Reference: Al-Rfou, Rami, et al. "Polyglot-NER: Massive multilingual named entity recognition." Proceedings of the 2015 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2015.

Cross-lingual Name Tagging and Linking for 282 Languages

download data
automatic annotation
source: Wikipedia

Contextualized Language Model-based Named Entity Recognition in Slovak Texts

Manually annotated set
Diploma thesis at Commeius University
PER, ORG, LOC, MISC annotations
cca 7k sentences.

Spelling

CHIBI

corpus of spelling errors created from edits in Wikipedia
spelling errors are sorted into 5 categories,

Wordnet

Slovak Wordnet

Summarization

Eur Lex Sum

Multilingual Summarization Dataset
Slovak part has 1.3k rows.

A Summarization Dataset of Slovak News Articles

200k of news article summaries
Reference: Marek Suppa and Jergus Adamec. 2020. A Summarization Dataset of Slovak News Articles. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6725–6730, Marseille, France. European Language Resources Association.

Models

General Models

Slovak Spacy Model

Contains Floret Word Vectors.
Tagger module uses Slovak National Corpus Tagset.
Morphological analyzer uses Universal dependencies tagset and is trained on Slovak dependency treebank.
Lemmatizer is trained on Slovak dependency treebank.
Named entity recognizer is trained separately on WikiAnn database.

Word embeddings

ELMo word embeddings

source: Wikipedia, Common Crawl

fastText word embeddings - Common Crawl

source: Common Crawl

fastText word embeddings - Wikipedia

source: Wikipedia

Document Embeddings

Language Agnostic BERT model

Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.

LASER

Language agnostic sentence embeddings.

E5

Multilingual document embeddings, based on Sentence Transformers.

Transformers

SlovakBert

Slovak RoBERTa base language model
trained on web corpus

sk-bert

Slovak BERT by Ardevop SK

ApoTro/slovak-t5-small

Slovak T5 small, created by fine-tuning mT5 small.

VoxPopuli Slovak model

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Facebook's Wav2Vec2 base model pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine-tuned on the transcribed data in sk

m-BERT

multilingual BERT, trained on Wikipedia

Translation models

Helsinki Opus NLP

Bidirectional translation models for Slovak for multiple languages
Also available for HF Transformers
Contains SentencePiece tokenization models
For MarianNMT
English, German, Finish, French, Swedich,

NLLB - No Langauge Left Behind

Multilingual translation model for Fairseq
Provides also language detection models
Original Fairseq REPO
HuggingFace Transformers integration - distilled 600M version

HuggingFace Translation models

Transformer models for machine translation
Slovak, English, Finish, Swedish, Spanish, French

M2M 100

Multilingual translation model with Slovak support.
Build for Fairseq
HuggingFace Transformers model

Flores

Flores101: Large-Scale Multilingual Machine Translation
Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.
Includes Slovak language
For fairseq

Tools

Slovak Hunspell

Spelling Dictionary
List of common names, abbreviations, pejoratives and neologisms.

Stanza

tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
models trained on UD
implementation in Python/PyTorch, command-line interface, web service interface
license: Apache v2.0

NLP-Cube

tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
models trained on UD
implementation in Python/dyNET, command-line interface, web service interface
license: Apache v2.0

Slovak Elasticsearch

tokenization, stemming

Slovak lexer

tokenization, segmentation
implementation in C++
license: GPL v3.0

dl4dp

UPOS, UD
models trained on UD
implementation in Python/PyTorch, command-line interface
license: MIT

UDPipe

tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
models trained on UD
implementation in C++, bindings in Java, Python, Perl, C#, command-line interface, web service interface
license: MPL v2.0

NLP4SK

tokenization, stemming, lemmatization, diacritic restoration, POS (SNK), NER
web service interface only
license: ?

NLP Tools

tokenization, segmentation, lemmatization, POS (OpenNLP, SNK), UD (CoreNLP), NER
web interface at http://nlp.bednarik.top/
Swagger REST API
implementation in Java/DL4J
source codes available
license: GNU AGPLv3

Semä

Web-based Visualisation of Slovak word vectors

Simplemma

Lemmatization for 25 languages
In Python
Slovak trained on UDP corpus

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
README.md		README.md

slovak-nlp/resources

Folders and files

Latest commit

History

Repository files navigation

Resources

Pages

Corpora, datasets, vocabularies

Web

Question Answering

Morpho-syntactic

Parallel

Semantic textual similarity

Sentiment

Fact Checking

Instructions

Named Entity Recognition

Spelling

Wordnet

Summarization

Models

General Models

Word embeddings

Document Embeddings

Transformers

Translation models

Tools

About

Resources

Stars

Watchers

Forks