A Catalog for Odia Language NLP Resources

The purpose of this catalog is to provide a one-stop solution for the researchers looking for Odia NLP resources. This is a collective effort and any contribution to enriching Odia NLP resource are welcome. All contributors are listed on the CONTRIBUTOR list.

TDIL : It contains language application, resources, and tools for Indian languages including Odia. It contains many language applications, resources, and tools for Odia such as Odia terminology application, Odia language search engine, wordnet, English-Odia parallel text corpus, English-Odia machine-assisted translation, text-to-speech software, and many more.

Text Corpora

Parallel Translation Corpus

OdiEnCorp 2.0 : This dataset contains 97K English-Odia parallel sentences and serving in WAT2020 for Odia-English machine translation task. Paper
OPUS Corpus : It contains parallel sentences of other languages with Odia. The collection of data are domain-specific and noisy.
OdiEnCorp 1.0 : This dataset contains 30K English-Odia parallel sentences. Paper
IndoWordnet Parallel Corpus : Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages including Odia). Paper
PMIndia : Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India. It contains 38K English-Odia parallel sentences.Paper
CVIT PIB : Parallel corpus for En-Indian languages mined from press information bureau website of India. It contains 60K English-Odia parallel sentences.

Monolingual Corpus

EMILLE Corpus : It contains fourteen monolingual corpora for Indian languages including Odia.Manual
OdiEnCorp 1.0 : This dataset contains 221K Odia sentences.Paper
AI4Bharat-IndicNLP Corpus : The text corpus not available now (will be available later). It used 3.5M Odia sentences to build the embedding. Vocabulary frequency files are available.Paper
OSCAR Corpus : It contains around 300K Odia sentences.

Lexical Resources

IndoWordNet : Wordnet for Indian languages including Odia.

POS Tagged corpus

Indian Language Corpora Initiative : It contains parallel annotated corpora in 12 Indian languages including Odia (tourism and health domain).
Odia Treebank : The treebank contains approx. 1082 tokens (100 sentences) in Odia. Paper

Dialect Detection corpus

Odia-Santali Dialect Detection Corpus : This corpus contains text data of Odia and Santali written in Odia script.

Models

Language Model

Language Model : Pretrained Odia Language Model.
BertOdia : Bert-based Odia Language Model.

Word Embedding

FastText (CommonCrawl + Wikipedia) : Pretrained Word vector (CommonCrawl + Wikipedia). Trained on Common Crawl and Wikipedia using fastText. Select the language "oriya" from the model list.
FastText (Wikipedia) : Pretrained Word vector (Wikipedia). Trained on Wikipedia using fastText. Select the language "oriya" from the model list.
AI4Bharat IndicNLP Project : Pretrained Word embeddings for 10 Indian languages including Odia. Paper

Morphanalyzers

IndicNLP Morphanalyzers : Unsupervised morphanalyzers for 10 Indian languages including Odia learnt using morfessor.

Text Classification

Odia News Article Classification : This dataset contains approxmately 19,000 news article headlines collected from Odia news websites. The labeled dataset is splitted into training and testset suitable for supervised text classification.
AI4Bharat IndicNLP News Articles : This datasets comprising news articles and their categories for 9 languages including Odia. For Odia language, it has 4 classes (business, crime, entertainment, sports) and each class contains 7.5K news articles. The dataset is balanced across classes. Paper

Libraries / Tools

Indic NLP Library : It is a python based NLP library for Indian language text processing including Odia.
Indic-OCR : OCR tools for Indic scripts including Odia. Also, supports Ol Chiki (Santali).
Odia Romanization Script : The perl script "odiaroman" maps the Devnagri (Odia) to Latin.

Speech Corpora

IIT Madras IndicTTS : The Indic TTS project develops the text-to-speech (TTS) synthesis system for Indian languages including Odia. The database contains spoken sentences/utterances recorded by both Male and Female native speakers.
LDC-IL : It includes Odia annotated speech corpora which has voices of 450 different native speakers.

Other Indian language NLP Resources

A comprehensive list of Indian language NLP resources can be found in the IndicNLP Catalog

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
CONTRIBUTORS.md		CONTRIBUTORS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTORS.md

CONTRIBUTORS.md

README.md

README.md

Repository files navigation

A Catalog for Odia Language NLP Resources

Table of Contents

NLP Repositories

Text Corpora

Parallel Translation Corpus

Monolingual Corpus

Lexical Resources

POS Tagged corpus

Dialect Detection corpus

Models

Language Model

Word Embedding

Morphanalyzers

Text Classification

Libraries / Tools

Speech Corpora

Other Indian language NLP Resources

About

Releases

Packages

shantipriyap/Odia-NLP-Resource-Catalog

Folders and files

Latest commit

History

CONTRIBUTORS.md

CONTRIBUTORS.md

README.md

README.md

Repository files navigation

A Catalog for Odia Language NLP Resources

Table of Contents

NLP Repositories

Text Corpora

Parallel Translation Corpus

Monolingual Corpus

Lexical Resources

POS Tagged corpus

Dialect Detection corpus

Models

Language Model

Word Embedding

Morphanalyzers

Text Classification

Libraries / Tools

Speech Corpora

Other Indian language NLP Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages