Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:
[Wikipedia Dumps](https://dumps.wikimedia.org/)
Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.
🆕 Added Evaluation Benchmarks sections
👍 Featured Resources
- 🆕Vakyansh CLSRIL-23: Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages (documentation) (experimentation platform).
- 🆕FLORES-101: Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel.
- 🆕Samanantar Parallel Corpus: Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages.
- 🆕IndicTrans: Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported.
- 🆕Itihasa Parallel Corpus: 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata.
- AI4Bharat IndicNLPSuite: Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages.
Browse the entire catalog...
🙋Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.
- Major Indic Language NLP Repositories
- Libraries and Tools
- Evaluation Benchmarks
- Standards
- Text Corpora
- Monolingual Corpus
- Language Identification
- Lexical Resources
- NER Corpora
- Parallel Translation Corpus
- Parallel Transliteration Corpus
- Text Classification
- Textual Entailment/Natural Language Inference
- Paraphrase
- Sentiment, Sarcasm, Emotion Analysis
- Hate Speech and Offensive Comments
- Question Answering
- Dialog
- Discourse
- Information Extraction
- POS Tagged corpus
- Chunk Corpus
- Dependency Parse Corpus
- Co-reference Corpus
- Models
- Speech Corpora
- OCR Corpora
- Multimodal Corpora
- Language Specific Catalogs
- Technology Development for Indian Languages (TDIL)
- Center for Indian Language Technology (CFILT)
- Language Technologies Research Center (LTRC)
- AI4Bharat IndicNLP
- Linguistic Data Consortium For Indian Languages (LDCIL)
- University of Hyderabad - Sanskrit NLP
- National Platform for Language Technology
- Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, etc
- pyiwn: Python Interface to IndoWordNet
- Indic-OCR : OCR for Indic Scripts
- CLTK: Toolkit for many of the world's classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library.
- iNLTK: iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
- Sanskrit Coders Indic Transliteration: Script conversion and romanization for Indian languages.
- Smart Sanskirt Annotator: Annotation tool for Sanskrit paper
- BNLP: Bengali language processing toolkit with tokenization, embedding, POS tagging, NER suppport
- CodeSwitch: Language identification, POS Tagging, NER, sentiment analysis support for code mixed data including Hindi and Nepali language
Benchmarks spanning multiple tasks.
- AI4Bharat IndicGLUE: NLU benchmark for 11 languages.
- AI4Bharat IndicNLG Suite: NLG benchmark for 11 languages spanning 5 generation tasks.
- GLUECoS: For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI).
- AI4Bharat Text Classification: A compilation of classification datasets for 10 languages.
- WAT 2021 Translation Dataset: Standard train and test sets for translation between English and 10 Indian languages.
- Unicode Standard for Indic Scripts
- AIBharat IndicCorp: contains 8.9 billion tokens from 12 Indian languages (including Indian English).
- Wikipedia Dumps
- Common Crawl
- OSCAR Corpus: Released in 2019, large-scaled processed CommonCrawl.
- WMT Common Crawl Dumps: Crawls between 2012 and 2016. Noisy text, needs to be filtered.
- CC-100 Corpus: Facebook CommonCrawl extracted data. They provide scripts for processing CommonCrawl. StatMT has built a replica of the CC-100 corpus using these scripts. You can find it HERE. This corpus also has romanized corpora for some Indian languages.
- WMT NEWS Crawl
- LDCIL Monolingual Corpus
- Charles University Hindi Monolingual Corpus
- Charles University Urdu Monolingual Corpus
- IIT Bombay Hindi Monolingual Corpus
- EMILLE Corpus (multiple Indian languages)
- Janmabhumi Malayalam Corpus
- Leipzig Corpus
- Sanskrit Monolingual and Sandhi-split Corpus
- Lot Of Indic Tweets Corpus: Large twitter datasets for telugu (7.9 million) and hindi (17.6 million) and fasttext skipgram and cbow word vectors for the same.
- CMU Romanized Hinglish Corpus: See THIS PAPER for details.
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 45k sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
- SMC Malayalam text corpus
- DNLP-Tel Telugu Corpus: Telugu corpus of 280M tokens and 23M sentences.
- VarDial 2018 Language Identification Dataset: 5 languages - Hindi, Braj, Awadhi, Bhojpuri, Magahi.
- IndoWordNet
- IIIT-Hyderabad Word Similarity Database: 7 Indian languages
- Facebook Hindi Analogy Dataset
- MGAD Hindi Analogy dataset
- AI4Bharat Word Frequency Lists: Tokens and their frequencies from the AI4Bharat corpus, a large monolingual corpus.
- Hindi RG-63: Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset
- IITB Cognate Datasets: Dataset of Cognates and False Friend Pairs for 12 Indian Languages. (Paper)
- FIRE 2013 AUKBC NER Corpus
- FIRE 2014 AUKBC NER Corpus
- IIT Bombay Marathi NER Corpus
- WikiAnn NER Corpus (Noisy) DOWNLOAD (Old broken LINK)
- IJCNLP 200 NER Corpus: NER corpora for hi, bn, or, te, ur.
- a-mma NER data
- Samanantar Parallel Corpus: Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages.
- FLORES-101: Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel.
- IIT Bombay English-Hindi Parallel Corpus: Largest en-hi parallel corpora in public domain (about 1.5 million segments)
- CVIT-IIITH PIB Multilingual Corpus: Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language).
- CVIT-IIITH Mann ki Baat Corpus: Mined from Indian PM Narendra Modi's Mann ki Baat speeches.
- PMIndia: Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India (paper).
- OPUS corpus
- WAT 2018 Parallel Corpus: There may significant overlap between WAT and OPUS.
- Charles University English-Hindi Parallel Corpus: This is included in the IITB parallel corpus.
- Charles University English-Tamil Parallel Corpus
- Charles University English-Odia Parallel Corpus v1.0
- Charles University English-Odia Parallel Corpus v2.0
- Charles University English-Urdu Religious Parallel Corpus
- Indian Language Corpora Initiative: Available on TDIL portal on request
- IndoWordnet Parallel Corpus: Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages).
- MTurk Indian Parallel Corpus
- TED Parallel Corpus
- JW300 Corpus: Parallel corpus mined from jw.org. Religious text from Jehovah's Witness.
- ALT Parallel Corpus: 10k sentences for Bengali, Hindi in parallel with English and many East Asian languages.
- FLORES dataset: English-Sinhala and English-Nepali corpora
- Uka Tarsadia University Corpus: 65k English-Gujarati sentence pairs. Corpus is described in this paper
- NLPC-UoM English-Tamil Corpus: 9k sentences, 24k glossary terms
- Wikititles: from statmt
- JNU-BHLTR Bhojpuri Corpus: English-Bhojpuri corpus of 65k sentences
- EILMT Corpus
- QED Corpus: English-Hindi corpus of 43k sentences from the educational domain.
- WikiMatrix Corpus: Mined from Wikipedia, looks noisy.
- CCMatrix: Parallel corpus mined from CommonCrawl, looks noisy (statmt repo).
- CGNetSwara: Hindi-Gondi parallel corpus (19k sentence pairs)
- MTEnglish2Odia: English-Odia (42k pairs)
- SAP Software Documentation: test and evaluation set for English-Hindi in the software documentation domain [paper]
- BUET English-Bangla Corpus, EMNLP-2020: 2.7M sentences (has overlaps with OPUS)
- CLE Parallel Corpus: Parallel corpus for English, Urdu and Nepali.
- Itihasa Parallel Corpus: 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata.
- Dakshina Dataset: The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. Contains an aggregate of around 300k word pairs and 120k sentence pairs.
- BrahmiNet Corpus: 110 language pairs mined from ILCI parallel corpus.
- Xlit-Crowd: Hindi-English Transliteration Corpus created via crowdsourcing.
- Xlit-IITB-Par: Hindi-English Transliteration Corpus mined from parallel translation corpora.
- FIRE 2013 Track on Transliterated Search: Transliteration dataset of native words in Hindi, Bengali and Gujarati.
- NEWS 2018 Shared Task dataset: Transliteration datasets for Kannada, Tamil, Bengali and Hindi created by Microsoft Research India.
- AI4Bharat StoryWeaver Xlit Dataset - Transliteration datasets for Hindi, Maithili & Konkani
- Hindi WikiData Transliteration Pairs - Hindi dataset (90k pairs)
- NotAI-tech English-Telugu: Around 38k word pairs
- BBC news articles classification dataset: 14 class classification
- iNLTK News Headlines classification: Datasets for multiple Indian languages.
- AI4Bharat IndicNLP News Articles: Word embeddings for 10 Indian languages.
- XNLI corpus: Hindi and Urdu test sets and machine translated training sets (from English MultiNLI).
- Amrita University-DPIL Corpus: Sentence level paraphrase identification for four Indian languages (Tamil, Malayalam, Hindi and Punjabi).
- IIT Bombay movie review datasets for Hindi and Marathi
- IIT Patna movie review datasets for Hindi
- IIIT-H LTRC Multi-domain dataset for Telugu
- ACTSA corpus for Telugu
- BHAAV (भाव) Corpus: A Text Corpus for Emotion Analysis from Hindi Stories
- SentiWordNet - SAIL - Hindi, Bangla, Tamil & Telugu
- Dravidian-CodeMix - FIRE 2020 - Tamil & Malayalam
- Bengali Sentiment Analysis - Classification Benchmark, 2020: 8k sentences
- SentNoB: sentiment dataset for Bangla from 3 domains on user comments containing 15k examples (Paper) (Dataset)
- Hate Speech and Offensive Content Identification in Indo-European Languages: (HASOC FIRE-2020)
- An Indian Language Social Media Collection for Hate and Offensive Speech, 2020: Hinglish Tweets and FB Comments collected during Parliamentary Election 2019 of India (Dataset available on request)
- Aggression-annotated Corpus of Hindi-English Code-mixed Data, 2018: Scraped from Facebook (21k) & Twitter (18k) (Paper)
- Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018: 3k tweets (Paper)
- A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection, 2018: 4.5k Tweets (Paper)
- Roman Urdu Offensive Language Detection, 2020: 10k tweets, can also used for Hindi, (Paper)
- Bengali Hate Speech - Classification Benchmark, 2020: 1.5k sentences
- Offensive Language Identification in Dravidian Languages, EACL 2021: Tamil, Malayalam, Kannada
- Fear Speech in Indian WhatsApp Groups, 2021
- Facebook Multilingual QA datasets: Contains dev and test sets for Hindi.
- TyDi QA datasets: QA dataset for Bengali and Telugu.
- bAbi 1.2 dataset: Has Hindi version of bAbi tasks in romanized Hindi.
- MMQA dataset: Hindi QA dataset described in this paper
- XQuAD: testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in this paper
- XQA: testset for Tamil QA. Described in this paper
- HindiRC: A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in this paper
- IITH HiDG: A Distractor Generation Dataset for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in this paper
- Chaii a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, here is a good collection of papers on multilingual Question Answering.
- EventXtract-IL: Event extraction for Tamil and Hindi. Described in this paper.
- [EDNIL-FIRE2020]https://ednilfire.github.io/ednil/2020/index.html): Event extraction for Tamil, Hindi, Bengali, Marathi, English. Described in this paper.
- Indian Language Corpora Initiative
- Universal Dependencies
- IIITH Paninian Treebank: POS annotations for hi, bn, kn, ml and mr.
- Code Mixed Dataset for Hindi, Bengali and Telugu, ICON 2016 shared task
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 5000 sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
- Indian Language Corpora Initiative
- Indian Languages Treebanking Project: Chunk annotations for hi, bn, kn, ml and mr.
- IIIT Hyderabad Hindi Treebank
- Universal Dependencies
- Universal Dependencies Hindi Treebank
- Universal Dependencies Urdu Treebank
- IIITH Paninian Treebank: Paninian Grammar Framework annotations along with mappings to Stanford dependency annotations for hi, bn, kn, ml and mr.
- Vedic Sanskrit Treebank: 4k Sanskrit dependency treebank paper
- AI4Bharat IndicFT: Fast-text word embeddings for 11 Indian languages.
- FastText CommonCrawl+Wikipedia
- FastText Wikipedia
- Polyglot
- AI4Bharat IndicBERT: Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English).
- AI4Bharat IndicBART: Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English).
- MuRIL: Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding (paper).
- BERT Multilingual: BERT model trained on Wikipedias of many languages (including major Indic languages).
- iNLTK: ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles.
- albert-base-sanskrit: ALBERT-based model trained on Sanskrit Wikipedia.
- RoBERTa-hindi-guj-san: Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati.
- Bangla-BERT-Base: Bengali BERT model trained on Bengali wikipedia and OSCAR datasets
- AI4Bharat IndicNLP Project: Unsupervised morphanalyzers for 10 Indian languages learnt using morfessor.
- IndicTrans: Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported.
- Shata-Anuvaadak: 110 language pairs
- LTRC Vanee: Dependency based Statistical MT system from English to Hindi
- AI4Bharat IndicWav2Vec: Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0
- Vakyansh CLSRIL-23: Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages (documentation) (experimentation platform).
- arijitx/wav2vec2-large-xlsr-bengali: Pretrained wav2vec2-large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM.
- Microsoft Speech Corpus: Speech corpus for Telugu, Tamil and Gujarati.
- Microsoft-IITB Marathi Speech Corpus: 109 hours of speech data collected via crowdsourcing.
- AccentDB: Database of Indian English accents from native speakers in Bangla, Malayalam, Telugu and Oriya.
- IIT Madras TTS database
- BABEL Speech Corpus: includes some Indian languages
- Pratham ASER dataset: Dataset for research on reading level assessment.
- WikiPron: Words and their pronunciations in IPA mined from Wiktionary. Includes Indian languages. paper
- CVIT IndicSpeech: TTS data for 3 Indian languages: Malayalam, Bengali and Hindi (24 hours each).
- Google Speech Corpus: TTS data for 6 Indian languages: Malayalam, Marathi, Telugu, Kannada, Gujarati, Tamil (upto 9 hours each). Resources SLR#63-#66, #78-#79. (paper)
- CoVoST 2: Tamil 2 hrs data
- SMC Malayalam Speech Corpus - Download link
- Vāksañcayaḥ Sanskrit Speech Corpus : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1 (paper).
- English-Hindi Visual Genome: Images captioned in both English and Hindi.
- English-Hindi Flickr 8k: A subset of images from Flickr8k images captioned by native speakers in both English and Hindi. Code and data available here.
Pointers to language-specific NLP resource catalogs