GitHub - zarmeen92/Urdu: Collection of Urdu datasets for POS, NER and NLP tasks

POS dataset

Urdu dataset for POS training. This is a small dataset and can be used for training parts of speech tagging for Urdu Language. Structure of the dataset is simple i.e.

word TAG
word TAG

The tagset used to build dataset is taken from Sajjad's Tagset To get large dataset, you need to purchase the license. Contact: virtuoso.irfan@gmail.com

UNER Dataset

Happy to announce that UNER (Urdu Named Entity Recognition) dataset is available for NLP apps. Following are NER tags which are used to build the dataset:

PERSON
LOCATION
ORGANIZATION
DATE
NUMBER
DESIGNATION
TIME

If you want to read more about the dataset check this paper Urdu NER.

Note

NER Dataset is in utf-16 format.

COUNTER (COrpus of Urdu News TExt Reuse) Dataset

This dataset is collected from journalism and can be used for Urdu NLP research. Here is the link to the resource for more information. COUNTER

Urdu model for SpaCy

Urdu model for SpaCy is available now. You can use it to build NLP apps easily. Install the package in your working environment.

pip install ur_model-0.0.0.tar.gz

You can use it with following code.

import spacy
nlp = spacy.load("ur_model")
doc = nlp("میں خوش ہوں کے اردو ماڈل دستیاب ہے۔ ")

SpaCy

I've also contributed to famous NLP library SpaCy. You can use Urdu word tokenizing, POS tagging and other NLP tasks. You can train your own POS model using this dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
COUNTER		COUNTER
ner		ner
pos		pos
spacy		spacy
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POS dataset

UNER Dataset

Note

COUNTER (COrpus of Urdu News TExt Reuse) Dataset

Urdu model for SpaCy

SpaCy

About

Releases

Packages

License

zarmeen92/Urdu

Folders and files

Latest commit

History

Repository files navigation

POS dataset

UNER Dataset

Note

COUNTER (COrpus of Urdu News TExt Reuse) Dataset

Urdu model for SpaCy

SpaCy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages