StatNLP Datasets

Multilingual Geoquery

A multilingual dataset for Geoquery. Each instance is a sentence annotated with its meaning representations. The corpora in Chinese, Indonesia, Farsi and Swedish are originally released by “Semantic Parsing with Neural Hybrid Trees”.

MalwareTextDB

The dataset in various format (see the readme for more details) can be found here: MalwareTextDB-1.0.zip (5,5MB download, 20MB unzipped) The dataset is originally published in this paper "MalwareTextDB: A Database for Annotated Malware Articles".

Multilingual ATIS

A new multilingual version of the ATIS corpus. The dataset is originally published in this paper "Neural Architectures for Multilingual Semantic Parsing".

NP-annotated SMS dataset

Thanks to Alexander Binder, Jie Yang, Dinh Quang Thinh, as well as 64 undergraduate students for the help in creating the annotations for the NUS SMS Corpus. The annotation guidelines given to students.

Chinese Address dataset

The dataset and annotation guideline are uploaded to Github. Thanks to Ali Damo Academy for the Chinese address Corpus.

Taobao and Youku NER Dataset

The dataset and annotation guideline are uploaded to Github. Thanks to Ali Damo Academy for the annotations.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StatNLP Datasets

Multilingual Geoquery

MalwareTextDB

Multilingual ATIS

NP-annotated SMS dataset

Chinese Address dataset

Taobao and Youku NER Dataset

About

Uh oh!

Releases

Packages

Uh oh!

statnlp-research/statnlp-datasets

Folders and files

Latest commit

History

Repository files navigation

StatNLP Datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!