jesc_small

Small Japanese-English Subtitle Corpus. Sentences are extracted from JESC: Japanese-English Subtitle Corpus, and filtered with the length of 4 to 16 words.

Both Japanese and English sentences are tokenized with StanfordNLP (v0.2.0).

All texts are encoded in UTF-8. Sentence separator is '\n' and word separator is ' '.

Additionally, all tokenized data can be downloaded from here.

Corpus statistics

File	#sentences	#words	#vocabulary
train.en	100,000	809,353	29,682
train.ja	100,000	808,157	46,471
dev.en	1,000	8,025	1,827
dev.ja	1,000	8,163	2,340
test.en	1,000	8,057	1,805
test.ja	1,000	8,084	2,306

This repo is inspired by small_parallel_enja.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dev.en

dev.en

dev.ja

dev.ja

test.en

test.en

test.ja

test.ja

train.en

train.en

train.ja

train.ja

Repository files navigation

jesc_small

Corpus statistics

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
dev.en		dev.en
dev.ja		dev.ja
test.en		test.en
test.ja		test.ja
train.en		train.en
train.ja		train.ja

yusugomori/jesc_small

Folders and files

Latest commit

History

Repository files navigation

jesc_small

Corpus statistics

About

Resources

Stars

Watchers

Forks