Skip to content

Download, parse, and filter data from European Parliament Proceedings. Data-ready for The-Pile.

Notifications You must be signed in to change notification settings

thoppe/The-Pile-EuroParl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The-Pile-EuroParl

Download, parse, and filter the European Parliament Proceedings, data-ready for The-Pile.

Stat Sheet

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

To use this parser, first download the source file

http://www.statmt.org/europarl/v7/europarl.tgz

and unpack it to the directory. The parser will look for all file within the txt subdirectory. Note that the download is slow and make take 12 or more hours.

The parser removes all basic tag information and only retains the name. The tag

<SPEAKER ID=77 LANGUAGE="NL" NAME="Pronk">

Is reduced to

Pronk

Extremely short files (<200 chracters) are removed as they did not contain useful language modeling text. A single file txt/pl/ep-09-10-22-009.txt fails to open with UTF-8 encoding and is skipped. No other filtering was done.

Data souce temporary hosted at https://drive.google.com/file/d/12Q23Y7IKQyjF28xH0Aw6yZaYEx2YIOiB/view?usp=sharing

About

Download, parse, and filter data from European Parliament Proceedings. Data-ready for The-Pile.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages