Collect dataset #6

summerlight · 2016-03-31T20:19:30Z

We need a script to generate a dataset for experiment. Our current dataset is ALTA-2010 Shared Task. In the case for the need of more language, annotation, shorter text or whatever else, we need to be able to generate a similar dataset.

Step needed:

Download Wikipedia dumps. Wikipedia texts are named in a format of xxwiki. All we need here are "current versions only" dumps.
Extract only text using wikiextractor.
Apply the methodology of the paper. You can easily get interlanguage links from the corresponding wiki page. (use a library BeautifulSoup4, find tags with a class "interlanguage-link")

summerlight · 2016-03-31T20:31:12Z

I committed a utility code to help downloading Wikipedia dump files. It can also be used for other purposes. You'll need BeautifulSoup4 and requests to run it.

mytony · 2016-04-02T04:06:07Z

move on to step 3

mytony self-assigned this Mar 31, 2016

summerlight closed this as completed Apr 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect dataset #6

Collect dataset #6

summerlight commented Mar 31, 2016

summerlight commented Mar 31, 2016

mytony commented Apr 2, 2016

Collect dataset #6

Collect dataset #6

Comments

summerlight commented Mar 31, 2016

summerlight commented Mar 31, 2016

mytony commented Apr 2, 2016