Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect dataset #6

Closed
summerlight opened this issue Mar 31, 2016 · 2 comments
Closed

Collect dataset #6

summerlight opened this issue Mar 31, 2016 · 2 comments
Assignees

Comments

@summerlight
Copy link
Owner

We need a script to generate a dataset for experiment. Our current dataset is ALTA-2010 Shared Task. In the case for the need of more language, annotation, shorter text or whatever else, we need to be able to generate a similar dataset.

Step needed:

  1. Download Wikipedia dumps. Wikipedia texts are named in a format of xxwiki. All we need here are "current versions only" dumps.
  2. Extract only text using wikiextractor.
  3. Apply the methodology of the paper. You can easily get interlanguage links from the corresponding wiki page. (use a library BeautifulSoup4, find tags with a class "interlanguage-link")
@summerlight
Copy link
Owner Author

I committed a utility code to help downloading Wikipedia dump files. It can also be used for other purposes. You'll need BeautifulSoup4 and requests to run it.

@mytony mytony self-assigned this Mar 31, 2016
@mytony
Copy link
Collaborator

mytony commented Apr 2, 2016

move on to step 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants