Skip to content
This repository has been archived by the owner on Jan 3, 2024. It is now read-only.

explore mediawiki parsers instead of parsing HTML directly #58

Open
suyashb95 opened this issue Jul 9, 2020 · 6 comments
Open

explore mediawiki parsers instead of parsing HTML directly #58

suyashb95 opened this issue Jul 9, 2020 · 6 comments

Comments

@suyashb95
Copy link
Owner

suyashb95 commented Jul 9, 2020

Instead of parsing the HTML, use existing mediawiki parsers (like mwparserfromhell) as a second stage since headings/content/tags/comments etc are clearly defined and the wikitext content is more compact

@suyashb95 suyashb95 self-assigned this Jul 9, 2020
@suyashb95 suyashb95 changed the title expore mediawiki parsers instead of parsing HTML directly explore mediawiki parsers instead of parsing HTML directly Jul 9, 2020
@ghost
Copy link

ghost commented Sep 3, 2020

Hello Suyash, I may like to work on this if I have the time. Some questions:

  • Are you aiming to use mwparserfromhell to read the HTML content parsed by beautifulsoup or to read the markdown in wiktionary XML dump files?
  • What are the prerequisites for you accepting a pull request? (as I see some pull requests haven't been merged)

@suyashb95
Copy link
Owner Author

Hi @sehwol , thank you for your interest in this! I was planning to use mwparserfromhell to parse the wikitext directly instead of HTML mainly for the following reasons

  • I've seen a lot of pages having the same kind of content in different html tags or structures so was hoping that this would be more resilient. Since we won't be dealing with HTML, we won't have to clean it up like we're doing here

  • If the parser works on wikitext, it'll be easy to make it work with wikitext dumps later on instead of making HTTP calls

The wikitext can be retrieved using Wiktionary's API
https://en.wiktionary.org/w/api.php?action=parse&page=test&prop=wikitext&formatversion=2&format=json

I'll accept a PR if the tests work and the code looks good to me. The two pending ones aren't really complete so I haven't merged them yet.

@ghost
Copy link

ghost commented Sep 4, 2020

Hi Suyash, do the tests all run on your computer?

If I fetch words like "video (Latin)" oldid 50291344, I'm sometimes getting stuff like this

...
                    "text": [
                        "Lua error in Module:la-verb at line 747: The parameter \"conj\" is not used by this template.",
                        "I see, perceive; look (at)",
...

Source: https://en.wiktionary.org/wiki/video?printable=yes&oldid=50291344#Verb_2

I'm not sure if wiktionary just developed a bug or if it's something else.

Edit:
I've started splitting up the tests and adding a bit more logging so people can tell which word and language specifically is failing a test. This does mean that I'm adding parameterized==0.7.4 as a dependency.

split-tests

@suyashb95
Copy link
Owner Author

Tbh I haven't worked on this project in a while but, I'll take a look at the tests right away. The exception looks like an error on Wiktionary's end that turns up when the wikitext is rendered. Adding parameterized==0.7.4 sounds like a great idea 😊 , do you mind creating a separate issue for this and submitting a PR once you've fixed the tests?

@frankier
Copy link

I think this is one of the most comprehensive parsers which does that: https://github.com/tatuylonen/wiktextract

@suyashb95
Copy link
Owner Author

@frankier this looks very promising, thanks for pointing out!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants