explore mediawiki parsers instead of parsing HTML directly #58

suyashb95 · 2020-07-09T16:30:18Z

Instead of parsing the HTML, use existing mediawiki parsers (like mwparserfromhell) as a second stage since headings/content/tags/comments etc are clearly defined and the wikitext content is more compact

ghost · 2020-09-03T17:55:51Z

Hello Suyash, I may like to work on this if I have the time. Some questions:

Are you aiming to use mwparserfromhell to read the HTML content parsed by beautifulsoup or to read the markdown in wiktionary XML dump files?
What are the prerequisites for you accepting a pull request? (as I see some pull requests haven't been merged)

suyashb95 · 2020-09-04T04:09:40Z

Hi @sehwol , thank you for your interest in this! I was planning to use mwparserfromhell to parse the wikitext directly instead of HTML mainly for the following reasons

I've seen a lot of pages having the same kind of content in different html tags or structures so was hoping that this would be more resilient. Since we won't be dealing with HTML, we won't have to clean it up like we're doing here
If the parser works on wikitext, it'll be easy to make it work with wikitext dumps later on instead of making HTTP calls

The wikitext can be retrieved using Wiktionary's API
https://en.wiktionary.org/w/api.php?action=parse&page=test&prop=wikitext&formatversion=2&format=json

I'll accept a PR if the tests work and the code looks good to me. The two pending ones aren't really complete so I haven't merged them yet.

ghost · 2020-09-04T11:37:58Z

Hi Suyash, do the tests all run on your computer?

If I fetch words like "video (Latin)" oldid 50291344, I'm sometimes getting stuff like this

...
                    "text": [
                        "Lua error in Module:la-verb at line 747: The parameter \"conj\" is not used by this template.",
                        "I see, perceive; look (at)",
...

Source: https://en.wiktionary.org/wiki/video?printable=yes&oldid=50291344#Verb_2

I'm not sure if wiktionary just developed a bug or if it's something else.

Edit:
I've started splitting up the tests and adding a bit more logging so people can tell which word and language specifically is failing a test. This does mean that I'm adding parameterized==0.7.4 as a dependency.

suyashb95 · 2020-09-04T15:43:13Z

Tbh I haven't worked on this project in a while but, I'll take a look at the tests right away. The exception looks like an error on Wiktionary's end that turns up when the wikitext is rendered. Adding parameterized==0.7.4 sounds like a great idea 😊 , do you mind creating a separate issue for this and submitting a PR once you've fixed the tests?

frankier · 2021-03-11T09:16:21Z

I think this is one of the most comprehensive parsers which does that: https://github.com/tatuylonen/wiktextract

suyashb95 · 2021-03-11T11:02:22Z

@frankier this looks very promising, thanks for pointing out!

suyashb95 added enhancement help wanted labels Jul 9, 2020

suyashb95 self-assigned this Jul 9, 2020

suyashb95 changed the title ~~expore mediawiki parsers instead of parsing HTML directly~~ explore mediawiki parsers instead of parsing HTML directly Jul 9, 2020

ghost mentioned this issue Sep 5, 2020

Create a test utility script that downloads test HTML and markdown and saves it as a part of the repo #63

Merged

suyashb95 mentioned this issue Sep 17, 2020

Fix wrong language issue for words with no contents #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explore mediawiki parsers instead of parsing HTML directly #58

explore mediawiki parsers instead of parsing HTML directly #58

suyashb95 commented Jul 9, 2020 •

edited

ghost commented Sep 3, 2020

suyashb95 commented Sep 4, 2020

ghost commented Sep 4, 2020 •

edited by ghost

suyashb95 commented Sep 4, 2020

frankier commented Mar 11, 2021

suyashb95 commented Mar 11, 2021

explore mediawiki parsers instead of parsing HTML directly #58

explore mediawiki parsers instead of parsing HTML directly #58

Comments

suyashb95 commented Jul 9, 2020 • edited

ghost commented Sep 3, 2020

suyashb95 commented Sep 4, 2020

ghost commented Sep 4, 2020 • edited by ghost

suyashb95 commented Sep 4, 2020

frankier commented Mar 11, 2021

suyashb95 commented Mar 11, 2021

suyashb95 commented Jul 9, 2020 •

edited

ghost commented Sep 4, 2020 •

edited by ghost