Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml title passes through doc object #59

Closed
itaiperi opened this issue Sep 10, 2018 · 12 comments · Fixed by #35
Closed

xml title passes through doc object #59

itaiperi opened this issue Sep 10, 2018 · 12 comments · Fixed by #35

Comments

@itaiperi
Copy link
Contributor

itaiperi commented Sep 10, 2018

Hi all,
There's a problem with redirection pages.
As it stands, in the worker/index.js file, the wiki page is parsed using the parsePage function.

let page = parsePage(xml);

In it, there's a call to the shouldSkip function, which returns true if the page is a redirection page. In such case, the parsePage function returns null to the calling function (in index.js). There's no check in the shouldSkip or parsePage functions that check whether the skip_redirects option is false or true.
This all results in the fact that in index.js, pages which are redirection pages are ignored no matter the value of the skip_redirects option.
Moreover, when I do change the shouldSkip function's return value to false, in order to not skip redirects, the redirection pages are processed like regular pages. This behavior seems unintuitive to me. I think the behavior should be that the redirection page should have a special "redirection" field which should point to the redirected-to page. This can be very helpful, since I'd like to treat the redirection page just like the redirected-to page, in terms of the text of the page, etc., so I'd like to be able to get to the redirected-to page from the redirection page.

@spencermountain
Copy link
Owner

hi, I agree completely.
oh, this functionality appears to be missing from the wtf-wikipedia api since a recent refactor. you can see we parse the redirect link here, there just seems like no good way to reference it right now.

will add this as an issue there, then look at storing redirect pages in some kind of json object with their destination link.

thanks!

@spencermountain
Copy link
Owner

hey @noodlez04 i've got this supported in 3.6.0
if skip_redirects=false, the redirect destination will be on doc.redirectTo.page as you can see in this test
lemme know if you see anything weird
cheers

@itaiperi
Copy link
Contributor Author

Hi @spencermountain
I've downloaded the newest version (3.6.1), and worked with the following simplewiki dump: simplewiki-20180901-pages-articles.xml.bz2.
I've tried running dumpster with and without skip_redirects, but in both cases, when I do the following query in mongo: db.pages.findOne({title: "USA"}), it returns null. It doesn't even save the redirect page into mongo.
You can make sure that it is in fact a redirect here: https://simple.wikipedia.org/wiki/USA
What's even weirder, is that I tried to make my own custom function to run with dumpster:

const dumpster = require('dumpster-dive')
options = {
	file: process.argv[2],
	db: 'simplewiki',
	skip_redirects: false,
	custom: function(doc) {
		let links = doc.links().filter(link => {
			return !(link.type == 'external')
		}).map(link => link.page)
		links = new Set(links)
		links = [...links]
		if(doc.title() == "USA") {
		    // redirectTo() is empty
		    console.log(doc.redirectTo())
		    console.log(doc.title())
		    console.log(doc.text())
        }
		if(Object.keys(doc.redirectTo()).length > 0) {
		    // this never prints
		    console.log(doc.redirectTo())
        }
		return { title: doc.title(), categories: doc.categories(), text: doc.text(), links: links }
	}
}
dumpster(options, () => console.log('Parsing is Done!'))

And in the first if, the redirectTo() print is an empty object. The second if is never entered (nothing printed).

Any ideas?

Thanks!

@aymansalama
Copy link

I have run it on the whole english wikipedia dump of sept 2018. Total number of articles loaded is 1,274,403 . However As of 18 September 2018, there are 5,718,153 articles in the English Wikipedia. So this means that more than 4 million articles are missing.

@spencermountain
Copy link
Owner

hey @noodlez04 yeah you're very close, sorry this is not very clear right now.
here's how i'd do it:
https://runkit.com/spencermountain/5ba24e901a4257001154e879

@aymansalama i noticed that. not sure what that's about. i'm gonna add a redirect & disambiguation count to the next release, so that may clear things up, but keep poking around.

@itaiperi
Copy link
Contributor Author

hey @spencermountain, first of all I'd like to thank you for the really quick responses! Sure makes the whole process much more pleasant.

Second, your answer does not explain why I can't find the USA page in mongo, as I noted in the previous comment.
Third, and probably most important, I am getting a really strange behaviour. I now run the following custom function:

const dumpster = require('dumpster-dive')
options = {
	file: process.argv[2],
	db: 'simplewiki',
	skip_redirects: false,
	custom: function(doc) {
		let links = doc.links().filter(link => {
			return !(link.type == 'external')
		}).map(link => link.page)
		links = new Set(links)
		links = [...links]
		if(doc.title() == "USA" || doc.title() == "United States") {
		    // prints doc.title() as "United States" for additional pages, on top of the "United States" page
		    console.log("******* ", doc.title(), doc.isRedirect(), doc.json().redirectTo, " *******")
		    console.log(doc.text())
        }
		if(doc.isRedirect()) {
		    // this never prints
		    console.log(doc.json().redirectTo)
        }
		return { title: doc.title(), categories: doc.categories(), text: doc.text(), links: links }
	}
}
dumpster(options, () => console.log('Parsing is Done!'))

and get the following output: output.txt

Note that the second if is never entered, again, though surely some pages are redirects.
Also note that USA doesn't print at all, and several pages are printed under the "United States" title, such as https://simple.wikipedia.org/wiki/United_States_at_the_2018_Winter_Paralympics (I validated thanks to the text. it's the third entry that is printed in the output file). However, in the mongodb, the page with the title United States at the 2018 Winter Paralympics does exist, though when printing doc.title() it prints only "United States" as can be seen above.

The page entry of USA as it appears in the dump is as follows:

<page>
  <title>USA</title>
  <ns>0</ns>
  <id>869</id>
  <redirect title="United States" />
  <revision>
    <id>6170530</id>
    <parentid>6168995</parentid>
    <timestamp>2018-06-26T09:15:54Z</timestamp>
    <contributor>
      <username>Auntof6</username>
      <id>22027</id>
    </contributor>
    <comment>Reverted to revision 6148877 by Hiàn: redirect back to correct page. ([[WP:TW|TW]])</comment>
    <model>wikitext</model>
    <format>text/x-wiki</format>
    <text xml:space="preserve">#REDIRECT [[United States]]</text>
    <sha1>ju80gui3dho5bw70q7obfkdmjkt1i7c</sha1>
  </revision>
</page>

And so I don't understand why it doesn't print at all.

@spencermountain
Copy link
Owner

ah, sorry. just figured out why the redirects haven't been working last couple releases.
fix landed, release coming soon.

@itaiperi
Copy link
Contributor Author

Thanks again for you dedication :) Will update how it works for me once the new version is out

@spencermountain
Copy link
Owner

this should work now in v4.0.0
thanks!

@itaiperi
Copy link
Contributor Author

itaiperi commented Oct 5, 2018

Hi @spencermountain,
Sorry for the late response.
I've tried working with version 4.0.1.
When I try, in the custom function, to print the doc if its title is USA, then there's no print, BUT, in MongoDB, there's an entry with the title attrtibute USA. Very weird.
Moreover, I tried to "capture" the United States document, by using

index = doc.text().indexOf("The United States of America (commonly")
if(index > -1) {
  // prints doc.title() as "United States" for additional pages, on top of the "United States" page
  console.log("\n******* ", doc.title(), doc.isRedirect(), doc.json().redirectTo, " *******")
  console.log(doc)
  console.log(doc.text().substring(index, index + 50))
}

And I get the following print:

*******  undefined false undefined  *******
Document { options: {} }
The United States of America (commonly referred to

You can see that index should indeed capture the right doc by looking at this wikipedia page, which does include the text searched for by index (Please note that I'm using simplewiki): https://simple.wikipedia.org/w/index.php?title=United_States&oldid=6237916 (my dump is from September 1st, 2018)

@spencermountain
Copy link
Owner

yeah, that should work. Looks like the xml title is not getting passed to the doc object properly

@spencermountain spencermountain changed the title Redirection pages not processed correctly xml title passes through doc object Oct 9, 2018
@spencermountain
Copy link
Owner

hey, this works now in 4.0.2, cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants