-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xml title passes through doc object #59
Comments
hi, I agree completely. will add this as an issue there, then look at storing redirect pages in some kind of json object with their destination link. thanks! |
hey @noodlez04 i've got this supported in 3.6.0 |
Hi @spencermountain
And in the first if, the redirectTo() print is an empty object. The second if is never entered (nothing printed). Any ideas? Thanks! |
I have run it on the whole english wikipedia dump of sept 2018. Total number of articles loaded is 1,274,403 . However As of 18 September 2018, there are 5,718,153 articles in the English Wikipedia. So this means that more than 4 million articles are missing. |
hey @noodlez04 yeah you're very close, sorry this is not very clear right now. @aymansalama i noticed that. not sure what that's about. i'm gonna add a redirect & disambiguation count to the next release, so that may clear things up, but keep poking around. |
hey @spencermountain, first of all I'd like to thank you for the really quick responses! Sure makes the whole process much more pleasant. Second, your answer does not explain why I can't find the USA page in mongo, as I noted in the previous comment.
and get the following output: output.txt Note that the second The page entry of
And so I don't understand why it doesn't print at all. |
ah, sorry. just figured out why the redirects haven't been working last couple releases. |
Thanks again for you dedication :) Will update how it works for me once the new version is out |
this should work now in v4.0.0 |
Hi @spencermountain,
And I get the following print:
You can see that index should indeed capture the right doc by looking at this wikipedia page, which does include the text searched for by index (Please note that I'm using |
yeah, that should work. Looks like the xml title is not getting passed to the doc object properly |
hey, this works now in |
Hi all,
There's a problem with redirection pages.
As it stands, in the
worker/index.js
file, the wiki page is parsed using theparsePage
function.dumpster-dive/src/worker/index.js
Line 23 in e7c6b83
In it, there's a call to the
shouldSkip
function, which returnstrue
if the page is a redirection page. In such case, theparsePage
function returns null to the calling function (inindex.js
). There's no check in theshouldSkip
orparsePage
functions that check whether theskip_redirects
option is false or true.This all results in the fact that in
index.js
, pages which are redirection pages are ignored no matter the value of theskip_redirects
option.Moreover, when I do change the
shouldSkip
function's return value to false, in order to not skip redirects, the redirection pages are processed like regular pages. This behavior seems unintuitive to me. I think the behavior should be that the redirection page should have a special "redirection" field which should point to the redirected-to page. This can be very helpful, since I'd like to treat the redirection page just like the redirected-to page, in terms of the text of the page, etc., so I'd like to be able to get to the redirected-to page from the redirection page.The text was updated successfully, but these errors were encountered: