Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for revisionID #568

Merged
merged 1 commit into from Nov 15, 2023
Merged

Conversation

dagingaa
Copy link
Contributor

This change adds initial support for revisionID as passed in through options. This is useful because one can use this to check for revision changes between two wikipedia dumps, like when using dumpster-dip on a monthly basis to keep a search database up-to-date (for RAG for example).

Mostly I just missed having this, and I plan to submit a follow-up PR to dumpster-dip to have it parse the revisionID and pass it in so I can use it.

Note that this change does not include updating the README and types yet, I will do that, but I wanted to wait for feedback on naming etc. first.

This change adds initial support for revisionID as passed in through options. This is useful because one can use this to check for revision changes between two wikipedia dumps, like when using dumpster-dip on a monthly basis to keep a search database up-to-date (for RAG for example).

Mostly I just missed having this, and I plan to submit a follow-up PR to dumpster-dip to have it parse the revisionID and pass it in so I can use it.

This commit does not include updating the README and types, I wanted to wait for feedback before I added the final commit.
dagingaa added a commit to dagingaa/dumpster-dip that referenced this pull request Nov 15, 2023
This change adds a feature to extract the revision id from the wikidump markup.

This is useful for checking for changes between two different wikidumps if you're so inclined.

Depends on spencermountain/wtf_wikipedia#568
@dagingaa
Copy link
Contributor Author

Mostly because regex is utterly unreadable, here's an explanation curtesy of ChatGPT:

Explanation:

  • <revision>: This matches the start of the <revision> tag.
  • [\s\S]*?: This matches any character including new lines ([\s\S]), as many times as possible but as few as needed (non-greedy, due to *?). This ensures that the regex searches within the content of the <revision> tag.
  • <id>: This matches the start of the <id> tag within the <revision> tag.
  • (\d+): This is a capturing group that matches one or more digits (\d+). This represents the id number.
  • </id>: This matches the end of the <id> tag.

@spencermountain spencermountain changed the base branch from master to dev November 15, 2023 16:58
@spencermountain spencermountain merged commit 505afa4 into spencermountain:dev Nov 15, 2023
@spencermountain
Copy link
Owner

this is spectacular. Thank you.
I've put this on dev branch, so it can make the next release, which should be in a few days.
I've added a typescript support for the new method, feel-free to document things, as you see fit.
cheers!

spencermountain added a commit that referenced this pull request Nov 15, 2023
@spencermountain spencermountain mentioned this pull request Nov 15, 2023
Merged
@spencermountain
Copy link
Owner

just kidding - this is released in 10.2.0
will get to updating dumpster-dip this week. thanks for the help!

@spencermountain
Copy link
Owner

hey, could we also grab revisionID from the api when we do a fetch?
@MarketingPip - wanna take a crack at it?
this is a cool feature.
cheers

@MarketingPip
Copy link
Contributor

@spencermountain - sure can.

I don't think this messes anything up but - wanna take a look see?

https://en.wikipedia.org/w/api.php?action=query&prop=revisions%7Cpageprops&rvprop=content|ids&maxlag=5&rvslots=main&origin=*&format=json&redirects=true&titles=Toronto_Raptors

Note: the ids prop added for reference in future. I will make PR in advanced, run some texts and see what else you want to grab. I will get rev / parent id. And do you want an option to search via rev id as well?

@MarketingPip
Copy link
Contributor

@spencermountain - I got most of the work done for getting revisionID. I will let you make / do the work for making the query for looking for specific revision via query. (if you decide you will support that).

That said - in a junk / play branch. I modified the test / expected results for the Italian and CSGO wikipedia, tho - I am afraid this will cause issues when you go to build in future when a revision changes and not the same. Let me know how you want me to modify the test & I will submit tomorrow or the next day etc..

@spencermountain
Copy link
Owner

ah, perfect. yeah, that's great.
Are you thinking of this?

wtf('Fubar', {revisionID: '372618'})

to fetch an older version?
never though of that - that would be cool. As long as it doesn't get really complicated - Go for it!

thanks for your help

@MarketingPip
Copy link
Contributor

ah, perfect. yeah, that's great. Are you thinking of this?

wtf('Fubar', {revisionID: '372618'})

to fetch an older version? never though of that - that would be cool. As long as it doesn't get really complicated - Go for it!

thanks for your help

@spencermountain - I am grabbing current revision ID (but I will see about grabbing a previous version if it doesn't get messy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants