Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature to extract latest revisionID from markup #3

Merged
merged 2 commits into from
Nov 19, 2023

Conversation

dagingaa
Copy link
Contributor

This change adds a feature to extract the revision id from the wikidump markup.

This is useful for checking for changes between two different wikidumps if you're so inclined.

Depends on spencermountain/wtf_wikipedia#568

This change adds a feature to extract the revision id from the wikidump markup.

This is useful for checking for changes between two different wikidumps if you're so inclined.

Depends on spencermountain/wtf_wikipedia#568
Bah, rookie mistake
@spencermountain spencermountain changed the base branch from main to dev November 19, 2023 15:58
@spencermountain spencermountain merged commit e7a2f81 into spencermountain:dev Nov 19, 2023
@spencermountain
Copy link
Owner

adding this to dev branch, where i've got some half-brained stuff I should really sort out.
Will try to work through it shortly.
Will let you know when it's released
thanks!

@dagingaa
Copy link
Contributor Author

This is great :D Somehow I didn't get notified about this, will fix that. But thank you! (Also I now realize I added the regex explainer to the wrong place)

@spencermountain
Copy link
Owner

hey Dag-Inge - I've got your work on a the dev branch of this library, and was wondering if you had any other features or changes that you'd like to see made, before a breaking release.
I've added some new output modes, and a script to download/unpack the dump.
I've also started adding support for the pageview data, along with your rev data, to be included in the doc at parse-time.
There's also a new ability to support adding plugins to each worker ahead-of time.

Am on-laptop this week, so can't test properly on a full dump. Wanted to hear what sort of changes you'd like to see, as it's a good time for brainstorming.
cheers

@dagingaa
Copy link
Contributor Author

dagingaa commented Dec 1, 2023

I was about to say pageview data would be amazing! Perhaps lastUpdated would be nice as well?

Just want to say this library has been invaluable in our experiments with Wikipedia data and RAG, so thank you for taking the time to create this!

Let me know if you need any help testing!

(Why oh why don't I get email notifications for this repo :( )

@spencermountain
Copy link
Owner

lastUpdated is a great idea. will add it this week.
cheers

@spencermountain
Copy link
Owner

spencermountain commented Dec 28, 2023

.revisionID() and .timestamp()make round-trips now in dumpster-dip 2.0.0 🥳
thank you for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants