Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify Collection Range in JSON/HTML #1

Closed
BrokenEagle opened this issue Nov 7, 2016 · 4 comments
Closed

Specify Collection Range in JSON/HTML #1

BrokenEagle opened this issue Nov 7, 2016 · 4 comments

Comments

@BrokenEagle
Copy link

I've been trying to validate the data in your reports with my own, and in many cases the data is similar, and in others it's far apart. (See http://danbooru.donmai.us/forum_topics/13112 for my attempt to compare it against the Nov 2 collection on Isshiki).

Theoretically, if we process the same data set the same way we should get the same results.

Adding the timestamp range (from and to) in Zulu UTC to the HTML/JSON would help with this. For the HTML, it doesn't even need to be a visible item, as long as it's in the page somewhere.

Adding the ID range would be even better, in case there are some discrepancies with the timestamps (e.g. incorrect timezone used).

Adding both would be the most preferred option, and would precisely delineate the data being analyzed for that particular report.

This should hopefully eliminate the data range as a source of variability between your report and mine, and any differences at that point should be due to inconsistencies in the data processing. (Either on your end or on mine)

@r888888888
Copy link
Owner

it's hard to extract the version ids used from bigquery without issuing another scan. i can add the dates though.

@r888888888
Copy link
Owner

see 8c69e4a

@r888888888
Copy link
Owner

I think one reason for the discrepancy for total counts in post details is you are probably counting versions whereas the reportbooru one counts discrete changes. Maybe you are doing something similar for the typed tags?

@BrokenEagle
Copy link
Author

Yes, that is correct, with a small caveat. My script counts post versions that are not uploads for "Post Changes", and post versions that are uploads for "Uploads". So the two are mutually exclusive.

For the typed tags however...

  • Post Changes:
    • It counts the number of tags in "added_tags" for Add events
    • It counts the number of tags in "removed_tags" for Remove events
  • Uploads:
    • It counts the number of tags in "tags"

For all of the above:

  • It disregards all meta-tags
  • It also disregards the following transient tags
    • *_request, tagme, commentary, check_commentary, translated, partially_translated, check_translation, annotated, partially_annotated, check_my_note, check_pixiv_source

I should be able to work with the date/times though. Thanks for adding that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants