Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow #113

Closed
Daniel-Mietchen opened this issue Nov 23, 2015 · 4 comments
Closed

Workflow #113

Daniel-Mietchen opened this issue Nov 23, 2015 · 4 comments

Comments

@Daniel-Mietchen
Copy link
Member

Daniel-Mietchen commented Nov 23, 2015

Here is a short version of the envisaged workflow for the OA Signalling project (components central to the project are marked in bold):

  1. listen to RecentChanges feed across all Wikimedia wikis (cf. event-data-wikipedia-agent)
  2. filter by bibliographic identifier for papers (currently only DOI, long-term also PubMed ID, PMC ID, arXiv ID, JSTOR ID and perhaps others)
  3. check whether paper was cited or uncited (all steps until here are included in CrossRef's live stream of DOI citations in Wikipedia)
  4. handle potential vandalism/ spam, e.g. via Revision scoring
  5. pull paper metadata from suitable source (e.g. from CrossRef/ DataCite for DOIs); Recitation bot does that, and so does Source, M.D.
  6. check whether that paper is available on Wikisource (initially only English, long-term other languages too)
    1. if so, check proper representation of paper and its metadata on Wikisource (as well as on Commons, Wikidata and Wikipedia) and in case of inconsistencies, notify someone (e.g. the original citer and/ or a relevant WikiProject, or simply a tracking page)
    2. if not, check whether that paper is available in JATS (currently only via PubMed Central, but long-term from anywhere); Recitation bot does that
      1. if so, check licensing of the paper
        1. if license is open, convert paper's JATS XML to MediaWiki XML
          1. upload full text to Wikisource (Recitation bot does that)
            1. check for consistency with original (perhaps via fuzzy anchoring?)
          2. upload images and media to Wikimedia Commons (requires duplicate detection - many images and videos already there; Recitation bot does that too; there is an unresolved issue with high-res images); for video or audio files (covered by the Open Access Media Importer), put a copy of the original file onto the Schnittserver
        2. if license is not open, notify OA Button (perhaps via OABOT?)
    3. start or update the Wikidata items for paper and/ or authors as necessary, perhaps even for references cited in the paper (bib2wikidata can upload CSL)
  7. check whether the initial citation that was identified through the RecentChanges stream is pulling bibliographic metadata from Wikidata
    1. if so, purge page to refresh display of citation information
    2. if not, update original citation with licensing/ OA Button info and links to Wikisource, Commons, Wikidata, as necessary
  8. keep track of revisions of cited references via CrossMark and notify someone of retractions etc.
  9. keep track of further citations (of the same cited reference) from within and beyond Wikimedia, e.g. via the DOI Event Tracker and notify someone (including the Cite-o-Meter)

Most of the components of this workflow do already exist but need some tweaking or brushing to fit our purposes better or to turn the pieces into a pipeline.

We will use this overview to define more detailed tasks.

@afandian
Copy link

This looks great. One current blind spot is text-only DOIs that aren't linked and the citation should be fixed. This is an extra detail in step 2. It's not difficult to do this but according to Dario Taraborelli there isn't an active project to do this.

The code running the Crossref live stream of citations is open source and exposes a websocket of diffs containing DOIs. I can add any required interfaces if anyone wants to connect other software to it.

@wetneb
Copy link

wetneb commented Dec 17, 2015

Hi @Daniel-Mietchen, we've met at OpenCon and I'm really excited by this project (under this branding or as OABOT). I think it would be great to use not only WikiSource, but also tons of open access repositories (BASE centralizes many of them and has APIs, cc @pietsch).
I have a few questions:

  • I did not know that Wikisource was used for academic papers. How would you query it or harvest it?
  • about adding DOIs to citations that do not provide one: I think it would be really useful to actually add these DOIs to Wikipedia itself (i.e. not keep the DOIs in your own workflow as they would be very useful to many other projects). Do you think such a bot has chances to be accepted on WP:BRFA?
  • same question, but for a bot that would add URLs to {{cite journal}} templates which do not have one yet: the URL would point to a open access repository where the paper is freely available?

I'm currently creating a dataset of all {{cite something}} references on the English WP and will do a few experiments on them to see what proportion of them could be improved with such a pipeline.

@Daniel-Mietchen
Copy link
Member Author

Hi @wetneb (and @pietsch), thanks for checking in here.

Not sure what precisely you have in mind with using Wikisource and those tons of other repositories, but what we plan to do is to harvest openly licensed scholarly articles from PubMed Central (and eventually also other sources, as long as they provide their articles in JATS) and to upload their full text to Wikisource (and its images and media to Wikimedia Commons), which would then be linked (and ideally deep-linked) from the citation of the scholarly article on Wikipedia. For a list of articles already test-uploaded, see here.

As to your questions,

  • we have no plans to query Wikisource right now, or to harvest from it, though we would be interested in approaches to annotate or mine it, so that the potentially queriable facts could be added to Wikidata, where they would become queriable.
  • we certainly do not plan to keep any DOIs secret in any way — not sure what you mean here. But instead of adding DOIs to individual citations on individual pages in some Wikipedia, we think it would be best to add them (along with all the bibliographic metadata) to Wikidata, and to pull the information into Wikipedia (across languages) and other Wikimedia projects from there. More details on that are here.
  • as for adding URLs to {{cite journal}}, that is something outside our scope for the moment, but at the heart of the OABOT suggestion.

As for all {{cite something}} references on enwp, that is something @halfak is working on over at python-mwcites — worth a look, along with the initial data from February this year, which could do with an update.

@Daniel-Mietchen
Copy link
Member Author

Closing this - I have moved the workflow description from above to https://github.com/wpoa/OA-signalling/blob/master/README.md#workflow and opened a new ticket #143 for handling future updates to the workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants