Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish infrastructure to automate the PFOCR pipeline #14

Closed
AlexanderPico opened this issue Nov 13, 2019 · 4 comments
Closed

Establish infrastructure to automate the PFOCR pipeline #14

AlexanderPico opened this issue Nov 13, 2019 · 4 comments
Assignees
Labels
enhancement New feature or request Group 4
Milestone

Comments

@AlexanderPico
Copy link
Member

making it better able to keep up with the stream of new data being generated, while also back-filling with data from past publications. We will develop a system for automating the construction of the lexicon used in the named entity recognition steps. We will automate the regular normalization and deposition of the PFOCR data into our newly-created Translator API.

@AlexanderPico AlexanderPico added the enhancement New feature or request label Nov 13, 2019
@AlexanderPico AlexanderPico added this to the Segment 1 milestone Nov 13, 2019
@AlexanderPico
Copy link
Member Author

@ariutta Can you make a flow diagram in draw.io that depicts the major steps in our PFOCR pipeline. We can then annotate (e.g., with fill color) which are automated and which are manual. Maybe we can provide a rough percentage estimate of how much of the pipeline is automated by this Segment 1 deadline.

@AlexanderPico
Copy link
Member Author

@ariutta Another aspect of this is "establishing infrasture" to automate in the future. Do you have ideas on tools we might want to use for this project to monitor scripts and future automation, e.g., Jenkins? Part of satifying this Segment 1 aim would simply be to identify and prototype that sort of infrastruture tooling.

@AlexanderPico
Copy link
Member Author

PFOCR Pipeline for BTE

@ariutta
Copy link
Member

ariutta commented Mar 10, 2020

As illustrated in the figure above, we propose a pipeline for performing OCR on figures and processing the results to identify entities of interest, such as genes. The items marked as Container can be packaged and deployed using Docker, with inter-container communication performed by means of an RPC system like grpc. The trigger to begin collecting figures will be chosen to be consistent with the rest of the BTE system, whether that be periodic like a nightly cron job or on-demand like a message queue. The subsequent workflow will be handled by a system like Pachyderm.

In the classify step, we will use a machine learning model that has been pre-trained on our collection of manually labeled figures (pathway vs. not-pathway). This model will definitely use computer vision to assign labels to figures and may additionally use text, such as the figure caption. For the figures classified as pathway, we will send them through an OCR processor to extract raw text and use our lexicon(s) and post-processing algorithms to extract entities such as genes from that text. Finally, we will export our results in formats requested by third-party consumers/hosts of our resulting data, such as gene sets by figure in GMT format for Enrichr.

@ariutta ariutta closed this as completed Mar 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Group 4
Projects
None yet
Development

No branches or pull requests

3 participants