Establish infrastructure to automate the PFOCR pipeline #14

AlexanderPico · 2019-11-13T01:56:31Z

making it better able to keep up with the stream of new data being generated, while also back-filling with data from past publications. We will develop a system for automating the construction of the lexicon used in the named entity recognition steps. We will automate the regular normalization and deposition of the PFOCR data into our newly-created Translator API.

AlexanderPico · 2020-02-15T01:04:39Z

@ariutta Can you make a flow diagram in draw.io that depicts the major steps in our PFOCR pipeline. We can then annotate (e.g., with fill color) which are automated and which are manual. Maybe we can provide a rough percentage estimate of how much of the pipeline is automated by this Segment 1 deadline.

AlexanderPico · 2020-02-15T01:17:55Z

@ariutta Another aspect of this is "establishing infrasture" to automate in the future. Do you have ideas on tools we might want to use for this project to monitor scripts and future automation, e.g., Jenkins? Part of satifying this Segment 1 aim would simply be to identify and prototype that sort of infrastruture tooling.

AlexanderPico · 2020-03-09T20:53:38Z

ariutta · 2020-03-10T05:37:12Z

As illustrated in the figure above, we propose a pipeline for performing OCR on figures and processing the results to identify entities of interest, such as genes. The items marked as Container can be packaged and deployed using Docker, with inter-container communication performed by means of an RPC system like grpc. The trigger to begin collecting figures will be chosen to be consistent with the rest of the BTE system, whether that be periodic like a nightly cron job or on-demand like a message queue. The subsequent workflow will be handled by a system like Pachyderm.

In the classify step, we will use a machine learning model that has been pre-trained on our collection of manually labeled figures (pathway vs. not-pathway). This model will definitely use computer vision to assign labels to figures and may additionally use text, such as the figure caption. For the figures classified as pathway, we will send them through an OCR processor to extract raw text and use our lexicon(s) and post-processing algorithms to extract entities such as genes from that text. Finally, we will export our results in formats requested by third-party consumers/hosts of our resulting data, such as gene sets by figure in GMT format for Enrichr.

AlexanderPico added the enhancement New feature or request label Nov 13, 2019

AlexanderPico added this to the Segment 1 milestone Nov 13, 2019

AlexanderPico assigned ariutta, AlexanderPico and khanspers Nov 13, 2019

AlexanderPico added the Group 4 label Feb 15, 2020

ariutta closed this as completed Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish infrastructure to automate the PFOCR pipeline #14

Establish infrastructure to automate the PFOCR pipeline #14

AlexanderPico commented Nov 13, 2019

AlexanderPico commented Feb 15, 2020

AlexanderPico commented Feb 15, 2020

AlexanderPico commented Mar 9, 2020

ariutta commented Mar 10, 2020

Establish infrastructure to automate the PFOCR pipeline #14

Establish infrastructure to automate the PFOCR pipeline #14

Comments

AlexanderPico commented Nov 13, 2019

AlexanderPico commented Feb 15, 2020

AlexanderPico commented Feb 15, 2020

AlexanderPico commented Mar 9, 2020

ariutta commented Mar 10, 2020