You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
making it better able to keep up with the stream of new data being generated, while also back-filling with data from past publications. We will develop a system for automating the construction of the lexicon used in the named entity recognition steps. We will automate the regular normalization and deposition of the PFOCR data into our newly-created Translator API.
The text was updated successfully, but these errors were encountered:
@ariutta Can you make a flow diagram in draw.io that depicts the major steps in our PFOCR pipeline. We can then annotate (e.g., with fill color) which are automated and which are manual. Maybe we can provide a rough percentage estimate of how much of the pipeline is automated by this Segment 1 deadline.
@ariutta Another aspect of this is "establishing infrasture" to automate in the future. Do you have ideas on tools we might want to use for this project to monitor scripts and future automation, e.g., Jenkins? Part of satifying this Segment 1 aim would simply be to identify and prototype that sort of infrastruture tooling.
As illustrated in the figure above, we propose a pipeline for performing OCR on figures and processing the results to identify entities of interest, such as genes. The items marked as Container can be packaged and deployed using Docker, with inter-container communication performed by means of an RPC system like grpc. The trigger to begin collecting figures will be chosen to be consistent with the rest of the BTE system, whether that be periodic like a nightly cron job or on-demand like a message queue. The subsequent workflow will be handled by a system like Pachyderm.
In the classify step, we will use a machine learning model that has been pre-trained on our collection of manually labeled figures (pathway vs. not-pathway). This model will definitely use computer vision to assign labels to figures and may additionally use text, such as the figure caption. For the figures classified as pathway, we will send them through an OCR processor to extract raw text and use our lexicon(s) and post-processing algorithms to extract entities such as genes from that text. Finally, we will export our results in formats requested by third-party consumers/hosts of our resulting data, such as gene sets by figure in GMT format for Enrichr.
making it better able to keep up with the stream of new data being generated, while also back-filling with data from past publications. We will develop a system for automating the construction of the lexicon used in the named entity recognition steps. We will automate the regular normalization and deposition of the PFOCR data into our newly-created Translator API.
The text was updated successfully, but these errors were encountered: