Skip to content

This repository contains files and information about step 2 of Kaphta Architecture: Information Extraction, using the R language.

Notifications You must be signed in to change notification settings

ramongsilva/Information-extraction-from-pubmed-abstracts-sentences-on-polyphenols-anticancer-activity

Repository files navigation

Information extraction from PubMed abstracts sentences on polyphenols anticancer activity

This repository contains files and information about step 2 of Kaphta Architecture: Information Extraction. In this stage, PubMed abstracts classified as positive in the previous stage (Text Classification step) were used to extract information. Information was extracted from sentences of PubMed abstracts with associations of recognized entities. The following are the files used in the tasks of NER (Named entity recognition), AR (Association recognition) and your respective results:

For more information about this and other steps of the Kaphta Architecture, see sections of the Kaptha Web Tool available in https://portal.ifsuldeminas.edu.br/kaphtawebtool/.

NER (Named entity recognition)

  • ner-pubmed-abstracts-gh.R: R script for named entity recognition (NER) in PubMed abstracts classified as positive in the previous stage (Text Classification step), using PubTator API
  • functions.R: R script with auxiliary functions. Save this file in the same folder of ner-pubmed-abstracts-gh.R and association-recognition-pubmed-abstracts-gh.R scripts, because it is needed to execute these scripts.
  • db_total_project.db: SQLite Database needed to execute all R scripts of kaphta architecture steps. This database contains tables with the Entity dictionary, Total PubMed abstracts textual corpus, and Pubmed abstracts classified as positive in text classification. Save this file in the same folder of ner-pubmed-abstracts-gh.R script, because it is needed to execute this script.

AR (Association recognition)

Results of the NER and AR tasks

  • entities-recognized: folder with files resulted from NER task in information extraction with the named entities (polyphenols, cancers and genes) recognized on PubMed abstracts classified as positive in the previous stage (Text Classification step). Save this folder with the files in the same folder of association-recognition-pubmed-abstracts-gh.R script, because it is needed to execute this script, on the Association recognition task.
  • entities-associations-sentences-recognized: folder with files resulted of NER task in information extraction with sentences recognized with entities (polyphenols, cancers and genes) associations on PubMed abstracts classified as positive in the previous stage (Text Classification step). Save this folder with the files in the same folder of association-recognition-pubmed-abstracts-gh.R script, because it is needed to execute this script, on the Association recognition task.
  • ner-frequency: folder with files with the frequency of entities about polyphenols, cancers and/or genes recognized in PubMed abstracts classified as positive in the previous stage (Text Classification step).
  • Rule_associations_recognized.rar: compacted file resulted of AR task containing the PubMed abstract sentences with at least one rule from rules dictionary recognized.

Result of AR task

Below is presented a table with the results of the Association Recognition task, separated for category, rules and sentence type (PC, PG, and P).

Table with the total of the recognized sentences associations for the different sentence type

Releases

No releases published

Packages

 
 
 

Languages