<a href="https://colab.research.google.com/github/siliang625/text_mining_health/blob/master/digital_public_health_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

Example using esearch to extract a list of related articles given search term provided by Alice, then using efetch to extract metadata for those articles (from Extract_metadata notebook)


In [0]:
#!pip install BeautifulSoup4



In [5]:
%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
np.set_printoptions(threshold=80, edgeitems=50)
from bs4 import BeautifulSoup
from collections import namedtuple
import codecs
import re
import time
import json
import requests
from urllib.request import urlopen

# key word search
we will use 'public health' as a sample search term. Sample query: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term=public%20health



In [0]:
base_url_key_word = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term=public%20health"

In [0]:
page_kw = urlopen(base_url_key_word)
soup_kw = BeautifulSoup(page_kw, "xml")

In [0]:
# get text form of everything in the xml file
soup_kw.get_text()

'1960797200\n7227790\n7227436\n7227427\n7227414\n7227382\n7227373\n7226536\n7226223\n7226038\n7225773\n7225723\n7225607\n7225559\n7225364\n7225352\n7225339\n7225331\n7225315\n7225311\n7225297\n public health "public health"[MeSH Terms] OR ("public"[All Fields] AND "health"[All Fields]) OR "public health"[All Fields]   "public health"[MeSH Terms] MeSH Terms 1186646 Y   "public"[All Fields] All Fields 1397349 N   "health"[All Fields] All Fields 3262233 N  AND GROUP OR  "public health"[All Fields] All Fields 969981 N  OR GROUP "public health"[MeSH Terms] OR ("public"[All Fields] AND "health"[All Fields]) OR "public health"[All Fields]'

The query can only return 20 related article ids as a upper limit, the actural amount of related articles is way bigger (eg: 1960797)

In [0]:
# total ids in the database
soup_kw.find('Count').text

'1960797'

Let's get a list of related article ids return by the query:

In [0]:
id_list = soup_kw.find_all('Id')
return_ids = []
for each_id in id_list:
    print(each_id.text)
    return_ids.append(each_id.text)

7227790
7227436
7227427
7227414
7227382
7227373
7226536
7226223
7226038
7225773
7225723
7225607
7225559
7225364
7225352
7225339
7225331
7225315
7225311
7225297


How 'public health' is queried in the databse? Based on the query results below, it uses 3 terms ('public', 'health', 'public health') and combine the search result.
```
# This is formatted as code
<TranslationStack>
<TermSet>
<Term>"public health"[MeSH Terms]</Term>
<Field>MeSH Terms</Field>
<Count>1186646</Count>
<Explode>Y</Explode>
</TermSet>
<TermSet>
<Term>"public"[All Fields]</Term>
<Field>All Fields</Field>
<Count>1397349</Count>
<Explode>N</Explode>
</TermSet>
<TermSet>
<Term>"health"[All Fields]</Term>
<Field>All Fields</Field>
<Count>3262233</Count>
<Explode>N</Explode>
</TermSet>
<OP>AND</OP>
<OP>GROUP</OP>
<OP>OR</OP>
<TermSet>
<Term>"public health"[All Fields]</Term>
<Field>All Fields</Field>
<Count>969981</Count>
<Explode>N</Explode>
</TermSet>
<OP>OR</OP>
<OP>GROUP</OP>
</TranslationStack>
<QueryTranslation>
"public health"[MeSH Terms] OR ("public"[All Fields] AND "health"[All Fields]) OR "public health"[All Fields]
</QueryTranslation>
```



# full text search
The sample query provided below aloows as to extract the full text of the articles, as well as some of the metadata (eg: author, abstract, reference, etc.)

Sample query: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=7225297

Let's extract the full content of the sample xml form

In [0]:
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id="

In [0]:
pmc_id = "7225297"

In [0]:
target_url = base_url + pmc_id

In [0]:
page = urlopen(target_url)
soup = BeautifulSoup(page, "xml")

In [0]:
# get text form of everything in the xml file
soup.get_text()

"\n\n\n\nFront Pediatr\nFront Pediatr\nFront. Pediatr.\n\nFrontiers in Pediatrics\n\n2296-2360\n\nFrontiers Media S.A.\n\n\n\n7225297\n10.3389/fped.2020.00207\n\n\nPediatrics\n\nPerspective\n\n\n\n\nA Perspective on Management of Limb Fractures in Obese Children: Is It Time for Dedicated Guidelines?\n\n\n\n\nDonati\nFabrizio\n\n\n1\n\n\n\n\n\nCostici\nPier Francesco\n\n\n1\n\n\n\n\nDe Salvatore\nSergio\n\n\n2\n\n\n\n\n\nBurrofato\nAaron\n\n\n1\n\n\n\n\nMicciulli\nEnrico\n\n\n1\n\n\n\n\nMaiese\nAniello\n\n\n3\n\n\n*\n\n\n\n\n\nSantoro\nPaola\n\n\n3\n\n\n\n\n\nLa Russa\nRaffaele\n\n\n3\n\n\n\n\n1Department of General Surgery, Orthopedic Institute, Bambino Gesù Children Hospital, Rome, Italy\n2Department of Orthopaedic and Trauma Surgery, Campus Bio-Medico University of Rome, Rome, Italy\n3Department of Anatomical, Histological, Forensic and Orthopaedic Sciences, Sapienza University of Rome, Rome, Italy\n\n\nEdited by: Marzia Duse, Sapienza University of Rome, Italy\n\n\nReviewed by: Mari

Now, let's extract some metadata form the xml

### article content

In [0]:
contrib = soup.find('body')
paragraphs = contrib.find_all('p')
for p in paragraphs:
    print(p.text)

Limb fractures are the most common injuries in pediatric orthopedics. Early and late complications are often not preventable, even when providing the best treatment; furthermore, these illnesses are often implicated in medico-legal claims. The development of evidence-based guidelines (EBG) is one of the main goals of medical research. Approved guidelines are fundamental to improve medical practice, especially in the management of specific diseases. The quality of guidelines strictly relies on the strength of its scientific evidence, which is often insufficient in pediatric orthopedic publications (1). A wide standardization of treatment is not always possible in medicine since specific conditions often require adjustments to the normal standard of care. One of these conditions is pediatric obesity (defined as a Body Mass Index, BMI, at or above the 95th percentile for children and teens of the same age and sex). Both obesity and polytrauma are major health problems. Despite rate of tra

### Title

In [0]:
title = soup.find('title-group')
article_title = title.find('article-title')
print(article_title.text)

A Perspective on Management of Limb Fractures in Obese Children: Is It Time for Dedicated Guidelines?


### Abstract

In [0]:
soup.find('abstract').text

'\nLimb fractures are the most common injuries in pediatric orthopedics. Early and late complications are often not preventable, even when providing the best treatment; furthermore, these injuries are largely implicated in medico-legal claims. The development of evidence-based guidelines is one of the main goals of medical research. Approved guidelines for diagnosis, treatment, and follow up are fundamental to obtain the best results in medical practice. Guidelines in pediatric traumatology have been developed, even though specific conditions, like obesity, could influence their drafting. The cast and fixation systems usually applied in pediatric fractures provide a growth plate sparing, a satisfying reduction, and good stress resistance, mostly because of a lower bodyweight compared to adults. Several studies suggest that obesity influences the bone quality, the management, and the outcomes in cases of fracture. High body weight increases the risk of trauma, modifies fracture characte

### Author

In [0]:
contrib = soup.find('contrib-group')
authors = contrib.find_all('contrib', **{'contrib-type':"author"})
for author in authors:
    name = author.find('name')
    print(' '.join([author.find('surname').text, author.find('given-names').text]))

Donati Fabrizio
Costici Pier Francesco
De Salvatore Sergio
Burrofato Aaron
Micciulli Enrico
Maiese Aniello
Santoro Paola
La Russa Raffaele


Now, let's say we want to have the name of all articles related to term 'public health':

In [0]:
mapping = {}
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id="
for each in return_ids:
    target_url = base_url + each
    page = urlopen(target_url)
    soup = BeautifulSoup(page, "xml")
    title = soup.find('title-group')
    article_title = title.find('article-title')
    #print(article_title.text)
    mapping[each] = article_title.text
    time.sleep(1)
    
mapping

{'7225297': 'A Perspective on Management of Limb Fractures in Obese Children: Is It Time for Dedicated Guidelines?',
 '7225311': 'β-Elemene Reverses the Resistance of p53-Deficient Colorectal Cancer Cells to 5-Fluorouracil by Inducing Pro-death Autophagy and Cyclin D3-Dependent Cycle Arrest',
 '7225315': 'Assessing the Preferences for Criteria in Multi-Criteria Decision Analysis in Treatments for Rare Diseases',
 '7225331': 'Peer Support in Coordination of Physical Health and Mental Health Services for People With Lived Experience of a Serious Mental Illness',
 '7225339': 'Hyperphosphatemic Tumoral Calcinosis: Pathogenesis, Clinical Presentation, and Challenges in Management',
 '7225352': 'Application of the Chinese Version of the BIS/BAS Scales in Participants With a Substance Use Disorder: An Analysis of Psychometric Properties and Comparison With Community Residents',
 '7225364': 'Pangolins Lack IFIH1/MDA5, a Cytoplasmic RNA Sensor That Initiates Innate Immune Defense Upon Coronavir

# ID converter api 

One academic article might have several ids: 

1. PMID: use simple numbers, e.g., 23193287.
2. PMCID: include the ‘PMC’ prefix, e.g., PMC3531190. You may drop the prefix if you select the checkbox for ‘Process as PMCIDs’.
3. Manuscript ID: include the relevant prefix, e.g., NIHMS236863 or EMS48932.
4. DOI: enter the complete string, e.g., 10.1093/nar/gks1195.

Fortunately, ncbi provided a service to convert ids between them. Let's try it out




In [1]:
base_url_converter = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"

In [7]:
sample_pmid = "32622273"

In [10]:
def id_convert(pmid):
    
    base_url_converter = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"
    api_url = base_url_converter + "?ids=" + sample_pmid + "&format=json"
    #print(api_url)
    response = requests.get(api_url)

    if response.status_code == 200:
        res = json.loads(response.content.decode('utf-8'))
  #print(res)
        if 'pmcid' not in res['records'][0]:
            print("this pm_id is not valid in pmc")
            return None
            
        else:
            pmc_id = res['records'][0]['pmcid']
            print("Corresponding pmcid is: {}".format(pmc_id)
            return pmc_id


ok not valid 


# Conclusion:
- Looks like are able to use key word search to find all detailed information of the related articles(content, metadata, etc.)
- We need to set a sleeping time between each http call, even for 20 conecutive calls :c


### TODO
- how to solve the amount of ids return by the key word search?
- encapsulating and refactoring code
- future article content processing

# Reference

[1] doc of beautiful soup library: https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find#find

[2] ID converver: https://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/

[3] another way to extract metadata in json form: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=32448042