# biopython
- EFetch, official.   
https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
- Biopythonを使ってPMCから論文取得  
https://roy29fuku.com/natural-language-processing/paper-analysis/retrieve-pmc-full-text-with-biopython/
- Biopython Tutorial and Cookbook  
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec139
- tqdm   
https://pypi.org/project/tqdm/
-  賢い論文検索Semantic Scholar   
https://roy29fuku.com/natural-language-processing/paper-analysis/cut-through-the-clutter-semantic-scholar/
- 19.7. xml.etree.ElementTree — The ElementTree XML API    
https://docs.python.org/2/library/xml.etree.elementtree.html#elementtree-parsing-xml
- scidownl, official doc.    
https://pypi.org/project/scidownl/
- How to print two strings (large text) side by side in python?   
https://stackoverflow.com/questions/53401383/how-to-print-two-strings-large-text-side-by-side-in-python   


In [116]:
from tqdm import tqdm
from Bio import Entrez
import xml.etree.ElementTree as ET
import requests
import re
from googletrans import Translator 

In [4]:
Entrez.email = "aflu.blossompaws@gmail.com"

#  fetch the PMID (PubMed ID) 

In [5]:
"""This code fetches the all PMID hitted by term
"""
pmids = []
term = 'Randomized+Controlled+Trial[pt]'
db='pubmed'
retmax = 10000
 
handle = Entrez.esearch(db=db,term=term)
record = Entrez.read(handle)
count = int(record['Count'])
 

In [10]:
for retstart in tqdm(range(0, count, retmax)):
    handle = Entrez.esearch(db=db, term=term, retmax=retmax, retstart=retstart)
    record = Entrez.read(handle)
    pmids.extend(record['IdList'])

  6%|▌         | 3/52 [00:07<02:06,  2.57s/it]


KeyboardInterrupt: 

In [11]:
len(pmids)

30000

# Explore one pubmed data 

In [46]:
pmids[7002]

'31467892'

In [6]:
id_ = '31467892'
handle = Entrez.efetch(db=db, id=id_, retmode="xml")
text = handle.read().decode()

In [7]:
print(text)

<?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
<PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">31467892</PMID>
        <DateCompleted>
            <Year>2020</Year>
            <Month>01</Month>
            <Day>28</Day>
        </DateCompleted>
        <DateRevised>
            <Year>2020</Year>
            <Month>02</Month>
            <Day>25</Day>
        </DateRevised>
        <Article PubModel="Electronic-eCollection">
            <Journal>
                <ISSN IssnType="Electronic">2314-6141</ISSN>
                <JournalIssue CitedMedium="Internet">
                    <Volume>2019</Volume>
                    <PubDate>
                        <Year>2019</Year>
                    </PubDate>
                </JournalIssue>
                <Title>BioMed research international</Title>
           

##  print all the tags

In [8]:
root = ET.fromstring(text)

In [9]:
def print_all_child(obj, intend=0):
    """Print all the tags from root with modified indents. 
    """
    for child in obj:
        print(" "*4*intend + child.tag)
        print_all_child(child, intend+1)

In [10]:
print_all_child(root)

PubmedArticle
    MedlineCitation
        PMID
        DateCompleted
            Year
            Month
            Day
        DateRevised
            Year
            Month
            Day
        Article
            Journal
                ISSN
                JournalIssue
                    Volume
                    PubDate
                        Year
                Title
                ISOAbbreviation
            ArticleTitle
            Pagination
                MedlinePgn
            ELocationID
            Abstract
                AbstractText
                AbstractText
                AbstractText
                    i
                    i
                    i
                AbstractText
            AuthorList
                Author
                    LastName
                    ForeName
                    Initials
                    Identifier
                    AffiliationInfo
                        Affiliation
                    AffiliationInfo
           

## extract Abstract 

In [11]:
def xml_print(obj,iter_text):
    """Print tab, attrib, text at once. 
    """
    for neighbor in obj.iter(iter_text):
        print("#####", neighbor.tag, neighbor.attrib)
        print(neighbor.text)
        print()
def xml_child_print(obj,iter_text):
    for neighbor in obj.iter(iter_text):
        for child in neighbor:
            xml_print(child,child.tag)

In [12]:
xml_print(root,"Title")
xml_print(root,"ArticleTitle")
xml_print(root,"ELocationID")
xml_child_print(root, "abstract")
xml_child_print(root, "Abstract")


##### Title {}
BioMed research international

##### ArticleTitle {}
The Effect of rTMS over the Different Targets on Language Recovery in Stroke Patients with Global Aphasia: A Randomized Sham-Controlled Study.

##### ELocationID {'EIdType': 'doi', 'ValidYN': 'Y'}
10.1155/2019/4589056

##### AbstractText {'Label': 'Objective', 'NlmCategory': 'UNASSIGNED'}
To evaluate and compare the effects of repetitive transcranial magnetic stimulation (rTMS) over the right pars triangularis of the posterior inferior frontal gyrus (pIFG) and the right posterior superior temporal gyrus (pSMG) in global aphasia following subacute stroke.

##### AbstractText {'Label': 'Methods', 'NlmCategory': 'UNASSIGNED'}
Fifty-four patients with subacute poststroke global aphasia were randomized to 15-day protocols of 20-minute inhibitory 1 Hz rTMS over either the right triangular part of the pIFG (the rTMS-b group) or the right pSTG (the rTMS-w group) or to sham stimulation, followed by 30 minutes of speech and lang

# get PMC ID 
PMCID : 生命医学文書フルテキストのアーカイブ

In [13]:
"""From abstract info. """
xml_child_print(root, "ArticleIdList")

##### ArticleId {'IdType': 'pubmed'}
31467892

##### ArticleId {'IdType': 'doi'}
10.1155/2019/4589056

##### ArticleId {'IdType': 'pmc'}
PMC6699349

##### ArticleId {'IdType': 'pubmed'}
12480484

##### ArticleId {'IdType': 'pubmed'}
15817891

##### ArticleId {'IdType': 'pubmed'}
16530137

##### ArticleId {'IdType': 'pubmed'}
17846113

##### ArticleId {'IdType': 'pubmed'}
19004769

##### ArticleId {'IdType': 'pubmed'}
21864891

##### ArticleId {'IdType': 'pubmed'}
22948550

##### ArticleId {'IdType': 'pubmed'}
23213288

##### ArticleId {'IdType': 'pubmed'}
23813984

##### ArticleId {'IdType': 'pubmed'}
23841973

##### ArticleId {'IdType': 'pubmed'}
23867417

##### ArticleId {'IdType': 'pubmed'}
24217362

##### ArticleId {'IdType': 'pubmed'}
24519979

##### ArticleId {'IdType': 'pubmed'}
25036386

##### ArticleId {'IdType': 'pubmed'}
25547773

##### ArticleId {'IdType': 'pubmed'}
25735707

##### ArticleId {'IdType': 'pubmed'}
25972805

##### ArticleId {'IdType': 'pubmed'}
27080074

#####

In [14]:
"""From ID converter API"""
tool = "service-root"
email = "aflu.blossompaws@gmail.com"
pmc_ID = "PMC6699349"
url = 'https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool={}&amp;email={}&amp;email&amp;ids={}&amp;format=json'.format(tool, email, ','.join([id_]))
r = requests.get(url)
j = json.loads(r.text)

NameError: name 'json' is not defined

In [None]:
pmc_ID_converted = j["records"][0]["versions"][0]["pmcid"]
pmc_ID_converted

# get full text from PMC ID

In [19]:
handle = Entrez.efetch(db='pmc', id=pmc_ID, rettype='full', retmode='xml')
text = handle.read().decode()

In [31]:
with open("sample.xml","w") as f:
    f.write(text)

In [20]:
print(text)

<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<pmc-articleset><article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article">
  <?properties open_access?>
  <front>
    <journal-meta>
      <journal-id journal-id-type="nlm-ta">Biomed Res Int</journal-id>
      <journal-id journal-id-type="iso-abbrev">Biomed Res Int</journal-id>
      <journal-id journal-id-type="publisher-id">BMRI</journal-id>
      <journal-title-group>
        <journal-title>BioMed Research International</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">2314-6133</issn>
      <issn pub-type="epub">2314-6141</issn>
      <publisher>
        <publisher-name>Hindawi</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="pmid">31467892</article-id>
      <article-id pub-

In [21]:
root_pmc = ET.fromstring(text)
print_all_child(root_pmc)

article
    front
        journal-meta
            journal-id
            journal-id
            journal-id
            journal-title-group
                journal-title
            issn
            issn
            publisher
                publisher-name
        article-meta
            article-id
            article-id
            article-id
            article-categories
                subj-group
                    subject
            title-group
                article-title
            contrib-group
                contrib
                    contrib-id
                    name
                        surname
                        given-names
                    email
                    xref
                        sup
                    xref
                        sup
                contrib
                    name
                        surname
                        given-names
                    xref
                        sup
                contrib
             

In [49]:
body = root_pmc.find(".//body")
ps = body.findall(".//p")

In [54]:
p = ps[0]

In [59]:
p.text

'Stroke-related aphasia is one of the most common consequences of cerebrovascular diseases and occurs in one-third of acute or subacute stroke patients ['

In [66]:
p.getchildren()[0].tail

  p.getchildren()[0].tail


']. Global aphasia is one of the most serious and common aphasia types in acute and subacute stroke patients. This type is usually caused by infarction of the left middle cerebral artery. Patients with global aphasia have difficulties with communication, which is affected gravely and comprehensively in the domains of spontaneous speech, auditory comprehension, naming, and repetition. The most important period of language recovery usually occurs in the first to the third month after stroke, which is the key time for neurophysiological restoration and reorganization of the language cortex ['

In [53]:
dir(ps[0])

['__class__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'attrib',
 'clear',
 'extend',
 'find',
 'findall',
 'findtext',
 'get',
 'getchildren',
 'getiterator',
 'insert',
 'items',
 'iter',
 'iterfind',
 'itertext',
 'keys',
 'makeelement',
 'remove',
 'set',
 'tag',
 'tail',
 'text']

In [126]:
def print_all_under_obj(obj):
    print(obj.text)
    children = obj.getchildren()
    if not len(children):
        return()
    for child in children:
        print_all_under_obj(child)
        print(child.tail)
        
def constract_list(obj, list_):
    list_.append(obj.text)
    children = obj.getchildren()
    for child in children:
        constract_list(child, list_)
        list_.append(child.tail)
    return(list_)

def print_side_by_side(a, b, size=30, space=4):
    """print two long texts side by side. 
    
    Args: 
        a (str) : text
        b (str) : text 
        size (int) : size for print for each line.
        space (int) : space between two texts. 
    """
    while a or b:
        print(a[:size].ljust(size) + " " * space + b[:size])
        a = a[size:]
        b = b[size:]

In [71]:
print_all_under_obj(body)


    

      
1. Introduction

      
Stroke-related aphasia is one of the most common consequences of cerebrovascular diseases and occurs in one-third of acute or subacute stroke patients [
1
]. Global aphasia is one of the most serious and common aphasia types in acute and subacute stroke patients. This type is usually caused by infarction of the left middle cerebral artery. Patients with global aphasia have difficulties with communication, which is affected gravely and comprehensively in the domains of spontaneous speech, auditory comprehension, naming, and repetition. The most important period of language recovery usually occurs in the first to the third month after stroke, which is the key time for neurophysiological restoration and reorganization of the language cortex [
2
].

      
Mounting studies have demonstrated that inhibitory low-frequency repetitive transcranial magnetic stimulation (LF-rTMS) (≤1 Hz) over the unaffected hemisphere can improve language function in poststr

## some edit

In [103]:
with open("sample.xml","r") as f:
    text = f.read()
tag_names = ["xref","ext-link","italic"]
remove_list = [f"<{tag}.*?>"  for tag in tag_names] + [f"</{tag}>" for tag in tag_names]
for k in remove_list:
    regex = re.compile(k)
    text = re.sub(regex,"",text)
print(text)

<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<pmc-articleset><article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article">
  <?properties open_access?>
  <front>
    <journal-meta>
      <journal-id journal-id-type="nlm-ta">Biomed Res Int</journal-id>
      <journal-id journal-id-type="iso-abbrev">Biomed Res Int</journal-id>
      <journal-id journal-id-type="publisher-id">BMRI</journal-id>
      <journal-title-group>
        <journal-title>BioMed Research International</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">2314-6133</issn>
      <issn pub-type="epub">2314-6141</issn>
      <publisher>
        <publisher-name>Hindawi</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="pmid">31467892</article-id>
      <article-id pub-

In [104]:
root_pmc = ET.fromstring(text)
body = root_pmc.find(".//body")

In [114]:
print_all_under_obj(body)


    

      
1. Introduction

      
Stroke-related aphasia is one of the most common consequences of cerebrovascular diseases and occurs in one-third of acute or subacute stroke patients [1]. Global aphasia is one of the most serious and common aphasia types in acute and subacute stroke patients. This type is usually caused by infarction of the left middle cerebral artery. Patients with global aphasia have difficulties with communication, which is affected gravely and comprehensively in the domains of spontaneous speech, auditory comprehension, naming, and repetition. The most important period of language recovery usually occurs in the first to the third month after stroke, which is the key time for neurophysiological restoration and reorganization of the language cortex [2].

      
Mounting studies have demonstrated that inhibitory low-frequency repetitive transcranial magnetic stimulation (LF-rTMS) (≤1 Hz) over the unaffected hemisphere can improve language function in poststroke 

  children = obj.getchildren()


In [115]:
body_list = constract_list(body,[])
body_list

  children = obj.getchildren()


['\n    ',
 '\n      ',
 '1. Introduction',
 '\n      ',
 'Stroke-related aphasia is one of the most common consequences of cerebrovascular diseases and occurs in one-third of acute or subacute stroke patients [1]. Global aphasia is one of the most serious and common aphasia types in acute and subacute stroke patients. This type is usually caused by infarction of the left middle cerebral artery. Patients with global aphasia have difficulties with communication, which is affected gravely and comprehensively in the domains of spontaneous speech, auditory comprehension, naming, and repetition. The most important period of language recovery usually occurs in the first to the third month after stroke, which is the key time for neurophysiological restoration and reorganization of the language cortex [2].',
 '\n      ',
 "Mounting studies have demonstrated that inhibitory low-frequency repetitive transcranial magnetic stimulation (LF-rTMS) (≤1 Hz) over the unaffected hemisphere can improve la

In [128]:
text_en = body_list[-3]
translator = Translator()
trans_jp = translator.translate(text_en, src="en", dest="ja")
text_jp = trans_jp.text

In [130]:
print(text_jp)
print(text_en)

低周波rTMSは失語症患者のリハビリに有益であると多くの研究が報告していますが、rTMSの理想的な刺激部位はわかっていません。右のpIFGとpSTGに適用された低周波rTMSは、亜急性脳卒中後のグローバル失語症の効果的な治療法であると想定できます。 15日間の治療直後でも、LF-rTMSは右pSTGを抑制し、聴覚理解と反復の大幅な増加を促進しましたが、LF-rTMSは右pIFGを抑制し、自発的な発話と反復に明らかに変化をもたらしました。この研究で異なる刺激部位間で観察された機能回復の違いの根底にある神経メカニズムを探索するには、さらなる調査が必要です。
Many studies have reported that low-frequency rTMS is beneficial for rehabilitating patients with aphasia, but the ideal stimulation sites for rTMS are not known. Low-frequency rTMS applied to the right pIFG and pSTG can be assumed to be an effective treatment for global aphasia following subacute stroke. Even immediately after the 15-day treatment, LF-rTMS inhibited the right pSTG and promoted significantly increased gains in auditory comprehension and repetition, whereas LF-rTMS inhibited the right pIFG and apparently caused changes in spontaneous speech and repetition. Further investigations are necessary to explore the neural mechanisms that underlie the differences in functional recovery observed between t

In [129]:
print_side_by_side(text_en, text_jp)

Many studies have reported tha    低周波rTMSは失語症患者のリハビリに有益であると多くの研究
t low-frequency rTMS is benefi    が報告していますが、rTMSの理想的な刺激部位はわかっていま
cial for rehabilitating patien    せん。右のpIFGとpSTGに適用された低周波rTMSは、亜
ts with aphasia, but the ideal    急性脳卒中後のグローバル失語症の効果的な治療法であると想定で
 stimulation sites for rTMS ar    きます。 15日間の治療直後でも、LF-rTMSは右pSTG
e not known. Low-frequency rTM    を抑制し、聴覚理解と反復の大幅な増加を促進しましたが、LF-
S applied to the right pIFG an    rTMSは右pIFGを抑制し、自発的な発話と反復に明らかに変
d pSTG can be assumed to be an    化をもたらしました。この研究で異なる刺激部位間で観察された機
 effective treatment for globa    能回復の違いの根底にある神経メカニズムを探索するには、さらな
l aphasia following subacute s    る調査が必要です。
troke. Even immediately after     
the 15-day treatment, LF-rTMS     
inhibited the right pSTG and p    
romoted significantly increase    
d gains in auditory comprehens    
ion and repetition, whereas LF    
-rTMS inhibited the right pIFG    
 and apparently caused changes    
 in spontaneous speech and rep    
etition. Further investigation    
s are necessary to ex

# usage of XPath 

In [24]:
titles = root_pmc.findall(".//title")
for title in titles:
    print(title.text)

Objective
 Methods
 Results
 Conclusions
1. Introduction
2. Materials and Methods
2.1. Subjects
2.2. Procedure
2.3. Transcranial Magnetic Stimulation
2.4. Sample Size Calculation
2.5. Statistical Analysis
3. Results
3.1. Participant Characteristics
3.2. Treatment Effects
4. Discussion
4.1. Study Limitations
5. Conclusions
Acknowledgments
Data Availability
Additional Points
Conflicts of Interest


# テキストの翻訳 
- 【Python】googletransを使って日本語のデータを英語に変換（翻訳）してみる  
https://96lovefootball.hatenablog.com/entry/2019/02/10/213000   
- Pythonで機械翻訳 -'translate'パッケージ & Webスクレイピング-   
https://arthur-ai.hatenadiary.jp/entry/2018/11/10/000807   
- Python 科学技術関連のパッケージ一覧    
https://www.trifields.jp/pypi-science-and-technology-932  
- watson-streaming 0.0.13    
https://pypi.org/project/watson-streaming/    


In [103]:
translator = Translator()

In [105]:
text = "Stroke-related aphasia is one of the most common consequences of cerebrovascular diseases and occurs in one-third of acute or subacute stroke patients"
trans_jp = translator.translate(text, src="en", dest="ja")

In [107]:
trans_jp.text

'脳卒中関連失語症は脳血管疾患の最も一般的な結果の1つであり、急性または亜急性脳卒中患者の3分の1で発生します'