# PubMed Metadata Query and Analysis

Purpose: Ingest selected XML nodes from pubmed.gov data sets, File > XML exports,
into Python dataframes.
Search strategy used in development: https://www.nlm.nih.gov/services/queries/health_literacy.html
```
Original Author: Dan Wendling 
Created on Tue Jul 17 17:38:22 2018 

Editor: Tony Chu
Created Jupyter Notebook version
Modified on Mon Aug 27 2018
```

### NOTES/WARNINGS


* Exported pubmed.gov XML is hierarchical; this script removes the hierarchy, 
arranging data instead as you might see in a relational database, where fields
with repeating entries (author, grant, MeSH, publication type, etc.) 
in their own dataframes; PMID is the "primary key" for pubmedarticle, but the
"foreign key" for the other dataframes.
* A small number of records will be left out! Nothing is captured for 
<PubmedBookArticle>, for example. The below captures journal articles, 
<PubmedArticle>.
* Structured abstracts use the NLM structure labels (@NlmCategory); the less
standardized labels in the original source articles are not captured.
* Some elements are simplified, such as geting only the pmc (free full text)
designation from the ArticleId element. More items could be added - this is not 
everything. Each section has a URL for more information.
* MeSH is thrown in here as one column, in case you want to use natural
language processing on TI-AB-MeSH, etc. The MeSH content ends with 
semi-colon space; code to prevent this is available (somewhere, not here).
* While MeSH has its own "table," it is also included as a concatenated field
in pubmedarticle - in case you want to run NLM on one dataframe.
* Not included here are repeating-entry nodes such as ArticleIdList (except
for grabbing the PubMed Central ID, if available) and History...

### FIXME - RECOMMENDATIONS

* Need one function to re-use code, to reduce chances for error.  
(Fixed - see **cell 3, function trans_df()** )


* Need to represent nulls in the dataframes for all desired features/columns  
currently what is brought into df is only the tags found in the current XML
file, too limiting.  
(Fixed - see **cell 5, function column_template()**)

 ## SCRIPT CONTENTS


1. Start-up / What to put into place, where
2. Notes of XSLT
3. Create all dataframes needed
4. Explore pubmedarticle
5. Explore author
6. Explore mesh
7. Explore grant
8. Explore chemical
9. Explore pubtype
10. Write out, for next work session, visualization, etc.

Partly based on https://stackoverflow.com/questions/49439081/nested-xml-file-to-pandas-dataframe

Query used for development: https://www.nlm.nih.gov/services/queries/health_literacy.html
But limited to articles with NIH grant numbers

### 1. Start-up / What to put into place, where

In [1]:
import lxml.etree as et
import pandas as pd
import numpy as np
import os

# The following commands print out all rather than the last output in a cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Set working directory
os.chdir('E:/WinPython-64bit-3.4.4.4Qt5/notebooks/pubmed-gov_xml_2_python_dataframe')
localDir = 'pubmed-gov_xml_2_python_dataframe/'

### 2. Notes of XSLT parsing

Adapt XSLT as needed - https://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html 

```
About pubmedarticle - 
CitationStatus is very important! There are multiple phases of 
record completeness. If a record does not have CitationStatus = MEDLINE, 
it may not be a complete record at the time you downloaded the record set. 
This means you may have basic information, but NOT grant information, or 
MeSH information, etc. Usually.
```
https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#status_value

About author - 
https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#authorlist

About mesh (medical subject heading) -
https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#meshheadinglist

About grant -
https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#grantlist
https://www.nlm.nih.gov/bsd/grant_acronym.html

About chemical - 
https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#chemicallist

About pubtype (publication type) - 
https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#publicationtypelist
https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.publication_types/?report=objectonly


In [2]:
# Parse and set XSLT to your filenames

xsl_pubmedarticle = et.parse("xslt/tbl_pubmedarticle.xsl")
xsl_author = et.parse("xslt/tbl_author.xsl")
xsl_grant = et.parse("xslt/tbl_grant.xsl")
xsl_chemical = et.parse("xslt/tbl_chemical.xsl")
xsl_mesh = et.parse("xslt/tbl_mesh.xsl")
xsl_pubtype = et.parse("xslt/tbl_pubtype.xsl")

### 3. Create all the dataframes needed

In [3]:
# Read in the downloaded XML file
doc = et.parse("xml/pubmed_result.xml")

# the XSL transform function

def trans_df(xsl_type):
    transformer = et.XSLT(xsl_type)
    result = transformer(doc)
    data = []
    for i in result.xpath('/*'):
        inner = {}
        for j in i.xpath('*'):
            inner[j.tag] = j.text
        data.append(inner)
    return pd.DataFrame(data)

In [4]:
pubmedarticle = trans_df(xsl_pubmedarticle)
author = trans_df(xsl_author)
mesh = trans_df(xsl_mesh)
grant = trans_df(xsl_grant)
chemical = trans_df(xsl_chemical)
pubtype = trans_df(xsl_pubtype)

### 4. Explore pubmedarticle

#### Write a function to set ideal columns (column template) by adding in the desired/new columns with missing values NaN

In [5]:
# Specify the target dataframe.In this test case, pubmedarticle is the target to be modified.
target = pubmedarticle

# List the additional columns for the new dataframe desired (It is fine to inlclue any column original).
col_desire = ['PMID','Volume','Col1','Col2'] # In this test case, Col1 & Col2 are the additional/new columns.

def column_template(target, col_desire):
    col_original = list(target.columns)
    col_add = []
    for col in col_desire:
        # Check which column does not exist among the original
        if col not in col_original:
            col_add.append(col)
    return target.reindex(columns=[target.columns.tolist() + col_add], fill_value = np.nan)
    
pubmedarticle = column_template(target, col_desire)

In [6]:
pubmedarticle.columns

# Select columns, column order
pubmedarticle = pubmedarticle[['PMID', 'CitationStatus', 'PMC', 'ArticleTitle', 'JournalTitle', 'Volume', 'Issue', 'Pagination', 'ArticleDateYear', 'DateCreatedYear', 'Abstract', 'MeSH']]

pubmedarticle.head()

Index(['Abstract', 'ArticleDateType', 'ArticleDateYear', 'ArticleTitle',
       'CitationOwner', 'CitationStatus', 'CitedMedium', 'DateCreated',
       'DateCreatedYear', 'Issue', 'JournalCountry', 'JournalNlmUniqueID',
       'JournalTitle', 'Language', 'MeSH', 'PMC', 'PMID', 'Pagination',
       'PubDate', 'PubModel', 'Volume', 'Col1', 'Col2'],
      dtype='object')

Unnamed: 0,PMID,CitationStatus,PMC,ArticleTitle,JournalTitle,Volume,Issue,Pagination,ArticleDateYear,DateCreatedYear,Abstract,MeSH
0,30125790,Publisher,,A randomized controlled trial of three smartph...,Behaviour research and therapy,109.0,,75-83,2018,2018,Many smartphone applications (apps) for mental...,
1,30124102,Publisher,,An Analysis of Informed Consent Form Readabili...,Journal of empirical research on human researc...,,,1556264618795057,2018,2018,Twenty-two percent of adults in the United Sta...,
2,30123879,PubMed-not-MEDLINE,PMC6092154,The Association of Health Literacy with Breast...,European journal of breast health,14.0,3.0,144-147,2018,2018,UNASSIGNED: The incidence of breast cancer amo...,
3,30123504,PubMed-not-MEDLINE,PMC6088416,Role of gender in the treatment experiences of...,Journal of eating disorders,6.0,,18,2018,2018,UNASSIGNED: Traditionally perceived as a disor...,
4,30123148,PubMed-not-MEDLINE,PMC6085433,Does Fear Increase Search Effort in More Numer...,Frontiers in psychology,9.0,,1203,2018,2018,The aim of this study was to investigate the e...,


In [7]:
# Multiple Medical Subject Headings (MeSH) for a record/row are all put in the same cell, seperated by semicolons.

# The follow statement prints out the full cell content (MeSH herein) without truncation.
pd.set_option('display.max_colwidth', -1)

test = pubmedarticle[pubmedarticle["MeSH"].notnull()]
test[["PMID","MeSH"]].head()

Unnamed: 0,PMID,MeSH
70,30064415,Adult; Breast Feeding; Complementary Therapies; Decision Making; Female; Health Literacy; Humans; Lactation; Pregnancy; Qualitative Research; Women's Health;
83,30055946,Health Literacy;
106,30019668,Comprehension; Consumer Health Information; Head and Neck Neoplasms; Humans; Internet; Otorhinolaryngologic Diseases; United Kingdom;
120,30005747,"Adult; Choice Behavior; Clinical Trials as Topic; Cognition; Comprehension; Conflict (Psychology); Female; Health Knowledge, Attitudes, Practice; Health Literacy; Humans; Immunologic Factors; Information Dissemination; Intelligence; Male; Middle Aged; Multiple Sclerosis, Relapsing-Remitting; Patient Education as Topic; Patient Participation; Risk Assessment; Risk Factors; Socioeconomic Factors; State Medicine; United Kingdom;"
169,29961935,"Clinical Competence; Comprehension; Diagnostic Techniques and Procedures; Drug Compounding; Education, Medical; Headache; Humans; Learning; Male; Medicine, Kampo; Middle Aged; Physicians; Prescriptions; Students, Medical; Surveys and Questionnaires; Teaching;"


In [8]:
# Article count by journal name
articleCountByJournalName = pubmedarticle['JournalTitle'].value_counts().reset_index()
articleCountByJournalName = articleCountByJournalName.rename(columns={'JournalTitle': 'Count', 'index': 'JournalTitle'})
articleCountByJournalName.head(n=20)

Unnamed: 0,JournalTitle,Count
0,Patient education and counseling,372
1,Journal of health communication,291
2,Journal of general internal medicine,164
3,PloS one,129
4,BMC public health,127
5,Journal of medical Internet research,109
6,Studies in health technology and informatics,104
7,Journal of cancer education : the official journal of the American Association for Cancer Education,101
8,BMC health services research,91
9,Medical decision making : an international journal of the Society for Medical Decision Making,85


In [9]:
# Citation status (records are built incrementally)
# https://www.nlm.nih.gov/bsd/mms/medlineelements.html#stat
articleStatus = pubmedarticle['CitationStatus'].value_counts().reset_index()
articleStatus = articleStatus.rename(columns={'CitationStatus': 'Count', 'index': 'CitationStatus'})
articleStatus.head()

Unnamed: 0,CitationStatus,Count
0,MEDLINE,11107
1,PubMed-not-MEDLINE,714
2,In-Data-Review,454
3,Publisher,451
4,In-Process,204


In [10]:
# Count of articles in PubMed Central
print("Available in PubMed Central (free full text)\n{}".format(pubmedarticle['PMC'].notnull().sum()))

Available in PubMed Central (free full text)
3999


### 5. Explore author

In [11]:
author.columns

Index(['Affiliation', 'AuthorListCompleteYN', 'AuthorValidYN',
       'CollectiveName', 'ConstructedPersonName', 'ForeName', 'Identifier',
       'IdentifierSource', 'Initials', 'LastName', 'PMID', 'Suffix'],
      dtype='object')

```
Reduce to what you need, reorder. Will not bring in columns that are not in the
XML; potential list is PMID, AuthorListCompleteYN, AuthorValidYN, LastName,
ForeName, Suffix, Initials, CollectiveName, EqualContrib, ConstructedPersonName,
Affiliation, IdentifierSource, Identifier
```

*FIXME - Useful to have Affiliation, but will cause error if it wasn't in the XML*

In [12]:
# Select columns, column order
author = author[['PMID', 'AuthorListCompleteYN', 'AuthorValidYN', 'LastName',
                 'ForeName', 'Initials', 'ConstructedPersonName', 'Affiliation']]

# Total person names
print("Total person names found: {}".format(len(author)))

# Total unique person names
print("Total unique person names: {}".format(author['ConstructedPersonName'].nunique()))

Total person names found: 53768
Total unique person names: 36258


In [13]:
# Where are authors based?
authorCountByAfilliation = author['Affiliation'].value_counts().reset_index()
authorCountByAfilliation = authorCountByAfilliation.rename(columns={'Affiliation': 'Author count', 'index': 'Author affiliation'})
authorCountByAfilliation.head(n=20)

Unnamed: 0,Author affiliation,Author count
0,"From the Centre for Global Health, University of Ottawa, Ottawa, Ontario, Canada; University of Lorraine, EA 4360 Apemac, Nancy, France; Cabrini Institute and Monash University, Melbourne, Australia; Bruyère Research Institute, and the Institute of Population Health, University of Ottawa, Ottawa, Ontario, Canada; Faculty of Medicine, Nursing and Health Sciences, Monash University, Australia; Musculoskeletal Statistics Unit, The Parker Institute, Department of Rheumatology, Copenhagen University Hospitals, Bispebjerg and Frederiksberg, Denmark; Department of Internal Medicine, Division of Rheumatology, Maastricht University Medical Center and Caphri Research Institute, Maastricht University, Maastricht, The Netherlands; Quintiles Inc.; Division of Rheumatology, Department of Medicine, Duke University School of Medicine, Durham, North Carolina, USA; Children's Hospital of Eastern Ontario Research Institute, Department of Pediatrics, University of Ottawa, Ottawa, Ontario, Canada; VU Medical Centre, Amsterdam, The Netherlands; University of California at San Francisco, San Francisco, California, USA; Faculty of Health and Applied Sciences, University of the West of England, Bristol, UK; Departments of Medicine and Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Internal Medicine II Rheumatology, Clinical Immunology, Osteology, Physical Therapy and Sports Medicine, Schlosspark-Klinik, Teaching Hospital of the Charité, University Medicine Berlin, Berlin, Germany; University of Marmara, Faculty of Medicine, Department of Physical and Rehabilitation Medicine, Rheumatology Clinics, Istanbul, Turkey; Birmingham Veterans Affairs Medical Center and University of Alabama at Birmingham, Birmingham, Alabama; Division of Rheumatology, University of Pennsylvania, Philadelphia, Pennsylvania, USA; Ottawa Hospital Research Institute, Ottawa; University Health Network (UHN)-Toronto Western Hospital, Toronto, Ontario",24
1,"Columbia University, New York, NY.",21
2,"Division of Plastic and Reconstructive Surgery, Department of Surgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts.",18
3,"Research Division,Institute of Mental Health,Singapore.",17
4,"Pharmaceutical Market Research Group, Health Literacy Initiative Committee, PO Box 1449, Minneola, FL 34755, USA.",16
5,"Research Division, Institute of Mental Health, Singapore.",16
6,"State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China.",14
7,"Gerontology Research Center, Faculty of Sport and Health Sciences, Univerisity of Jyvaskyla, P.O. Box 35 (viv 149), 40014, Jyväskylä, Finland.",14
8,"Department of Public Health, University of Copenhagen, Copenhagen, Denmark.",14
9,"Elyse R. Park, Giselle K. Perez, Karen Donelan, Mariel Franklin, Kelly A. Hyland, and Karen A. Kuhlthau, Massachusetts General Hospital; Joel S. Weissman, Brigham and Women's Hospital; Lisa R. Diller and Christopher J. Recklitis, Dana-Farber Cancer Institute, Boston, MA; Anne C. Kirchhoff, Huntsman Cancer Institute, Salt Lake City, UT; Wendy Leisenring, Fred Hutchinson Cancer Research Center, Seattle, WA; Ann C. Mertens, Emory University School of Medicine, Atlanta, GA; James D. Reschovsky, Mathematica Policy Research, Washington, DC; and Gregory T. Armstrong and Leslie L. Robison, St Jude Children's Research Hospital, Memphis, TN.",14


In [14]:
# How many articles are associated with each institution?
# Author affiliation field is not controlled, making this messy/inaccurate!
# <Identity> tag cleans this up but is not in most records.
# FIXME - Good candidate for natural language processing.

articleCountByInstitution = author.groupby('Affiliation')['PMID'].nunique().reset_index()
articleCountByInstitution = articleCountByInstitution.rename(columns={'Affiliation': 'Author affiliation', 'PMID': 'PMID count'})
articleCountByInstitution.head(20)

Unnamed: 0,Author affiliation,PMID count
0,"(Department of Pharmacy Practice, South Dakota State University, at the time of the study) Division of Social and Administrative Sciences, University of Wisconsin-Madison . Madison, WI ( United States ). oshiyanbola@pharmacy.wisc.edu.",1
1,"* HEARing Cooperative Research Centre , Melbourne , Australia.",1
2,"*California State University, Long Beach, CA; and †School of Social Work, University of Southern California, Los Angeles, CA.",1
3,"*Cedars-Sinai Center for Outcomes Research and Education (CS-CORE), Los Angeles, California; †Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, California; ‡Department of Medicine, Division of Digestive and Liver Diseases, Cedars-Sinai Medical Center, Los Angeles, California; §Department of Medicine, Division of Health Services Research, Cedars-Sinai Medical Center, Los Angeles, California; and ‖Takeda Pharmaceuticals U.S.A., Inc., Deerfield, Illinois.",1
4,"*Center for Healthcare Organization and Implementation Research, ENRM Veterans Affairs Medical Center, Bedford †Division of Health Informatics and Implementation Science, Quantitative Health Sciences, University of Massachusetts Medical School, Worcester ‡Department of Health Law Policy and Management, Boston University School of Public Health, Boston, MA §Center for Health Equity Research and Promotion, Corporal Michael J. Crescenz/Philadelphia VAMC ∥Division of General Internal Medicine, University of Pennsylvania School of Medicine, Philadelphia, PA ¶Center of Innovation for Complex Chronic Healthcare, Jesse Brown Veterans Affairs Medical Center #Section of Academic Internal Medicine, Department of Medicine, University of Illinois at Chicago College of Medicine, Chicago, IL **Ralph H. Johnson VA Medical Center ††Medical University of South Carolina College of Nursing, Charleston, SC ‡‡Bryant University Communications Department, Smithfield, RI.",1
5,"*Center for Home Care Policy & Research ‡Department of Compliance & Regulatory Affairs, Visiting Nurse Service of NY ∥Division of Geriatrics & Palliative Medicine, Weill Cornell Medical Center, New York †Department of Physical Therapy Education, SUNY Upstate Medical University, Syracuse §Department of Human Development, Cornell University, Ithaca, NY.",1
6,"*Centre for Community Child Health Murdoch, Murdoch Childrens Research Institute, Parkville, Australia; †Department of Paediatrics, University of Melbourne, Parkville, Australia; ‡Department of Psychology, Deakin University, Burwood, Australia; §The Royal Children's Hospital, Parkville, Australia.",1
7,*Corresponding author. E-mail: daniel.apolinario@usp.br.,1
8,*Corresponding author. E-mail: fran.baum@flinders.edu.au.,1
9,*Corresponding author. E-mail: maria.stuttaford@warwick.ac.uk.,1


### 6. Explore MeSH

In [15]:
mesh.columns

# Select columns, column order.

mesh = mesh[['PMID', 'DescriptorName', 'MajorDNTopicYN', 'MeshUI', 
             'QualifierName', 'QualifierNameMajorTopicYN', 'QualifierNameUI']]

mesh.head(5)
# Each PMID may associate with multiple MeSH. This means a PMID may duplicate in record.

Index(['DescriptorName', 'MajorDNTopicYN', 'MeshUI', 'PMID', 'QualifierName',
       'QualifierNameMajorTopicYN', 'QualifierNameUI'],
      dtype='object')

Unnamed: 0,PMID,DescriptorName,MajorDNTopicYN,MeshUI,QualifierName,QualifierNameMajorTopicYN,QualifierNameUI
0,30064415,Adult,N,D000328,,,
1,30064415,Breast Feeding,N,D001942,,,
2,30064415,Complementary Therapies,Y,D000529,instrumentation,N,Q000295
3,30064415,Decision Making,N,D003657,,,
4,30064415,Female,N,D005260,,,


In [16]:
# Total MeSH entries
print("Total MeSH terms used: {}".format(len(mesh)))

# Total unique MeSH terms
print("Unique MeSH terms: {}".format(mesh['DescriptorName'].nunique()))

# Total unique Qualifier terms used
print("Unique Qualifier terms: {}".format(mesh['QualifierName'].nunique()))

Total MeSH terms used: 148572
Unique MeSH terms: 4809
Unique Qualifier terms: 68


In [17]:
# What MeSH terms are represented?
meshTermCounts = mesh['DescriptorName'].value_counts().reset_index()
meshTermCounts = meshTermCounts.rename(columns={'DescriptorName': 'Article count', 'index': 'MeSH term'})
output = meshTermCounts.head(n=20)
print("The MeSH terms represented, Top 20 (max)\n{}".format(output))

The MeSH terms represented, Top 20 (max)
                                MeSH term  Article count
0   Humans                                 10820        
1   Female                                 6200         
2   Male                                   5485         
3   Adult                                  4228         
4   Middle Aged                            3830         
5   Health Literacy                        3795         
6   Comprehension                          3550         
7   Aged                                   2946         
8   Health Knowledge, Attitudes, Practice  2454         
9   Patient Education as Topic             2330         
10  Surveys and Questionnaires             2032         
11  Adolescent                             1844         
12  Educational Status                     1647         
13  United States                          1596         
14  Young Adult                            1529         
15  Informed Consent                       1524

In [18]:
# For Tableau, create table of "major" descriptors appearing at least 10 times
meshMajorTable = mesh.loc[mesh['MajorDNTopicYN'].str.contains('Y') == True]
meshMajorTable.head(10)

Unnamed: 0,PMID,DescriptorName,MajorDNTopicYN,MeshUI,QualifierName,QualifierNameMajorTopicYN,QualifierNameUI
2,30064415,Complementary Therapies,Y,D000529,instrumentation,N,Q000295
5,30064415,Health Literacy,Y,D057220,,,
7,30064415,Lactation,Y,D007774,,,
10,30064415,Women's Health,Y,D016387,,,
11,30055946,Health Literacy,Y,D057220,,,
14,30019668,Head and Neck Neoplasms,Y,D006258,,,
16,30019668,Internet,Y,D020407,,,
17,30019668,Otorhinolaryngologic Diseases,Y,D010038,,,
21,30005747,Clinical Trials as Topic,Y,D002986,,,
23,30005747,Comprehension,Y,D032882,,,


In [19]:
# From https://stackoverflow.com/questions/30485151/python-pandas-exclude-rows-below-a-certain-frequency-count
meshMajorTable = meshMajorTable.groupby('DescriptorName').filter(lambda x: len(x) >= 10)
meshMajorTable = meshMajorTable[['PMID', 'DescriptorName', 'MajorDNTopicYN']]

meshMajorTable.head(10)

# Below - meshMajorTable will be one of the worksheets added to Excel export file.

Unnamed: 0,PMID,DescriptorName,MajorDNTopicYN
5,30064415,Health Literacy,Y
10,30064415,Women's Health,Y
11,30055946,Health Literacy,Y
16,30019668,Internet,Y
21,30005747,Clinical Trials as Topic,Y
23,30005747,Comprehension,Y
26,30005747,"Health Knowledge, Attitudes, Practice",Y
30,30005747,Information Dissemination,Y
35,30005747,Patient Education as Topic,Y
57,29961935,Teaching,Y


### 7. Explore grant

https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#grantlist <br/>
https://www.nlm.nih.gov/bsd/grant_acronym.html

In [20]:
grant.columns

# Select columns, column order
grant = grant[['PMID', 'GrantListCompleteYN', 'GrantID', 'Agency', 'Acronym', 'Country']]

Index(['Acronym', 'Agency', 'Country', 'GrantID', 'GrantListCompleteYN',
       'PMID'],
      dtype='object')

In [21]:
# Total unique grant numbers
print("Total unique grant numbers: {}".format(grant['GrantID'].nunique()))

# What agencies are represented?
grantCountByAgency = grant['Agency'].value_counts().reset_index()
grantCountByAgency = grantCountByAgency.rename(columns={'Affiliation': 'grant count', 'index': 'grant affiliation'})
grantCountByAgency.head(n=20)

Total unique grant numbers: 4753


Unnamed: 0,grant affiliation,Agency
0,NCI NIH HHS,923
1,NIDDK NIH HHS,809
2,NIA NIH HHS,668
3,NICHD NIH HHS,540
4,NHLBI NIH HHS,535
5,NCRR NIH HHS,394
6,NIMH NIH HHS,377
7,NCATS NIH HHS,370
8,AHRQ HHS,281
9,NINR NIH HHS,258


### 8. Explore chemical

https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#chemicallist

In [22]:
chemical.columns
      
# Select columns, column order
# Choices: PMID, RegistryNumber, UI, 
chemical = chemical[['PMID', 'RegistryNumber', 'UI', 'NameOfSubstance']]

Index(['NameOfSubstance', 'PMID', 'RegistryNumber', 'UI'], dtype='object')

In [23]:
# Total substance entries
print("Total chemical entries: {}".format(len(chemical)))

# Total unique chemicals
print("Total unique chemicals: {}".format(chemical['UI'].nunique()))

# What chemicals are represented?
chemicalCounts = chemical['NameOfSubstance'].value_counts().reset_index()
chemicalCounts = chemicalCounts.rename(columns={'NameOfSubstance': 'Article count', 'index': 'NameOfSubstance'})
chemicalCounts.head(n=20)

Total chemical entries: 1936
Total unique chemicals: 515


Unnamed: 0,NameOfSubstance,Article count
0,Glycated Hemoglobin A,99
1,Blood Glucose,84
2,Prescription Drugs,75
3,Pharmaceutical Preparations,67
4,Hypoglycemic Agents,64
5,Anti-HIV Agents,53
6,Nonprescription Drugs,52
7,Antihypertensive Agents,40
8,Antineoplastic Agents,35
9,Anti-Bacterial Agents,34


### 9. Explore pubtype (publication type)

https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#publicationtypelist <br/>
https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.publication_types/?report=objectonly

In [24]:
pubtype.columns

# Select columns, column order
# Choices: PMID, PublicationType, PublicationTypeUI
pubtype = pubtype[['PMID', 'PublicationType']]

pubtype.head()

Index(['PMID', 'PublicationType', 'PublicationTypeUI'], dtype='object')

Unnamed: 0,PMID,PublicationType
0,30125790,Journal Article
1,30124102,Journal Article
2,30123879,Journal Article
3,30123504,Journal Article
4,30123148,Journal Article


In [25]:
# Total pubtype entries
print("Total pubtype entries found: {}".format(len(pubtype)))

# Total unique pubtypes
print("Total unique pubtype entries: {}".format(pubtype['PublicationType'].nunique()))

# What pubtypes are represented?
pubTypeCount = pubtype['PublicationType'].value_counts().reset_index()
pubTypeCount = pubTypeCount.rename(columns={'Affiliation': 'grant count', 'index': 'grant affiliation'})
pubTypeCount.head(n=20)

Total pubtype entries found: 22973
Total unique pubtype entries: 54


Unnamed: 0,grant affiliation,PublicationType
0,Journal Article,12207
1,"Research Support, Non-U.S. Gov't",3437
2,"Research Support, N.I.H., Extramural",1432
3,Review,1060
4,Randomized Controlled Trial,747
5,Comparative Study,681
6,"Research Support, U.S. Gov't, P.H.S.",495
7,Multicenter Study,346
8,Comment,338
9,"Research Support, U.S. Gov't, Non-P.H.S.",329


### 10. Write out, for next work session, visualization, etc.

```
Example for Tableau Public
```

```
Year is part of pubmedarticle; the below creates a table to feed a Year bar 
chart. Using DateCreatedYear, the year the pubmed.gov record was created. This
is because records are built incrementally and while all have the date the 
record was created, some new records don't have the date the article was/will
be published. DateCreatedYear is more inclusive.
```

In [26]:
ArticleListTable = pubmedarticle[['DateCreatedYear', 'PMID', 'ArticleTitle']]

# Row must include DateCreatedYear; reduce columns
yearTable = pubmedarticle[pd.notnull(pubmedarticle['DateCreatedYear'])]
yearTable = yearTable[[ 'DateCreatedYear', 'PMID']]

In [27]:
writer = pd.ExcelWriter('DataSource.xlsx')
ArticleListTable.to_excel(writer,'ArticleList', index=False)
yearTable.to_excel(writer,'DateCreatedYear', index=False)
meshMajorTable.to_excel(writer,'MeshMajor', index=False)
writer.save()

Example of pulling in Excel data, for follow-on work sessions:<br/>
pubmedarticle = pd.read_excel('DataSource.xlsx')