### Example of PubMed Clustering

This notebook shows an example of using this project.

*Step 1:* Imports.

In [1]:
import os
from pubmed_clustering import PubMedClustering

*Step 2:* Logging.

In [2]:
import logging

logger = logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)-5s %(message)s')

*Step 3:* Initializing the class.

Parameters:
- `pubmed_ids` (required): This parameter contains either a list of PMID strings, a string containing comma-separated PMIDs or the disk location to a text file containing a list of PMIDs. If it is a disk location, ensure the `is_file` flag is set to `True`.
- `is_file` (optional): Boolean flag to specify if `pubmed_ids` is a text file that should be read. Default: `False`.
- `metamap` (optional): Location of MetaMap binary. Without MetaMap, the pre-processing will not run and the program will terminate. Default: `/opt/public_mm/bin/metamap16`.
- `email` (optional): Email ID. Required to query PubMed using the BioPython library. Default: `"Your.Name.Here@example.org"`.
- `labels` (optional): Either a dictionary containing PMID -> cluster pairs or a location to a text file that of the form `PMID\tlabel`. If it is a text file, ensure the flag `labels_is_file` is set to `True`. Default: `None`.
- `labels_is_file` (optional): Boolean flag to specify if `pubmed_ids` is a text file that should be read. Default: `False`.

In [3]:
clus1 = PubMedClustering(os.path.join('./data', 'pmids_gold_set_unlabeled.txt'),
                            metamap='/opt/public_mm/bin/metamap16', email="This.Is.My@Email.org", is_file=True,
                            labels=os.path.join('./data', 'pmids_gold_set_labeled.txt'), labels_is_file=True)

*Step 4:* Running the pipeline.

In [4]:
clus1.run()

2018-11-26 01:08:25,764 INFO  Fetching articles and abstracts from PubMed
2018-11-26 01:08:27,062 INFO  Parsing fetched PubMed articles
2018-11-26 01:08:27,491 INFO  Parsed all PubMed articles
2018-11-26 01:08:27,491 INFO  Processing all documents with MetaMap
2018-11-26 01:08:27,492 INFO  0 of 69 documents processed with MetaMap
2018-11-26 01:08:44,508 INFO  10 of 69 documents processed with MetaMap
2018-11-26 01:09:00,663 INFO  20 of 69 documents processed with MetaMap
2018-11-26 01:09:19,221 INFO  30 of 69 documents processed with MetaMap
2018-11-26 01:09:39,659 INFO  40 of 69 documents processed with MetaMap
2018-11-26 01:10:03,064 INFO  50 of 69 documents processed with MetaMap
2018-11-26 01:10:23,008 INFO  60 of 69 documents processed with MetaMap
2018-11-26 01:10:46,613 INFO  All documents processed with MetaMap
2018-11-26 01:10:46,615 INFO  Creating dictionary
2018-11-26 01:10:46,615 INFO  adding document #0 to Dictionary(0 unique tokens: [])
2018-11-26 01:10:46,618 INFO  built

*Step 5:* Observing results.

In [5]:
for document in clus1.documents:
    print('PMID:', document['pmid'])
    print('Document text:', document['all_text'].strip()) 
    print('Cluster:', clus1.clustering_results_dict[document['pmid']])

PMID: 8001324
Document text: Noonan syndrome. An update and review for the primary pediatrician.
Cluster: 2
PMID: 8270381
Document text: Clinical aspects of renal involvement in Bardet-Biedl syndrome. 
The Bardet-Biedl syndrome (BBS), which consists of polydactyly, obesity, mental retardation, pigmentary retinopathy and hypogonadism has been known since 1922, but due to the great similarity to the clinical manifestations of the Laurence-Moon syndrome (LMS) there is a considerable terminological confusion in the medical literature. An attempt is made at clarifying the problem. Four children from two families have been observed. There were inter- and intrafamilial variabilities of the expression and severity of the particular features, but retinopathy and structural and/or functional abnormalities were found in 100%. The combination of the two can serve as an easy clinical screening for diagnosis of the disease. Renal involvement is considered to be a cardinal feature of the syndrome. Th

In [6]:
clus1.precision

0.9004329004329004

In [7]:
clus1.recall

0.9288043478260869

In [8]:
clus1.f_measure

0.9143986037281078

In [9]:
clus1.purity

0.9117647058823529

If running without labels:

In [10]:
clus2 = PubMedClustering(os.path.join('./data', 'pmids_test_set_unlabeled.txt'),
                            metamap='/opt/public_mm/bin/metamap16', email="This.Is.My@Email.org", is_file=True)

In [11]:
clus2.run()

2018-11-26 01:10:46,757 INFO  Fetching articles and abstracts from PubMed
2018-11-26 01:10:47,828 INFO  Parsing fetched PubMed articles
2018-11-26 01:10:48,110 INFO  Parsed all PubMed articles
2018-11-26 01:10:48,110 INFO  Processing all documents with MetaMap
2018-11-26 01:10:48,111 INFO  0 of 36 documents processed with MetaMap
2018-11-26 01:11:10,554 INFO  10 of 36 documents processed with MetaMap
2018-11-26 01:11:29,202 INFO  20 of 36 documents processed with MetaMap
2018-11-26 01:11:49,133 INFO  30 of 36 documents processed with MetaMap
2018-11-26 01:12:03,741 INFO  All documents processed with MetaMap
2018-11-26 01:12:03,742 INFO  Creating dictionary
2018-11-26 01:12:03,744 INFO  adding document #0 to Dictionary(0 unique tokens: [])
2018-11-26 01:12:03,745 INFO  built Dictionary(122 unique tokens: ['cancer', 'carcinoma', 'colon', 'colorectal', 'hereditary']...) from 36 documents (total 1401 corpus positions)
2018-11-26 01:12:03,746 INFO  Filtering low occurence words from diction

In [12]:
for document in clus1.documents:
    print('PMID:', document['pmid'])
    print('Document text:', document['all_text'].strip()) 
    print('Cluster:', clus1.clustering_results_dict[document['pmid']])

PMID: 8001324
Document text: Noonan syndrome. An update and review for the primary pediatrician.
Cluster: 2
PMID: 8270381
Document text: Clinical aspects of renal involvement in Bardet-Biedl syndrome. 
The Bardet-Biedl syndrome (BBS), which consists of polydactyly, obesity, mental retardation, pigmentary retinopathy and hypogonadism has been known since 1922, but due to the great similarity to the clinical manifestations of the Laurence-Moon syndrome (LMS) there is a considerable terminological confusion in the medical literature. An attempt is made at clarifying the problem. Four children from two families have been observed. There were inter- and intrafamilial variabilities of the expression and severity of the particular features, but retinopathy and structural and/or functional abnormalities were found in 100%. The combination of the two can serve as an easy clinical screening for diagnosis of the disease. Renal involvement is considered to be a cardinal feature of the syndrome. Th