### Example of PubMed Clustering

This notebook shows an example of using this project.

*Step 1:* Imports.

In [1]:
import os
from pubmed_clustering import PubMedClustering

*Step 2:* Logging.

In [2]:
import logging

logger = logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)-5s %(message)s')

*Step 3:* Initializing the class.

Parameters:
- `pubmed_ids` (required): This parameter contains either a list of PMID strings, a string containing comma-separated PMIDs or the disk location to a text file containing a list of PMIDs. If it is a disk location, ensure the `is_file` flag is set to `True`.
- `is_file` (optional): Boolean flag to specify if `pubmed_ids` is a text file that should be read. Default: `False`.
- `metamap` (optional): Location of MetaMap binary. Without MetaMap, the pre-processing will not run and the program will terminate. Default: `/opt/public_mm/bin/metamap16`.
- `email` (optional): Email ID. Required to query PubMed using the BioPython library. Default: `"Your.Name.Here@example.org"`.
- `labels` (optional): Either a dictionary containing PMID -> cluster pairs or a location to a text file that of the form `PMID\tlabel`. If it is a text file, ensure the flag `labels_is_file` is set to `True`. Default: `None`.
- `labels_is_file` (optional): Boolean flag to specify if `pubmed_ids` is a text file that should be read. Default: `False`.

In [3]:
clus1 = PubMedClustering(os.path.join('./data', 'pmids_gold_set_unlabeled.txt'),
                            metamap='/opt/public_mm/bin/metamap16', email="This.Is.My@Email.org", is_file=True,
                            labels=os.path.join('./data', 'pmids_gold_set_labeled.txt'), labels_is_file=True)

*Step 4:* Running the pipeline.

In [None]:
clus1.run()

2018-11-28 12:31:00,374 INFO  Fetching articles and abstracts from PubMed
2018-11-28 12:31:01,899 INFO  Parsing fetched PubMed articles
2018-11-28 12:31:02,288 INFO  Parsed all PubMed articles
2018-11-28 12:31:02,289 INFO  Processing all documents with MetaMap
2018-11-28 12:31:02,289 INFO  0 of 69 documents processed with MetaMap
2018-11-28 12:31:19,270 INFO  10 of 69 documents processed with MetaMap
2018-11-28 12:31:35,258 INFO  20 of 69 documents processed with MetaMap
2018-11-28 12:31:53,245 INFO  30 of 69 documents processed with MetaMap
2018-11-28 12:32:13,376 INFO  40 of 69 documents processed with MetaMap


*Step 5:* Observing results.

In [None]:
for document in clus1.documents:
    print('PMID:', document['pmid'])
    print('Document text:', document['all_text'].strip()) 
    print('Cluster:', clus1.clustering_results_dict[document['pmid']])

In [None]:
clus1.precision

In [None]:
clus1.recall

In [None]:
clus1.f_measure

In [None]:
clus1.purity

If running without labels:

In [None]:
clus2 = PubMedClustering(os.path.join('./data', 'pmids_test_set_unlabeled.txt'),
                            metamap='/opt/public_mm/bin/metamap16', email="This.Is.My@Email.org", is_file=True)

In [None]:
clus2.run()

In [None]:
for document in clus1.documents:
    print('PMID:', document['pmid'])
    print('Document text:', document['all_text'].strip()) 
    print('Cluster:', clus1.clustering_results_dict[document['pmid']])