# Homework 4: Named Entity Recognization[5 points]


### HIDS 7006, Spring 2024

### Due: Wednesday, May 1, 2024 11:59 pm E.T.

This homework will test your knowledge about extracting named entities from biomedical text (PubMed articles) that you learned in lecture 12. Fill in the code/answers for the questions as indicated below.

Please edit this document directly using Jupyter or Google Colab and answer each of the questions in-line as indicated.

Turn in a single document i.e. the Notebook showing all of your code and output for the entire assignment, with each question clearly demarcated. Submit your completed assignment through Canvas. The notebook can be downloaded by clicking the `File` option (top left) and clicking `Download .ipynb` in the drop down menu.

## 1. Biomedical Named Entity Extraction  [5 points]

**Dataset**

You have been provided with a list of 100 PMIDs (PubMed Identifiers) in a CSV. You need to randomly select 25 pmids from the CSV files and extract the biomedical entities (genes, disease, mutations, and chemicals) mentioned in the selected 25 pmids using PubTator APIs.



**Steps**

1. Read the input pmids.csv files (which contains 100 pmids) using Pandas or Python file IO operations.

2. Randomly select 25 pmids from the input file and store them in a python list. Hint: If you are using Pandas to read the input file, you can use the function `sample` to randomly select a subset of rows in dataframe.

    **Note:** The 25 pmids should be selected randomly and not the first/last 25
    
    `sample` function documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html

3. Use Pubtator APIs to create a dataframe of the different biomedical entities mentioned in the selected 25 pmids. The dataframe should **atleast** contain the following columns: `["pmid", "entity_mention", "entity_type", "entity_id"]`

    **Hint:** Please refer to the section "*Extracting biomedical entities from PubMed abstracts*" notebook from lecture 12 (`12_named_entity_recognization_solved`) for details on how to use the APIs and create the dataframe of entities.

4. What are the top five diseases (by count) detected by Pubtator in your list of 25 pmids? Based on these top five diseases, what disease topic or query do you think was used to obtain the initial list of pmids.


5. What are the top five genes (by count) detected by Pubtator in your list of 25 pmids?

In [None]:
##Code here; Use as many code cells as required

# 1) Read the input pmids.csv files (which contains 100 pmids) using Pandas or Python file IO operations.

In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
pmids=pd.read_csv('/content/drive/MyDrive/HIDS_7006/Assignment/pmids.csv')
pmids

Unnamed: 0,PMID
0,34534430
1,32846060
2,30980071
3,30718506
4,30413378
...,...
95,25668228
96,25667280
97,25470694
98,25393796


# 2) Randomly select 25 pmids from the input file and store them in a python list. Hint: If you are using Pandas to read the input file, you can use the function sample to randomly select a subset of rows in dataframe.

In [None]:
pmids_25=pmids.sample(n=25, random_state=42)
pmids_25

Unnamed: 0,PMID
83,26027660
53,26771872
70,26515464
45,26899019
44,26918046
39,27022118
22,27876675
80,26124334
10,28602779
0,34534430


# 3) Use Pubtator APIs to create a dataframe of the different biomedical entities mentioned in the selected 25 pmids. The dataframe should atleast contain the following columns: ["pmid", "entity_mention", "entity_type", "entity_id"]

Hint: Please refer to the section "Extracting biomedical entities from PubMed abstracts" notebook from lecture 12 (12_named_entity_recognization_solved) for details on how to use the APIs and create the dataframe of entities.

In [None]:
def process_pmids(input_pmids, format, bioconcept):
  json = {}

  #load pmids
  json = {"pmids": [pmid.strip() for pmid in input_pmids]}

  #load bioconcepts
  if bioconcept != "":
    json["concepts"]=bioconcept.split(",")

  ###process restful API
  r = requests.post("https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/"+format , json = json)

  if r.status_code != 200:
    print ("[Error]: HTTP code "+ str(r.status_code))
  else:
    return r.text

In [None]:
selected_pmids = pmids_25["PMID"].tolist()
selected_pmids

[26027660,
 26771872,
 26515464,
 26899019,
 26918046,
 27022118,
 27876675,
 26124334,
 28602779,
 34534430,
 28259530,
 27577079,
 26439803,
 27315356,
 25765070,
 30413378,
 26287849,
 26209642,
 28525386,
 27544060,
 26758680,
 25891174,
 27694386,
 26938871,
 26554404]

In [None]:
selected_pmids_str=[str(pmid) for pmid in selected_pmids]

In [None]:
import requests

In [None]:
result = process_pmids(selected_pmids_str, "pubtator", "")

In [None]:
result

'25765070|t|Cancer immunology. Mutational landscape determines sensitivity to PD-1 blockade in non-small cell lung cancer.\n25765070|a|Immune checkpoint inhibitors, which unleash a patient\'s own T cells to kill tumors, are revolutionizing cancer treatment. To unravel the genomic determinants of response to this therapy, we used whole-exome sequencing of non-small cell lung cancers treated with pembrolizumab, an antibody targeting programmed cell death-1 (PD-1). In two independent cohorts, higher nonsynonymous mutation burden in tumors was associated with improved objective response, durable clinical benefit, and progression-free survival. Efficacy also correlated with the molecular smoking signature, higher neoantigen burden, and DNA repair pathway mutations; each factor was also associated with mutation burden. In one responder, neoantigen-specific CD8+ T cell responses paralleled tumor regression, suggesting that anti-PD-1 therapy enhances neoantigen-specific T cell reactivity. Our 

In [None]:
print(result)

25765070|t|Cancer immunology. Mutational landscape determines sensitivity to PD-1 blockade in non-small cell lung cancer.
25765070|a|Immune checkpoint inhibitors, which unleash a patient's own T cells to kill tumors, are revolutionizing cancer treatment. To unravel the genomic determinants of response to this therapy, we used whole-exome sequencing of non-small cell lung cancers treated with pembrolizumab, an antibody targeting programmed cell death-1 (PD-1). In two independent cohorts, higher nonsynonymous mutation burden in tumors was associated with improved objective response, durable clinical benefit, and progression-free survival. Efficacy also correlated with the molecular smoking signature, higher neoantigen burden, and DNA repair pathway mutations; each factor was also associated with mutation burden. In one responder, neoantigen-specific CD8+ T cell responses paralleled tumor regression, suggesting that anti-PD-1 therapy enhances neoantigen-specific T cell reactivity. Our res

In [None]:
def get_entities_from_pubtator_format(pubtator_output):
  entities = []
  lines = pubtator_output.split("\n")
  for line in lines:
    tokens = line.split("\t")
    if len(tokens) < 6: continue
    pmid, start_offset, end_offset, entity_mention, entity_type, entity_id = tokens[0:6]
    entities.append([pmid, start_offset, end_offset, entity_mention, entity_type, entity_id])
  return entities

In [None]:
pubtator_output=process_pmids(selected_pmids_str, "pubtator", "")

In [None]:
pubtator_entities = get_entities_from_pubtator_format(pubtator_output)

In [None]:
pubtator_entities

[['25765070', '0', '6', 'Cancer', 'Disease', 'MESH:D009369'],
 ['25765070', '157', '164', 'patient', 'Species', '9606'],
 ['25765070', '187', '193', 'tumors', 'Disease', 'MESH:D009369'],
 ['25765070', '347', '359', 'lung cancers', 'Disease', 'MESH:D008175'],
 ['25765070', '373', '386', 'pembrolizumab', 'Chemical', 'MESH:C582435'],
 ['25765070', '510', '516', 'tumors', 'Disease', 'MESH:D009369'],
 ['25765070', '843', '849', 'T cell', 'CellLine', 'T cell'],
 ['25765070', '871', '876', 'tumor', 'Disease', 'MESH:D009369'],
 ['25765070', '952', '958', 'T cell', 'CellLine', 'T cell'],
 ['25765070', '1021', '1033', 'lung cancers', 'Disease', 'MESH:D008175'],
 ['25891174', '0', '13', 'Pembrolizumab', 'Chemical', 'MESH:C582435'],
 ['25891174', '50', '61', 'lung cancer', 'Disease', 'MESH:D008175'],
 ['25891174', '114', '137', 'programmed cell death 1', 'Gene', '5133'],
 ['25891174', '139', '143', 'PD-1', 'Gene', '5133'],
 ['25891174', '161', '174', 'pembrolizumab', 'Chemical', 'MESH:C582435'],
 

In [None]:
cols = ["pmid", "start_offset", "end_offset", "entity_mention", "entity_type", "entity_id"]

In [None]:
entity_df = pd.DataFrame(pubtator_entities, columns=cols)

In [None]:
entity_df.head()

Unnamed: 0,pmid,start_offset,end_offset,entity_mention,entity_type,entity_id
0,25765070,0,6,Cancer,Disease,MESH:D009369
1,25765070,157,164,patient,Species,9606
2,25765070,187,193,tumors,Disease,MESH:D009369
3,25765070,347,359,lung cancers,Disease,MESH:D008175
4,25765070,373,386,pembrolizumab,Chemical,MESH:C582435


In [None]:
entity_df.shape

(916, 6)

# 4) What are the top five diseases (by count) detected by Pubtator in your list of 25 pmids? Based on these top five diseases, what disease topic or query do you think was used to obtain the initial list of pmids.

In [None]:
disease=entity_df[entity_df['entity_type']=='Disease']
disease

Unnamed: 0,pmid,start_offset,end_offset,entity_mention,entity_type,entity_id
0,25765070,0,6,Cancer,Disease,MESH:D009369
2,25765070,187,193,tumors,Disease,MESH:D009369
3,25765070,347,359,lung cancers,Disease,MESH:D008175
5,25765070,510,516,tumors,Disease,MESH:D009369
7,25765070,871,876,tumor,Disease,MESH:D009369
...,...,...,...,...,...,...
901,34534430,1507,1518,neutropenia,Disease,MESH:D009503
902,34534430,1567,1579,lung disease,Disease,MESH:D008171
904,34534430,1624,1629,death,Disease,MESH:D003643
914,34534430,1910,1915,NSCLC,Disease,MESH:D002289


In [None]:
disease_counts = disease['entity_mention'].value_counts(normalize=False).sort_values(ascending=False)

In [None]:
print("Unique Values and Counts (Descending):")
print(disease_counts)

Unique Values and Counts (Descending):
entity_mention
NSCLC                                              54
lung cancer                                        31
tumor                                              10
tumors                                             10
SCC                                                 8
                                                   ..
Non-Small-Cell Lung Cancer                          1
neutropenia                                         1
squamous cell cancer                                1
Lung Adenocarcinoma and Squamous Cell Carcinoma     1
death                                               1
Name: count, Length: 68, dtype: int64


**- Top 5 diseases are:** <br>
1) NSCLC (Non-Small Cell Lung Cancer) <br>
2) Lung cancer <br>
3) Tumor <br>
4) Tumors <br>
5) SCC (Squamous Cell Carcinoma) <br>
<br>

**Based on the analysis results, it can be assumed that the list is comprised of articles related to cancer with high portion related to lung cancer.**



# 5) What are the top five genes (by count) detected by Pubtator in your list of 25 pmids?

In [None]:
unique_entity_types = entity_df['entity_type'].unique()
print(unique_entity_types)

['Disease' 'Species' 'Chemical' 'CellLine' 'Gene' 'ProteinMutation']


In [None]:
gene=entity_df[entity_df['entity_type']=='Gene']
gene

Unnamed: 0,pmid,start_offset,end_offset,entity_mention,entity_type,entity_id
12,25891174,114,137,programmed cell death 1,Gene,5133
13,25891174,139,143,PD-1,Gene,5133
17,25891174,322,335,PD-1 ligand 1,Gene,29126
18,25891174,337,342,PD-L1,Gene,29126
23,25891174,668,673,PD-L1,Gene,29126
...,...,...,...,...,...,...
897,34534430,854,858,HER2,Gene,2064
906,34534430,1686,1690,HER2,Gene,2064
908,34534430,1752,1756,HER2,Gene,2064
909,34534430,1771,1775,HER2,Gene,2064


In [None]:
gene_counts = gene['entity_mention'].value_counts(normalize=False).sort_values(ascending=False)
print("Unique Values and Counts Genes (Descending):")
print(gene_counts)

Unique Values and Counts Genes (Descending):
entity_mention
ALK                                         54
EGFR                                        49
KRAS                                        35
PD-L1                                       22
MET                                         18
RET                                         15
LKB1                                        13
mTOR                                        12
HER2                                        10
TP53                                        10
ABL1                                        10
CD8                                          8
PIM1                                         8
ROS1                                         6
BRAF                                         6
FGFR1                                        4
anaplastic lymphoma kinase                   3
EXP1                                         3
c-MYC                                        3
STAT3                                        3


**- Top 5 genes are:** <br>
1) ALK <br>
2) EGFR <br>
3) KRAS <br>
4) PD-L1 <br>
5) MET <br>