In [1]:
!pip install --upgrade biopython



<details>
  <summary>💡Explain this code</summary>
This command uses pip (Python’s package installer) to download and install six popular libraries:

biopython → tools for computational biology and bioinformatics.

spacy → advanced natural language processing (NLP).

nltk → natural language toolkit, often used for text preprocessing.

pandas → data manipulation and analysis (especially with tables).

networkx → graph and network analysis (e.g., molecular or linguistic networks).

matplotlib → plotting and visualization libray..

</details>

In [2]:
from Bio import Entrez
import spacy, re, pandas as pd
import nltk
from nltk.tokenize import word_tokenize
import networkx as nx, matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
import spacy

In [3]:
nlp = spacy.load(r"C:\Users\sayye\Downloads\en_ner_bionlp13cg_md-0.4.0\en_ner_bionlp13cg_md-0.4.0\en_ner_bionlp13cg_md\en_ner_bionlp13cg_md-0.4.0")



<details>
    <summary>💡Explain this code </summary>    

Entrez is a tool that lets you access data from NCBI (National Center for Biotechnology Information) — databases like PubMed, GenBank, and others.
You can use it to search and download biological papers, sequences, or gene information directly from Python.

For example,

```python
from Bio import Entrez
Entrez.email = "youremail@example.com"
handle = Entrez.esearch(db="pubmed", term="COVID-19", retmax=
5````
That would search PubMed for “COVID-19” and return a few results.
spacy → For advanced natural language processing (tokenization, part-of-speech tagging, named entity recognition, etc.).

re → Python’s regular expression module for text pattern matching and cleaning (like removing symbols or extracting specific text).

pandas as pd → For handling structured data (tables, CSV files, dataframes). The as pd part is just a short alias so you can write pd.DataFrame() instead of pandas.DataFrame()

nltk- This imports a specific function, word_tokenize, from nltk.
It’s used to split a sentence or paragraph into individual words (tokens
````
</detail>)


In [4]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sayye\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sayye\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
Entrez.email = 'sayyedsaniya1010@gmail.com'
Entrez.api_key = None #optional

<details>
    <summary>💡Explain this code </summary>

nltk.download('punkt')

What it is:
punkt is a pre-trained tokenizer model that NLTK uses to split text into sentences and words.

Why you need it:
When you run word_tokenize("Hello world!"), NLTK relies on the Punkt tokenizer under the hood to know where words begin and end, including punctuation and abbreviations.

What it downloads:
A small dataset containing tokenization rules for many languages.

nltk.download('stopwords')

What it is:
This downloads a list of common stopwords — words like “the”, “is”, “and”, “a” — that don’t add much meaning in text analysis.

Why you need it:
When cleaning text for NLP, we often remove these to focus on the meaningful words.

Entrez.email = 'xyz@gmail.com'

This line tells NCBI who you are.

When you use Biopython’s Entrez module to access online databases (like PubMed or GenBank), you’re connecting to NCBI’s servers.
They require every user to provide an email address — not for spam or verification, but so that:

If your script sends too many requests or causes problems, NCBI can contact you.

It helps them track usage responsibly.

Entrez.api_key = 'YOUR_NCBI_API_KEY' (optional but useful)

Now this one is optional, but very helpful.

An API key is like a personal access token that identifies you as a trusted user.
Without it, NCBI allows only about 3 requests per second.
With an API key, you can make up to 10 requests per second, which is a big speed boost if you’re fetching lots of records.

You can get one for free by creating an account on the NCBI website.

So, it’s optional because your code will still work without it — but it’s recommended for:

Heavy data retrieval

Frequent automated access

Avoiding “too many requests” errors.

</details>

#skipped reinstallation

In [6]:
def fetch_pubmed_abstracts(query, max_records=10):
    '''Searches Pubmed for a given query, retrieves the pubmed IDs of
    a few matching papers, Fetch thier abstracts and titles,
    Returns:
    in a list of dictionaries'''
    h = Entrez.esearch(db='pubmed', term=query, retmax=max_records)
    ids = Entrez.read(h)['IdList']; h.close()
    if not ids:
        return []
    h = Entrez.efetch(db='pubmed', id=','.join(ids), retmode='xml'); recs = Entrez.read(h)
    h.close()
    out = []
    for art in recs.get('PubmedArticle', []):
        pmid = str(art['MedlineCitation']['PMID'])
        article = art['MedlineCitation']['Article']
        title = article.get('ArticleTitle','')
        abstract = ''
        if article.get('Abstract'):
            parts = article['Abstract'].get('AbstractText')
            if isinstance(parts, list):
                texts = []
                for p in parts:
                    if isinstance(p, str):
                        texts.append(p)
                    elif isinstance(p, dict):
                        texts.append(p.get('_',''))
                abstract = ' '.join(texts)
            elif isinstance(parts, dict):
                abstract = parts
            elif isinstance(parts, dict):
                abstract = parts.get(' ', '')
        out.append({'pmid': pmid, 'title': str(title), 'abstract': abstract})
    return out

In [7]:
docs = fetch_pubmed_abstracts('cancer drug', max_records=10)
for d in docs:
    print(d['pmid'], '-', d['title'][:120])

41111411 - Pan-Cancer Analyses of Shared and Distinct Gene Expression in 17 Cancers: Rethinking Cancer Classification and Moving Be
41111393 - The Proteostasis Network is a Therapeutic Target in Acute Myeloid Leukemia.
41111358 - BCL-2 inhibition in Waldenström macroglobulinaemia and marginal zone lymphoma.
41111237 - Glutathione-Responsive Polyhomocysteine Derivatives with Ultralow Toxicity toward Therapeutic Delivery.
41111131 - Integrated computational-experimental pipeline for CHK1 inhibitor discovery: structure-based identification of novel che
41111109 - Assessing SMC Complex Function in Replication Fork Progression with DNA Fiber Assays.
41111090 - YAP/TEAD inhibitor VT3989 in solid tumors: a phase 1/2 trial.
41111074 - Comprehensive evaluation of high dose methotrexate therapy: a retrospective observational trial.
41111053 - Nuclear receptor ESRRA promotes ERα-positive breast cancer through dual action on super enhancers and promoters to regul
41111032 - ESMO guidance on the us

In [8]:
def clean_text(txt):
    t = re.sub(r'\s+,',' ', txt or '').strip()
    t = re.sub(r'\[[0-9]+\]', '', t)
    return t

def remove_stopwords(txt):
    tokens = [w for w in word_tokenize(txt) if re.match(r'\w', w)]
    return ' '.join([w for w in tokens if w.lower() not in stop])

#nlp = spacy.load('en_core_web_sm')

def extract_entities(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

print(extract_entities(clean_text(docs[0]['abstract'])))

[('Cancer', 'CANCER'), ('pan-cancer', 'CANCER'), ('cancers', 'CANCER'), ('pan-cancer', 'CANCER'), ('cancers', 'CANCER'), ('adrenocortical cancer', 'CANCER'), ('lung cancer', 'CANCER'), ('kidney cancer', 'CANCER'), ('colorectal cancer', 'CANCER'), ('tissue', 'TISSUE'), ('TFs', 'CELL'), ('miR-124-3p', 'GENE_OR_GENE_PRODUCT'), ('miR-7106-5p', 'GENE_OR_GENE_PRODUCT'), ('SP1', 'GENE_OR_GENE_PRODUCT'), ('RELA', 'GENE_OR_GENE_PRODUCT'), ('NF-κB Subunit', 'GENE_OR_GENE_PRODUCT'), ('Nuclear Factor Kappa B Subunit 1', 'GENE_OR_GENE_PRODUCT'), ('NFKB1', 'GENE_OR_GENE_PRODUCT'), ('TFs', 'CELL'), ('Cyclin-Dependent Kinase 2', 'GENE_OR_GENE_PRODUCT'), ('CDK2', 'GENE_OR_GENE_PRODUCT'), ('Histone Deacetylase 1', 'GENE_OR_GENE_PRODUCT'), ('HDAC1', 'GENE_OR_GENE_PRODUCT'), ('ABL', 'GENE_OR_GENE_PRODUCT'), ('Non-Receptor Tyrosine Kinase', 'GENE_OR_GENE_PRODUCT'), ('ABL1', 'GENE_OR_GENE_PRODUCT'), ('cancer', 'CANCER'), ('PI3K-Akt', 'GENE_OR_GENE_PRODUCT'), ('p53', 'GENE_OR_GENE_PRODUCT'), ('SP1', 'GENE_OR

In [9]:
TRIGGERS = ['inhibit', 'inhibits', 'inhibitting', 'activate', 'activates', 'bind', 'binds', 'block', 'suppress', 'associated', 'cause', 'causes', 'increase', 'decrease']

In [None]:
def extract_relations(text):
    doc = nlp(text)
    relations = []
    for sent in doc.sents:
        ents = sent.ents
        if len(ents)<2:
            continue
        sent_l = sent.text.lower()
        for t in TRIGGERS:
            if t in sent_l:
                for i in range(len(ents)):
                    for j in range(i+1, len(ents)):
                        relations.append({'sentence': sent.text.strip(),
                                         'e1': ents[i].text,
                                         'e2': ents[j].text,
                                         'trigger': t})
                        break
    return relations

query = 'cancer drug inhibitor'
docs = fetch_pubmed_abstracts(query, max_records=10)

triplets = []
for d in docs:
    txt = clean_text(d['abstract'])
    rels = extract_relations(txt)
    for r in rels:
        r.update({'pmid': d['pmid'], 'title': d['title']})
        triplets.append(r)

df = pd.DataFrame(triplets)
print('Triplets found', len(df))
display(df.head())
df.to_csv('pubmed_triplets.csv', index = False)

if not df.empty:
    G = nx.DiGraph()
    for _, row in df.iterrows():
        G.add_edge(row['e1'], row['e2'], label = row['trigger'])
    plt.figure(figsize=(6,6))
    pos = nx.spring_layout(G, seed = 2)
    nx.draw(G, pos, with_labels=True, node_size=900, font_size=9)
    nx.draw_networkx_edge_labels(G,pos,edge_labels=nx.get_edge_attributes(G, 'label')) #font_color='red'))
    plt.title('Knowledge Graph')
    plt.show()

else:
    print('No relations detected')