<a href="https://colab.research.google.com/github/thevirusoup/Learning-Git/blob/main/Web%20Scrapping/Web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to Web Scraping

Web Scraping is the automated process of extracting data from websites using software. In bioinformatics, this is useful to gather:

-Gene or protein information from databases like NCBI, UniProt, or Ensembl

-Pathway data from KEGG

-Literature data from PubMed

-Taxonomy data from online sources

Why use web scraping in Bioinformatics?
-APIs are not always available for all databases.

-Real-time data extraction and updates.

-Data mining for large-scale analyses.


#Basics of HTML for Web Scraping
HTML (HyperText Markup Language) is the standard language used to create and structure webpages. It uses tags to define different types of content, such as headings, paragraphs, tables, and links.



#Example (Double click it to see the html structure)

<!DOCTYPE html>
<html>
  <head>
    <title>Gene Information</title>
  </head>
  <body>
    <h1>BRCA1 Gene</h1>
    <p>This gene encodes a nuclear phosphoprotein...</p>
    <div class="summary">
      <p>BRCA1 is involved in DNA repair mechanisms...</p>
    </div>
    <a href="https://www.ncbi.nlm.nih.gov/gene/672">NCBI Link</a>
  </body>
</html>

#Common HTML Elements Useful in Web Scraping

| Tag         | Meaning              | Description for Scraping |
| ----------- | -------------------- | ------------------------ |
| `<html>`    | HTML root element    | Start of a webpage       |
| `<head>`    | Metadata and scripts | Not commonly scraped     |
| `<body>`    | Main content         | Scraping focus area      |
| `<h1>,<h2>` | Headings             | Often used as titles     |
| `<p>`       | Paragraph            | Used for descriptions    |
| `<div>`     | Division (grouping)  | Common wrapper tag       |
| `<a>`       | Anchor (hyperlink)   | Used to extract links    |
| `<span>`    | Inline container     | Often used with styling  |
| `<table>`   | Table                | Used in structured data  |


Real Example from NCBI Gene Page (simplified) (Double click)

<div class="gene-summary">
  <p>The BRCA1 gene is involved in DNA repair...</p>
</div>
In web scraping:

We look for the <div> with class gene-summary.

Then extract the <p> inside it using BeautifulSoup.

#Required Python Libraries
| Library         | Purpose                                                |
| --------------- | ------------------------------------------------------ |
| `requests`      | To send HTTP requests to webpages and get HTML content |
| `BeautifulSoup` | To parse HTML content and extract specific data        |


In [None]:
# Install required libraries
!pip install requests
!pip install beautifulsoup4



#Understanding the Target Website

We’ll scrape gene summary data from NCBI Gene. For example, consider the following URL for the BRCA1 gene in humans:

https://www.ncbi.nlm.nih.gov/gene/672

This page contains:

Gene name

Description

Organism

Summary

#1.**Gene or protein information from databases like NCBI**

In [None]:
# practical 1
import requests

# Step 1: NCBI esummary endpoint for Gene database
gene_id = "672"  # BRCA1 gene ID
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id={gene_id}&retmode=json"

# Step 2: Send GET request
response = requests.get(url)
data = response.json()

# Step 3: Extract summary
summary = data["result"][gene_id]["summary"]

# Step 4: Display result
print("BRCA1 Gene Summary from NCBI:")
print(summary)


BRCA1 Gene Summary from NCBI:
This gene encodes a 190 kD nuclear phosphoprotein that plays a role in maintaining genomic stability, and it also acts as a tumor suppressor. The BRCA1 gene contains 22 exons spanning about 110 kb of DNA. The encoded protein combines with other tumor suppressors, DNA damage sensors, and signal transducers to form a large multi-subunit protein complex known as the BRCA1-associated genome surveillance complex (BASC). This gene product associates with RNA polymerase II, and through the C-terminal domain, also interacts with histone deacetylase complexes. This protein thus plays a role in transcription, DNA repair of double-stranded breaks, and recombination. Mutations in this gene are responsible for approximately 40% of inherited breast cancers and more than 80% of inherited breast and ovarian cancers. Alternative splicing plays a role in modulating the subcellular localization and physiological function of this gene. Many alternatively spliced transcript va

Explanation

| Part            | Description                                     |
| --------------- | ----------------------------------------------- |
| `esummary.fcgi` | NCBI utility that returns a gene summary        |
| `db=gene`       | Specifies we’re querying the Gene database      |
| `id=672`        | BRCA1 Gene ID                                   |
| `retmode=json`  | We request JSON-formatted data for easy parsing |


#2.**Pathway data from KEGG**

In [None]:
import requests

# KEGG Pathway ID for Glycolysis/Gluconeogenesis in E. coli
pathway_id = "eco00010"

# Step 1: URL to fetch pathway info
url = f"http://rest.kegg.jp/get/{pathway_id}"

# Step 2: Send GET request
response = requests.get(url)

# Step 3: Print plain text data
if response.status_code == 200:
    print(f"KEGG Pathway: {pathway_id} (Glycolysis/Gluconeogenesis in E. coli)\n")
    print(response.text)
else:
    print("Failed to retrieve data from KEGG.")


KEGG Pathway: eco00010 (Glycolysis/Gluconeogenesis in E. coli)

ENTRY       eco00010                    Pathway
NAME        Glycolysis / Gluconeogenesis - Escherichia coli K-12 MG1655
DESCRIPTION Glycolysis is the process of converting glucose into pyruvate and generating small amounts of ATP (energy) and NADH (reducing power). It is a central pathway that produces important precursor metabolites: six-carbon compounds of glucose-6P and fructose-6P and three-carbon compounds of glycerone-P, glyceraldehyde-3P, glycerate-3P, phosphoenolpyruvate, and pyruvate [MD:M00001]. Acetyl-CoA, another important precursor metabolite, is produced by oxidative decarboxylation of pyruvate [MD:M00307]. When the enzyme genes of this pathway are examined in completely sequenced genomes, the reaction steps of three-carbon compounds from glycerone-P to pyruvate form a conserved core module [MD:M00002], which is found in almost all organisms and which sometimes contains operon structures in bacterial genomes.

In [None]:
#How to Find Pathway IDs for E. coli?
# List all pathway IDs for E. coli (eco)
url = "http://rest.kegg.jp/list/pathway/eco"
response = requests.get(url)

# Display first 10 entries
print("First 10 Pathways in E. coli (eco):\n")
print("\n".join(response.text.splitlines()[:10]))


🧬 First 10 Pathways in E. coli (eco):

eco01100	Metabolic pathways - Escherichia coli K-12 MG1655
eco01110	Biosynthesis of secondary metabolites - Escherichia coli K-12 MG1655
eco01120	Microbial metabolism in diverse environments - Escherichia coli K-12 MG1655
eco01200	Carbon metabolism - Escherichia coli K-12 MG1655
eco01210	2-Oxocarboxylic acid metabolism - Escherichia coli K-12 MG1655
eco01212	Fatty acid metabolism - Escherichia coli K-12 MG1655
eco01230	Biosynthesis of amino acids - Escherichia coli K-12 MG1655
eco01232	Nucleotide metabolism - Escherichia coli K-12 MG1655
eco01250	Biosynthesis of nucleotide sugars - Escherichia coli K-12 MG1655
eco01240	Biosynthesis of cofactors - Escherichia coli K-12 MG1655


#Parse KEGG Pathway Data for E. coli and Save as a Table
We will:

Request pathway data from KEGG (e.g., eco00010)

Parse gene information using basic string processing

Store it as a structured table using pandas

Export it as a CSV for downstream bioinformatics work

In [None]:
import requests
import pandas as pd

# Step 1: Fetch KEGG pathway data
pathway_id = "eco00010"  # Glycolysis/Gluconeogenesis in E. coli
url = f"http://rest.kegg.jp/get/{pathway_id}"
response = requests.get(url)

if response.status_code == 200:
    raw_text = response.text

    # Step 2: Parse gene section from KEGG flat file format
    lines = raw_text.split("\n")
    genes = []

    in_gene_section = False

    for line in lines:
        if line.startswith("GENE"):
            in_gene_section = True
            content = line[12:]  # Skip "GENE       "
        elif in_gene_section and line.startswith(" " * 12):
            content = line[12:]  # continuation line
        else:
            in_gene_section = False
            continue

        # Parse content: example "945068 (gapA) glyceraldehyde-3-phosphate dehydrogenase [EC:1.2.1.12]"
        parts = content.split(" ", 1)
        if len(parts) == 2:
            gene_id = parts[0]
            rest = parts[1]
            # Extract gene symbol and description
            if rest.startswith("("):
                gene_symbol = rest.split(")")[0][1:]
                description = rest.split(")")[1].strip()
            else:
                gene_symbol = ""
                description = rest.strip()
            genes.append((gene_id, gene_symbol, description))

    # Step 3: Create DataFrame
    df = pd.DataFrame(genes, columns=["KEGG Gene ID", "Gene Symbol", "Function Description"])

    # Step 4: Save to CSV
    df.to_csv("Ecoli_Glycolysis_Pathway.csv", index=False)

    # Display first few rows
    print(" Extracted pathway genes:\n")
    print(df.head())

else:
    print(" Failed to retrieve data from KEGG.")


✅ Extracted pathway genes:

  KEGG Gene ID Gene Symbol                               Function Description
0        b0114              aceE; pyruvate dehydrogenase E1 component [KO:...
1        b0115              aceF; pyruvate dehydrogenase, E2 subunit [KO:K...
2        b0116              lpd; lipoamide dehydrogenase [KO:K00382] [EC:1...
3        b0325              yahK; aldehyde reductase, NADPH-dependent [KO:...
4        b0356              frmA; S-(hydroxymethyl)glutathione dehydrogena...


#**3. Literature mining**
Query PubMed for recent articles related to "E. coli glycolysis" and extract:

PMID

Article Title

Journal

Year

First Author

In [None]:
import requests
import xml.etree.ElementTree as ET
import pandas as pd

# Step 1: Define the search query and base URL
search_term = "E. coli glycolysis"
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"

# Step 2: Use ESearch to get list of PMIDs
search_url = f"{base_url}esearch.fcgi?db=pubmed&term={search_term}&retmax=10&retmode=xml"
search_response = requests.get(search_url)
search_root = ET.fromstring(search_response.text)

# Extract PMIDs
pmid_list = [id_elem.text for id_elem in search_root.findall(".//Id")]

# Step 3: Use ESummary to get article metadata
pmid_str = ",".join(pmid_list)
summary_url = f"{base_url}esummary.fcgi?db=pubmed&id={pmid_str}&retmode=xml"
summary_response = requests.get(summary_url)
summary_root = ET.fromstring(summary_response.text)

# Step 4: Extract article details
articles = []
for docsum in summary_root.findall(".//DocSum"):
    pmid = docsum.find("./Id").text
    title = ""
    journal = ""
    pubdate = ""
    first_author = ""

    for item in docsum.findall("Item"):
        if item.attrib.get("Name") == "Title":
            title = item.text
        elif item.attrib.get("Name") == "Source":
            journal = item.text
        elif item.attrib.get("Name") == "PubDate":
            pubdate = item.text
        elif item.attrib.get("Name") == "AuthorList":
            first_author = item[0].text if len(item) > 0 else ""

    articles.append((pmid, title, journal, pubdate, first_author))

# Step 5: Create and display DataFrame
df = pd.DataFrame(articles, columns=["PMID", "Title", "Journal", "Year", "First Author"])
df.to_csv("Ecoli_Glycolysis_Literature.csv", index=False)
df.head()


Unnamed: 0,PMID,Title,Journal,Year,First Author
0,40316502,Metabolic Engineering of Escherichia coli for ...,J Agric Food Chem,2025 May 2,Zhang Q
1,40288656,Efficient production of ectoine from Jerusalem...,Bioresour Technol,2025 Apr 25,Zhang H
2,40279546,Kinsenoside-Loaded Microneedle Accelerates Dia...,Adv Sci (Weinh),2025 Apr 25,Lu L
3,40270472,Zonated Copper-Driven Breast Cancer Progressio...,Adv Sci (Weinh),2025 Apr 24,Chen L
4,40266004,Integration of Macrogenomics and Metabolomics:...,J Agric Food Chem,2025 May 7,Wang K


#**4. Taxonomy and Species Information**

Scrape the taxonomy lineage of E. coli from NCBI:

Domain → Phylum → Class → Order → Family → Genus → Species

In [None]:
import requests
import xml.etree.ElementTree as ET
import pandas as pd

# Step 1: E. coli Taxonomy ID
taxid = 562
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id={taxid}&retmode=xml"

# Step 2: Fetch and parse XML
response = requests.get(url)
root = ET.fromstring(response.text)

# Step 3: Initialize empty lineage
lineage_list = []

# Step 4: Iterate and extract safely
for taxon in root.findall(".//Taxon"):
    lineage_elem = taxon.find("Lineage")
    scientific_name_elem = taxon.find("ScientificName")

    if lineage_elem is not None and scientific_name_elem is not None:
        lineage = lineage_elem.text
        scientific_name = scientific_name_elem.text
        lineage_list = lineage.split("; ") + [scientific_name]
    else:
        print("Lineage or ScientificName tag not found.")

# Step 5: Assign Ranks (align to length)
standard_ranks = ["Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"]
# Adjust rank list if lineage is shorter or longer
used_ranks = standard_ranks[-len(lineage_list):]
taxonomy_table = list(zip(used_ranks, lineage_list))

# Step 6: Create and display DataFrame
df = pd.DataFrame(taxonomy_table, columns=["Rank", "Name"])
df.to_csv("Ecoli_Taxonomy_API.csv", index=False)
df


Lineage or ScientificName tag not found.
Lineage or ScientificName tag not found.
Lineage or ScientificName tag not found.
Lineage or ScientificName tag not found.
Lineage or ScientificName tag not found.
Lineage or ScientificName tag not found.
Lineage or ScientificName tag not found.
Lineage or ScientificName tag not found.


Unnamed: 0,Rank,Name
0,Domain,cellular organisms
1,Phylum,Bacteria
2,Class,Pseudomonadati
3,Order,Pseudomonadota
4,Family,Gammaproteobacteria
5,Genus,Enterobacterales
6,Species,Enterobacteriaceae


In [None]:
#Step: Install Required Library
!pip install biopython

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.3/3.3 MB[0m [31m143.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m75.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85
