<a href="https://colab.research.google.com/github/shivanshr58/COVID-19-Vaccine-Data-Web-Scraping/blob/main/covid_19_vaccine_data_web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COVID-19 Vaccine Data Web Scraping Initiative

## Introduction
This project is aimed at employing advanced web scraping techniques to mine COVID-19 vaccine data from the esteemed Milken Institute's tracker. Utilizing a suite of Python libraries, I extract data points such as vaccine developer names, funder name, trials, and developmental stages. This initiative not only hones my technical skills but also contributes to the broader discourse on pandemic response and vaccine dissemination.

## Project Objectives
- Master the art of web scraping with Python to harvest website data.
- Navigate complex HTML structures with finesse using Beautiful Soup.
- Curate a comprehensive dataset detailing vaccine developers, financiers, trials, and progression phases.
- Leverage Pandas for sophisticated data structuring and analytical insights.
- Engineer a dynamic web scraping script capable of capturing real-time updates on COVID-19 vaccine developments.

## Technical Tools
- **Python**: The backbone of the project, enabling seamless web scraping and data handling.
- **Beautiful Soup**: A powerful tool that parses HTML and XML documents, making data extraction a breeze.
- **Requests**: The gateway for HTTP requests, facilitating data retrieval from the web.
- **Pandas**: The data maestro, providing extensive capabilities for data manipulation and analysis.


This project reflects my dedication to learning and showcases my proficiency in web scraping with Python, emphasizing the practical application of tools like Beautiful Soup and Pandas.


# Importing Necessary Libraries

In [None]:
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests

# Scraping

In [None]:
# getting and parsing the html from the website that has vaccine data
html = requests.get("https://covid-19tracker.milkeninstitute.org/").text
parsed = soup(html)

In [None]:
# getting data for the anti body type vaccines
Antibodies = parsed.find("div", attrs = {"id" : "treatment_antibodies","class":"chart-section for_treatments"})

In [None]:
# getting developer details
Developer = []
for i in Antibodies.find_all("div",class_ = "is_h5-2 is_developer w-richtext"):
  Developer.append(i.text )

In [None]:
# getting Product_Description
Product_Description = []
for i in Antibodies.find_all("div",class_ = "is_h5-2 is_treatments w-richtext"):
  Product_Description.append(i.text )

In [None]:
# getting Funder and Clinical_Trial_Covid details
Funder_Clinical_Trial_Covid = []
for row in Antibodies.find_all("div", class_ = "chart_row-expanded", attrs = {"id": "w-node-f2b970c8-db0b-3c00-b222-aea9198e07fe-d8371e24"}):
  l1 = [i.text for i in row.find_all(class_ = "is_h6 w-richtext")]
  Funder_Clinical_Trial_Covid.append(l1)

In [None]:
# temp = []
# for i in list(filter(lambda x: len(x) ==1, Funder_Clinical_Trial_Covid)):
#   if i not in temp:
#     temp.append(i)

In [None]:
# temp = []
# for i in Funder_Clinical_Trial_Covid:
#   if len(i) == 1 and i != ['Unknown']:
#     print(i)

In [None]:
# getting phase details
Phase = []
for block in Antibodies.find_all("div",attrs = {"id": "w-node-f51a01d3-6a9b-252d-6906-10d394febba9-d8371e24"}):
  Phase.append(block.find(class_ = "is_h5-2 is_stage-indicator").text)

# Creating DataFrame

In [None]:
# creating a dataframe with the scraped data
vaccines_df = pd.DataFrame({"Developer":Developer ,"Product_Description":Product_Description,"Funder_Clinical_Trial_Covid":Funder_Clinical_Trial_Covid,"Phase":Phase})

In [None]:
vaccines_df

Unnamed: 0,Developer,Product_Description,Funder_Clinical_Trial_Covid,Phase
0,Corvus Pharmaceuticals and Lewis Katz School...,CPI-006,"[Unknown, NCT04734873, NCT04464395]",clinical
1,Alexion Pharmaceuticals. TACTIC-R trial,"Ultomiris (ravulizumab-cwvz), complement inhib...","[Unknown, NCT04369469, EudraCT 2020-001354-22 ...",clinical
2,Assistance Publique - Hopitaux de Paris (Phase...,"Soliris (eculizumab), complement inhibitor","[Unknown, NCT04288713, NCT04346797, NCT0435549...",clinical
3,AstraZeneca; ACCORD trial,"MEDI3506, monoclonal antibody targeting interl...","[UK Government (ACCORD study), EudraCT 2020-00...",clinical
4,BioCon/Equilium,"itolizumab, anti-CD6 IgG1 monoclonal antibody","[Unknown, NCT04475588, NCT04605926]",clinical
...,...,...,...,...
80,Rosalind Franklin Institute/ Oxford University...,Nanobodies from Llamas,[Unknown],pre-clinical
81,Tiziana Life Sciences,"TZLS-501, an anti-interleukin-6 receptor monoc...",[Unknown],pre-clinical
82,University of Texas at Austin/ US National Ins...,Linked nanobody antibody,[Unknown],pre-clinical
83,Virna Therapeutics/ University of Toronto,Neutralizing antibodies,[],pre-clinical


# Data Cleaning / Fixing errors

In [None]:
# Preparing Funder_Clinical_Trial_Covid column for splitting
vaccines_df["Funder_Clinical_Trial_Covid"] = vaccines_df["Funder_Clinical_Trial_Covid"].apply(lambda x: "_".join(x))

"""Some data points have only one entry which includes both the funder name and the trial name. This could lead to inconsistency after splitting the column.
 To address this, "Unknown" will be added as the funder where it is absent, and the data point only contains a trial name.

For example, when splitting 'NCT04399980, NCT04397497, NCT04447469, NCT04463004, EudraCT 2020-001795-15' (trial names) with "_" as the delimiter,
 the trial names could inadvertently end up under the funder column.
To prevent this, an underscore will be added for such cases to ensure that the trial names are placed in the second column after the split."""

In [None]:
# before fixing
# here data points with only one element start with both funder and trial name

vaccines_df[vaccines_df["Funder_Clinical_Trial_Covid"].str.contains("_")==False]["Funder_Clinical_Trial_Covid"].unique()

array(['Unknown',
       'NCT04399980, NCT04397497, NCT04447469, NCT04463004, EudraCT 2020-001795-15',
       'NCT04429529, NCT04649515', '', 'UK Government',
       'Biomedical Advanced Research and Development Authority (BARDA)',
       'North Dakota Bioscience Innovation Grant',
       'Department of Defense (DoD)',
       'Canadian Institutes for Health Research (CIHR)'], dtype=object)

In [None]:
# Adding unkown funder where it is missing to split properly
import re
vaccines_df.loc[vaccines_df["Funder_Clinical_Trial_Covid"].str.contains("_")==False, "Funder_Clinical_Trial_Covid"] = vaccines_df.loc[vaccines_df["Funder_Clinical_Trial_Covid"].str.contains("_")==False, "Funder_Clinical_Trial_Covid"].apply(lambda x: "Unknown_" + x if re.search("\d{3}", x) or x == "" else x)

In [None]:
#after fixing
# now all data points with only one element start with funder only, problem solved
vaccines_df[vaccines_df["Funder_Clinical_Trial_Covid"].str.contains("_")==False]["Funder_Clinical_Trial_Covid"].unique()

array(['Unknown', 'UK Government',
       'Biomedical Advanced Research and Development Authority (BARDA)',
       'North Dakota Bioscience Innovation Grant',
       'Department of Defense (DoD)',
       'Canadian Institutes for Health Research (CIHR)'], dtype=object)

In [None]:
# Splitting funder and Clinical_Trial_Covid into separate columns
vaccines_df[["Funder","Clinical_Trial_Covid"]] = vaccines_df['Funder_Clinical_Trial_Covid'].str.split('_', expand=True)
vaccines_df.drop(columns = "Funder_Clinical_Trial_Covid", inplace = True)  # dropping the Funder_Clinical_Trial_Covid column after the split

# Final Data

In [None]:
vaccines_df.head()

Unnamed: 0,Developer,Product_Description,Phase,Funder,Clinical_Trial_Covid
0,Corvus Pharmaceuticals and Lewis Katz School...,CPI-006,clinical,Unknown,"NCT04734873, NCT04464395"
1,Alexion Pharmaceuticals. TACTIC-R trial,"Ultomiris (ravulizumab-cwvz), complement inhib...",clinical,Unknown,"NCT04369469, EudraCT 2020-001354-22 (TACTIC-R)..."
2,Assistance Publique - Hopitaux de Paris (Phase...,"Soliris (eculizumab), complement inhibitor",clinical,Unknown,"NCT04288713, NCT04346797, NCT04355494, EudraCT..."
3,AstraZeneca; ACCORD trial,"MEDI3506, monoclonal antibody targeting interl...",clinical,UK Government (ACCORD study),EudraCT 2020-001736-95 (ACCORD Trial)
4,BioCon/Equilium,"itolizumab, anti-CD6 IgG1 monoclonal antibody",clinical,Unknown,"NCT04475588, NCT04605926"
