# **WHO Web Scraping**
The project involves scraping health-related data from the World Health Organization (WHO) website. The objective is to collect structured information on various health topics, including their overviews, impacts, and the WHO's response to each topic. This data will be compiled into a tabular format for easy access and analysis.

In [2]:
# import necessary libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [3]:
url = 'https://www.who.int/health-topics/'
url

'https://www.who.int/health-topics/'

In [4]:
res = requests.get(url)
res.status_code

200

In [5]:
# parse the response text as HTML using BeautifulSoup
soup = BeautifulSoup(res.text, 'html.parser')

### ***Scrape the health topics***
topics tag: <p class="heading text-underline"

In [40]:
# get the health topics
topic_span = soup.find_all('p', class_="heading text-underline")
topic_span

[<p class="heading text-underline">Abortion</p>,
 <p class="heading text-underline">Addictive behaviour</p>,
 <p class="heading text-underline">Adolescent health</p>,
 <p class="heading text-underline">Ageing</p>,
 <p class="heading text-underline">Ageism</p>,
 <p class="heading text-underline">Air pollution</p>,
 <p class="heading text-underline">Alcohol</p>,
 <p class="heading text-underline">Anaemia</p>,
 <p class="heading text-underline">Antimicrobial resistance</p>,
 <p class="heading text-underline">Assistive technology</p>,
 <p class="heading text-underline">Biological weapons</p>,
 <p class="heading text-underline">Biologicals</p>,
 <p class="heading text-underline">Blood products</p>,
 <p class="heading text-underline">Blood transfusion safety</p>,
 <p class="heading text-underline">Brain health</p>,
 <p class="heading text-underline">Breastfeeding</p>,
 <p class="heading text-underline">Buruli ulcer  (Mycobacterium ulcerans infection)</p>,
 <p class="heading text-underline">C

In [42]:
# get the health topics without tag
topics = []

for topic in topic_span:
    text = topic.get_text()
    topics.append(text)
topics

['Abortion',
 'Addictive behaviour',
 'Adolescent health',
 'Ageing',
 'Ageism',
 'Air pollution',
 'Alcohol',
 'Anaemia',
 'Antimicrobial resistance',
 'Assistive technology',
 'Biological weapons',
 'Biologicals',
 'Blood products',
 'Blood transfusion safety',
 'Brain health',
 'Breastfeeding',
 'Buruli ulcer  (Mycobacterium ulcerans infection)',
 'Cancer',
 'Cardiovascular diseases',
 'Cervical cancer',
 'Chagas disease (American trypanosomiasis)',
 'Chemical incidents',
 'Chemical safety',
 'Chikungunya',
 'Child growth',
 'Child health',
 "Children's environmental health",
 'Cholera',
 'Chronic respiratory diseases',
 'Climate change',
 'Clinical trials',
 'Commercial determinants of health',
 'Common goods for health',
 'Complementary feeding',
 'Congenital disorders',
 'Contraception',
 'Coronavirus disease (COVID-19)',
 'Crimean-Congo haemorrhagic fever',
 'Deafness and hearing loss',
 'Deliberate events',
 'Dementia',
 'Dengue and severe dengue',
 'Depression',
 'Diabetes',
 

### ***Scrape links to health topics***
links tag: <a class="link-container table"

In [44]:
# get the links to the health topics
link_span = soup.find_all('a', class_="link-container table")
link_span

[<a aria-label="Abortion" class="link-container table" href="https://www.who.int/health-topics/abortion" role="link">
 <div class="table-cell info">
 <div class="date">
 <span class="timestamp">Health interventions</span>
 </div>
 <p class="heading text-underline">Abortion</p>
 </div>
 </a>,
 <a aria-label="Addictive behaviour" class="link-container table" href="https://www.who.int/health-topics/addictive-behaviour" role="link">
 <div class="table-cell info">
 <div class="date">
 <span class="timestamp">Human behaviour</span>
 </div>
 <p class="heading text-underline">Addictive behaviour</p>
 </div>
 </a>,
 <a aria-label="Adolescent health" class="link-container table" href="https://www.who.int/health-topics/adolescent-health" role="link">
 <div class="table-cell info">
 <div class="date">
 <span class="timestamp">Populations and demographics</span>
 </div>
 <p class="heading text-underline">Adolescent health</p>
 </div>
 </a>,
 <a aria-label="Ageing" class="link-container table" href=

In [46]:
# get links without tags
links = []

for link in link_span:
    url = link.get('href')
    if url:
        links.append(url)
links

['https://www.who.int/health-topics/abortion',
 'https://www.who.int/health-topics/addictive-behaviour',
 'https://www.who.int/health-topics/adolescent-health',
 'https://www.who.int/health-topics/ageing',
 'https://www.who.int/health-topics/ageism',
 'https://www.who.int/health-topics/air-pollution',
 'https://www.who.int/health-topics/alcohol',
 'https://www.who.int/health-topics/anaemia',
 'https://www.who.int/health-topics/antimicrobial-resistance',
 'https://www.who.int/health-topics/assistive-technology',
 'https://www.who.int/health-topics/biological-weapons',
 'https://www.who.int/health-topics/biologicals',
 'https://www.who.int/health-topics/blood-products',
 'https://www.who.int/health-topics/blood-transfusion-safety',
 'https://www.who.int/health-topics/brain-health',
 'https://www.who.int/health-topics/breastfeeding',
 'https://www.who.int/health-topics/buruli-ulcer',
 'https://www.who.int/health-topics/cancer',
 'https://www.who.int/health-topics/cardiovascular-diseases',

### ***Scrape Overview, Impact, WHO Response***
contents tag: <div class="sf_colsOut tabContent"

In [48]:
# get contents containing Overview, Impact and WHO Response

topic_contents = []

# get topic page HTML
# loop through links for each topic url
for url in links:
    topic_res = requests.get(url)
    topic_soup = BeautifulSoup(topic_res.text, 'html.parser')
    
    contents_span = topic_soup.find_all('div', class_="sf_colsOut tabContent")
    
    content = {
        "Overview": "N/A",
        "Impacts": "N/A",
        "WHO Response": "N/A"
    }

    # Map content divs to sections based on their availability
    if len(contents_span) > 0:
        content["Overview"] = contents_span[0].get_text(strip=True, separator="\n")
    if len(contents_span) > 1:
        content["Impacts"] = contents_span[1].get_text(strip=True, separator="\n")
    if len(contents_span) > 2:
        content["WHO Response"] = contents_span[2].get_text(strip=True, separator="\n")
    topic_contents.append(content)
topic_contents

[{'Overview': 'WHO defines health as a state of complete physical, mental and social well-being, and not merely the absence of disease or infirmity. Making health for all a reality, and moving towards the progressive realization of human rights, requires that all individuals have access to quality health care, including comprehensive abortion care services – which includes information, management of abortion, and post-abortion care. Lack of access to safe, timely, affordable and respectful abortion care poses a risk to not only the physical, but also the mental and social, well-being of women and girls.\nInduced abortion is a simple and common health-care procedure. Each year, almost half of all pregnancies – 121 million – are unintended; 6 out of 10 unintended pregnancies and 3 out of 10 of all pregnancies end in induced abortion. Abortion is safe when carried out using a method recommended by WHO, appropriate to the pregnancy duration and by someone with the necessary skills. However

#### ***Store the Scraped data in a DataFrame***

In [49]:
WHO_data = pd.DataFrame()

WHO_data['Health_Topics'] = topics
WHO_data['Links'] = links
WHO_data['Overview'] = [content["Overview"] for content in topic_contents]
WHO_data['Impacts'] = [content["Impacts"] for content in topic_contents]
WHO_data['WHO_Response'] = [content["WHO Response"] for content in topic_contents]

WHO_data.head()

Unnamed: 0,Health_Topics,Links,Overview,Impacts,WHO_Response
0,Abortion,https://www.who.int/health-topics/abortion,WHO defines health as a state of complete phys...,Restricting access to abortion does not reduce...,Abortion can be safely and effectively perform...
1,Addictive behaviour,https://www.who.int/health-topics/addictive-be...,Many people around the world are engaged in (v...,"Use of the Internet, computers, smartphones an...",Disorders due to addictive behaviours are reco...
2,Adolescent health,https://www.who.int/health-topics/adolescent-h...,Adolescence is the phase of life between child...,There are more adolescents in the world than e...,WHO supports countries to ensure that their na...
3,Ageing,https://www.who.int/health-topics/ageing,Every person – in every country in the world –...,Ageing presents both challenges and opportunit...,"WHO works with Member States, UN agencies and ..."
4,Ageism,https://www.who.int/health-topics/ageism,Age is one of the first things we notice about...,Ageism has far-reaching impacts on all aspects...,WHO has been requested by its 194 Member State...


In [52]:
# exporting the dataset
WHO_data.to_csv('WHO_health_topics.csv', index=False)