<a href="https://colab.research.google.com/github/tegacodess/My-Data-Projects/blob/main/WHO_webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping WHO Health Site

This project is one of the activities in the Axia Africa Data Science (Cohort 8) learning program i'm currently undergoing.

Site: [WHO Health Topics](https://www.who.int/health-topics/#A)

### Objectives
To obtain the:
1. Health topics
2. Link to each topic
3. Overview of each topic
4. Impact
5. WHO response to each topic

store obtained data in a dataframe, and there after export the dataframe into the CSV, TSV and Excel file formats.

### Tools Used


*   Python
*   Pandas Library
* Requests Library
* BeautifulSoup Library








In [None]:
# importing necessary libraries
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import requests


In [None]:
# Storing the base url of the WHO health topic site, sending a get request and checking its response status code. 200 means successful.
url = 'https://www.who.int/health-topics/#A'
response= requests.get(url)
response.status_code


200

In [None]:
# To parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

## Scrape Data

To successfully do this, i would:
* Find all the elements of each item in the checklist using its html tag and class name.
* Store scraped data in a list
* Convert to a dataframe.

### ▶ Get Health Topics and Link to Each Topic

In [None]:
health_topic = soup.find_all('p', class_ = 'heading text-underline')

In [None]:
topic = []
# this for loop iterates over the scraped data, gets the text and stores it in a list
for health in health_topic:
  new_topic = health.get_text()
  topic.append(new_topic)



In [None]:
# locate the topic and the link

topic_nl = []

for tag in soup.find_all('a', class_= 'link-container table'):
  topic_name = tag.find('p', class_='heading text-underline').get_text()
  topic_link = tag['href']
  topic_nl.append({
      'Topic Name': topic_name,
      'Topic Link': topic_link
  }
  )

# to check that it works
topic_nl[0::5]

[{'Topic Name': 'Abortion',
  'Topic Link': 'https://www.who.int/health-topics/abortion'},
 {'Topic Name': 'Ageism',
  'Topic Link': 'https://www.who.int/health-topics/ageism'},
 {'Topic Name': 'Assistive technology',
  'Topic Link': 'https://www.who.int/health-topics/assistive-technology'},
 {'Topic Name': 'Brain health',
  'Topic Link': 'https://www.who.int/health-topics/brain-health'},
 {'Topic Name': 'Cervical cancer',
  'Topic Link': 'https://www.who.int/health-topics/cervical-cancer'},
 {'Topic Name': 'Child growth',
  'Topic Link': 'https://www.who.int/health-topics/child-growth'},
 {'Topic Name': 'Chronic respiratory diseases',
  'Topic Link': 'https://www.who.int/health-topics/chronic-respiratory-diseases'},
 {'Topic Name': 'Complementary feeding',
  'Topic Link': 'https://www.who.int/health-topics/complementary-feeding'},
 {'Topic Name': 'Deafness and hearing loss',
  'Topic Link': 'https://www.who.int/health-topics/hearing-loss'},
 {'Topic Name': 'Diabetes',
  'Topic Link': 

In [None]:
# Convert to a dataframe
data = pd.DataFrame(topic_nl)


### ▶ Get Overview, Impact and WHO Response

This will be obtained by:
1. Using a for loop to iterate over the topic_nl list
2. Sending a get request to each url (items in the link) to fetch the data
3. Parsing the obtained html content
4. Implementing a logic that segments the contents into overview, impact and WHO response


In [None]:
# Initial attempt
url = 'https://www.who.int/health-topics/tetanus'
response = requests.get(url)
response.status_code

200

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
# <div class="sf_colsOut tabContent" data-sf-element="Tab contents" data-placeholder-label="Tab contents">
#                 <div class="sf_colsIn">
#                     <p>WHO defines health as a state of complete physical, mental and social well-being, and not merely the absence of disease or infirmity. Making health for all a reality, and moving towards the progressive realization of human rights, requires that all individuals have access to quality health care, including comprehensive abortion care services – which includes information, management of abortion, and post-abortion care. Lack of access to safe, timely, affordable and respectful abortion care poses a risk to not only the physical, but also the mental and social, well-being of women and girls.</p><p>Induced abortion is a simple and common health-care procedure. Each year, almost half of all pregnancies – 121 million – are unintended; 6 out of 10 unintended pregnancies and 3 out of 10 of all pregnancies end in induced abortion. Abortion is safe when carried out using a method recommended by WHO, appropriate to the pregnancy duration and by someone with the necessary skills. However, when women with unwanted pregnancies face barriers to obtaining quality abortion, they often resort to unsafe abortion.</p><p>Ensuring that women and girls have access to abortion care that is evidence-based – which includes being safe, respectful and non-discriminatory – is fundamental to meeting the Sustainable Development Goals (SDGs) relating to good health and well-being (SDG3) and gender equality (SDG5).</p>
#                 </div>
#             </div>

In [None]:
content = soup.find('div', class_='sf_colsOut tabContent')
content

<div class="sf_colsOut tabContent" data-placeholder-label="Tab contents" data-sf-element="Tab contents">
<div class="sf_colsIn">
<p>Tetanus is a serious illness contracted through exposure to the spores of the bacterium, <em>Clostridium tetani</em>, which live in soil, saliva, dust and manure. The bacteria can enter the body through a deep cuts, wounds or burns affecting the nervous system. The infection leads to painful muscle contractions, particularly of the jaw and neck muscle, and is commonly known as “lockjaw”.<span style="background-color:transparent;text-align:inherit;text-transform:inherit;white-space:inherit;word-spacing:normal;caret-color:auto;"> </span></p><p>People of all ages can get tetanus but the disease is particularly common and serious in newborn babies and their mothers when the mother is unprotected from tetanus by the vaccine, tetanus toxoid. Tetanus occurring during pregnancy or within 6 weeks of the end of pregnancy is called maternal tetanus, while tetanus occ

In [None]:
content = content.get_text(strip = True)
content

'Tetanus is a serious illness contracted through exposure to the spores of the bacterium,Clostridium tetani, which live in soil, saliva, dust and manure. The bacteria can enter the body through a deep cuts, wounds or burns affecting the nervous system. The infection leads to painful muscle contractions, particularly of the jaw and neck muscle, and is commonly known as “lockjaw”.People of all ages can get tetanus but the disease is particularly common and serious in newborn babies and their mothers when the mother is unprotected from tetanus by the vaccine, tetanus toxoid. Tetanus occurring during pregnancy or within 6 weeks of the end of pregnancy is called maternal tetanus, while tetanus occurring within the first 28 days of life is called neonatal tetanus.The disease remains an important public health problem in many parts of the world, but especially in low-income countries or districts, where immunization coverage is low and unclean birth practices are common. WHO estimates that in

In [None]:
# This fetches the overview, the impact and WHO response in 3 divs
contents = soup.find_all('div', class_='sf_colsOut tabContent')
contents

[<div class="sf_colsOut tabContent" data-placeholder-label="Tab contents" data-sf-element="Tab contents">
 <div class="sf_colsIn">
 <p>Tetanus is a serious illness contracted through exposure to the spores of the bacterium, <em>Clostridium tetani</em>, which live in soil, saliva, dust and manure. The bacteria can enter the body through a deep cuts, wounds or burns affecting the nervous system. The infection leads to painful muscle contractions, particularly of the jaw and neck muscle, and is commonly known as “lockjaw”.<span style="background-color:transparent;text-align:inherit;text-transform:inherit;white-space:inherit;word-spacing:normal;caret-color:auto;"> </span></p><p>People of all ages can get tetanus but the disease is particularly common and serious in newborn babies and their mothers when the mother is unprotected from tetanus by the vaccine, tetanus toxoid. Tetanus occurring during pregnancy or within 6 weeks of the end of pregnancy is called maternal tetanus, while tetanus 

In [None]:
# to extract only the text content and append it to a list
full_content = []

for content in contents:
  clean_text = content.get_text(strip=True)
  full_content.append(clean_text)


In [None]:
#add heading
headings ={
    'Overview':full_content[0],
    'Impact':full_content[1],
    'Abortion Care':full_content[2]
}


## Refactor and Combine Everything

In [None]:
url = 'https://www.who.int/health-topics'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# locate the topic and the link
topic_nl = []

for tag in soup.find_all('a', class_= 'link-container table'):
  topic_name = tag.find('p', class_='heading text-underline').get_text()
  topic_link = tag['href']
  topic_nl.append({
      'Name': topic_name,
      'Link': topic_link
  })

# collect data for each topic and link
data =[]

for topic in topic_nl:
  print(f"Scraping content for:{topic['Name']}")

  # fetch the topic page through the link
  link = topic['Link']
  topic_response = requests.get(link)
  soup = BeautifulSoup(topic_response.text, 'html.parser')

  # locate the tags for the content (overview, impact, and WHO response)
  contents = soup.find_all('div',  class_='sf_colsOut tabContent')
  full_content = []

  for content in contents:
      clean_text = content.get_text(strip=True)
      full_content.append(clean_text)


  # create a heading
  headings ={
    'Overview':'N/A',
    'Impact':'N/A',
    'WHO Response':'N/A'
   }

  # organize the contents into the respective headings
  if len(full_content)>0:
    headings ['Overview'] = full_content[0]
  if len(full_content) > 1:
    headings ['Impact'] = full_content[1]
  if len(full_content) > 2:
    headings ['Response'] = full_content[2]

    # Append data
    data.append({
       "Health Topics": topic['Name'],
       "Links": topic['Link'],
       "Overview":headings['Overview'],
       "Impacts": headings['Impact'],
       "WHO Response": headings['Response']
    })

# convert to data
df = pd.DataFrame(data)


Scraping content for:Abortion
Scraping content for:Abuse of older people
Scraping content for:Addictive behaviour
Scraping content for:Adolescent health
Scraping content for:Ageing
Scraping content for:Ageism
Scraping content for:Air pollution
Scraping content for:Alcohol
Scraping content for:Anaemia
Scraping content for:Antimicrobial resistance
Scraping content for:Assistive technology
Scraping content for:Biological weapons
Scraping content for:Biologicals
Scraping content for:Blood products
Scraping content for:Blood transfusion safety
Scraping content for:Brain health
Scraping content for:Breastfeeding
Scraping content for:Buruli ulcer  (Mycobacterium ulcerans infection)
Scraping content for:Cancer
Scraping content for:Cardiovascular diseases
Scraping content for:Cervical cancer
Scraping content for:Chagas disease (American trypanosomiasis)
Scraping content for:Chemical incidents
Scraping content for:Chemical safety
Scraping content for:Chikungunya
Scraping content for:Child growth

In [None]:
df.head()

Unnamed: 0,Health Topics,Links,Overview,Impacts,WHO Response
0,Abortion,https://www.who.int/health-topics/abortion,WHO defines health as a state of complete phys...,Restricting access to abortion does not reduce...,Abortion can be safely and effectively perform...
1,Abuse of older people,https://www.who.int/health-topics/abuse-of-old...,"The abuse of older people, also known as elder...",Many strategies have been implemented to preve...,"On 15 June 2022, World Elder Abuse Awareness D..."
2,Addictive behaviour,https://www.who.int/health-topics/addictive-be...,Many people around the world are engaged in (v...,"Use of the Internet, computers, smartphones an...",Disorders due to addictive behaviours are reco...
3,Adolescent health,https://www.who.int/health-topics/adolescent-h...,Adolescence is the phase of life between child...,There are more adolescents in the world than e...,WHO supports countries to ensure that their na...
4,Ageing,https://www.who.int/health-topics/ageing,Every person – in every country in the world –...,Ageing presents both challenges and opportunit...,"WHO works with Member States, UN agencies and ..."


In [None]:
# Export to CSV, TSV, and Excel
df.to_csv('WHO Health Topics Webscrape.csv', index = False)
df.to_csv('WHO Health Topics Webscrape.tsv', sep='\t', index=True)
df.to_excel('WHO Health Topics Webscrape.xlsx', index=False)


#### Conclusion

In this notebook I successfully scraped the WHO Health Topic site, getting the different health topics and their respective topic links. I further scraped each link to obtain the overview, impact and WHO response sections. Then, I stored the data in a data frame and exported it in different file formats.  
