<a href="https://colab.research.google.com/github/tanaymukherjee/Web-Scraping-in-Python/blob/master/Part_3_Extracting_data_from_Wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting data from nested HTML tags

## Import relevant packages

In [0]:
# Load the packages
import requests
from bs4 import BeautifulSoup

## Get Request

In [0]:
# Defining the url of the site
base_site = "https://en.wikipedia.org/wiki/Indus_Valley_Civilisation"

# Making a get request
response = requests.get(base_site)
response.status_code

200

In [0]:
# Extracting the HTML
html = response.content

# Checking that the reply is indeed an HTML code by inspecting the first 100 symbols
html[:100]

b'\n<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<titl'

## Initiating the soup

In [0]:
# Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily.
# Using the default parser as it is included in Python

soup = BeautifulSoup(html, "html.parser")

## Exporting the HTML to a file

In [0]:
# Exporting the HTML to a file
with open('Wiki_response.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

- The 'with' statement is shorthand for a 'try-finally' block
- Open is function for opening/creating a file to edit
- The 'wb' argument signifies the mode in which to edit the file - Writing in Bytes format
- .prettify() modifies the HTML code with additional indentations for better readability

## Practical Examples

In [0]:
# Let's use the variable links we defined a couple of sections ago
# It contains all the 'a' tags on this page
links

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected until June 15, 2022 at 10:46 UTC."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a class="image" href="/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png" title="IVC major sites"><img alt="IVC major sites" data-file-height="890" data-file-width="1200" decoding="async" height="163" src="//upload.wikimedia.o

In [0]:
# Let's choose one link to manipulate
link = links[26]
link

<a href="/wiki/South_Asia" title="South Asia">South Asia</a>

In [0]:
# Get the link's text
link.string

'South Asia'

In [0]:
# Extract the link's URL
link['href']

'/wiki/South_Asia'

In [0]:
# This is a relative URL
# To obtain the absolute URL address we will use urljoin

from urllib.parse import urljoin

In [0]:
# Now we need the address of the current page + the relative URL to compute the full-path URL
base_site

'https://en.wikipedia.org/wiki/Indus_Valley_Civilisation'

In [0]:
relative_url = link['href']
relative_url

'/wiki/South_Asia'

In [0]:
full_url = urljoin(base_site, relative_url)
full_url

'https://en.wikipedia.org/wiki/South_Asia'

## Processing multiple links at once

In [0]:
links

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected until June 15, 2022 at 10:46 UTC."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a class="image" href="/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png" title="IVC major sites"><img alt="IVC major sites" data-file-height="890" data-file-width="1200" decoding="async" height="163" src="//upload.wikimedia.o

In [0]:
# Examining the link's addresses
[l.get('href') for l in links]

# Note that if l['href'] was written instead of l.get('href'), this would produce an error

[None,
 '/wiki/Wikipedia:Protection_policy#semi',
 '#mw-head',
 '#p-search',
 '/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png',
 '/wiki/South_Asia',
 '/wiki/Bronze_Age#South_Asia',
 '/wiki/Type_site',
 '/wiki/Harappa',
 '/wiki/Mehrgarh',
 '/wiki/Painted_Grey_Ware_culture',
 '/wiki/Cemetery_H_culture',
 '/wiki/File:Mohenjo-daro.jpg',
 '/wiki/File:Mohenjo-daro.jpg',
 '/wiki/Mohenjo-daro',
 '/wiki/Sindh',
 '/wiki/Pakistan',
 '/wiki/Great_Bath,_Mohenjo-daro',
 '/wiki/Indus_River',
 '/wiki/UNESCO_World_Heritage_Site',
 '/wiki/File:Harappan_small_figures.jpg',
 '/wiki/File:Harappan_small_figures.jpg',
 '/wiki/Zebu',
 '/wiki/Chicken',
 '/wiki/Bronze_Age',
 '/wiki/Civilisation',
 '/wiki/South_Asia',
 '#cite_note-FOOTNOTEWright20091-1',
 '#cite_note-2',
 '/wiki/Ancient_Egypt',
 '/wiki/Mesopotamia',
 '/wiki/Near_East',
 '/wiki/Afghanistan',
 '/wiki/Pakistan',
 '/wiki/India',
 '#cite_note-FOOTNOTEWright2009-3',
 '#cite_note-4',
 '/wiki/Indus_River',
 '/wiki/Ghaggar-Hakra_R

In [0]:
# Notice that some links don't have URL (None appears)

# Dropping the links without href attribute
clean_links = [l for l in links if l.get('href') != None]

In [0]:
# Obtaining the relative URLs
relative_urls = [link.get('href') for link in clean_links]
relative_urls

['/wiki/Wikipedia:Protection_policy#semi',
 '#mw-head',
 '#p-search',
 '/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png',
 '/wiki/South_Asia',
 '/wiki/Bronze_Age#South_Asia',
 '/wiki/Type_site',
 '/wiki/Harappa',
 '/wiki/Mehrgarh',
 '/wiki/Painted_Grey_Ware_culture',
 '/wiki/Cemetery_H_culture',
 '/wiki/File:Mohenjo-daro.jpg',
 '/wiki/File:Mohenjo-daro.jpg',
 '/wiki/Mohenjo-daro',
 '/wiki/Sindh',
 '/wiki/Pakistan',
 '/wiki/Great_Bath,_Mohenjo-daro',
 '/wiki/Indus_River',
 '/wiki/UNESCO_World_Heritage_Site',
 '/wiki/File:Harappan_small_figures.jpg',
 '/wiki/File:Harappan_small_figures.jpg',
 '/wiki/Zebu',
 '/wiki/Chicken',
 '/wiki/Bronze_Age',
 '/wiki/Civilisation',
 '/wiki/South_Asia',
 '#cite_note-FOOTNOTEWright20091-1',
 '#cite_note-2',
 '/wiki/Ancient_Egypt',
 '/wiki/Mesopotamia',
 '/wiki/Near_East',
 '/wiki/Afghanistan',
 '/wiki/Pakistan',
 '/wiki/India',
 '#cite_note-FOOTNOTEWright2009-3',
 '#cite_note-4',
 '/wiki/Indus_River',
 '/wiki/Ghaggar-Hakra_River',


In [0]:
# Transforming to absolute path URLs
full_urls = [urljoin(base_site, url) for url in relative_urls]
full_urls

['https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi',
 'https://en.wikipedia.org/wiki/Indus_Valley_Civilisation#mw-head',
 'https://en.wikipedia.org/wiki/Indus_Valley_Civilisation#p-search',
 'https://en.wikipedia.org/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png',
 'https://en.wikipedia.org/wiki/South_Asia',
 'https://en.wikipedia.org/wiki/Bronze_Age#South_Asia',
 'https://en.wikipedia.org/wiki/Type_site',
 'https://en.wikipedia.org/wiki/Harappa',
 'https://en.wikipedia.org/wiki/Mehrgarh',
 'https://en.wikipedia.org/wiki/Painted_Grey_Ware_culture',
 'https://en.wikipedia.org/wiki/Cemetery_H_culture',
 'https://en.wikipedia.org/wiki/File:Mohenjo-daro.jpg',
 'https://en.wikipedia.org/wiki/File:Mohenjo-daro.jpg',
 'https://en.wikipedia.org/wiki/Mohenjo-daro',
 'https://en.wikipedia.org/wiki/Sindh',
 'https://en.wikipedia.org/wiki/Pakistan',
 'https://en.wikipedia.org/wiki/Great_Bath,_Mohenjo-daro',
 'https://en.wikipedia.org/wiki/Indus_River',
 'https

In [0]:
# Extracting only URLs pointing to Wikipedia (internal URLs)
internal_links = [url for url in full_urls if 'wikipedia.org' in url]
internal_links

['https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi',
 'https://en.wikipedia.org/wiki/Indus_Valley_Civilisation#mw-head',
 'https://en.wikipedia.org/wiki/Indus_Valley_Civilisation#p-search',
 'https://en.wikipedia.org/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png',
 'https://en.wikipedia.org/wiki/South_Asia',
 'https://en.wikipedia.org/wiki/Bronze_Age#South_Asia',
 'https://en.wikipedia.org/wiki/Type_site',
 'https://en.wikipedia.org/wiki/Harappa',
 'https://en.wikipedia.org/wiki/Mehrgarh',
 'https://en.wikipedia.org/wiki/Painted_Grey_Ware_culture',
 'https://en.wikipedia.org/wiki/Cemetery_H_culture',
 'https://en.wikipedia.org/wiki/File:Mohenjo-daro.jpg',
 'https://en.wikipedia.org/wiki/File:Mohenjo-daro.jpg',
 'https://en.wikipedia.org/wiki/Mohenjo-daro',
 'https://en.wikipedia.org/wiki/Sindh',
 'https://en.wikipedia.org/wiki/Pakistan',
 'https://en.wikipedia.org/wiki/Great_Bath,_Mohenjo-daro',
 'https://en.wikipedia.org/wiki/Indus_River',
 'https

## Extracting data from nested tags

In [0]:
# Our objective now is to extract all links that can be found under a section heading
# Marked as 'Main article:' or 'See also:'
# By quick inspection, we see that these are contained in div tags with attribute 'role' set to 'note'

div_notes = soup.find_all("div", {"role": "note"})
div_notes

[<div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Periodisation_of_the_Indus_Valley_Civilisation" title="Periodisation of the Indus Valley Civilisation">Periodisation of the Indus Valley Civilisation</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">See also: <a class="mw-redirect" href="/wiki/Neolithic_revolution" title="Neolithic revolution">Neolithic revolution</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Further information: <a href="/wiki/Indian_mathematics#Prehistory" title="Indian mathematics">Indian mathematics § Prehistory</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">See also: <a href="/wiki/Pottery_in_the_Indian_subcontinent" title="Pottery in the Indian subcontinent">Pottery in the Indian subcontinent</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Dancing_Girl_(sculpture)" title="Dancing Girl (sculpture)">Dancing

In [0]:
div_notes[0]

<div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Periodisation_of_the_Indus_Valley_Civilisation" title="Periodisation of the Indus Valley Civilisation">Periodisation of the Indus Valley Civilisation</a></div>

In [0]:
# We can apply find() and find_all() to a tag in the same way we do it to the whole document
div_notes[0].find('a')

<a href="/wiki/Periodisation_of_the_Indus_Valley_Civilisation" title="Periodisation of the Indus Valley Civilisation">Periodisation of the Indus Valley Civilisation</a>

In [0]:
# A naive approach to get all links would be to use find
div_links = [div.find('a') for div in div_notes]
div_links

[<a href="/wiki/Periodisation_of_the_Indus_Valley_Civilisation" title="Periodisation of the Indus Valley Civilisation">Periodisation of the Indus Valley Civilisation</a>,
 <a class="mw-redirect" href="/wiki/Neolithic_revolution" title="Neolithic revolution">Neolithic revolution</a>,
 <a href="/wiki/Indian_mathematics#Prehistory" title="Indian mathematics">Indian mathematics § Prehistory</a>,
 <a href="/wiki/Pottery_in_the_Indian_subcontinent" title="Pottery in the Indian subcontinent">Pottery in the Indian subcontinent</a>,
 <a href="/wiki/Dancing_Girl_(sculpture)" title="Dancing Girl (sculpture)">Dancing Girl (sculpture)</a>,
 <a href="/wiki/Lothal" title="Lothal">Lothal</a>,
 <a href="/wiki/Substratum_in_Vedic_Sanskrit" title="Substratum in Vedic Sanskrit">Substratum in Vedic Sanskrit</a>,
 <a href="/wiki/Indus_script" title="Indus script">Indus script</a>,
 <a href="/wiki/Prehistoric_religion" title="Prehistoric religion">Prehistoric religion</a>,
 <a href="/wiki/Vedic_period" title

In [0]:
len(div_links)

13

In [0]:
# However, some divs have more than 1 link
div_notes[3]

<div class="hatnote navigation-not-searchable" role="note">See also: <a href="/wiki/Pottery_in_the_Indian_subcontinent" title="Pottery in the Indian subcontinent">Pottery in the Indian subcontinent</a></div>

In [0]:
# This div has 3 links in it
div_notes[6].find_all('a')

[<a href="/wiki/Substratum_in_Vedic_Sanskrit" title="Substratum in Vedic Sanskrit">Substratum in Vedic Sanskrit</a>,
 <a href="/wiki/Harappan_language" title="Harappan language">Harappan language</a>,
 <a href="/wiki/Dravidian_peoples#Origins" title="Dravidian peoples">Origins of Dravidian peoples</a>]

In [0]:
# Therefore we need to use find_all
# Let's use a for loop

# Define initially empty list of links
div_links = []

for div in div_notes:
    anchors = div.find_all('a')
    
    # Need to add every link from anchors to div_links
    for a in anchors:
        div_links.append(a)
    
    # Can use div_links.extend(anchors) instead of the for loop

In [0]:
div_links

[<a href="/wiki/Periodisation_of_the_Indus_Valley_Civilisation" title="Periodisation of the Indus Valley Civilisation">Periodisation of the Indus Valley Civilisation</a>,
 <a class="mw-redirect" href="/wiki/Neolithic_revolution" title="Neolithic revolution">Neolithic revolution</a>,
 <a href="/wiki/Indian_mathematics#Prehistory" title="Indian mathematics">Indian mathematics § Prehistory</a>,
 <a href="/wiki/Pottery_in_the_Indian_subcontinent" title="Pottery in the Indian subcontinent">Pottery in the Indian subcontinent</a>,
 <a href="/wiki/Dancing_Girl_(sculpture)" title="Dancing Girl (sculpture)">Dancing Girl (sculpture)</a>,
 <a href="/wiki/Lothal" title="Lothal">Lothal</a>,
 <a href="/wiki/Meluhha" title="Meluhha">Meluhha</a>,
 <a href="/wiki/Substratum_in_Vedic_Sanskrit" title="Substratum in Vedic Sanskrit">Substratum in Vedic Sanskrit</a>,
 <a href="/wiki/Harappan_language" title="Harappan language">Harappan language</a>,
 <a href="/wiki/Dravidian_peoples#Origins" title="Dravidian

In [0]:
# We now have a complete list
len(div_links)

18

In [0]:
# Let's get the URLs
note_urls = [urljoin(base_site, l.get('href')) for l in div_links]
note_urls

['https://en.wikipedia.org/wiki/Periodisation_of_the_Indus_Valley_Civilisation',
 'https://en.wikipedia.org/wiki/Neolithic_revolution',
 'https://en.wikipedia.org/wiki/Indian_mathematics#Prehistory',
 'https://en.wikipedia.org/wiki/Pottery_in_the_Indian_subcontinent',
 'https://en.wikipedia.org/wiki/Dancing_Girl_(sculpture)',
 'https://en.wikipedia.org/wiki/Lothal',
 'https://en.wikipedia.org/wiki/Meluhha',
 'https://en.wikipedia.org/wiki/Substratum_in_Vedic_Sanskrit',
 'https://en.wikipedia.org/wiki/Harappan_language',
 'https://en.wikipedia.org/wiki/Dravidian_peoples#Origins',
 'https://en.wikipedia.org/wiki/Indus_script',
 'https://en.wikipedia.org/wiki/Prehistoric_religion',
 'https://en.wikipedia.org/wiki/Vedic_period',
 'https://en.wikipedia.org/wiki/Indo-Aryan_migration',
 'https://en.wikipedia.org/wiki/Bond_event',
 'https://en.wikipedia.org/wiki/4.2_kiloyear_event',
 'https://en.wikipedia.org/wiki/Iron_Age_India',
 'https://en.wikipedia.org/wiki/Indus-Mesopotamia_relations']

In [0]:
len(note_urls)

18