<a href="https://colab.research.google.com/github/tanaymukherjee/Web-Scraping-in-Python/blob/master/Extracting_data_from_Wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting data from nested HTML tags

## Import relevant packages

In [0]:
# Load the packages
import requests
from bs4 import BeautifulSoup

## Get Request

In [2]:
# Defining the url of the site
base_site = "https://en.wikipedia.org/wiki/Indus_Valley_Civilisation"

# Making a get request
response = requests.get(base_site)
response.status_code

200

In [3]:
# Extracting the HTML
html = response.content

# Checking that the reply is indeed an HTML code by inspecting the first 100 symbols
html[:100]

b'\n<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<titl'

## Initiating the soup

In [0]:
# Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily.
# Using the default parser as it is included in Python

soup = BeautifulSoup(html, "html.parser")

## Exporting the HTML to a file

In [0]:
# Exporting the HTML to a file
with open('Wiki_response.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

- The 'with' statement is shorthand for a 'try-finally' block
- Open is function for opening/creating a file to edit
- The 'wb' argument signifies the mode in which to edit the file - Writing in Bytes format
- .prettify() modifies the HTML code with additional indentations for better readability

# Searching and navigating the HTML tree


## Searching - find() and find_all()

In [7]:
soup


<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Indus Valley Civilisation - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XpyXRQpAAEMAACV6dgwAAABH","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Indus_Valley_Civilisation","wgTitle":"Indus Valley Civilisation","wgCurRevisionId":951742502,"wgRevisionId":951742502,"wgArticleId":46853,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Harv and Sfn template errors","CS1: long volume value","CS1 French-language sources (fr)","CS1 maint: ref=harv","Webarchive template wayback link

In [8]:
# We can search by tag name
# This returns as the element with all its contents and nested elements inside
soup.find('head')

<head>
<meta charset="utf-8"/>
<title>Indus Valley Civilisation - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XpyXRQpAAEMAACV6dgwAAABH","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Indus_Valley_Civilisation","wgTitle":"Indus Valley Civilisation","wgCurRevisionId":951742502,"wgRevisionId":951742502,"wgArticleId":46853,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Harv and Sfn template errors","CS1: long volume value","CS1 French-language sources (fr)","CS1 maint: ref=harv","Webarchive template wayback links","CS1 errors: missing periodical","All articles with incomplete

In [10]:
# If there is no result it returns None
# Note: None is not displayed in IPython unless print() or repr() is used
soup.find('audio')

print(soup.find('audio'))

None


In [11]:
# .find() returns only the first such result
soup.find('a')

<a id="top"></a>

In [12]:
# If we want all the results we use find_all() 
links = soup.find_all('a')
links

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected until June 15, 2022 at 10:46 UTC."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a class="image" href="/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png" title="IVC major sites"><img alt="IVC major sites" data-file-height="890" data-file-width="1200" decoding="async" height="163" src="//upload.wikimedia.o

In [13]:
# find_all returns a list of all results
isinstance(links, list)

True

In [14]:
# We must be careful when using find_all()
# If no result is found it returns an empty list
soup.find_all('video')

[]

In [15]:
# How many links are on the page?
len(links)

2650

In [0]:
# Usually, we prefer to store the result in a variable
# Let's store the body of a table in a table variable
table = soup.find('tbody')

In [17]:
# Inspect the value of the variable
table

<tbody><tr><td colspan="2" style="text-align:center"><a class="image" href="/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png" title="IVC major sites"><img alt="IVC major sites" data-file-height="890" data-file-width="1200" decoding="async" height="163" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/220px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/330px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/440px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 2x" width="220"/></a></td></tr><tr><th scope="row">Geographical range</th><td><a href="/wiki/South_Asia" title="South Asia">South Asia<

In [18]:
# Inspect the type of the variable
type(table)

bs4.element.Tag

In [19]:
# A tag can be searched in the same way we search the whole document
table.find_all('td')

[<td colspan="2" style="text-align:center"><a class="image" href="/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png" title="IVC major sites"><img alt="IVC major sites" data-file-height="890" data-file-width="1200" decoding="async" height="163" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/220px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/330px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/440px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 2x" width="220"/></a></td>,
 <td><a href="/wiki/South_Asia" title="South Asia">South Asia</a></td>,
 <td><a href="/wiki/Bronze_Age#South_Asia" ti

In [20]:
# Since we used find_all, the result is a list
len(table.find_all('td'))

7

# Navigating the tree

In [21]:
# A tag's children are stored in a list, accessed with .contents
table.contents

[<tr><td colspan="2" style="text-align:center"><a class="image" href="/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png" title="IVC major sites"><img alt="IVC major sites" data-file-height="890" data-file-width="1200" decoding="async" height="163" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/220px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/330px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/440px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 2x" width="220"/></a></td></tr>,
 <tr><th scope="row">Geographical range</th><td><a href="/wiki/South_Asia" title="South Asia">South Asia</a>

In [22]:
# Total number of table content 
len(table.contents)

7

In [23]:
# Read a particular table content
table.contents[1]

<tr><th scope="row">Geographical range</th><td><a href="/wiki/South_Asia" title="South Asia">South Asia</a></td></tr>

In [24]:
# We can also go up the tree with .parent
table.parent

<table class="infobox" style="width:22em"><caption>Indus Valley Civilization</caption><tbody><tr><td colspan="2" style="text-align:center"><a class="image" href="/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png" title="IVC major sites"><img alt="IVC major sites" data-file-height="890" data-file-width="1200" decoding="async" height="163" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/220px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/330px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/440px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 2x" width="220"/></a></td></tr><tr><th scope="row

In [25]:
# table.parent is also a tag
# Thus, we can use .parent on it as well
table.parent.parent

<div class="mw-parser-output"><p class="mw-empty-elt">
</p>
<div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Bronze Age civilisation in South Asia</div>
<p class="mw-empty-elt">
</p>
<table class="infobox" style="width:22em"><caption>Indus Valley Civilization</caption><tbody><tr><td colspan="2" style="text-align:center"><a class="image" href="/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png" title="IVC major sites"><img alt="IVC major sites" data-file-height="890" data-file-width="1200" decoding="async" height="163" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/220px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/330px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 1.5x, //upload.wikimedi

In [26]:
# We use .parent to go up the tree
# But what about .children?
table.children

<list_iterator at 0x7fa112461048>

In [27]:
# If we want a list of an element's children, we need to use table.contents as shown before
# .children is an iterator over that list, 
# which means we can use it in a for loop to iterate over all the children

for child in table.children:
    print(child)

<tr><td colspan="2" style="text-align:center"><a class="image" href="/wiki/File:Indus_Valley_Civilization,_Mature_Phase_(2600-1900_BCE).png" title="IVC major sites"><img alt="IVC major sites" data-file-height="890" data-file-width="1200" decoding="async" height="163" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/220px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/330px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png/440px-Indus_Valley_Civilization%2C_Mature_Phase_%282600-1900_BCE%29.png 2x" width="220"/></a></td></tr>
<tr><th scope="row">Geographical range</th><td><a href="/wiki/South_Asia" title="South Asia">South Asia</a></t

## Searching by attributes

In [28]:
# We can search for tags based on their attributes, in addition to their name
soup.find('div', id = 'siteSub')

<div class="noprint" id="siteSub">From Wikipedia, the free encyclopedia</div>

There are two ways in which we can do that:

### 1. Passing attributes as function parameters

In [29]:
# By writing them as function parameters
# Notice that since class is a reserved word, we write class_
soup.find_all('a', class_ = 'mw-jump-link')

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>]

In [30]:
# We can filter against multiple attributes at once
soup.find('a', class_ = 'mw-jump-link', href = '#p-search')

<a class="mw-jump-link" href="#p-search">Jump to search</a>

###2. Placing the attributes in a dictionary

In [31]:
# By writting the attributes in a dictionary
soup.find('a', attrs={ 'class':'mw-jump-link', 'href':'#p-search' })

<a class="mw-jump-link" href="#p-search">Jump to search</a>

In [32]:
soup.find('div', {'id' : 'footer'})

<div id="footer" role="contentinfo">
<ul class="" id="footer-info">
<li id="footer-info-lastmod"> This page was last edited on 18 April 2020, at 17:08<span class="anonymous-show"> (UTC)</span>.</li>
<li id="footer-info-copyright">Text is available under the <a href="//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License" rel="license">Creative Commons Attribution-ShareAlike License</a><a href="//creativecommons.org/licenses/by-sa/3.0/" rel="license" style="display:none;"></a>;
additional terms may apply.  By using this site, you agree to the <a href="//foundation.wikimedia.org/wiki/Terms_of_Use">Terms of Use</a> and <a href="//foundation.wikimedia.org/wiki/Privacy_policy">Privacy Policy</a>. Wikipedia® is a registered trademark of the <a href="//www.wikimediafoundation.org/">Wikimedia Foundation, Inc.</a>, a non-profit organization.</li>
</ul>
<ul class="" id="footer-places">
<li id="footer-places-privacy"><a class="extiw" href="https://f