# Set-up and Workflow

### Importing the packages

In [25]:
# Load the packages
import requests
from bs4 import BeautifulSoup

### Making a GET request

In [26]:
# Defining the url of the site
base_site = "https://en.wikipedia.org/wiki/Music"

# Making a get request
response = requests.get(base_site)
response.status_code

200

In [27]:
# Extracting the HTML
html = response.content

# Checking that the reply is indeed an HTML code by inspecting the first 100 symbols
html[:100]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

### Making the soup

In [28]:
# Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily.
# Using the default parser as it is included in Python
soup = BeautifulSoup(html, "html.parser")

### Exporting the HTML to a file

In [29]:
# It is extremely useful to be able to check this file when searching where some info is located
# or to see how was the document parsed

# Exporting the HTML to a file
with open('Wiki_response.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))


# the 'with' statement is shorthand for a 'try-finally' block
# open is function for opening/creating a file to edit
# the 'wb' argument signifies the mode in which to edit the file - Writing in Bytes format
# .prettify() modifies the HTML code with additional indentations for better readability

# Searching and navigating the HTML tree

## Searching - find() and find_all()

In [30]:
# The soup variable (BeautifulSoup object) we defined earlier can be seen as representing the whole document
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Music - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"da60c370-9bc9-465f-8205-dbac1982543e","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Music","wgTitle":"Music","wgCurRevisionId":1119417309,"wgRevisionId":1119417309,"wgArticleId":18839,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing Ancient Greek (to 1453)-language text","Webarchive template wayback links","Pages containing links to subscription-only content","Wikipedia articles needing t

In [31]:
# We can search by tag name
# This returns as the element with all its contents and nested elements inside
soup.find('head')

<head>
<meta charset="utf-8"/>
<title>Music - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"da60c370-9bc9-465f-8205-dbac1982543e","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Music","wgTitle":"Music","wgCurRevisionId":1119417309,"wgRevisionId":1119417309,"wgArticleId":18839,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing Ancient Greek (to 1453)-language text","Webarchive template wayback links","Pages containing links to subscription-only content","Wikipedia articles needing time reference citations from February 2020","CS1: Julian–Gregori

In [32]:
# If there is no result it returns None
# Note: None is not displayed in IPython unless print() or repr() is used
soup.find('video')

In [33]:
# Display the None value
print(soup.find('video'))

None


In [34]:
# verify the type of output
type(soup.find('video'))

NoneType

In [35]:
# .find() returns only the first such result
soup.find('a')

<a id="top"></a>

In [36]:
# If we want all the results we use find_all() 
links = soup.find_all('a')
links

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Music&amp;action=edit">improve it</a>,
 <a href="/wiki/Talk:Music" title="Talk:Music

In [37]:
# find_all returns a list of all results
isinstance(links, list)

True

In [38]:
# We must be careful when using find_all()
# If no result is found it returns an empty list
soup.find_all('video')

[]

In [39]:
# How many links are on the page?
len(links)

2471

In [40]:
# Usually, we prefer to store the result in a variable
# Let's store the body of a table in a table variable
table = soup.find('tbody')

In [41]:
# Inspect the value of the variable
table

<tbody><tr><td class="mbox-image"><div class="mbox-image-div"><img alt="" data-file-height="40" data-file-width="40" decoding="async" height="40" src="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/60px-Ambox_important.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/80px-Ambox_important.svg.png 2x" width="40"/></div></td><td class="mbox-text"><div class="mbox-text-span"><div class="multiple-issues-text mw-collapsible"><b>This article has multiple issues.</b> Please help <b><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Music&amp;action=edit">improve it</a></b> or discuss these issues on the <b><a href="/wiki/Talk:Music" title="Talk:Music">talk page</a></b>. <small><i>(<a href="/wiki/Help:Maintenance_template_removal" title="Help:Maintenance template removal">Learn how and when to remove these template me

In [42]:
# Inspect the type of the variable
type(table)

bs4.element.Tag

In [43]:
# A tag can be searched in the same way we search the whole document
table.find_all('td')

[<td class="mbox-image"><div class="mbox-image-div"><img alt="" data-file-height="40" data-file-width="40" decoding="async" height="40" src="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/60px-Ambox_important.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/80px-Ambox_important.svg.png 2x" width="40"/></div></td>,
 <td class="mbox-text"><div class="mbox-text-span"><div class="multiple-issues-text mw-collapsible"><b>This article has multiple issues.</b> Please help <b><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Music&amp;action=edit">improve it</a></b> or discuss these issues on the <b><a href="/wiki/Talk:Music" title="Talk:Music">talk page</a></b>. <small><i>(<a href="/wiki/Help:Maintenance_template_removal" title="Help:Maintenance template removal">Learn how and when to remove these template messages<

In [44]:
# Since we used find_all, the result is a list
len(table.find_all('td'))

6

## Navigating the tree

In [45]:
# A tag's children are stored in a list, accessed with .contents
table.contents

[<tr><td class="mbox-image"><div class="mbox-image-div"><img alt="" data-file-height="40" data-file-width="40" decoding="async" height="40" src="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/60px-Ambox_important.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/80px-Ambox_important.svg.png 2x" width="40"/></div></td><td class="mbox-text"><div class="mbox-text-span"><div class="multiple-issues-text mw-collapsible"><b>This article has multiple issues.</b> Please help <b><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Music&amp;action=edit">improve it</a></b> or discuss these issues on the <b><a href="/wiki/Talk:Music" title="Talk:Music">talk page</a></b>. <small><i>(<a href="/wiki/Help:Maintenance_template_removal" title="Help:Maintenance template removal">Learn how and when to remove these template messages

In [46]:
len(table.contents)

1

In [47]:
table.contents[1]

IndexError: list index out of range

In [51]:
# We can also go up the tree with .parent
table.parent

<table class="box-Multiple_issues plainlinks metadata ambox ambox-content ambox-multiple_issues compact-ambox" role="presentation"><tbody><tr><td class="mbox-image"><div class="mbox-image-div"><img alt="" data-file-height="40" data-file-width="40" decoding="async" height="40" src="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/60px-Ambox_important.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/80px-Ambox_important.svg.png 2x" width="40"/></div></td><td class="mbox-text"><div class="mbox-text-span"><div class="multiple-issues-text mw-collapsible"><b>This article has multiple issues.</b> Please help <b><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Music&amp;action=edit">improve it</a></b> or discuss these issues on the <b><a href="/wiki/Talk:Music" title="Talk:Music">talk page</a></b>. <small><i>(<a href

In [52]:
# table.parent is also a tag
# Thus, we can use .parent on it as well
table.parent.parent

<div class="mw-parser-output"><div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Form of art using sound</div>
<p class="mw-empty-elt">
</p>
<style data-mw-deduplicate="TemplateStyles:r1033289096">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}</style><div class="hatnote navigation-not-searchable" role="note">For other uses, see <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>.</div>
<p class="mw-empty-elt">
</p>
<style data-mw-deduplicate="TemplateStyles:r1097763485">.mw-parser-output .ambox{border:1px solid #a2a9b1;border-left:10px solid #36c;background-color:#fbfbfb;box-sizing:border-box}.mw-parser-output .ambox+link+.ambox,.mw-parser-output .ambox+link+style+.ambox,.mw-parser-output .ambox+link+link+.ambox,.m

In [53]:
# We use .parent to go up the tree
# But what about .children?
table.children

<list_iterator at 0x268131614c0>

In [54]:
# If we want a list of an element's children, we need to use table.contents as shown before
# .children is an iterator over that list, 
# which means we can use it in a for loop to iterate over all the children

for child in table.children:
    print(child)

<tr><td class="mbox-image"><div class="mbox-image-div"><img alt="" data-file-height="40" data-file-width="40" decoding="async" height="40" src="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/60px-Ambox_important.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/80px-Ambox_important.svg.png 2x" width="40"/></div></td><td class="mbox-text"><div class="mbox-text-span"><div class="multiple-issues-text mw-collapsible"><b>This article has multiple issues.</b> Please help <b><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Music&amp;action=edit">improve it</a></b> or discuss these issues on the <b><a href="/wiki/Talk:Music" title="Talk:Music">talk page</a></b>. <small><i>(<a href="/wiki/Help:Maintenance_template_removal" title="Help:Maintenance template removal">Learn how and when to remove these template messages<

## Searching by attributes

In [55]:
# We can search for tags based on their attributes, in addition to their name
soup.find('div', id = 'siteSub')

<div class="noprint" id="siteSub">From Wikipedia, the free encyclopedia</div>

In [56]:
# There are two ways in which we can do that:

### Passing attributes as function parameters

In [57]:
# By writing them as function parameters
# Notice that since class is a reserved word, we write class_
soup.find_all('a', class_ = 'mw-jump-link')

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]

In [58]:
# We can filter against multiple attributes at once
soup.find('a', class_ = 'mw-jump-link', href = '#p-search')

### Placing the attributes in a dictionary

In [59]:
# By writting the attributes in a dictionary
soup.find('a', attrs={ 'class':'mw-jump-link', 'href':'#p-search' })

In [60]:
soup.find('div', {'id' : 'footer'})

# Extracting data from the HTML tree

In [61]:
# Let's use some placeholder object to manipulate in the examples below
a = soup.find('a', class_ = 'mw-jump-link')
a

<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>

In [62]:

# We can obtain the name of the tag with the .name attribute
a.name

'a'

## Getting the attribute value

In [63]:
# We can access a tag’s attributes by treating the tag just like a dictionary

In [64]:
# First way
a['href']

'#mw-head'

In [65]:
# Notice how multi-valued attributes, such as class, return a list
a['class']

['mw-jump-link']

In [66]:
# Second way
a.get('href')

'#mw-head'

In [67]:
# Again, class returns a list
a.get('class')

['mw-jump-link']

#### Differences between these methods manifest when the key is missing

In [68]:
# tag['missing-key'] returns an error
# a['id'] will raise an error, if uncommented

In [69]:
# tag.get('missing-key') returns a default value None
a.get('id')

In [70]:
# We can use repr() function to display all special characters and combinations (None, \n...)
repr(a.get('id'))

'None'

In [71]:
# We can also get all attribute name-value pairs in a dictionary
a.attrs

{'class': ['mw-jump-link'], 'href': '#mw-head'}

## Extracting the text

### .string vs .text

In [72]:
# We can access the raw string of an element by using .string
a.string

'Jump to navigation'

In [73]:
# Alternativelly we can use .text
a.text

'Jump to navigation'

#### They exhibit different behaviour when the element contains more than one distinct string

In [74]:
# This paragraph has many nested elements, with lots of different fragments of text
p = soup.find_all('p')[1]
p

<p class="mw-empty-elt">
</p>

In [75]:
# .text returns everything inside the element
p.text

'\n'

In [76]:
# .string returns None when there is more than 1 string
p.string

'\n'

In [77]:
repr(p.string)

"'\\n'"

In [78]:
p.parent

<div class="mw-parser-output"><div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Form of art using sound</div>
<p class="mw-empty-elt">
</p>
<style data-mw-deduplicate="TemplateStyles:r1033289096">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}</style><div class="hatnote navigation-not-searchable" role="note">For other uses, see <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>.</div>
<p class="mw-empty-elt">
</p>
<style data-mw-deduplicate="TemplateStyles:r1097763485">.mw-parser-output .ambox{border:1px solid #a2a9b1;border-left:10px solid #36c;background-color:#fbfbfb;box-sizing:border-box}.mw-parser-output .ambox+link+.ambox,.mw-parser-output .ambox+link+style+.ambox,.mw-parser-output .ambox+link+link+.ambox,.m

In [79]:
# We can stack different operations one after the other
p.parent.text

'Form of art using sound\n\n\nFor other uses, see Music (disambiguation).\n\n\nThis article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)\n\nThis article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources:\xa0"Music"\xa0–\xa0news\xa0· newspapers\xa0· books\xa0· scholar\xa0· JSTOR (October 2021) (Learn how and when to remove this template message)\nThis article may be too long to read and navigate comfortably. Please consider splitting content into sub-articles, condensing it, or adding subheadings. Please discuss this issue on the article\'s talk page. (May 2022)\n\n (Learn how and when to remove this template message)\n Grooved side of the Voyager Golden Record launched along the Voyager probes to space, which feature music from around the world\nPart of a series 

In [80]:
# semi-properly displayed text
print(p.parent.text)

Form of art using sound


For other uses, see Music (disambiguation).


This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: "Music" – news · newspapers · books · scholar · JSTOR (October 2021) (Learn how and when to remove this template message)
This article may be too long to read and navigate comfortably. Please consider splitting content into sub-articles, condensing it, or adding subheadings. Please discuss this issue on the article's talk page. (May 2022)

 (Learn how and when to remove this template message)
 Grooved side of the Voyager Golden Record launched along the Voyager probes to space, which feature music from around the world
Part of a series onPerforming arts
Acrobatics
Ballet


In [81]:
# We can also use .get_text() instead of .text
p.parent.get_text()

'Form of art using sound\n\n\nFor other uses, see Music (disambiguation).\n\n\nThis article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)\n\nThis article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources:\xa0"Music"\xa0–\xa0news\xa0· newspapers\xa0· books\xa0· scholar\xa0· JSTOR (October 2021) (Learn how and when to remove this template message)\nThis article may be too long to read and navigate comfortably. Please consider splitting content into sub-articles, condensing it, or adding subheadings. Please discuss this issue on the article\'s talk page. (May 2022)\n\n (Learn how and when to remove this template message)\n Grooved side of the Voyager Golden Record launched along the Voyager probes to space, which feature music from around the world\nPart of a series 

In [82]:
print(p.parent.get_text())

Form of art using sound


For other uses, see Music (disambiguation).


This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: "Music" – news · newspapers · books · scholar · JSTOR (October 2021) (Learn how and when to remove this template message)
This article may be too long to read and navigate comfortably. Please consider splitting content into sub-articles, condensing it, or adding subheadings. Please discuss this issue on the article's talk page. (May 2022)

 (Learn how and when to remove this template message)
 Grooved side of the Voyager Golden Record launched along the Voyager probes to space, which feature music from around the world
Part of a series onPerforming arts
Acrobatics
Ballet


In [83]:
# We can also extract the whole text of the webpage
# CAUTION: This includes JavaScript text, CSS and other not directly displayed text
print(soup.text)





Music - Wikipedia







































 



Music

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
Form of art using sound


For other uses, see Music (disambiguation).


This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: "Music" – news · newspapers · books · scholar · JSTOR (October 2021) (Learn how and when to remove this template message)
This article may be too long to read and navigate comfortably. Please consider splitting content into sub-articles, condensing it, or adding subheadings. Please discuss this issue on the article's talk page. (May 2022)

 (Learn how and when to remove this template message)
 Grooved side of the Voyager Golde

### .strings and .stripped_strings

In [84]:
# All strings inside an element can be accessed separatelly by using the .strings iterator

In [85]:
for s in p.strings:
    print(repr(s))

'\n'


In [86]:
# The extra whitespace can be removed by using the .stripped_strings iterator instead
for s in p.stripped_strings:
    print(repr(s))

# Practical examples

## Links - absolute path URL

In [87]:
# Let's use the variable links we defined a couple of lectures ago for this example
# It contains all the 'a' tags on this page
links

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Music&amp;action=edit">improve it</a>,
 <a href="/wiki/Talk:Music" title="Talk:Music

In [88]:
# Let's choose one link to manipulate
link = links[26]
link

<a class="internal" href="/wiki/File:The_Sounds_of_Earth_-_GPN-2000-001976.jpg" title="Enlarge"></a>

In [89]:
# Get the link's text
link.string

In [90]:
# Extract the link's URL
link['href']

'/wiki/File:The_Sounds_of_Earth_-_GPN-2000-001976.jpg'

In [91]:
# This is a relative URL
# To obtain the absolute URL address we will use urljoin

from urllib.parse import urljoin

In [92]:
# Now we need the address of the current page + the relative URL to compute the full-path URL
base_site

'https://en.wikipedia.org/wiki/Music'

In [93]:
relative_url = link['href']
relative_url

'/wiki/File:The_Sounds_of_Earth_-_GPN-2000-001976.jpg'

In [94]:
full_url = urljoin(base_site, relative_url)
full_url

'https://en.wikipedia.org/wiki/File:The_Sounds_of_Earth_-_GPN-2000-001976.jpg'

## Processing multiple links at once

In [95]:
# We will work with:
links

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Music&amp;action=edit">improve it</a>,
 <a href="/wiki/Talk:Music" title="Talk:Music

In [96]:
# Examining the link's addresses
[l.get('href') for l in links]   # Note that if l['href'] was written instead of l.get('href'), this would produce an error

[None,
 '/wiki/Wikipedia:Protection_policy#semi',
 '#mw-head',
 '#searchInput',
 '/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/w/index.php?title=Music&action=edit',
 '/wiki/Talk:Music',
 '/wiki/Help:Maintenance_template_removal',
 '/wiki/File:Question_book-new.svg',
 '/wiki/Wikipedia:Verifiability',
 'https://en.wikipedia.org/w/index.php?title=Music&action=edit',
 '/wiki/Help:Referencing_for_beginners',
 '//www.google.com/search?as_eq=wikipedia&q=%22Music%22',
 '//www.google.com/search?tbm=nws&q=%22Music%22+-wikipedia&tbs=ar:1',
 '//www.google.com/search?&q=%22Music%22&tbs=bkt:s&tbm=bks',
 '//www.google.com/search?tbs=bks:1&q=%22Music%22+-wikipedia',
 '//scholar.google.com/scholar?q=%22Music%22',
 'https://www.jstor.org/action/doBasicSearch?Query=%22Music%22&acc=on&wc=on',
 '/wiki/Help:Maintenance_template_removal',
 '/wiki/Wikipedia:Article_size',
 '/wiki/Wikipedia:Splitting',
 '/wiki/Wikipedia:Summary_style',
 '/wiki/Help:Section#Subsections',
 '/wiki/Talk:Music',
 '/wik

In [97]:
# Notice that some links don't have URL (None appears)

# Dropping the links without href attribute
clean_links = [l for l in links if l.get('href') != None]

In [98]:
# Obtaining the relative URLs
relative_urls = [link.get('href') for link in clean_links]
relative_urls

['/wiki/Wikipedia:Protection_policy#semi',
 '#mw-head',
 '#searchInput',
 '/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/w/index.php?title=Music&action=edit',
 '/wiki/Talk:Music',
 '/wiki/Help:Maintenance_template_removal',
 '/wiki/File:Question_book-new.svg',
 '/wiki/Wikipedia:Verifiability',
 'https://en.wikipedia.org/w/index.php?title=Music&action=edit',
 '/wiki/Help:Referencing_for_beginners',
 '//www.google.com/search?as_eq=wikipedia&q=%22Music%22',
 '//www.google.com/search?tbm=nws&q=%22Music%22+-wikipedia&tbs=ar:1',
 '//www.google.com/search?&q=%22Music%22&tbs=bkt:s&tbm=bks',
 '//www.google.com/search?tbs=bks:1&q=%22Music%22+-wikipedia',
 '//scholar.google.com/scholar?q=%22Music%22',
 'https://www.jstor.org/action/doBasicSearch?Query=%22Music%22&acc=on&wc=on',
 '/wiki/Help:Maintenance_template_removal',
 '/wiki/Wikipedia:Article_size',
 '/wiki/Wikipedia:Splitting',
 '/wiki/Wikipedia:Summary_style',
 '/wiki/Help:Section#Subsections',
 '/wiki/Talk:Music',
 '/wiki/Help:

In [99]:
# Transforming to absolute path URLs
full_urls = [urljoin(base_site, url) for url in relative_urls]
full_urls

['https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi',
 'https://en.wikipedia.org/wiki/Music#mw-head',
 'https://en.wikipedia.org/wiki/Music#searchInput',
 'https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/w/index.php?title=Music&action=edit',
 'https://en.wikipedia.org/wiki/Talk:Music',
 'https://en.wikipedia.org/wiki/Help:Maintenance_template_removal',
 'https://en.wikipedia.org/wiki/File:Question_book-new.svg',
 'https://en.wikipedia.org/wiki/Wikipedia:Verifiability',
 'https://en.wikipedia.org/w/index.php?title=Music&action=edit',
 'https://en.wikipedia.org/wiki/Help:Referencing_for_beginners',
 'https://www.google.com/search?as_eq=wikipedia&q=%22Music%22',
 'https://www.google.com/search?tbm=nws&q=%22Music%22+-wikipedia&tbs=ar:1',
 'https://www.google.com/search?&q=%22Music%22&tbs=bkt:s&tbm=bks',
 'https://www.google.com/search?tbs=bks:1&q=%22Music%22+-wikipedia',
 'https://scholar.google.com/scholar?q=%22Music%22',
 'https://www.jstor.

In [100]:
# Extracting only URLs pointing to Wikipedia (internal URLs)
internal_links = [url for url in full_urls if 'wikipedia.org' in url]
internal_links

['https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi',
 'https://en.wikipedia.org/wiki/Music#mw-head',
 'https://en.wikipedia.org/wiki/Music#searchInput',
 'https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/w/index.php?title=Music&action=edit',
 'https://en.wikipedia.org/wiki/Talk:Music',
 'https://en.wikipedia.org/wiki/Help:Maintenance_template_removal',
 'https://en.wikipedia.org/wiki/File:Question_book-new.svg',
 'https://en.wikipedia.org/wiki/Wikipedia:Verifiability',
 'https://en.wikipedia.org/w/index.php?title=Music&action=edit',
 'https://en.wikipedia.org/wiki/Help:Referencing_for_beginners',
 'https://en.wikipedia.org/wiki/Help:Maintenance_template_removal',
 'https://en.wikipedia.org/wiki/Wikipedia:Article_size',
 'https://en.wikipedia.org/wiki/Wikipedia:Splitting',
 'https://en.wikipedia.org/wiki/Wikipedia:Summary_style',
 'https://en.wikipedia.org/wiki/Help:Section#Subsections',
 'https://en.wikipedia.org/wiki/Talk:Music',
 'https:

# Extracting data from nested tags

In [101]:
# Our objective now is to extract all links that can be found under a section heading
# Marked as 'Main article:' or 'See also:'
# By quick inspection, we see that these are contained in div tags with attribute 'role' set to 'note'

div_notes = soup.find_all("div", {"role": "note"})
div_notes

[<div class="hatnote navigation-not-searchable" role="note">For other uses, see <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>.</div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/History_of_music" title="History of music">History of music</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Further information: <a class="mw-redirect" href="/wiki/Origins_of_music" title="Origins of music">Origins of music</a> and <a href="/wiki/Prehistoric_music" title="Prehistoric music">Prehistoric music</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main articles: <a href="/wiki/Music_of_Egypt" title="Music of Egypt">Music of Egypt</a> and <a href="/wiki/Music_of_Greece" title="Music of Greece">Music of Greece</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Music_of_Asia" title="Music of Asia">M

In [102]:
div_notes[0]

<div class="hatnote navigation-not-searchable" role="note">For other uses, see <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>.</div>

In [103]:
# We can apply find() and find_all() to a tag in the same way we do it to the whole document
div_notes[0].find('a')

<a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>

In [104]:
# A naive approach to get all links would be to use find
div_links = [div.find('a') for div in div_notes]
div_links

[<a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a href="/wiki/History_of_music" title="History of music">History of music</a>,
 <a class="mw-redirect" href="/wiki/Origins_of_music" title="Origins of music">Origins of music</a>,
 <a href="/wiki/Music_of_Egypt" title="Music of Egypt">Music of Egypt</a>,
 <a href="/wiki/Music_of_Asia" title="Music of Asia">Music of Asia</a>,
 <a class="mw-redirect" href="/wiki/Western_Classical_Music" title="Western Classical Music">Western Classical Music</a>,
 <a href="/wiki/Baroque_music" title="Baroque music">Baroque music</a>,
 <a href="/wiki/Classical_period_(music)" title="Classical period (music)">Classical period (music)</a>,
 <a href="/wiki/Romantic_music" title="Romantic music">Romantic music</a>,
 <a href="/wiki/20th-century_music" title="20th-century music">20th-century music</a>,
 <a href="/wiki/Musical_composition" title="Musical composition">Musical composition</a>,
 

In [105]:
len(div_links)

39

In [106]:
# However, some divs have more than 1 link
div_notes[6]

<div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Baroque_music" title="Baroque music">Baroque music</a></div>

In [107]:
# This div has 6 links in it
div_notes[6].find_all('a')

[<a href="/wiki/Baroque_music" title="Baroque music">Baroque music</a>]

In [108]:
# Therefore we need to use find_all
# Let's use a for loop

# Define initially empty list of links
div_links = []

for div in div_notes:
    anchors = div.find_all('a')
    
    # Need to add every link from anchors to div_links
    for a in anchors:
        div_links.append(a)
    
    # Can use div_links.extend(anchors) instead of the for loop
    

In [109]:
div_links

[<a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a href="/wiki/History_of_music" title="History of music">History of music</a>,
 <a class="mw-redirect" href="/wiki/Origins_of_music" title="Origins of music">Origins of music</a>,
 <a href="/wiki/Prehistoric_music" title="Prehistoric music">Prehistoric music</a>,
 <a href="/wiki/Music_of_Egypt" title="Music of Egypt">Music of Egypt</a>,
 <a href="/wiki/Music_of_Greece" title="Music of Greece">Music of Greece</a>,
 <a href="/wiki/Music_of_Asia" title="Music of Asia">Music of Asia</a>,
 <a class="mw-redirect" href="/wiki/Western_Classical_Music" title="Western Classical Music">Western Classical Music</a>,
 <a href="/wiki/Baroque_music" title="Baroque music">Baroque music</a>,
 <a href="/wiki/Classical_period_(music)" title="Classical period (music)">Classical period (music)</a>,
 <a href="/wiki/Romantic_music" title="Romantic music">Romantic music</a>,
 <a href="/wiki/

In [110]:
# We now have a complete list
len(div_links)

48

In [111]:
# Let's get the URLs
note_urls = [urljoin(base_site, l.get('href')) for l in div_links]
note_urls

['https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/wiki/History_of_music',
 'https://en.wikipedia.org/wiki/Origins_of_music',
 'https://en.wikipedia.org/wiki/Prehistoric_music',
 'https://en.wikipedia.org/wiki/Music_of_Egypt',
 'https://en.wikipedia.org/wiki/Music_of_Greece',
 'https://en.wikipedia.org/wiki/Music_of_Asia',
 'https://en.wikipedia.org/wiki/Western_Classical_Music',
 'https://en.wikipedia.org/wiki/Baroque_music',
 'https://en.wikipedia.org/wiki/Classical_period_(music)',
 'https://en.wikipedia.org/wiki/Romantic_music',
 'https://en.wikipedia.org/wiki/20th-century_music',
 'https://en.wikipedia.org/wiki/Musical_composition',
 'https://en.wikipedia.org/wiki/Performance',
 'https://en.wikipedia.org/wiki/Musical_improvisation',
 'https://en.wikipedia.org/wiki/Musical_notation',
 'https://en.wikipedia.org/wiki/Elements_of_music',
 'https://en.wikipedia.org/wiki/Pitch_(music)',
 'https://en.wikipedia.org/wiki/Melody',
 'https://en.wikipedia.org/

In [112]:
len(note_urls)

48

# Scraping multiple pages automatically - Extracting all the text from the note URLs

In [113]:
# We will use the links we obtained above
note_urls

['https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/wiki/History_of_music',
 'https://en.wikipedia.org/wiki/Origins_of_music',
 'https://en.wikipedia.org/wiki/Prehistoric_music',
 'https://en.wikipedia.org/wiki/Music_of_Egypt',
 'https://en.wikipedia.org/wiki/Music_of_Greece',
 'https://en.wikipedia.org/wiki/Music_of_Asia',
 'https://en.wikipedia.org/wiki/Western_Classical_Music',
 'https://en.wikipedia.org/wiki/Baroque_music',
 'https://en.wikipedia.org/wiki/Classical_period_(music)',
 'https://en.wikipedia.org/wiki/Romantic_music',
 'https://en.wikipedia.org/wiki/20th-century_music',
 'https://en.wikipedia.org/wiki/Musical_composition',
 'https://en.wikipedia.org/wiki/Performance',
 'https://en.wikipedia.org/wiki/Musical_improvisation',
 'https://en.wikipedia.org/wiki/Musical_notation',
 'https://en.wikipedia.org/wiki/Elements_of_music',
 'https://en.wikipedia.org/wiki/Pitch_(music)',
 'https://en.wikipedia.org/wiki/Melody',
 'https://en.wikipedia.org/

In [None]:
# The objective is to get all the useful text from those wikipedia pages

# We will do that by extracting all text contained in a paragraph element,
# for all paragraphs on a page,
# for all pages (in note_urls)

In [118]:
# initialize list to store paragraph text for each webpage
par_text = []


# creating a loop counter
i = 0

# Loop through each URL in note_urls
for url in note_urls:
    
    # connect to every webpage
    note_resp = requests.get(url)
    
    # checking if the request is successful
    if note_resp.status_code == 200:            # Everything is OK!
        print('URL #{0}: {1}'.format(i+1,url))    # print out the number of iteration and the URL to keep track of place in loop
    
    else:                                       # Something is wrong!
        print('Status code {0}: Skipping URL #{1}: {2}'.format(note_resp.status_code, i+1, url))
        i = i+1
        continue
        
    
    # get HTML from webpage
    note_html = note_resp.content
    
    # convert HTML to BeautifulSoup objec
    note_soup = BeautifulSoup(note_html, 'html.parser')
    
    # find all "p" tags on the webpage
    note_pars = note_soup.find_all("p")
    
    # Get the text from each "p" tag
    text = [p.text for p in note_pars]
    
    # Append text from each "p" tag to our list, par_text
    par_text.append(text)
    
    # Incrementing the loop counter
    i = i+1


URL #1: https://en.wikipedia.org/wiki/Music_(disambiguation)
URL #2: https://en.wikipedia.org/wiki/History_of_music
URL #3: https://en.wikipedia.org/wiki/Origins_of_music
URL #4: https://en.wikipedia.org/wiki/Prehistoric_music
URL #5: https://en.wikipedia.org/wiki/Music_of_Egypt
URL #6: https://en.wikipedia.org/wiki/Music_of_Greece
URL #7: https://en.wikipedia.org/wiki/Music_of_Asia
URL #8: https://en.wikipedia.org/wiki/Western_Classical_Music
URL #9: https://en.wikipedia.org/wiki/Baroque_music
URL #10: https://en.wikipedia.org/wiki/Classical_period_(music)
URL #11: https://en.wikipedia.org/wiki/Romantic_music
URL #12: https://en.wikipedia.org/wiki/20th-century_music
URL #13: https://en.wikipedia.org/wiki/Musical_composition
URL #14: https://en.wikipedia.org/wiki/Performance
URL #15: https://en.wikipedia.org/wiki/Musical_improvisation
URL #16: https://en.wikipedia.org/wiki/Musical_notation
URL #17: https://en.wikipedia.org/wiki/Elements_of_music
URL #18: https://en.wikipedia.org/wiki/P

In [135]:
# Inspecting the result for the first page
par_text


[['Music is an art form consisting of sound and silence, expressed through time.\n',
  'Music may also refer to:\n'],
 ['\n',
  'Although definitions of music vary wildly throughout the world, every known culture partakes in it, and it is thus considered a cultural universal. The origins of music remain highly contentious; commentators often relate it to the origin of language, with much disagreement surrounding whether music arose before, after or simultaneously with language. Many theories have been proposed by scholars from a wide range of disciplines, though none have achieved broad approval. Most cultures have their own mythical origins concerning the invention of music, generally rooted in their respective mythological, religious or philosophical beliefs.\n',
  "The music of prehistoric cultures is first firmly dated to c.\u200940,000\xa0BP of the Upper Paleolithic by evidence of bone flutes, though it remains unclear whether or not the actual origins lie in the earlier Middle Pa

In [132]:
# We see that we have a list of all paragraph strings
# It would be more useful to have all the text as one string, not as a list of strings

# Merging all paragraphs of the first page into one long string
page_text = "".join(par_text[0])
page_text

'Music is an art form consisting of sound and silence, expressed through time.\nMusic may also refer to:\n'

In [141]:
# Let's do that for all pages

# Merging all paragraphs for all pages
page_text = ["".join(text) for text in par_text]

# Inspect the result for some webpage
page_text[0]

'Music is an art form consisting of sound and silence, expressed through time.\nMusic may also refer to:\n'

In [146]:
# Inspect result
# print(page_text[4])
print(page_text[4])

Music has been an integral part of Egyptian culture since antiquity in Egypt. Egyptian music had a significant impact on the development of ancient Greek music, and via the Greeks it was important to early European music well into the Middle Ages. Due to the thousands of years long dominance of Egypt over its neighbors, Egyptian culture, including music and musical instruments, was very influential in the surrounding regions; for instance, the instruments claimed in the Bible to have been played by the ancient Hebrews are all Egyptian instruments as established by Egyptian archaeology. Egyptian modern music is considered as a main core of Middle Eastern and Oriental music as it has a huge influence on the region due to the popularity and huge influence of Egyptian cinema and music industries, owing to the political influence Egypt has on its neighboring countries, as well as Egypt producing the most accomplished musicians and composers in the region, specially in the 20th century, a lo

In [148]:
# Creating a dictionary with the (key,value) pairs being (url,text)
url_to_text = dict(zip(note_urls, page_text))  # You don't need to know the specifics of these functions

url_to_text

{'https://en.wikipedia.org/wiki/Music_(disambiguation)': 'Music is an art form consisting of sound and silence, expressed through time.\nMusic may also refer to:\n',
 'https://en.wikipedia.org/wiki/History_of_music': '\nAlthough definitions of music vary wildly throughout the world, every known culture partakes in it, and it is thus considered a cultural universal. The origins of music remain highly contentious; commentators often relate it to the origin of language, with much disagreement surrounding whether music arose before, after or simultaneously with language. Many theories have been proposed by scholars from a wide range of disciplines, though none have achieved broad approval. Most cultures have their own mythical origins concerning the invention of music, generally rooted in their respective mythological, religious or philosophical beliefs.\nThe music of prehistoric cultures is first firmly dated to c.\u200940,000\xa0BP of the Upper Paleolithic by evidence of bone flutes, tho

In [149]:
print(url_to_text['https://en.wikipedia.org/wiki/Music_theory'])


Music theory is the study of the practices and possibilities of music. The Oxford Companion to Music describes three interrelated uses of the term "music theory". The first is the "rudiments", that are needed to understand music notation (key signatures, time signatures, and rhythmic notation); the second is learning scholars' views on music from antiquity to the present; the third is a sub-topic of musicology that "seeks to define processes and general principles in music". The musicological approach to theory differs from music analysis "in that it takes as its starting-point not the individual work or performance but the fundamental materials from which it is built."[1]
Music theory is frequently concerned with describing how musicians and composers make music, including tuning systems and composition methods among other topics. Because of the ever-expanding conception of what constitutes music, a more inclusive definition could be the consideration of any sonic phenomena, includin

In [None]:
# A word of caution:
# We have not extracted all of the main content's text,
# as some text may be contained in lists and tables, outside of paragraphs we scraped