# IST652 Advanced Topic Prensentation--- Beautifulsoup

### Vidushi Mishra, Ruiwei Zhang, Bhargav Konakanchi


![How To Scrape Web Pages with Beautiful Soup and Python 3.jpg](https://community-cdn-digitalocean-com.global.ssl.fastly.net/variants/TnmVb22Ayu65PHezWt2UVJLh/035575f2985fe451d86e717d73691e533a1a00545d7230900ed786341dc3c882)
- image resource: https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3

## 1. What is Beautifulsoup? What's its usage?

Beautiful Soup, an allusion to the Mock Turtle’s song found in Chapter 10 of Lewis Carroll’s Alice’s Adventures in Wonderland, is a Python library that allows for quick turnaround on web scraping projects. Currently available as Beautiful Soup 4 and compatible with both Python 2.7 and Python 3, Beautiful Soup creates a parse tree from parsed HTML and XML documents.

In [2]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


## 2. Import Beautifulsoup and other required package 

In [8]:
# Import libraries
import requests
from bs4 import BeautifulSoup

## 3. Collecting a Web Page with Requests

The next step we will need to do is collect the URL of the first web page with Requests. We’ll assign the URL for the first page to the variable page by using the method requests.get().

In [9]:
# Collect first page of artists’ list
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')

## 4. Collecting and Parsing a Web Page

We’ll now create a BeautifulSoup object, or a parse tree. This object takes as its arguments the page.text document from Requests (the content of the server’s response) and then parses it from Python’s built-in html.parser. 

![The content of the web](https://assets.digitalocean.com/articles/eng_python/beautiful-soup/web-page-inspector.png)  
We’ll see first that the table of names is within `<div>` tags where `class="BodyText"`. This is important to note so that we only search for text within this section of the web page. We also notice that the name Zabaglia, Niccola is in a link tag, since the name references a web page that describes the artist. So we will want to reference the `<a>` tag for links. Each artist’s name is a reference to a link.
we’ll use Beautiful Soup’s `find()` and `find_all()` methods in order to pull the text of the artists’ names from the BodyText `<div>`.


In [14]:
# Create a BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')
# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')
# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')

Next, at the bottom of our program file, we will want to create a for loop in order to iterate over all the artist names that we just put into the artist_name_list_items variable.

We’ll print these names out with the prettify() method in order to turn the Beautiful Soup parse tree into a nicely formatted Unicode string.

In [15]:
# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
    print(artist_name.prettify())

<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">
 Zabaglia, Niccola
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202">
 Zaccone, Fabian
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475">
 Zadkine, Ossip
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135">
 Zaech, Bernhard
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298">
 Zagar, Jacob
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988">
 Zagroba, Idalia
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232">
 Zaidenberg, A.
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154">
 Zaidenberg, Arthur
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910">
 Zaisinger, Matthäus
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch

## 5. Removing Superfluous Data

In order to remove the bottom links of the page, let’s again right-click and Inspect the DOM. We’ll see that the links on the bottom of the `<div class="BodyText">` section are contained in an HTML table: `<table class="AlphaNav">`. 

We can therefore use Beautiful Soup to find the AlphaNav class and use the `decompose()` method to remove a tag from the parse tree and then destroy it along with its contents.

We’ll use the variable last_links to reference these bottom links and add them to the program file.

In [16]:
# Remove bottom links
last_links = soup.find(class_='AlphaNav')
last_links.decompose()

## 6. Pulling the Contents from a Tag

In order to access only the actual artists’ names, we’ll want to target the contents of the `<a>` tags rather than print out the entire link tag.

We can do this with Beautiful Soup’s `.contents`, which will return the tag’s children as a Python list data type.

Let’s revise the for loop so that instead of printing the entire link and its tag, we’ll print the list of children (i.e. the artists’ full names).

In [18]:
artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')

# Use .contents to pull out the <a> tag’s children
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    print(names)

Zabaglia, Niccola
Zaccone, Fabian
Zadkine, Ossip
Zaech, Bernhard
Zagar, Jacob
Zagroba, Idalia
Zaidenberg, A.
Zaidenberg, Arthur
Zaisinger, Matthäus
Zajac, Jack
Zak, Eugène
Zakharov, Gurii Fillipovich
Zakowortny, Igor
Zalce, Alfredo
Zalopany, Michele
Zammiello, Craig
Zammitt, Norman
Zampieri, Domenico
Zampieri, called Domenichino, Domenico
Zanartú, Enrique Antunez
Zanchi, Antonio
Zanetti, Anton Maria
Zanetti Borzino, Leopoldina
Zanetti I, Antonio Maria, conte
Zanguidi, Jacopo
Zanini, Giuseppe
Zanini-Viola, Giuseppe
Zanotti, Giampietro
Zao Wou-Ki


We have received back a list of all the artists’ names available on the first page of the letter Z.

However, what if we want to also capture the URLs associated with those artists? We can extract URLs found within a page’s `<a>` tags by using Beautiful Soup’s `get('href')` method.

From the output of the links above, we know that the entire URL is not being captured, so we will concatenate the link string with the front of the URL string (in this case https://web.archive.org/).

These lines we’ll also add to the for loop:

In [19]:
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    links = 'https://web.archive.org' + artist_name.get('href')
    print(names)
    print(links)

Zabaglia, Niccola
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630
Zaccone, Fabian
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202
Zadkine, Ossip
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475
Zaech, Bernhard
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135
Zagar, Jacob
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298
Zagroba, Idalia
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988
Zaidenberg, A.
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232
Zaidenberg, Arthur
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154
Zaisinger, Matthäus
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910
Zajac, Jac

## 7. Writing the Data to a CSV File

Collecting data that only lives in a terminal window is not very useful. Comma-separated values (CSV) files allow us to store tabular data in plain text, and is a common format for spreadsheets and databases. 

We’ll create and open a file called z-artist-names.csv for us to write to (we’ll use the variable f for file here) by using the 'w' mode. We’ll also write the top row headings: Name and Link which we’ll pass to the `writerow()` method as a list:

Within our for loop, we’ll write each row with the artists' names and their associated links. 


In [43]:
import csv
f = csv.writer(open('z-artist-names.csv', 'w'))
f.writerow(['Name', 'Link'])
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    links = 'https://web.archive.org' + artist_name.get('href')
    # Add each artist’s name and associated link to a row
    f.writerow([names, links])

115

113

111

113

109

113

111

116

116

108

108

124

114

111

115

114

112

115

135

121

113

118

124

128

112

114

120

117

105

## 8. Retrieving Related Pages

We have created a program that will pull data from the first page of the list of artists whose last names start with the letter Z. However, there are 4 pages in total of these artists available on the website.

In order to collect all of these pages, we can perform more iterations with `for` loops. 

In [46]:
pages = []
for i in range(1, 5):
    url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
    pages.append(url)
for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')

    last_links = soup.find(class_='AlphaNav')
    last_links.decompose()

    artist_name_list = soup.find(class_='BodyText')
    artist_name_list_items = artist_name_list.find_all('a')

    for artist_name in artist_name_list_items:
        names = artist_name.contents[0]
        links = 'https://web.archive.org' + artist_name.get('href')

        f.writerow([names, links])

115

113

111

113

109

113

111

116

116

108

108

124

114

111

115

114

112

115

135

121

113

118

124

128

112

114

120

117

105

114

108

110

112

112

112

107

100

111

125

114

107

119

110

115

111

109

111

111

112

116

116

111

119

108

112

110

111

108

119

119

106

112

110

105

111

115

115

111

112

109

110

108

110

105

112

122

111

109

116

112

108

114

111

112

108

116

111

115

106

101

117

115

113

111

113

113

111

110

111

113

119

110

111

120

112

123

118

109

110

118

114

115

110

107

109

112