# Webscraping

If data isn't available in databases or APIs, it might be available sitting on a website. Make sure you're not getting into any legal trouble before scraping from sites - many of them don't like people scraping them. You can use requests and BeautifulSoup to parse the DOM.

In [1]:
import requests
from bs4 import BeautifulSoup

In [7]:
url = 'http://www.selenayhaven.com/index.html'
result = requests.get(url)

In [8]:
result.status_code

200

In [22]:
# Import the web page into BeautifulSoup
ps = BeautifulSoup(result.text, "html5lib")
result.close()
#the second param may need to be changed to 'lxml' for other websites

In [14]:
print(ps)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>The Haven</title>
<meta content="Selenay" name="author"/>
<meta content="The home for Selenay's Stargate, Atlantis and Buffy fanfiction along with regularly updated multi-fandom fanfic reviews." name="description"/>
<meta content="BtVS, btvs, buffy the vampire slayer, buffy, slayer, b/g, b/w, buffy/giles, buffy/willow, fanfiction, haven, selenay, fanart, reviews, buffy/faith, b/f, fanfic reviews, recs, sg-1, stargate, jack, daniel, jack/daniel, slash, j/d, stargate atlantis, SG:A, teyla/weir, weir/teyla, femslash, f/f, sheppard/mckay, sheppard/rodney, m/m" name="keywords"/>
<link href="layout.css" rel="stylesheet" type="text/css"/>
</head>
<body>

<div><img alt="Header graphici: The Haven" class="head" height="100" src="images/the

In [15]:
# find returns the first result from a tag search, so if you use body, it will parse out the body tag
body = ps.find(name="body")

In [17]:
print(body)

<body>

<div><img alt="Header graphici: The Haven" class="head" height="100" src="images/thehaven02.gif" width="600"/>
</div>
<div id="sideimage">
<img alt="Index artwork" height="450" src="images/mainpic01.jpg" width="300"/>
</div>
<div id="linkindex">
<h2 class="topindex"><a accesskey="w" href="whatnew.htm" tabindex="2" title="What's New. Access key: W">What's New</a></h2>
<p>Latest additions to The Haven. Last updated: 24 Oct 2010</p>
<h2><a accesskey="v" href="fic_index.php" tabindex="3" title="Haven Fiction: Multi-fandom fanfiction. Access key: V">Haven Fiction</a></h2>
<p>Buffy, Doctor Who, Stargate and Atlantis fanfiction by Selenay.</p>
<h2><a accesskey="r" href="reviews/" tabindex="4" title="Fanfic Reviews. Access key: R">Fanfiction reviews</a></h2>
<p>Fanfiction reviews and recommendations from Selenay and friends</p>
<h2><a accesskey="a" href="random/index.htm" tabindex="6" title="Wallpapers, photos and icons. Access key: A">Art Corner</a></h2>
<p>Wallpapers, icons and conve

In [18]:
# you can use the text property to get the text inside a tag. This is where I wish I'd used an h1 tag!
# Using the h2 brings back the first h2 level tag I used. Silly me.
title = ps.find(name="h2").text

In [19]:
print(title)

What's New


In [20]:
# to get all the tags of a type in one go, use find_all. This returns an iterable object!
sections = ps.find_all(name="h2")
for section in sections:
    print(section.text)

What's New
Haven Fiction
Fanfiction reviews
Art Corner
Links
Dreamwidth
Contact


In [29]:
# you can also use tag attributes to select things
top_h2 = ps.find(name="h2", attrs={'class':'topindex'})
print(top_h2)

<h2 class="topindex"><a accesskey="w" href="whatnew.htm" tabindex="2" title="What's New. Access key: W">What's New</a></h2>


This was a simple (and, on reflection, badly designed for webscraping) website. In the course, the instructor used combinations of parsing out the body, parsing the table element with a certain ID, and then looping through the td tags to scrape out the information he needed.