# BeautifulSoup
- great 'screen scraping' package
- tons of interesting data on webpages designed for people, not programs
- makes it easy to extract information from complex web pages and XML documents
- often can figure out what to do by playing interactively
- [doc](http://www.crummy.com/software/BeautifulSoup/)

# Example
# Want to find all the headlines on the front page of the [New York Times](http://nyt.com)
- look at webpage source - html structure is quite complex
- would be very difficult using with string.find() or regular expressions
- soup reads in the page of interest, then you can query it

In [1]:
# 'lxml' is a XML parser(parses HTML too)
# must tell soup what unicode decoding to use

import urllib.request

from bs4 import BeautifulSoup
import lxml

nf2 = urllib.request.urlopen('http://nyt.com')
sp = BeautifulSoup(nf2, 'lxml', from_encoding='utf-8')

In [2]:
# headlines seem to be contained in 'h2' elements

sp.findAll('h2')[10:20]

[<h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/15/us/politics/leaks-donald-trump.html">After His Election, Trump’s Love for Leaks Quickly Faded</a></h2>,
 <h2 class="story-heading"><i class="icon"></i><a href="https://www.nytimes.com/2017/02/16/us/politics/trump-russia-leaks-twitter.html">Trump Denounces ‘Low-Life Leakers,’ Pledging to Hunt Them Down</a> <time class="timestamp" data-eastern-timestamp="11:37 AM" data-utc-timestamp="1487263060" datetime="2017-02-16">11:37 AM ET</time></h2>,
 <h2 class="story-heading"><i class="icon"></i><a href="https://www.nytimes.com/2017/02/16/us/politics/campaign-over-president-trump-will-hold-a-what-else-campaign-rally.html">4 Weeks Into Term, Trump to Hold Campaign Rally</a> </h2>,
 <h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/15/world/middleeast/benjamin-netanyahu-israel-trump.html">Peace in Mideast Doesn’t Demand 2 States, Trump Says</a></h2>,
 <h2 class="story-heading">
                                 

In [3]:
# first 'h2' element

h2 = sp.h2
h2

<h2 class="branding"><a href="http://www.nytimes.com/">
<svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
<image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.svg"></image>
</svg>
</a></h2>

In [4]:
# can pull 'a' element out of 'h2'
# this 'a' element is a picture

a=h2.find('a')
a

<a href="http://www.nytimes.com/">
<svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
<image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.svg"></image>
</svg>
</a>

In [5]:
# try pulling the 'a' out of all 'h2' elements
# looks like we get mostly headlines

al=[h2.find('a') for h2 in sp.findAll("h2")]
al[:20]

[<a href="http://www.nytimes.com/">
 <svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
 <image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.svg"></image>
 </svg>
 </a>,
 None,
 None,
 None,
 None,
 None,
 <a href="https://www.nytimes.com/2017/02/16/us/politics/congress-republicans-health-care-infrastructure-taxes.html">Republican Congress, Stuck at Starting Line, Jogs in Place</a>,
 <a href="https://www.nytimes.com/interactive/2017/02/16/us/politics/trump-news-conference-live-analysis.html">Trump to Announce His New Labor Pick</a>,
 None,
 <a href="https://www.nytimes.com/2017/02/15/us/politics/trump-intelligence-agencies-stephen-feinberg.html">Trump Plans to Have an Ally Review the U.S. Spy Agencies</a>,
 <a href="https://www.nyti

In [6]:
# pull out the 'a' link text 

[a.contents for a in al if a != None][:30]

[['\n',
  <svg aria-label="The New York Times" class="nyt-logo" height="64" role="img" width="379">
  <image alt="The New York Times" border="0" height="64" src="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.png" width="379" xlink:href="https://a1.nyt.com/assets/homepage/20170201-155716/images/foundation/logos/nyt-logo-379x64.svg"></image>
  </svg>,
  '\n'],
 ['Republican Congress, Stuck at Starting Line, Jogs in Place'],
 ['Trump to Announce His New Labor Pick'],
 ['Trump Plans to Have an Ally Review the U.S. Spy Agencies'],
 ['After His Election, Trump’s Love for Leaks Quickly Faded'],
 ['Trump Denounces ‘Low-Life Leakers,’ Pledging to Hunt Them Down'],
 ['4 Weeks Into Term, Trump to Hold Campaign Rally'],
 ['Peace in Mideast Doesn’t Demand 2 States, Trump Says'],
 ['Trump’s Nominee for Israel Envoy Apologizes for ‘Hurtful Words’'],
 ['Mick Mulvaney, Trump’s Pick for Budget Director, Is Confirmed'],
 ['A Punishing News Cycle Gives Journali

In [7]:
# filter out images

[a.contents for a in al if a != None and len(a)==1][:30]

[['Republican Congress, Stuck at Starting Line, Jogs in Place'],
 ['Trump to Announce His New Labor Pick'],
 ['Trump Plans to Have an Ally Review the U.S. Spy Agencies'],
 ['After His Election, Trump’s Love for Leaks Quickly Faded'],
 ['Trump Denounces ‘Low-Life Leakers,’ Pledging to Hunt Them Down'],
 ['4 Weeks Into Term, Trump to Hold Campaign Rally'],
 ['Peace in Mideast Doesn’t Demand 2 States, Trump Says'],
 ['Trump’s Nominee for Israel Envoy Apologizes for ‘Hurtful Words’'],
 ['Mick Mulvaney, Trump’s Pick for Budget Director, Is Confirmed'],
 ['A Punishing News Cycle Gives Journalists Renewed Mission'],
 ['White House Offers Rules to Steady Insurance Markets'],
 ['A Bee Mogul Confronts the Crisis in His Field'],
 [<span class="contact">Learn more</span>],
 ['Your Thursday Briefing'],
 ['California Today: Supporting Trump on Deep Blue Coast'],
 ['\r\n      Listen to ‘The Daily’\r\n    '],
 ['Why You Should Get Around to Writing Your Will'],
 ['How to Navigate a Museum While Travel