Scraping Webpages with BeautifulSoup
====================================

Lets try to get a list of all the years of all of Sean Connery's movies!

BeautifulSoup lets you download webpages and search them for specific HTML entities. You can use this ability to scrape data out of the webpage, or a series of webpages.  It is fast and works well.  Their [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a handy reference.

Getting the Content
------------------
First you gotta grab the content (I like to use [requests](http://docs.python-requests.org/en/latest/) for this)

In [1]:
import requests
r = requests.get('http://www.imdb.com/name/nm0000125/') # lets look at Sean Connery's list of movies

How you can make your "beautiful soup"! This turns the HTML into a DOM tree that you can navigate with code.

In [2]:
from bs4 import BeautifulSoup
webpage = BeautifulSoup(r.text, "html.parser")

Scraping the Info You Want
------------------------
Now there are a few ways to get content out.   For instance, to get the title you could treat it like an object:

In [3]:
webpage.title.text

u'Sean Connery - IMDb'

Or you can search for specific tags. This would get all the links (as DOM elements):

In [4]:
len(webpage.find_all('a'))

852

Or you can use good old CSS selectors, to actually find all the years his movies were made in:

In [5]:
len(webpage.select('div.filmo-row span.year_column'))

325

Of course, we really want to turn this into a list of years... not DOM elements

In [6]:
raw_year_list = [e.text.strip() for e in webpage.select('div.filmo-row span.year_column')]

Cleaning and Analyzing the Data
-----------------------------
So we can check if he made any films in a particular year

In [7]:
'1972' in raw_year_list

True

And we can look for messy data:

In [8]:
[year for year in raw_year_list if not year.isnumeric()]

[u'1974/II',
 u'1956-1959',
 u'1995-2003',
 u'1993-2000',
 u'1999-2000',
 u'1987-1988',
 u'1976-1979',
 u'1975-1979',
 u'2008-2014',
 u'2011-2012',
 u'2006-2009']

And we can remove these messy entries (even though that isn't the best thing to do):

In [9]:
year_list = [year for year in raw_year_list if year.isnumeric()]
','.join(year_list)

u'2012,2005,2003,2003,2000,1999,1998,1998,1996,1996,1995,1995,1994,1993,1992,1991,1991,1990,1990,1989,1989,1988,1988,1987,1986,1986,1984,1983,1982,1982,1981,1981,1979,1979,1979,1977,1976,1976,1975,1975,1974,1974,1973,1971,1971,1970,1969,1969,1969,1968,1967,1966,1966,1965,1965,1964,1964,1964,1963,1962,1962,1961,1961,1961,1961,1961,1960,1960,1960,1960,1959,1959,1959,1958,1958,1958,1958,1957,1957,1957,1957,1957,1957,1957,1957,1957,1956,1956,1956,1954,1954,2012,2003,2000,1999,1996,1995,1993,1992,1967,2008,1962,1959,1967,1988,2012,2012,2009,2009,2009,2008,2008,2008,2007,2007,2007,2006,2006,2006,2006,2006,2006,2005,2005,2005,2005,2004,2004,2004,2003,2003,2003,2003,2003,2002,2002,2002,2002,2002,2002,2002,2001,2000,2000,2000,2000,1999,1999,1999,1999,1999,1999,1999,1998,1998,1998,1998,1998,1998,1997,1997,1996,1996,1996,1996,1996,1995,1995,1995,1995,1993,1992,1990,1990,1989,1989,1988,1988,1987,1987,1987,1987,1987,1987,1986,1983,1983,1983,1979,1977,1976,1975,1975,1975,1975,1972,1971,1970,1969,196

In [10]:
import collections
year_freq = collections.Counter(year_list)
for year in sorted(year_freq.keys()):
    print str(year)+': '+('+'*year_freq[year])

1954: ++
1956: +++
1957: +++++++++
1958: ++++
1959: ++++
1960: ++++
1961: +++++
1962: +++
1963: +++
1964: +++++
1965: ++++++++++
1966: +++
1967: +++++++
1968: +
1969: ++++
1970: ++
1971: +++
1972: +
1973: +
1974: ++
1975: +++++++
1976: +++
1977: ++
1979: ++++
1981: ++
1982: ++
1983: +++++
1984: ++
1986: +++
1987: ++++++++
1988: +++++++
1989: +++++
1990: ++++
1991: +++
1992: ++++
1993: ++++++
1994: ++
1995: ++++++++++
1996: +++++++++
1997: ++++++
1998: +++++++++
1999: ++++++++++++++
2000: +++++++++++++++++++++++
2001: +
2002: ++++++++++
2003: ++++++++++++++++
2004: ++++++++
2005: ++++++++++
2006: ++++++++++++++
2007: ++++++++++
2008: ++++++
2009: ++++++++
2010: ++
2012: +++++++++
2014: +
2015: +++
