# Scraping kittens.html with BeautifulSoup

If you have not already done so, install bs4 on the command line or by running the code from the cell below. Remember, if you need to include a shell command within your Jupyter notebook, structure the command as you normally would, but include an `!` in the front (as you see below). 

In [2]:
!pip3 install bs4

Collecting bs4
Collecting beautifulsoup4 (from bs4)
  Using cached beautifulsoup4-4.4.1-py3-none-any.whl
Installing collected packages: beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.4.1 bs4-0.0.1


**Remember:** when you import libraries/modules and it looks like "nothing happened", that's a good sign! It means it imported without an error. 

In [2]:
from bs4 import BeautifulSoup

In Foundations, we have been using the `requests` library to talk to websites. Here, we're using a similar library to get data from [kittens.html](http://static.decontextualize.com/kittens.html). 

In [3]:
from urllib.request import urlopen
html_str = urlopen("http://static.decontextualize.com/kittens.html").read()

In [4]:
print(html_str)

b'<!doctype html>\n<html>\n\t<head>\n\t\t<title>Kittens!</title>\n\t\t<style type="text/css">\n\t\t\tspan.lastcheckup { font-family: "Courier", fixed; font-size: 11px; }\n\t\t</style>\n\t</head>\n\t<body>\n\t\t<h1>Kittens and the TV Shows They Love</h1>\n\t\t<div class="kitten">\n\t\t\t<h2>Fluffy</h2>\n\t\t\t<div><img src="http://placekitten.com/120/120"></div>\n\t\t\t<ul class="tvshows">\n\t\t\t\t<li>\n\t\t\t\t\t<a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>\n\t\t\t\t</li>\n\t\t\t\t<li>\n\t\t\t\t\t<a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>\n\t\t\t\t</li>\n\t\t\t</ul>\n\t\t\tLast check-up: <span class="lastcheckup">2014-01-17</span>\n\t\t</div>\n\t\t<div class="kitten">\n\t\t\t<h2>Monsieur Whiskeurs</h2>\n\t\t\t<div><img src="http://placekitten.com/110/110"></div>\n\t\t\t<ul class="tvshows">\n\t\t\t\t<li>\n\t\t\t\t\t<a href="http://www.imdb.com/title/tt0106179/">The X-Files</a>\n\t\t\t\t</li>\n\t\t\t\t<li>\n\t\t\t\t\t<a href="http://www.imdb.co

We get back our request as one long string. But we need our data to be machine-readable so that we can do cool things with it! Let's use a parser from BeautifulSoup to help us handle that task. 

In [5]:
# document is a special variable that contains thing called an object (rather than a list or string, etc.)
document = BeautifulSoup(html_str, "html.parser")

Note that "document" is now a special kind of variable; it contains an **object**, rather than a list or a string, etc. If we do `type()` on document, we see that it just tells us that it is a BeautifulSoup object -- this type of object is specific to the bs4 library. 

In [6]:
type(document)

bs4.BeautifulSoup

**A little bit about the Jupyter environment:**
    
+ If you start typing `document.` and then hit TAB, you'll see all the methods you can apply to `document`. 

> Note that `document` is **NOT** a special word; we decided to name our variable `document` when we parsed our html data, but really, we could have named it anything. 

+ In the case of BeautifulSoup, it also has an helpful command `help()` to show you what you can do with a bs4 object.

In [7]:
help(document)

Help on BeautifulSoup in module bs4 object:

class BeautifulSoup(bs4.element.Tag)
 |  This class defines the basic interface called by the tree builders.
 |  
 |  These methods will be called by the parser:
 |    reset()
 |    feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    handle_starttag(name, attrs) # See note about return value
 |    handle_endtag(name)
 |    handle_data(data) # Appends to the current data node
 |    endData(containerClass=NavigableString) # Ends the current data node
 |  
 |  No matter how complicated the underlying parser is, you should be
 |  able to build a tree using 'start tag' events, 'end tag' events,
 |  'data' events, and "done with data" events.
 |  
 |  If you encounter an empty-element tag (aka a self-closing tag,
 |  like HTML's <br> tag), call handle_starttag and then
 |  handle_endtag.
 |  
 |  Method resolution order:
 |      BeautifulSoup
 |      bs4.element.Tag
 |      bs4.element.PageElement
 |

## Looking at our document

In [8]:
# your tag is an object, and 
# finds you the first tag that matches the string that you passed .find()
h1_tag = document.find('h1')

In [9]:
# our tag object 
h1_tag

<h1>Kittens and the TV Shows They Love</h1>

In [10]:
h1_tag.string

'Kittens and the TV Shows They Love'

In [11]:
img_tag = document.find('img')

In [12]:
img_tag.string

You can treat a **tag object** like a dictionary! You can pull out attributes of the tag the same way we'd pull out the key-value pairs out of a dictionary. 

In [13]:
img_tag['src']

'http://placekitten.com/120/120'

The `.find()` method is just the start! Tags themselves have a find method -- this searches the descendants of that tag, rather than the entire html page. 


In [15]:
img_tags = document.find_all('img')

In [16]:
# this is a fancy way of saying lists! 
type(img_tags)

bs4.element.ResultSet

In [17]:
img_tags[0]

<img src="http://placekitten.com/120/120"/>

In [18]:
len(img_tags)

2

In [19]:
for tag_obj in img_tags:
    print(tag_obj['src'])

http://placekitten.com/120/120
http://placekitten.com/110/110


In [20]:
h2_tags = document.find_all('h2')

# Just the content in our tags
for item in h2_tags:
    print(item.string)

Fluffy
Monsieur Whiskeurs


In [21]:
# alternate form of the find_all function
checkups = document.find_all('span', {'class': 'lastcheckup'})

for items in checkups:
    print(items.string)

2014-01-17
2013-11-02


We want to print out: 

`Fluffy: 2014-01-17
Monsieur Whiskeurs: 2013-11-02`


Imagine our html document as a **tree structure**. 

In [22]:
kittens = document.find_all('div', {'class': 'kitten'})

# for item in kittens:
#     h2_tag = document.find_all('h2') # this is actually searching the ENTIRE document
#     print(h2_tag[0].string)
#     checkup = document.find_all('span')
#     print(checkup[0].string)
  
# Search ONLY the descendant of our div tag

for item in kittens:
    h2_tag = item.find('h2') # this is actually searching the ENTIRE document
    print(h2_tag.string)
    checkup = item.find('span')
    print(checkup.string)

Fluffy
2014-01-17
Monsieur Whiskeurs
2013-11-02


In [23]:
kittens = document.find_all('div', {'class':'kitten'})

In [24]:
first_kitten = kittens[0]
first_kitten_h2 = first_kitten.find('h2')
print(first_kitten_h2.string)

Fluffy


In [25]:
second_kitten = kittens[1]
second_kitten_h2 = second_kitten.find('h2')
print(second_kitten_h2.string)

Monsieur Whiskeurs


## Aside: joining lists

In [26]:
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']

In [27]:
separator = ','

In [28]:
separator.join(planets)

'Mercury,Venus,Earth,Mars,Jupiter,Saturn,Uranus,Neptune'

In [29]:
','.join(planets)

'Mercury,Venus,Earth,Mars,Jupiter,Saturn,Uranus,Neptune'

In [30]:
print('&\n'.join(planets))

Mercury&
Venus&
Earth&
Mars&
Jupiter&
Saturn&
Uranus&
Neptune


## Aside: slicing lists

In [31]:
print('&\n'.join(planets[:4]))

Mercury&
Venus&
Earth&
Mars


In [32]:
kittens = document.find_all('div', {'class': 'kitten'})

for item in kittens:
    h2_tag = item.find('h2') # this is actually searching the ENTIRE document
    a_tags = item.find_all('a')
    print(h2_tag.string)
    
    for a_tag_item in a_tags:
        print('-', a_tag_item.string)

    

Fluffy
- Deep Space Nine
- Mr. Belvedere
Monsieur Whiskeurs
- The X-Files
- Fresh Prince


In [33]:
# building a structure to see our data differently 

kittens = document.find_all('div', {'class': 'kitten'})

for item in kittens:
    h2_tag = item.find('h2') # this is actually searching the ENTIRE document
    a_tags = item.find_all('a')
    print(h2_tag.string)
    
    for a_tag_item in a_tags:
        print('-', a_tag_item.string)


Fluffy
- Deep Space Nine
- Mr. Belvedere
Monsieur Whiskeurs
- The X-Files
- Fresh Prince


## Aside: appending items to a list

In [34]:
x = ['a', 'b', 'c', 'd']

In [35]:
x[0]

'a'

In [36]:
x[2]

'c'

In [37]:
x.append('e')

In [38]:
len(x)

5

In [39]:
x[4]

'e'

In [40]:
numbers = [1, 2, 3, 4, 5, 6]

In [41]:
# we want to end up with a list of the squares of those numbers

square = []
for number in numbers:
    s = number * number
    square.append(s)

In [42]:
square

[1, 4, 9, 16, 25, 36]

In [45]:
# list comprehenion

squared = [item * item for item in numbers]
squared

[1, 4, 9, 16, 25, 36]

## Back to Kittens! 

In [44]:
kittens = document.find_all('div', {'class': 'kitten'})

for item in kittens:
    h2_tag = item.find('h2') # this is actually searching the ENTIRE document
    a_tags = item.find_all('a')    
    all_shows_str = []
    
    for a_tag_item in a_tags:
        tag_str = a_tag_item.string
        all_shows_str.append(tag_str)
        
    string_with_all_show_names = ', '.join(all_shows_str)
    print(h2_tag.string + ":" + string_with_all_show_names)

Fluffy:Deep Space Nine, Mr. Belvedere
Monsieur Whiskeurs:The X-Files, Fresh Prince


Now we want to transform the code above so that we save the data that we want as a **list of dictionaries**. 

Let's comment out our code above so that we understand what's happening and figure out what we need to do to create our own data structure:

In [None]:
# kittens_data = list()
kittens_data = []
kittens = document.find_all('div', {'class': 'kitten'})

for item in kittens:
    h2_tag = item.find('h2') # this is actually searching the ENTIRE document
    a_tags = item.find_all('a')    
    all_shows_str = []
    
    for a_tag_item in a_tags:
        tag_str = a_tag_item.string
        all_shows_str.append(tag_str)
        
    # (1) Create a dictionary and add the relevant keys/value pairs
    # (2) Add the dictionary to your list 
    
    string_with_all_show_names = ', '.join(all_shows_str)
    # print(h2_tag.string + ":" + string_with_all_show_names)
    
kittens_data

## Quick Review of Dictionaries

In [46]:
# declaring a dictionary:

x = { 'a': 1, 'b': 2, 'c': 3}

# to get a value out of a dictionary

x['a']

1

In [49]:
x.keys()

dict_keys(['b', 'a', 'c'])

In [51]:
for k in x.keys():
    print(k)

b
a
c


Note that the order of the items/keys in dictionaries is random!

In [None]:
# Initialize a dictionary
y = {}

## How to gradually build up a dictionary structure

In [None]:
# Build up your dictionary gradually (programmatically)

# target: { 1: 1, 2: 4, 3: 9, 4: 6}

squares = {}
for n in range(1, 11):
    # you can add a new value to a dictionary by assigning it as below
    # dict_name[key] = value
    squares[n] = n*n 

In [53]:
# Another example:

names = ['Aaron', 'Bob', 'Caroline', 'Daphne']

# target: {"Aaron": 5, "Bob": 3, "Caroline": 8, "Daphne": 6}

name_length_map = {}
for item in names:
    name_length_map[item] = len(item)

name_length_map

{'Aaron': 5, 'Bob': 3, 'Caroline': 8, 'Daphne': 6}

In [54]:
name_length_map['Bob']

3

## Back to kittens again! 

Let's use what we learned about gradually building up a dictionary structure to create a list of dictionaries of our kittens.

In [56]:
# This is another way of initializing an empty list:
# kittens_data = list() 

# Our old way:
kittens_data = []
kittens = document.find_all('div', {'class': 'kitten'})

for item in kittens:
    h2_tag = item.find('h2') 
    a_tags = item.find_all('a')    
    all_shows_str = []
    
    for a_tag_item in a_tags:
        tag_str = a_tag_item.string
        all_shows_str.append(tag_str)
        
    # (1) Create a dictionary and add the relevant keys/value pairs
    
    kitten_map = {}
    kitten_map['name'] = h2_tag.string
    kitten_map['tvshows'] = all_shows_str
    
    # (2) Add the dictionary to your list 
    kittens_data.append(kitten_map)
    
kittens_data

[{'name': 'Fluffy', 'tvshows': ['Deep Space Nine', 'Mr. Belvedere']},
 {'name': 'Monsieur Whiskeurs', 'tvshows': ['The X-Files', 'Fresh Prince']}]

In [57]:
for kitten in kittens_data:
    for show in kitten['tvshows']:
        print(show)

Deep Space Nine
Mr. Belvedere
The X-Files
Fresh Prince


In [58]:
len(kittens_data)

2

### A shorter way to write this code

In [None]:
# An alternate way to create our dictionaries
kitten_map = {'name': h2_tag.string, 
            'tvshows': all_shows_str}

# OR

kittens_data.append({'name': h2_tag.string, 
            'tvshows': all_shows_str})

### Next task: each dictionary should also include each kitten's last checkup date.

In [59]:
kittens_data = []
kittens = document.find_all('div', {'class': 'kitten'})

for item in kittens:
    h2_tag = item.find('h2') 
    a_tags = item.find_all('a')    
    all_shows_str = []
    
    for a_tag_item in a_tags:
        tag_str = a_tag_item.string
        all_shows_str.append(tag_str)
        
    checkup =  item.find('span') # get the string with checkup.string
    kittens_data.append({'name': h2_tag.string, 
            'tvshows': all_shows_str, 
             'checkup': checkup.string })
    
kittens_data

[{'checkup': '2014-01-17',
  'name': 'Fluffy',
  'tvshows': ['Deep Space Nine', 'Mr. Belvedere']},
 {'checkup': '2013-11-02',
  'name': 'Monsieur Whiskeurs',
  'tvshows': ['The X-Files', 'Fresh Prince']}]