# FLIP(01):  Advanced Data Science
**(Module 03: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 01 - The Stew: Beautiful Soup 4

After you have your ingredients, now what? Now you make them into a stew… a beautiful stew.

Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.

Beautiful Soup’s default parser comes from Python’s standard library. It’s flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.

One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.

In addition, BS4 can help you navigate a parsed document and find what you need. This makes it quick and painless to build common applications. For example, if you wanted to find all the links in the web page we pulled down earlier, it’s only a few lines:

In [None]:
import requests
page = requests.get('http://examplesite.com')
contents = page.content

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(contents, 'html.parser')
soup.find_all('a')

In [None]:
from bs4 import BeautifulSoup

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')

In [None]:
print(soup.prettify())

Here are some simple ways to navigate that data structure:

In [None]:
soup.title

In [None]:
soup.title.name

In [None]:
soup.title.string

In [None]:
soup.title.parent.name

In [None]:
soup.p

In [None]:
soup.p['class']

In [None]:
soup.a

In [None]:
soup.find_all('a')

In [None]:
soup.find(id='link3')

One common task is extracting all the URLs found within a page’s <a> tags:

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

Another common task is extracting all the text from a page:

In [None]:
print(soup.get_text())

# Making the soup

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

In [None]:
from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp)

soup = BeautifulSoup("<html>data</html>")

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

# Kinds of objects

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

** Tag **

A Tag object corresponds to an XML or HTML tag in the original document:

In [None]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)

Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.

** Name ** 

Every tag has a name, accessible as .name:

In [None]:
tag.name

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [None]:
tag.name = 'blockquote'
tag

** Attributes **

A tag may have any number of attributes. The tag $<id="boldest">$ has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

In [None]:
tag['id'] = 'boldest'

In [None]:
tag['id']

You can access that dictionary directly as  *.attrs*:

In [None]:
tag.attrs

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [None]:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag

In [None]:
del tag['id']
del tag['another-attribute']

In [None]:
tag

Now, if you print *tag[' id ']*, it will apper an KeyError.

In [None]:
tag['id']

In [None]:
print(tag.get('id'))

**Multi-valued attributes**

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is *class* (that is, a tag can have more than one CSS class). Others include *rel, rev, accept-charset, headers*, and *accesskey*. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

In [None]:
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']

In [None]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']

If an attribute *looks* like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

In [None]:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']

When you turn a tag back into a string, multiple attribute values are consolidated:

In [None]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']

In [None]:
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

You can use `get_attribute_list` to get a value that’s always a list, string, whether or not it’s a multi-valued atribute

    id_soup.p.get_attribute_list(‘id’) # [“my id”]
    
If you parse a document as XML, there are no multi-valued attributes:

In [None]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

## Comments and other special strings

`Tag, NavigableString,` and `BeautifulSoup` cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The only one you’ll probably ever need to worry about is the comment:

In [None]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)

The `Comment` object is just a special type of NavigableString:

In [None]:
comment

when it appears as part of an HTML document, a `Comment` is displayed with special formatting:

In [None]:
print(soup.b.prettify())

Beautiful Soup defines classes for anything else that might show up in an XML document: `CData, ProcessingInstruction, Declaration,` and `Doctype`. Just like `Comment`, these classes are subclasses of `NavigableString` that add something extra to the string. Here’s an example that replaces the comment with a CDATA block:

In [None]:
from bs4 import CData
cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())

# Navigating the tree

Here’s the “Three sisters” HTML document again:

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

I’ll use this as an example to show you how to move from one part of a document to another.

## Going down

Tags may contain strings and other tags. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

### Navigating using tag names

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head> tag, just say `soup.head:`

In [None]:
soup.head

In [None]:
soup.title

You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first `<b>` tag beneath the `<body>` tag:

In [None]:
soup.body.b

Using a tag name as an attribute will give you only the *first* tag by that name:

In [None]:
soup.a

If you need to get all the `<a>` tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as *find_all()*:

In [None]:
soup.find_all('a')

### .contents and .children

A tag’s children are available in a list called `.contents:`

In [None]:
head_tag = soup.head
head_tag

In [None]:
head_tag.contents

In [None]:
title_tag = head_tag.contents[0]
title_tag

In [None]:
title_tag.contents

The `BeautifulSoup` object itself has children. In this case, the `<html>` tag is the child of the `BeautifulSoup` object.:

In [None]:
len(soup.contents)

In [None]:
soup.contents[0].name

A string does not have .contents, because it can’t contain anything:

In [None]:
text = title_tag.contents[0]
text.contents

Instead of getting them as a list, you can iterate over a tag’s children using the `.children` generator:

In [None]:
for child in title_tag.children:
    print(child)

The `.contents` and `.children` attributes only consider a tag’s direct children. For instance, the `<head>` tag has a single direct child–the `<title>` tag:

In [None]:
head_tag.contents

But the `<title>` tag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the `<head>` tag. The `.descendants` attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

In [None]:
for child in head_tag.descendants:
    print(child)

The `<head>` tag has only one child, but it has two descendants: the `<title>` tag and the `<title>` tag’s child. The `BeautifulSoup` object only has one direct child (the `<html>` tag), but it has a whole lot of descendants:

In [None]:
len(list(soup.children))

In [None]:
len(list(soup.descendants))

If a tag has only one child, and that child is a `NavigableString`, the child is made available as `.string:`

In [None]:
title_tag.string

If a tag’s only child is another tag, and that tag has a `.string`, then the parent tag is considered to have the same `.string` as its child:

In [None]:
head_tag.contents

In [None]:
head_tag.string

If a tag contains more than one thing, then it’s not clear what `.string` should refer to, so `.string` is defined to be `None:`

In [None]:
print(soup.html.string)

If there’s more than one thing inside a tag, you can still look at just the strings. Use the `.strings` generator:

In [None]:
for string in soup.strings:
    print(repr(string))

These strings tend to have a lot of extra whitespace, which you can remove by using the `.stripped_strings` generator instead:

In [None]:
for string in soup.stripped_strings:
    print(repr(string))

You can access an element’s parent with the `.parent` attribute. In the example “three sisters” document, the `<head>` tag is the parent of the `<title>` tag:

In [None]:
title_tag = soup.title
title_tag

In [None]:
title_tag.parent

The title string itself has a parent: the `<title>` tag that contains it:

In [None]:
title_tag.string.parent

The parent of a top-level tag like `<html>` is the `BeautifulSoup` object itself:

In [None]:
html_tag = soup.html
type(html_tag.parent)

And the `.parent` of a `BeautifulSoup` object is defined as None:

In [None]:
print(soup.parent)

You can iterate over all of an element’s parents with `.parents`. This example uses `.parents` to travel from an `<a>`  tag buried deep within the document, to the very top of the document:

In [None]:
link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

### Going sideways

Consider a simple document like this:

In [None]:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())

The `<b>` tag and the `<c>` tag are at the same level: they’re both direct children of the same tag. We call them siblings. When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.

You can use `.next_sibling` and `.previous_sibling` to navigate between page elements that are on the same level of the parse tree:

In [None]:
sibling_soup.b.next_sibling

In [None]:
sibling_soup.c.previous_sibling

The `<b>` tag has a `.next_sibling`, but no `.previous_sibling`, because there’s nothing before the `<b>` tag on the same level of the tree. For the same reason, the `<c>` tag has a `.previous_sibling` but no `.next_sibling:`

In [None]:
print(sibling_soup.b.previous_sibling)

In [None]:
print(sibling_soup.c.next_sibling)

The strings “text1” and “text2” are not siblings, because they don’t have the same parent:

In [None]:
sibling_soup.b.string

In [None]:
print(sibling_soup.b.string.next_sibling)

In real documents, the `.next_sibling` or `.previous_sibling` of a tag will usually be a string containing whitespace. Going back to the “three sisters” document:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

You might think that the .next_sibling of the first `<a>` tag would be the second `<a>` tag. But actually, it’s a string: the comma and newline that separate the first `<a>` tag from the second:

In [None]:
link = soup.a
link

In [None]:
link.next_sibling

The second `<a>` tag is actually the `.next_sibling` of the comma:

In [None]:
link.next_sibling.next_sibling

You can iterate over a tag’s siblings with `.next_siblings` or `.previous_siblings:`

In [None]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

In [None]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

Here’s the final `<a>` tag in the “three sisters” document. Its `.next_sibling` is a string: the conclusion of the sentence that was interrupted by the start of the `<a>`  tag.:

In [None]:
last_a_tag = soup.find("a", id="link3")
last_a_tag

In [None]:
last_a_tag.next_sibling

But the `.next_element` of that `<a>` tag, the thing that was parsed immediately after the `<a>` tag, is not the rest of that sentence: it’s the word “Tillie”:

In [None]:
last_a_tag.next_element

That’s because in the original markup, the word “Tillie” appeared before that semicolon. The parser encountered an `<a>` tag, then the word “Tillie”, then the closing `</a>` tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the `<a>` tag, but the word “Tillie” was encountered first.

The `.previous_element` attribute is the exact opposite of `.next_element`. It points to whatever element was parsed immediately before this one:

In [None]:
last_a_tag.previous_element

In [None]:
last_a_tag.previous_element.next_element

You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed:

In [None]:
for element in last_a_tag.next_elements:
    print(repr(element))

# Searching the tree

Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. I’m going to spend a lot of time explaining the two most popular methods: `find()` and `find_all()`. The other methods take almost exactly the same arguments, so I’ll just cover them briefly.

Once again, I’ll be using the “three sisters” document as an example:

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

By passing in a filter to an argument like `find_all()`, you can zoom in on the parts of the document you’re interested in.

## Kinds of filters

Before talking in detail about find_all() and similar methods, I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

### A string

The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the `<b>` tags in the document:

In [None]:
soup.find_all('b')

If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead.

### A regular expression

If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its `search()` method. This code finds all the tags whose names start with the letter “b”; in this case, the `<body>` tag and the `<b>` tag:

In [None]:
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

### A regular expression

If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its `search()` method. This code finds all the tags whose names start with the letter “b”; in this case, the `<body>` tag and the `<b>` tag:

In [None]:
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

This code finds all the tags whose names contain the letter ‘t’:

In [None]:
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

### Alist

If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the `<a>` tags and all the `<b>` tags:

In [None]:
soup.find_all(["a", "b"])

The value `True` matches everything it can. This code finds all the tags in the document, but none of the text strings:

In [None]:
for tag in soup.find_all(True):
    print(tag.name)

### A function

If none of the other matches work for you, define a function that takes an element as its only argument. The function should return `True` if the argument matches, and `False` otherwise.

Here’s a function that returns `True` if a tag defines the “class” attribute but doesn’t define the “id” attribute:

In [None]:
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

Pass this function into `find_all()` and you’ll pick up all the `<p>` tags:

In [None]:
soup.find_all(has_class_but_no_id)

This function only picks up the `<p>` tags. It doesn’t pick up the `<a>` tags, because those tags define both “class” and “id”. It doesn’t pick up tags like `<html>` and `<title>`, because those tags don’t define “class”.

If you pass in a function to filter on a specific attribute like `href`, the argument passed into the function will be the attribute value, not the whole tag. Here’s a function that finds all `a` tags whose `href` attribute does not match a regular expression:

In [None]:
def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)

The function can be as complicated as you need it to be. Here’s a function that returns `True` if a tag is surrounded by string objects:

In [None]:
from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print (tag.name)

### The name argument

Pass in a value for `name` and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.

This is the simplest usage:

In [None]:
soup.find_all("title")

Recall from Kinds of filters that the value to `name` can be a string, a regular expression, a list, a function, or the value True.

### The keyword arguments

Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called `id,` Beautiful Soup will filter against each tag’s ‘id’ attribute:

In [None]:
soup.find_all(id='link2')

If you pass in a value for `href`, Beautiful Soup will filter against each tag’s ‘href’ attribute:

In [None]:
soup.find_all(href=re.compile("elsie"))

You can filter an attribute based on a string, a regular expression, a list, a function, or the value True.
This code finds all tags whose `id` attribute has a value, regardless of what the value is:

In [None]:
soup.find_all(id=True)

You can filter multiple attributes at once by passing in more than one keyword argument:

In [None]:
soup.find_all(href=re.compile("elsie"), id='link1')

Some attributes, like the  $data-*$ attributes in HTML 5, have names that <font color = 'red'>can’t</font> be used as the names of keyword arguments:

In [None]:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

In [None]:
data_soup.find_all(data-foo="value")

You can use these attributes in searches by putting them into a dictionary and passing the dictionary into `find_all()` as the `attrs` argument:

In [None]:
data_soup.find_all(attrs={"data-foo": "value"})

You can’t use a keyword argument to search for HTML’s ‘name’ element, because Beautiful Soup uses the `name` argument to contain the name of the tag itself. Instead, you can give a value to ‘name’ in the `attrs` argument.

### Searching by CSS class

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using `class` as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument `class_:`

In [None]:
soup.find_all("a", class_="sister")

As with any keyword argument, you can pass `class_` a string, a regular expression, a function, or `True`:

In [None]:
soup.find_all(class_=re.compile("itl"))

In [None]:
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

In [None]:
soup.find_all(class_=has_six_characters)

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes:

In [None]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all("p", class_="strikeout")

In [None]:
css_soup.find_all("p", class_="body")

You can also search for the exact string value of the `class` attribute:

In [None]:
css_soup.find_all("p", class_="body strikeout")

But searching for variants of the string value won’t work:

In [None]:
css_soup.find_all("p", class_="strikeout body")

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

In [None]:
css_soup.select("p.strikeout.body")

In older versions of Beautiful Soup, which don’t have the `class_` shortcut, you can use the `attrs` trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for:

In [None]:
soup.find_all("a", attrs={"class": "sister"})

### The `string` argument

With `string` you can search for strings instead of tags. As with `name` and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True. Here are some examples:

In [None]:
soup.find_all(string='Elsie')

In [None]:
soup.find_all(string=["Tillie", "Elsie", "Lacie"])

In [None]:
soup.find_all(string=re.compile("Dormouse"))

In [None]:
def is_the_only_string_within_a_tag(s):
    """Return True if this string is the only child of its parent tag."""
    return (s == s.parent.string)

In [None]:
soup.find_all(string=is_the_only_string_within_a_tag)

Although `string` is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose `.string` matches your value for `string.` This code finds the `<a>` tags whose `.string` is “Elsie”:

In [None]:
soup.find_all("a", string="Elsie")

The `string` argument is new in Beautiful Soup 4.4.0. In earlier versions it was called `text:`

In [None]:
soup.find_all("a", text="Elsie")

### The `limit` argument

`find_all()` returns all the tags and strings that match your filters. This can take a while if the document is large. If you don’t need all the results, you can pass in a number for `limit.` This works just like the LIMIT keyword in SQL. It tells Beautiful Soup to stop gathering results after it’s found a certain number.

There are three links in the “three sisters” document, but this code only finds the first two:

In [None]:
soup.find_all("a", limit=2)

### The `recursive` argument

If you call `mytag.find_all()`, Beautiful Soup will examine all the descendants of `mytag:` its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in `recursive=False.` See the difference here:

In [None]:
soup.html.find_all("title")

In [None]:
soup.html.find_all("title", recursive=False)

Here’s that part of the document:

The `<title>` tag is beneath the `<html>` tag, but it’s not directly beneath the `<html>` tag: the `<head>` tag is in the way. Beautiful Soup finds the `<title>` tag when it’s allowed to look at all descendants of the `<html>` tag, but when `recursive=False` restricts it to the `<html>` tag’s immediate children, it finds nothing.

Beautiful Soup offers a lot of tree-searching methods (covered below), and they mostly take the same arguments as `find_all(): name, attrs, string, limit,`  and the keyword arguments. But the `recursive` argument is different: `find_all()` and `find()` are the only methods that support it. Passing `recursive=False` into a method like `find_parents()` wouldn’t be very useful.

### Calling a tag is like calling `find_all()`

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the `BeautifulSoup` object or a `Tag` object as though it were a function, then it’s the same as calling `find_all()` on that object. These two lines of code are equivalent:

In [None]:
soup.find_all("a")
soup("a")

These two lines are also equivalent:

In [None]:
soup.title.find_all(string=True)
soup.title(string=True)

`find()`

Signature: find(name, attrs, recursive, string, **kwargs)

The `find_all()` method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one `<body>` tag, it’s a waste of time to scan the entire document looking for more. Rather than passing in `limit=1` every time you call `find_all`, you can use the `find()` method. These two lines of code are nearly equivalent:

In [None]:
soup.find_all('title', limit=1)

In [None]:
soup.find('title')

The only difference is that `find_all()` returns a list containing the single result, and `find()` just returns the result.

If `find_all()` can’t find anything, it returns an empty list. If `find()` can’t find anything, it returns `None:`

In [None]:
print(soup.find("nosuchtag"))

Remember the `soup.head.title` trick from Navigating using tag names? That trick works by repeatedly calling `find():`

In [None]:
soup.head.title

In [None]:
soup.find("head").find("title")

### `find_parents()` and `find_parent()`

Signature: find_parents(name, attrs, string, limit, **kwargs)

Signature: find_parent(name, attrs, string, **kwargs)

I spent a lot of time above covering `find_all()` and `find()`. The Beautiful Soup API defines ten other methods for searching the tree, but don’t be afraid. Five of these methods are basically the same as `find_all()`, and the other five are basically the same as `find()`. The only differences are in what parts of the tree they search.

First let’s consider `find_parents()` and `find_parent()`. Remember that `find_all()` and `find()` work their way down the tree, looking at tag’s descendants. These methods do the opposite: they work their way up the tree, looking at a tag’s (or a string’s) parents. Let’s try them out, starting from a string buried deep in the “three daughters” document:

In [None]:
a_string = soup.find(string="Lacie")
a_string

In [None]:
a_string.find_parents("a")

In [None]:
a_string.find_parent("p")

One of the three `<a>` tags is the direct parent of the string in question, so our search finds it. One of the three `<p>` tags is an indirect parent of the string, and our search finds that as well. There’s a `<p>` tag with the CSS class “title” somewhere in the document, but it’s not one of this string’s parents, so we can’t find it with `find_parents()`.

You may have made the connection between `find_parent()` and `find_parents()`, and the `.parent` and `.parents` attributes mentioned earlier. The connection is very strong. These search methods actually use `.parents` to iterate over all the parents, and check each one against the provided filter to see if it matches.

### `find_next_siblings()` and `find_next_sibling()`

Signature: find_next_siblings(name, attrs, string, limit, **kwargs)

Signature: find_next_sibling(name, attrs, string, **kwargs)

These methods use .next_siblings to iterate over the rest of an element’s siblings in the tree. The `find_next_siblings()` method returns all the siblings that match, and `find_next_sibling()` only returns the first one:

In [None]:
first_link = soup.a
first_link

In [None]:
first_link.find_next_siblings("a")

In [None]:
first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_next_sibling("p")

### `find_previous_siblings()` and `find_previous_sibling()`

Signature: find_previous_siblings(name, attrs, string, limit, **kwargs)

Signature: find_previous_sibling(name, attrs, string, **kwargs)

These methods use `.previous_siblings` to iterate over an element’s siblings that precede it in the tree. The `find_previous_siblings()` method returns all the siblings that match, and `find_previous_sibling()` only returns the first one:

In [None]:
last_link = soup.find("a", id="link3")
last_link

In [None]:
last_link.find_previous_siblings("a")

In [None]:
first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_previous_sibling("p")

### `find_all_next()` and `find_next()`

Signature: find_all_next(name, attrs, string, limit, **kwargs)

Signature: find_next(name, attrs, string, **kwargs)

These methods use `.next_elements` to iterate over whatever tags and strings that come after it in the document. The `find_all_next()` method returns all matches, and `find_next()` only returns the first match:

In [None]:
first_link = soup.a
first_link

In [None]:
first_link.find_all_next(string=True)

In [None]:
first_link.find_next("p")

### `find_all_previous()` and `find_previous()`

Signature: find_all_previous(name, attrs, string, limit, **kwargs)

Signature: find_previous(name, attrs, string, **kwargs)

These methods use `.previous_elements` to iterate over the tags and strings that came before it in the document. The `find_all_previous()`  method returns all matches, and `find_previous()` only returns the first match:

In [None]:
first_link = soup.a
first_link

In [None]:
first_link.find_all_previous("p")

In [None]:
first_link.find_previous("title")

### CSS selectors

Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the `.select()` method of a `Tag` object or the `BeautifulSoup` object itself.

You can find tags:

In [None]:
soup.select("title")

In [None]:
soup.select("p:nth-of-type(3)")

Find tags beneath other tags:

In [None]:
soup.select("body a")

In [None]:
soup.select("html head title")

Find tags directly beneath other tags:

In [None]:
soup.select("head > title")

In [None]:
soup.select("p > a")

In [None]:
soup.select("p > a:nth-of-type(2)")

In [None]:
soup.select("p > #link1")

In [None]:
soup.select("body > a")

Find the siblings of tags:

In [None]:
soup.select("#link1 ~ .sister")

In [None]:
soup.select("#link1 + .sister")

Find tags by CSS class:

In [None]:
soup.select(".sister")

In [None]:
soup.select("[class~=sister]")

Find tags by ID:

In [None]:
soup.select("#link1")

In [None]:
soup.select("a#link2")

Test for the existence of an attribute:

In [None]:
soup.select('a[href]')

Find tags by attribute value:

In [None]:
soup.select('a[href="http://example.com/elsie"]')

In [None]:
soup.select('a[href^="http://example.com/"]')

In [None]:
soup.select('a[href$="tillie"]')

In [None]:
soup.select('a[href*=".com/el"]')

Match language codes:

In [None]:
multilingual_markup = """
 <p lang="en">Hello</p>
 <p lang="en-us">Howdy, y'all</p>
 <p lang="en-gb">Pip-pip, old fruit</p>
 <p lang="fr">Bonjour mes amis</p>
"""

In [None]:
multilingual_soup = BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')

Find only the first tag that matches a selector:

In [None]:
soup.select_one(".sister")

This is all a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.

## Modifying the tree

Beautiful Soup’s main strength is in searching the parse tree, but you can also modify the tree and write your changes as a new HTML or XML document.

### Changing tag names and attributes

I covered this earlier, in `Attributes`, but it bears repeating. You can rename a tag, change the values of its attributes, add new attributes, and delete attributes:

In [None]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
tag

In [None]:
del tag['class']
del tag['id']
tag

### Modifying `.string`

If you set a tag’s `.string` attribute, the tag’s contents are replaced with the string you give:

In [None]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)

In [None]:
tag = soup.a
tag.string = "New link text."
tag

<font color = 'red'>Be careful:</font> if the tag contained other tags, they and all their contents will be destroyed.

### `append()`

You can add to a tag’s contents with `Tag.append()`. It works just like calling `.append()` on a Python list:

In [None]:
soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")

In [None]:
soup

In [None]:
soup.a.contents

### `NavigableString()` and `.new_tag()`

If you need to add a string to a document, no problem–you can pass a Python string in to `append()`, or you can call the `NavigableString` constructor:

In [None]:
soup = BeautifulSoup("<b></b>")
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
tag

In [None]:
tag.contents

If you want to create a comment or some other subclass of `NavigableString`, just call the constructor:

In [None]:
from bs4 import Comment
new_comment = Comment("Nice to see you.")
tag.append(new_comment)
tag

In [None]:
tag.contents

### `insert()`

`Tag.insert()` is just like `Tag.append()`, except the new element doesn’t necessarily go at the end of its parent’s `.contents`. It’ll be inserted at whatever numeric position you say. It works just like `.insert()` on a Python list:

In [None]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a

In [None]:
tag.insert(1, "but did not endorse ")
tag

### `insert_before()` and `insert_after()`

The `insert_before()` method inserts a tag or string immediately before something else in the parse tree:

In [None]:
soup = BeautifulSoup("<b>stop</b>")
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
soup.b

The `insert_after()` method moves a tag or string so that it immediately follows something else in the parse tree:

In [None]:
soup.b.i.insert_after(soup.new_string(" ever "))
soup.b

In [None]:
soup.b.contents

### `clear()`

`Tag.clear()` removes the contents of a tag:

In [None]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a

tag.clear()
tag

### `extract()`

`PageElement.extract()` removes a tag or string from the tree. It returns the tag or string that was extracted:

In [None]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

i_tag = soup.i.extract()

a_tag

In [None]:
i_tag

In [None]:
a_tag

In [None]:
print(i_tag.parent)

At this point you effectively have two parse trees: one rooted at the `BeautifulSoup` object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call `extract` on a child of the element you extracted:

In [None]:
my_string = i_tag.string.extract()
my_string

In [None]:
print(my_string.parent)

In [None]:
i_tag

### `decompose()`

`Tag.decompose()` removes a tag from the tree, then completely destroys it and its contents:

In [None]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

soup.i.decompose()

a_tag

### `replace_with()`

`PageElement.replace_with()` removes a tag or string from the tree, and replaces it with the tag or string of your choice:

In [None]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)

a_tag

`replace_with()` returns the tag or string that was replaced, so that you can examine it or add it back to another part of the tree.

### `wrap()`

`PageElement.wrap()` wraps an element in the tag you specify. It returns the new wrapper:

In [None]:
soup = BeautifulSoup("<p>I wish I was bold.</p>")
soup.p.string.wrap(soup.new_tag("b"))

In [None]:
soup.p.wrap(soup.new_tag("div"))

### `unwrap()`

`Tag.unwrap()` is the opposite of `wrap()`. It replaces a tag with whatever’s inside that tag. It’s good for stripping out markup:

In [None]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag

# Output

## Pretty-printing

The `prettify()` method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with each HTML/XML tag on its own line:

In [None]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
soup.prettify()

In [None]:
print(soup.prettify())

You can call `prettify()` on the top-level `BeautifulSoup` object, or on any of its `Tag` objects:

In [None]:
print(soup.a.prettify())

## Non-pretty printing

If you just want a string, with no fancy formatting, you can call `unicode()` or `str()` on a `BeautifulSoup` object, or a `Tag` within it:

In [None]:
str(soup)

## Output formatters

If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters:

In [None]:
soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
str(soup)

By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into $“&amp;”$, $“&lt;”$, and $“&gt;”$, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML:

In [None]:
soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>")
soup.p

In [None]:
soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
soup.a

You can change this behavior by providing a value for the `formatter` argument to `prettify(), encode(),` or `decode()`. Beautiful Soup recognizes four possible values for `formatter`.

The default is `formatter="minimal"`. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML:

In [None]:
french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
soup = BeautifulSoup(french)
print(soup.prettify(formatter="minimal"))

If you pass in `formatter="html"`, Beautiful Soup will convert Unicode characters to HTML entities whenever possible:

In [None]:
print(soup.prettify(formatter="html"))

If you pass in `formatter=None`, Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples:

In [None]:
print(soup.prettify(formatter=None))

In [None]:
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))

Finally, if you pass in a function for `formatter`, Beautiful Soup will call that function once for every string and attribute value in the document. You can do whatever you want in this function. Here’s a formatter that converts strings to uppercase and does absolutely nothing else:

In [None]:
def uppercase(str):
    return str.upper()

print(soup.prettify(formatter=uppercase))

In [None]:
print(link_soup.a.prettify(formatter=uppercase))

Here’s an example that replaces Unicode characters with HTML entities whenever possible, but also converts all strings to uppercase:

In [None]:
from bs4.dammit import EntitySubstitution
def uppercase_and_substitute_html_entities(str):
    return EntitySubstitution.substitute_html(str.upper())

print(soup.prettify(formatter=uppercase_and_substitute_html_entities))

One last caveat: if you create a `CData` object, the text inside that object is always presented exactly as it appears, with no formatting. Beautiful Soup will call the formatter method, just in case you’ve written a custom method that counts all the strings in the document or something, but it will ignore the return value:

In [None]:
from bs4.element import CData
soup = BeautifulSoup("<a></a>")
soup.a.string = CData("one < three")
print(soup.a.prettify(formatter="xml"))

**CDATA Rules** 

The given rules are required to be followed for XML CDATA −

>CDATA cannot contain the string "]]>" anywhere in the XML document.

>Nesting is not allowed in CDATA section.

If you want to implement the code, you can run the following without formatter:

In [None]:
from bs4.element import CData
soup = BeautifulSoup("<a></a>")
soup.a.string = CData("one < three")
print(soup.a.prettify())

### get_text()

If you only want the text part of a document or tag, you can use the `get_text()` method. It returns all the text in a document or beneath a tag, as a single Unicode string:

In [None]:
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()

In [None]:
soup.i.get_text()

You can specify a string to be used to join the bits of text together:

In [None]:
soup.get_text("|")

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

In [None]:
soup.get_text("|", strip = True)

But at that point you might want to use the .stripped_strings generator instead, and process the text yourself:

In [None]:
[text for text in soup.stripped_strings]

# Differences between parsers

Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here’s a short document, parsed as HTML:

In [None]:
BeautifulSoup("<a><b /></a>")

Since an empty `<b />` tag is not valid HTML, the parser turns it into a `<b></b>` tag pair.

Here’s the same document parsed as XML (running this requires that you have lxml installed). Note that the empty `<b />` tag is left alone, and that the document is given an XML declaration instead of being put into an `<html>` tag.:

In [None]:
BeautifulSoup("<a><b /></a>", "xml")

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

But if the document is not perfectly-formed, different parsers will give different results. Here’s a short, invalid document parsed using lxml’s HTML parser. Note that the dangling </p> tag is simply ignored:

In [None]:
BeautifulSoup("<a></p>", "lxml")

Here’s the same document parsed using html5lib:

In [None]:
BeautifulSoup("<a></p>", "html5lib")

Instead of ignoring the dangling `</p>` tag, html5lib pairs it with an opening `<p>` tag. This parser also adds an empty `<head>` tag to the document.

Here’s the same document parsed with Python’s built-in HTML parser:

In [None]:
BeautifulSoup("<a></p>", "html.parser")

# Encodings

Any HTML or XML document is written in a specific encoding like ASCII or UTF-8. But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode:

In [None]:
markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
soup = BeautifulSoup(markup)
soup.h1

In [None]:
soup.h1.string

It’s not magic. (That sure would be nice.) Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode. The autodetected encoding is available as the `.original_encoding` attribute of the `BeautifulSoup` object:

In [None]:
soup.original_encoding

Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. Sometimes it guesses correctly, but only after a byte-by-byte search of the document that takes a very long time. If you happen to know a document’s encoding ahead of time, you can avoid mistakes and delays by passing it to the `BeautifulSoup` constructor as `from_encoding.`

Here’s a document written in ISO-8859-8. The document is so short that Unicode, Dammit can’t get a good lock on it, and misidentifies it as ISO-8859-7:

In [None]:
markup = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup)
soup.h1

In [None]:
soup.original_encoding

We can fix this by passing in the correct `from_encoding:`

In [None]:
soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
soup.h1

In [None]:
soup.original_encoding

If you don’t know what the correct encoding is, but you know that Unicode, Dammit is guessing wrong, you can pass the wrong guesses in as `exclude_encodings:`

In [None]:
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
soup.h1

In [None]:
soup.original_encoding

## Output encoding

When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with. Here’s a document written in the Latin-1 encoding:

In [None]:
markup = b'''
 <html>
  <head>
   <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
  </head>
  <body>
   <p>Sacr\xe9 bleu!</p>
  </body>
 </html>
'''

In [None]:
soup = BeautifulSoup(markup)
print(soup.prettify())

Note that the `<meta>` tag has been rewritten to reflect the fact that the document is now in UTF-8.

If you don’t want UTF-8, you can pass an encoding into `prettify():`

In [None]:
print(soup.prettify("latin-1"))

You can also call encode() on the `BeautifulSoup` object, or any element in the soup, just as if it were a Python string:

In [None]:
soup.p.encode("latin-1")

In [None]:
soup.p.encode("utf-8")

Any characters that can’t be represented in your chosen encoding will be converted into numeric XML entity references. Here’s a document that includes the Unicode character SNOWMAN:

In [None]:
markup = u"<b>\N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup)
tag = snowman_soup.b

The SNOWMAN character can be part of a UTF-8 document (it looks like ☃), but there’s no representation for that character in ISO-Latin-1 or ASCII, so it’s converted into “&#9731” for those encodings:

In [None]:
print(tag.encode("utf-8"))

In [None]:
print(tag.encode("latin-1"))

In [None]:
print(tag.encode("ascii"))

## Unicode, Dammit

You can use Unicode, Dammit without using Beautiful Soup. It’s useful whenever you have data in an unknown encoding and you just want it to become Unicode:

In [None]:
from bs4 import UnicodeDammit
dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)

In [None]:
dammit.original_encoding

Unicode, Dammit’s guesses will get a lot more accurate if you install the `chardet` or `cchardet` Python libraries. The more data you give Unicode, Dammit, the more accurately it will guess. If you have your own suspicions as to what the encoding might be, you can pass them in as a list:

In [None]:
dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
print(dammit.unicode_markup)

In [None]:
dammit.original_encoding

### Smart quotes

You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML entities:

In [None]:
markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>"

UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup

In [None]:
UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup

You can also convert Microsoft smart quotes to ASCII quotes:

In [None]:
UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup

Hopefully you’ll find this feature useful, but Beautiful Soup doesn’t use it. Beautiful Soup prefers the default behavior, which is to convert Microsoft smart quotes to Unicode characters along with everything else:

In [None]:
UnicodeDammit(markup, ["windows-1252"]).unicode_markup

### Inconsistent encodings

Sometimes a document is mostly in UTF-8, but contains Windows-1252 characters such as (again) Microsoft smart quotes. This can happen when a website includes data from multiple sources. You can use `UnicodeDammit.detwingle()` to turn such a document into pure UTF-8. Here’s a simple example:

In [None]:
snowmen = (u"\N{SNOWMAN}" * 3)
quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")
doc = snowmen.encode("utf8") + quote.encode("windows_1252")

This document is a mess. The snowmen are in UTF-8 and the quotes are in Windows-1252. You can display the snowmen or the quotes, but not both:

In [None]:
print(doc)

In [None]:
print(doc.decode("windows-1252"))

UnicodeDammit.detwingle() only knows how to handle Windows-1252 embedded in UTF-8 (or vice versa, I suppose), but this is the most common case.


Note that you must know to call `UnicodeDammit.detwingle()` on your data before passing it into `BeautifulSoup` or the `UnicodeDammit` constructor. Beautiful Soup assumes that a document has a single encoding, whatever it might be. If you pass it a document that contains both UTF-8 and Windows-1252, it’s likely to think the whole document is Windows-1252, and the document will come out looking like `â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”`.

`UnicodeDammit.detwingle()` is new in Beautiful Soup 4.1.0.

# Comparing objects for equality

Beautiful Soup says that two `NavigableString` or `Tag` objects are equal when they represent the same HTML or XML markup. In this example, the two `<b>` tags are treated as equal, even though they live in different parts of the object tree, because they both look like `“<b>pizza</b>”`:

In [None]:
markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup = BeautifulSoup(markup, 'html.parser')
first_b, second_b = soup.find_all('b')
print (first_b == second_b)

In [None]:
print (first_b.previous_element == second_b.previous_element)

If you want to see whether two variables refer to exactly the same object, use is:

In [None]:
print(first_b is second_b)

# Copying Beautiful Soup objects

You can use `copy.copy()` to create a copy of any `Tag` or `NavigableString:`

In [None]:
import copy
p_copy = copy.copy(soup.p)
print(p_copy)

The copy is considered equal to the original, since it represents the same markup as the original, but it’s not the same object:

In [None]:
print (soup.p == p_copy)

In [None]:
print (soup.p is p_copy)

The only real difference is that the copy is completely detached from the original Beautiful Soup object tree, just as if `extract()` had been called on it:

In [None]:
print (p_copy.parent)

This is because two different `Tag` objects can’t occupy the same space at the same time.

# Parsing only part of a document

Let’s say you want to use Beautiful Soup look at a document’s `<a>` tags. It’s a waste of time and memory to parse the entire document and then go over it again looking for `<a>` tags. It would be much faster to ignore everything that wasn’t an `<a>` tag in the first place. The `SoupStrainer` class allows you to choose which parts of an incoming document are parsed. You just create a `SoupStrainer` and pass it in to the `BeautifulSoup` constructor as the `parse_only` argument.

(Note that this feature won’t work if you’re using the html5lib parser. If you use html5lib, the whole document will be parsed, no matter what. This is because html5lib constantly rearranges the parse tree as it works, and if some part of the document didn’t actually make it into the parse tree, it’ll crash. To avoid confusion, in the examples below I’ll be forcing Beautiful Soup to use Python’s built-in parser.)

### `SoupStrainer`

The `SoupStrainer` class takes the same arguments as a typical method from Searching the tree: `name, attrs, string,` and `**kwargs`. Here are three `SoupStrainer` objects:

In [None]:
from bs4 import SoupStrainer

only_a_tags = SoupStrainer("a")

only_tags_with_id_link2 = SoupStrainer(id="link2")

def is_short_string(string):
    return len(string) < 10

only_short_strings = SoupStrainer(string=is_short_string)

I’m going to bring back the “three sisters” document one more time, and we’ll see what the document looks like when it’s parsed with these three `SoupStrainer` objects:

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())

In [None]:
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2))

# Troubleshooting

`diagnose()`

If you’re having trouble understanding what Beautiful Soup does to a document, pass the document into the `diagnose()` function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing you how different parsers handle the document, and tell you if you’re missing a parser that Beautiful Soup could be using:

In [None]:
from bs4.diagnose import diagnose
with open("bad.html") as fp:
    data = fp.read()
diagnose(data)

**Note:**
bad.html is in folder data

## Version mismatch problems


    * SyntaxError: Invalid syntax (on the line ROOT_TAG_NAME = u'[document]'): Caused by running the Python 2 version of Beautiful Soup under Python 3, without converting the code.
    * ImportError: No module named HTMLParser - Caused by running the Python 2 version of Beautiful Soup under Python 3.
    * ImportError: No module named html.parser - Caused by running the Python 3 version of Beautiful Soup under Python 2.
    * ImportError: No module named BeautifulSoup - Caused by running Beautiful Soup 3 code on a system that doesn’t have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to bs4.
    * ImportError: No module named bs4 - Caused by running Beautiful Soup 4 code on a system that doesn’t have BS4 installed.
    * All used files are in data folder, should change path to implement the code.