# Web Scraping: Part I - The Beautiful Soup Package

```
Prepared for CPSC4300/6300 Section 001 -- Applied Data Science
Xizhou Feng
Clemson University
Fall 2020
```

# Overview

Many data science projects involve the task of retrieving data from multiple dats sources. Because web sites have been an important data sources, being able to effectively and quickly extracting necessary infomation available on the Internet is a must-have skill for today's data science practitioner.

In this notebook, we learn how to use Python to scape data from web sites. We organize the note book into two parts:

1. Learn the Beautiful Soup Package
2. Extract the Clemson Tigers roster from ESPN's web site.

After finishing this notebook, you should be able to scrape other websites as well. 

This notebook is compiled based on the *Beautiful Soup Documentation* (URL:   https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

# The Structure of Web Page

Most web pages are written in HTML, which stands for Hyper Text Markup Language. HTML describes the structure of a Web page by putting a series of elements represented by tags into a text file. 

Browsers then use the HTML tags to render the contents of the page. Although browsers do not display these tags on the screen, you can use the "Developer Tools" provided by a modern browser to view the HTML source.

## Exercise 1. Check the contents of a web page

1. Open www.us.gov in a brwoser.
2. Open the "Develop Tools" in a browser to examine the elements of the web page.

## HTML Document Object Model

The HTML DOM (Document Object Model) abstracts a web page into a tree of objects. For example, the figure on the right represents the DOM tree of the HTML source on the left.

<img style="float: right;" src="https://www.w3schools.com/js/pic_htmltree.gif">

```html
<!DOCTYPE html>
<html>
  <head>
    <title>My Title</title>
  </head>
  
  <body>
    <a href="https://usa.gov">My link</a>
    <h1>My header</h1>
  </body>
</html>
```

From this tree structure, we can find:

+ A document consists of a set of nodes organized in a hierachical manner.
+ There are at least four types of nodes: 
   + Document node: the root of the DOM tree.
   + Element node: a block of text enclosed by a pair of tags, for example, ```<h1>My header</h1>```.
   + Text node: a leaf node consisting of a block of text without tag.
   + Attribute node: an attribute inside the openning tag, for example, ```href="https://usa.gov```.

# The Beautiful Soup Library

Beautiful Soup is a Python library for extracting data out of HTML and XML files. It provides idiomatic ways of navigating, searching, and modifying a DOM tree parsed from a web page. See: https://www.crummy.com/software/BeautifulSoup/.

## Create Beautiful Soup

Here, we consider the following HTML text. First, we assign the text to a variable `html_doc`.

In [8]:
html_doc = """
<!DOCTYPE html>
<html>
  <head>
    <title>My Title</title>
  </head>

  <body>
    <a href="https://usa.gov">My link</a>
    <h1>My header</h1>
  </body>
</html>
"""

Then we import the `BeautifulSoup` class from the `bs4` package.

### Review: Python module and package

Python module is an organizational unit of Python code. Python modules have a namespace containing arbitrary Python objects. Python code in one module gains access to the code in another module by the process of importing it. 

Python has only one type of module object, and all modules are of this type. Considering the large number of modules, Python use the concept of packages to help organize modules and provide a naming hierarchy. You can think of packages as the directories on a file system and modules as files within directories, although packages and modules need not originate from the file system.

All modules have a name. Subpackage names are separated from their parent package name by dots, akin to Python’s standard attribute access syntax. Thus you might have a module called sys and a package called email, which in turn has a subpackage called email.mime and a module within that subpackage called email.mime.text.

The `import` statement is the most common way to import a module. Below are some examples:

```
import math                       # math imported and bounded locally
import os.path                    # os.path imported and bounded locally
import numpy as np                # numpy imported and bounded as np
from bs4 import BeautifulSoup     # bs4 imported bs4.BeautifulSoup as BeautifulSoup
```

Executing the basic import statement involves in two steps:

1. find a module, loading and initializing it if necessary
1. define a name or names in the local namespace for the scope where the import statement occurs.


The `from` form of the `import` statement involves a slightly more complex process:

1. find the module specified in the from clause, loading and initializing it if necessary;
1. for each of the identifiers specified in the import clauses:
   1. check if the imported module has an attribute by that name
   1. if not, attempt to import a submodule with that name and then check the imported module again for that attribute
   1. if the attribute is not found, ImportError is raised.
   1. otherwise, a reference to that value is stored in the local namespace, using the name in the as clause if it is present, otherwise using the attribute name





In [9]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

Here `BeautifulSoup` is a class representing a parsed HTML or XML document. Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure.

In Python, every class use a constructor to initialize an object of its type. The function `__init__` defines the constructor for a class. 

## Tips: access Python documentation and source code

You can use the `?` and `??` to access the documentation and source code of a Python module respectively.

For example, `?BeautifulSoup` will show the documentation of the BeautifulSoup and `??BeautifulSoup` will show the corresponding sourec code.

In [10]:
?BeautifulSoup

[0;31mInit signature:[0m
[0mBeautifulSoup[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmarkup[0m[0;34m=[0m[0;34m''[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeatures[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbuilder[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mparse_only[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfrom_encoding[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexclude_encodings[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0melement_classes[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
A data structure representing a parsed HTML or XML document.

Most of the methods you'll call on a BeautifulSoup object are inherited from
PageElement or Tag.

Internally, this class defines th

In [11]:
??BeautifulSoup

[0;31mInit signature:[0m
[0mBeautifulSoup[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmarkup[0m[0;34m=[0m[0;34m''[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeatures[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbuilder[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mparse_only[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfrom_encoding[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexclude_encodings[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0melement_classes[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mBeautifulSoup[0m[0;34m([0m[0mTag[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""A data structure representing a parsed HTML or XML document.[0m
[0;34

## Kinds of Beautiful Soup Objects

Beautiful Soup transforms a HTML document into a tree of Python objects. Normally, we deals with four kinds of objects when scraping a webpage using Beautiful Soup: 
+ `Tag`: corresponds to an XML or HTML tag in the original document.
+ `NavigableString`: corresponds to a bit of text within a tag.
+ `BeautifulSoup`: represents the document as a whole. 
+ `Comment`: a special type of NavigableString object.

### The BeautifulSoup Objects

When working with a BeautifulSoup object, it is necessary to know what attributes and methods that BeautifulSoup has. You can use one the three methods to list all attributes and methods of an object.

+ the `dir()` built-in method
+ the `__dict__` magic method of the class/object
+ the `inspect` mdoule

In [12]:
dir(soup)

['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'ROOT_TAG_NAME',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_check_markup_is_url',
 '_decode_markup',
 '_feed',
 '_find_all',
 '_find_one',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_linkage_fixer',
 '_most_recent_element',
 '_namespaces',
 '_popToTag',
 '_should_pretty_print',
 'append',
 'attrs',
 'builder',
 'can_be_empty_element',
 'cdata_list_attributes',
 'childGenerator',
 'children',
 'clear',
 'conta

In [13]:
BeautifulSoup.__dict__

mappingproxy({'__module__': 'bs4',
              '__doc__': 'A data structure representing a parsed HTML or XML document.\n\n    Most of the methods you\'ll call on a BeautifulSoup object are inherited from\n    PageElement or Tag.\n\n    Internally, this class defines the basic interface called by the\n    tree builders when converting an HTML/XML document into a data\n    structure. The interface abstracts away the differences between\n    parsers. To write a new tree builder, you\'ll need to understand\n    these methods as a whole.\n\n    These methods will be called by the BeautifulSoup constructor:\n      * reset()\n      * feed(markup)\n\n    The tree builder may call these methods from its feed() implementation:\n      * handle_starttag(name, attrs) # See note about return value\n      * handle_endtag(name)\n      * handle_data(data) # Appends to the current data node\n      * endData(containerClass) # Ends the current data node\n\n    No matter how complicated the underlying p

In [14]:
# List all attributes
import inspect

for a in inspect.getmembers(soup):
    if not a[0].startswith('_'):
        if not inspect.ismethod(a[1]):
            print(a[0])

ASCII_SPACES
DEFAULT_BUILDER_FEATURES
ROOT_TAG_NAME
attrs
builder
can_be_empty_element
cdata_list_attributes
children
contains_replacement_characters
contents
currentTag
current_data
declared_html_encoding
decomposed
descendants
element_classes
hidden
isSelfClosing
is_empty_element
is_xml
known_xml
markup
name
namespace
next
nextSibling
next_element
next_elements
next_sibling
next_siblings
original_encoding
parent
parents
parse_only
parserClass
parser_class
prefix
preserve_whitespace_tag_stack
preserve_whitespace_tags
previous
previousSibling
previous_element
previous_elements
previous_sibling
previous_siblings
string
string_container_stack
strings
stripped_strings
tagStack
text


In [15]:
# List all methods

for a in inspect.getmembers(soup):
    if not a[0].startswith('_'):
        if inspect.ismethod(a[1]):
            print(a[0])

append
childGenerator
clear
decode
decode_contents
decompose
encode
encode_contents
endData
extend
extract
fetchNextSiblings
fetchParents
fetchPrevious
fetchPreviousSiblings
find
findAll
findAllNext
findAllPrevious
findChild
findChildren
findNext
findNextSibling
findNextSiblings
findParent
findParents
findPrevious
findPreviousSibling
findPreviousSiblings
find_all
find_all_next
find_all_previous
find_next
find_next_sibling
find_next_siblings
find_parent
find_parents
find_previous
find_previous_sibling
find_previous_siblings
format_string
formatter_for_name
get
getText
get_attribute_list
get_text
handle_data
handle_endtag
handle_starttag
has_attr
has_key
index
insert
insert_after
insert_before
new_string
new_tag
nextGenerator
nextSiblingGenerator
object_was_parsed
parentGenerator
popTag
prettify
previousGenerator
previousSiblingGenerator
pushTag
recursiveChildGenerator
renderContents
replaceWith
replaceWithChildren
replace_with
replace_with_children
reset
select
select_one
setup
smooth
s

### check the type of the object

In [16]:
type(soup)

bs4.BeautifulSoup

### print the contents of the object

In [17]:
# lazy print
soup


<!DOCTYPE html>

<html>
<head>
<title>My Title</title>
</head>
<body>
<a href="https://usa.gov">My link</a>
<h1>My header</h1>
</body>
</html>

In [18]:
# Pretty print
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   My Title
  </title>
 </head>
 <body>
  <a href="https://usa.gov">
   My link
  </a>
  <h1>
   My header
  </h1>
 </body>
</html>



In [19]:
### Inpect the attributes of the object

In [20]:
soup.name

'[document]'

In [21]:
soup.attrs

{}

In [22]:
soup.contents

['\n',
 'html',
 '\n',
 <html>
 <head>
 <title>My Title</title>
 </head>
 <body>
 <a href="https://usa.gov">My link</a>
 <h1>My header</h1>
 </body>
 </html>,
 '\n']

In [23]:
soup.children

<list_iterator at 0x148b99e50c90>

In [24]:
for child in soup.children:
    print(type(child), child.name, child)

<class 'bs4.element.NavigableString'> None 

<class 'bs4.element.Doctype'> None html
<class 'bs4.element.NavigableString'> None 

<class 'bs4.element.Tag'> html <html>
<head>
<title>My Title</title>
</head>
<body>
<a href="https://usa.gov">My link</a>
<h1>My header</h1>
</body>
</html>
<class 'bs4.element.NavigableString'> None 



In [25]:
### Search for Tags

In [26]:
# Find all matching Tag elements
soup.find_all('h1')

[<h1>My header</h1>]

In [27]:
for c in soup.find_all('h1'):
    print(type(c), c.name, c)

<class 'bs4.element.Tag'> h1 <h1>My header</h1>


In [28]:
# Find the first matching Tag object
soup.find('h1')

<h1>My header</h1>

### The Tag Object

In [29]:
head = soup.head

In [30]:
head

<head>
<title>My Title</title>
</head>

In [31]:
type(head)

bs4.element.Tag

In [32]:
head.name

'head'

In [33]:
head.attrs

{}

In [34]:
head.children

<list_iterator at 0x148b99e3bb50>

In [35]:
head.find(True)

<title>My Title</title>

In [36]:
head.find_all(True)

[<title>My Title</title>]

### The Attribute Object

In [37]:
link = soup.find('a')

In [38]:
link

<a href="https://usa.gov">My link</a>

In [39]:
type(link)

bs4.element.Tag

In [40]:
attrs = link.attrs

In [41]:
attrs

{'href': 'https://usa.gov'}

In [42]:
url = attrs['href']

In [43]:
url

'https://usa.gov'

### The Text Object

In [44]:
link = soup.find('a')
text = link.get_text()
print(text)
print(type(text))

My link
<class 'str'>


## Navigate the tree

### Going down

#### Navigating using tag names

In [45]:
soup.head

<head>
<title>My Title</title>
</head>

In [46]:
soup.title

<title>My Title</title>

In [47]:
soup.a

<a href="https://usa.gov">My link</a>

In [48]:
soup.body.a

<a href="https://usa.gov">My link</a>

#### Navigating using links from a node to its children.

+ `.contents`: an attribute that provides a list of a tag’s childre.
+ `.children`: an attribute that provides a generator to iterate over a tag’s children.
+ `.descendants`: an attribute that allows iterating over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on.
+ `.string`: If a tag has only one child, and that child is a NavigableString, the child is made available as `.string`.
+ `.strings`: a generator that returns more than one thing inside a tag.
+ `.stripped_strings`: a generator that is similar to `strings` but will remove extra whitespace.

### Going Up
+ `.parent`: an attribute that returns an element’s parent.
+ `.parents`: an attributes that iterates over all of an element’s parents. 

### Going sideways
+ `.next_sibling`
+ `.previous_sibling`
+ `.next_siblings`
+ `.previous_siblings`

### Going back and forth

+ `.next_element`
+ `.previous_element`
+ `.next_elements`
+ `.previous_elements`

## Searching the Tree

Beautiful Soup defines several methods for searching the parse tree. Among these methods, the two most popular methods are:

+ `find_all(name, attrs, recursive, string, limit, **kwargs)`
   + `find_all()` returns the at most *limit* elements that matche a filter. 
+ `find(name, attrs, recursive, string, **kwargs)`
   + `find()` returns the first elements that matches a filter.

`find()` is equivalent to `find_all()` with argument `limit=1`.

### Filters

You can pass in a filter to an argument like find_all() to include only the parts of the document which you are interested in. BeautifulSoup supports several kinds of filters which you can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

+ A string: The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string.

In [49]:
links = soup.find_all('a')
links

[<a href="https://usa.gov">My link</a>]

+ A regular expression: If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its `search()` method.

In [50]:
import re
for tag in soup.find_all(re.compile("^t")):
    print(tag)

<title>My Title</title>


+ A list: If you pass in a list, Beautiful Soup will allow a string match against any item in that list.

In [51]:
for tag in soup.find_all(["a", "h1"]):
    print(tag)

<a href="https://usa.gov">My link</a>
<h1>My header</h1>


+ True: The value True matches everything it can. This code finds all the tags in the document, but none of the text strings.

In [52]:
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
a
h1


+ A Function: You can also define a function that takes an element as its only argument and use it as a filter. The function should return True if the argument matches, and False otherwise.

In [53]:
def link_to_gov(tag):
    if tag.name == 'a' and tag.has_attr('href') and tag.attrs['href'].endswith('.gov'):
        return True
    else:
        return False

In [54]:
for tag in soup.find_all(link_to_gov):
    print(tag)

<a href="https://usa.gov">My link</a>


### find_all

`find_all(name, attrs, recursive, string, limit, **kwargs)`

#### The name argument
Passing in a value for name tells Beautiful Soup to only consider tags with certain names.

In [55]:
soup.find_all('title')

[<title>My Title</title>]

#### The keyword argument
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes.

In [56]:
import re
soup.find_all(href=re.compile(".gov$"))

[<a href="https://usa.gov">My link</a>]

#### Search by CSS class

It’s very useful to search for a tag that has a certain CSS class. Because the name of the CSS attribute, “class”, is a reserved word in Python, using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

#### The string argument
With string you can search for strings instead of tags.

In [57]:
soup.find_all(string=re.compile("My"))

['My Title', 'My link', 'My header']

#### The limit argument
The limit argument tells Beautiful Soup to stop gathering results after it’s found a certain number.

In [58]:
soup.find_all(string=re.compile("My"), limit=1)

['My Title']

#### The recursive argument
If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False

In [59]:
soup.find_all('title')
soup.find_all('title', recursive=False)

[]

#### Shortcut for find_all()
If you treat the BeautifulSoup object or a Tag object as though it were a function, then it’s the same as calling find_all() on that object.

In [64]:
soup.find_all("a")
soup("a")
soup.title.find_all(string=True)
soup.title(string=True)

['My Title']

### find
Signature: find(name, attrs, recursive, string, **kwargs)

find() returns a single result. It is equivalent to find_all() with argument limit=1.

In [65]:
soup.find_all("a")

[<a href="https://usa.gov">My link</a>]

In [66]:
type(soup.find_all("a"))

bs4.element.ResultSet

In [67]:
soup.find("a")

<a href="https://usa.gov">My link</a>

In [68]:
type(soup.find("a"))

bs4.element.Tag

### Other find methods

#### find_parents() and find_parent()

+ `find_parents(name, attrs, string, limit, **kwargs)`
+ `find_parent(name, attrs, string, **kwargs)`

Remember that find_all() and find() work their way down the tree, looking at tag’s descendants. The above two methods do the opposite: they work their way up the tree, looking at a tag’s (or a string’s) parents.

#### Find siblings
+ `find_next_siblings(name, attrs, string, limit, **kwargs)`
+ `find_next_sibling(name, attrs, string, **kwargs)`
+ `find_previous_siblings(name, attrs, string, limit, **kwargs)`
+ `find_previous_sibling(name, attrs, string, **kwargs)`

#### find next and previous
+ `find_all_next(name, attrs, string, limit, **kwargs)`
+ `find_next(name, attrs, string, **kwargs)`
+ `find_all_previous(name, attrs, string, limit, **kwargs)`
+ `find_previous(name, attrs, string, **kwargs)`


## CSS selectors

CSS (stands for Cascading Style Sheets) describes how HTML elements are to be displayed on screen, paper, or in other media. CSS saves a lot of work in web design as it can control the layout of multiple web pages all at once.

In CSS, selectors are patterns used to select the element(s) you want to style. To know more about CSS and CSS selectors, you can read some tutorials provided at the W3Schools website (e.g., https://www.w3schools.com/cssref/css_selectors.asp).

As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the SoupSieve project. BeautifulSoup has a .select() method which uses SoupSieve to run a CSS selector against a parsed document and return all the matching elements. Tag has a similar method which runs a CSS selector against the contents of a single tag.

Generally speaking, you can use the `find`, `find_all`, and their derivatives to achieve all you want to get from CSS selectors. 

+ Find tags

In [69]:
soup.select("title")
soup.select("p:nth-of-type(3)")

[]

+ Find tags under other tags

In [70]:
soup.select("body a")
soup.select("html head title")

[<title>My Title</title>]

+ Find tags directly under other tags

In [72]:
soup.select("head > title"), soup.select("body > a"), soup.select("p > a"), soup.select("p > #link1")

([<title>My Title</title>], [<a href="https://usa.gov">My link</a>], [], [])

+ Find the siblings of tags
   + Find all siblings

In [73]:
soup.select("#link1 ~ .sister")

[]

+ Find first sibling

In [74]:
soup.select("#link1 + .sister")

[]

+ Find tags by CSS class

In [75]:
soup.select(".sister")
soup.select("[class~=sister]")

[]

+ Find tags by ID

In [76]:
soup.select("#link1")
soup.select("a#link2")

[]

+ Find tags that match any selector from a list of selectors

In [None]:
soup.select("#link1,#link2")

+ Test for the existence of an attribute

In [None]:
soup.select('a[href]')

+ Find tags by attribute value

+ Equality

In [None]:
soup.select('a[href="http://example.com/elsie"]')

   + Starts with

In [None]:
soup.select('a[href^="http://example.com/"]')
   + Ends with
soup.select('a[href$="tillie"]')

   + Regular expression

In [None]:
soup.select('a[href*=".com/el"]')

In [None]:
soup.select('a[href*=".com/el"]')

Find only the first matching tag select_one()

In [None]:
soup.select_one(".sister")

## Output
+ Pretty-printing
The `prettify()` method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string.

+ Non-pretty printing
If you just want a string, with no fancy formatting, you can call `unicode()` or `str()` on a BeautifulSoup object, or a Tag.

+ `get_text()`
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.