#### Four main object types in beautiful soup:
- BeautifulSoup object
- Tag object
- NavigableString object
- Comment object

In [1]:
from bs4 import BeautifulSoup

##### BeautifulSoup object
Lets set an HTML document that we want to use as a markup.We'll call that document html_doc

In [2]:
html_doc = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>DATA SCIENCE FOR DUMMIES</b></p>

<p class='description'>Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br><br>
Edition 1 of this book:
        <br>
 <ul>
  <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
  <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
  <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
  <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>   
  </ul>
<br><br>
What to do next:
<br>
<a href='http://www.data-mania.com/blog/books-by-lillian-pierson/' class = 'preview' id='link 1'>See a preview of the book</a>,
<a href='http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/' class = 'preview' id='link 2'>get the free pdf download,</a> and then
<a href='http://bit.ly/Data-Science-For-Dummies' class = 'preview' id='link 3'>buy the book!</a> 
</p>

<p class='description'>...</p>
'''

In [7]:
# Let's start by looking at the BeautifulSoup constructor.
# By default the constructor will attempt to detect what parser type you need,
# based on the document object you pass in.
# Let's pick a parser for our constructor instead.
soup = BeautifulSoup(html_doc,'html.parser')
# here html.parser explicitly tells the constructor that we want to use the html parser
# soup is going to be a BeautifulSoup object and is going to be a parsed HTML tree
# What BeautifulSoup does is it transforms the markup in html_doc into a parse tree,
# the markup into a parse tree representing the structure of the document.
print(soup)


<html><head><title>Best Books</title></head>
<body>
<p class="title"><b>DATA SCIENCE FOR DUMMIES</b></p>
<p class="description">Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br/><br/>
Edition 1 of this book:
        <br/>
<ul>
<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
<li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
<li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
<li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
</ul>
<br/><br/>
What to do next:
<br/>
<a class="preview" href="http://www.data-mania.com/blog/books-by-lillian-pierson/" id=

In [12]:
# we see that this is bit difficult to read since there is no structure in the printed 
# output. Let's prettify it and print the first 350 elements
print(soup.prettify()[:350])
# And you see it added some structure, which makes it a little easier to read.

<html>
 <head>
  <title>
   Best Books
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    DATA SCIENCE FOR DUMMIES
   </b>
  </p>
  <p class="description">
   Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
   <br/


##### Tag Object
- A tag object represents HTML or XML elements that are present in the original markup document.
- <a href="www.google.com">
    - Above is a tag object
    - a is the tag name
    - href is the tag attribute
- Tag objects have two very important features:
    - names
    - attributes
        - You can use attributes to reference, search, and navigate data by tagging BeautifulSoup.

We'll first look at name attribute. 

In [14]:
# Let's create a BeautifulSoup object, and again, call it soup.
# We'll call our BeautifulSoup constructor, and then we're going to pass in
# a body element from our HTML document.
# I'm going to type that in, so it's going to be b body, and then we'll put a description,
# and then we want product description, and close the tag, and then we'll say html.
# When we pass in the HTML argument, it tells the constructor that it should
# interpret the tag as HTML markup.
# For explanation, we will look at b tag. This is a bad example because this
# is more like xml and not html. 
soup = BeautifulSoup('<b body="description">Product Description</b>','html')
# Ignore the warning for this example.



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [16]:
# Next, we will create a tag variable, and set it equal to the soup.b
tag = soup.b
# This essentially tells BeautifulSoup that the tag's name is b, a reference to the HTML we passed in.
type(tag)
# we can see that we get back a bs4.element.Tag identifier for our tag object.

bs4.element.Tag

In [19]:
# Let's print this tag, we'll say print tag, and this is our tag, it's the HTML
# we passed into the BeautifulSoup constructor.
print(tag)

<b body="description">Product Description</b>


In [20]:
tag.name
# So, when we say tag.name, it returns a string that reads b.
# because this is the name of the tag.

'b'

In [23]:
# And, if you want to replace the name b with best books instead, you could just say tag.name,
# and set the name as best books.
tag.name = 'bestbooks'
print(tag)
# again note that the example looks more like xml element and not HTML element
print(tag.name)

<bestbooks body="description">Product Description</bestbooks>
bestbooks


##### Now, let's look at working with attributes.
A tag can have any variety of attributes, you can access a tag's attributes by treating the tag like a dictionary. For example, if we write tag and select body and run this, it returns a string that reads description. That is directly from the markup we passed into the BeautifulSoup constructor. 

In [24]:
tag['body']

'description'

To return a dictionary that contains all of the tag's attributes, you can access that directly using the attrs method. We'll write the name of our tag, and then call attrs off of it. And as you can see here, we have one tag called body, and it's value is description.

In [27]:
tag.attrs

{'body': 'description'}

You can easily add an attribute to a tag by simply attaching a attribute label to the tag object, and setting it equal to some value.

In [29]:
tag['id'] = 3
print(tag.attrs)
print(tag)

{'body': 'description', 'id': 3}
<bestbooks body="description" id="3">Product Description</bestbooks>


To delete an attribute from the tag, just say del and write the attribute name you want to delete.

In [30]:
del tag['body']
del tag['id']
tag

<bestbooks>Product Description</bestbooks>

We can call the attrs method now and all we'll get back is an empty dictionary, meaning that we successfully deleted all of our attributes.

In [31]:
tag.attrs

{}

#### use tags to navigate a parse tree
Now, let's look at how to use tags to navigate a parse tree.

To navigate to a specific portion of the tree, simply write the name of the tag you're interested in.

In [32]:
#we'll recreate the parse tree we created earlier
soup = BeautifulSoup(html_doc,'html.parser')

In [34]:
# Now, to retrieve certain tags from within the parse tree, 
# all you have to do is write the name of the tag.
soup.title

<title>Best Books</title>

If you want to pull up the name of the book, well let's look and see what part of the tree it's located in. Primarily, it's located in the body tag, but more specifically, it's located within the b tag.

In [35]:
soup.body.b

<b>DATA SCIENCE FOR DUMMIES</b>

to retrieve the first tag on the HTML document that contains a web link:

In [39]:
soup.a

<a class="preview" href="http://www.data-mania.com/blog/books-by-lillian-pierson/" id="link 1">See a preview of the book</a>

### NavigableString object

NavigableString objects are the strings contained within a tag object.

For example: "<b>Hello World</b>"
Here, "Hello World" will constitute the navigable string object inside the b tag object.