#### Four main object types in beautiful soup:
- BeautifulSoup object
- Tag object
- NavigableString object
- Comment object

In [1]:
from bs4 import BeautifulSoup

##### BeautifulSoup object
Lets set an HTML document that we want to use as a markup.We'll call that document html_doc

In [2]:
#create an html object that contains all the HTML 
html_doc = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>DATA SCIENCE FOR DUMMIES</b></p>

<p class='description'>Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br><br>
Edition 1 of this book:
        <br>
 <ul>
  <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
  <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
  <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
  <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>   
  </ul>
<br><br>
What to do next:
<br>
<a href='http://www.data-mania.com/blog/books-by-lillian-pierson/' class = 'preview' id='link 1'>See a preview of the book</a>,
<a href='http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/' class = 'preview' id='link 2'>get the free pdf download,</a> and then
<a href='http://bit.ly/Data-Science-For-Dummies' class = 'preview' id='link 3'>buy the book!</a> 
</p>

<p class='description'>...</p>
'''

In [3]:
# Let's start by looking at the BeautifulSoup constructor.
# By default the constructor will attempt to detect what parser type you need,
# based on the document object you pass in.
# Let's pick a parser for our constructor instead.
# Create a beautiful soup object that contains the html inside the html object we created
soup = BeautifulSoup(html_doc,'html.parser')
# here html.parser explicitly tells the constructor that we want to use the html parser
# soup is going to be a BeautifulSoup object and is going to be a parsed HTML tree
# What BeautifulSoup does is it transforms the markup in html_doc into a parse tree,
# the markup into a parse tree representing the structure of the document.
print(soup)


<html><head><title>Best Books</title></head>
<body>
<p class="title"><b>DATA SCIENCE FOR DUMMIES</b></p>
<p class="description">Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br/><br/>
Edition 1 of this book:
        <br/>
<ul>
<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
<li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
<li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
<li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
</ul>
<br/><br/>
What to do next:
<br/>
<a class="preview" href="http://www.data-mania.com/blog/books-by-lillian-pierson/" id=

In [4]:
# we see that this is bit difficult to read since there is no structure in the printed 
# output. Let's prettify it and print the first 350 elements
print(soup.prettify()[:350])
# And you see it added some structure, which makes it a little easier to read.

<html>
 <head>
  <title>
   Best Books
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    DATA SCIENCE FOR DUMMIES
   </b>
  </p>
  <p class="description">
   Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
   <br/


##### Tag Object
- A tag object represents HTML or XML elements that are present in the original markup document.
- <a href="www.google.com">
    - Above is a tag object
    - a is the tag name
    - href is the tag attribute
- Tag objects have two very important features:
    - names
    - attributes
        - You can use attributes to reference, search, and navigate data by tagging BeautifulSoup.

We'll first look at name attribute. 

In [5]:
# Let's create a BeautifulSoup object, and again, call it soup.
# We'll call our BeautifulSoup constructor, and then we're going to pass in
# a body element from our HTML document.
# I'm going to type that in, so it's going to be b body, and then we'll put a description,
# and then we want product description, and close the tag, and then we'll say html.
# When we pass in the HTML argument, it tells the constructor that it should
# interpret the tag as HTML markup.
# For explanation, we will look at b tag. This is a bad example because this
# is more like xml and not html. 
soup = BeautifulSoup('<b body="description">Product Description</b>','html')
# Ignore the warning for this example.



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [6]:
# Next, we will create a tag variable, and set it equal to the soup.b
tag = soup.b
# This essentially tells BeautifulSoup that the tag's name is b, a reference to the HTML we passed in.
type(tag)
# we can see that we get back a bs4.element.Tag identifier for our tag object.

bs4.element.Tag

In [7]:
# Let's print this tag, we'll say print tag, and this is our tag, it's the HTML
# we passed into the BeautifulSoup constructor.
print(tag)

<b body="description">Product Description</b>


In [8]:
tag.name
# So, when we say tag.name, it returns a string that reads b.
# because this is the name of the tag.

'b'

In [9]:
# And, if you want to replace the name b with best books instead, you could just say tag.name,
# and set the name as best books.
tag.name = 'bestbooks'
print(tag)
# again note that the example looks more like xml element and not HTML element
print(tag.name)

<bestbooks body="description">Product Description</bestbooks>
bestbooks


##### Now, let's look at working with attributes.
A tag can have any variety of attributes, you can access a tag's attributes by treating the tag like a dictionary. For example, if we write tag and select body and run this, it returns a string that reads description. That is directly from the markup we passed into the BeautifulSoup constructor. 

In [10]:
tag['body']

'description'

To return a dictionary that contains all of the tag's attributes, you can access that directly using the attrs method. We'll write the name of our tag, and then call attrs off of it. And as you can see here, we have one tag called body, and it's value is description.

In [11]:
tag.attrs

{'body': 'description'}

You can easily add an attribute to a tag by simply attaching a attribute label to the tag object, and setting it equal to some value.

In [12]:
tag['id'] = 3
print(tag.attrs)
print(tag)

{'body': 'description', 'id': 3}
<bestbooks body="description" id="3">Product Description</bestbooks>


To delete an attribute from the tag, just say del and write the attribute name you want to delete.

In [13]:
del tag['body']
del tag['id']
tag

<bestbooks>Product Description</bestbooks>

We can call the attrs method now and all we'll get back is an empty dictionary, meaning that we successfully deleted all of our attributes.

In [14]:
tag.attrs

{}

#### use tags to navigate a parse tree
Now, let's look at how to use tags to navigate a parse tree.

To navigate to a specific portion of the tree, simply write the name of the tag you're interested in.

In [15]:
#we'll recreate the parse tree we created earlier
soup = BeautifulSoup(html_doc,'html.parser')

In [16]:
# Now, to retrieve certain tags from within the parse tree, 
# all you have to do is write the name of the tag.
soup.title

<title>Best Books</title>

If you want to pull up the name of the book, well let's look and see what part of the tree it's located in. Primarily, it's located in the body tag, but more specifically, it's located within the b tag.

In [17]:
soup.body.b

<b>DATA SCIENCE FOR DUMMIES</b>

to retrieve the first tag on the HTML document that contains a web link:

In [18]:
soup.a

<a class="preview" href="http://www.data-mania.com/blog/books-by-lillian-pierson/" id="link 1">See a preview of the book</a>

### NavigableString object

In [21]:
#NavigableString objects are the strings contained within a tag object.
#For example: "<b>Hello World</b>"
#Here, "Hello World" will constitute the navigable string object inside the b tag object.
tag = soup.b
type(tag)

bs4.element.Tag

In [22]:
tag.name

'b'

In [25]:
print(tag)

<b>DATA SCIENCE FOR DUMMIES</b>


In [23]:
# to get any string contained within this tag,
tag.string

'DATA SCIENCE FOR DUMMIES'

In [27]:
type(tag.string)
# you ca see that this string is of type NavigableString

bs4.element.NavigableString

In [28]:
nav_string = tag.string
nav_string

'DATA SCIENCE FOR DUMMIES'

In [33]:
#if you want to replace this string object with someother string, you can call the ReplaceWith method
tag.string.replaceWith('Replaced Book Title')
print(tag)
# please note that this will replace the string in-place in the html tree

<b>Replaced Book Title</b>


In [34]:
soup = BeautifulSoup(html_doc,"html.parser")
soup


<html><head><title>Best Books</title></head>
<body>
<p class="title"><b>DATA SCIENCE FOR DUMMIES</b></p>
<p class="description">Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br/><br/>
Edition 1 of this book:
        <br/>
<ul>
<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
<li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
<li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
<li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
</ul>
<br/><br/>
What to do next:
<br/>
<a class="preview" href="http://www.data-mania.com/blog/books-by-lillian-pierson/" id=

In [35]:
# you can get all the NavigableStrings contained within a parsed tree using a stripped_strings generator
for string in soup.stripped_strings:
    print(repr(string))

'Best Books'
'DATA SCIENCE FOR DUMMIES'
'Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe'
'Edition 1 of this book:'
'Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis'
'Details different data visualization techniques that can be used to showcase and summarize your data'
'Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques'
'Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark'
'What to do next:'
'See a preview of the book'
','
'get the free pdf download,'
'and then'
'buy the book!'
'...'


In [36]:
#stripped_strings will ignore strings consisting entirely of white spaces
#stripped_strings will remove white space at the beginning and the end of the strings

##### how to access parent tag objects within a parse tree

In [37]:
#Let's create a new object called title tag.
title_tag = soup.title
title_tag

<title>Best Books</title>

In [39]:
# Now, if we wanted to access the parent of the title tag,
#all we have to do is say title_tag.parent
#and that will return the title element's parent.
title_tag.parent
#In this case the parent of title tag is head.

<head><title>Best Books</title></head>

In [40]:
title_tag.parent.name

'head'

##### so remember, we use NavigableString objects to retrieve chunks of strings within the tag objects

In [41]:
r = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>DATA SCIENCE FOR DUMMIES</b></p>

<p class='description'>Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br><br>
Edition 1 of this book:
        <br>
 <ul>
  <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
  <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
  <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
  <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>   
  </ul>
<br><br>
What to do next:
<br>
<a href='http://www.data-mania.com/blog/books-by-lillian-pierson/' class = 'preview' id='link 1'>See a preview of the book</a>,
<a href='http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/' class = 'preview' id='link 2'>get the free pdf download,</a> and then
<a href='http://bit.ly/Data-Science-For-Dummies' class = 'preview' id='link 3'>buy the book!</a> 
</p>

<p class='description'>...</p>
'''

In [42]:
# this time, lets use lxml parser
soup = BeautifulSoup(r, 'lxml')
type(soup)

bs4.BeautifulSoup

### Parsing your data

In [45]:
print(soup.prettify()[0:100])

<html>
 <head>
  <title>
   Best Books
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    DA


### Getting data from a parse tree

In [46]:
text_only = soup.get_text()
print(text_only)

Best Books

DATA SCIENCE FOR DUMMIES
Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe

Edition 1 of this book:
        

Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis
Details different data visualization techniques that can be used to showcase and summarize your data
Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques
Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark


What to do next:

See a preview of the book,
get the free pdf download, and then
buy the book!
...



### Searching and retrieving data from a parse tree

#### Retrieving tags by filtering with name arguments

In [47]:
# To return all the tags that contain HTML list items
# call the find_all method off a soup object
# and then pass in the name of the tag, li.
soup.find_all("li")

[<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>,
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>,
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>,
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>]

#### Retrieving tags by filtering with keyword arguments

In [48]:
#To return all of the tags that contain an id attribute of three:
# the find_all method internally uses match method to filter only those tags with id=3 
soup.find_all(id="link 3")

[<a class="preview" href="http://bit.ly/Data-Science-For-Dummies" id="link 3">buy the book!</a>]

#### Retrieving tags by filtering with string arguments

In [56]:
# In this method you search for tags by filtering based on an exact string.
# We do that by writing the name of our soup object
# and then calling the find_all method off of it
soup.find_all('ul')

[<ul>
 <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
 </ul>]

#### Retrieving tags by filtering with list of objects

In [59]:
soup.find_all(['ul', 'b'])
#this will return all the tags that have tagname ul or b
#the reason b got listed first was because it was higher up in the HTML tree hierarchy

[<b>DATA SCIENCE FOR DUMMIES</b>, <ul>
 <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
 </ul>]

#### Retrieving tags by filtering with regular expressions

In [60]:
# to return all of the tags that contain a regular expression, you can use a
# regular expression object to be used as a filter
# this will list all tags that contain the letter l in their tag name
import re
l = re.compile('l')
for tag in soup.find_all(l): print(tag.name)

html
title
ul
li
li
li
li


#### Retrieving tags by filtering with a Boolean value

In [53]:
#In this method you search for tags by filtering based on true, false values.
#To return all of the tags that are contained in a parse tree
#you can pass in a Boolean value to use as a filter.
#The find_all function accepts Boolean values.
#So if you want to print out all HTML tags from within the soup object,
#we can just use that same loop but pass in the value true
#as an argument to the find_all function.
for tag in soup.find_all(True): print(tag.name)

html
head
title
body
p
b
p
br
br
br
ul
li
li
li
li
br
br
br
a
a
a
p


#### Retrieving weblinks by filtering with string objects

In [61]:
#to return all the web-links within a parse tree, you can use a string object inside find_all 
# to use as a filter. Lets start by isolating all the web-links within the soup object.
# We do that by calling the find_all method off of the soup object, and passing in the tag a.
#We will write a for loop that passes through every tag in the soup object to search for all the a tags.
# For each a tag that it finds, it gets the href value and prints that out.
for link in soup.find_all('a'): print(link.get('href'))
    # That is a simple mechanism you can use to scrape web links from a web page.

http://www.data-mania.com/blog/books-by-lillian-pierson/
http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/
http://bit.ly/Data-Science-For-Dummies


#### Retrieving strings by filtering with regular expressions

In [55]:
# To return all of the strings that contain a regular expression,
# you can pass in a regular expression object to use as a filter.
soup.find_all(string=re.compile("data"))
# The find_all method then returns a list of strings from the original web page,
#all of which contain the word data.

['Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe\n',
 'Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis',
 'Details different data visualization techniques that can be used to showcase and summarize your data',
 'Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark']

### Web scraping in action
In the following demonstration, I'm going to show you how to scrape webpage and then save your results
in an external file.

In [62]:
from bs4 import BeautifulSoup
import urllib #we're going to need urllib library in order to read in our data from the internet.
import re

In [65]:
#### Scraping a webpage and saving your results
r = urllib.request.urlopen('https://analytics.usa.gov').read()
soup = BeautifulSoup(r,'lxml')
type(soup)

bs4.BeautifulSoup

In [66]:
soup.prettify()[:100]

'<!DOCTYPE html>\n<html lang="en">\n <!-- Initalize title and data source variables -->\n <head>\n  <!--\n'

In [67]:
#now lets print all the links on this page
for link in soup.find_all('a'):
    print(link.get('href'))

/
#explanation
https://analytics.usa.gov/data/
data/
#top-pages-realtime
#top-pages-7-days
#top-pages-30-days
https://analytics.usa.gov/data/live/all-pages-realtime.csv
https://analytics.usa.gov/data/live/all-domains-30-days.csv
https://www.digitalgov.gov/services/dap/
https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4
https://support.google.com/analytics/answer/2763052?hl=en
https://analytics.usa.gov/data/live/second-level-domains.csv
https://analytics.usa.gov/data/live/sites.csv
mailto:DAP@support.digitalgov.gov
https://analytics.usa.gov/data/
https://analytics.usa.gov/developer
mailto:DAP@support.digitalgov.gov
https://github.com/GSA/analytics.usa.gov/issues
https://github.com/GSA/analytics.usa.gov
https://github.com/18F/analytics-reporter
http://www.gsa.gov/
https://www.digitalgov.gov/services/dap/
https://cloud.gov/


In [68]:
#lets find all the a tags that have an attribute of href
#and then of all of these tags that are returned, we want the loop
#to match against them a regular expression that reads http, and print out only those.
for link in soup.find_all('a',attrs={'href':re.compile('^http')}):
    print(link)

<a href="https://analytics.usa.gov/data/">Data</a>
<a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
<a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
<a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
<a href="https://analytics.usa.gov/data/">download the data here.</a>
<a href="https://analytics.usa.gov/developer"> API project</a>
<a

It isnt useful to have your results stuck in a Jupyter notebook. So we will now
save this to an external file.
To do that, we're going to create a new text file called parsed data.

In [72]:
file = open('parsed_data.txt','w')#'w' to tell Python we want to write into this text file.
for link in soup.find_all('a',attrs={'href':re.compile('^http')}):
    soup_link = str(link)
    print(soup_link)
    file.write(soup_link)
file.flush()
file.close()

<a href="https://analytics.usa.gov/data/">Data</a>
<a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
<a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
<a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
<a href="https://analytics.usa.gov/data/">download the data here.</a>
<a href="https://analytics.usa.gov/developer"> API project</a>
<a

In [73]:
#to find out where this created data file is,
%pwd

'C:\\Users\\AccDEV3\\Desktop\\VS\\Pt\\PythonBasics\\DataScienceET\\WebScrapingWithBeautifulSoup'

In [1]:
#note that in the created file, you will have the stray tags like <a href="  etc.
#so you will need to do some data munging to clean up the urls

In [2]:
import requests
url = 'http://www.imdb.com/search/title'
data = {
    'sort':'num_votes,desc'
}
data['release_date'] = 2017
data['page'] = 1
response = requests.get(url,data)


In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text,'html.parser')
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   IMDb: Released between 2017-01-01 and 2017-12-31
(Sorted by Number of Votes Descending) - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   if (typeof ue