## Use cases
### Eg
### Existing Blog -> New Blog resources page
### E-commerce store automation
### Hydrological analysis
### Emergency resource allocation planning
### Oil and gas production intel

## 4 Beautiful Soup Object Types
### BeautifulSoup object
### Tag object
### Navigable object
### Comment object

## Working with objects

In [1]:
! pip install BeautifulSoup4



In [2]:
from bs4 import BeautifulSoup

### Looking at a beautiful soup object


In [3]:
html_doc = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>DATA SCIENCE FOR DUMMIES</b></p>

<p class='description'>Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br><br>
Edition 1 of this book:
        <br>
 <ul>
  <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
  <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
  <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
  <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>   
  </ul>
<br><br>
What to do next:
<br>
<a href='http://www.data-mania.com/blog/books-by-lillian-pierson/' class = 'preview' id='link 1'>See a preview of the book</a>,
<a href='http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/' class = 'preview' id='link 2'>get the free pdf download,</a> and then
<a href='http://bit.ly/Data-Science-For-Dummies' class = 'preview' id='link 3'>buy the book!</a> 
</p>

<p class='description'>...</p>
'''

### By default the constructor will try to figure out what type a parser we need, but in this case we will try to pick a parser ourselves
### Beautiful soup transforms the markupinto a parse tree which is a set of linked objects representing the structure of the document 

In [4]:
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)


<html><head><title>Best Books</title></head>
<body>
<p class="title"><b>DATA SCIENCE FOR DUMMIES</b></p>
<p class="description">Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br/><br/>
Edition 1 of this book:
        <br/>
<ul>
<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
<li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
<li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
<li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
</ul>
<br/><br/>
What to do next:
<br/>
<a class="preview" href="http://www.data-mania.com/blog/books-by-lillian-pierson/" id=

### we will make the code easier to read by calling prettify

In [5]:
print (soup.prettify()[0:350])

<html>
 <head>
  <title>
   Best Books
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    DATA SCIENCE FOR DUMMIES
   </b>
  </p>
  <p class="description">
   Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
   <br/


## tag objects: represent html or xml markers present in the original markup document, tag objects have 2 important features: name and attributes 

### attributes can be used to reference, search and navigate data by tag in beautiful soup

In [6]:
soup = BeautifulSoup('<b body="description"">Product Description</b>', 'html')

tag=soup.b
type(tag)

bs4.element.Tag

In [7]:
print (tag)

<b body="description">Product Description</b>


In [8]:
tag.name

'b'

In [9]:
tag.name = 'bestbooks'
tag

<bestbooks body="description">Product Description</bestbooks>

In [10]:
tag.name

'bestbooks'

### working with attributes

In [11]:
tag['body']

'description'

In [12]:
# returns a dictionary of all the attributes
tag.attrs

{'body': 'description'}

In [13]:
#adding an attribute to a tag, in this case the attribute id
tag['id'] = 3
tag.attrs

{'body': 'description', 'id': 3}

In [14]:
tag

<bestbooks body="description" id="3">Product Description</bestbooks>

In [15]:
#deleting an attribute from a tag, in this case deleting the body and id attribute
del tag['body']
del tag['id']
tag

<bestbooks>Product Description</bestbooks>

In [16]:
# printing again for check it returns an empty dictionary post deletion
tag.attrs

{}

### Using tags to navigate a tree 

In [17]:
html_doc = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>DATA SCIENCE FOR DUMMIES</b></p>

<p class='description'>Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br><br>
Edition 1 of this book:
        <br>
 <ul>
  <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
  <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
  <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
  <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>   
  </ul>
<br><br>
What to do next:
<br>
<a href='http://www.data-mania.com/blog/books-by-lillian-pierson/' class = 'preview' id='link 1'>See a preview of the book</a>,
<a href='http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/' class = 'preview' id='link 2'>get the free pdf download,</a> and then
<a href='http://bit.ly/Data-Science-For-Dummies' class = 'preview' id='link 3'>buy the book!</a> 
</p>

<p class='description'>...</p>
'''
soup = BeautifulSoup(html_doc, 'html.parser')

In [18]:
#retrieving tags 
soup.head

<head><title>Best Books</title></head>

In [19]:
#retrieving the title tag
soup.title

<title>Best Books</title>

In [20]:
#isolating the b tag from the body tag to retrieve the title
soup.body.b

<b>DATA SCIENCE FOR DUMMIES</b>

In [21]:
soup.body

<body>
<p class="title"><b>DATA SCIENCE FOR DUMMIES</b></p>
<p class="description">Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br/><br/>
Edition 1 of this book:
        <br/>
<ul>
<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
<li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
<li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
<li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
</ul>
<br/><br/>
What to do next:
<br/>
<a class="preview" href="http://www.data-mania.com/blog/books-by-lillian-pierson/" id="link 1">See a preview of the book</a>,
<a cla

In [22]:
soup.ul

<ul>
<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
<li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
<li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
<li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
</ul>

In [23]:
soup.a

<a class="preview" href="http://www.data-mania.com/blog/books-by-lillian-pierson/" id="link 1">See a preview of the book</a>

# NavigableString Objects

In [24]:
from bs4 import BeautifulSoup

### Beautiful soup object

In [25]:
soup = BeautifulSoup('<b body="description">Product description</b>')

### NavigableString objects

In [26]:
tag= soup.b
type(tag)

bs4.element.Tag

In [27]:
tag.name

'b'

In [28]:
tag.string

'Product description'

In [29]:
type(tag.string)

bs4.element.NavigableString

In [30]:
nav_string = tag.string
nav_string

'Product description'

In [31]:
nav_string.replace_with('Null')
tag.string

'Null'

#### Working with NavigableString objects

In [32]:
html_doc = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>DATA SCIENCE FOR DUMMIES</b></p>

<p class='description'>Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br><br>
Edition 1 of this book:
        <br>
 <ul>
  <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
  <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
  <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
  <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>   
  </ul>
<br><br>
What to do next:
<br>
<a href='http://www.data-mania.com/blog/books-by-lillian-pierson/' class = 'preview' id='link 1'>See a preview of the book</a>,
<a href='http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/' class = 'preview' id='link 2'>get the free pdf download,</a> and then
<a href='http://bit.ly/Data-Science-For-Dummies' class = 'preview' id='link 3'>buy the book!</a> 
</p>

<p class='description'>...</p>
'''
soup = BeautifulSoup(html_doc, 'html.parser')

In [33]:
for string in soup.stripped_strings: print(repr(string))

'Best Books'
'DATA SCIENCE FOR DUMMIES'
'Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe'
'Edition 1 of this book:'
'Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis'
'Details different data visualization techniques that can be used to showcase and summarize your data'
'Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques'
'Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark'
'What to do next:'
'See a preview of the book'
','
'get the free pdf download,'
'and then'
'buy the book!'
'...'


In [34]:
title_tag = soup.title
title_tag

<title>Best Books</title>

In [35]:
title_tag.parent

<head><title>Best Books</title></head>

In [36]:
title_tag.string

'Best Books'

In [37]:
title_tag.string.parent

<title>Best Books</title>

# Working with parsed data in beautiful soup
### Parsing data: an html or xml document is just passed to the BeautifulSoup() constructor
### The constructor converts the document to unicode and then parses it with a built-in HTML parser (by default)
### Printing data that's in a parse tree
### Searching and retrieving data from parse tree

## Searching and retrieving data
### find_all() method
### searches a tag and its descendents to retrieve tags or strings that match your filters

## Methods for searching and filtering a parse tree
### name argument, keyword argument, string argument, lists, booleans, strings and regular expressions 
### all of these can be passed to find_all() and return either their strings or tags

In [38]:
import pandas as pd

from bs4 import BeautifulSoup

import re

In [39]:
r = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>DATA SCIENCE FOR DUMMIES</b></p>

<p class='description'>Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br><br>
Edition 1 of this book:
        <br>
 <ul>
  <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
  <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
  <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
  <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>   
  </ul>
<br><br>
What to do next:
<br>
<a href='http://www.data-mania.com/blog/books-by-lillian-pierson/' class = 'preview' id='link 1'>See a preview of the book</a>,
<a href='http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/' class = 'preview' id='link 2'>get the free pdf download,</a> and then
<a href='http://bit.ly/Data-Science-For-Dummies' class = 'preview' id='link 3'>buy the book!</a> 
</p>

<p class='description'>...</p>
'''

### converting html to beautiful soup object

In [40]:
soup = BeautifulSoup(r, 'lxml')
type(soup)

bs4.BeautifulSoup

### Parsing the data, retrieving the first 100 elements

In [42]:
print (soup.prettify()[0:100])

<html>
 <head>
  <title>
   Best Books
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    DA


### retrieving only the text from the htm document

In [43]:
text_only = soup.get_text()
print(text_only)

Best Books

DATA SCIENCE FOR DUMMIES
Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe

Edition 1 of this book:
        

Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis
Details different data visualization techniques that can be used to showcase and summarize your data
Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques
Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark


What to do next:

See a preview of the book,
get the free pdf download, and then
buy the book!
...



### Search and retrieve the data from a parse tree
#### using name arguments

In [44]:
soup.find_all("li")

[<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>,
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>,
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>,
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>]

#### filtering with keywords, retrieving all the elements with a 3 in it

In [45]:
soup.find_all(id="link 3")

[<a class="preview" href="http://bit.ly/Data-Science-For-Dummies" id="link 3">buy the book!</a>]

#### retrieve tags using strings, in this case all the strings ul

In [46]:
soup.find_all('ul')

[<ul>
 <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
 </ul>]

#### filtering using list objects 

In [47]:
soup.find_all(['ul', 'b'])

[<b>DATA SCIENCE FOR DUMMIES</b>, <ul>
 <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
 </ul>]

#### filtering using reg expressions

In [48]:
l = re.compile('l')
for tag in soup.find_all(l): print(tag.name)

html
title
ul
li
li
li
li


#### filtering using boolean values

In [49]:
for tag in soup.find_all(True): print(tag.name)

html
head
title
body
p
b
p
br
br
br
ul
li
li
li
li
br
br
br
a
a
a
p


#### filtering based on string objects

In [50]:
for link in soup.find_all('a'): print(link.get('href'))

http://www.data-mania.com/blog/books-by-lillian-pierson/
http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/
http://bit.ly/Data-Science-For-Dummies


#### retrieving strings filtering with regular expressions

In [51]:
soup.find_all(string=re.compile("data"))

['Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe\n',
 'Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis',
 'Details different data visualization techniques that can be used to showcase and summarize your data',
 'Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark']

# Web scraping and saving result in txt file

In [56]:
from bs4 import BeautifulSoup
import urllib
import urllib.request
import sys
import re

In [59]:
if sys.version_info[0] == 3:
    from urllib.request import urlopen
else:
    # Not Python 3 - today, it is most likely to be Python 2
    # But note that this might need an update when Python 4
    # might be around one day
    from urllib import urlopen
url = 'https://analytics.usa.gov'
with urlopen("http://www.python.org") as url:
    r = url.read()
soup = BeautifulSoup(r, "lxml")
type(soup)

bs4.BeautifulSoup

In [61]:
print (soup.prettify()[:100])

<!DOCTYPE html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!-


In [62]:
for link in soup.find_all('a'): print(link.get('href'))

#content
#python-network
/
/psf-landing/
https://docs.python.org
https://pypi.python.org/
/jobs/
/community/
#top
/
/psf/donations/
#site-map
#
javascript:;
javascript:;
javascript:;
#
https://www.facebook.com/pythonlang?fref=ts
https://twitter.com/ThePSF
/community/irc/
/about/
/about/apps/
/about/quotes/
/about/gettingstarted/
/about/help/
http://brochure.getpython.info/
/downloads/
/downloads/
/downloads/source/
/downloads/windows/
/downloads/mac-osx/
/download/other/
https://docs.python.org/3/license.html
/download/alternatives
/doc/
/doc/
/doc/av
https://wiki.python.org/moin/BeginnersGuide
https://devguide.python.org/
https://docs.python.org/faq/
http://wiki.python.org/moin/Languages
http://python.org/dev/peps/
https://wiki.python.org/moin/PythonBooks
/doc/essays/
/community/
/community/survey
/community/diversity/
/community/lists/
/community/irc/
/community/forums/
/psf/annual-report/2019/
/community/workshops/
/community/sigs/
/community/logos/
https://wiki.python.org/moin/
/co

In [65]:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}): print (link)

<a href="https://docs.python.org" title="Python Documentation">Docs</a>
<a href="https://pypi.python.org/" title="Python Package Index">PyPI</a>
<a href="https://www.facebook.com/pythonlang?fref=ts"><span aria-hidden="true" class="icon-facebook"></span>Facebook</a>
<a href="https://twitter.com/ThePSF"><span aria-hidden="true" class="icon-twitter"></span>Twitter</a>
<a href="http://brochure.getpython.info/" title="">Python Brochure</a>
<a href="https://docs.python.org/3/license.html" title="">License</a>
<a href="https://wiki.python.org/moin/BeginnersGuide" title="">Beginner's Guide</a>
<a href="https://devguide.python.org/" title="">Developer's Guide</a>
<a href="https://docs.python.org/faq/" title="">FAQ</a>
<a href="http://wiki.python.org/moin/Languages" title="">Non-English Docs</a>
<a href="http://python.org/dev/peps/" title="">PEP Index</a>
<a href="https://wiki.python.org/moin/PythonBooks" title="">Python Books</a>
<a href="https://wiki.python.org/moin/" title="">Python Wiki</a>


In [70]:
file = open('parsed_data.txt', 'wb')
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
    soup_link = str(link)
    print (soup_link)
    file.write(soup_link.encode())
file.flush()
file.close()

<a href="https://docs.python.org" title="Python Documentation">Docs</a>
<a href="https://pypi.python.org/" title="Python Package Index">PyPI</a>
<a href="https://www.facebook.com/pythonlang?fref=ts"><span aria-hidden="true" class="icon-facebook"></span>Facebook</a>
<a href="https://twitter.com/ThePSF"><span aria-hidden="true" class="icon-twitter"></span>Twitter</a>
<a href="http://brochure.getpython.info/" title="">Python Brochure</a>
<a href="https://docs.python.org/3/license.html" title="">License</a>
<a href="https://wiki.python.org/moin/BeginnersGuide" title="">Beginner's Guide</a>
<a href="https://devguide.python.org/" title="">Developer's Guide</a>
<a href="https://docs.python.org/faq/" title="">FAQ</a>
<a href="http://wiki.python.org/moin/Languages" title="">Non-English Docs</a>
<a href="http://python.org/dev/peps/" title="">PEP Index</a>
<a href="https://wiki.python.org/moin/PythonBooks" title="">Python Books</a>
<a href="https://wiki.python.org/moin/" title="">Python Wiki</a>


In [None]:
%pwd