### Scraping using Beautiful Soup

- [Beautifulsoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [14]:
from bs4 import BeautifulSoup
import lxml

In [15]:
### Get a hold of local html file
html_file_path = "bs4-start/website.html"

with open(html_file_path,"r",encoding="utf-8") as file:
    #read the content of the html file
    html_content = file.read()

#display the content
html_content

'<!DOCTYPE html>\n<html>\n\n<head>\n\t<meta charset="utf-8">\n\t<title>Angela\'s Personal Site</title>\n</head>\t\n\n<body>\n\t<h1 id="random">Newly-added h1 just for testing</h1>\n\t<h1 id="name">Angela Yu</h1>\n\t<p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p>\n\t<p>I am an iOS and Web Developer. I ❤️ coffee and motorcycles.</p>\n\t<hr>\n\t<h3 class="heading">Books and Teaching</h3>\n\t<ul>\n\t\t<li>The Complete iOS App Development Bootcamp</li>\n\t\t<li>The Complete Web Development Bootcamp</li>\n\t\t<li>100 Days of Code - The Complete Python Bootcamp</li>\n\t</ul>\n\t<hr>\n\t<h3 class="heading">Other Pages</h3>\n\t<a href="https://angelabauer.github.io/cv/hobbies.html">My Hobbies</a>\n\t<a href="https://angelabauer.github.io/cv/contact-me.html">Contact Me</a>\n</body>\n\n</html>'

In [16]:
#parse the markup using beautifulsoup
soup = BeautifulSoup(html_content,"html.parser") #able to parse both HTML and XML, telling parser to parse HTML

#alternatively use `lxml` as parser


#get title of the webpage
print(soup.title)

#get name of the title tag
print(soup.title.name)


#get actual string in the title tag
print(soup.title.string)

#soup object with indentation
print(soup.prettify())


<title>Angela's Personal Site</title>
title
Angela's Personal Site
<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <title>
   Angela's Personal Site
  </title>
 </head>
 <body>
  <h1 id="random">
   Newly-added h1 just for testing
  </h1>
  <h1 id="name">
   Angela Yu
  </h1>
  <p>
   <em>
    Founder of
    <strong>
     <a href="https://www.appbrewery.co/">
      The App Brewery
     </a>
    </strong>
    .
   </em>
  </p>
  <p>
   I am an iOS and Web Developer. I ❤️ coffee and motorcycles.
  </p>
  <hr/>
  <h3 class="heading">
   Books and Teaching
  </h3>
  <ul>
   <li>
    The Complete iOS App Development Bootcamp
   </li>
   <li>
    The Complete Web Development Bootcamp
   </li>
   <li>
    100 Days of Code - The Complete Python Bootcamp
   </li>
  </ul>
  <hr/>
  <h3 class="heading">
   Other Pages
  </h3>
  <a href="https://angelabauer.github.io/cv/hobbies.html">
   My Hobbies
  </a>
  <a href="https://angelabauer.github.io/cv/contact-me.html">
   Contact Me
  </a>

In [17]:
#`find_all` to find all of `x` tag
print([tag.string for tag in soup.find_all(name="li")])
print([tag.getText() for tag in soup.find_all(name="li")])

['The Complete iOS App Development Bootcamp', 'The Complete Web Development Bootcamp', '100 Days of Code - The Complete Python Bootcamp']
['The Complete iOS App Development Bootcamp', 'The Complete Web Development Bootcamp', '100 Days of Code - The Complete Python Bootcamp']


In [18]:
print(soup.find(name="h1",id="random"))

print(soup.find(name="h1",id="name"))


print(soup.find(name="h3",class_="heading"))

<h1 id="random">Newly-added h1 just for testing</h1>
<h1 id="name">Angela Yu</h1>
<h3 class="heading">Books and Teaching</h3>


In [19]:
#narrow down to a particular element based on html structure

#`select_one`gives the 1st matching item
#`select` gives all matches
print(soup.select_one(selector="p a")) #similar to css selectors, `a` sitting inside a `p` tag

soup.select_one(selector="#name") # `#` for selecting by `id`

soup.select(selector=".heading") #`.` for selecting element by specific `class`

<a href="https://www.appbrewery.co/">The App Brewery</a>


[<h3 class="heading">Books and Teaching</h3>,
 <h3 class="heading">Other Pages</h3>]

- Laws favors towards scraping data, as long as data is
    - publically available
    - not copyrighted
- Unscrappable for data behind authentication
    - captcha
    - recatcha
- Ethics
- Go for APIs
- Respect the Web Owner
- Check: `url + robots.txt`