## Parsing Data using BeautifulSoup

Passing an `html` document to the BeautifulSoup constructor will give us a _parse tree_ that we can use to interact with the data. We can then use methods such as `find_all` to find and retrieve a tag and its descendents. The `find_all` method can take a number of different arguments in order to filter the data that is retrieved.

In [2]:
from bs4 import BeautifulSoup
import requests

In [3]:
url = 'https://github.com/ms-robot-please/Data-Mania-Demos/blob/master/Iot-2018.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html')

In [5]:
type(soup)

bs4.BeautifulSoup

In [7]:
print(soup.prettify()[0:500])

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>


### Getting data from our parse tree

#### Get _just_ the text

In this csae the amount of data retrieved exceeds the size_limit for the notebook :)

In [13]:
txt1 = soup.get_text()

### Searching and Retrieving data from a parse tree

#### Retrieving tags by filtering with name arguments

In [26]:
soup.find_all('a')[0:5]

[<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>,
 <a aria-label="Homepage" class="mr-lg-3 color-fg-inherit flex-order-2" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
 <svg aria-hidden="true" class="octicon octicon-mark-github" data-view-component="true" height="32" version="1.1" viewbox="0 0 16 16" width="32">
 <path d="M8 0c4.42 0 8 3.58 8 8a8.013 8.013 0 0 1-5.45 7.59c-.4.08-.55-.17-.55-.38 0-.27.01-1.13.01-2.2 0-.75-.25-1.23-.54-1.48 1.78-.2 3.65-.88 3.65-3.95 0-.88-.31-1.59-.82-2.15.08-.2.36-1.02-.08-2.12 0 0-.67-.22-2.2.82-.64-.18-1.32-.27-2-.27-.68 0-1.36.09-2 .27-1.53-1.03-2.2-.82-2.2-.82-.44 1.1-.16 1.92-.08 2.12-.51.56-.82 1.28-.82 2.15 0 3.06 1.86 3.75 3.64 3.95-.23.2-.44.55-.51 1.07-.46.21-1.61.55-2.33-.66-.15-.24-.6-.83-1.23-.82-.67.01-.27.38.01.53.34.19.73.9.82 1.13.16.45.68 1.31 2.69.94 0 .67.01 1.3.01 1.49 0 .21-.15.45-.

#### Retrieving tags by filtering with keyword arguments

In [27]:
soup.find_all(id="search-icon")

[<template id="search-icon">
 <svg aria-hidden="true" class="octicon octicon-search" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
 <path d="M10.68 11.74a6 6 0 0 1-7.922-8.982 6 6 0 0 1 8.982 7.922l3.04 3.04a.749.749 0 0 1-.326 1.275.749.749 0 0 1-.734-.215ZM11.5 7a4.499 4.499 0 1 0-8.997 0A4.499 4.499 0 0 0 11.5 7Z"></path>
 </svg>
 </template>]

#### Retrieving tags by filtering with string arguments

In [30]:
soup.find_all('template')[0:5]

[<template id="search-icon">
 <svg aria-hidden="true" class="octicon octicon-search" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
 <path d="M10.68 11.74a6 6 0 0 1-7.922-8.982 6 6 0 0 1 8.982 7.922l3.04 3.04a.749.749 0 0 1-.326 1.275.749.749 0 0 1-.734-.215ZM11.5 7a4.499 4.499 0 1 0-8.997 0A4.499 4.499 0 0 0 11.5 7Z"></path>
 </svg>
 </template>,
 <template id="code-icon">
 <svg aria-hidden="true" class="octicon octicon-code" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
 <path d="m11.28 3.22 4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.749.749 0 0 1-1.275-.326.749.749 0 0 1 .215-.734L13.94 8l-3.72-3.72a.749.749 0 0 1 .326-1.275.749.749 0 0 1 .734.215Zm-6.56 0a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042L2.06 8l3.72 3.72a.749.749 0 0 1-.326 1.275.749.749 0 0 1-.734-.215L.47 8.53a.75.75 0 0 1 0-1.06Z"></path>
 </svg>
 </template>,
 <template id="file-code-icon">
 <svg aria-hidden="true" class="octicon octico

#### Retrieving tags by filtering with List Objects

In this case we can use a `list` to specify set of tags that we want to retrieve.

In [32]:
soup.find_all(['template', 'a'])[0:5]

[<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>,
 <a aria-label="Homepage" class="mr-lg-3 color-fg-inherit flex-order-2" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
 <svg aria-hidden="true" class="octicon octicon-mark-github" data-view-component="true" height="32" version="1.1" viewbox="0 0 16 16" width="32">
 <path d="M8 0c4.42 0 8 3.58 8 8a8.013 8.013 0 0 1-5.45 7.59c-.4.08-.55-.17-.55-.38 0-.27.01-1.13.01-2.2 0-.75-.25-1.23-.54-1.48 1.78-.2 3.65-.88 3.65-3.95 0-.88-.31-1.59-.82-2.15.08-.2.36-1.02-.08-2.12 0 0-.67-.22-2.2.82-.64-.18-1.32-.27-2-.27-.68 0-1.36.09-2 .27-1.53-1.03-2.2-.82-2.2-.82-.44 1.1-.16 1.92-.08 2.12-.51.56-.82 1.28-.82 2.15 0 3.06 1.86 3.75 3.64 3.95-.23.2-.44.55-.51 1.07-.46.21-1.61.55-2.33-.66-.15-.24-.6-.83-1.23-.82-.67.01-.27.38.01.53.34.19.73.9.82 1.13.16.45.68 1.31 2.69.94 0 .67.01 1.3.01 1.49 0 .21-.15.45-.

#### Retrieving tags by filtering with regular expressions

In [35]:
import re
t = re.compile('t')
for tag in soup.find_all(t)[0:10]:
    print(tag.name)

html
meta
script
script
script
script
script
script
script
script
