# Parse HTML with BeautifulSoup

## Loading and examining HTML page

Import 
* `requests` for loading web pages
* `math` for math operations
* `bs4` for loading *BeautifulSoup* for working with HTML

In [1]:
import requests
import bs4
import math

Load the BeautifulSoup page and print part of the result

In [2]:
url = 'http://www.crummy.com/software/BeautifulSoup'
source = requests.get(url).text
print(source[:500])
print(len(source))

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
<link rev="made" href="mailto:leonardr@segfault.org">
<link rel="stylesheet" type="text/css" href="/nb/themes/Default/nb.css">
<meta name="Description" content="Beautiful Soup: a library designed for screen-scraping HTML and XML
9321


Is word `Alice` in the source of the web page?

In [3]:
print('Alice' in source)

False


Count occurences of word `Soup`:

In [4]:
print(source.count('Soup'))

45


Find index of 'foo.com' substring.

In [8]:
foo = 'foo.com'
position =  source.find(foo)
print(position)

3528


Quick test to see the substring in the source variable. You can access strings like lists.

In [9]:
print(source[position:position + 7])

foo.com


or you can use a tidier version

In [12]:
print(source[position:position + len(foo)])

foo.com


## Starting with BeautifulSoup

Getting `BeautifulSoup` object:

In [13]:
soup = bs4.BeautifulSoup(source, "html.parser")

Output raw HTML.
`print(soup)` would print the whole string representation of `BeautifulSoup` object, use `print(soup.__str__()[:1000])` to first transform `BeautifulSoupbs4` into its string representation and then pring first 1000 characters. 

In [14]:
# print(soup.__str__()[:1000]) - prints string repo
print(soup) 

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/transitional.dtd">

<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
<link href="mailto:leonardr@segfault.org" rev="made"/>
<link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
<meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>
<meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>
<meta content="Leonard Richardson" name="author"/>
</head>
<body alink="red" bgcolor="white" link="blue" text="black" vlink="660066">
<img align="right" src="10.1.jpg" width="250"/><br/>
<p>You didn't write that awful page. You're just trying to get some
data out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround
screen scraping projects.<

Print formatted HTML:

In [15]:
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Beautiful Soup: We called him Tortoise because he taught us.
  </title>
  <link href="mailto:leonardr@segfault.org" rev="made"/>
  <link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
  <meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>
  <meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>
  <meta content="Leonard Richardson" name="author"/>
 </head>
 <body alink="red" bgcolor="white" link="blue" text="black" vlink="660066">
  <img align="right" src="10.1.jpg" width="250"/>
  <br/>
  <p>
   You didn't write that awful page. You're just trying to get some
data out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-tur

Find all `a` HTML tags

In [16]:
soup.findAll('a')[:10]

[<a href="bs4/download/"><h1>Beautiful Soup</h1></a>,
 <a href="#Download">Download</a>,
 <a href="bs4/doc/">Documentation</a>,
 <a href="#HallOfFame">Hall of Fame</a>,
 <a href="https://code.launchpad.net/beautifulsoup">Source</a>,
 <a href="https://bazaar.launchpad.net/%7Eleonardr/beautifulsoup/bs4/view/head:/NEWS.txt">Changelog</a>,
 <a href="https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup">Discussion group</a>,
 <a href="zine/">Zine</a>,
 <a href="https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&amp;utm_medium=referral&amp;utm_campaign=website">Tidelift subscription</a>,
 <a href="zine/"><i>Tool Safety</i></a>]

In [17]:
soup.findAll('Soup')

[]

Find the first occurrence of `a` tag:

In [18]:
first_tag = soup.find('a')

Get the value of `href` attribute

In [19]:
first_tag.get('href')

'bs4/download/'

Get all HTML links from the web page

In [20]:
tags = soup.findAll('a')
links = []
for tag in tags:
   link = tag.get('href')
   links.append(link)

alternatively a shorter notation can be used

In [21]:
link_list = [l.get('href') for l in soup.findAll('a')]
print(link_list[:5])

['bs4/download/', '#Download', 'bs4/doc/', '#HallOfFame', 'https://code.launchpad.net/beautifulsoup']


Filter out all external links by going in a loop through all the links and adding links starting with `http://` or `https://` to the resulting list

In [22]:
external_links = []
for l in link_list:
   if l[:7] == 'http://' or l[:8] == 'https://':
       external_links.append(l)

TypeError: 'NoneType' object is not subscriptable

In [23]:
%debug

> [0;32m<ipython-input-22-7323dedaebfc>[0m(3)[0;36m<module>[0;34m()[0m
[0;32m      1 [0;31m[0mexternal_links[0m [0;34m=[0m [0;34m[[0m[0;34m][0m[0;34m[0m[0m
[0m[0;32m      2 [0;31m[0;32mfor[0m [0ml[0m [0;32min[0m [0mlink_list[0m[0;34m:[0m[0;34m[0m[0m
[0m[0;32m----> 3 [0;31m   [0;32mif[0m [0ml[0m[0;34m[[0m[0;34m:[0m[0;36m7[0m[0;34m][0m [0;34m==[0m [0;34m'http://'[0m [0;32mor[0m [0ml[0m[0;34m[[0m[0;34m:[0m[0;36m8[0m[0;34m][0m [0;34m==[0m [0;34m'https://'[0m[0;34m:[0m[0;34m[0m[0m
[0m[0;32m      4 [0;31m       [0mexternal_links[0m[0;34m.[0m[0mappend[0m[0;34m([0m[0ml[0m[0;34m)[0m[0;34m[0m[0m
[0m


ipdb>  exit


Why?
Let's make sure that none of the strings we are checking are actually strings and not `None`.

In [24]:
for l in link_list:
    if l is not None and len(l) >= 7 and \
    (l[:7] == 'http://' or
     l[:8] == 'https://'):
        external_links.append(l)
print(external_links[:5])

['https://code.launchpad.net/beautifulsoup', 'https://bazaar.launchpad.net/%7Eleonardr/beautifulsoup/bs4/view/head:/NEWS.txt', 'https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup', 'https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&utm_medium=referral&utm_campaign=website', 'https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup']


## Parsing DOM tree with BeautifulSoup

Load a string with HTML code into a `bs4BeautifulSoup` object

In [2]:
htmlString = "<!DOCTYPE html><html><head>" \
    "<title>This is a title</title></head>" \
    "<body><h3> Test </h3><p>Hello world!</p</body></html>"
tree = bs4.BeautifulSoup(htmlString, "html.parser")  # get BeautifulSoup object

NameError: name 'bs4' is not defined

Get the `html` _root node_ of the DOM tree

In [26]:
root_node = tree.html
print(root_node)

<html><head><title>This is a title</title></head><body><h3> Test </h3><p>Hello world!</p></body></html>


Get `head` of the root using `contents`

In [27]:
head = root_node.contents[0]
print(head)

<head><title>This is a title</title></head>


Get `body` from `root`

In [28]:
body = root_node.contents[1]
print(body)

<body><h3> Test </h3><p>Hello world!</p></body>


Alternatively, body can be accessed directly:

In [29]:
body = tree.body
print(body)

<body><h3> Test </h3><p>Hello world!</p></body>


## Web scraping case
Scrape jobs portal www.indeed.com for information on data science skills.

Load a page from www.indeed.com with posts about Data Scientist job offers and load it into a `BeautifulSoup` object

In [1]:
url = 'http://www.indeed.com/jobs?q=data+scientist&l='
source = requests.get(url).text
bs_tree = bs4.BeautifulSoup(source, 'html.parser')

NameError: name 'requests' is not defined

Check how many job postings we found

In [31]:
job_count_string = bs_tree.find(id = 'searchCount').contents[0]
job_count_string = job_count_string.split()[-2]
print("Search yielded %s hits." % (job_count_string))

Search yielded 32,003 hits.


Note that `job_count` so far is still a string, not an integer, and the `,` separator prevents us from just casting it to int. 
To convert string `"32,006"` into an integer we 
* iterate through all characters of the string, filtering out everything that is not a digit (`filter()` function),
* concatenate elements of the resulting list with digit characters using `str.join()`,
* convert the resulting string into an integer (`int()`).

In [32]:
job_count_string = '32,006'
job_count_digits = list(filter(lambda c: c in '1234567890', job_count_string))
print(job_count_digits)

['3', '2', '0', '0', '6']


In [33]:
job_count = int(''.join(job_count_digits))
print(type(job_count))

<class 'int'>


The website is only listing 10 results per page,  so we need to scrape them page after page

In [34]:
url = 'http://www.indeed.com/jobs?q=data+scientist'
html_page = requests.get(url).text
bs_tree = bs4.BeautifulSoup(html_page, 'html.parser')

In [35]:
# num_pages = int(math.ceil(job_count/10.0))
num_pages = 3

base_url = 'http://www.indeed.com'
job_links = []
for i in range(5):  # do range(num_pages) if you want them all
    if i%10==0:
        print("Number of pages left: ", num_pages-i)
    url = 'http://www.indeed.com/jobs?q=data+scientist&start=' + str(i*10)
    html_page = requests.get(url).text
    bs_tree = bs4.BeautifulSoup(html_page, 'html.parser')
    
    job_postings = []
    job_posting_candidates = bs_tree.findAll("h2", {"class": "jobtitle"})
    for candidate in job_posting_candidates: 
        job_postings.append(candidate.findChildren()[0].get("href"))
    
    for job in job_postings:     # go after each link
        job_links.append(base_url + job)
        
print("We found a lot of jobs: ", len(job_links))

Number of pages left:  3
We found a lot of jobs:  50


In [38]:
skill_set = {'hadoop': 0, 'spark': 0}
counter = 0

for link in job_links:
    counter +=1  
    
    try:
        html_page = requests.get(link).text
    except Exception:
        continue  # proceed with next link in case of an error

    html_text = html_page.lower()    
    for key in skill_set.keys():
        if key in html_text:  
            skill_set[key] +=1
            # skip the rest of the text as the term can occur several times
            continue
            
    if counter % 5 == 0:
        print("\rJob postings left:", len(job_links) - counter, skill_set,
              end='', flush=True)

Job postings left:  0 {'hadoop': 7, 'spark': 5}

## Accessing Twitter API
**Task**: List popular tweets of Donald Trump (with more than 1000000 retweets)

Authenticate your app against Twitter with 
* consumer key
* consumer secret
* access token key
* access token secret

In [39]:
import twitter


cKey = 'tOZWzonsap454dtjS1aHy17Bd'
cSecret = 'fRVCa1WpS8lapp9pVQdr82X4UamEsfMLDoobm0kkN9yepg8chz'
aKey = '560363551-rDPjiQQ91OW4x8YE5OS8QySkmwus9SM8IBuTNF6J'
aSecret = 'bU1cAoGTFIeoRbZkz23eeazy98rWZ5THzM43DsB3cBvms'

## create the api object with the twitter-python library
api = twitter.Api(consumer_key=cKey, consumer_secret=cSecret, 
                  access_token_key=aKey, access_token_secret=aSecret, tweet_mode="extended")

get timeline for user with screen_name 'realDonaldTrump'

In [40]:
twitter_statuses = api.GetUserTimeline(
        screen_name = 'realDonaldTrump')
statuses = [t.AsDict() for t in twitter_statuses]

Get tweets with `retweet_count` larger than 1000000, which we will be considering as popular


In [41]:
statuses[0]

{'created_at': 'Thu Nov 15 20:43:42 +0000 2018',
 'favorite_count': 634,
 'full_text': 'It is our sacred duty to support America’s Service Members every single day they wear the uniform – and every day after when they return home as Veterans. Together we will HONOR those who defend us, we will CHERISH those who protect us, and we will celebrate the amazing heroes... https://t.co/kovcIj4fwU',
 'hashtags': [],
 'id': 1063170699500642304,
 'id_str': '1063170699500642304',
 'lang': 'en',
 'media': [{'display_url': 'pic.twitter.com/kovcIj4fwU',
   'expanded_url': 'https://twitter.com/realDonaldTrump/status/1063170699500642304/video/1',
   'id': 1063170305084944384,
   'media_url': 'http://pbs.twimg.com/ext_tw_video_thumb/1063170305084944384/pu/img/F1wBFXepohxi2fOz.jpg',
   'media_url_https': 'https://pbs.twimg.com/ext_tw_video_thumb/1063170305084944384/pu/img/F1wBFXepohxi2fOz.jpg',
   'sizes': {'large': {'h': 720, 'resize': 'fit', 'w': 1280},
    'medium': {'h': 675, 'resize': 'fit', 'w': 1

In [42]:
maybe_interesting = []
for status in statuses:
    if status['retweet_count'] > 10000:
        maybe_interesting.append(status)

# same using `filter`        
# maybe_interesting = \
#    filter(lambda status: 
#        status['retweet_count']>50000, 
#        statuses)

Get rid of everything except texts of those tweets and print them.

In [43]:
tweet_texts = list(map(lambda status: 
                       status['full_text'], 
                       maybe_interesting))
for t in tweet_texts[:5]:
    print('######')
    print(t)

######
The only “Collusion” is that of the Democrats with Russia and many others. Why didn’t the FBI take the Server from the DNC? They still don’t have it. Check out how biased Facebook, Google and Twitter are in favor of the Democrats. That’s the real Collusion!
######
Universities will someday study what highly conflicted (and NOT Senate approved) Bob Mueller and his gang of Democrat thugs have done to destroy people. Why is he protecting Crooked Hillary, Comey, McCabe, Lisa Page &amp; her lover, Peter S, and all of his friends on the other side?
######
....care how many lives the ruin. These are Angry People, including the highly conflicted Bob Mueller, who worked for Obama for 8 years. They won’t even look at all of the bad acts and crimes on the other side. A TOTAL WITCH HUNT LIKE NO OTHER IN AMERICAN HISTORY!
######
The inner workings of the Mueller investigation are a total mess. They have found no collusion and have gone absolutely nuts. They are screaming and shouting at peop