ISRC Python Workshop: Scrape web data

___Getting data from Web: Parse by Hand!___

<hr>

@author: Zhiya Zuo

@email: zhiya-zuo@uiowa.edu

---

#### Introduction

Sometimes, they have APIs but they have no well-written packages in the language you prefer (e.g. only Java but no Python libraries). Even worse, there may not be APIs for the public and we have to design a scraper to retrieve all the relevant informaiton we want. In such cases, we can manually build our own wrapper functions.

##### Preliminiary examples

Examples from <a href="https://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic_document" target="blank_">w3schools</a>.

```html
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My 1st paragraph.</p>
<p>My 2nd paragraph.</p>
<p>My 3rd paragraph.</p>

</body>
</html>
```

Save this code to your disk as `sample.html` (or any other name). We will use a great library called ___`Beautiful Soup`___ to read the contents from Python. You may also need to install lxml, which is for parsing specific formats (e.g., html and xml).

In [1]:
## Do the following if you have not
# pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup as Soup

In [2]:
with open("sample-data/sample.html", "r") as sample:
    sample_contents = sample.read()

The structure of HTML is not displayed properly without BeautifulSoup, which is really hand!

In [4]:
sample_contents

'<!DOCTYPE html>\n<html>\n<body>\n\n<h1>My First Heading</h1>\n\n<p>My 1st paragraph.</p>\n<p>My 2nd paragraph.</p>\n<p>My 3rd paragraph.</p>\n\n</body>\n</html>\n'

In [6]:
sample_soup = Soup(sample_contents, 'lxml')

By printing it, we can see the exact contents as shown above with proper indentation

In [7]:
print(sample_soup.prettify())

<!DOCTYPE html>
<html>
 <body>
  <h1>
   My First Heading
  </h1>
  <p>
   My 1st paragraph.
  </p>
  <p>
   My 2nd paragraph.
  </p>
  <p>
   My 3rd paragraph.
  </p>
 </body>
</html>



Get the contents of interest: all the `p`'s

_`p` means paragraph in html. Check more tag definitions on https://w3schools.org._

In [8]:
p_tags = sample_soup.find_all("p")

For each of the `p` tag, we get the textual value out.

In [9]:
for p in p_tags:
    print(p.text)

My 1st paragraph.
My 2nd paragraph.
My 3rd paragraph.


---

#### A real example

As you can see, this is very straightforward. Let's use a real website for illustration. For example, if we are interested in company profiles, we can scrape from [Google Finance](https://www.google.com/finance). We will be using <a href="https://www.google.com/finance?q=NYSE%3AIBM&ei=Ij62WPHgGdSLmAHrja_wAQ" target="_blank">IBM's profile</a> as an example. However, you may find off-the-shelf packages. We will only use this for an introduction on how to scrape manually.

To view the "text style" or the real structure of a web page, you can use ___`developer tools`___ function in your browser. For example, you can see something like this. If you move your mouse to a place, the console will show you the corresponding tags in the source html files. You will find that the description text is located within a `p` tag.

<img src="http://i.imgur.com/IEl2uyG.png" width="1000">

Recall that [`requests`](http://docs.python-requests.org/) is a convenient package for sending HTTP requests.

In [14]:
import requests

In [27]:
ibm_url = "https://finance.google.com/finance?q=NYSE%3AIBM&ei=lyOaWvCNC8SrjAGGgoewCQ"
r = requests.get(ibm_url)
r.status_code

200

Convert it to a soup object

In [28]:
ibm_soup = Soup(r.text, 'lxml')

Find the correponding tag. Note that `class_` has a trailing underscore `_`

In [30]:
summary_tag = ibm_soup.find("div", class_="companySummary")
print('--------------------')
print(summary_tag.text)
print('--------------------')

--------------------

International Business Machines Corporation (IBM) is a technology company. The Company operates through five segments: Cognitive Solutions, Global Business Services (GBS), Technology Services & Cloud Platforms, Systems and Global Financing. The Cognitive Solutions segment delivers a spectrum of capabilities, from descriptive, predictive and prescriptive analytics to cognitive systems. Cognitive Solutions includes Watson, a cognitive computing platform that has the ability to interact in natural language, process big data, and learn from interactions with people and computers. The GBS segment provides clients with consulting, application management services and global process services. The Technology Services & Cloud Platforms segment provides information technology infrastructure services. The Systems segment provides clients with infrastructure technologies. The Global Financing segment includes client financing, commercial financing, and remanufacturing and rema

With this in mind, you can scrape almost any webpage of interest. Other formats such as <a href="http://www.json.org/" target="_blank">JSON</a> and <a href="https://www.w3.org/XML/" target="_blank">XML</a> do have high similarities and a few differences. They are not very difficult to know the basics! (We've talked about this in [the previous notebook](https://github.com/zhiyzuo/uiowa-isrc-python/blob/master/5-Getting-Data-Using-APIs.ipynb).

***But keep in mind that you should act politely, with propoer permission!! To find out whether specific paths/contents are allowed to be scraped, you can check their ___`robots.txt`___. For example, <a href="https://www.google.com/robots.txt" target="_blank">here's</a> the permission information set by Google.***

---

#### More than one page

##### Simple case: Pagination

There are situtaions where we need to scrape data that is in different pages. We will be using [this website](http://spidyquotes.herokuapp.com/) as an example.

After we clikc on `next page`, we can see that the address bar will become http://spidyquotes.herokuapp.com/page/2/, which clearly shows that we can jump to any page with a proper setting. Let's try!

In [31]:
quote_url = "http://spidyquotes.herokuapp.com/page/"

Let's use `2`:

In [39]:
quote_2 = quote_url + "2"
quote_2

'http://spidyquotes.herokuapp.com/page/2'

In [41]:
r = requests.get(quote_2)
r.status_code

200

In [43]:
quote_2_soup = Soup(r.text, 'lxml')

By checking the HTML structure using Chrome's `developer tools`, we find that all the quotes are `div` tags with class `quote`. Within each of the `quote div`, there are 3 elements: quote text, author, and tags:
```html
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="simplicity,understand"> 
            
            <a class="tag" href="/tag/simplicity/page/1/">simplicity</a>
            
            <a class="tag" href="/tag/understand/page/1/">understand</a>
            
        </div>
    </div>
```

With all these in mind, we can build a very sinmple wrapper function

In [51]:
import pandas as pd

In [68]:
def parse_quote(quote_div):
    '''Each tag has 3 elements. We just retrieve text and author here'''
    span_tags = quote_div.find_all('span')
    quote_text = span_tags[0].text
    quote_author = span_tags[1].find('small').text
    return {'author': quote_author, 'quote': quote_text}

In [69]:
# Often, constants are stored in variables with all caps.
QUOTE_URL = "http://spidyquotes.herokuapp.com/page/"
def get_quote(page=1):
    r = requests.get(QUOTE_URL+str(page))
    soup = Soup(r.text, 'lxml')
    quote_div_list = soup.find_all('div', class_='quote')
    quote_df = pd.DataFrame([parse_quote(div) for div in quote_div_list])
    return quote_df

In [70]:
get_quote(2)

Unnamed: 0,author,quote
0,Marilyn Monroe,“This life is what you make it. No matter what...
1,J.K. Rowling,“It takes a great deal of bravery to stand up ...
2,Albert Einstein,"“If you can't explain it to a six year old, yo..."
3,Bob Marley,"“You may not be her first, her last, or her on..."
4,Dr. Seuss,"“I like nonsense, it wakes up the brain cells...."
5,Douglas Adams,"“I may not have gone where I intended to go, b..."
6,Elie Wiesel,"“The opposite of love is not hate, it's indiff..."
7,Friedrich Nietzsche,"“It is not a lack of love, but a lack of frien..."
8,Mark Twain,"“Good friends, good books, and a sleepy consci..."
9,Allen Saunders,“Life is what happens to us while we are makin...


We can further come up with a function that takes in a list of pages

In [79]:
def get_quotes(page_list=[1]):
    return pd.concat([get_quote(pg) for pg in page_list], 
                     axis=0, ignore_index=True)

In [80]:
get_quotes([1,2])

Unnamed: 0,author,quote
0,Albert Einstein,“The world as we have created it is a process ...
1,J.K. Rowling,"“It is our choices, Harry, that show what we t..."
2,Albert Einstein,“There are only two ways to live your life. On...
3,Jane Austen,"“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and..."
5,Albert Einstein,“Try not to become a man of success. Rather be...
6,André Gide,“It is better to be hated for what you are tha...
7,Thomas A. Edison,"“I have not failed. I've just found 10,000 way..."
8,Eleanor Roosevelt,“A woman is like a tea bag; you never know how...
9,Steve Martin,"“A day without sunshine is like, you know, nig..."


##### Inifinite scroll?

In some cases, we may see infinite scroll pages such as Twitter/Facebook. The quote website also offers a great example for us to try scraping on this type of page: http://spidyquotes.herokuapp.com/scroll

When we examine such pages, what we should first do is to look at the ___network___ tab in the `developer tool`:

![Imgur](https://i.imgur.com/Nmna5nc.gif)

We can then see that when we scroll down, there are actually hidden API calls! Therefore, we can just call the APIS to ___simulate what the javascript's doing___ to get the dataset.

In [86]:
QUOTE_API = "http://spidyquotes.herokuapp.com/api/quotes"

Let's test with page 2. Note that the request method is `GET` as shown in the `developer's tool`.

In [87]:
r = requests.get(QUOTE_API, params={'page': 2})
r.status_code

200

In [91]:
quote_json = r.json()
quote_json.keys()

dict_keys(['has_next', 'page', 'quotes', 'tag', 'top_ten_tags'])

In [93]:
quote_json['quotes']

[{'author': {'goodreads_link': '/author/show/82952.Marilyn_Monroe',
   'name': 'Marilyn Monroe',
   'slug': 'Marilyn-Monroe'},
  'tags': ['friends',
   'heartbreak',
   'inspirational',
   'life',
   'love',
   'sisters'],
  'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fai

Therefore, for each page, we can directly parse the resulting JSON instead of parsing the HTML manually!

In [94]:
def get_quote_api(page):
    r = requests.get(QUOTE_API, params={'page': 2})
    if r.status_code != 200:
        print('Request failed: %s'%r.text)
        return None
    quotes = r.json()['quotes']
    return pd.DataFrame([{'quote': q['text'], 'author': q['author']['name']} for q in quotes])

In [95]:
get_quote_api(7)

Unnamed: 0,author,quote
0,Marilyn Monroe,“This life is what you make it. No matter what...
1,J.K. Rowling,“It takes a great deal of bravery to stand up ...
2,Albert Einstein,"“If you can't explain it to a six year old, yo..."
3,Bob Marley,"“You may not be her first, her last, or her on..."
4,Dr. Seuss,"“I like nonsense, it wakes up the brain cells...."
5,Douglas Adams,"“I may not have gone where I intended to go, b..."
6,Elie Wiesel,"“The opposite of love is not hate, it's indiff..."
7,Friedrich Nietzsche,"“It is not a lack of love, but a lack of frien..."
8,Mark Twain,"“Good friends, good books, and a sleepy consci..."
9,Allen Saunders,“Life is what happens to us while we are makin...


Done!

---

#### Conclusion

As we can see, it is not that difficult to parse HTML pages. By manually parsing the pages, we in turn gain more control on how we want the results to be shaped. However, anti-scraping deign of websites may bring obstables in finding ways to retrieve valuable informaiton in automated ways, just as those infinite scroll design. 

Note that the examples we are using here are relatively simple. There are cases that we cannot access the pagination/scoll simply by `requests` alone. In those cases, [Selenium](http://selenium-python.readthedocs.io/) will save our lifes by ___simulating Browsers___!

Some more tutorials/tools:

- https://scrapy.org/
- https://www.dataquest.io/blog/web-scraping-tutorial-python/
- https://www.quora.com/Python-programming-language-1/How-is-BeautifulSoup-different-from-Scrapy