# <img style="float: left; padding-right: 100px; width: 300px" src="images/logo.png">AI4SG Bootcamp:



## Module 2D: Data Scraping


**Authors:** Faustine

---

### What is Data scraping
It is commonly defined as a system where a technology extracts data from a particular codebase or program. Data scraping provides results for a variety of uses and automates aspects of data aggregation.
Most common data scraping is *web scraping*, is the process of importing information from a website into a spreadsheet or 
local file saved on your computer. It’s one of the most efficient ways to get data from the web, and in some cases to channel
that data to another website. 
Data scraping has a vast number of applications – it’s useful in just about any case where data needs to be moved from one place to another

#### Uses of data scraping include:
<ol start="1">
<li>Research for web content/business intelligence </li>
<li>Pricing for travel booker sites/price comparison sites </li>
<li>Finding sales leads/conducting market research by crawling public data sources (e.g. Yell and Twitter) </li>
<li>Sending product data from an e-commerce site to another online vendor (e.g. Google Shopping)</li>
</ol>

In this lab, we'll scrape Goodread's Best Books list:

https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1 .

We'll walk through scraping the list pages for the book names/urls

Although many programming languages offer libraries for web information retrieval and analysis, 
we will be focusing on the Python data analysis ecosystem given its popularity and capabilities.

# Table of Contents 
<ol start="0">
<li> Learning Goals </li>
<li> Exploring the Web pages and downloading them</li>
<li> Parse the page, extract book urls </li>
<li> Parse a book page, extract book properties </li>
<li> Set up a pipeline for fetching and parsing</li>
</ol>

## Learning Goals

Understand the structure of a web page. Use Beautiful soup to scrape content from these web pages.

*This lab corresponds to lectures 2, 3 and 4 and maps on to homework 1 and further.*

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

In [None]:
import time, requests

## 1. Exploring the web pages and downloading them

We're going to see the structure of Goodread's best books list. We'll use the Developer tools in chrome, safari and firefox have similar tools available

![](images/goodreads1.png)

To getch this page we use the `requests` module. But are we allowed to do this? Lets check:

https://www.goodreads.com/robots.txt

Yes we are.

In [3]:
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="
url = URLSTART+BESTBOOKS+'1'
print(url)
page = requests.get(url)

https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1


We can see properties of the page. Most relevant are `status_code` and `text`. The former tells us  if the web-page was found, and if found , ok. (See lecture notes.)

In [4]:
page.status_code # 200 is good

200

In [5]:
page.text[:5000]

'<!DOCTYPE html>\n<html class="desktop\n">\n<head>\n  <title>Best Books Ever (56398 books)</title>\n\n<meta content=\'54,898 books based on 190541 votes: The Hunger Games by Suzanne Collins, Harry Potter and the Order of the Phoenix by J.K. Rowling, To Kill a Mockingbird...\' name=\'description\'>\n<meta content=\'telephone=no\' name=\'format-detection\'>\n<link href=\'https://www.goodreads.com/list/show/1.Best_Books_Ever\' rel=\'canonical\'>\n\n\n\n    <script type="text/javascript"> var ue_t0=window.ue_t0||+new Date();\n </script>\n  <script type="text/javascript">\n    (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:arguments,t:a.d()})}}function b(m,l,h,j,i){var k={m:m,f:l,l:h,c:""+j,err:i,fromOnError:1,args:arguments};c.ueLogError(k);return false}b.skipTrace=1;e.onerror=b;function f(){c.uex("ld")}if(e.addEventListener){e.addEventListener

Let us write a loop to fetch 2 pages of "best-books" from goodreads. Notice the use of a format string. This is an example of old-style python format strings

In [7]:
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="
for i in range(1,3):
    bookpage=str(i)
    stuff=requests.get(URLSTART+BESTBOOKS+bookpage)
    filetowrite="files/page"+ '%02d' % i + ".html"
    print("FTW", filetowrite)
    fd=open(filetowrite,"w")
    fd.write(stuff.text)
    fd.close()
    time.sleep(2)

FTW files/page01.html
FTW files/page02.html


## 2. Parse the page, extract book urls

Notice how we do file input-output, and use beautiful soup in the code below. The `with` construct ensures that the file being read is closed, something we do explicitly for the file being written. We look for the elements with class `bookTitle`, extract the urls, and write them into a file

In [8]:
from bs4 import BeautifulSoup

In [9]:
bookdict={}
for i in range(1,3):
    books=[]
    stri = '%02d' % i
    filetoread="files/page"+ stri + '.html'
    print("FTW", filetoread)
    with open(filetoread) as fdr:
        data = fdr.read()
    soup = BeautifulSoup(data, 'html.parser')
    for e in soup.select('.bookTitle'):
        books.append(e['href'])
    print(books[:10])
    bookdict[stri]=books
    fd=open("files/list"+stri+".txt","w")
    fd.write("\n".join(books))
    fd.close()

FTW files/page01.html
['/book/show/2767052-the-hunger-games', '/book/show/2.Harry_Potter_and_the_Order_of_the_Phoenix', '/book/show/2657.To_Kill_a_Mockingbird', '/book/show/1885.Pride_and_Prejudice', '/book/show/41865.Twilight', '/book/show/19063.The_Book_Thief', '/book/show/7613.Animal_Farm', '/book/show/11127.The_Chronicles_of_Narnia', '/book/show/30.J_R_R_Tolkien_4_Book_Boxed_Set', '/book/show/18405.Gone_with_the_Wind']
FTW files/page02.html
['/book/show/43763.Interview_with_the_Vampire', '/book/show/153747.Moby_Dick_or_the_Whale', '/book/show/5.Harry_Potter_and_the_Prisoner_of_Azkaban', '/book/show/4989.The_Red_Tent', '/book/show/37435.The_Secret_Life_of_Bees', '/book/show/7171637-clockwork-angel', '/book/show/2187.Middlesex', '/book/show/6.Harry_Potter_and_the_Goblet_of_Fire', '/book/show/16299.And_Then_There_Were_None', '/book/show/49552.The_Stranger']


Here is George Orwell's 1984

In [10]:
bookdict['02'][0]

'/book/show/43763.Interview_with_the_Vampire'

 Lets go look at the first URLs on both pages

![](images/goodreads2.png)

## 3. Parse a book page, extract book properties

Ok so now lets dive in and get one of these these files and parse them.

In [11]:
furl=URLSTART+bookdict['02'][0]
furl

'https://www.goodreads.com/book/show/43763.Interview_with_the_Vampire'

![](images/goodreads3.png)

In [12]:
fstuff=requests.get(furl)
print(fstuff.status_code)

200


In [13]:
d=BeautifulSoup(fstuff.text, 'html.parser')

In [14]:
d.select("meta[property='og:title']")[0]['content']

'Interview with the Vampire (The Vampire Chronicles, #1)'

Lets get everything we want...

In [17]:
d=BeautifulSoup(fstuff.text, 'html.parser')
print(
"title", d.select_one("meta[property='og:title']")['content'],"\n",
"isbn", d.select("meta[property='books:isbn']")[0]['content'],"\n",
"type", d.select("meta[property='og:type']")[0]['content'],"\n",
"author", d.select("meta[property='books:author']")[0]['content'],"\n",
#"average rating", d.select_one("span.average").text,"\n",
"ratingCount", d.select("meta[itemprop='ratingCount']")[0]["content"],"\n",
#"reviewCount", d.select_one("span.count")["title"]
)

title Interview with the Vampire (The Vampire Chronicles, #1) 
 isbn 9780345476876 
 type books.book 
 author https://www.goodreads.com/author/show/7577.Anne_Rice 
 ratingCount 442591 



Ok, now that we know what to do, lets wrap our fetching into a proper script. So that we dont overwhelm their servers, we will only fetch 5 from each page, but you get the idea...

We'll segue of a bit to explore new style format strings. See https://pyformat.info for more info.

In [18]:
"list{:0>2}.txt".format(3)

'list03.txt'

In [19]:
a = "4"
b = 4
class Four:
    def __str__(self):
        return "Fourteen"
c=Four()

In [20]:
"The hazy cat jumped over the {} and {} and {}".format(a, b, c)

'The hazy cat jumped over the 4 and 4 and Fourteen'

## 4. Set up a pipeline for fetching and parsing

Ok lets get back to the fetching...

In [21]:
fetched=[]
for i in range(1,3):
    with open("files/list{:0>2}.txt".format(i)) as fd:
        counter=0
        for bookurl_line in fd:
            if counter > 4:
                break
            bookurl=bookurl_line.strip()
            stuff=requests.get(URLSTART+bookurl)
            filetowrite=bookurl.split('/')[-1]
            filetowrite="files/"+str(i)+"_"+filetowrite+".html"
            print("FTW", filetowrite)
            fd=open(filetowrite,"w", encoding='utf-8')
            fd.write(stuff.text)
            fd.close()
            fetched.append(filetowrite)
            time.sleep(2)
            counter=counter+1
            
print(fetched)

FTW files/1_2767052-the-hunger-games.html
FTW files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html
FTW files/1_2657.To_Kill_a_Mockingbird.html
FTW files/1_1885.Pride_and_Prejudice.html
FTW files/1_41865.Twilight.html
FTW files/2_43763.Interview_with_the_Vampire.html
FTW files/2_153747.Moby_Dick_or_the_Whale.html
FTW files/2_5.Harry_Potter_and_the_Prisoner_of_Azkaban.html
FTW files/2_4989.The_Red_Tent.html
FTW files/2_37435.The_Secret_Life_of_Bees.html
['files/1_2767052-the-hunger-games.html', 'files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html', 'files/1_2657.To_Kill_a_Mockingbird.html', 'files/1_1885.Pride_and_Prejudice.html', 'files/1_41865.Twilight.html', 'files/2_43763.Interview_with_the_Vampire.html', 'files/2_153747.Moby_Dick_or_the_Whale.html', 'files/2_5.Harry_Potter_and_the_Prisoner_of_Azkaban.html', 'files/2_4989.The_Red_Tent.html', 'files/2_37435.The_Secret_Life_of_Bees.html']


Ok we are off to parse each one of the html pages we fetched. We have provided the skeleton of the code and the code to parse the year, since it is a bit more complex...see the difference in the screenshots above. 

In [22]:
import re
yearre = r'\d{4}'
def get_year(d):
    if d.select_one("nobr.greyText"):
        return d.select_one("nobr.greyText").text.strip().split()[-1][:-1]
    else:
        thetext=d.select("div#details div.row")[1].text.strip()
        rowmatch=re.findall(yearre, thetext)
        if len(rowmatch) > 0:
            rowtext=rowmatch[0].strip()
        else:
            rowtext="NA"
        return rowtext

<div class="exercise"><b>Exercise</b></div>

Your job is to fill in the code to get the genres.

In [None]:
def get_genres(d):
    # your code here


In [None]:

listofdicts=[]
for filetoread in fetched:
    print(filetoread)
    td={}
    with open(filetoread) as fd:
        datext = fd.read()
    d=BeautifulSoup(datext, 'html.parser')
    td['title']=d.select_one("meta[property='og:title']")['content']
    td['isbn']=d.select_one("meta[property='books:isbn']")['content']
    td['booktype']=d.select_one("meta[property='og:type']")['content']
    td['author']=d.select_one("meta[property='books:author']")['content']
    td['rating']=d.select_one("span.average").text
    td['ratingCount']=d.select_one("meta[itemprop='ratingCount']")["content"]
    td['reviewCount']=d.select_one("span.count")["title"]
    td['year'] = get_year(d)
    td['file']=filetoread
    glist = get_genres(d)
    td['genres']="|".join(glist)
    listofdicts.append(td)

In [None]:
listofdicts[0]

Finally lets write all this stuff into a csv file which we will use to do analysis.

In [None]:
df = pd.DataFrame.from_records(listofdicts)
df.head()

In [None]:
df.to_csv("files/meta.csv", index=False, header=True)

### Note

Now we have used Beautiful soup to scrape content from Goodreads website. 

Whether or not you intend to use data scraping in your work, it’s advisable to still educate yourself on the subject, as it is most likely to become even more important in the next few years.

There are now data scraping AI on the market that can use machine learning to keep on getting better at recognising inputs which only humans have traditionally been able to interpret – like images