# DS-SF-30 | Codealong 03: Databases, Scrapping, and APIs; Part C - Scrapping and Amazon Product Reviews

> ## This notebook demonstrates how to scrape data from websites (as an educational tool).  This should only be used a last resort (i.e., no alternate methods are available).  In all cases, be considerate when scrapping data.

In this notebook, we would like to access Amazon's reviews of J.K. Rowling's The Casual Vacancy:
- https://www.amazon.com/Casual-Vacancy-J-K-Rowling/dp/0316228532
- (or with this shorter URL: https://www.amazon.com/dp/0316228532)

(We will use this dataset in our Natural Language Processing class)

Amazon's Product Advertising API used to provide programmatic access to Amazon's product reviews.  However, that functionality has been removed in recent years.  :(

Without a useful API, we will therefore scrape the reviews directly from Amazon's website.

As of December 12, 2006, 5,801 people reviewed the book.  Amazon displays these reviews in chunks of 10 across 581 pages.  The URL for the first list is:
- https://www.amazon.com/Casual-Vacancy-J-K-Rowling/product-reviews/0316228532/ref=cm_cr_dp_qt_see_all_top?ie=UTF8&reviewerType=avp_only_reviews&showViewpoints=1&sortBy=helpful

> Or this is shorter URL: https://www.amazon.com/product-reviews/0316228532?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent&pageNumber=1
>
> All other pages can be accessed by changing `pageNumber` (ranging for 1 to 581)

We will scrape all 581 pages in this notebook and save them (this is raw data) for later processing in the next notebook.

In [1]:
import numpy as np
import time
import requests
import os
import gzip
import json

> ## After scraping the first page, we get the following:

(http://docs.python-requests.org/en/master/)

In [2]:
response = requests.get('https://www.amazon.com/dp/product-reviews/0316228532?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent&pageNumber=1')

In [3]:
response

<Response [503]>

In [4]:
response.status_code

503

> We expected the "error" code 200, i.e., "OK".  Instead we got 503 which stands for "Service Unavailable".

In [5]:
response.headers['content-type']

'text/html'

In [6]:
response.encoding

'ISO-8859-1'

In [7]:
response.content

'<!--\n        To discuss automated access to Amazon data please contact api-services-support@amazon.com.\n        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.\n-->\n\n<!doctype html>\n<html>\n<head>\n  <meta charset="utf-8">\n  <meta http-equiv="x-ua-compatible" content="ie=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\n  <title>503 Service Unavailable Error</title>\n  <style>html, body {\n    padding: 0;\n    margin: 0\n  }\n\n  img {\n    border: 0\n  }\n\n  #a {\n    background: #232f3e;\n    padding: 11px 11px 11px 192px\n  }\n\n  #b {\n    position: absolute;\n    left: 22px;\n    top: 12px\n  }\n\n  #c {\n    position: relative;\n    max-width: 800px;\n    padding: 0 40px 0 0\n  }\n\n  #e, #f {\n    heigh

In [8]:
response.request.headers

{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.10.0'}

The request library (i.e., the "browser") identified itself as "python-requests/2.10.0" (the user-agent), therefore Amazon detected that the request didn't come from a real user and blocked it.

(https://en.wikipedia.org/wiki/User_agent)

We can get around this issue by using a well-known user-agent.

> ## Take 2

In [9]:
response = requests.get('https://www.amazon.com/dp/product-reviews/0316228532?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent&pageNumber=1',
                    headers = {'User-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11) AppleWebKit/601.1.32 (KHTML, like Gecko) Version/8.1 Safari/601.1.32'})

In [10]:
response

<Response [200]>

In [11]:
response.status_code

200

> All good now.

In [12]:
response.headers['content-type']

'text/html;charset=UTF-8'

In [13]:
response.encoding

'UTF-8'

> The actual HTML page that would have been displayed in your browser:

In [14]:
response.content



> ## Putting all of this together

In [15]:
reviews = {}

In [16]:
def scrape_page(page_number):
    return requests.get('https://www.amazon.com/dp/product-reviews/0316228532',
                        headers = {'User-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11) AppleWebKit/601.1.32 (KHTML, like Gecko) Version/8.1 Safari/601.1.32'},
                        params = {'ie': 'UTF8',
                                  'reviewerType': 'all_reviews', 'showViewpoints': 1, 'sortBy': 'recent',
                                  'pageNumber': page_number})

In [17]:
def scrape_reviews():
    for page_number in range(1, 582):
        if (page_number in reviews) and (reviews[page_number]['status_code'] == 200):
            continue

        page = scrape_page(page_number)

        print 'page {}: {}'.format(page_number, page.status_code)

        reviews[page_number] = {
            'status_code': page.status_code,
            'content': page.content,
        }

        # We will wait for a random interval between page requests (Poisson distribution)
        time.sleep(np.random.poisson(15))

In [18]:
scrape_reviews()

page 1: 200
page 2: 200
page 3: 200
page 4: 200
page 5: 200
page 6: 200
page 7: 200
page 8: 200
page 9: 200
page 10: 200
page 11: 200
page 12: 200
page 13: 200
page 14: 200
page 15: 200
page 16: 200
page 17: 200
page 18: 200
page 19: 200
page 20: 200
page 21: 200
page 22: 200
page 23: 200
page 24: 200
page 25: 200
page 26: 200
page 27: 200
page 28: 200
page 29: 200
page 30: 200
page 31: 200
page 32: 200
page 33: 200
page 34: 200
page 35: 200
page 36: 200
page 37: 200
page 38: 200
page 39: 200
page 40: 200
page 41: 200
page 42: 200
page 43: 200
page 44: 200
page 45: 200
page 46: 200
page 47: 200
page 48: 200
page 49: 200
page 50: 200
page 51: 200
page 52: 200
page 53: 200
page 54: 200
page 55: 200
page 56: 200
page 57: 200
page 58: 200
page 59: 200
page 60: 200
page 61: 200
page 62: 200
page 63: 200
page 64: 200
page 65: 200
page 66: 200
page 67: 200
page 68: 200
page 69: 200
page 70: 200
page 71: 200
page 72: 200
page 73: 200
page 74: 200
page 75: 200
page 76: 200
page 77: 200
page 78:

In [19]:
for page_number in reviews:
    if reviews[page_number]['status_code'] == 200:
        continue

    print '{}: {}'.format(page_number, reviews[page_number]['status_code'])

> All pages were returned with an 200/OK error code.  If needed, we could have re-run `scrape_reviews()` to scrape again the pages that had a different error code.  We are good to go here.  Let's save these pages.

> ## Saving the raw data (pages)

(https://docs.python.org/2/library/json.html)

(https://docs.python.org/2/library/gzip.html)

In [20]:
with gzip.open(os.path.join('..', 'datasets', 'dataset-03-reviews.json.gz'), 'wb') as f:
        f.write(json.dumps(reviews, ensure_ascii = False, indent = 4, sort_keys = True))