# DS-SF-36 | 04 | Databases and Scrapping | Codealong

## Part B | Scrapping and Amazon Product Reviews

> ## This notebook demonstrates how to scrape data from websites (as an educational tool).  This should only be used a last resort (i.e., no alternate methods are available).  In all cases, be considerate when scrapping data.

Amazon's Product Advertising API used to provide programmatic access to Amazon's product reviews.  However, that functionality has been removed in recent years.  :(

Without a useful API, we will therefore scrape the reviews directly from Amazon's website.

In this notebook, we would like to scrape the Amazon's reviews for the following article:
- https://www.amazon.com/dp/B06XYN5HN7

As of July 5, 2017, this article has 17 reviews, displayed in chunks of 10 across 2 pages.  The URL for the first list is:
- https://www.amazon.com/product-reviews/B06XYN5HN7?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent&pageNumber=1

> All other pages can be accessed by changing `pageNumber` (ranging for 1 to 2)

We will scrape both pages in this notebook and save them (this is raw data) for later processing in the next notebook.

In [1]:
import numpy as np
import time
import requests
import os
import gzip
import json

> ## After scraping the first page, we get the following:

(http://docs.python-requests.org/en/master/)

In [2]:
response = requests.get('https://www.amazon.com/dp/product-reviews/B06XYN5HN7?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent&pageNumber=1',
    headers = {'User-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/602.1.43 (KHTML, like Gecko) Version/10.0 Safari/602.1.43'})

In [3]:
response

<Response [200]>

In [4]:
response.status_code

200

> We expect a status code of 200, i.e., "OK".
>
> However, because the `request` library (i.e., the "browser") identifies itself as "python-requests/2.10.0" (the user-agent), Amazon can infer that the request didn't come from a real user and blocked it.  The status code returned is then 503 for "Service Unavailable".
>
> (https://en.wikipedia.org/wiki/User_agent)
>
> We can get around this issue by using a well-known user-agent.

In [5]:
response.headers['content-type']

'text/html;charset=UTF-8'

In [6]:
response.encoding

'UTF-8'

> The actual HTML page that would have been displayed in your browser:

In [7]:
response.content



> ## Putting all of this together

In [8]:
reviews = {}

In [9]:
def scrape_page(page_number):
    return requests.get('https://www.amazon.com/dp/product-reviews/B06XYN5HN7',
                        headers = {'User-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/602.1.43 (KHTML, like Gecko) Version/10.0 Safari/602.1.43'},
                        params = {'ie': 'UTF8',
                                  'reviewerType': 'all_reviews', 'showViewpoints': 1, 'sortBy': 'recent',
                                  'pageNumber': page_number})

In [10]:
def scrape_reviews():
    for page_number in range(1, 3):
        if (page_number in reviews) and (reviews[page_number]['status_code'] == 200):
            continue

        page = scrape_page(page_number)

        print 'page {}: {}'.format(page_number, page.status_code)

        reviews[page_number] = {
            'status_code': page.status_code,
            'content': page.content,
        }

        # Wait for a random interval between page requests (exponential distribution)
        time.sleep(np.random.exponential(10))

In [11]:
scrape_reviews()

page 1: 200
page 2: 200


In [12]:
for page_number in reviews:
    if reviews[page_number]['status_code'] == 200:
        continue

    print '{}: {}'.format(page_number, reviews[page_number]['status_code'])

> All pages were returned with an 200/OK status code.  If needed, we could have re-run `scrape_reviews()` to scrape again the pages that had a different status code.  We are good to go here.  Let's save these pages.

> ## Saving the raw data (pages)

- (https://docs.python.org/2/library/json.html)
- (https://docs.python.org/2/library/gzip.html)

In [13]:
with gzip.open(os.path.join('..', 'datasets', 'dataset-04-reviews.json.gz'), 'wb') as f:
        f.write(json.dumps(reviews, ensure_ascii = False, indent = 4, sort_keys = True))