<h2>Sentiment Analysis</h2>

<h3>Part1</h3>
We’ll use a book with plenty of reviews, say Learning Python by Mark Lutz, which can be
found at <a>https://www.amazon.com/Learning-Python-5th-Mark-Lutz/dp/1449355730/</a>. Note that this
product has an id of “1449355730,” and even using the URL <a>https://www.amazon.com/
product-reviews/1449355730/</a>, without the product name, will work.<br>
If you explore the reviews page, you’ll note that the reviews are paginated. By
browsing to other pages and following along in your browser’s developer tools, we see
that POST requests are being made (by JavaScript) to URLs looking like <a>https://www.
amazon.com/ss/customer-reviews/ajax/reviews/get/ref=cm_cr_arp_d_paging_
btm_2</a> with the product id included in the form data, as well as some other form fields
that look relatively easy to spoof
<img src='images/c1.jpg'><br>

In [None]:
import requests
from bs4 import BeautifulSoup
review_url = 'https://www.amazon.com/hz/reviews-render/ajax/reviews/get/'
product_id = '1449355730'
session = requests.Session()
session.headers.update({
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
})
session.get('https://www.amazon.com/product-reviews/{}/'.format(product_id))
def get_reviews(product_id, page):
    data = {
    'sortBy':'',
    'reviewerType':'all_reviews',
    'formatType':'',
    'mediaType':'',
    'filterByStar':'',
    'pageNumber':page,
    'filterByLanguage':'',
    'filterByKeyword':'',
    'shouldAppend':'undefined',
    'deviceType':'desktop',
    'canShowIntHeader':'undefined',
    'reftag':'cm_cr_getr_d_paging_btm_prev_{}'.format(page),
    'pageSize':10,
    'asin':product_id,
    'scope':'reviewsAjax1'
    }
    r = session.post(review_url + 'ref=' + data['reftag'], data=data)
    return r.text
print(get_reviews(product_id, 1))

If you explore the reviews page in the browser, you’ll
see that the value of this field is in fact increased for each request, that is, “reviewsAjax1,”
“reviewsAjax2,” and so on.Finally, note that the POST request does not return a full HTML page, but some kind
of hand-encoded result that will be parsed (normally) by JavaScript: <img src='images/c2.jpg'> 

<h3>Part 2</h3>
Let’s adjust our code to parse the reviews in a structured format. We’ll loop through
all the instructions; convert them using the “json” module; check for “append” entries;
and then use Beautiful Soup to parse the HTML fragment and get the review id, rating,title, and text. We’ll also need a small regular expression to get out the rating, which is set
as a class with a value like “a-start-1” to “a-star-5”. We could use these as is, but simply
getting “1” to “5” might be easier to work with later on, so we already perform a bit of
cleaning here:

In [2]:
import requests
import json
import re
from bs4 import BeautifulSoup

review_url = 'https://www.amazon.com/hz/reviews-render/ajax/reviews/get/'
product_id = '1449355730'
session = requests.Session()
session.headers.update({
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
})
session.get('https://www.amazon.com/product-reviews/{}/'.format(product_id))


def parse_reviews(reply):
    reviews = []
    for fragment in reply.split('&&&'):
        if not fragment.strip():
            continue
        json_fragment = json.loads(fragment)
        if json_fragment[0] != 'append':
            continue
        html_soup = BeautifulSoup(json_fragment[2], 'html.parser')
        div = html_soup.find('div', class_='review')
        if not div:
            continue
        review_id = div.get('id')
        title = html_soup.find(class_='review-title').get_text(strip=True)
        review = html_soup.find(class_='review-text').get_text(strip=True)

        # Find and clean the rating:
        review_cls = ' '.join(html_soup.find(class_='review-rating').get('class'))
        rating = re.search('a-star-(\d+)', review_cls).group(1)
        reviews.append({
            'review_id': review_id,
            'rating': rating,
            'title': title,
            'review': review})
    return reviews

def get_reviews(product_id, page):
    data = {
    'sortBy':'',
    'reviewerType':'all_reviews',
    'formatType':'',
    'mediaType':'',
    'filterByStar':'',
    'pageNumber':page,
    'filterByLanguage':'',
    'filterByKeyword':'',
    'shouldAppend':'undefined',
    'deviceType':'desktop',
    'canShowIntHeader':'undefined',
    'reftag':'cm_cr_getr_d_paging_btm_prev_{}'.format(page),
    'pageSize':10,
    'asin':product_id,
    'scope':'reviewsAjax1'
    }
    r = session.post(review_url + 'ref=' + data['reftag'], data=data)
    reviews = parse_reviews(r.text)
    return reviews
reviews=get_reviews(product_id, 1) 
for review in reviews:
    print(' -', review['rating'], review['title'],review['review_id'])


 - 5 Check out the Index R175ZF0CZU718Q
 - 5 Outstanding introduction to the Python language! R3GM7E8ZWH3TKS
 - 5 Great Service from the book seller R1XW780574N11X
 - 5 The book is long because it's thorough, and it's a quality book RUR7PRSM2BZC
 - 5 huge but well worth it R2KY62EVDQ9636
 - 5 A Mark Lutz Trifecta of Python Winners R16F2OE9239BC
 - 5 Comprehensive R36KASBOZZBOK9
 - 3 Too many words RSX92UJ62C6SF
 - 3 Half python-proselytizing, half real material R1F9L3BP2EW3VE
 - 4 A solid introduction to a fun language RSNM1C0OA575D


<h3>Part 3</h3>
The only thing left to do is to loop through all the pages, and store the
reviews in a database using the “dataset” library. Luckily, figuring out when to stop
looping is easy: once we do not get any reviews for a particular page, we can stop:

In [4]:
import requests
import json
import re
from bs4 import BeautifulSoup
import dataset

db = dataset.connect('sqlite:///reviews.db')

review_url = 'https://www.amazon.com/hz/reviews-render/ajax/reviews/get/'
product_id = '1449355730'
session = requests.Session()
session.headers.update({
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
})
session.get('https://www.amazon.com/product-reviews/{}/'.format(product_id))


def parse_reviews(reply):
    reviews = []
    for fragment in reply.split('&&&'):
        if not fragment.strip():
            continue
        json_fragment = json.loads(fragment)
        if json_fragment[0] != 'append':
            continue
        html_soup = BeautifulSoup(json_fragment[2], 'html.parser')
        div = html_soup.find('div', class_='review')
        if not div:
            continue
        review_id = div.get('id')
        title = html_soup.find(class_='review-title').get_text(strip=True)
        review = html_soup.find(class_='review-text').get_text(strip=True)

        # Find and clean the rating:
        review_cls = ' '.join(html_soup.find(class_='review-rating').get('class'))
        rating = re.search('a-star-(\d+)', review_cls).group(1)
        reviews.append({
            'review_id': review_id,
            'rating': rating,
            'title': title,
            'review': review})
    return reviews

def get_reviews(product_id, page):
    data = {
    'sortBy':'',
    'reviewerType':'all_reviews',
    'formatType':'',
    'mediaType':'',
    'filterByStar':'',
    'pageNumber':page,
    'filterByLanguage':'',
    'filterByKeyword':'',
    'shouldAppend':'undefined',
    'deviceType':'desktop',
    'canShowIntHeader':'undefined',
    'reftag':'cm_cr_getr_d_paging_btm_prev_{}'.format(page),
    'pageSize':10,
    'asin':product_id,
    'scope':'reviewsAjax1'
    }
    r = session.post(review_url + 'ref=' + data['reftag'], data=data)
    reviews = parse_reviews(r.text)
    return reviews
page=1
while True:
    print('Scraping: ',page)
    reviews=get_reviews(product_id, page) 
    if not reviews: break
    for review in reviews:
        print(' -', review['rating'], review['title'],review['review_id'])
        db['reviews'].upsert(review, ['review_id'])
    page+=1    


Scraping:  1
 - 5 Check out the Index R175ZF0CZU718Q
 - 5 Outstanding introduction to the Python language! R3GM7E8ZWH3TKS
 - 5 Great Service from the book seller R1XW780574N11X
 - 5 The book is long because it's thorough, and it's a quality book RUR7PRSM2BZC
 - 5 huge but well worth it R2KY62EVDQ9636
 - 5 A Mark Lutz Trifecta of Python Winners R16F2OE9239BC
 - 5 Comprehensive R36KASBOZZBOK9
 - 3 Too many words RSX92UJ62C6SF
 - 3 Half python-proselytizing, half real material R1F9L3BP2EW3VE
 - 4 A solid introduction to a fun language RSNM1C0OA575D
Scraping:  2


JSONDecodeError: Expecting value: line 1 column 1 (char 0)