# Scrape Amazon Reviews

Scraping Amazon Dog Collar reviews for a friend who would like some analysis on what factors might be driving negative reviews.  Will need to install docker to use splash to scrape amazon.  

### Imports

In [1]:
import requests
from requests import get
import pandas as pd
from bs4 import BeautifulSoup

### URL to Scrape

In [5]:
# stash away the url in a variable
url = "https://www.amazon.com/Lifetime-Walking-American-Pitbull-Shepherd/product-reviews/B087NQ9V3K/ref=cm_cr_arp_d_viewopt_srt?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber=1"
reviewlist = []

### Docker workaround using splash

In [7]:
# use splash to past the 503 response you would get from trying to scrape without it
r = requests.get('http://localhost:8050/render.html', params={'url': url, 'wait': 2})


In [8]:
# make the review soup
soup = BeautifulSoup(r.text, 'html.parser')

In [9]:
# check out the title 
print(soup.title.text)

Amazon.com: Customer reviews: W/W Lifetime Gold Dog Chain Collar Walking Metal Chain Collar with Design Secure Buckle,18K Cuban Link Strong Heavy Duty Chew Proof for Medium Dogs(19MM, 15")


In [10]:
# find all returns in a list that we can loop through
reviews = soup.find_all('div', {'data-hook': 'review'})

In [19]:
# create loop to grab title, rating, and review
for item in reviews:
    # make into a dictionary by turning title, rating, and body into keys and data that we are
    # scraping into values
    review = {
    'product': soup.title.text.replace('Amazon.com: Customer reviews:', '').strip(),
    'title' : item.find('a', {'data-hook': 'review-title'}).text.strip(),
    'rating' : float(item.find('i', {'data-hook': 'review-star-rating'}).text.replace('out of 5 stars', '').strip()),
    'review_body' : item.find('span', {'data-hook': 'review-body'}).text.strip(),}
    
    
    

### Make everything into useful functions

In [2]:
def get_soup(url):
    r = requests.get('http://localhost:8050/render.html', params={'url': url, 'wait': 2})
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

In [3]:
reviewlist = []

In [4]:
def get_reviews(soup):
    reviews = soup.find_all('div', {'data-hook': 'review'})
    try:
        for item in reviews:
            review = {
            'product': soup.title.text.replace('Amazon.com: Customer reviews:', '').strip(),
            'title' : item.find('a', {'data-hook': 'review-title'}).text.strip(),
            'rating' : float(item.find('i', {'data-hook': 'review-star-rating'}).text.replace('out of 5 stars', '').strip()),
            'review_body' : item.find('span', {'data-hook': 'review-body'}).text.strip(),
            }
            reviewlist.append(review)
    except:
        pass

In [5]:
soup = get_soup('https://www.amazon.com/Lifetime-Walking-American-Pitbull-Shepherd/product-reviews/B087NQ9V3K/ref=cm_cr_arp_d_viewopt_srt?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber=1')

In [6]:
get_reviews(soup)

In [7]:
print(len(reviewlist))

10


In [8]:
print(reviewlist[0])

{'product': 'W/W Lifetime Gold Dog Chain Collar Walking Metal Chain Collar with Design Secure Buckle,18K Cuban Link Strong Heavy Duty Chew Proof for Medium Dogs(19MM, 15")', 'title': 'Chain', 'rating': 4.0, 'review_body': 'Heavy chain, good quality. Rott puppy doesn’t mind. I’ll buy another when she outgrows the first one. Only complaint is the locking mechanism. It has two prongs you pinch to lock and one of them often comes out and the chain only stays on because it sort of jams. I have to re-click it fairly often. It has come completely off a couple times too so just be careful and keep your eye out. If your dog doesn’t come when called and the chain falls off outside you may be a little stressed 😅'}


### Deal with pages

In [9]:
for x in range(1,10):
    soup = get_soup(f'https://www.amazon.com/Lifetime-Walking-American-Pitbull-Shepherd/product-reviews/B087NQ9V3K/ref=cm_cr_arp_d_viewopt_srt?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber={x}')
    print(f'Getting page: {x}')
    get_reviews(soup)
    print(len(reviewlist))
    if not soup.find('li', {'class': 'a-disabled a-last'}):
        pass
    else:
        break


Getting page: 1
20
Getting page: 2
30
Getting page: 3
40
Getting page: 4
50
Getting page: 5
60
Getting page: 6
70
Getting page: 7
80
Getting page: 8
90
Getting page: 9
100


### Clean up

In [12]:
#imports to help with cleaning
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd


### Make into dataframe

In [13]:
df = pd.DataFrame(reviewlist)

In [15]:
df.head()


Unnamed: 0,product,title,rating,review_body
0,W/W Lifetime Gold Dog Chain Collar Walking Met...,So cute!!!,5.0,So cute! It’s heavier duty than I was anticipa...
1,W/W Lifetime Gold Dog Chain Collar Walking Met...,Nice quality,5.0,Has not changed colors yet. Very comfortable o...
2,W/W Lifetime Gold Dog Chain Collar Walking Met...,Worth every penny!,5.0,
3,W/W Lifetime Gold Dog Chain Collar Walking Met...,Make sure to measure your dogs neck so it’s a ...,5.0,Everything about the chain was good
4,W/W Lifetime Gold Dog Chain Collar Walking Met...,you can do better,1.0,i like the finished everything but this chain ...


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   product      100 non-null    object 
 1   title        100 non-null    object 
 2   rating       100 non-null    float64
 3   review_body  100 non-null    object 
dtypes: float64(1), object(3)
memory usage: 3.2+ KB


### Check Response
This was the response before using splash

In [3]:
response = get(url)

In [4]:
response

<Response [503]>

Apparently Amazon does not like scrapers.  This is not the response we want, so a work around is needed.  