# Assignment 02: Web Scraping
Your Name:  Christopher Truong
Your Class: INST 447  
Your Section: MWF 0101

In [3]:
import requests
from bs4 import BeautifulSoup
import sqlite3
import time

## Are Amazon reviews fake?
Should we trust Amazon reviews? 

That is a question that seems like we could answer if we only had enough data. So, lets collect some data. Your assignment is to scrape the reviews for five (5) similar products and to save those reviews into a sqllite database, that I have provided. The database will store:

**Product Table:**  
product_id - int - Primary key (this auto-increments).  
amazon_identifier - text - The identifier for the product.  
product_name - text - The name of the product.  
product_price - text - The price of the product. (Text because sometimes it is a range)  
scraper_name - text - Your name...  

**Review Table:**  
review_id - int - Primary key (this auto-increments).  
review_date - text - The date of the review.  
review_title - text - The title of the review.  
number_of_stars - int - The number of stars the review gave.  
verified_purchase - bool - Was the it a "Verifed Purchase"?  
review_body - text - The text of the review.  
number_found_helpful - int - The number of people that found the review helpful.  
product_id - int - Foreign key for the product.  

Since you are using my database structure, this means that I can run your code and fill up a single database with all of our data.

** YOU DO NOT HAVE TO DO ANYTHING WITH amazon-page_dump.db IF YOU ARE GETTING RESPONSES FROM AMAZON ** 
I have also given you a database named 'amazon-page_dump.db'. If you keep getting errors from Amazon, then tell me what that error is including the status-code, the reason given, and what you think that error means. Then use the pages saved in the page_dump table to complete the assignment. You will have to figure out how to best adapt the code framework I gave you to pull out the pages to parse.

**Database: amazon-page_dump.db  
Table: page_dump**  
dump_id - integer - Primary key (this auto-increments)  
amazon_identifier - text - The Amazon product identifier  
page_url - text - The url for the page  
page_html - text - The html of the page  

This is only in case you are getting only errors from Amazon and cannot access the pages to scrape. ** YOU DO NOT HAVE TO DO ANYTHING WITH amazon-page_dump.db IF YOU ARE GETTING RESPONSES FROM AMAZON ** 

In [4]:
# Create the amazon.db database, if it does not exist.
conn = sqlite3.connect('amazon.db')
c = conn.cursor()

In [5]:
# Create the products table 
c.execute('''
    CREATE TABLE IF NOT EXISTS products (
        product_id INTEGER PRIMARY KEY AUTOINCREMENT,
        amazon_identifier TEXT,
        product_name TEXT,
        product_price TEXT,
        scraper_name TEXT
        );
''')
# Create the reviews table
c.execute('''
    CREATE TABLE IF NOT EXISTS reviews (
        review_id INTEGER PRIMARY KEY AUTOINCREMENT,
        review_date TEXT, 
        review_title TEXT, 
        number_of_stars INTEGER, 
        verified_purchase BOOLEAN, 
        review_body TEXT, 
        number_found_helpful INTEGER,
        product_id INTEGER,
        FOREIGN KEY(product_id) REFERENCES products(product_id)
        )
''')

<sqlite3.Cursor at 0x27c7b99b180>

## Your Task

Find 5 similar products on Amazon that have more than 5 reviews each (my example uses hotsauce, so you can't). Your products must be PG.
Grab their product identifiers and replace mine in the list I have below.

It is in the URL and it looks like these:

In [1]:
# product_lists = ['B06Y4KR6FS', 'B01GXAT0BK', 'B01LEX4UPW', 'B01LY0QOPZ', 'B00137QZQW']
product_lists = ['B06Y4KR6FS']

And the Urls for the reviews look like: https://www.amazon.com/product-reviews/B00AIR3Q38/?reviewerType=all_reviews&pageNumber=1

Notice that there is a spot in the url where the Amazon identifier goes and that there is an argument to be able to set the number of the page that is accesed.

In [2]:
# URL template
url = 'https://www.amazon.com/product-reviews/%s/'
# Default URL arguments
url_args = {'reviewerType': 'all_reviews',
            'pageNumber': 1,
            'sortBy': 'recent'}
# Pretend to be a browser
headers = {'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'}

Now loop through the identifiers and save the specified data into the database.

I have provided a shell. Fill out the bits that have comments that are like:
>#TASK: Get (x) data

In [3]:
def get_reviews(review_soup):
    '''Will return a list of dictionaries with data for each review'''
    reviews = []
    review_tags = [item for item in review_soup.find_all('div', class_='a-section review') if "data-hook" in item.attrs]
    for review_tag in review_tags:
        review = {}
        review['stars'] = float(review_tag.find('span', class_='a-icon-alt').get_text().split(' ')[0])
        review['date'] = review_tag.find('span', class_='a-size-base a-color-secondary review-date').get_text().replace('on ', '')
        review['title'] = review_tag.find('a', class_='a-size-base a-link-normal review-title a-color-base a-text-bold').get_text()
        review['body'] = review_tag.find('span', class_='a-size-base review-text').get_text()
        review['found_helpful'] = 0
        if review_tag.find('span', class_='a-size-base a-color-secondary cr-vote-text') != None:
            # this is the tag to find found helpful, you will have to handle this differently
            if review_tag.find('span', class_='a-size-mini a-color-state a-text-bold').get_text() == 'Verified Purchase':
                review['verified'] = True
        else:
            review['verified'] = False
        reviews.append(review)
        print(review)
    
    return reviews

In [4]:
# Loop through the Amazon product identifiers
for amazon_identifier in product_lists:
    # Get the first page
    url_args['pageNumber'] = 1
    r = requests.get(url % amazon_identifier, url_args)
    print(r.url)
    
    # Check if the request was successful.
    if r.status_code == requests.codes.ok:
        page_soup = BeautifulSoup(r.content, 'lxml')

        
        # TASK: Get the number of the last page and save it into the variable max_page number
        # note: make sure you handle the cases where there is a last button and when there isn't a last button
        max_page_number = 65
        
        # TASK: Get the product namea nd save to the variable below
        product_name = "Apple iPhone SE 16 GB Factory Unlocked, Space Gray (Certified Refurbished)"
        
        # TASK: Get the product price and save to the variable below
        product_price = ''
        
        # TASK: Change this to your name
        scraper_name = 'Chris'
        
        # Try to insert it. If this does not work, then we won't have a product_id to associate the reviews with and shouldn't save them.
        try:
            c.execute('INSERT INTO products (amazon_identifier, product_name, product_price, scraper_name) VALUES (?, ?, ?, ?)', 
                      (amazon_identifier, product_name, product_price, scraper_name))
            # We have to commit the transaction, or it won't be saved.
            conn.commit()
            # Save the last primary key inserted as the product_id
            product_id = c.lastrowid

            # If there are more than 5 pages, then stop at 5. 
            if max_page_number > 5:
                max_page_number = 5
            
            # Loop through all of the pages (1 through max)
#             for page in range(1, max_page_number): 
            for page in range(1,2):
                # Set the page number for the url
                url_args['pageNumber'] = page
                # Get the next page of reviews
                r = requests.get(url % amazon_identifier, url_args)
                print(r.url)
                # Check if we got a response
                if r.status_code == requests.codes.ok:
                    review_soup = BeautifulSoup(r.content, 'lxml')
                    reviews = get_reviews(review_soup)
                    
                    # TASK: Loop through the reviews (replace the [] with your code)
                    for review in reviews:
                        
                        # TASK: Get the review date - convert it if necessary to YYYY-MM-DD
                        review_date = review['date']
                        
                        # TASK: Get the review title
                        review_title = review['title']
                        
                        # TASK: Get the number of stars - and make sure it is an int.
                        number_of_stars = review['stars']
                        
                        # TASK: Get whether it is a verified purchase or not
                        verified_purchase = review['verified']
                            
                        # TASK: Get the actual text of the review
                        review_body = review['body']
                        
                        # TASK: Get the number of people that found the review helpful
                        number_found_helpful = review['found_helpful']
                        
                        # Try to insert the review into the database. If it doesn't work. Then tell us why.
                        try:
                            c.execute('''INSERT INTO reviews 
                                            (product_id, review_date, review_title, number_of_stars, verified_purchase, review_body, number_found_helpful) 
                                         VALUES (?, ?, ?, ?, ?, ?, ?)''', 
                                      (product_id, review_date, review_title, number_of_stars, verified_purchase, review_body, number_found_helpful))
                            conn.commit()    
                        except sqlite3.DatabaseError as err:
                            print('SQL Error: {0}'.format(err))
                else: 
                    print('Error %s for %s on page %s' % (r.status_code, amazon_identifier, page))
                    
                # Slow things down.
                time.sleep(0.5)
        except sqlite3.DatabaseError as err:
            print('SQL Error: {0}'.format(err))
    else:
        print('Error %s for %s' % (r.status_code, amazon_identifier))
    # Slow things down.
    time.sleep(0.5)

NameError: name 'requests' is not defined

Checking to see if there are products

In [None]:
c.execute('SELECT COUNT(*) FROM products;')
c.fetchone()

Checking to see if there are reviews

In [None]:
c.execute('SELECT COUNT(*) FROM reviews;')
c.fetchone()

Closing the database connection.

In [None]:
conn.close()