### Objectives
### The objectives of this notebook are as follows:

1) Understanding how to extract data from the internet.   
2) Build a python script to extract data at certain intervals of time (automate data scraping)

### About The Data¶
The data scraped in this notebook are review data of iPhone XR from Amazon India (https://www.amazon.in/Apple-iPhone-XR-64GB-White/dp/B07JGXM9WN/ref=cm_cr_arp_d_bdcrb_top?ie=UTF8).

In [None]:

## import libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import os
from tqdm import tqdm

### Building the Data Scraper

The code in the cell below is used to copy paste the HTML metadata of the page we are data scraping on. By executing this cell, you can find which <meta> tag wraps the data we are looking for. The data we are looking for in this notebook will be as follows:

  1) Review Date. 

  2)Review Location. 
  
  3)Review Score. 
  
  4)Review Title. 
  
  5)Review Text. 

In [None]:
url = "https://www.amazon.in/Apple-iPhone-XR-64GB-White/product-reviews/B07JGXM9WN/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1"

page = requests.get(url)
html = BeautifulSoup(page.text, 'html.parser')
print(html.prettify())

<!DOCTYPE doctype html>
<html class="a-no-js" data-19ax5a9jf="dingo" lang="en-in">
 <!-- sp:feature:head-start -->
 <head>
  <script>
   var aPageStart = (new Date()).getTime();
  </script>
  <meta charset="utf-8"/>
  <!-- sp:feature:cs-optimization -->
  <meta content="on" http-equiv="x-dns-prefetch-control"/>
  <link href="https://images-eu.ssl-images-amazon.com" rel="dns-prefetch"/>
  <link href="https://m.media-amazon.com" rel="dns-prefetch"/>
  <link href="https://completion.amazon.com" rel="dns-prefetch"/>
  <!-- sp:feature:aui-assets -->
  <link href="https://images-eu.ssl-images-amazon.com/images/I/11EIQ5IGqaL._RC|012LjolmrML.css,41QEXYSM7bL.css,21qPwhPKAAL.css,01Vctty9pOL.css,017DsKjNQJL.css,01l9iDpr-DL.css,41EWOOlBJ9L.css,11UoGyLuXoL.css,01ElnPiDxWL.css,11QxHU4QYaL.css,01Sp8sB1HiL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,01evdoiemkL.css,01oDR3IULNL.css,31zpKVx8wkL.css,01XPHJk60-L.css,01Jb-VvL4uL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01ruG+gDPFL.css,01CFUgsA-YL.css,21

### What are we looking for? ¶
The metadata above will give you a headache and it's fine. What you need to do is open [this link](https://www.amazon.in/Apple-iPhone-XR-64GB-White/product-reviews/B07JGXM9WN/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1) and find one sample text of each data you are searching for copy it and find an exact copy of the data. If the data is not text, you want to right click on the HTML web page and click Inspect Element. After this, your browser's screen will split with the page and a HTML metadata on it. From here, hover over the metadata and part of the web page will be highlighted. Search for what you're looking for using the hover and highlight.

After knowing where the data is, note the html tag containing your data. Sample of what I'm looking for is shown in the cell below

Note: If you're having difficulities Inspect Element, try using Google Chrome or Mozilla Firefox on a computer.

In [None]:
# sample of data that will be scraped

# **Review Rating**

# <i class="a-icon a-icon-star a-star-1 review-rating" data-hook="review-star-rating">
# <span class="a-icon-alt">
# 1.0 out of 5 stars
# </span>
# </i>

# **Review Title**

# <a class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" data-hook="review-title" href="/gp/customer-reviews/RZQ9NHUADHI5W?ASIN=B07JGXM9WN">
# <span>
# Def a bad experience
# </span>
# </a>

# **Review Date and Location**

# <span class="a-size-base a-color-secondary review-date" data-hook="review-date">
# Reviewed in India on 21 April 2019
# </span>

# **Review Text**

# <span class="a-size-base review-text review-text-content" data-hook="review-body">
# <span>
# Went with the iPhone XR after over a month of consideration.Amazon was offering some really good discounts and felt the time was right to upgrade having used the iPhone 6 for four seamless years.
# <br/>
# The delivery and the product was perfect
# <br/>
# 3 days in and I started to experience issues
# <br/>
# A.The Phone microphone stopped working as a result I can not use any functionality of the phone no calls/no video recording/no voice recording or voice messages.
# <br/>
# B. The camera in jus the photo mode has also slowed down it takes over 15 secs to click and store a pic to gallery
# <br/>
# I have initiated a replacement option with Amazon within just 3 days of the purchase.in hindsight I now am starting to see why people shop in stores.
# </span>
# </span>


### One Page Data Scrape

In [None]:
def grab_review_rating(text):
    '''
    Return rating score as a float.
    '''
    return float(text.replace(' out of 5 stars', '').strip())


def grab_review_location_and_date(text):
    '''
    Return location and text data
    '''
    location = re.sub('Reviewed in | on \d{1,2} \w+ \d{4}', '', text).strip()
    date = re.findall('\d{1,2} \w+ \d{4}', text)[0]
    return location, date


def scrape(url):
    '''
    Extract information containing review title, 
    review date, review location, review rating, and review text
    from Amazon India's review page.
    
    Returns a Pandas Dataframe
    '''
    page = requests.get(url)
    html = BeautifulSoup(page.text, "html.parser")
    review_titles = html.find_all('a', class_='review-title', attrs={'data-hook':'review-title'})
    review_dates_and_locations = html.find_all('span', class_='review-date', attrs={'data-hook':'review-date'})
    review_texts = html.find_all('span', class_='review-text', attrs={'data-hook':'review-body'})
    review_ratings = html.find_all('i', class_='review-rating', attrs={'data-hook':'review-star-rating'})
    data = []

    for title, date_and_location, text, rating in zip(
        review_titles, 
        review_dates_and_locations, 
        review_texts, 
        review_ratings
    ):
        title = ' '.join([i.strip() for i in title.get_text().split()])
        location, date = grab_review_location_and_date(date_and_location.get_text())
        rating = grab_review_rating(rating.get_text())
        text = ' '.join([i.strip() for i in text.get_text().split()])
        data.append([title, date, location, rating, text])

    df = pd.DataFrame(data, columns=['title', 'date', 'location', 'rating', 'text'])
    df['date'] = pd.to_datetime(df['date'])
    
    return df

In [None]:
df = scrape(url + '1')
df

Unnamed: 0,title,date,location,rating,text
0,Used/defective device,2019-09-29,India,1.0,"Amazon is sending defective devices, please do..."
1,Money value product,2019-09-29,India,5.0,Amazing smartphone with best build auality and...
2,Good price at this cost,2019-10-02,India,4.0,Camera picture is yellowish.No fingerprints op...
3,BAD EXPERIENCE,2020-01-02,India,1.0,PRODUCT RECEIVED IN GOOD CONDITION BUT FACE RE...
4,Best for PUBG,2020-03-05,India,5.0,"Just go for it, try buying the 128GB or 256GB ..."
5,Product review,2019-09-25,India,5.0,Osome product but I suggest get iPhone 11 upda...
6,Charger,2020-09-06,India,1.0,"Its has not completed 1 yr yet however, the ch..."
7,Another brilliant product from Apple,2019-11-14,India,5.0,Amazing product as can be expected from Apple....
8,All time great,2019-10-29,India,5.0,Best ever apple phone for value for money. Not...
9,Earphone stopped working,2020-06-01,India,4.0,My headphone stopped working after 7 months ot...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   title     10 non-null     object        
 1   date      10 non-null     datetime64[ns]
 2   location  10 non-null     object        
 3   rating    10 non-null     float64       
 4   text      10 non-null     object        
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 528.0+ bytes


### Building A Multi Page Data Scraper
Now that we've successfully built a data scraper that is capable of extracting data from a single page, building a data scraper for multiple page is easy. In most cases, in a web page URL a different pages are only a character away. You can explore any pages below and see the differences in the URL page.


product review: page 1

https://www.amazon.in/Apple-iPhone-XR-64GB-White/product-reviews/B07JGXM9WN/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1


product review: page 2

https://www.amazon.in/Apple-iPhone-XR-64GB-White/product-reviews/B07JGXM9WN/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2


product review: page 3

https://www.amazon.in/Apple-iPhone-XR-64GB-White/product-reviews/B07JGXM9WN/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=3

In [None]:
# create for loop for no of pages

url = "https://www.amazon.in/Apple-iPhone-XR-64GB-White/product-reviews/B07JGXM9WN/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber="

for i in tqdm(range(1, 501)):
    z = scrape(url + str(i))
    if i == 1:
        df = z.copy()
    else:
        df = df.append(z)
        
df

100%|██████████| 500/500 [03:34<00:00,  2.33it/s]


Unnamed: 0,title,date,location,rating,text
0,"Great phone, if you don't care about OLED scre...",2019-04-12,India,5.0,"Great Phone, I am happy with the phone so far...."
1,Just don’t buy it from amazon totally waste of...,2019-12-09,India,1.0,Your browser does not support HTML5 video. Iph...
2,Worth Every Penny,2019-04-04,India,5.0,When compared with XS only visible difference ...
3,Blindly go for it.,2019-10-03,India,5.0,Your browser does not support HTML5 video. I c...
4,Apple's Perfection at a premium but ....,2019-10-08,India,5.0,Upgraded from 5s to xr after 3 years of 5s use...
...,...,...,...,...,...
5,seriously apple,2019-09-20,India,3.0,720p display at 49?😂still fools buy
6,Good,2019-11-14,India,5.0,Good
7,Prank from apple..lolz,2018-10-27,India,1.0,HD lcd in 2019 @ 77k is a real joke.even the l...
8,Amazing phone,2019-10-27,India,5.0,Been using it from the past couple of months t...


In [None]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 890 entries, 0 to 9
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   title     890 non-null    object        
 1   date      890 non-null    datetime64[ns]
 2   location  890 non-null    object        
 3   rating    890 non-null    float64       
 4   text      890 non-null    object        
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 41.7+ KB


In [None]:
# Import Drive API and authenticate.
from google.colab import drive

# Mount your Drive to the Colab VM.
drive.mount('/gdrive')

# Write the DataFrame to CSV file.
with open('/gdrive/My Drive/amazon_scraped_data.csv', 'w') as f:
  df.to_csv(f)

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).
