<h1>
Scraping of the Amazon Reviews
</h1>



# What is web scraping?


- Web scraping is the process of gathering information from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping!

<h1>
    What is the project?
</h1>
<p>
    We are scraping the reviews of the products based on the keyword search. For example if we search <b> Phones under 20000 </b> we will scrape all the products data(price, rating, description, reviews, etc.,) less the 20000 and reviews of the products. </p>
<h1>
    Why do we need this data?
</h1>
<p>
    Once we have the products data(price, rating, description, reviews, etc.,) we can do analysis on the data or we can bulild the ML models with the help of the data.
</p>
<br>


<h3>
    flow of the project
</h3>

-> Getting the products list based on the search keyword<br>
-> Getting the product details with the product list we scraped <br>
-> We get the links of the reviews of each product from where we extract the reviews of the product

Before going to BeautifulSoup we will understand about the library called as requests. 

We can install requests library ***$ python -m pip install requests***

- The requests module allows you to send HTTP requests using Python.

- The HTTP request returns a Response Object with all the response data (content, encoding, status, etc).

- After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

- If status_code == 200 then our page downloaded sucessfully else there is an issue in downloading the page




# What is BeautifulSoup?
 <p>
Beautiful Soup is a Python library for pulling data out of HTML and XML files. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
</p>

To install BeautifulSoup we can use ***$ python -m pip install beautifulsoup4***

### Finding all instances of a tag at once find_all()

- If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.
- Like if you want to extract all the information in the table we can use the find_all() function

### Finding the single instance of a tag find()

- you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object.
- In our case we got the title of the object with the find function.



---
#### Things to know

Tags have commonly used names that depend on their position in relation to other tags:

- child — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
- parent — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
- sibiling — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We have used the parent tag property in few places of our code. We intially get the required content of parent and we search the children tag in the parent content. Once you look into the code you will understand.

- If we have issue in reading the file or if get the error 503(Service Unavilable) then we have use differnt user agents
- we can rotate the user agents if we are having an error 503. list of User agents are given in user_agents.csv
- We can find more user agents in https://developers.whatismybrowser.com/useragents/explore/software_name/chrome/1
- If Amazon robot is blocking us we can create a list of proxies and can randomly choose the IP address.
- we can use the sleep function to reduce the speed of the scraping. In our case we have used 5sec and 10 sec.

In [1]:
import requests
from bs4 import BeautifulSoup
from time import sleep

import pandas as pd # to write the data into CSV file using Dataframe

In [2]:
from google.colab import drive # incase if you are working in google colab
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


# What are user agents?

- User agent helps us with the end-user interaction with web content. The user agent string helps the destination server identify which browser, type of device, and operating system is being used.
- Example: **Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0".**

### Why should you rotate User Agents?

If you are making a large number of requests for web scraping a website, it is a good idea to randomize. You can make each request you send look random, by changing the exit IP address of the request using rotating proxies and sending a different set of HTTP headers to make it look like the request is coming from different computers from different browsers.

In [3]:
# In our case we have taken around 20 user agents and we are iterating through this user agents.
user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.83 Safari/537.1']

In [10]:
cookie={} # insert request cookies within{}
# We are getting the reponse based on the search query
def getAmazonSearch(search_query):
    url="https://www.amazon.in/s?k="+search_query
    for agent in user_agents:
      header={'User-Agent': agent}
      print(url)
      page=requests.get(url,cookies=cookie,headers=header)
      if page.status_code==200:
          return page
      else:
          continue

In [17]:
# ASIN stands for Amazon Standard Identification Number. Almost every product on our site has its own ASIN--a unique code we use to identify it.
def Searchasin(asin):
    url="https://www.amazon.in/dp/"+asin
    print(url)
    for agent in user_agents:
      header={'User-Agent': agent}
      page=requests.get(url,cookies=cookie,headers=header)
      if page.status_code==200:
          return page
      else:
          continue
    return "Error"

In [15]:
# Function to get the reviews based on the link we send
def Searchreviews(review_link):
    url="https://www.amazon.in"+review_link
    print(url)
    page=requests.get(url,cookies=cookie,headers=header)
    if page.status_code==200:
        return page
    else:
        return "Error"

In [14]:
# Getting all the pages for the search we have done. In our case we searched for iphone all mobiles
def getnextpage(soup):
    # this will return the next page URL
    pages = soup.find('span', {'class': 's-pagination-strip'})
    if not pages.find('span', {'class': 's-pagination-item s-pagination-next s-pagination-disabled '}):
        ref = pages.find('a', {'class': 's-pagination-item s-pagination-next s-pagination-button s-pagination-separator'})['href']
        url = 'https://www.amazon.in' + str(ref)
        return url
    else:
        return


response=getAmazonSearch('iphone+all+mobiles')
soup=BeautifulSoup(response.content)
urls = []
while True:
    url = getnextpage(soup)
    if not url:
        break
    for agent in user_agents:
      header={'User-Agent': agent}
      response=requests.get(url,cookies=cookie,headers=header)
    soup=BeautifulSoup(response.content)
    urls.append(url)

https://www.amazon.in/s?k=iphone+all+mobiles


In [38]:
#urls of all the pages for the searched product
data = {'URLS': urls}
df_url = pd.DataFrame(data)
df_url.to_csv('URLs.csv')
urls

['https://www.amazon.in/s?k=iphone+all+mobiles&page=2&qid=1655127255&ref=sr_pg_1',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=3&qid=1655127256&ref=sr_pg_2',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=4&qid=1655127258&ref=sr_pg_3',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=5&qid=1655127259&ref=sr_pg_4',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=6&qid=1655127260&ref=sr_pg_5',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=7&qid=1655127261&ref=sr_pg_6',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=8&qid=1655127262&ref=sr_pg_7',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=9&qid=1655127263&ref=sr_pg_8',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=10&qid=1655127265&ref=sr_pg_9',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=11&qid=1655127266&ref=sr_pg_10',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=12&qid=1655127267&ref=sr_pg_11',
 'https://www.amazon.in/s?k=iphone+all+mobiles&page=13&qid=1655127268&ref=sr_pg_12

In [15]:
# getting ASIN numbers for the products in the urls taken
data_asin = []
for url in urls:
  page=requests.get(url,cookies=cookie,headers=header)
  soup=BeautifulSoup(page.content)
  for i in soup.findAll("div",{'class':"s-result-item s-asin sg-col-0-of-12 sg-col-16-of-20 sg-col s-widget-spacing-small sg-col-12-of-16"}):
    data_asin.append(i['data-asin'])

In [16]:
print(data_asin)
print(len(data_asin))

['B09ZBFH4P2', 'B08L5VJYV7', 'B09C877MGL', 'B08L89J9G3', 'B09LCXRXXM', 'B09FLZH6R4', 'B09G93RSZF', 'B09MKPX6RQ', 'B08L5WD9D6', 'B09XB8RRMV', 'B09G9BFKZN', 'B09WN464Q9', 'B09GFNZT24', 'B08444SXZ6', 'B08L5W16HX', 'B09LZ5G39D', 'B09JW2KPRM', 'B07WDKLDRX', 'B09WQY65HN', 'B09MQBS6VW', 'B08L89NSQK', 'B08444Z7QM', 'B08VB57558', 'B09P8RC8HG', 'B094P189P4', 'B09RF1NNQK', 'B09C6BPT7L', 'B09QS9CWLV', 'B09PRDHRKV', 'B08L5S1NT7', 'B09SH9D5PT', 'B08WX6CTP3', 'B09QS9CWLV', 'B08XGDN3TZ', 'B09G94T2NY', 'B09TWDYSWQ', 'B09TWGDY4W', 'B08XJG8MQM', 'B09RMQYHLH', 'B09TWHTBKQ', 'B09SH9D5PT', 'B096VDR283', 'B09QSB9RMH', 'B08ZMVPYH6', 'B09MKPX6RQ', 'B09CTZ1WFP', 'B08L8CPQH7', 'B09T2WPLS1', 'B09PRDHRKV', 'B08L5S1NT7', 'B09WQK2H8F', 'B09XB8GFBQ', 'B07WJWFL2V', 'B08444S68L', 'B08444Z7QM', 'B096VD6RQG', 'B09WRMNDSV', 'B09G93RSZF', 'B09ZBFDQ9F', 'B07WHNJ4ZK', 'B07WDKLRM4', 'B07CHB989V', 'B09MZ67H1H', 'B09WN464Q9', 'B01N2WJX87', 'B089MS8SKL', 'B09HJY4G4Z', 'B07WJV5KPL', 'B09K57BX2N', 'B08ZMVPYH6', 'B09S3X9VXX', 'B09R

In [20]:
# In this code we are scraping the Products data and the link of the comments page of each product
link=[]
Product_title= []
Product_cost = []
Product_rating = []
Product_descrtipion = []
updated_asin = []
Total_ratings = []
print(len(data_asin))
for i in range(len(data_asin)):
    response=Searchasin(data_asin[i])
    print(response)
    if response == "Error":
      continue
    soup=BeautifulSoup(response.content)
    
    title = soup.find("span",{"class":'a-size-large product-title-word-break'})
    if title == None:
      continue
    price = soup.find("span",{"class":'a-offscreen'})
    if price == None:
      continue
    description_response = soup.find("ul",{"class":'a-unordered-list a-vertical a-spacing-mini'}) #Here we search initally with the parent tag
    if description_response == None:
      continue
    description = ""
    for desc in description_response.findAll("span",{"class":'a-list-item'}): #from the response of parent tag we search the children
      description += desc.get_text().strip()+"\n"
    if description == "":
      continue
    rating = soup.find("span",{"class":'a-size-medium a-color-base'})
    if rating==None:
      continue
    total_rating_response = soup.find("div",{"class":'a-row a-spacing-medium averageStarRatingNumerical'})
    if total_rating_response == None:
      continue
    total_rating = total_rating_response.find("span",{"class":'a-size-base a-color-secondary'})
    if total_rating == None:
      continue
    h_ref = soup.find("a",{'data-hook':"see-all-reviews-link-foot"})
    if h_ref == None:
      continue
    link.append(h_ref['href'])
    updated_asin.append(data_asin[i])
    Product_title.append(title.get_text().strip())
    Product_cost.append(price.get_text().strip())
    Product_rating.append(rating.get_text().strip().split()[0])
    Total_ratings.append(total_rating.get_text().strip().split()[0])
    Product_descrtipion.append(description)
    sleep(10)
    

290
https://www.amazon.in/dp/B09ZBFH4P2
<Response [200]>
https://www.amazon.in/dp/B08L5VJYV7
<Response [200]>
https://www.amazon.in/dp/B09C877MGL
<Response [200]>
https://www.amazon.in/dp/B08L89J9G3
<Response [200]>
https://www.amazon.in/dp/B09LCXRXXM
<Response [200]>
https://www.amazon.in/dp/B09FLZH6R4
<Response [200]>
https://www.amazon.in/dp/B09G93RSZF
<Response [200]>
https://www.amazon.in/dp/B09MKPX6RQ
<Response [200]>
https://www.amazon.in/dp/B08L5WD9D6
<Response [200]>
https://www.amazon.in/dp/B09XB8RRMV
<Response [200]>
https://www.amazon.in/dp/B09G9BFKZN
<Response [200]>
https://www.amazon.in/dp/B09WN464Q9
<Response [200]>
https://www.amazon.in/dp/B09GFNZT24
<Response [200]>
https://www.amazon.in/dp/B08444SXZ6
<Response [200]>
https://www.amazon.in/dp/B08L5W16HX
<Response [200]>
https://www.amazon.in/dp/B09LZ5G39D
<Response [200]>
https://www.amazon.in/dp/B09JW2KPRM
<Response [200]>
https://www.amazon.in/dp/B07WDKLDRX
<Response [200]>
https://www.amazon.in/dp/B09WQY65HN
<Respo

In [45]:
# Out of 290 we only got 66 Products because amazon blocked us on reamining pages. As the user agents we have taken are not sufficient we can add more user agents 
# from  https://developers.whatismybrowser.com/useragents/explore/software_name/chrome/1
print(len(updated_asin))
print(len(Product_descrtipion))
print(len(Product_title))
print(len(Product_rating))
print(len(Total_ratings))

66
66
66
66
66


In [61]:
# Stroing ASIN, Product details and links in a dataframe. 
data = {'ASIN':updated_asin,
        'Product Title': Product_title,
        'Product Description': Product_descrtipion,
        'Product Rating': Product_rating,
        'Total Ratings' : Total_ratings,
        'Links' : link
        }

 
df = pd.DataFrame(data)
 
print(df.head())

df.to_csv("Product_details.csv")

         ASIN                                      Product Title  \
0  B09ZBFH4P2  realme narzo 50 5G (Hyper Blue, 6GB RAM+128GB ...   
1  B08L5VJYV7                     Apple iPhone 12 (64GB) - White   
2  B09C877MGL  Redmi Note 10T 5G (Chromium White, 6GB RAM, 12...   
3  B08L89J9G3                     Apple iPhone 11 (64GB) - Green   
4  B09LCXRXXM  (Renewed) Realme GT 5G Master Edition (Cosmos ...   

                                 Product Description Product Rating  \
0  Mediatek Dimensity 810 5G powerful Gaming Proc...            3.9   
1                        Front Camera\nDual Camera\n            4.6   
2  Processor: Mediatek Dimensity 700 Octa-core; 7...              4   
3  6.1-inch (15.5 cm diagonal) Liquid Retina HD L...            4.6   
4  This Renewed product is tested to work and loo...            3.8   

  Total Ratings                                              Links  
0            24  /realme-Storage-Dimensity-Processor-Camera/pro...  
1         1,178  /New-Appl

In [31]:
reviews_description=[]
reviews_titles = []
asin_reviews = []
count = 0
df = pd.read_csv('Product_details.csv')
asin = df['ASIN']
link = df['Links']
for j in range(len(link)):
    print("Processing reviews for Product",j)
    for k in range(10):
      response=Searchreviews(link[j]+'&pageNumber='+str(k))
      soup=BeautifulSoup(response.content)
      if not soup.find("li",{'class':"a-disabled a-last"}):
        
        for i in soup.findAll("span",{'data-hook':"review-body"}):
            reviews_description.append(i.text)
            asin_reviews.append(asin[j])
        for i in soup.findAll("a",{'data-hook':"review-title"}):
            reviews_titles.append(i.text)
      else:
        break
      sleep(5)

Processing reviews for Product 0
https://www.amazon.in/realme-Storage-Dimensity-Processor-Camera/product-reviews/B09ZBFH4P2/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=0
https://www.amazon.in/realme-Storage-Dimensity-Processor-Camera/product-reviews/B09ZBFH4P2/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=1
https://www.amazon.in/realme-Storage-Dimensity-Processor-Camera/product-reviews/B09ZBFH4P2/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=2
Processing reviews for Product 1
https://www.amazon.in/New-Apple-iPhone-12-64GB/product-reviews/B08L5VJYV7/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=0
https://www.amazon.in/New-Apple-iPhone-12-64GB/product-reviews/B08L5VJYV7/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=1
https://www.amazon.in/New-Apple-iPhone-12-64GB/product-reviews/B08L5VJYV7/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumb

In [32]:
print(len(reviews_description))
print(len(reviews_titles))  
print(len(asin_reviews))

5778
5778
5778


In [36]:
# storing the reviews in the dataframe with ASIN. 
data = {'ASIN':asin_reviews,
        'Review Titles': reviews_titles,
        'Review Description': reviews_description
        }

 
# Create DataFrame
df_1 = pd.DataFrame(data)
 
# Print the output.
print(df_1.head())

df_1.to_csv("Product_reviews.csv")

         ASIN                                      Review Titles  \
0  B09ZBFH4P2  \nThis is really a all rounder package with re...   
1  B09ZBFH4P2                               \nScreen protector\n   
2  B09ZBFH4P2                                \nDecent 5G Phone\n   
3  B09ZBFH4P2                                 \nCheap and best\n   
4  B09ZBFH4P2                             \nRealme nazro 50 5G\n   

                                  Review Description  
0  \nThis phone is good for everyone ...Every use...  
1                   \nScreen protector is missing.\n  
2  \nThis is my first Realme phone, Decent 5G at ...  
3  \nEverything is ok. Except the phone is having...  
4  \nFantastic fabulous usage awesome sensor read...  


# Further Exploration

- We can get the related products of each product/ recomeneded product.
- We can get the Q&A from all the products and we can build a simple Q&A system
- By writng filters we can only get the products that we searched for