## Web Scraping Amazon Best Sellers using BeautifulSoup

Web scraping is a valuable skill to have in your data science toolkit. It can be used for a variety of purposes such as data collection for training machine learning models, or to collect data for analysis. In this tutorial, we will be using web scraping to collect data from Amazon's Best Sellers page. We will be using the BeautifulSoup library to scrape the data.

Additionally, we will perform some basic data preprocessing which is an immediate activity after scraping.




In [1]:
## load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import time
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests


In [13]:
## get the following data for the amazon best sellers

## name, url, category, rank, rating, number of reviews, price, date of scraping

url = 'https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_1?ie=UTF8&pg=1'

## get the html
html = requests.get(url)

## parse the html
soup = BeautifulSoup(html.content, 'html.parser')

## create a list of all the books
books = soup.find_all('div', id='gridItemRoot')


We will be scraping the following information from the Amazon Best Sellers page:

>- Title
>- Author
>- Rating
>- Number of reviews
>- Price
>- Link to the book's page

In [124]:
## name
title = books[0].find('div', class_="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y").text
author = books[0].find('a', class_="a-size-small a-link-child").find('div', class_="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y").text
rating = books[0].find('span', class_="a-icon-alt").text
reviews = books[0].find('span', class_="a-size-small").text.replace(',', '')
price = books[0].find('span', attrs={'class':'p13n-sc-price'}).text[1:] ## remove the rupee symbol
## get book link
link_to_book = books[0].find('a', class_="a-link-normal")['href']

## create the absolute link
base_url = 'https://www.amazon.in'
book_link = base_url + link_to_book

date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

In [126]:
## create a dataframe

## dictionary
df_dict = {'title':title, 'author':author, 'rating':rating, 'reviews':reviews, 'price':price, 'book_link': book_link, 'date':date}


df=pd.DataFrame(df_dict, index=[0])
df

Unnamed: 0,title,author,rating,reviews,price,book_link,date
0,Atomic Habits: the life-changing million-copy ...,James Clear,4.6 out of 5 stars,78722,396.0,https://www.amazon.in/Atomic-Habits-James-Clea...,2023-06-12 10:43:06


In [151]:
## extract data from the first page

book_title = []
book_author = []
book_rating = []
book_reviews = []
book_price = []
date_scraped = []
book_link = []

## make this a function

def get_best_sellers_books(container):

    for book in container:
        try:
            title = book.find('div', class_="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y").text
            book_title.append(title)
        except:
            book_title.append(np.nan)
        try:
            author = book.find('a', class_="a-size-small a-link-child").find('div', \
                attrs={'class': ("_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y", \
                    "_cDEzb_p13n-sc-css-line-clamp-2_EWgCb")}).text
            book_author.append(author)
        except:
            book_author.append(np.nan)
        try:
            rating = book.find('span', class_="a-icon-alt").text
            book_rating.append(rating)
        except:
            book_rating.append(np.nan)
        try:
            reviews = book.find('span', class_="a-size-small").text.replace(',', '')
            book_reviews.append(reviews)
        except:
            book_reviews.append(np.nan)
        
        try:
            price = book.find('span', attrs={'class':'p13n-sc-price'}).text[1:].replace(',', '') ## remove the rupee symbol
            book_price.append(price)
        except:
            book_price.append(np.nan)

        link_to_book = book.find('a', class_="a-link-normal")['href']
        link = base_url + link_to_book
        book_link.append(link)


        date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        date_scraped.append(date)
    
    ## create a dictionary
    df_dict = {'title':book_title, 'author':book_author, 'rating':book_rating, \
        'reviews':book_reviews, 'price':book_price, 'book_link': book_link, 'date':date_scraped}
    
    ## create a dataframe
    best_sellers_books = pd.DataFrame(df_dict)
    
    return best_sellers_books


In [152]:
## get the data from the first page
best_sellers_books_pg1 = get_best_sellers_books(books)
best_sellers_books_pg1

Unnamed: 0,title,author,rating,reviews,price,book_link,date
0,Atomic Habits: the life-changing million-copy ...,James Clear,4.6 out of 5 stars,78722,396.0,https://www.amazon.in/Atomic-Habits-James-Clea...,2023-06-12 11:05:09
1,The Hidden Hindu (Hindi Translation of The Hid...,Akshat Gupta,4.5 out of 5 stars,302,150.0,https://www.amazon.in/Hidden-Hindu-Hindi-Trans...,2023-06-12 11:05:09
2,Ikigai: The Japanese secret to a long and happ...,Francesc Miralles,4.6 out of 5 stars,48160,374.0,https://www.amazon.in/Ikigai-H%C3%A9ctor-Garc%...,2023-06-12 11:05:09
3,The Psychology of Money,Morgan Housel,4.6 out of 5 stars,50257,200.0,https://www.amazon.in/Psychology-Money-Morgan-...,2023-06-12 11:05:09
4,The Hidden Hindu: Science-Fiction meets Indian...,Akshat Gupta,4.4 out of 5 stars,1535,180.0,https://www.amazon.in/Hidden-Hindu-Akshat-Gupt...,2023-06-12 11:05:09
5,My First Library: Boxset of 10 Board Books for...,Wonder House Books,4.5 out of 5 stars,69946,399.0,https://www.amazon.in/My-First-Library-Boxset-...,2023-06-12 11:05:09
6,Grandma's Bag of Stories: Collection of 20+ Il...,,4.6 out of 5 stars,Sudha Murty,157.0,https://www.amazon.in/Grandmas-Bag-Stories-Sud...,2023-06-12 11:05:09
7,Rich Dad Poor Dad: 25th Anniversary Edit,Robert T. Kiyosaki,4.5 out of 5 stars,22486,369.0,https://www.amazon.in/Rich-Dad-Poor-Middle-Ann...,2023-06-12 11:05:09
8,The Power of Your Subconscious Mind,Joseph Murphy,4.5 out of 5 stars,74628,115.0,https://www.amazon.in/Power-Your-Subconscious-...,2023-06-12 11:05:09
9,Indian Polity Sixth Revised Edition,M Laxmikanth,4.5 out of 5 stars,4497,650.0,https://www.amazon.in/Indian-English-Revised-S...,2023-06-12 11:05:09


In [153]:
## get data from the remaining pages

## There is only one additional page
## get the url for the next page
pg2_url = 'https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_2?ie=UTF8&pg=2'

## get the html
html2 = requests.get(pg2_url)

## parse the html
soup2 = BeautifulSoup(html2.content, 'html.parser')

## container
books2 = soup2.find_all('div', id='gridItemRoot')

## get the data from the other pages
best_sellers_books_all = get_best_sellers_books(books2)
best_sellers_books_all

Unnamed: 0,title,author,rating,reviews,price,book_link,date
0,Atomic Habits: the life-changing million-copy ...,James Clear,4.6 out of 5 stars,78722,396.00,https://www.amazon.in/Atomic-Habits-James-Clea...,2023-06-12 11:05:09
1,The Hidden Hindu (Hindi Translation of The Hid...,Akshat Gupta,4.5 out of 5 stars,302,150.00,https://www.amazon.in/Hidden-Hindu-Hindi-Trans...,2023-06-12 11:05:09
2,Ikigai: The Japanese secret to a long and happ...,Francesc Miralles,4.6 out of 5 stars,48160,374.00,https://www.amazon.in/Ikigai-H%C3%A9ctor-Garc%...,2023-06-12 11:05:09
3,The Psychology of Money,Morgan Housel,4.6 out of 5 stars,50257,200.00,https://www.amazon.in/Psychology-Money-Morgan-...,2023-06-12 11:05:09
4,The Hidden Hindu: Science-Fiction meets Indian...,Akshat Gupta,4.4 out of 5 stars,1535,180.00,https://www.amazon.in/Hidden-Hindu-Akshat-Gupt...,2023-06-12 11:05:09
...,...,...,...,...,...,...,...
95,Concept of Physics by H.C Verma Part - II - Se...,H.C. Verma,4.6 out of 5 stars,10085,330.00,https://www.amazon.in/Concept-Physics-Part-2-2...,2023-06-12 11:05:15
96,Interact in English Main Course Book (MCB) + L...,,4.4 out of 5 stars,EXAM360,324.00,https://www.amazon.in/Interact-English-Course-...,2023-06-12 11:05:15
97,R D Sharma Mathematics Class 10 with MCQ in Ma...,R.D. Sharma,4.5 out of 5 stars,2530,450.00,https://www.amazon.in/Mathematics-Class-10-Exa...,2023-06-12 11:05:15
98,Verity,Colleen Hoover,4.3 out of 5 stars,226395,235.00,https://www.amazon.in/Verity-thriller-that-cap...,2023-06-12 11:05:15


In [154]:
best_sellers_books_all.to_csv('amazon_best_sellers.csv', index=False)

