# Scraping Amazon Bestselling Books by Genre 

## Introduction

In the time when the data has become the new oil because of its enormous applications in Machine Learning. Suppose, you are asked to create a dataset of the bestselling books on Amazon to  analyze it further for some Machine Learning purpose, then one way of doing it is by pasting the values of all the attributes you want in your dataset one by one, but, wouldn't that be tiring and cumbersome? 

We can automate this manual process of copying and pasting by Web Scraping. Web Scraping involves fetching the data automatically from the websites using the web crawlers.

In this notebook, we'll learn how to web scrap the world's biggest bookstore Amazon which has more than 48.5 million books available on its website. We'll webscrape the top 50 bestselling books by genre available on Amazon.in. 

The tools we are going to use to web scrape Amazon.in are:

- **Python**- We are going to write our code using the Python language.
- **Requests**- We are going to use the requests library to fetch and interact with the web page.
- **BeautifulSoup**- It's going to parse and extract data from the web page.
- **Pandas** - With the help of pandas we are going to interact with the data which is extracted using the BeautifulSoup and convert the data into the csv format.

## Outline

Here are the steps we'll follow:
    
- We're going to scrape https://www.amazon.in/gp/bestsellers/books/ 
- We'll first get the list of different genre. For each genre we'll get the genre name and genre page URL.
- For each genre we'll get the top 50 books in the genre by the genre page.
- For each book we'll grab the Book Name, Author Name, Stars, Number of Reviews, Book_Type, Price and the Book URL.
- For each genre we'll create the CSV file in the following format
```
Book Name,Author Name,Stars,Number of Reviews,Book_Type,Price,Book URL
Harry Potter and the Philosopher's Stone,J.K. Rowling,4.7 out of 5starts,(36,920),Kindle,223,https://www.amazon.in/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/258-2870564-1318463?pd_rd_i=B019PIOJYU&psc=1
```



Let's start working on this outline.

## Scrape the list of book genres from Amazon

So, we are going to scrape the page `https://www.amazon.in/gp/bestsellers/books/` and will grab the **Genre Titles** and the **Genre URLs**.

The steps involved in generating the list of book genres:

1. Fetch the `https://www.amazon.in/gp/bestsellers/books/` using the requests library.
2. We'll parse and extract the information using the BeautifulSoup.
3. Finally, we'll convert the data extracted to a pandas dataframe.

### Importing the necessary libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

### Scraping the Amazon bestselling books

In [2]:
# Fetching the 'https://www.amazon.in/gp/bestsellers/books/' using the requests library
genre_url = 'https://www.amazon.in/gp/bestsellers/books/'
response = requests.get(genre_url)

In [3]:
# Checking the page request status
response.status_code

200

Status code `200` shows that the request for fetching the web page is successful and we can proceed further to extract the information from the web page.

Now, in `response` object we get all the text which can be fetched from the url `https://www.amazon.in/gp/bestsellers/books/`

In [4]:
# Number of characters in response object
len(response.text)

305042

In [5]:
# Printing the first 1000 characters
response.text[:1000]

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n<!-- sp:end-feature:cs-optimization -->\n\n<!-- sp:feature:aui-assets -->\n<link rel="stylesheet" href="https://images-eu.ssl-images-amazon.com/images/I/11EIQ5IGqaL._RC|01ZTHTZObnL.css,41wZkyTaWoL.css,31Y8m1dzTdL.css,013z33uKh2L.css,017DsKjNQJL.css,0131vqwP5UL.css,41EWOOlBJ9L.css,11TIuySqr6L.css,01ElnPiDxWL.css,11bGSgD5pDL.css,01Dm5eKVxwL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,21N4kUH7pxL.css,01oDR3IULNL.css,41CYNGpGlrL.css,01XPHJk60-L.css,114y0SIP+yL.css,21aPhFy+riL.cs

We cannot use `reponse.text` to extract any information. So, we'll create the BeautifulSoup object by passing the content from the fetched page as response.text to html parser which will convert the text to a html format.

In [6]:
doc = BeautifulSoup(response.text, 'html.parser')

In [7]:
type(doc)

bs4.BeautifulSoup

Now, we want to extract all the different genre books and the genre urls from the `https://www.amazon.in/gp/bestsellers/books/` page using the `doc` variable, which we can understand as the `html` copy of the webpage we have fetched.

To do this, we will go to the web page and right click on the element we want to grab and the check `inspect` option we get after the right click.

![](https://i.imgur.com/xeprkah.png)

Next we'll grab the element from which we can extract all the information regarding the Genre Titles and Genre URLs. Now, if we hover over a `div` tag with `class = _p13n-zg-nav-tree-all_style_zg-browse-group__88fbz` then we get all the information which is required.
![](https://i.imgur.com/A8Jmwlj.png)

Let's check how it works.

We'll use the `find_all` method of BeautifulSoup to grab all the `div` tags with `class = _p13n-zg-nav-tree-all_style_zg-browse-group__88fbz` 

In [8]:
div_selection_class = '_p13n-zg-nav-tree-all_style_zg-browse-group__88fbz'
div_tags = doc.find_all('div', class_ = div_selection_class)

In [9]:
len(div_tags) # Check number of div tags that are fetched.

2

Let's check the first div tag.

In [10]:
div_tags[0]

<div class="_p13n-zg-nav-tree-all_style_zg-browse-group__88fbz"><div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><span class="_p13n-zg-nav-tree-all_style_zg-selected__1SfhQ">Books</span></div><div class="_p13n-zg-nav-tree-all_style_zg-browse-group__88fbz" role="group"><div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><a href="/gp/bestsellers/books/1318158031">Action &amp; Adventure</a></div><div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><a href="/gp/bestsellers/books/1318052031">Arts, Film &amp; Photography</a></div><div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><a href="/gp/bestsellers/books/1318064031">Biographies, Diaries &amp; True A

As we can see `div_tags[0]` has lot information which is difficult to understand. So, let's go back to our web page and once again do `inspect` on the element we want to grab and see what else we can try.

![](https://i.imgur.com/zjcujdp.png)

Within the same div tag which we have already extracted, we can find a `a` tag which contains the Genre Title and Genre URL we are looking for. 

Now, let's go within the `div_tags[0]` by applying `find_all` method and extract all the `a` tags.

In [11]:
a_tags = div_tags[0].find_all('a')

In [12]:
#Checking how many `a` tags we got
len(a_tags)

34

In [13]:
#Inspecting the first 5 a_tags
a_tags[:5]

[<a href="/gp/bestsellers/books/1318158031">Action &amp; Adventure</a>,
 <a href="/gp/bestsellers/books/1318052031">Arts, Film &amp; Photography</a>,
 <a href="/gp/bestsellers/books/1318064031">Biographies, Diaries &amp; True Accounts</a>,
 <a href="/gp/bestsellers/books/1318068031">Business &amp; Economics</a>,
 <a href="/gp/bestsellers/books/1318073031">Children's &amp; Young Adult</a>]

So, we have **34** different genres. Let's try to extract the genre title and genre url for a 1st a_tag.

In [14]:
#By applying '.text' we can extract the genre title out of the tag
genre_title_1 = a_tags[0].text

#Let's print the 1st Genre Title 
print('1st Genre Title:',genre_title_1)


1st Genre Title: Action & Adventure


In [15]:
#By selecting the 'href' we can get the genre url
genre_url_1 = a_tags[0]['href']

#Let's print the 1st Genre URL
print('1st Genre URL:', genre_url_1)

#Let's add 'https://www.amazon.in' to the above url and get the complete working link.

base_url = 'https://www.amazon.in'
print('1st Genre URL:', base_url + genre_url_1)

1st Genre URL: /gp/bestsellers/books/1318158031
1st Genre URL: https://www.amazon.in/gp/bestsellers/books/1318158031


Our code seems to work fine, so, let's create a list of all the 34 values by applying the for loop and extract the genre titles and genre urls.

In [16]:
#Create a empty genre_tile list
genre_titles = []

for i in range(0, len(a_tags)):
    genre_titles.append(a_tags[i].text) #Append the genre_titles to the empty list

#Print the genre_titles
print(genre_titles)

['Action & Adventure', 'Arts, Film & Photography', 'Biographies, Diaries & True Accounts', 'Business & Economics', "Children's & Young Adult", 'Comics & Mangas', 'Computing, Internet & Digital Media', 'Crafts, Home & Lifestyle', 'Crime, Thriller & Mystery', 'Engineering', 'Exam Preparation', 'Fantasy, Horror & Science Fiction', 'Health, Family & Personal Development', 'Health, Fitness & Nutrition', 'Higher Education Textbooks', 'Historical Fiction', 'History', 'Humour', 'Language, Linguistics & Writing', 'Law', 'Literature & Fiction', 'Maps & Atlases', 'Medicine & Health Sciences', 'Politics', 'Reference', 'Religion', 'Romance', 'School Books', 'Science & Mathematics', 'Sciences, Technology & Medicine', 'Society & Social Sciences', 'Sports', 'Textbooks & Study Guides', 'Travel']


So, we got all the **34** genre titles in a genre_titles list. Now, let's create the list for all the genre urls.

In [17]:
#Create a empty genre_url list
genre_urls = []
base_url = 'http://amazon.in' # Base url

for i in range(0, len(a_tags)):
    genre_urls.append(base_url + a_tags[i]['href']) #Append the genre_titles to the empty list

#Print the genre_titles
print(genre_urls)

['http://amazon.in/gp/bestsellers/books/1318158031', 'http://amazon.in/gp/bestsellers/books/1318052031', 'http://amazon.in/gp/bestsellers/books/1318064031', 'http://amazon.in/gp/bestsellers/books/1318068031', 'http://amazon.in/gp/bestsellers/books/1318073031', 'http://amazon.in/gp/bestsellers/books/1318104031', 'http://amazon.in/gp/bestsellers/books/1318105031', 'http://amazon.in/gp/bestsellers/books/1318118031', 'http://amazon.in/gp/bestsellers/books/1318161031', 'http://amazon.in/gp/bestsellers/books/22960344031', 'http://amazon.in/gp/bestsellers/books/4149751031', 'http://amazon.in/gp/bestsellers/books/1402038031', 'http://amazon.in/gp/bestsellers/books/1318128031', 'http://amazon.in/gp/bestsellers/books/23033693031', 'http://amazon.in/gp/bestsellers/books/4149418031', 'http://amazon.in/gp/bestsellers/books/1318164031', 'http://amazon.in/gp/bestsellers/books/4149493031', 'http://amazon.in/gp/bestsellers/books/1318143031', 'http://amazon.in/gp/bestsellers/books/1318144031', 'http://a

So, we got both the lists of **genre_titles** and **genre_urls**. 

### Use Pandas library to convert lists to a dataframe

Now, the plan is to create a dataframe of these lists using the pandas library.

We'll form a dataframe by converting `genre_titles`, `genre_urls` lists to a dictionary.

In [18]:
#We'll pass the list as the values to the dictionary keys.

genre_dict = {
    'genre_titles': genre_titles, 
    'genre_urls': genre_urls
}

Now, let's convert this dictionary to a pandas dataframe and name it `genre_df`.

In [19]:
genre_df = pd.DataFrame(genre_dict)
genre_df

Unnamed: 0,genre_titles,genre_urls
0,Action & Adventure,http://amazon.in/gp/bestsellers/books/1318158031
1,"Arts, Film & Photography",http://amazon.in/gp/bestsellers/books/1318052031
2,"Biographies, Diaries & True Accounts",http://amazon.in/gp/bestsellers/books/1318064031
3,Business & Economics,http://amazon.in/gp/bestsellers/books/1318068031
4,Children's & Young Adult,http://amazon.in/gp/bestsellers/books/1318073031
5,Comics & Mangas,http://amazon.in/gp/bestsellers/books/1318104031
6,"Computing, Internet & Digital Media",http://amazon.in/gp/bestsellers/books/1318105031
7,"Crafts, Home & Lifestyle",http://amazon.in/gp/bestsellers/books/1318118031
8,"Crime, Thriller & Mystery",http://amazon.in/gp/bestsellers/books/1318161031
9,Engineering,http://amazon.in/gp/bestsellers/books/22960344031


### Create a csv file with the extracted information

Now, we can also save `genre_df` dataframe to a csv format.

In [20]:
# genre_df.to_csv('genres.csv', index = None)

We are done with our first step and we have successfully extracted the information about the genre titles and the genre urls and converted it to a csv format. Now, the next step is to extract the books information from the genre page.

## Getting information out of a genre page

The roadmap we'll follow to extract the information out of the genre page is

1. We'll first try to grab the Book Name, Author Name, Stars, Number of Reviews, Edition, Price and the Book URL out of first genres url `https://www.amazon.in/gp/bestsellers/books/1318158031`.
2. We'll then apply the `for` loop on the `genre_df` dataframe and extract the information for all the genres.
3. We'll convert the extracted information to a pandas dataframe.
4. Finally we will convert all the dataframes to a csv file.

In [21]:
# genre url is the url of the first genre page
genre_url = 'https://www.amazon.in/gp/bestsellers/books/1318158031'

# Using a request library to fetch the genre_url
response = requests.get(genre_url)

`genre_url` has been fetched in the `response` variable by using the requests library.

In [22]:
# Checking the response status
response.status_code

200

Getting the value `200` as our status code shows that our request is successful.

Now, we'll create the BeautifulSoup object by parsing the response.text with a html parser.

In [23]:
genre_doc = BeautifulSoup(response.text, 'html.parser')

In [24]:
type(genre_doc)

bs4.BeautifulSoup

We are ready to extract the books information from the genre pages. Let's start by selecting the first genre and then by pressing right click on the first book and select `inspect`.

![](https://i.imgur.com/0kb79UY.png)

After selecting `inspect` we get
![](https://i.imgur.com/bB0cDjk.png)

From this page we are going to select the `div` tag with `class = zg-grid-general-faceout` which will give us all the information which is needed.

In [25]:
div_selection_class = 'zg-grid-general-faceout'
div_tags = genre_doc.find_all('div', class_ = div_selection_class)

In [26]:
len(div_tags)

50

So, we got total **50** `div_tags`. Let's inspect the first tag.

In [27]:
div_tags[0]

<div class="zg-grid-general-faceout"><div><a class="a-link-normal" href="/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&amp;psc=1" role="link" tabindex="-1"><div class="a-section a-spacing-mini _p13n-zg-list-grid-desktop_maskStyle_noop__3Xbw5"><img alt="Harry Potter and the Philosopher's Stone" class="a-dynamic-image p13n-sc-dynamic-image p13n-product-image" data-a-dynamic-image='{"https://images-eu.ssl-images-amazon.com/images/I/81DPTref3fL._AC_UL302_SR302,200_.jpg":[302,200],"https://images-eu.ssl-images-amazon.com/images/I/81DPTref3fL._AC_UL604_SR604,400_.jpg":[604,400],"https://images-eu.ssl-images-amazon.com/images/I/81DPTref3fL._AC_UL906_SR906,600_.jpg":[906,600]}' height="200px" src="https://images-eu.ssl-images-amazon.com/images/I/81DPTref3fL._AC_UL302_SR302,200_.jpg" style="max-width:302px;max-height:200px"/></div></a><a class="a-link-normal" href="/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019

Again a lot of information has popped out after inspecting the first `div_tag`. Let's go back and inspect the first book again.
![](https://i.imgur.com/9C97G35.png)

We can clearly see that under the parent div tag with `class = zg-grid-general-faceout` we can find the name of the first book `Harry Potter and the Philosopher's Stone` under a `span`. So, let's extract the span out of the `div_tags[0]`.

In [28]:
div_tags[0].find('span')

<span><div class="_p13n-zg-list-grid-desktop_truncationStyles_p13n-sc-css-line-clamp-1__1Fn1y">Harry Potter and the Philosopher's Stone</div></span>

Let's extract the text out of the span to get the book name.

In [29]:
div_tags[0].find('span').text

"Harry Potter and the Philosopher's Stone"

So, we got out first book name. Now, let's move towards the next item, i.e. Author's name.

We can again go back to the web page and right click on the Author's name to select the inspect. We'll land on the tag which contains the Author's name.
![](https://i.imgur.com/0UkNSF2.png)

Now, we can select a `a_tag` with `class = a-size-small a-link-child` to get the name of the author from the parent div_tag.

In [30]:
a_selection_tag = 'a-size-small a-link-child'
div_tags[0].find('a', class_ = a_selection_tag)

<a class="a-size-small a-link-child" href="/J-K-Rowling/e/B000AP9A6K/ref=zg_bs_1318158031_bl_1/000-0000000-0000000?pd_rd_i=B019PIOJYU"><div class="_p13n-zg-list-grid-desktop_truncationStyles_p13n-sc-css-line-clamp-1__1Fn1y">J.K. Rowling</div></a>

Let's extract the text out of the tag to get the author's name.

In [31]:
div_tags[0].find('a', class_ = a_selection_tag).text

'J.K. Rowling'

I have tried extracting the author's name for the different books available on the web page of the genre `Action & Adventure`. And I found one book for which I was not able to extract the author's name by the above code. And the reason for that is the author's name for the book `அகத்தை அடைக்காதே அசுரதேவா.! : Agaththai adaikkaathe asuratheva.! (Tamil Edition)` is not a link so it cannot be extracted using the `a_tag`.

![](https://i.imgur.com/M9PpxGo.png)

In fact we need to use a `span` tag with `class = a-size-small a-color-base` to extract the Author's name. So, we'll find the `span` tag in the `div_tag` associated with the book `அகத்தை அடைக்காதே அசுரதேவா.! : Agaththai adaikkaathe asuratheva.! (Tamil Edition)` at the time of my writing this notebook that `span` tag is contained under the `div_tags[12]`.

In [34]:
span_selection_class = 'a-size-small a-color-base'
div_tags[12].find('span', class_ = span_selection_class)

<span class="a-size-small a-color-base"><div class="_p13n-zg-list-grid-desktop_truncationStyles_p13n-sc-css-line-clamp-1__1Fn1y">ஜெனிஷா   ரோஸ்</div></span>

Let's extract the text to get the author's name.

In [35]:
span_selection_class = 'a-size-small a-color-base'
div_tags[12].find('span', class_ = span_selection_class).text

'ஜெனிஷா   ரோஸ்'

Now, we have learned how to find the book name and author name of all the books present for the genre `Action & Adventure`.

Similarly, if we keep on right clicking on the elements we want to extract on the web page and select the proper tags, then, we can find the other tags.

So, the next element we'll find is the **book url** and that can be extracted from `a_tag` with `class = a-link-normal`.

In [36]:
# Book Url
a_selection_tag = 'a-link-normal'
div_tags[0].find('a', class_ = a_selection_tag)['href']

'/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&psc=1'

We can add a `base_url = https://amazon.in` before the url we have extracted to make it a complete working link.

In [37]:
# Book Url with a complete working link
base_url = 'https://amazon.in'
base_url + div_tags[0].find('a', class_ = a_selection_tag)['href']

'https://amazon.in/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&psc=1'

Now, if we select this link and open it in our web browser then it will land us to a webpage of the book `Harry-Potter-Philosophers-Stone`.

Next thing we can find is the **number of stars** ratings that the book has received out of 5. To find the **number of stars**  we are going to right click on the start ratings of any of the book present on the `Action & Adventure` genre page.

We can see that the **number of stars** can be extracted from the `span tag` with a `class = a-icon-alt`.

![](https://i.imgur.com/hIAPWVh.png)

In [38]:
# Number of star ratings
span_selection_class = 'a-icon-alt'
div_tags[0].find('span', class_ = span_selection_class)

<span class="a-icon-alt">4.7 out of 5 stars</span>

Let's extract the text from the span tag.

In [39]:
div_tags[0].find('span', class_ = span_selection_class).text

'4.7 out of 5 stars'

Next we can find the `Number of ratings` given by the user.

To find the `Number of ratings` we'll select the `span` tag with a `class = a-size-small`.

In [40]:
# Number of ratings
span_selection_class = 'a-size-small'
div_tags[0].find('span', class_ = span_selection_class)

<span class="a-size-small">37,210</span>

Let's select the `text` to find the `number of ratings`

In [41]:
div_tags[0].find('span', class_ = span_selection_class).text

'37,210'

Next we are going to find the `Book Type` which is available on the bestsellers web page.

We can inspect the `Book Type` and select the `span` tag with `class = a-size-small a-color-secondary a-text-normal`.

In [42]:
# Book Type
span_selection_class = 'a-size-small a-color-secondary a-text-normal'
div_tags[0].find('span', class_ = span_selection_class)

<span class="a-size-small a-color-secondary a-text-normal">Kindle Edition</span>

In [43]:
div_tags[0].find('span', class_ = span_selection_class).text

'Kindle Edition'

Isn't it simple? Just inspect the element you want to grab, then select the appropriate tag which contains that element and Voila!! That's the beauty of BeautifulSoup library.

Let's move towards the final item we want to grab and that is the `Price` for which the book is sold on the bestsellers page.

To find the `Price` for which the book is sold again we can check the parent tag `div_tags[0]`.

We can see that the `Price` value is present under the `span` tag with `class = p13n-sc-price`.

![](https://i.imgur.com/SGAXQkR.png)

Let's implement it.

In [44]:
# Book Price
span_selection_class = 'p13n-sc-price'
div_tags[0].find('span', class_ = span_selection_class)

<span class="p13n-sc-price">₹223.00</span>

In [45]:
div_tags[0].find('span', class_ = span_selection_class).text

'₹223.00'

## Final Code

We got all the attributes we wanted to create a dataset of the top 50 bestselling books from the different genres. Now, we are going to sum up all the information we have gathered and create functions and helper functions.

We are going to divide the roadmap into two stages:

- First, we are going to collect information about the book genres.
- Second, we are going to use the information collected for each genre and use it to create the dataset with all the book attributes.

### Scraping Amazon Book Genres

- We are going to create a `get_genre_info` function which is going to return the list of `genre_titles` and `genre_urls`.
- We are going to create another function `scrape_genres` which is going to use the `get_genre_info` and create the pandas dataframe using the lists `genre_titles` and `genre_urls`.

In [46]:
def get_genre_info(doc):
    #Selecting a parent tag
    div_selection_class = '_p13n-zg-nav-tree-all_style_zg-browse-group__88fbz'
    genre_div_tags = doc.find_all('div', class_ = div_selection_class)
    
    #Selecting a useful tag out of the parent tag
    genre_tags = genre_div_tags[0].find_all('a')
    
    #List for genre_titles
    genre_titles = []

    for i in range(0, len(genre_tags)):
        genre_titles.append(genre_tags[i].text)
        
    #List for genre_urls
    genre_urls = []
    base_url = 'https://www.amazon.in'
    
    
    for i in range(0, len(genre_tags)):
        genre_urls.append(base_url + genre_tags[i]['href'])
    return genre_titles, genre_urls

In [47]:
def scrape_genres():
    #Using request library to fetch the books genres
    genre_url = 'https://www.amazon.in/gp/bestsellers/books/'
    response = requests.get(genre_url)
    
    #Check the page requests success
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(genre_url))
    
    #Creating a BeautifulSoup object
    doc = BeautifulSoup(response.text, 'html.parser')
    
    #Genre dictionary to which lists from the get_genre_info are passed
    genre_dic = {
        'genre_titles': get_genre_info(doc)[0],
        'genre_urls':get_genre_info(doc)[1]
    }
    return pd.DataFrame(genre_dic)

Let's check how the code is working by calling a `scrape_genres` function.

In [48]:
scrape_genres()

Unnamed: 0,genre_titles,genre_urls
0,Action & Adventure,https://www.amazon.in/gp/bestsellers/books/131...
1,"Arts, Film & Photography",https://www.amazon.in/gp/bestsellers/books/131...
2,"Biographies, Diaries & True Accounts",https://www.amazon.in/gp/bestsellers/books/131...
3,Business & Economics,https://www.amazon.in/gp/bestsellers/books/131...
4,Children's & Young Adult,https://www.amazon.in/gp/bestsellers/books/131...
5,Comics & Mangas,https://www.amazon.in/gp/bestsellers/books/131...
6,"Computing, Internet & Digital Media",https://www.amazon.in/gp/bestsellers/books/131...
7,"Crafts, Home & Lifestyle",https://www.amazon.in/gp/bestsellers/books/131...
8,"Crime, Thriller & Mystery",https://www.amazon.in/gp/bestsellers/books/131...
9,Engineering,https://www.amazon.in/gp/bestsellers/books/229...


We have successfully achieved the first stage of our roadmap by getting a dataframe of the genre_titles and the genre_urls.

Let's move towards the next and the most important stage i.e. Creating a dataframe of the top selling 50 books of the different genres available on Amazon.

### Scraping Book Name, Author Name, Stars, Number of Reviews, Book_Type, Price, Book URL for top 50 bestselling books of different genres.

First let's create a `genre_books_info` function which will grab all the attributes which are required in our dataset.

In [49]:
def genre_books_info(div_tags):
    # Tag for the Book Name
    Book_Name_tags = div_tags.find('span')
    
    # Tag for the Author's Name
    Author_Name_tags = div_tags.find(['a', 'span'], class_ = ['a-size-small a-link-child', 'a-size-small a-color-base'])
    
    # Book URL
    Book_URL = 'https://amazon.in' + div_tags.find('a', class_ = 'a-link-normal')['href']
    
    # Tag for the Book Type
    Book_Type_tags = div_tags.find('span', class_ = 'a-size-small a-color-secondary a-text-normal')
    
    # Tag for Book Price
    Price_tags = div_tags.find('span', class_ = 'p13n-sc-price')
    
    # Tag for the number of stars
    Star_Rating_tags = div_tags.find('span', class_ = 'a-icon-alt')
    
    # Tag for the number of reviews
    Reviews_tags = div_tags.find('div', class_ = 'a-icon-row')
    
    return Book_Name_tags, Author_Name_tags, Book_URL, Book_Type_tags, Price_tags, Star_Rating_tags, Reviews_tags

Let's create a dictionary `genre_books_dict` to hold the list for these attributes and run a for loop to append all the values in the respective lists.

In [50]:
genre_books_dict = {
    'Book_Name':[],
    'Author_Name':[],
    'Book_URL':[],
    'Book_Type':[],
    'Price':[],
    'Star_Rating':[],
    'Reviews':[]
}

for i in range(0, len(div_tags)):
    genre_info = genre_books_info(div_tags[i])
    
    if genre_info[0] is not None:
        genre_books_dict['Book_Name'].append(genre_info[0].text)
    else:
        genre_books_dict['Book_Name'].append('Missing')
        
    if genre_info[1] is not None:
        genre_books_dict['Author_Name'].append(genre_info[1].text)
    else:
        genre_books_dict['Author_Name'].append('Missing')
        
    if genre_info[2] is not None:
        genre_books_dict['Book_URL'].append(genre_info[2])
    else:
        genre_books_dict['Book_URL'].append('Missing')
        
    if genre_info[3] is not None:
        genre_books_dict['Book_Type'].append(genre_info[3].text)
    else:
        genre_books_dict['Book_Type'].append('Missing')
        
    if genre_info[4] is not None:
        genre_books_dict['Price'].append(genre_info[4].text)
    else:
        genre_books_dict['Price'].append('Missing')
        
    if genre_info[5] is not None:
        genre_books_dict['Star_Rating'].append(genre_info[5].text)
    else:
        genre_books_dict['Star_Rating'].append('Missing')
        
    if genre_info[6] is not None:
        genre_books_dict['Reviews'].append(genre_info[6].find('span', class_ = 'a-size-small').text)
    else:
        genre_books_dict['Reviews'].append('Missing')
        

Let's check the working of our code by providing `genre_books_info` function the `div_tags` collected from first genre page with a url `https://www.amazon.in/gp/bestsellers/books/1318158031/` 

In [51]:
genre_url_1 = 'https://www.amazon.in/gp/bestsellers/books/1318158031/'
reponse = requests.get(genre_url_1)

In [52]:
genre_doc = BeautifulSoup(response.text, 'html.parser')

We know that the book details we want to scrap are available under the `div_tag` with `class = zg-grid-general-faceout`. Let's use this information to grab the details about top 50 books from the genre `Action & Adventure`.

In [53]:
div_selection_class = 'zg-grid-general-faceout'
div_tags = genre_doc.find_all('div', class_ = div_selection_class )

In [54]:
genre_books_df = pd.DataFrame(genre_books_dict)
genre_books_df

Unnamed: 0,Book_Name,Author_Name,Book_URL,Book_Type,Price,Star_Rating,Reviews
0,Harry Potter and the Philosopher's Stone,J.K. Rowling,https://amazon.in/Harry-Potter-Philosophers-St...,Kindle Edition,₹223.00,4.7 out of 5 stars,37210
1,The Complete Novels of Sherlock Holmes,Arthur Conan Doyle,https://amazon.in/Complete-Novels-Sherlock-Hol...,Paperback,₹139.00,4.5 out of 5 stars,12585
2,"The Silent Patient: The record-breaking, multi...",Alex Michaelides,https://amazon.in/Silent-Patient-Alex-Michaeli...,Paperback,₹267.00,4.5 out of 5 stars,87693
3,Harry Potter and the Chamber of Secrets,J.K. Rowling,https://amazon.in/Harry-Potter-Chamber-Secrets...,Kindle Edition,₹226.00,4.7 out of 5 stars,30526
4,Baby Touch: Tummy Time,Ladybird,https://amazon.in/Baby-Touch-Tummy-Time/dp/024...,Board book,₹267.00,4.5 out of 5 stars,1016
5,Something I Never Told You,Shravya Bhinder,https://amazon.in/Something-I-Never-Told-You/d...,Paperback,₹164.00,4.3 out of 5 stars,1671
6,Harry Potter and the Prisoner of Azkaban,J.K. Rowling,https://amazon.in/Harry-Potter-Prisoner-Azkaba...,Kindle Edition,₹171.00,4.7 out of 5 stars,23278
7,Harry Potter and the Goblet of Fire,J.K. Rowling,https://amazon.in/Harry-Potter-Goblet-Fire-Row...,Kindle Edition,₹233.00,4.7 out of 5 stars,20685
8,Harry Potter and the Order of the Phoenix,J.K. Rowling,https://amazon.in/Harry-Potter-Order-Phoenix-R...,Kindle Edition,₹233.00,4.7 out of 5 stars,19297
9,Harry Potter and the Deathly Hallows,J.K. Rowling,https://amazon.in/Harry-Potter-Deathly-Hallows...,Kindle Edition,₹233.00,4.7 out of 5 stars,26179


We successfully scrapped the complete information for all the top 50 bestselling books of the genre `Action & Adventure` and it looks awesome.

Now, let's write the functions to grab the information of top 50 bestselling books for all the different genres.

1. We are going to write a `get_genre_page` function to grab the different genre page urls.
2. We are going to use the functions `genre_books_info` written above to grab the info about a book.
3. Now we are going to write a `get_genre_books` function which is going to use the `genre_books_info` to create the dataframe for all the different attributes of the books.
4. We are going to write a `scrape_genre` function which is going to take the dataframes from a `get_genre_books` function and convert them into CSV format.

In [55]:
def get_genre_page(genre_url):
    #Download Page
    response = requests.get(genre_url)
    #Check page status
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(genre_url))
    #Parsing Beautiful Soup
    genre_doc = BeautifulSoup(response.text, 'html.parser')
    return genre_doc

In [56]:
def genre_books_info(div_tags):
    # Tag for the Book Name
    Book_Name_tags = div_tags.find('span')
    
    # Tag for the Author's Name
    Author_Name_tags = div_tags.find(['a', 'span'], class_ = ['a-size-small a-link-child', 'a-size-small a-color-base'])
    
    # Book URL
    Book_URL = 'https://amazon.in' + div_tags.find('a', class_ = 'a-link-normal')['href']
    
    # Tag for the Book Type
    Book_Type_tags = div_tags.find('span', class_ = 'a-size-small a-color-secondary a-text-normal')
    
    # Tag for Book Price
    Price_tags = div_tags.find('span', class_ = 'p13n-sc-price')
    
    # Tag for the number of stars
    Star_Rating_tags = div_tags.find('span', class_ = 'a-icon-alt')
    
    # Tag for the number of reviews
    Reviews_tags = div_tags.find('div', class_ = 'a-icon-row')
    
    return Book_Name_tags, Author_Name_tags, Book_URL, Book_Type_tags, Price_tags, Star_Rating_tags, Reviews_tags

In [57]:
def get_genre_books(genre_doc):
    #Selecting parent tag which contains the book info
    div_selection_class = 'zg-grid-general-faceout'
    div_tags = genre_doc.find_all('div', class_ = div_selection_class )
    # Creating dictionary of the book attributes
    genre_books_dict = {
    'Book_Name':[],
    'Author_Name':[],
    'Book_URL':[],
    'Book_Type':[],
    'Price':[],
    'Star_Rating':[],
    'Reviews':[]
    }
    #Creating the list of book ttributes
    for i in range(0, len(div_tags)):
        genre_info = genre_books_info(div_tags[i])

        if genre_info[0] is not None:
            genre_books_dict['Book_Name'].append(genre_info[0].text)
        else:
            genre_books_dict['Book_Name'].append('Missing')

        if genre_info[1] is not None:
            genre_books_dict['Author_Name'].append(genre_info[1].text)
        else:
            genre_books_dict['Author_Name'].append('Missing')

        if genre_info[2] is not None:
            genre_books_dict['Book_URL'].append(genre_info[2])
        else:
            genre_books_dict['Book_URL'].append('Missing')

        if genre_info[3] is not None:
            genre_books_dict['Book_Type'].append(genre_info[3].text)
        else:
            genre_books_dict['Book_Type'].append('Missing')

        if genre_info[4] is not None:
            genre_books_dict['Price'].append(genre_info[4].text)
        else:
            genre_books_dict['Price'].append('Missing')

        if genre_info[5] is not None:
            genre_books_dict['Star_Rating'].append(genre_info[5].text)
        else:
            genre_books_dict['Star_Rating'].append('Missing')
        # Applying extra find as at the time of writing for one book reviews were not availabe    
        if genre_info[6] is not None:
            genre_books_dict['Reviews'].append(genre_info[6].find('span', class_ = 'a-size-small').text)
        else:
            genre_books_dict['Reviews'].append('Missing')
            
    return pd.DataFrame(genre_books_dict)


In [58]:
def scrape_genre(genre_ulr, path):
    #Stopping the function to create the already exiting file
    if os.path.exists(path):
        print('The file {} already exists.. Skipping...'.format(path))
        return
    #Creating dataframe with genre_titles and genre_urls columns
    genre_df = get_genre_books(get_genre_page(genre_ulr))
    #Saving the dataframe to csv format
    genre_df.to_csv(path, index = None)

We have already written the functions `get_genre_info` and `scrape_genres` to create the dataframe of the genre_titles and the genre_ulrs. Let's write them once again.

In [59]:
def get_genre_info(doc):
    #Selecting a parent tag
    div_selection_class = '_p13n-zg-nav-tree-all_style_zg-browse-group__88fbz'
    genre_div_tags = doc.find_all('div', class_ = div_selection_class)
    
    #Selecting a useful tag out of the parent tag
    genre_tags = genre_div_tags[0].find_all('a')
    
    #List for genre_titles
    genre_titles = []

    for i in range(0, len(genre_tags)):
        genre_titles.append(genre_tags[i].text)
        
    #List for genre_urls
    genre_urls = []
    base_url = 'https://www.amazon.in'
    
    
    for i in range(0, len(genre_tags)):
        genre_urls.append(base_url + genre_tags[i]['href'])
    return genre_titles, genre_urls

In [60]:
def scrape_genres():
    #Using request library to fetch the books genres
    genre_url = 'https://www.amazon.in/gp/bestsellers/books/'
    response = requests.get(genre_url)
    
    #Check the page requests success
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(genre_url))
    
    #Creating a BeautifulSoup object
    doc = BeautifulSoup(response.text, 'html.parser')
    
    #Dictionary to which lists from get_genre_info are passed
    genre_dic = {
        'genre_titles': get_genre_info(doc)[0],
        'genre_urls':get_genre_info(doc)[1]
    }
    return pd.DataFrame(genre_dic)

Finally we'll write a function `scrape_genre_books` which will call the function `scrape_genres` to create the dataframe containing **genre_titles** and **genre_urls**. Now we'll apply the `for` loop to the genre_urls and pick the urls one by one and send it to the function `scrape_genre` which will scrape all the attributes of the book and save them in the CSV format in the **data** folder.

In [61]:
def scrape_genre_books():
    print('Scraping list of book genres')
    # Scraping the genre_titles and genre_urls
    genres_df = scrape_genres()
    # Creating a 'data' folder
    os.makedirs('data', exist_ok = True)
    # Looping to extract the genre_title and the genre_url
    for index, row in genres_df.iterrows():
        print('Scraping bestselling books for the genre {}'. format(row['genre_titles']))
        # Inputing genre_url and path to save the file to scrape_genre.
        scrape_genre(row['genre_urls'], 'data/{}.csv'.format(row['genre_titles']))
    

Let's check how the code work by calling `scrape_genre_books()`.

In [72]:
scrape_genre_books()

Scraping list of book genres
Scraping bestselling books for the genre Action & Adventure
The file data/Action & Adventure.csv already exists.. Skipping...
Scraping bestselling books for the genre Arts, Film & Photography
The file data/Arts, Film & Photography.csv already exists.. Skipping...
Scraping bestselling books for the genre Biographies, Diaries & True Accounts
The file data/Biographies, Diaries & True Accounts.csv already exists.. Skipping...
Scraping bestselling books for the genre Business & Economics
The file data/Business & Economics.csv already exists.. Skipping...
Scraping bestselling books for the genre Children's & Young Adult
The file data/Children's & Young Adult.csv already exists.. Skipping...
Scraping bestselling books for the genre Comics & Mangas
The file data/Comics & Mangas.csv already exists.. Skipping...
Scraping bestselling books for the genre Computing, Internet & Digital Media
The file data/Computing, Internet & Digital Media.csv already exists.. Skipping.

This is the final output I got in the data folder.
![](https://i.imgur.com/jp67h5V.png)

## Reference and Future Work

Summary:

- First figure out the webpage to scrape.
- Use the requests library to fetch the webpage.
- Create the BeautifulSoup object using the BeautifulSoup library to parse and extract the information.
- Inspect the elements to be extracted by right clicking.
- Select the appropriate tags for the elements to be scraped and use BeautifulSoup to extract them.
- Use Pandas to convert the information grabbed to a dataframe.
- Finally save the work to a CSV format.

References:

- https://www.youtube.com/watch?v=RKsLLG-bzEY&ab_channel=Jovian
- https://www.datacamp.com/community/tutorials/amazon-web-scraping-using-beautifulsoup 

Future Work:

- We can preprocess the datasets that are extracted from Amazon. Preprocessing can include the following.
    1. Replacing the missing values if any with the appropriate values.
    2. Removing the duplicated rows.
    3. Taking a note of any unusual values.
    4. Adjusting the outliers as they can hamper the further analysis of the data.
    
    
- We can analyze the datasets further to check the following.
    1. Which genre is the most popular with highest user ratings?
    2. List of Authors which have top selling books in the multiple genres?
    3. List of Authors having multiple top selling books in the same genre?
    4. Top 10 books with the maximum selling price across all the genres.
    5. What is the trend of star rating and number of user ratings?
    6. Books which have the highest star ratings and highest number of user ratings.
    