# Scraping Amazon Best Seller Books using Python

## Introduction about Web scrapping

Web scraping, also called web data mining or web harvesting, Web scraping is the process of extract, parse, download and organize useful information from the web automatically.

This extracted is in the form of an Excel spreadsheet or a CSV file, but the data can also be saved in other formats, such as a JSON file.

Unlike screen scraping, which only copies pixels displayed on screen, web scraping extracts underlying HTML code and, with it, data stored in a database.



![](https://i.imgur.com/nwQiy9e.jpg)

## Introduction about GitHub 

GitHub, is a cloud based hosting service for software development projects that use the Git revision control system.GitHub is a social network for programmers.It is the world's largest coding community.Github allows you to take part in collaboration by forking projects, sending and pulling requests, and monitoring development.


GitHub offers both paid plans for private repositories, and free accounts for open source projects.


It offers the distributed version control and source code management (SCM) functionality of Git, plus its own features. It provides access control and several collaboration features such as bug tracking, feature requests, task management, continuous integration for every project.


![](https://i.imgur.com/CKGbGYe.jpg)

#### Project outline

- Here are the steps we'll follow:
- We're going to scrape https://www.amazon.in/gp/bestsellers/books/ 

- We'll first get the list of different genre. For each genre we'll get the genre name and genre page URL. For each genre we'll get the top 50 books in the genre by the genre page.

- For each book we'll grab the Book Name, Author Name, Stars, Number of Reviews, Book_Type, Price and the Book URL.

- For each genre we'll create the CSV file in the following format

Book_Name,Author_Name,Book_URL,Edition_Type,Price,Star_Rating,Reviews

Harry Potter and the Philosopher's Stone,J.K. Rowling,https://amazon.in/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&psc=1,Kindle Edition,₹299.00,4.7 out of 5 stars,"39,452"

The Silent Patient: The record-breaking, multimillion copy Sunday Times bestselling thriller and Richard & Judy book club pick",Alex Michaelides,https://amazon.in/Silent-Patient-Alex-Michaelides/dp/1409181634/ref=zg_bs_1318158031_2/000-0000000-0000000?pd_rd_i=1409181634&psc=1,Paperback,₹279.00,4.5 out of 5 stars,"92,969"


![](https://i.imgur.com/Zx5uZVq.png)


## Tools used to scrape the list of topics from Github

- Requests : to download the page
- BS4 : to parse and extract information
- Converting to a Pandas DataFrame

## Requests

The requests allows you to send HTTP requests using Python. The HTTP request returns a Response Object with all the response data (content, encoding, status, etc)



![](https://i.imgur.com/Ty6gVtb.jpg)

In [4]:
import requests

In [5]:
url='https://www.amazon.in/gp/bestsellers/books/'

In [6]:
response = requests.get(url)

In [7]:
response.status_code

200

In [8]:
len(response.text)

305080

In [9]:
page_content=response.text

In [10]:
page_content[0:500]

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completio'

In [11]:
with open('amazon_bestseller.html',"w") as f:
    f.write(page_content)

SCREENSHORT OF BESTSLLER

## Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.

- Use the right properties and methods to extract the required information.

- Create functions to extract from the page into lists and dictionaries.

- (Optional) Use a REST API to acquire additional information if required.

In [12]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(page_content, 'html.parser')

In [13]:
selection_class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8"
genre_title_tag=doc.find_all('div',{ 'class':selection_class})

![](https://i.imgur.com/CJ0a06E.png)

In [14]:
len(genre_title_tag)

35

In [15]:
#genre_title_tag[:3]

In [16]:
genre_title_tag=genre_title_tag[1:len(genre_title_tag)]
genre_title_tag[:3]

[<div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><a href="/gp/bestsellers/books/1318158031">Action &amp; Adventure</a></div>,
 <div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><a href="/gp/bestsellers/books/1318052031">Arts, Film &amp; Photography</a></div>,
 <div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><a href="/gp/bestsellers/books/1318064031">Biographies, Diaries &amp; True Accounts</a></div>]

## Extracting tittles and urls of the genre

In [17]:
def get_topic_titles(doc):
    selection_class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8"
    genre_title_tag=doc.find_all('div',{ 'class':selection_class})
    topic_titles = []
    for tag in genre_title_tag:
        topic_titles.append(tag.text)
    return topic_titles

In [18]:
get_topic_titles(doc)

['Books',
 'Action & Adventure',
 'Arts, Film & Photography',
 'Biographies, Diaries & True Accounts',
 'Business & Economics',
 "Children's & Young Adult",
 'Comics & Mangas',
 'Computing, Internet & Digital Media',
 'Crafts, Home & Lifestyle',
 'Crime, Thriller & Mystery',
 'Engineering',
 'Exam Preparation',
 'Fantasy, Horror & Science Fiction',
 'Health, Family & Personal Development',
 'Health, Fitness & Nutrition',
 'Higher Education Textbooks',
 'Historical Fiction',
 'History',
 'Humour',
 'Language, Linguistics & Writing',
 'Law',
 'Literature & Fiction',
 'Maps & Atlases',
 'Medicine & Health Sciences',
 'Politics',
 'Reference',
 'Religion',
 'Romance',
 'School Books',
 'Science & Mathematics',
 'Sciences, Technology & Medicine',
 'Society & Social Sciences',
 'Sports',
 'Textbooks & Study Guides',
 'Travel']

In [19]:
def get_topic_urls(doc):
    selection_class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8"
    genre_url_tag=doc.find_all('div',{ 'class':selection_class})
    #print(genre_url_tag)
    topic_urls = []
    base_url='https://www.amazon.in'
    for tag in genre_url_tag:
        try:
            topic_urls.append(base_url +  tag.find('a')['href'])
        except:
            topic_urls.append('No URL')
        #print(tag.find('a'))
        #topic_urls.append(base_url + tag.find('a')['href'])
    return topic_urls

In [20]:
urls=get_topic_urls(doc)
urls

['No URL',
 'https://www.amazon.in/gp/bestsellers/books/1318158031',
 'https://www.amazon.in/gp/bestsellers/books/1318052031',
 'https://www.amazon.in/gp/bestsellers/books/1318064031',
 'https://www.amazon.in/gp/bestsellers/books/1318068031',
 'https://www.amazon.in/gp/bestsellers/books/1318073031',
 'https://www.amazon.in/gp/bestsellers/books/1318104031',
 'https://www.amazon.in/gp/bestsellers/books/1318105031',
 'https://www.amazon.in/gp/bestsellers/books/1318118031',
 'https://www.amazon.in/gp/bestsellers/books/1318161031',
 'https://www.amazon.in/gp/bestsellers/books/22960344031',
 'https://www.amazon.in/gp/bestsellers/books/4149751031',
 'https://www.amazon.in/gp/bestsellers/books/1402038031',
 'https://www.amazon.in/gp/bestsellers/books/1318128031',
 'https://www.amazon.in/gp/bestsellers/books/23033693031',
 'https://www.amazon.in/gp/bestsellers/books/4149418031',
 'https://www.amazon.in/gp/bestsellers/books/1318164031',
 'https://www.amazon.in/gp/bestsellers/books/4149493031',
 

## Import pandas to create dataframe

In [21]:
import pandas as pd

In [22]:
def scrape_topics():
    topics_url = 'https://www.amazon.in/gp/bestsellers/books/'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [23]:
scrape_topics().drop(0,axis=0) 

Unnamed: 0,title,url
1,Action & Adventure,https://www.amazon.in/gp/bestsellers/books/131...
2,"Arts, Film & Photography",https://www.amazon.in/gp/bestsellers/books/131...
3,"Biographies, Diaries & True Accounts",https://www.amazon.in/gp/bestsellers/books/131...
4,Business & Economics,https://www.amazon.in/gp/bestsellers/books/131...
5,Children's & Young Adult,https://www.amazon.in/gp/bestsellers/books/131...
6,Comics & Mangas,https://www.amazon.in/gp/bestsellers/books/131...
7,"Computing, Internet & Digital Media",https://www.amazon.in/gp/bestsellers/books/131...
8,"Crafts, Home & Lifestyle",https://www.amazon.in/gp/bestsellers/books/131...
9,"Crime, Thriller & Mystery",https://www.amazon.in/gp/bestsellers/books/131...
10,Engineering,https://www.amazon.in/gp/bestsellers/books/229...


## Extracting information for all genre

In [24]:
genre_url='https://www.amazon.in/gp/bestsellers/books/1318158031'

In [25]:
response=requests.get(genre_url)

In [26]:
genre_doc = BeautifulSoup(response.text, 'html.parser')

In [27]:
div_tags= genre_doc.find_all('div',{'class':"zg-grid-general-faceout"})
#div_tags

In [28]:
import os
genre_doc = BeautifulSoup(response.text, 'html.parser')
books_dict_genre={
        'Book_Name':[],
        'Author_Name':[],
        'Book_URL':[],
        'Edition_Type':[],
        'Price':[],
        'Star_Rating':[],
        'Reviews':[]
    }    

def get_topic_page(genre_urls):
    # download the page
    genre_url='https://www.amazon.in/gp/bestsellers/books/1318158031'
    # check sucessful response
    response=requests.get(genre_urls)
    
    if response.status_code!=200:
        raise Exception('failed to load page{}'.format(genre_urls))
    # parse using BeautifulSoup
    topic_doc=BeautifulSoup(response.text, 'html.parser')
    #div_tags= genre_doc.find_all('div',{'class':"zg-grid-general-faceout"})
    return topic_doc


def genre_books_info(div_tags):
    #extracting book names
    Book_Name_tags =div_tags.find('span')
    #extracting author name of books 
    Author_Name_tags = div_tags.find('a', class_ = 'a-size-small a-link-child')
    #extracting books urls
    Book_URL = 'https://amazon.in' + div_tags.find('a', class_ = 'a-link-normal')['href']
    #extracting edition type of books
    Edition_Type_tags = div_tags.find('span', class_ = 'a-size-small a-color-secondary a-text-normal')
    #extracting price tag of book 
    Price_tags = div_tags.find('span', class_ = 'p13n-sc-price')
    #extracting star rating of books
    Star_Rating_tags = div_tags.find('span', class_ = 'a-icon-alt')
    #extracting review of books
    Reviews_tags = div_tags.find('span', class_ = 'a-size-small')
    return Book_Name_tags, Author_Name_tags, Book_URL, Edition_Type_tags, Price_tags, Star_Rating_tags, Reviews_tags
    
def book_name(genre_info):
    if genre_info[0] is not None:
        books_dict_genre['Book_Name'].append(genre_info[0].text)
    else:
        books_dict_genre['Book_Name'].append('Missing')
    return books_dict_genre

def author_name(genre_info):
    if genre_info[1] is not None:
        books_dict_genre['Author_Name'].append(genre_info[1].text)
    else:
        books_dict_genre['Author_Name'].append('Missing')
    return books_dict_genre  

def book_url(genre_info):
    if genre_info[2] is not None:
        books_dict_genre['Book_URL'].append(genre_info[2])
    else:
        books_dict_genre['Book_URL'].append('Missing')
    return books_dict_genre

def edition_type(genre_info) :   
    if genre_info[3] is not None:
        books_dict_genre['Edition_Type'].append(genre_info[3].text)
    else:
        books_dict_genre['Edition_Type'].append('Missing')
    return books_dict_genre 


def book_price(genre_info):     
    if genre_info[4] is not None:
        return books_dict_genre['Price'].append(genre_info[4].text)
    else:
        return books_dict_genre['Price'].append('Missing')
          
def star_rating(genre_info):
    if genre_info[5] is not None:
        books_dict_genre['Star_Rating'].append(genre_info[5].text)
    else:
        books_dict_genre['Star_Rating'].append('Missing') 
    return books_dict_genre    

def book_reviews(genre_info):
    if genre_info[6] is not None:
        books_dict_genre['Reviews'].append(genre_info[6].text)
    else:
        books_dict_genre['Reviews'].append('Missing')
    return books_dict_genre

def get_genre_books(genre_doc):
    div_selection_class = 'zg-grid-general-faceout'
    div_tags = genre_doc.find_all('div', class_ = div_selection_class ) # creating a dictionary   
    for i in range(0, len(div_tags)):
        genre_info = genre_books_info(div_tags[i])
        book_name(genre_info)
        author_name(genre_info)
        book_url(genre_info)
        edition_type(genre_info)
        book_price(genre_info)
        star_rating(genre_info)
        book_reviews(genre_info)  
    return pd.DataFrame(books_dict_genre)

get_genre_books(genre_doc)

Unnamed: 0,Book_Name,Author_Name,Book_URL,Edition_Type,Price,Star_Rating,Reviews
0,Harry Potter and the Philosopher's Stone,J.K. Rowling,https://amazon.in/Harry-Potter-Philosophers-St...,Kindle Edition,₹299.00,4.7 out of 5 stars,41751
1,The Magicians of Mazda,Ashwin Sanghi,https://amazon.in/Magicians-Mazda-Ashwin-Sangh...,Paperback,₹264.00,5.0 out of 5 stars,7
2,The Complete Novels of Sherlock Holmes,Arthur Conan Doyle,https://amazon.in/Complete-Novels-Sherlock-Hol...,Paperback,₹139.00,4.5 out of 5 stars,13230
3,The Magicians Of Mazda,Ashwin Sanghi,https://amazon.in/Magicians-Mazda-Ashwin-Sangh...,Kindle Edition,₹326.80,5.0 out of 5 stars,7
4,"The Silent Patient: The record-breaking, multi...",Alex Michaelides,https://amazon.in/Silent-Patient-Alex-Michaeli...,Paperback,₹285.00,4.5 out of 5 stars,100341
5,The Secret Garden,Frances Hodgson Burnett,https://amazon.in/Secret-Garden-Frances-Hodgso...,Paperback,₹120.00,4.4 out of 5 stars,10185
6,Harry Potter and the Chamber of Secrets,J.K. Rowling,https://amazon.in/Harry-Potter-Chamber-Secrets...,Kindle Edition,₹299.00,4.7 out of 5 stars,33962
7,Something I Never Told You,Shravya Bhinder,https://amazon.in/Something-I-Never-Told-You/d...,Paperback,₹150.00,4.3 out of 5 stars,1765
8,Harry Potter and the Order of the Phoenix,J.K. Rowling,https://amazon.in/Harry-Potter-Order-Phoenix-R...,Kindle Edition,₹299.00,4.7 out of 5 stars,22478
9,The Nightingale,Kristin Hannah,https://amazon.in/Nightingale-Kristin-Hannah/d...,Paperback,₹296.00,4.6 out of 5 stars,76977


### Extracting book names of book Harry Potter of genre action and adventure
![](https://i.imgur.com/mwpanA5.png)

### Extracting author name of book Harry Potter of genre action and adventure

![](https://i.imgur.com/8QGVBYl.png)


### Extracting book urls of book Harry Potter of genre action and adventure
![](https://i.imgur.com/ZuD98Ej.png)


### Extracting book edition type of book Harry Potter of genre action and adventure
![](https://i.imgur.com/FZmwnmQ.png)

### Extracting stars of book Harry Potter of genre action and adventure
![](https://i.imgur.com/j6SGsQT.png)

### Extracting reviews of book Harry Potter of genre action and adventure

![](https://i.imgur.com/pggNwnm.png)

In [29]:
def scrape_genre(genre_url, path):
    if os.path.exists(path):
        print('The file {} already exists.. Skipping...'.format(path))
        return
    genre_df = get_genre_books(get_topic_page(genre_url))
    genre_df.to_csv(path, index = None)
    
def scrape_genre_books():
    print('Scraping list of book genres')
    genres_df = scrape_topics()
    genres_df = genres_df.drop(0,axis=0)
    #print(genres_df)
    os.makedirs('data', exist_ok = True)
    for index, row in genres_df.iterrows():
        print('Scraping bestselling books for the genre "{}"'.format(row['title']))
        scrape_genre(row['url'], 'data/{}.csv'.format(row['title']))   

In [32]:
scrape_genre_books()

Scraping list of book genres
Scraping bestselling books for the genre "Action & Adventure"
The file data/Action & Adventure.csv already exists.. Skipping...
Scraping bestselling books for the genre "Arts, Film & Photography"
The file data/Arts, Film & Photography.csv already exists.. Skipping...
Scraping bestselling books for the genre "Biographies, Diaries & True Accounts"
The file data/Biographies, Diaries & True Accounts.csv already exists.. Skipping...
Scraping bestselling books for the genre "Business & Economics"
Scraping bestselling books for the genre "Children's & Young Adult"
Scraping bestselling books for the genre "Comics & Mangas"
Scraping bestselling books for the genre "Computing, Internet & Digital Media"
Scraping bestselling books for the genre "Crafts, Home & Lifestyle"
Scraping bestselling books for the genre "Crime, Thriller & Mystery"
Scraping bestselling books for the genre "Engineering"
Scraping bestselling books for the genre "Exam Preparation"
Scraping bestsell

### Created data folder 
![](https://i.imgur.com/35lLQnn.png)


### Files stored in data folder

![](https://i.imgur.com/Xi6IkuN.png)

### The data is stored in csv format 
![](https://i.imgur.com/wj2igmU.png)

## Summary 

### What we have done so far was

- Install and import libraries

- Download and Parse the Bestseller HTML page source code using request and Beautifulsoup to get item categories topics URL.

- Extract the topic(genre name),genre urls

- Extract information from each page

- Combine the extracted information Extract information from each page’s data in a Python Dictionaries

- Save the information data to CSV file Using Pandas library

- By the end of the project, we’ll create a CSV file in the following format:


Book_Name,Author_Name,Book_URL,Edition_Type,Price,Star_Rating,Reviews

Harry Potter and the Philosopher's Stone,J.K. Rowling,https://amazon.in/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&psc=1,Kindle Edition,₹299.00,4.7 out of 5 stars,"39,452"

The Silent Patient: The record-breaking, multimillion copy Sunday Times bestselling thriller and Richard & Judy book club pick",Alex Michaelides,https://amazon.in/Silent-Patient-Alex-Michaelides/dp/1409181634/ref=zg_bs_1318158031_2/000-0000000-0000000?pd_rd_i=1409181634&psc=1,Paperback,₹279.00,4.5 out of 5 stars,"92,969"
```


## References 

Here is some link to learn more about the used libraries:

- request- (https://docs.python-requests.org/en/latest/)
- Beautiful Soup- (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- w3school- (https://www.w3schools.com/python/)
- pandas-( https://www.w3schools.com/python/pandas/default.asp)
- itertools.chain-(https://www.geeksforgeeks.org/python-itertools-chain/)
