# Scraping Top Bestseller Book in Amazon Project Using Python

**Amazon Bestseller Books** website is a popular website where people can find top popular books. This site contains many genres. In each genre, books are ranked based on rating and the number of buyers. Some genres includes action & adventure, travel, and romance. In general, information of a book contains name, author, rating, number of reviewers, and its URL. They are packed in a tag, and our job is to use function to extract the information from the tag.
![](https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/best-selling-books-2020-1607638264.png)

**Purpose**: This project showed the method to scrape the information of the bestseller books such as name, author, rating, and URL in each genre in Amazone website: https://www.amazon.in/gp/bestsellers/books/. The information was represented as data frame, and was saved in csv file for each category. To organize easily, a folder was created to contained all the csv files at the final.
- During introducing procedure, this project also mentioned tools used such as Python, requests, Beautiful Soup, Pandas..., which were used for scraping the bestseller books in Amazon website
![](https://imgur.com/adDRt5p)

**Project Outline:**
- Scraped Website: https://www.amazon.in/gp/bestsellers/books/
- Get a list of genres. For each genre, we'll get the genre name and genre page URL
- For each genre, we'll get information of the bestseller books from that genre page.
- For each book, we'll get name of the book, author name, rating, and the book's URL.
- For example, in action & adventure genre:
```
Book Name,Author,Rating,URL
Harry Potter and the Philosopher's Stone,J.K. Rowling,4.7 out of 5 stars,https://www.amazon.in/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&psc=1
அன்புள்ள மாயவனே (Tamil Edition),ammu yoga,4.3 out of 5 stars,https://www.amazon.in/%E0%AE%85%E0%AE%A9%E0%AF%8D%E0%AE%AA%E0%AF%81%E0%AE%B3%E0%AF%8D%E0%AE%B3-%E0%AE%AE%E0%AE%BE%E0%AE%AF%E0%AE%B5%E0%AE%A9%E0%AF%87-Tamil-ammu-yoga-ebook/dp/B0B6PSKHHJ/ref=zg_bs_1318158031_2/000-0000000-0000000?pd_rd_i=B0B6PSKHHJ&psc=1
"The Silent Patient: The record-breaking, multimillion copy Sunday Times bestselling thriller and Richard & Judy book club pick",Alex Michaelides,4.5 out of 5 stars,https://www.amazon.in/Silent-Patient-Alex-Michaelides/dp/1409181634/ref=zg_bs_1318158031_3/000-0000000-0000000?pd_rd_i=1409181634&psc=1
```

## 1. Scrape the category page from Amazon

**Objective:** To obtain the list of book category from the bestseller website and their urls.

- **Procedure**:
    - Use requests to download the Amazone Bestseller page.
    - Use BS4 to parse and extract information from each category page (name and url of each category)
        - To extract category name and its url, two functions were created: get_category_name(doc) and get_category_url(doc).
    - Convert to a pandas dataframe containing name and URL of category page.

**Install and Import necessary libraries:**

In [1]:
!pip install requests pandas beautifulsoup4 --upgrade --quiet

In [2]:
import requests
import pandas as pd
import os
from bs4 import BeautifulSoup

**Function to download the Bestseller Amazon page:**

In [3]:
def get_amazon_page():
    #TODO: this function return the book category page
    url = 'https://www.amazon.in/gp/bestsellers/books/'
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

In [4]:
doc = get_amazon_page()

In [5]:
type(doc)

bs4.BeautifulSoup

**Some helper functions to parse and extract information of each book category page:**

In [6]:
def get_category_name(cat_doc):
    # Result the name of each category
    select_class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8";
    class_tags = cat_doc.find_all(class_=select_class)
    categories_tag = class_tags[1:]
    name_categories = [];
    for tag in categories_tag:
        name_categories.append(tag.text)
    return name_categories

 `get_category_name()` received the page content. Then it find all tags containing name of the book category, extract, and return the name of each book category.

In [7]:
def get_category_url(cat_doc):
    # Return the url of each category
    select_class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8";
    class_tags = cat_doc.find_all(class_=select_class)
    categories_tag = class_tags[1:]
    category_urls = [];
    base_url = 'https://www.amazon.in';
    for tag in categories_tag:
        a_tag = tag.find('a')
        category_urls.append(base_url + a_tag['href'])
    return category_urls

 `get_category_url()` finds all tags of categories and extract, and return URL information.

- **Explain:**
    - From the bestseller page, first inspect each category URL, and find the tag represent for each category page.
    - to find the name and link of each category, use `find_all` to find all tags that has the `class`: `_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8`
    - to get the name, simply use `.text` for each tag above.
    - to get the link, we need to create a loop, for each tag, use `find` to search `a` tag with attribute `href`. Finally, combine with the `base_url: https://www.amazon.in` to form the complete link
   
![](https://i.imgur.com/URwK9Dy.png)

**Results:**

In [8]:
book_categories = get_category_name(doc)

In [9]:
len(book_categories)

34

In [10]:
book_categories[:5]

['Action & Adventure',
 'Arts, Film & Photography',
 'Biographies, Diaries & True Accounts',
 'Business & Economics',
 "Children's & Young Adult"]

In [11]:
category_urls = get_category_url(doc);
len(category_urls)

34

In [12]:
category_urls[:5]

['https://www.amazon.in/gp/bestsellers/books/1318158031',
 'https://www.amazon.in/gp/bestsellers/books/1318052031',
 'https://www.amazon.in/gp/bestsellers/books/1318064031',
 'https://www.amazon.in/gp/bestsellers/books/1318068031',
 'https://www.amazon.in/gp/bestsellers/books/1318073031']

## 2. Get info of bestseller books in the page of each genre

**Objective**: In this section, we will parse and extract info of each bestseller book for each genre such as book name, author, rating, and URL for EACH category page.

- **Procedure**:
    - Use requests to download a category page.
    - Use functions to find all tags containing information such as name, author, rating, and URL of books and pass the tag one-by-one to helper functions.
    - Use helper functions that take a tag of each book and return the info such as name, author, rate, and URL separately
    - Convert to a pandas dataframe containing information of books.

**For example:** Let take the first genre: Action and Adventure. Below is the URL of action and adventure category.

In [13]:
print(book_categories[0],category_urls[0])

Action & Adventure https://www.amazon.in/gp/bestsellers/books/1318158031


**Function to download a category page:**

In [14]:
def get_category_page(cat_url):
    # This function download a category page
    response = requests.get(cat_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(cat_url))
    cat_doc = BeautifulSoup(response.text,'html.parser')
    return cat_doc

 `get_category_page()` is used to download each book category page.

In [15]:
cat_doc = get_category_page(category_urls[0])

In [16]:
type(cat_doc)

bs4.BeautifulSoup

**Function to get info of books:**

In [17]:
def get_category_book(book_doc):
    #Describe: this function return the dataframe of books info
    
    #Extract book name and author tags
    book_class = 'zg-grid-general-faceout';
    book_tags = book_doc.find_all(class_=book_class);
    
    #Get book info
    book_category_dict = {
    'Book Name' :[],
    'Author' :[],
    'Rating' :[],
    'URL' :[],
    }
    
    for i in range(len(book_tags)):
        book_info = get_book_category_info(book_tags[i]);
        book_category_dict['Book Name'].append(book_info[0]);
        book_category_dict['Author'].append(book_info[1]);
        book_category_dict['Rating'].append(book_info[2]);
        book_category_dict['URL'].append(book_info[3]);
    return pd.DataFrame(book_category_dict)

 `get_category_book()` is used to find all tags for each book by specifying the class attribute. These tags will be passed one-by-one to another helper function `get_book_category_info`, which return the lists of book name, author, rating and URL. This function also creates the dataframe containing info of each book such as name, author, rating and URL

**Help Function taking each tag and return info of each book:**

In [18]:
def get_book_category_info(book_tag):
    # Describe: This function takes each tag and return info for each book
    #Book name
    name = book_name(book_tag);
    #Author
    author = book_author(book_tag);
    #link book
    link = url(book_tag);
    #Rating: contain check if there is rating for a book.
    book_rate = rating(book_tag);
    return name,author,book_rate,link

 `get_book_category_info()` recieves the tag of each book. Then it has 4 helper inner functions to extrac info of book such as name, author, rating and URL.
 - `book_name` used to return the name of the book
 - `book_author` used to return the name of author
 - `rating` used to return the rating of the book
 - `link` used to return the URL of the book

In [19]:
def book_name(tag):
    # This function return book name fromt tag
    span_tag = tag.find('span');
    return span_tag.text

def book_author(tag):
    # this function return author
    author_class = 'a-row a-size-small'
    author_tag = tag.find(class_=author_class)
    return author_tag.text

def rating(tag):
    # This function return the rating. It also ontain check if there is rating for a book.
    rate_tag = tag.find(class_='a-icon-alt')
    if rate_tag == None:
        book_rate = 'NA'
    else:
        book_rate = rate_tag.text
    return book_rate

def url(tag):
    base_url = 'https://www.amazon.in'
    a_tag = tag.find('a')
    return base_url + a_tag['href']

- **Explain Procedure:**
    - First, for each category, download the category page by `get_category_page` by passing its URL to bs4.
    - Using `get_category_book()` to find list of tags containing info of all books by using `find_all` to find all tags that has the `class`: `zg-grid-general-faceout`.
    - Then create a loop to pass every tag of each book to the helper function `get_book_category_info()`. This function will return the name, author, rating and book url.
    - In `get_book_category_info()`, we have 4 inner helper functions that extract 4 types of info: `book_name`, `book_author`, `rating`, and `url`.
   
![](https://i.imgur.com/rJG6TPJ.png)

**Example Result:**

In [20]:
get_category_book(cat_doc)[:5]

Unnamed: 0,Book Name,Author,Rating,URL
0,War of Lanka (Ram Chandra Series Book 4),Amish Tripathi,,https://www.amazon.in/War-Lanka-Ram-Chandra-Bo...
1,Harry Potter and the Philosopher's Stone,J.K. Rowling,4.7 out of 5 stars,https://www.amazon.in/Harry-Potter-Philosopher...
2,The Complete Novels of Sherlock Holmes,Arthur Conan Doyle,4.5 out of 5 stars,https://www.amazon.in/Complete-Novels-Sherlock...
3,"The Silent Patient: The record-breaking, multi...",Alex Michaelides,4.5 out of 5 stars,https://www.amazon.in/Silent-Patient-Alex-Mich...
4,அன்புள்ள மாயவனே (Tamil Edition),ammu yoga,4.3 out of 5 stars,https://www.amazon.in/%E0%AE%85%E0%AE%A9%E0%AF...


## 3. Saving and Organizing data

**Function to scrape info of bestseller books in ALL catgories and save as csv files in each category folder.**

In [21]:
def scrape_book_bestseller():
    print('Scraping the top bestseller book in AMAZON:')
    category_df = scrape_category()
    
    os.makedirs('data',exist_ok=True)
    for index,row in category_df.iterrows():
        print('Scraping the top bestsellers book for {} genre'.format(row['Book Categories']))
        scrape_category_book(row['URL'],'data/{}.csv'.format(row['Book Categories']))

- The function above is the general outermost function, and is the function we'll call to scrape all info of books in ALL categories. It has two helper functions:
    - `scrape_category()` to scrape the name of each category and the URL and create a data frame. It contains `get_amazon_page` helper function to download bestseller book page by using requests. Moreover, this function contain 2 inner helper functions `get_category_name` and `get_category_url` above to scrape the info of each category.

In [22]:
def scrape_category():
    doc =  get_amazon_page();
    category_dict ={
    'Book Categories': get_category_name(doc),
    'URL': get_category_url(doc)
    }
    return pd.DataFrame(category_dict);

- `scrape_category_book()` to scrape books in each category. It belongs to a loop that pass name and URL of each category in the list above. Inside this function, we see the `get_category_book` and `get_category_page` function.
    - First `scrape_category_book` recieve the URL of a category and its saving path. Then the function check if the saving path exists. If so, the cvs file is created, and no need for scraping. If not, it will pass the url to `get_category_page()` function to download the category page by using bs4. Then `get_category_book()` take file doc from the previous function to parse and extract info of each book in the category page and create the list. As we know, inside `get_category_book()`, we have another helper function that take a tag of each book to extract the information.
    - This function also has a checkpoint. During running the program, there will be error with fetching the web page. So, to avoid create the same csv file. It wil check the if the csv file exists by checking the savving path. If the file exists, that category will be skipped.

In [23]:
def scrape_category_book(cat_url,path):
    if os.path.exists(path):
        print('The file {} already exists. Skipping...'.format(path))
        return
    book_df = get_category_book(get_category_page(cat_url))
    book_df.to_csv(path, index = None)

## 4. Put it together - Final Code

#### a. Function of scraping category

In [24]:
def get_category_name(cat_doc):
    # Result the name of each category
    select_class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8";
    class_tags = cat_doc.find_all(class_=select_class)
    categories_tag = class_tags[1:]
    name_categories = [];
    for tag in categories_tag:
        name_categories.append(tag.text)
    return name_categories

def get_category_url(cat_doc):
    # Return the url of each category
    select_class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8";
    class_tags = cat_doc.find_all(class_=select_class)
    categories_tag = class_tags[1:]
    category_urls = [];
    base_url = 'https://www.amazon.in';
    for tag in categories_tag:
        a_tag = tag.find('a')
        category_urls.append(base_url + a_tag['href'])
    return category_urls

def get_amazon_page():
    #TODO: this function return the book category page
    url = 'https://www.amazon.in/gp/bestsellers/books/'
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

def scrape_category():
    # This function return the list of category contain name and URL
    doc =  get_amazon_page();
    category_dict ={
    'Book Categories': get_category_name(doc),
    'URL': get_category_url(doc)
    }
    return pd.DataFrame(category_dict);

#### b. Function of scraping info of each books for each category page.

In [32]:
def book_name(tag):
    # This function return book name fromt tag
    span_tag = tag.find('span');
    return span_tag.text

def book_author(tag):
    # this function return author
    author_class = 'a-row a-size-small'
    author_tag = tag.find(class_=author_class)
    return author_tag.text

def rating(tag):
    # This function return the rating. It also ontain check if there is rating for a book.
    rate_tag = tag.find(class_='a-icon-alt')
    if rate_tag == None:
        book_rate = 'NA'
    else:
        book_rate = rate_tag.text
    return book_rate

def url(tag):
    base_url = 'https://www.amazon.in'
    a_tag = tag.find('a')
    return base_url + a_tag['href']
    
def get_book_category_info(book_tag):
    # Describe: This function takes each tag and return info for each book
    #Book name
    name = book_name(book_tag);
    #Author
    author = book_author(book_tag);
    #link book
    link = url(book_tag);
    #Rating: contain check if there is rating for a book.
    book_rate = rating(book_tag);
    return name,author,book_rate,link

def get_category_book(book_doc):
    #Describe: this function return the dataframe of books info
    #Extract book name and author tags
    book_class = 'zg-grid-general-faceout';
    book_tags = book_doc.find_all(class_=book_class);
    
    #Get book info
    book_category_dict = {
    'Book Name' :[],
    'Author' :[],
    'Rating' :[],
    'URL' :[],
    }
    
    for i in range(len(book_tags)):
        book_info = get_book_category_info(book_tags[i]);
        book_category_dict['Book Name'].append(book_info[0]);
        book_category_dict['Author'].append(book_info[1]);
        book_category_dict['Rating'].append(book_info[2]);
        book_category_dict['URL'].append(book_info[3]);
    return pd.DataFrame(book_category_dict)

def scrape_category_book(cat_url,path):
    # this function accept the url of each category, scrape the book info and save it in specific path.
    if os.path.exists(path):
        print('The file {} already exists. Skipping...'.format(path))
        return
    book_df = get_category_book(get_category_page(cat_url))
    book_df.to_csv(path, index = None)

#### c. General/ Outer function of scraping the list of categores and the info of books for each category page.

In [33]:
def scrape_book_bestseller():
    print('Scraping the top bestseller book in AMAZON:')
    category_df = scrape_category()
    
    os.makedirs('data',exist_ok=True)
    for index,row in category_df.iterrows():
        print('Scraping the top bestsellers book for {} genre'.format(row['Book Categories']))
        scrape_category_book(row['URL'],'data/{}.csv'.format(row['Book Categories']))

## 5. Results

**Check the function that scrape the list of categories:**

In [34]:
scrape_category()

Unnamed: 0,Book Categories,URL
0,Action & Adventure,https://www.amazon.in/gp/bestsellers/books/131...
1,"Arts, Film & Photography",https://www.amazon.in/gp/bestsellers/books/131...
2,"Biographies, Diaries & True Accounts",https://www.amazon.in/gp/bestsellers/books/131...
3,Business & Economics,https://www.amazon.in/gp/bestsellers/books/131...
4,Children's & Young Adult,https://www.amazon.in/gp/bestsellers/books/131...
5,Comics & Mangas,https://www.amazon.in/gp/bestsellers/books/131...
6,"Computing, Internet & Digital Media",https://www.amazon.in/gp/bestsellers/books/131...
7,"Crafts, Home & Lifestyle",https://www.amazon.in/gp/bestsellers/books/131...
8,"Crime, Thriller & Mystery",https://www.amazon.in/gp/bestsellers/books/131...
9,Engineering,https://www.amazon.in/gp/bestsellers/books/229...


**Check the final/outer code that scrape the bestseller books for each category and save it in csv file.**

In [35]:
scrape_book_bestseller()

Scraping the top bestseller book in AMAZON:
Scraping the top bestsellers book for Action & Adventure genre
Scraping the top bestsellers book for Arts, Film & Photography genre
Scraping the top bestsellers book for Biographies, Diaries & True Accounts genre
Scraping the top bestsellers book for Business & Economics genre
Scraping the top bestsellers book for Children's & Young Adult genre
Scraping the top bestsellers book for Comics & Mangas genre
Scraping the top bestsellers book for Computing, Internet & Digital Media genre
Scraping the top bestsellers book for Crafts, Home & Lifestyle genre
Scraping the top bestsellers book for Crime, Thriller & Mystery genre
Scraping the top bestsellers book for Engineering genre
Scraping the top bestsellers book for Exam Preparation genre
Scraping the top bestsellers book for Fantasy, Horror & Science Fiction genre
Scraping the top bestsellers book for Health, Family & Personal Development genre
Scraping the top bestsellers book for Health, Fitness

**Comment:** There are several errors with loading the website. When running again, the code will check if the csv file exits. If so, it will skipp that category and move to the next category. The code work well and seems to scrape all categories. Let check csv files for some categories.

**History genre:**

In [36]:
# Display top 10 bestseller books in history genre
history_df = pd.read_csv('./data/History.csv');
history_df[:10]

Unnamed: 0,Book Name,Author,Rating,URL
0,My Journey: Transforming Dreams into Actions,A.P.J. Abdul Kalam,4.7 out of 5 stars,https://www.amazon.in/My-Journey-Transforming-...
1,Animals Tales From Panchtantra: Timeless Stori...,Wonder House Books,4.5 out of 5 stars,https://www.amazon.in/Animals-Tales-Panchtantr...
2,TINKLE DIGEST 1,ANANT PAI,4.2 out of 5 stars,https://www.amazon.in/TINKLE-DIGEST-1-ANANT-PA...
3,101 Panchatantra Stories for Children: Colourf...,Om Books Editorial Team,4.5 out of 5 stars,https://www.amazon.in/101-Panchatantra-Stories...
4,Sapiens,Yuval Noah Harari,4.7 out of 5 stars,https://www.amazon.in/Sapiens/dp/B079C1B3H6/re...
5,WINGS OF FIRE: AUTOBIOGRAPHY OF ABDUL KALAM,Arun Tiwari,4.6 out of 5 stars,https://www.amazon.in/Wings-Fire-Autobiography...
6,Man's Search For Meaning: The classic tribute ...,Viktor E Frankl,4.5 out of 5 stars,https://www.amazon.in/Mans-Search-Meaning-Vikt...
7,Autobiography of a Yogi,Paramahansa Yogananda,4.6 out of 5 stars,https://www.amazon.in/Autobiography-Yogi-Param...
8,"Three Thousand Stitches: Ordinary People, Extr...",Sudha Murty,4.6 out of 5 stars,https://www.amazon.in/Three-Thousand-Stitches-...
9,The Theory Of Everything,Stephen Hawking,4.6 out of 5 stars,https://www.amazon.in/Theory-Everything-Stephe...


**Romance genre:**

In [37]:
# Display top 10 bestseller books in romance genre
romance_df = pd.read_csv('./data/Romance.csv');
romance_df[:10]

Unnamed: 0,Book Name,Author,Rating,URL
0,இராவணனா? காவலனா? (Tamil Edition),SANAGEETH NOVELS,4.3 out of 5 stars,https://www.amazon.in/%E0%AE%87%E0%AE%B0%E0%AE...
1,காதலாற்றுப்படை : Kathalaachupadai (Tamil Edition),சுஜா சந்திரன்,4.5 out of 5 stars,https://www.amazon.in/%E0%AE%95%E0%AE%BE%E0%AE...
2,தந்தியில்லா வீணை (Tamil Edition),வியனி நாதன்,4.3 out of 5 stars,https://www.amazon.in/%E0%AE%A4%E0%AE%A8%E0%AF...
3,It Ends With Us: A Novel: Volume 1,Colleen Hoover,4.5 out of 5 stars,https://www.amazon.in/Ends-Us-Novel-Colleen-Ho...
4,சண்டமாருதத்தின் காதல் மலரிவள்...! (Tamil Edition),சர ணிகா,4.1 out of 5 stars,https://www.amazon.in/%E0%AE%9A%E0%AE%A3%E0%AF...
5,மோகத்தை வென்றவளே….!! Mohaththai vendravaley..!...,Sri vinitha ஸ்ரீ வினிதா,4.2 out of 5 stars,https://www.amazon.in/%E0%AE%AE%E0%AF%8B%E0%AE...
6,வசியம் வைத்தேன் வந்து விடு: Vasiyam Vaithen Va...,Surya Saravanan,4.7 out of 5 stars,https://www.amazon.in/%E0%AE%B5%E0%AE%9A%E0%AE...
7,காதல் ரதியே!ரதியே!-Kadhal Rathiye!Rathiye! (Ta...,AHILA ISAAC,4.2 out of 5 stars,https://www.amazon.in/%E0%AE%95%E0%AE%BE%E0%AE...
8,என்னையே தந்தேன் உனக்காக!!!: Ennaiye thanthen u...,ஆத்விகா பொம்மு (Aadvika Pommu),4.5 out of 5 stars,https://www.amazon.in/%E0%AE%8E%E0%AE%A9%E0%AF...
9,பாடவா.. புதுப்பாடலை..: Padava Pudhupadalai (Ta...,முத்துலட்சுமி ராகவன் Muthulakshmi Raghavan,3.3 out of 5 stars,https://www.amazon.in/%E0%AE%AA%E0%AE%BE%E0%AE...


**Travel genre:**

In [38]:
# Display top 10 bestseller books in travel genre
travel_df = pd.read_csv('./data/Travel.csv');
travel_df[:10]

Unnamed: 0,Book Name,Author,Rating,URL
0,World Map - Laminated Both Sides,Dreamland Publications,4.3 out of 5 stars,https://www.amazon.in/World-Map-Dreamland-Publ...
1,Mandala: Colouring Books for Adults with Tear ...,Wonder House Books,4.5 out of 5 stars,https://www.amazon.in/Mandala-Colouring-Books-...
2,Colouring Books Super Boxset: Pack of 6 Crayon...,Wonder House Books,4.4 out of 5 stars,https://www.amazon.in/Colouring-Books-Creative...
3,India Map (Laminated Both Sides ) - With New U...,Dreamland Publications,4.4 out of 5 stars,https://www.amazon.in/India-Map-Dreamland-Publ...
4,Masala Lab: The Science of Indian Cooking: The...,Krish Ashok,4.5 out of 5 stars,https://www.amazon.in/Masala-Lab-Science-India...
5,I Wish I Could Tell Her (Order now to get a si...,Ajay K Pandey,4.8 out of 5 stars,https://www.amazon.in/Wish-Could-Tell-Order-si...
6,Zen: The Art of Simple Living,Shunmyo Masuno,4.6 out of 5 stars,https://www.amazon.in/Zen-Simple-Living-Shunmy...
7,Bruised Passports : Travelling the World as Di...,Savi Munjal,4.2 out of 5 stars,https://www.amazon.in/Bruised-Passports-Travel...
8,Penguin Random House Normal People: One millio...,Sally Rooney,4.2 out of 5 stars,https://www.amazon.in/Normal-People-Sally-Roon...
9,Journey Continues: A Sequel To Apprenticed To ...,M. Sri,4.8 out of 5 stars,https://www.amazon.in/Journey-Continues-Sequel...


**Saving Notebook:**

In [39]:
import jovian

In [40]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "tylerdnguyen94/scraping-amazon-bestseller-books-official" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/tylerdnguyen94/scraping-amazon-bestseller-books-official[0m


'https://jovian.ai/tylerdnguyen94/scraping-amazon-bestseller-books-official'

## 7. Summary

Here is what we've covered in this notebook:
1. Download the bestseller book website of Amazon using `requests`.
2. Parse the HTML source code using beautiful soup.
3. Extract the name and URL of each category page in form of data frame.
4. For each book category:
    - Parse the HTML source code for each category page URL using beautiful soup.
    - Extract name, author, rating, and URL of each bestseller book.
    - Compile the information into a dictionary.
    - Return the list of books with informations: name, author...
5. For each category/ genre, we save the data frame into a separate csv file, and named with category.
6. During the running, there will be error with loading the website. To avoid create duplicate csv file, a checkpoint is created to check if the csv file of a category exists. If so, that category will be skipped.
7. To facilitate organizing csv file, a folder is created to contain all csv files of each genre.

## 8. Future Work

- One of the future work is convert the rating into numbers so that we could manipulate the data. For example, we can find the maximum or minimum rating, and find some statistic information.
- Moreover, we could extract the number of the reviews and relate to the rating star. For example, if the rating is 4.9/5 but there are few reviews, it will not be reliable.

## 9. Reference

**Refernce:**

https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/project-web-scraping-with-python

**Final Saving Notebook:**

In [None]:
jovian.commit(files=['./data'])

<IPython.core.display.Javascript object>