### Data Collecting:

In [58]:
import requests # used to send HTTP requests to web servers
from bs4 import BeautifulSoup # parsing HTML and XML documents
import pandas as pd # powerful data manipulation and analysis library
import numpy as np # used for numerical computations in Python


#### 1 - Web Scraping:

we have decided to collect data from Amazon's bestseller Books by applying web scraping for two pages ( each one contain around 50 books )

##### using web Scraping codes from bestsellers books in Amazon

In [59]:
no_pages = 2
def get_data(pageNo):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", 
               "Accept-Encoding":"gzip, deflate", 
               "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
               "DNT":"1", "Connection":"close", 
               "Upgrade-Insecure-Requests":"1"}

    r = requests.get(f'https://www.amazon.sa/-/en/gp/bestsellers/books/ref=zg_bs_pg_1_books?ie=UTF8&pg={pageNo}&language=en&crid=1MSN01VVU9GYY&qid=1711400365&rnid=12463048031&sprefix=engl+book%2Cstripbooks%2C312&ref=sr_pg_{pageNo}', headers=headers)
    content = r.content
    soup = BeautifulSoup(content, "html.parser")

    alls = []
    for d in soup.findAll('div', attrs={'class':'zg-grid-general-faceout'}): 
        name = d.find('div', attrs={'class':'_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y'})
        price = d.find('span', attrs={'class':'_cDEzb_p13n-sc-price_3mJ9Z'})
        rating = d.find('span', attrs={'class':'a-icon-alt'})
        users_rated = d.find('span', attrs={'aria-hidden':'true'})
        author = d.find('div', attrs={'class':'a-row'})
        format_type = d.find('span', attrs={'class':'a-text-normal'})
        genre = d.find('div', attrs={'class':'a-row a-size-base a-color-base'})
        cover_image = d.find('img', attrs={'class': 'a-dynamic-image p13n-sc-dynamic-image p13n-product-image'})

        all1 = []

        if name is not None:
            all1.append(name.text)
        else:
            all1.append("Null")

        if price is not None:
            all1.append(price.text)
        else:
            all1.append("Null")

        if rating is not None:
            all1.append(rating.text)
        else:
            all1.append("Null")

        if users_rated is not None:
            all1.append(users_rated.text)
        else:
            all1.append("Null")

        if author is not None:
            all1.append(author.text)
        else:
            all1.append("Null")

        if format_type is not None:
            all1.append(format_type.text)
        else:
            all1.append("Null")

        if genre is not None:
            all1.append(genre.text)
        else:
            all1.append("Null")

        if cover_image is not None:
            all1.append(cover_image['src'])
        else:
            all1.append("No Image")

        alls.append(all1)
    books = soup.findAll('div', attrs={'class': 'zg-grid-general-faceout'})
    print(f"Books found : {len(books)}")
    return alls



In [60]:
results = []
for i in range(1, no_pages+1):
    results.append(get_data(i))
flatten = lambda l: [item for sublist in l for item in sublist]
df = pd.DataFrame(flatten(results), columns=[
    'Title',          
    'Price',         
    'Rating',          
    'Num Of Reviews', 
    'Author',         
    'Book Type',      
    'Genre',     
    'Cover Image'     
])

Books found : 30
Books found : 30


We notice here that only 30 books have been extacted from each page out of 50 , could be due to server blocking . that's why we decided to collect other books using an extension for web scraping from google chrome .

- Web scraper - free web scraping : https://chromewebstore.google.com/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn?hl=en

In [61]:
# checking how it looks like 
df.head(100)

Unnamed: 0,Title,Price,Rating,Num Of Reviews,Author,Book Type,Genre,Cover Image
0,كتاب التحصيلي علمي 46-47 (2025),SAR 98.00,4.3 out of 5 stars,9,Nasser bin Abdulaziz Al-Abdulkarim,Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
1,El Sharq library المعاصر 9 تاسيس كمي 2/1 ورقي ...,SAR 107.58,4.5 out of 5 stars,226,عماد الجزيري,Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
2,Coloriages mystères Disney Princesses: Colorie...,SAR 109.10,4.7 out of 5 stars,5863,Jérémy Mariez,Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
3,My First Library : Boxset Of 10 Board Books Fo...,SAR 47.00,4.6 out of 5 stars,80669,Wonder House Books,Board book,Null,https://images-eu.ssl-images-amazon.com/images...
4,Null,SAR 65.00,4.7 out of 5 stars,12574,"4.7 out of 5 stars 12,574",Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
5,فاتتني صلاة,SAR 26.00,4.7 out of 5 stars,301,اسلام جمال,Unknown Binding,Null,https://images-eu.ssl-images-amazon.com/images...
6,Atomic Habits: An Easy & Proven Way to Build G...,SAR 89.00,4.8 out of 5 stars,73014,James Clear,Hardcover,Null,https://images-eu.ssl-images-amazon.com/images...
7,Golden Books The Tale of Peter Rabbit,SAR 9.00,4.8 out of 5 stars,1893,Beatrix Potter,Hardcover,Null,https://images-eu.ssl-images-amazon.com/images...
8,White Nights,SAR 19.00,4.6 out of 5 stars,1509,Fyodor Dostoyevsky,Mass Market Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
9,The Psychology of Money: Timeless Lessons on W...,SAR 55.00,4.7 out of 5 stars,20443,Morgan Housel,Paperback,Null,https://images-eu.ssl-images-amazon.com/images...


since the genre column is inside each book's page , we had to collect them manually 

*results might be diffrent since we have collected them few days ago and amazon's bestseller books might have changed a bit*

In [62]:
# to save the data as a csv file
df.to_csv("amazon_raw_books.csv", index=False)