# 1. Project Overview

### Books Sales Trend 

This project aims to study book data, including the number of reviews and the books that is listed as a bestseller, to uncover key trends such as the most in-demand genres and the factors that attract readers and increase a book’s popularity. This analysis is expected to provide valuable insights that can help publishers and authors enhance their marketing strategies and boost the success of their books. 

#### The analysis will focus on understanding:

- How do ratings and the number of reviews vary among bestsellers?
- Are certain authors more likely to have their books become bestsellers?
- Does the attractiveness of a book's cover influence its likelihood of becoming a bestseller?
- What genres are most represented among bestsellers?
- What is the relationship between price and bestseller books? , What is the price range of bestseller books?


This project is expected to contribute valuable insights to the publishing industry and help stakeholders make data-driven decisions.

# 2. Data Collecting

## Data sources

#### We collected our data by web scraping from the following online stores:

- **Amazon KSA Online Store:** A global platform with a wide range of books, including international bestsellers.

- **Jarir KSA Online Store:** A leading bookstore in Saudi Arabia, offering both Arabic and English books.

These sources were chosen because they represent a diverse range of books, have a large and diverse audience, and provide relatively complete data. By focusing on bestseller lists, we aim to study the factors that contribute to a book’s success in these markets.

## Data Description

To study the factors influencing bestseller books, we identified key attributes that are likely to have a significant impact on a book’s popularity. After reviewing related studies, research papers, and articles, we referenced the following sources to guide our attribute selection [1][2][3].

#### Based on these references, we selected the following attributes for our dataset:

- **Title:** The name of the book.

- **Price:** The retail price of the book.

- **Rating:** The average customer rating (e.g., out of 5 stars).

- **Num Of Reviews:** The total number of customer reviews.

- **Author:** The name of the author(s).

- **Book Type:** The format of the book (e.g., paperback, hardcover, eBook).

- **Genre:** The category or genre of the book (e.g., fiction, non-fiction, self-help).

- **Cover Image:** The image of the book cover (for visual analysis or reference).

These attributes were chosen because they are commonly associated with a book’s success and can help answer key questions.

## Challenges in Data Collection

 Data collection comes with various challenges that can hinder efficiency and accuracy. In our process, which involves web scraping, we faced several key difficulties:
1. **Time-Consuming Process:**
Data collection, especially when using web scraping techniques, requires significant time due to the complexity of extracting and processing data from multiple sources.

3. **Unclear HTML Structure:**
Some essential elements like <div> and <span> do not have clear or consistent class names, making it difficult to identify and extract the required data efficiently.

5. **Dynamic Content with JavaScript:**
Certain websites load content dynamically using JavaScript, which means that the data may not be visible in the initial HTML source code. This requires additional tools or techniques to handle dynamic content effectively.

6. **Request Limits and Access Restrictions:**
Some data sources impose strict limits on the number of requests that can be made within a specific timeframe, while others require special access permissions or API keys.

7. **Inconsistent Data Availability:**
Some information is available in certain sources but missing in others, leading to incomplete datasets and making it challenging to ensure data consistency and reliability.


## Actions Taken

1.  **Small-Scale Testing Before Full Collection:** We tested the scraping code on a small dataset to ensure accuracy. Once confirmed, we scaled up to collect the full dataset, avoiding repetitive work and saving time. 

2. **Relied on HTML Attributes and Structure:**
To handle unclear or inconsistent class names, we used element IDs or the DOM structure. For elements in arrays with the same class, we relied on their positional consistency to extract data.

3. **Used Selenium for Dynamic Content:**
For JavaScript-loaded content, we implemented Selenium to interact with pages like a browser, ensuring dynamic content was fully loaded before scraping. 

4. **Added Time Delays Between Requests:**
To avoid being blocked, we introduced time delays between requests to simulate natural user behavior, reducing the risk of triggering rate limits. 

5. **Leveraged Ready-Made Scraping Tools:**
Tools like Instant Data Scraper helped us gather initial data efficiently. We collected links to individual pages and accessed them separately to minimize request limits. 

6. **Combined Data from Multiple Sources:**
To address missing or inconsistent data, we merged datasets from different sources, ensuring a more comprehensive and reliable final dataset. 

7. **Conducted Manual Reviews and Validation:**
We manually reviewed samples of scraped data to identify and correct errors, ensuring high data quality and refining our scripts for better accuracy.

## Web Scraping


#### Web Scraping Tools used:
- Web scraper - free web scraping : https://chromewebstore.google.com/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn?hl=en <br>
<b>Used when faced server blocking, in Amazon store <b>
- Instant Data Scraper - free web scraping  : https://chromewebstore.google.com/detail/instant-data-scraper/ofaokhiedipichpaobibbnahnkdoiiah <br>
<b>Used to extract bestseller books URLs from Jarir <b>

####  Libraries, Modules, and Methods Used in the Web Scraping Script
-  <b> Libraries Used:<b>
1. time → Standard Python library used to add delays between requests.
2. pandas → Handles CSV file operations (reading & saving scraped data).
3. requests → Fetches HTML content from web pages.
bs4 (BeautifulSoup) → Parses and extracts static HTML elements.
4. selenium → Automates browser interactions for scraping dynamically loaded content.
- <b> Modules Used (from Selenium): <b>
1. selenium.webdriver → Controls the Chrome browser for web scraping.
2. selenium.webdriver.common.by → Provides mechanisms to locate elements in the HTML (e.g., by class name, ID).
3. selenium.webdriver.support.ui → Contains WebDriverWait for handling dynamically loaded elements.
4. selenium.webdriver.support.expected_conditions (imported as EC) → Defines conditions for checking if elements are present before interacting with them.
 - <b> Methods Used: <b>
1. time.sleep(seconds) → Pauses execution to allow page elements to load.
2. pd.read_csv(file, encoding, nrows) → Reads book URLs from a CSV file.
3. pd.concat([df1, df2], ignore_index=True) → Merges scraped data into a single DataFrame.
4. requests.get(url, headers=HEADERS) → Sends an HTTP request to fetch page content.
5. BeautifulSoup(response.text, "html.parser") → Parses the HTML response for static content.
6. driver.get(url) → Loads a webpage using Selenium.
7. WebDriverWait(driver, timeout).until(condition) → Waits for a web element to appear before scraping.
8. EC.presence_of_element_located((By.CLASS_NAME, "tf-rating")) → Checks if an element is present in the DOM.
9. soup.find(tag, class_="class-name") → Finds the first occurrence of an element in the HTML.
10. soup.find_all(tag, class_="class-name") → Finds all matching elements.


### Amazom bestseller books 

#### We have decided to collect data from Amazon's bestseller Books by applying web scraping for two pages ( each one contain around 50 books )

In [222]:
import requests # used to send HTTP requests to web servers
from bs4 import BeautifulSoup # parsing HTML and XML documents
import pandas as pd # powerful data manipulation and analysis library
import numpy as np # used for numerical computations in Python


In [59]:
no_pages = 2
def get_data(pageNo):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", 
               "Accept-Encoding":"gzip, deflate", 
               "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
               "DNT":"1", "Connection":"close", 
               "Upgrade-Insecure-Requests":"1"}

    r = requests.get(f'https://www.amazon.sa/-/en/gp/bestsellers/books/ref=zg_bs_pg_1_books?ie=UTF8&pg={pageNo}&language=en&crid=1MSN01VVU9GYY&qid=1711400365&rnid=12463048031&sprefix=engl+book%2Cstripbooks%2C312&ref=sr_pg_{pageNo}', headers=headers)
    content = r.content
    soup = BeautifulSoup(content, "html.parser")

    alls = []
    for d in soup.findAll('div', attrs={'class':'zg-grid-general-faceout'}): 
        name = d.find('div', attrs={'class':'_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y'})
        price = d.find('span', attrs={'class':'_cDEzb_p13n-sc-price_3mJ9Z'})
        rating = d.find('span', attrs={'class':'a-icon-alt'})
        users_rated = d.find('span', attrs={'aria-hidden':'true'})
        author = d.find('div', attrs={'class':'a-row'})
        format_type = d.find('span', attrs={'class':'a-text-normal'})
        genre = d.find('div', attrs={'class':'a-row a-size-base a-color-base'})
        cover_image = d.find('img', attrs={'class': 'a-dynamic-image p13n-sc-dynamic-image p13n-product-image'})

        all1 = []

        if name is not None:
            all1.append(name.text)
        else:
            all1.append("Null")

        if price is not None:
            all1.append(price.text)
        else:
            all1.append("Null")

        if rating is not None:
            all1.append(rating.text)
        else:
            all1.append("Null")

        if users_rated is not None:
            all1.append(users_rated.text)
        else:
            all1.append("Null")

        if author is not None:
            all1.append(author.text)
        else:
            all1.append("Null")

        if format_type is not None:
            all1.append(format_type.text)
        else:
            all1.append("Null")

        if genre is not None:
            all1.append(genre.text)
        else:
            all1.append("Null")

        if cover_image is not None:
            all1.append(cover_image['src'])
        else:
            all1.append("No Image")

        alls.append(all1)
    books = soup.findAll('div', attrs={'class': 'zg-grid-general-faceout'})
    print(f"Books found : {len(books)}")
    return alls



In [60]:
results = []
for i in range(1, no_pages+1):
    results.append(get_data(i))
flatten = lambda l: [item for sublist in l for item in sublist]
df = pd.DataFrame(flatten(results), columns=[
    'Title',          
    'Price',         
    'Rating',          
    'Num Of Reviews', 
    'Author',         
    'Book Type',      
    'Genre',     
    'Cover Image'     
])

Books found : 30
Books found : 30


<b>We notice here that only 30 books have been extacted from each page out of 50, could be due to server blocking. that's why we decided to collect other books using an extension for web scraping from google chrome.<b>

- Web scraper - free web scraping : https://chromewebstore.google.com/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn?hl=en

In [61]:
# checking how it looks like 
df.head(100)

Unnamed: 0,Title,Price,Rating,Num Of Reviews,Author,Book Type,Genre,Cover Image
0,كتاب التحصيلي علمي 46-47 (2025),SAR 98.00,4.3 out of 5 stars,9,Nasser bin Abdulaziz Al-Abdulkarim,Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
1,El Sharq library المعاصر 9 تاسيس كمي 2/1 ورقي ...,SAR 107.58,4.5 out of 5 stars,226,عماد الجزيري,Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
2,Coloriages mystères Disney Princesses: Colorie...,SAR 109.10,4.7 out of 5 stars,5863,Jérémy Mariez,Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
3,My First Library : Boxset Of 10 Board Books Fo...,SAR 47.00,4.6 out of 5 stars,80669,Wonder House Books,Board book,Null,https://images-eu.ssl-images-amazon.com/images...
4,Null,SAR 65.00,4.7 out of 5 stars,12574,"4.7 out of 5 stars 12,574",Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
5,فاتتني صلاة,SAR 26.00,4.7 out of 5 stars,301,اسلام جمال,Unknown Binding,Null,https://images-eu.ssl-images-amazon.com/images...
6,Atomic Habits: An Easy & Proven Way to Build G...,SAR 89.00,4.8 out of 5 stars,73014,James Clear,Hardcover,Null,https://images-eu.ssl-images-amazon.com/images...
7,Golden Books The Tale of Peter Rabbit,SAR 9.00,4.8 out of 5 stars,1893,Beatrix Potter,Hardcover,Null,https://images-eu.ssl-images-amazon.com/images...
8,White Nights,SAR 19.00,4.6 out of 5 stars,1509,Fyodor Dostoyevsky,Mass Market Paperback,Null,https://images-eu.ssl-images-amazon.com/images...
9,The Psychology of Money: Timeless Lessons on W...,SAR 55.00,4.7 out of 5 stars,20443,Morgan Housel,Paperback,Null,https://images-eu.ssl-images-amazon.com/images...


<b>Note: since the genre column is inside each book's page, we had to collect them manually.<b>

*results might be diffrent since we have collected them few days ago and amazon's bestseller books might have changed a bit*

In [62]:
# to save the data as a csv file
df.to_csv("amazon_raw_books.csv", index=False)

### Jarir bestseller books 

#### - We extrat Arabic and Engligh bestseller books URLs in Jarir (around 260 book) using **Instant Data Scraper** extention and save it in "jarir_bestsellers.csv" file to read it and make loop on it.
#### - Also we use **selenium** for javaScript dynamic elements like "Rating" and "Num Of Reviews". 

In [None]:
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# CSV file containing book URLs
df = pd.read_csv("jarir_bestsellers.csv", names=["book_link"])

# CSV file containing book URLs
INPUT_CSV = "jarir_bestsellers.csv"
all_books_df = pd.DataFrame()

# Headers to mimic a browser request
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0"
}

# Initialize Selenium WebDriver
driver = webdriver.Chrome()

def get_book_data(url):
    """Scrapes book data from Jarir's website."""
    book_details = {}

    # Scrape static content using Requests + BeautifulSoup
    response = requests.get(url, headers=HEADERS)
    response.encoding = "utf-8"  # Force UTF-8 encoding
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract Title
    book_details["Title"] = soup.find("h2", class_="product-title__title").text.strip() if soup.find("h2", class_="product-title__title") else "Null"

    # Extract Price
    price_container = soup.find("span", class_="price_alignment")
    if price_container:
        value = price_container.find_all("span")[-1].text.strip() if price_container.find_all("span") else "Null"
        book_details["Price"] = value
    else:
        book_details["Price"] = "Null"

    # Use Selenium for dynamically loaded elements (Rating & Reviews)
    driver.get(url)
    time.sleep(3)  # Allow time for JavaScript to load

    # Extract Rating
    try:
        rating_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "tf-rating"))
        )
        book_details["Rating"] = rating_element.text.strip()
    except:
        book_details["Rating"] = "Null"

    # Extract Number of Reviews
    try:
        num_reviews_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "tf-count"))
        )
        book_details["Num Of Reviews"] = num_reviews_element.text.strip()
    except:
        book_details["Num Of Reviews"] = "Null"

    # Extract Author
    author_tag = soup.find("b", string="Author:")
    book_details["Author"] = author_tag.find_next("span", class_="cl-blue").text.strip() if author_tag else "Null"

    # Extract Book Type (Format)
    format_tag = soup.find("b", string="Format:")
    book_details["Book Type"] = format_tag.find_next("span").text.strip() if format_tag else "Null"

    # Extract Genre (Book Classification)
    book_classification = soup.find("b", string="Book classification:")
    if book_classification:
        genres = [span.text.strip() for span in book_classification.find_next("span").find_all("span", class_="cl-blue") if span.text.strip()]
        book_details["Genre"] = ", ".join(genres) if genres else "Null"
    else:
        book_details["Genre"] = "Null"

    # Extract High-Quality Cover Image
    image_tags = soup.find_all("img", class_="image image--contain")
    if len(image_tags) > 1:
        raw_image_url = image_tags[1]["src"]
        # Modify the URL to get better quality (replace width=54 with width=350)
        book_details["Cover Image"] = raw_image_url.replace("width=54", "width=350")
    else:
        book_details["Cover Image"] = "No Image"

    return book_details


# Read URLs from CSV & Scrape Data
i=1
df = pd.read_csv(INPUT_CSV, names=["book_link"], encoding="utf-8")  # Adjust nrows as needed
for url in df["book_link"].dropna():  # Drop NaN values
    print(i, f"Scraping: {url}")
    book_info = get_book_data(url)
    i+=1

    # Append the scraped data to the DataFrame
    all_books_df = pd.concat([all_books_df, pd.DataFrame([book_info])], ignore_index=True)
    time.sleep(3)  # Delay to prevent request blocking

# Close Selenium WebDriver
driver.quit()

# Drop the URL column (not needed in final output)
all_books_df.drop(columns=["URL"], inplace=True, errors="ignore")



In [None]:
all_books_df.to_csv("jarir_raw_books.csv", index=False)

#### Note: 
We  found 10 books in Jarir with null "Rating" and "Num Of Reviews" despite having actual values. These were manually corrected in the CSV file. Other nulls were confirmed to represent zero, as they appeared at the end of the bestseller list. The 10 books, however, were in the first half, surrounded by complete data, justifying manual updates..

### Integration Step

#### after we conduct Amazon and Jarir bestseller books successfullym we integrate them in one csv file

In [None]:
# Load the two datasets
df_jarir = pd.read_csv("jarir_raw_books.csv", encoding="utf-8")
df_amazon = pd.read_csv("amazon_raw_books.csv", encoding="utf-8")

# Merge (concatenate) them
df_merged = pd.concat([df_jarir, df_amazon], ignore_index=True)

# Save the merged dataset to a new CSV file
merged_filename = "raw_bestseller_books.csv"
df_merged.to_csv(merged_filename, index=False, encoding="utf-8")

# 3. Data Cleaning and Preprocessing

## Refrences

1. A. Alharbi, "Exploring Factors Influencing the Amazon Best-Selling Books Selection Process from 2009 to 2019," ResearchGate, 2024. [Online]. Available: https://www.researchgate.net/publication/382998978_Exploring_Factors_Influencing_the_Amazon_Best-Selling_Books_Selection_Process_from_2009_to_2019.

2. J. Smith and J. Doe, "Using Full-Text Content to Characterize and Identify Best Seller Books," PLOS ONE, May 11, 2023. [Online]. Available: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0302070.

3. L. Johnson and K. Brown, "Analyzing Social Book Reading Behavior on Goodreads and How It Predicts Amazon Best Sellers," ResearchGate, 2018. [Online]. Available: https://www.researchgate.net/publication/327789907_Analyzing_Social_Book_Reading_Behavior_on_Goodreads_and_how_it_predicts_Amazon_Best_Sellers.