# DATA LINKS AND EXPLANATIONS

In this notebook I share the documentation about the methods and justifications for the data collection, as well as where this information can be found within my project's resources. This transparency helps users and other developers understand the basis of my recommender system and the reliability of its suggestions.

## Data Links

1. https://www.kaggle.com/datasets/mahdiehhajian/users-book-dataset
2. https://www.kaggle.com/datasets/anshtanwar/top-200-trending-books-with-reviews/data?select=customer+reviews.csv
3. https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks
4. https://www.kaggle.com/datasets/mdhamani/goodreads-books-100k


## Scrapped Dataset by me from this Goodreads web

5. https://www.goodreads.com/list/best_of_year/2023

## Explanations

## 1. Users Book Dataset

https://www.kaggle.com/datasets/mahdiehhajian/users-book-dataset


### About Dataset

This dataset is a compilation of a list of the best selling and best user experience books at Kitabarah.


### It contains this CSV files:

- Books.csv
- Ratings.csv
- Users.csv


### Columns Description: 

**For User Data:**

| Key Feature | Description                                   |
|-------------|-----------------------------------------------|
| User-ID     | A unique identifier for each user.           |
| Location    | The geographical location of the user.       |
| Age         | The age of the user.                         |

**For Book Data:**

| Key Feature          | Description                                                      |
|----------------------|------------------------------------------------------------------|
| ISBN                 | A unique identification number for books.                         |
| Book-Title           | The title of the book.                                           |
| Book-Author          | The author(s) of the book. Multiple authors are delimited by "-". |
| Year-Of-Publication  | The year when the book was published.                            |
| Publisher            | The publisher of the book.                                       |
| Image-URL-S          | Small-sized image URL of the book cover.                         |
| Image-URL-M          | Medium-sized image URL of the book cover.                        |
| Image-URL-L          | Large-sized image URL of the book cover.                         |

**For Book Ratings Data:**

| Key Feature  | Description                                            |
|--------------|--------------------------------------------------------|
| User-ID      | A unique identifier for each user.                     |
| ISBN         | A unique identification number for books.               |
| Book-Rating  | The rating given by the user for a particular book.    |

## 2. Top 100 Bestselling Book Reviews on Amazon

https://www.kaggle.com/datasets/anshtanwar/top-200-trending-books-with-reviews/data?select=customer+reviews.csv


### About Dataset

This dataset offers an in-depth look into Amazon's top 100 Bestselling books along with their customer reviews, Ratings, Price etc. This dataset provides a window into the world of popular reading. It's an already scrapped dataset on November 2023.


### It contains this CSV files:

- Top-100 Trending Books.csv
- customer reviews.csv


### Key Features:

| Key Feature           | Description                                                                                     |
|-----------------------|-------------------------------------------------------------------------------------------------|
| Book Rank             | The ranking of the book among the top 100 Bestselling books on Amazon.                         |
| Book Title            | The title of the book.                                                                          |
| Price                 | The price of the book in USD.                                                                   |
| Rating                | The overall rating of the book, on a scale of 1 to 5.                                           |
| Author                | The author of the book.                                                                         |
| Year of Publication   | The year in which the book was published.                                                       |
| Genre                 | The genre or category to which the book belongs.                                                |
| URL                   | The URL link to the book on Amazon's platform.                                                  |
| Review Title          | The title of the book review.                                                                   |
| Reviewer              | The name of the person who has written a review for the book.                                    |
| Reviewer Rating       | The rating given by the reviewer for the book, on a scale of 1 to 5.                             |
| Review Description    | The text description of the review given.                                                        |
| Is_verified           | Indicates whether the review is verified as a genuine customer review.                            |
| Date                  | The timestamp indicates the date when the review was posted.                                     |
| Timestamp             | The timestamp indicates when the review was posted.                                              |
| ASIN                  | Amazon Standard Identification Number assigned to products on Amazon.                            |


## 3. Goodreads-books

https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks


### About Dataset

Good clean dataset of books already made with the Goodreads API.


### It contains this CSV file:

- gr_books.csv


### Key Features: 

| Key Feature        | Description                                                    |
|--------------------|----------------------------------------------------------------|
| bookID             | A unique Identification number for each book.                   |
| title              | The name under which the book was published.                   |
| authors            | Names of the authors of the book. Multiple authors are delimited with a hyphen (-). |
| average_rating     | The average rating of the book received in total.              |
| isbn               | Another unique number to identify the book, the International Standard Book Number. |
| isbn13             | A 13-digit ISBN to identify the book, instead of the standard 11-digit ISBN. |
| language_code      | Helps understand what is the primary language of the book. For instance, "eng" is standard for English. |
| num_pages          | Number of pages the book contains.                             |
| ratings_count      | Total number of ratings the book received.                     |
| text_reviews_count | Total number of written text reviews the book received.        |


## 4. GoodReads 100k books

https://www.kaggle.com/datasets/mdhamani/goodreads-books-100k


### About Dataset

This dataset contains some of the generally required columns needed to express a book.


### It contains this CSV file:

- GoodReads_100k_books.csv


### Key Features:

| Key Feature  | Description                                        |
|--------------|----------------------------------------------------|
| Author       | The name of the author/authors of the book.        |
| Book Format  | The format of the book.                            |
| Description  | The description of the book.                      |
| Genre        | The list of genres related to the book.           |
| Image        | Image link of the book.                            |
| ISBN         | ISBN code of the book.                             |
| ISBN13       | ISBN13 code of the book.                           |
| Link         | The Goodreads link of the book.                    |
| Pages        | Number of pages in the book.                       |
| Rating       | Average rating of the book.                        |


## 5. My own Web Scrapped Data from Goodreads

### About the Dataset Acquisition process:

In the development of my book recommender system, a key component was to source contemporary and popular books to ensure our recommendations are timely and relevant. To achieve this, I utilized web scraping techniques to collect data on the top 100 best books published during the year 2023 from the Goodreads website. Goodreads is a well-known platform where readers can find new books, read reviews, and see what's trending in the literary world. By focusing on books published in the current year, we maintain the freshness of our recommendation pool, offering users titles that are current and possibly in high demand.

The web scraping process was carefully designed to adhere to the website's terms of service and robots.txt file to ensure ethical data collection practices. We extracted key information such as book titles, authors, ratings, and the total number of ratings, which serve as crucial factors in our recommender engine.

This dataset serves as a basis for part of our recommendation logic: if a user's input matches a title within this up-to-date list, our system is designed to recommend a book from this subset, thereby providing suggestions that are in line with current reading trends.

The code for the web scraping process, as well as the initial data inspection and cleaning procedures, can be found in the notebook titled "1. Data Import and Inspection". This notebook details the step-by-step methodology employed in extracting the dataset, the structure of the scraped data, and the initial exploratory data analysis performed to understand the dataset's characteristics.

### Key Features of the resulting DataFrame:


| Key Feature  | Description                                                  |
|--------------|--------------------------------------------------------------|
| title        | Title of the book.                                           |
| author       | Name of the author/authors of the book.                      |
| image_url    | URL of the book cover image.                                 |
| rating       | The average rating of the book received.                     |
| total_ratings| Total number of ratings the book received.                   |

(At first the "rating" column includes also the total ratings info but I split it).