# DATA IMPORT AND INSPECTION

⚠️**NOTE:** In this notebook, I share the initial data inspection that I conducted. Due to potential future issues and time constraints, I've decided to focus primarily on utilizing only the "books_df" dataset and the one I personally scraped from the Goodreads website, while many other datasets included in this notebook may not be used.

Anyway, I'll leave them here to share the information and the code in case I want to improve my book recommender in the future and thus be able to use them.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from IPython.display import Image, display
from IPython.display import HTML

import warnings
warnings.filterwarnings("ignore")

## Load the Data

In [2]:
books_df = pd.read_csv('GoodReads_100k_books.csv')
ratings_df = pd.read_csv('Ratings.csv')
books_df_2 = pd.read_csv('Books.csv')
users_df = pd.read_csv('Users.csv')
top100_df = pd.read_csv('Top-100 Trending Books.csv')
customers_df = pd.read_csv('customer reviews.csv')

# Attempt to read a CSV file named 'gr_books.csv' into a Pandas DataFrame.
try:
    gr_books_df = pd.read_csv('gr_books.csv', error_bad_lines=False)
    
# If there is an exception (specifically, a ParserError) while parsing the CSV file,
# print an error message that includes the details of the exception.
except pd.errors.ParserError as e:
    print(f"Error parsing 'gr_books.csv': {e}")

Skipping line 3350: expected 12 fields, saw 13
Skipping line 4704: expected 12 fields, saw 13
Skipping line 5879: expected 12 fields, saw 13
Skipping line 8981: expected 12 fields, saw 13



## Web Scrapping Goodreads Best Books of 2023

The top 100 best books published during 2023. 
- For further explanation check the previous notebook "0. Dataset Links and Explanations".

In [3]:
# The URL of the Goodreads page to scrape
url = 'https://www.goodreads.com/list/best_of_year/2023'

# Send the GET request to the Goodreads URL
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Lists to store the book data
book_titles = []
book_authors = []
book_images = []
book_ratings = []

# Loop over the rows in the table
for row in soup.select('table.tableList tr'):
    # Extracting the book title
    title_element = row.select_one('td:nth-child(3) a span[itemprop="name"]')
    book_titles.append(title_element.get_text(strip=True) if title_element else 'Title Not Found')

    # Extracting the author name
    author_element = row.select_one('td:nth-child(3) .authorName span[itemprop="name"]')
    book_authors.append(author_element.get_text(strip=True) if author_element else 'Author Not Found')

    # Extracting the image source
    image_element = row.select_one('td:nth-child(2) .bookCover')
    book_images.append(image_element['src'] if image_element else 'Image Not Found')

    # Extracting the rating
    rating_element = row.select_one('td:nth-child(3) .minirating')
    book_ratings.append(rating_element.get_text(strip=True) if rating_element else 'Rating Not Found')

# Create a DataFrame with the scraped data
goodreads_df = pd.DataFrame({
    'Title': book_titles,
    'Author': book_authors,
    'Image URL': book_images,
    'Rating': book_ratings
})

# Display the DataFrame
goodreads_df

Unnamed: 0,Title,Author,Image URL,Rating
0,"Fourth Wing (The Empyrean, #1)",Rebecca Yarros,https://i.gr-assets.com/images/S/compressed.ph...,"4.63 avg rating — 858,170 ratings"
1,Happy Place,Emily Henry,https://i.gr-assets.com/images/S/compressed.ph...,"4.06 avg rating — 578,174 ratings"
2,Yellowface,R.F. Kuang,https://i.gr-assets.com/images/S/compressed.ph...,"3.87 avg rating — 230,939 ratings"
3,"Love, Theoretically",Ali Hazelwood,https://i.gr-assets.com/images/S/compressed.ph...,"4.16 avg rating — 242,107 ratings"
4,"Divine Rivals (Letters of Enchantment, #1)",Rebecca Ross,https://i.gr-assets.com/images/S/compressed.ph...,"4.26 avg rating — 164,652 ratings"
...,...,...,...,...
95,"Poverty, by America",Matthew Desmond,https://i.gr-assets.com/images/S/compressed.ph...,"4.30 avg rating — 20,742 ratings"
96,"The Perfumist of Paris (The Jaipur Trilogy, #3)",Alka Joshi,https://i.gr-assets.com/images/S/compressed.ph...,"4.17 avg rating — 19,269 ratings"
97,"A Fire in the Flesh (Flesh and Fire, #3)",Jennifer L. Armentrout,https://i.gr-assets.com/images/S/compressed.ph...,"4.12 avg rating — 22,872 ratings"
98,"Finlay Donovan Jumps the Gun (Finlay Donovan, #3)",Elle Cosimano,https://i.gr-assets.com/images/S/compressed.ph...,"3.82 avg rating — 43,273 ratings"


### Save the scrapped DataFrame to a CSV File

In [4]:
goodreads_df.to_csv('goodreads_webscrap.csv', index=False)

## Inspect the first rows and the shape of each dataframe

#### Books

In [5]:
books_df.head()

Unnamed: 0,author,bookformat,desc,genre,img,isbn,isbn13,link,pages,rating,reviews,title,totalratings
0,Laurence M. Hauptman,Hardcover,Reveals that several hundred thousand Indians ...,"History,Military History,Civil War,American Hi...",https://i.gr-assets.com/images/S/compressed.ph...,002914180X,9780000000000.0,https://goodreads.com/book/show/1001053.Betwee...,0,3.52,5,Between Two Fires: American Indians in the Civ...,33
1,"Charlotte Fiell,Emmanuelle Dirix",Paperback,Fashion Sourcebook - 1920s is the first book i...,"Couture,Fashion,Historical,Art,Nonfiction",https://i.gr-assets.com/images/S/compressed.ph...,1906863482,9780000000000.0,https://goodreads.com/book/show/10010552-fashi...,576,4.51,6,Fashion Sourcebook 1920s,41
2,Andy Anderson,Paperback,The seminal history and analysis of the Hungar...,"Politics,History",https://i.gr-assets.com/images/S/compressed.ph...,948984147,9780000000000.0,https://goodreads.com/book/show/1001077.Hungar...,124,4.15,2,Hungary 56,26
3,Carlotta R. Anderson,Hardcover,"""All-American Anarchist"" chronicles the life a...","Labor,History",https://i.gr-assets.com/images/S/compressed.ph...,814327079,9780000000000.0,https://goodreads.com/book/show/1001079.All_Am...,324,3.83,1,All-American Anarchist: Joseph A. Labadie and ...,6
4,Jean Leveille,,"Aujourdâ€™hui, lâ€™oiseau nous invite Ã sa ta...",,https://i.gr-assets.com/images/S/compressed.ph...,2761920813,,https://goodreads.com/book/show/10010880-les-o...,177,4.0,1,Les oiseaux gourmands,1


#### Ratings

In [6]:
ratings_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


#### Books 2

In [7]:
books_df_2.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


#### Users

In [8]:
users_df.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


#### Top 100 Trending Books

In [9]:
top100_df.head()

Unnamed: 0,Rank,book title,book price,rating,author,year of publication,genre,url
0,1,"Iron Flame (The Empyrean, 2)",18.42,4.1,Rebecca Yarros,2023,Fantasy Romance,amazon.com/Iron-Flame-Empyrean-Rebecca-Yarros/...
1,2,The Woman in Me,20.93,4.5,Britney Spears,2023,Memoir,amazon.com/Woman-Me-Britney-Spears/dp/16680090...
2,3,My Name Is Barbra,31.5,4.5,Barbra Streisand,2023,Autobiography,amazon.com/My-Name-Barbra-Streisand/dp/0525429...
3,4,"Friends, Lovers, and the Big Terrible Thing: A...",23.99,4.4,Matthew Perry,2023,Memoir,amazon.com/Friends-Lovers-Big-Terrible-Thing/d...
4,5,How to Catch a Turkey,5.65,4.8,Adam Wallace,2018,"Childrens, Fiction",amazon.com/How-Catch-Turkey-Adam-Wallace/dp/14...


#### Customers

In [10]:
customers_df.head()

Unnamed: 0,Sno,book name,review title,reviewer,reviewer rating,review description,is_verified,date,timestamp,ASIN
0,0,The Woman in Me,Unbelievably impressive. Her torn life on paper.,Murderess Marbie,4,I'm only a third way in. Shipped lightening fa...,True,26-10-2023,"Reviewed in the United States October 26, 2023",1668009048
1,1,The Woman in Me,What a heartbreaking story,L J,5,"""There have been so many times when I was scar...",True,06-11-2023,"Reviewed in the United States November 6, 2023",1668009048
2,2,The Woman in Me,Britney you are so invincible! You are an insp...,Jamie,5,The media could not be loaded. I personally ha...,True,01-11-2023,"Reviewed in the United States November 1, 2023",1668009048
3,3,The Woman in Me,"Fast Read, Sad Story",KMG,5,I have been a fan of Britney's music since the...,True,25-10-2023,"Reviewed in the United States October 25, 2023",1668009048
4,4,The Woman in Me,"Buy it, it’s worth the read!",Stephanie Brown,5,"Whether or not you’re a fan, it’s a great read...",True,01-11-2023,"Reviewed in the United States November 1, 2023",1668009048


#### Goodreads - Web Scrapped

In [11]:
goodreads_df.head()

Unnamed: 0,Title,Author,Image URL,Rating
0,"Fourth Wing (The Empyrean, #1)",Rebecca Yarros,https://i.gr-assets.com/images/S/compressed.ph...,"4.63 avg rating — 858,170 ratings"
1,Happy Place,Emily Henry,https://i.gr-assets.com/images/S/compressed.ph...,"4.06 avg rating — 578,174 ratings"
2,Yellowface,R.F. Kuang,https://i.gr-assets.com/images/S/compressed.ph...,"3.87 avg rating — 230,939 ratings"
3,"Love, Theoretically",Ali Hazelwood,https://i.gr-assets.com/images/S/compressed.ph...,"4.16 avg rating — 242,107 ratings"
4,"Divine Rivals (Letters of Enchantment, #1)",Rebecca Ross,https://i.gr-assets.com/images/S/compressed.ph...,"4.26 avg rating — 164,652 ratings"


In [12]:
# Check 10 first images just to make sure that it works

# Create an HTML string to display images side by side
num_images_to_display = 10
image_html = "<div style='display:flex;'>"

# Iterate through the 'Image URL' column and display the first 10 images
for i, url in enumerate(goodreads_df['Image URL']):
    if i >= num_images_to_display:
        break  # Stop after displaying the first 10 images
    image_html += f"<img src='{url}' style='margin: 5px;'>"

image_html += "</div>"

# Display the HTML string
HTML(image_html)

#### Goodreads

In [13]:
gr_books_df.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic


### Shape

In [14]:
print("Books Shape: " ,books_df.shape )
print("Books 2 Shape: " ,books_df_2.shape )
print("Ratings Shape: " ,ratings_df.shape )
print("Users Shape: " ,users_df.shape )
print("Top 100 Shape: " ,top100_df.shape )
print("Customers Shape: " ,customers_df.shape )
print("Goodreads Shape: " ,goodreads_df.shape )
print("Goodreads 2 Shape: " ,gr_books_df.shape )

Books Shape:  (100000, 13)
Books 2 Shape:  (271360, 8)
Ratings Shape:  (1149780, 3)
Users Shape:  (278858, 3)
Top 100 Shape:  (100, 8)
Customers Shape:  (920, 10)
Goodreads Shape:  (100, 4)
Goodreads 2 Shape:  (11123, 12)


To find the total number of rows, I simply sum the number of rows in each dataset:

Total Rows = 100,000 + 271,360 + 1,149,780 + 278,858 + 100 + 920 + 100 + 11,123 = 1,812,241 rows

So, the **total number of rows across all the datasets is 1,812,241 rows**.

It seems that we have aprox. 1M books. (I have to check for duplicates to know how many different books I have).

## Check data types and look for missing values

#### Books

In [15]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   author        100000 non-null  object 
 1   bookformat    96772 non-null   object 
 2   desc          93228 non-null   object 
 3   genre         89533 non-null   object 
 4   img           96955 non-null   object 
 5   isbn          85518 non-null   object 
 6   isbn13        88565 non-null   object 
 7   link          100000 non-null  object 
 8   pages         100000 non-null  int64  
 9   rating        100000 non-null  float64
 10  reviews       100000 non-null  int64  
 11  title         99999 non-null   object 
 12  totalratings  100000 non-null  int64  
dtypes: float64(1), int64(3), object(9)
memory usage: 9.9+ MB


Data types seem correct. 

#### Ratings

In [16]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


Data types seem correct.

#### Books 2

In [17]:
books_df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


Year of publication must be date type.

#### Users

In [18]:
users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


Data types seem correct.

#### Top 100

In [19]:
top100_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Rank                 100 non-null    int64  
 1   book title           100 non-null    object 
 2   book price           100 non-null    float64
 3   rating               97 non-null     float64
 4   author               100 non-null    object 
 5   year of publication  100 non-null    int64  
 6   genre                100 non-null    object 
 7   url                  100 non-null    object 
dtypes: float64(2), int64(2), object(4)
memory usage: 6.4+ KB


Year of publication must be date type.

#### Customers

In [20]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Sno                 920 non-null    int64 
 1   book name           920 non-null    object
 2   review title        920 non-null    object
 3   reviewer            920 non-null    object
 4   reviewer rating     920 non-null    int64 
 5   review description  920 non-null    object
 6   is_verified         920 non-null    bool  
 7   date                920 non-null    object
 8   timestamp           920 non-null    object
 9   ASIN                920 non-null    object
dtypes: bool(1), int64(2), object(7)
memory usage: 65.7+ KB


Date and timestamp must be date type.

#### Goodreads

In [21]:
goodreads_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      100 non-null    object
 1   Author     100 non-null    object
 2   Image URL  100 non-null    object
 3   Rating     100 non-null    object
dtypes: object(4)
memory usage: 3.2+ KB


Rating must be float or int.

#### Goodreads 2

In [22]:
gr_books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11123 entries, 0 to 11122
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              11123 non-null  int64  
 1   title               11123 non-null  object 
 2   authors             11123 non-null  object 
 3   average_rating      11123 non-null  float64
 4   isbn                11123 non-null  object 
 5   isbn13              11123 non-null  int64  
 6   language_code       11123 non-null  object 
 7     num_pages         11123 non-null  int64  
 8   ratings_count       11123 non-null  int64  
 9   text_reviews_count  11123 non-null  int64  
 10  publication_date    11123 non-null  object 
 11  publisher           11123 non-null  object 
dtypes: float64(1), int64(5), object(6)
memory usage: 1.0+ MB


Publication date must be date type.

### Missing Values

In [23]:
print("Null values in Books:\n" ,books_df.isnull().sum())
print(" ")
print("Null values in Books 2:\n" ,books_df_2.isnull().sum())
print(" ")
print("Null values in Ratings:\n ",ratings_df.isnull().sum())
print(" ")
print("Null values in Users:\n",users_df.isnull().sum())
print(" ")
print("Null values in top100:\n" ,top100_df.isnull().sum())
print(" ")
print("Null values in Customers:\n" ,customers_df.isnull().sum())
print(" ")
print("Null values in Goodreads:\n" ,goodreads_df.isnull().sum())
print(" ")
print("Null values in Goodreads 2:\n" ,gr_books_df.isnull().sum())

Null values in Books:
 author              0
bookformat       3228
desc             6772
genre           10467
img              3045
isbn            14482
isbn13          11435
link                0
pages               0
rating              0
reviews             0
title               1
totalratings        0
dtype: int64
 
Null values in Books 2:
 ISBN                   0
Book-Title             0
Book-Author            1
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64
 
Null values in Ratings:
  User-ID        0
ISBN           0
Book-Rating    0
dtype: int64
 
Null values in Users:
 User-ID          0
Location         0
Age         110762
dtype: int64
 
Null values in top100:
 Rank                   0
book title             0
book price             0
rating                 3
author                 0
year of publication    0
genre                  0
url                    0
dtype: int64
 
Null value

This output provides a summary of the number of missing values in each column for the respective datasets. It's helpful for data quality assessment and this data cleaning tasks.

As we see, we have some datasets with missing values in some of their columns.

#### Here's a summary of the datasets and respective columns with missing values, along with potential ways to handle them:

**Dataset: Books**
- Columns with missing values: `bookformat`, `desc`, `genre`, `img`, `isbn`, `isbn13`, `title`
- Possible treatments:
  - For columns `bookformat`, `desc`, `genre`, `img`, `isbn`, `isbn13`, I will decide whether these missing values are critical for my analysis. If not critical, I can leave them as they are or consider deleting the corresponding rows.
  - For the `title` column, as it's important and there's only one missing value, I could try manually searching for the missing title and replacing it.

**Dataset: Books 2**
- Columns with missing values: `Book-Author`, `Publisher`, `Image-URL-L`
- Possible treatments:
  - For the `Book-Author` column, as it's important and there's only one missing value, I could search for the missing author manually and replace it.
  - For the `Publisher` and `Image-URL-L` columns, I can decide whether these missing values are critical for my analysis. If not critical, I can leave them as they are or consider deleting the corresponding rows.

**Dataset: Ratings**
- No columns with missing values in this dataset.

**Dataset: Users**
- Columns with missing values: `Age`
- Possible treatments:
  - For the `Age` column, I can decide how to handle them deleting rows with missing values, replacing missing values with an estimate of the average or median age, or using more advanced imputation techniques if necessary.

**Dataset: top100**
- Columns with missing values: `rating`
- Possible treatments:
  - For the `rating` column, as ratings are critical for the analysis, I may consider deleting rows with missing values or attempting to manually find and replace the missing ratings.

**Dataset: Customers**
- No columns with missing values in this dataset.

**Dataset: Goodreads**
- No columns with missing values in this dataset.

**Dataset: Goodreads 2**
- No columns with missing values in this dataset.

It's important to note that how I handle missing values depends on the context of the analysis and the importance of the missing data to my objectives. I make decisions based on the nature of the data and the question I am trying to answer. Options include deleting rows with missing values, imputing values, manually searching for missing data, or taking other actions as needed.

## Check for the existence of duplicate values

In [24]:
print(books_df.duplicated().sum())
print(ratings_df.duplicated().sum())
print(books_df_2.duplicated().sum())
print(users_df.duplicated().sum())
print(top100_df.duplicated().sum())
print(customers_df.duplicated().sum())
print(goodreads_df.duplicated().sum())
print(gr_books_df.duplicated().sum())

0
0
0
0
0
0
0
0


We don't have duplicates.

## Identify common columns across datasets for integration

In [25]:
# Assign names to the dataframes
books_df.name = "books_df"
books_df_2.name = "books_df_2"
ratings_df.name = "ratings_df"
gr_books_df.name = "gr_books_df"
users_df.name = "users_df"
customers_df.name = "customers_df"
top100_df.name = "top100_df"
goodreads_df.name = "goodreads_df"

# List of the dataframes
dataframes = [books_df, books_df_2, ratings_df, gr_books_df, users_df, customers_df, top100_df, goodreads_df]

# Iterate through the dataframes and print their columns
for df in dataframes:
    print(f"Columns of {df.name}: {list(df.columns)}")

Columns of books_df: ['author', 'bookformat', 'desc', 'genre', 'img', 'isbn', 'isbn13', 'link', 'pages', 'rating', 'reviews', 'title', 'totalratings']
Columns of books_df_2: ['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L']
Columns of ratings_df: ['User-ID', 'ISBN', 'Book-Rating']
Columns of gr_books_df: ['bookID', 'title', 'authors', 'average_rating', 'isbn', 'isbn13', 'language_code', '  num_pages', 'ratings_count', 'text_reviews_count', 'publication_date', 'publisher']
Columns of users_df: ['User-ID', 'Location', 'Age']
Columns of customers_df: ['Sno', 'book name', 'review title', 'reviewer', 'reviewer rating', 'review description', 'is_verified', 'date', 'timestamp', 'ASIN']
Columns of top100_df: ['Rank', 'book title', 'book price', 'rating', 'author', 'year of publication', 'genre', 'url']
Columns of goodreads_df: ['Title', 'Author', 'Image URL', 'Rating']


This provides a clear list of the columns for each DataFrame, which is helpful for understanding the structure of my data and for reference when working with these datasets in my analysis.

## Identifying common fields using the ISBN 

This function takes two dataframes (df1 and df2), the names of the columns containing ISBNs in each dataframe (col1 and col2), and the names of the dataframes themselves (name1 and name2). It then finds and prints the number of common ISBNs between the two dataframes, ignoring case sensitivity by converting ISBNs to uppercase. The function returns a set of common ISBNs for further analysis.

ISBNs are unique identifiers for books so we will see how many books are in common.

In [26]:
def find_common_isbns(df1, df2, col1, col2, name1, name2):
    """
    Finds and prints the number of common ISBNs between two dataframes.
    
    Parameters:
    df1 (pd.DataFrame): First dataframe containing ISBNs.
    df2 (pd.DataFrame): Second dataframe containing ISBNs.
    col1 (str): Column name in the first dataframe that contains ISBNs.
    col2 (str): Column name in the second dataframe that contains ISBNs.
    name1 (str): Name of the first dataframe.
    name2 (str): Name of the second dataframe.
    
    Returns:
    set: A set of common ISBNs between the two dataframes.
    """
    # Normalize column names to uppercase to avoid case sensitivity issues
    df1_isbns = set(df1[col1].str.upper())
    df2_isbns = set(df2[col2].str.upper())
    
    # Find the intersection of ISBNs between the two dataframes
    common_isbns = df1_isbns.intersection(df2_isbns)
    
    # Print the names of the dataframes
    print(f"Number of common ISBNs between {name1} and {name2}: {len(common_isbns)}")
    
    return common_isbns

In [27]:
from itertools import combinations

# Define the list of dataframes that have ISBN or isbn columns
dataframes_with_isbn = [books_df, books_df_2, ratings_df, gr_books_df]

# Convert all 'ISBN' columns to lowercase
for df in dataframes_with_isbn:
    if 'ISBN' in df.columns:
        df.rename(columns={'ISBN': 'isbn'}, inplace=True)

# Generate all possible combinations of dataframes (2 at a time)
combinations_of_dataframes = combinations(dataframes_with_isbn, 2)

# Iterate through all combinations and find common ISBNs
for df1, df2 in combinations_of_dataframes:
    common_isbns = None
    
    if 'isbn' in df1.columns and 'isbn' in df2.columns:
        common_isbns = find_common_isbns(df1, df2, 'isbn', 'isbn', df1.name, df2.name)
    
    if common_isbns is not None:
        print(f"Number of common ISBNs between {df1.name} and {df2.name}: {len(common_isbns)}")
    else:
        print(f"No common ISBNs between {df1.name} and {df2.name}")

Number of common ISBNs between books_df and books_df_2: 1785
Number of common ISBNs between books_df and books_df_2: 1785
Number of common ISBNs between books_df and ratings_df: 1915
Number of common ISBNs between books_df and ratings_df: 1915
Number of common ISBNs between books_df and gr_books_df: 328
Number of common ISBNs between books_df and gr_books_df: 328
Number of common ISBNs between books_df_2 and ratings_df: 269843
Number of common ISBNs between books_df_2 and ratings_df: 269843
Number of common ISBNs between books_df_2 and gr_books_df: 3653
Number of common ISBNs between books_df_2 and gr_books_df: 3653
Number of common ISBNs between ratings_df and gr_books_df: 3855
Number of common ISBNs between ratings_df and gr_books_df: 3855


It iterates 2 times the same combinations. 

The output provides insights into the common ISBNs (International Standard Book Numbers) found between different combinations of dataframes:

1. Between `books_df` and `books_df_2`, there are 1,785 common ISBNs. This suggests that these two dataframes share a substantial number of books with matching ISBNs.

2. Between `books_df` and `ratings_df`, there are 1,915 common ISBNs. This indicates that there are books in the `books_df` dataframe that have received ratings in the `ratings_df` dataframe.

3. Between `books_df` and `gr_books_df`, there are 328 common ISBNs. This implies that there is some overlap between the books in `books_df` and the books listed in `gr_books_df`, potentially indicating shared book data.

4. Between `books_df_2` and `ratings_df`, there are 269,843 common ISBNs. This suggests a significant overlap between books in `books_df_2` and those that have received ratings in `ratings_df`.

5. Between `books_df_2` and `gr_books_df`, there are 3,653 common ISBNs. This indicates that there is some overlap between the books in `books_df_2` and the books listed in `gr_books_df`.

6. Between `ratings_df` and `gr_books_df`, there are 3,855 common ISBNs. This suggests that there is a subset of books in `ratings_df` that are also listed in `gr_books_df`, potentially indicating shared book data or books that have been rated.

Overall, these insights highlight the degree of overlap and commonality in ISBNs between different combinations of dataframes, which can be useful for data integration and analysis tasks.

# Next Steps Based on this First Data Inspection

For my data cleaning process, considering the data import and inspection I've done, the following steps are necessary:

1. **Check for Missing Values**: I'll need to handle missing values. Depending on the importance of the columns, I could fill them with default values, mean/mode/median, or remove rows or columns with too many missing values.

2. **Handle Duplicate Values**: duplicates not found.

3. **Data Types Consistency**: Convert data types changing years (now objects) to dates and ratings to integers or floats.

4. **Merging DataFrames**: Merge `books_df_2`, `ratings_df`, and `users_df` from the same dataset, and similarly, combine `top100_df` and `customers_df` as they belong to the same dataset. Ensure that common columns like ISBNs align properly to maintain data integrity.

5. **Use Google Books API**: To enrich my dataset, I probably will use the Google Books API to pull in missing information. Match the ISBNs across different DataFrames to verify that I'm pulling the correct information for each book.

6. **Outliers and Anomalies**: Look for any outliers or anomalies that could affect the clustering and modeling. Decide on a strategy to handle them, which might include logging, capping, or removing these values.

7. **Textual Data**: For columns with textual data, I consider whether to use text analysis techniques like sentiment analysis or topic modeling to turn unstructured text into structured data that can be used for clustering.

8. **Normalization or Encoding**: If I end up making a sentiment analysis for the description or synopsis of the books, numerical features like 'polarity' and 'subjectivity' may need scaling. Polarity indicates sentiment orientation (positive, negative, neutral), and subjectivity reflects personal opinion, emotion, or judgment. These can be scaled numerically or binned into categories as part of feature engineering for clustering.

When merging datasets and using the API, it's crucial to ensure the data being added is aligned with the existing structure to maintain consistency and relevance for the recommendation system and clustering analysis.