<a href="https://colab.research.google.com/github/raushan9jnv/Book-recommendation-system/blob/main/Book_recommendation_system_capstone_project_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Problem Statements**
During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such
web services, recommender systems have taken more and more place in our lives. From
e-commerce (suggest to buyers articles that could interest them) to online advertisement
(suggest to users the right contents, matching their preferences), recommender systems are
today unavoidable in our daily online journeys.

In a very general way, recommender systems are algorithms aimed at suggesting relevant
items to users (items being movies to watch, text to read, products to buy, or anything else
depending on industries).

Recommender systems are really critical in some industries as they can generate a huge
amount of income when they are efficient or also be a way to stand out significantly from
competitors. The main objective is to create a book recommendation system for users.

# **Datasets Description**

### We are given with three different datasets and used as the part of our recommendation system.

1. **Users.csv :** Contains the users. Note that user IDs (User-ID) have been anonymized and map to
integers. Demographic data is provided (Location, Age) if available. Otherwise, these
fields contain NULL values.

2. **Books.csv :** Books are identified by their respective ISBN. Invalid ISBNs have already been removed
from the dataset. Moreover, some content-based information is given (Book-Title,
Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web
Services. Note that in the case of several authors, only the first is provided. URLs linking
to cover images are also given, appearing in three different flavors (Image-URL-S,
Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the
Amazon website.

3. **Ratings.csv :** Contains the book rating information. Ratings (Book-Rating) are either explicit,
expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit,
expressed by 0.

### **Variables of Users.csv**
1. user_id	
2. location	
3. age

### **Variables of Books.csv**
1. ISBN	
2. Book-Title	
3. Book-Author	
4. Year-Of-Publication	
5. Publisher	
6. Image-URL-S	
7. Image-URL-M	
8. Image-URL-L

### **Variables of Ratings.csv**
  1. user_id	
  2. isbn	
  3. book_rating

# **Objective**
The main objective is to create a book recommendation system for users.

# **Let's begin**

---

### **Importing libraries and Files**

In [34]:
# importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [35]:
file_path = "/content/drive/MyDrive/Almabetter/Capstone project/Book recommendation system capstone project-4/Book Recommendation System/data_book_recommendation/"

In [36]:
#importing books dataset
df_books = pd.read_csv(file_path + 'Books.csv')

In [37]:
#importing ratings dataset
df_ratings = pd.read_csv(file_path + 'Ratings.csv')

In [38]:
#importing users dataset
df_users = pd.read_csv(file_path + 'Users.csv')

### **How look our all three data - Quick review**

Books

In [39]:
# first two rows of books dataset
df_books.head(2)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...


In [40]:
# shape of books dataset
df_books.shape

(271360, 8)

Ratings

In [41]:
# first two rows of ratings dataset
df_ratings.head(2)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5


In [42]:
# shape of ratings dataset
df_ratings.shape

(1149780, 3)

Users

In [43]:
# first two rows of users dataset
df_users.head(2)

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


In [44]:
# shape of user dataset
df_users.shape

(278858, 3)

# **Data Exploration and Preprocessing**

### **Books - Data Exploration**

In [54]:
df_books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [55]:
df_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


In [53]:
df_books.describe(include = 'all')

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
count,271360,271360,271359,271360,271358,271360,271360,271357
unique,271360,242135,102023,202,16807,271044,271044,271041
top,1577485157,Selected Poems,Agatha Christie,2002,Harlequin,http://images.amazon.com/images/P/044021145X.0...,http://images.amazon.com/images/P/076791404X.0...,http://images.amazon.com/images/P/068803036X.0...
freq,1,27,632,13903,7535,2,2,2


**Check for null values**

In [57]:
df_books.isnull().sum()

ISBN                   0
Book-Title             0
Book-Author            1
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64

**Check for duplicates value**

In [63]:
df_books.duplicated().sum()  # no any

0

 **Important info**

In [45]:
def BooksInfo():
  Binfo_df = pd.DataFrame(index=df_books.columns)
  Binfo_df['Datatypes'] =  df_books.dtypes
  Binfo_df['Count of non-null values'] = df_books.count()
  Binfo_df['NaN values'] = df_books.isnull().sum()
  Binfo_df['% NaN Values'] = (Binfo_df['NaN values']/len(df_books)).round(4)*100        # or df_apps.isnull().mean()
  Binfo_df['Unique_count'] = df_books.nunique()
  return Binfo_df
BooksInfo()

Unnamed: 0,Datatypes,Count of non-null values,NaN values,% NaN Values,Unique_count
ISBN,object,271360,0,0.0,271360
Book-Title,object,271360,0,0.0,242135
Book-Author,object,271359,1,0.0,102023
Year-Of-Publication,object,271360,0,0.0,202
Publisher,object,271358,2,0.0,16807
Image-URL-S,object,271360,0,0.0,271044
Image-URL-M,object,271360,0,0.0,271044
Image-URL-L,object,271357,3,0.0,271041


### **Books - Data Preprocessing**

In [None]:
df_books.drop(['Image-URL-S','Image-URL-M','Image-URL-L'],axis=1,inplace= True)

In [13]:
df_books.head(2)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


In [16]:
def colRename(df):
  df.columns = df.columns.str.strip().str.lower().str.replace('-','_')
  return df.head(2)

###books

In [17]:
df_books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication',
       'Publisher'],
      dtype='object')

In [None]:
colRename(df_books)

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


In [None]:
df_books.nunique()

isbn                   271360
book_title             242135
book_author            102023
year_of_publication       202
publisher               16807
dtype: int64

In [None]:
df_books.isnull().sum()

isbn                   0
book_title             0
book_author            1
year_of_publication    0
publisher              2
dtype: int64

In [None]:
null_book_author=df_books[df_books['book_author'].isnull()]

In [None]:
null_book_author

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
187689,9627982032,The Credit Suisse Guide to Managing Your Perso...,,1995,Edinburgh Financial Publishing


In [None]:
df_books.index

RangeIndex(start=0, stop=271360, step=1)

In [None]:
df_books.iloc[100]

isbn                                                0385235941
book_title             Prize Stories, 1987: The O'Henry Awards
book_author                                   William Abrahams
year_of_publication                                       1987
publisher                                      Doubleday Books
Name: 100, dtype: object

In [None]:
display(df_books.loc[(df_books['book_author'] == 'Richard Bruce Wright') & (df_books.year_of_publication == '2001')])

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher


In [None]:
df1=df_books[((df_books['book_author'] == 'Richard Bruce Wright'))]

In [None]:
df1

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
69226,771597185,The teacher's daughter,Richard Bruce Wright,1982,Macmillan of Canada


### ratings

In [None]:
colRename(df_ratings)

Unnamed: 0,user_id,isbn,book_rating
0,276725,034545104X,0
1,276726,0155061224,5


In [None]:
df_ratings.nunique()

user_id        105283
isbn           340556
book_rating        11
dtype: int64

In [None]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   user_id      1149780 non-null  int64 
 1   isbn         1149780 non-null  object
 2   book_rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


###users

In [None]:
colRename(df_users)

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


In [None]:
df_users.nunique()

user_id     278858
location     57339
age            165
dtype: int64