<a href="https://colab.research.google.com/github/raushan9jnv/Book-recommendation-system/blob/main/Book_recommendation_system_capstone_project_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Problem Statements**
During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such
web services, recommender systems have taken more and more place in our lives. From
e-commerce (suggest to buyers articles that could interest them) to online advertisement
(suggest to users the right contents, matching their preferences), recommender systems are
today unavoidable in our daily online journeys.

In a very general way, recommender systems are algorithms aimed at suggesting relevant
items to users (items being movies to watch, text to read, products to buy, or anything else
depending on industries).

Recommender systems are really critical in some industries as they can generate a huge
amount of income when they are efficient or also be a way to stand out significantly from
competitors. The main objective is to create a book recommendation system for users.

# **Datasets Description**

### We are given with three different datasets and used as the part of our recommendation system.

1. **Users.csv :** Contains the users. Note that user IDs (User-ID) have been anonymized and map to
integers. Demographic data is provided (Location, Age) if available. Otherwise, these
fields contain NULL values.

2. **Books.csv :** Books are identified by their respective ISBN. Invalid ISBNs have already been removed
from the dataset. Moreover, some content-based information is given (Book-Title,
Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web
Services. Note that in the case of several authors, only the first is provided. URLs linking
to cover images are also given, appearing in three different flavors (Image-URL-S,
Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the
Amazon website.

3. **Ratings.csv :** Contains the book rating information. Ratings (Book-Rating) are either explicit,
expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit,
expressed by 0.

### **Variables of Users.csv**
1. user_id	
2. location	
3. age

### **Variables of Books.csv**
1. ISBN	
2. Book-Title	
3. Book-Author	
4. Year-Of-Publication	
5. Publisher	
6. Image-URL-S	
7. Image-URL-M	
8. Image-URL-L

### **Variables of Ratings.csv**
  1. user_id	
  2. isbn	
  3. book_rating

# **Objective**
The main objective is to create a book recommendation system for users.

# **Let's begin**

---

### **Importing libraries and Files**

In [None]:
# importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [None]:
file_path = "/content/drive/MyDrive/Almabetter/Capstone project/Book recommendation system capstone project-4/Book Recommendation System/data_book_recommendation/"

In [None]:
#importing books dataset
df_books = pd.read_csv(file_path + 'Books.csv')

In [None]:
#importing ratings dataset
df_ratings = pd.read_csv(file_path + 'Ratings.csv')

In [None]:
#importing users dataset
df_users = pd.read_csv(file_path + 'Users.csv')

### **How look our all three data - Quick review**

Books

In [None]:
# first two rows of books dataset
df_books.head(2)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...


In [None]:
# shape of books dataset
df_books.shape

(271360, 8)

Ratings

In [None]:
# first two rows of ratings dataset
df_ratings.head(2)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5


In [None]:
# shape of ratings dataset
df_ratings.shape

(1149780, 3)

Users

In [None]:
# first two rows of users dataset
df_users.head(2)

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


In [None]:
# shape of user dataset
df_users.shape

(278858, 3)

# **Data Exploration and Preprocessing**

### **Books - Data Exploration**

In [None]:
df_books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [None]:
df_books.describe(include = 'all')

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
count,271360,271360,271359,271360,271358,271360,271360,271357
unique,271360,242135,102023,202,16807,271044,271044,271041
top,1577485157,Selected Poems,Agatha Christie,2002,Harlequin,http://images.amazon.com/images/P/044021145X.0...,http://images.amazon.com/images/P/076791404X.0...,http://images.amazon.com/images/P/068803036X.0...
freq,1,27,632,13903,7535,2,2,2


**Check for null values**

In [None]:
df_books.isnull().sum()

ISBN                   0
Book-Title             0
Book-Author            1
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64

**Check for duplicates value**

In [None]:
df_books.duplicated().sum()  # no any

0

 **Important info**

In [None]:
def BooksInfo():
  Binfo_df = pd.DataFrame(index=df_books.columns)
  Binfo_df['Datatypes'] =  df_books.dtypes
  Binfo_df['Count of non-null values'] = df_books.count()
  Binfo_df['NaN values'] = df_books.isnull().sum()
  Binfo_df['% NaN Values'] = (Binfo_df['NaN values']/len(df_books)).round(4)*100        # or df_apps.isnull().mean()
  Binfo_df['Unique_count'] = df_books.nunique()
  return Binfo_df
BooksInfo()

Unnamed: 0,Datatypes,Count of non-null values,NaN values,% NaN Values,Unique_count
ISBN,object,271360,0,0.0,271360
Book-Title,object,271360,0,0.0,242135
Book-Author,object,271359,1,0.0,102023
Year-Of-Publication,object,271360,0,0.0,202
Publisher,object,271358,2,0.0,16807
Image-URL-S,object,271360,0,0.0,271044
Image-URL-M,object,271360,0,0.0,271044
Image-URL-L,object,271357,3,0.0,271041


### **Books - Data Preprocessing**

 **Dropping Columns**

In [None]:
df_books.drop(['Image-URL-S','Image-URL-M','Image-URL-L'],axis=1,inplace= True)

In [None]:
df_books.head(2)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


**Renaming rest columns - for ease**

In [None]:
# defined a function to renaming all three data

def colRename(df):
  df.columns = df.columns.str.strip().str.lower().str.replace('-','_')
  return df.head(2)

In [None]:
# successfully renamed our columns

colRename(df_books)

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


### year_of_publication

In [None]:
# look into year of publication unique values
df_books.year_of_publication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

**Here we see, we have some misleading values
2050, 2020 galimard, 'DK Publishing Inc' and more.**

we will handle this for sure :)

In [None]:
# location for which year_of_publication is 'DK Publishing Inc'.
df_books.loc[df_books.year_of_publication == 'DK Publishing Inc',:]

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


In [None]:
# correction for isbn '0789466953' in books data 
df_books.loc[df_books.isbn ==  '0789466953', 'year_of_publication'] = 2000
df_books.loc[df_books.isbn ==  '0789466953', 'book_author'] =  'James Buckley'
df_books.loc[df_books.isbn ==  '0789466953', 'publisher'] = 'DK Publishing Inc'
df_books.loc[df_books.isbn ==  '0789466953', 'book_title'] = 'DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)'

In [None]:
# correction for isbn '078946697X' in books data 
df_books.loc[df_books.isbn ==  '078946697X', 'year_of_publication'] = 2000
df_books.loc[df_books.isbn ==  '078946697X', 'book_author'] =  "Michael Teitelbaum"
df_books.loc[df_books.isbn ==  '078946697X', 'publisher'] = 'DK Publishing Inc'
df_books.loc[df_books.isbn ==  '078946697X', 'book_title'] = 'DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)'

In [None]:
# location for which year_of_publication is 'Gallimard'.
df_books.loc[df_books.year_of_publication == 'Gallimard',:]

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...


In [None]:
# correction for isbn '0789466953' in books data 
df_books.loc[df_books.isbn ==  '2070426769', 'year_of_publication'] = 2003
df_books.loc[df_books.isbn ==  '2070426769', 'book_author'] =  'Jean-Marie Gustave Le ClÃ?Â©zio'
df_books.loc[df_books.isbn ==  '2070426769', 'publisher'] = 'Gallimard'
df_books.loc[df_books.isbn ==  '2070426769', 'book_title'] = "Peuple du ciel, suivi de Les Bergers"

In [None]:
# set invalid parsing as NaN
df_books.year_of_publication = pd.to_numeric(df_books.year_of_publication, errors = 'coerce')

In [None]:
# identifying different years of publications existed in books record
print(sorted(df_books['year_of_publication'].unique()))

[0, 1376, 1378, 1806, 1897, 1900, 1901, 1902, 1904, 1906, 1908, 1909, 1910, 1911, 1914, 1917, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2010, 2011, 2012, 2020, 2021, 2024, 2026, 2030, 2037, 2038, 2050]


 Our data is only upto 2006 only but here we see values greater than 2006. we will handle these noise. 
 
 also year is 0. that not possible

In [None]:
df_books.loc[(df_books.year_of_publication > 2006) | (df_books.year_of_publication ==0),'year_of_publication'] = np.NAN
df_books.year_of_publication.fillna(round(df_books.year_of_publication.mean()), inplace = True)
# first we fill year_of_publication is of greater than 2006 and 0 fillled with NAN values and thn replaced with mean value.

In [None]:
# print(sorted(df_books['year_of_publication'].unique()))

In [None]:
df_books.year_of_publication = df_books.year_of_publication.astype(np.int32)  #converted to int32, no value remain after decimal

### **publisher**

In [None]:
# looking for publisher nan values
df_books.loc[df_books.publisher.isnull(), :]

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


In [None]:
# all nan publisher replaced with 'other' in book dataset.
df_books.loc[(df_books.isbn == '193169656X'), 'publisher'] = 'other'
df_books.loc[(df_books.isbn == '1931696993'), 'publisher'] = 'other'

### book_author

**Handling Nan values**

In [None]:
# looking for book author null row
null_book_author=df_books[df_books['book_author'].isnull()]

In [None]:
null_book_author

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
187689,9627982032,The Credit Suisse Guide to Managing Your Perso...,,1995,Edinburgh Financial Publishing


In [None]:
df_books.index

RangeIndex(start=0, stop=271360, step=1)

In [None]:
df_books.iloc[100]

isbn                                                0385235941
book_title             Prize Stories, 1987: The O'Henry Awards
book_author                                   William Abrahams
year_of_publication                                       1987
publisher                                      Doubleday Books
Name: 100, dtype: object

In [None]:
display(df_books.loc[(df_books['book_author'] == 'Richard Bruce Wright') & (df_books.year_of_publication == '2001')])

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher


In [None]:
df1=df_books[((df_books['book_author'] == 'Richard Bruce Wright'))]

In [None]:
df1

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
69226,771597185,The teacher's daughter,Richard Bruce Wright,1982,Macmillan of Canada


## **Users - Data Exploration**

In [296]:
df_users.columns

Index(['User-ID', 'Location', 'Age'], dtype='object')

In [297]:
df_users.describe(include = 'all')

Unnamed: 0,User-ID,Location,Age
count,278858.0,278858,168096.0
unique,,57339,
top,,"london, england, united kingdom",
freq,,2506,
mean,139429.5,,34.751434
std,80499.51502,,14.428097
min,1.0,,0.0
25%,69715.25,,24.0
50%,139429.5,,32.0
75%,209143.75,,44.0


**Check for null values**

In [298]:
df_users.isnull().sum()

User-ID          0
Location         0
Age         110762
dtype: int64

**Check for duplicates value**

In [299]:
df_users.duplicated().sum()  # no any

0

 **Important info**

In [300]:
def UserInfo():
  Uinfo_df = pd.DataFrame(index=df_users.columns)
  Uinfo_df['Datatypes'] =  df_users.dtypes
  Uinfo_df['Count of non-null values'] = df_users.count()
  Uinfo_df['NaN values'] = df_users.isnull().sum()
  Uinfo_df['% NaN Values'] = (Uinfo_df['NaN values']/len(df_users)).round(4)*100        # or df_apps.isnull().mean()
  Uinfo_df['Unique_count'] = df_users.nunique()
  return Uinfo_df
UserInfo()

Unnamed: 0,Datatypes,Count of non-null values,NaN values,% NaN Values,Unique_count
User-ID,int64,278858,0,0.0,278858
Location,object,278858,0,0.0,57339
Age,float64,168096,110762,39.72,165


## **Users - Data Preprocessing**

**Renaming columns - for ease**

In [301]:
# defined a function to renaming all three data

def colRename(df):
  df.columns = df.columns.str.strip().str.lower().str.replace('-','_')
  return df.head(2)

In [302]:
# successfully renamed our columns

colRename(df_users)

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


### user_id

In [303]:
#unique users id
df_users.user_id.values

array([     1,      2,      3, ..., 278856, 278857, 278858])

### age

In [306]:
# unique age group in ascending order
print(sorted(df_users.age.unique()))

[nan, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 110.0, 111.0, 113.0, 114.0, 115.0, 116.0, 118.0, 119.0, 123.0, 124.0, 127.0, 128.0, 132.0, 133.0, 136.0, 137.0, 138.0, 140.0, 141.0, 143.0, 146.0, 147.0, 148.0, 151.0, 152.0, 156.0, 157.0, 159.0, 162.0, 168.0, 172.0, 175.0, 183.0, 186.0, 189.0, 199.0, 200.0, 201.0, 204.0, 207.0, 208.0, 209.0, 210.0, 212.0, 219.0, 220.0, 223.0, 226.0

here, age never be nan, 0 or less than 5.

also age is much larger i.e. 244. that never possible we will handle these noise.

In [307]:
# converted all ages greater than 90 and less than 5 to nan and filling those nan values with mean age of the users
df_users.loc[(df_users.age > 90)| (df_users.age < 5), 'age'] = np.nan
df_users.age = df_users.age.fillna(df_users.age.mean())
df_users.age = df_users.age.astype(np.int32)  # converted the data type to int32 so that we don't have decimal values like 22.34

In [308]:
# successfully changed
print(sorted(df_users.age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## **Ratings - Data Exploration**

In [309]:
df_ratings.columns

Index(['User-ID', 'ISBN', 'Book-Rating'], dtype='object')

In [310]:
df_ratings.describe(include = 'all')

Unnamed: 0,User-ID,ISBN,Book-Rating
count,1149780.0,1149780.0,1149780.0
unique,,340556.0,
top,,971880107.0,
freq,,2502.0,
mean,140386.4,,2.86695
std,80562.28,,3.854184
min,2.0,,0.0
25%,70345.0,,0.0
50%,141010.0,,0.0
75%,211028.0,,7.0


**Check for null values**

In [311]:
df_ratings.isnull().sum()

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

**Check for duplicates value**

In [312]:
df_ratings.duplicated().sum()  # no any

0

 **Important info**

In [313]:
def RatingInfo():
  Rinfo_df = pd.DataFrame(index=df_ratings.columns)
  Rinfo_df['Datatypes'] =  df_ratings.dtypes
  Rinfo_df['Count of non-null values'] = df_ratings.count()
  Rinfo_df['NaN values'] = df_ratings.isnull().sum()
  Rinfo_df['% NaN Values'] = (Rinfo_df['NaN values']/len(df_ratings)).round(4)*100        # or df_apps.isnull().mean()
  Rinfo_df['Unique_count'] = df_ratings.nunique()
  return Rinfo_df
RatingInfo()

Unnamed: 0,Datatypes,Count of non-null values,NaN values,% NaN Values,Unique_count
User-ID,int64,1149780,0,0.0,105283
ISBN,object,1149780,0,0.0,340556
Book-Rating,int64,1149780,0,0.0,11


## **Users - Data Preprocessing**

**Renaming columns - for ease**

In [314]:
# defined a function to renaming all three data

def colRename(df):
  df.columns = df.columns.str.strip().str.lower().str.replace('-','_')
  return df.head(2)

In [315]:
# successfully renamed our columns

colRename(df_ratings)

Unnamed: 0,user_id,isbn,book_rating
0,276725,034545104X,0
1,276726,0155061224,5


### ratings distribution