<a href="https://colab.research.google.com/github/sanjayyanadi/Unsupervised-model/blob/main/unsupervised_books_recommendation_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Books Recommendation System**



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


The traditional book ordering system is a manual and time-consuming process wherethe customer has to visit a bookstore to search and purchase the books. In this tightschedule, problems arise in finding specific books due to the inadequate distribution of books through the bookshop. The buyer could not get a recommendation for the correctselection of books.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#importing libraries 
from google.colab import drive
import operator
import pandas as pd
import numpy as np
from numpy import math
import re
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
import random
from collections import Counter
from scipy.sparse import csr_matrix
from pandas.api.types import is_numeric_dtype
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

### Dataset Loading

In [None]:
# Importing the dataset
drive.mount('/content/drive/')  
df_Books = pd.read_csv('/content/drive/MyDrive/ALMABETTER/CAPSTONE_PROJECT/un_supervised_learning_model/Books.csv')
df_Users = pd.read_csv('/content/drive/MyDrive/ALMABETTER/CAPSTONE_PROJECT/un_supervised_learning_model/Users.csv')
df_Ratings = pd.read_csv('/content/drive/MyDrive/ALMABETTER/CAPSTONE_PROJECT/un_supervised_learning_model/Ratings.csv')

### Dataset First View

In [None]:
# Books Dataset
df_Books.head()

In [None]:
# Users Dataset
df_Users.head()

In [None]:
# Ratings Dataset
df_Ratings.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns 
# books
print("books: ",df_Books.shape)
# Users
print("Users:",df_Users.shape)
# ratings
print("ratings: ",df_Ratings.shape)

### Dataset Information

In [None]:
# Dataset Info
# books
print("Books:")
df_Books.info()
#Users
print("Users: ")
df_Users.info()
# Ratings
print("Ratings: ")
df_Ratings.info()

#### Duplicate Values

In [None]:
# books Dataset Duplicate Value Count
len(df_Books[df_Books.duplicated()])

In [None]:
# Users Dataset Duplicate Value Count
len(df_Users[df_Users.duplicated()])

In [None]:
# Ratings Dataset Duplicate Value Count
len(df_Ratings[df_Ratings.duplicated()])

#### Missing Values/Null Values

In [None]:
# Books
# Missing Values/Null Values Count
print(df_Books.isnull().sum())
# Visualizing the missing values
sns.heatmap(df_Books.isnull())

In [None]:
# Users
# Missing Values/Null Values Count
print(df_Users.isnull().sum())
# Visualizing the missing values
sns.heatmap(df_Users.isnull())

In [None]:
# Ratings
# Missing Values/Null Values Count
print(df_Ratings.isnull().sum())
# Visualizing the missing values
sns.heatmap(df_Ratings.isnull())

### What did you know about your dataset?

The dataset given is a dataset from Telecommunication industry, and we have to analysis the churn of customers and the insights behind it.

Churn prediction is analytical studies on the possibility of a customer abandoning a product or service. The goal is to understand and take steps to change it before the costumer gives up the product or service.

The above dataset has 3333 rows and 20 columns. There are no mising values and duplicate values in the dataset. 

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Books
print("books: ",df_Books.columns)
#USers
print("Users: ",df_Users.columns)
# Ratings
print("Ratings: ",df_Ratings.columns)

In [None]:
# Dataset Describe
print(df_Books.describe(include='all'))
print(df_Users.describe(include='all'))
print(df_Ratings.describe(include='all'))

### Variables Description 

The Book-Crossing dataset comprises 3 files.
• Users
Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL-values.

• Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon web site.

• Ratings
Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df_Books.columns.tolist():
  print("No. of unique values in ",i,"is",df_Books[i].nunique(),".")
for i in df_Users.columns.tolist():
  print("No. of unique values in ",i,"is",df_Users[i].nunique(),".")
for i in df_Ratings.columns.tolist():
  print("No. of unique values in ",i,"is",df_Ratings[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
## checking ISBN
flag = 0
k =[]
reg = "[^A-Za-z0-9]"

for x in df_Ratings['ISBN']:
    z = re.search(reg,x)    
    if z:
        flag = 1

if flag == 1:
    print("False")
else:
    print("True")

In [None]:
# removing extra characters from ISBN (from ratings dataset) existing in books dataset
bookISBN = df_Books['ISBN'].tolist() 
reg = "[^A-Za-z0-9]" 
for index, row_Value in df_Ratings.iterrows():
    z = re.search(reg, row_Value['ISBN'])    
    if z:
        f = re.sub(reg,"",row_Value['ISBN'])
        if f in bookISBN:
            df_Ratings.at[index , 'ISBN'] = f

In [None]:
## Uppercasing all alphabets in ISBN
df_Ratings['ISBN'] = df_Ratings['ISBN'].str.upper()

In [None]:
#replacing null data from book author
df_Books['Book-Author'].fillna("Unknown" , inplace = True)
df_Books['Book-Author'].isna().sum()

In [None]:
df_Books['Year-Of-Publication'].unique()

In [None]:
#since year data has some object it it, we shall convert it into null data
df_Books['Year-Of-Publication'] = pd.to_numeric(df_Books['Year-Of-Publication'],errors='coerce')
df_Books['Year-Of-Publication'].isna().sum()
#since year data has the year 0 and 2023 which is invalid, we shall convert it into null data
df_Books.loc[(df_Books['Year-Of-Publication'] > 2023) | (df_Books['Year-Of-Publication'] == 0), 'year'] = 0
#Replacing null data with median 
df_Books['Year-Of-Publication'].fillna(df_Books['year'].median() , inplace = True)
df_Books['Year-Of-Publication'].isna().sum()
df_Books['Year-Of-Publication']=df_Books['Year-Of-Publication'].astype(int)

In [None]:
#Replacing null data from publisher
df_Books['Publisher'].fillna('other' , inplace = True)
df_Books['Publisher'].isna().sum()

In [None]:
print(sorted(df_Users['Age'].unique()))

In [None]:
#removing age above 100 and below 10
df_Users.loc[(df_Users['Age'] > 100) | (df_Users['Age'] < 10) , 'Age' ] = np.NAN
#adding the maen data to the null data
df_Users['Age'].fillna(df_Users['Age'].mean(), inplace = True)
df_Users['Age'] = df_Users['Age'].astype(int)

In [None]:
# Drop URL columns
df_Books.drop(['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis=1, inplace=True)
df_Books.head()

In [None]:
# merging datasets
df = pd.merge(df_Books, df_Ratings, on='ISBN', how='inner')
df= pd.merge(df, df_Users, on='User-ID', how='inner')

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(y="Book-Author", data=df,order=df['Book-Author'].value_counts().index[0:10])
plt.title("Top 10 books author")

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(y="Publisher", data=df,order=df['Publisher'].value_counts().index[0:10])
plt.title("Top 10 publishers")

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(y="Book-Rating", data=df,order=df['Book-Rating'].value_counts().index[0:10])
plt.title("rating distributions")

In [None]:
## Explicit Ratings
plt.figure(figsize=(8,6))
data = df[df['Book-Rating'] != 0]
sns.countplot(x="Book-Rating", data=data)
plt.title("Explicit Ratings")

In [None]:
plt.figure(figsize=(10,10))
df.Age.hist(bins=[10*i for i in range(1, 10)])     
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(30,6))
sns.countplot(x="Location", data=df,order=df['Location'].value_counts().index[0:10])
plt.title("No of readers from each city (Top 10)")

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(y="Book-Title", data=df, order=df['Book-Title'].value_counts().index[0:10])
plt.title("top 10 books with highest rating")

In [None]:
#Books popular Yearly
years = set()
indices = []
for ind, row in df.iterrows():
    if row['Year-Of-Publication'] in years:
        indices.append(ind)
    else:
        years.add(row['Year-Of-Publication'])

pd.set_option("display.max_rows", None, "display.max_columns", None)

## ***Model Implementation***

In [None]:
## Explicit Ratings Dataset
dataset1 = df[df['Book-Rating'] != 0]
dataset1 = dataset1.reset_index(drop = True)
dataset1.shape

In [None]:
## Implicit Ratings Dataset
dataset2 = df[df['Book-Rating'] == 0]
dataset2 = dataset2.reset_index(drop = True)
dataset2.shape


### ***Popularity Based Filtering***
As the name suggests Popularity based recommendation system works with the trend. It basically uses the items which are in trend right now. For example, if any book which is usually bought by every new user then there are chances that it may suggest that book to the user who just signed up.
Book weighted avg formula:
Weighted Rating(WR)=[vR/(v+m)]+[mC/(v+m)]
where,
v is the number of votes for the books;
m is the minimum votes required to be listed in the chart;
R is the average rating of the book; and
C is the mean vote across the whole report.
Now we find the values of v,m,R,C.

In [None]:
def popular_books(df,n=200):
    
    rating_count = dataset1.groupby('Book-Title').count()['Book-Rating'].reset_index()    
    rating_count.rename(columns={'Book-Rating':'Number_Of_Votes'},inplace=True)
    
    rating_avg = df.groupby('Book-Title')['Book-Rating'].mean().reset_index()    
    rating_avg.rename(columns={'Book-Rating':'Ratings-Average'},inplace=True)
    
    popular_books = rating_count.merge(rating_avg,on='Book-Title')
    
    def weighted_rate(x):
        v = x['Number_Of_Votes']
        R = x['Ratings-Average']
                
        return ((v * R) + (m*C))/(v+m)
    n = 10
    C = popular_books['Ratings-Average'].mean()
    m = popular_books['Number_Of_Votes'].quantile(0.95)
    
    # Filter out all qualified books into a new DataFrame
    popular_books = popular_books[popular_books['Number_Of_Votes'] >= m]
    
    popular_books['Polularity-Score'] = popular_books.apply(weighted_rate,axis=1)
    
    popular_books = popular_books.sort_values(by='Polularity-Score',ascending=False)
    print("Top {} popular books".format(n))
    return popular_books[['Book-Title','Ratings-Average','Number_Of_Votes','Polularity-Score']].reset_index(drop=True).head(n)
    display(popular_books(df,10))

n =10
print("Top {} popular books".format(n))
display(popular_books(dataset1,10))

### ***recommendations of books based on same author and  publisher of given book name***

In [None]:
def recom_author_public():
    bookName= input("name a book for recommendations based on authors")
    number=int(input("enter a number recommendations required"))
    if bookName in list(dataset1['Book-Title'].unique()):
        d = dataset1[dataset1['Book-Title'] == bookName]
        # book Author
        print("\nBooks by same Author:\n")
        au = d['Book-Author'].unique()
        data = dataset1[dataset1['Book-Title'] != bookName]
        if au[0] in list(data['Book-Author'].unique()):
            k2 = data[data['Book-Author'] == au[0]]
        k2 = k2.sort_values(by=['Book-Rating'])
        z = k2['Book-Title'].unique()
        for x in range(len(z)):
            print(z[x])
            if x >= number-1:
                break
        # book publisher
        print("\n\nBooks by same Publisher:\n")
        au = d['Publisher'].unique()
        
        if au[0] in list(data['Publisher'].unique()):
            k2 = pd.DataFrame(data[data['Publisher'] == au[0]])
        k2=k2.sort_values(by=['Book-Rating']) 
        z = k2['Book-Title'].unique()
        for x in range(len(z)):
            print(z[x])
            if x >= number-1:
                break
    else:
        print("Invalid Book Name!!")

In [None]:
recom_author_public()

### ***Collaborative Filtering based Recommendation System--(Item-Item Based)***

In [None]:
bookName = input("Enter a book name: ")
number = int(input("Enter number of books to recommend: "))

In [None]:
df = pd.DataFrame(dataset1['Book-Title'].value_counts())
df['Total-Ratings'] = df['Book-Title']
df['Book-Title'] = df.index
df.reset_index(level=0, inplace=True)
df = df.drop('index',axis=1)

df = dataset1.merge(df, left_on = 'Book-Title', right_on = 'Book-Title', how = 'left')
df = df.drop(['Year-Of-Publication','Publisher','Age'], axis=1)

popularity_threshold = 50
popular_book = df[df['Total-Ratings'] >= popularity_threshold]
popular_book = popular_book.reset_index(drop = True)

In [None]:
testdf = pd.DataFrame()
testdf['ISBN'] = popular_book['ISBN']
testdf['Book-Rating'] = popular_book['Book-Rating']
testdf['User-ID'] = popular_book['User-ID']
testdf = testdf[['User-ID','Book-Rating']].groupby(testdf['ISBN'])

In [None]:
listOfDictonaries=[]
indexMap = {}
reverseIndexMap = {}
ptr=0

for groupKey in testdf.groups.keys():
    tempDict={}
    groupDF = testdf.get_group(groupKey)
    for i in range(0,len(groupDF)):
        tempDict[groupDF.iloc[i,0]] = groupDF.iloc[i,1]
    indexMap[ptr]=groupKey
    reverseIndexMap[groupKey] = ptr
    ptr=ptr+1
    listOfDictonaries.append(tempDict)

dictVectorizer = DictVectorizer(sparse=True)
vector = dictVectorizer.fit_transform(listOfDictonaries)
pairwiseSimilarity = cosine_similarity(vector)

In [None]:
def printBookDetails(bookID):
    print(dataset1[dataset1['ISBN']==bookID]['Book-Title'].values[0])

def getTopRecommandations(bookID):
    bookName= input("input a book name for recommendations ")
    number= input("input a number for number of recommendations ")
    collaborative = []
    row = reverseIndexMap[bookID]
    print("Input Book:")
    printBookDetails(bookID)
    
    print("\nRECOMMENDATIONS:\n")
    
    mn = 0
    similar = []
    for i in np.argsort(pairwiseSimilarity[row])[:-2][::-1]:
          if dataset1[dataset1['ISBN']==indexMap[i]]['Book-Title'].values[0] not in similar:
                if int(mn)>=int(number):
                      break
                mn+=1
                similar.append(dataset1[dataset1['ISBN']==indexMap[i]]['Book-Title'].values[0])
                printBookDetails(indexMap[i])
                collaborative.append(dataset1[dataset1['ISBN']==indexMap[i]]['Book-Title'].values[0])
    return collaborative
    



In [None]:
k = list(dataset1['Book-Title'])
m = list(dataset1['ISBN'])

collaborative = getTopRecommandations(m[k.index(bookName)])

In [None]:
popularity_threshold = 50

user_count = dataset1['User-ID'].value_counts()
data = dataset1[dataset1['User-ID'].isin(user_count[user_count >= popularity_threshold].index)]
rat_count = data['Book-Rating'].value_counts()
data = data[data['Book-Rating'].isin(rat_count[rat_count >= popularity_threshold].index)]

matrix = data.pivot_table(index='User-ID', columns='ISBN', values = 'Book-Rating').fillna(0)

In [None]:
average_rating = pd.DataFrame(dataset1.groupby('ISBN')['Book-Rating'].mean())
average_rating['ratingCount'] = pd.DataFrame(df_Ratings.groupby('ISBN')['Book-Rating'].count())
average_rating.sort_values('ratingCount', ascending=False).head()

In [None]:
isbn = df_Books.loc[df_Books['Book-Title'] == bookName].reset_index(drop = True).iloc[0]['ISBN']
row = matrix[isbn]
correlation = pd.DataFrame(matrix.corrwith(row), columns = ['Pearson Corr'])
corr = correlation.join(average_rating['ratingCount'])

res = corr.sort_values('Pearson Corr', ascending=False).head(number+1)[1:].index
corr_books = pd.merge(pd.DataFrame(res, columns = ['ISBN']), df_Books, on='ISBN')
print("\n Recommended Books: \n")
corr_books

In [None]:
data = (dataset1.groupby(by = ['Book-Title'])['Book-Rating'].count().reset_index().
        rename(columns = {'Book-Rating': 'Total-Rating'})[['Book-Title', 'Total-Rating']])

result = pd.merge(data, dataset1, on='Book-Title')
result = result[result['Total-Rating'] >= popularity_threshold]
result = result.reset_index(drop = True)

matrix = result.pivot_table(index = 'Book-Title', columns = 'User-ID', values = 'Book-Rating').fillna(0)
up_matrix = csr_matrix(matrix)

In [None]:
model = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model.fit(up_matrix)

distances, indices = model.kneighbors(matrix.loc[bookName].values.reshape(1, -1), n_neighbors = number+1)
print("\nRecommended books:\n")
for i in range(0, len(distances.flatten())):
    if i > 0:
        print(matrix.index[indices.flatten()[i]]) 

In [None]:
def recom_knneighbor():
    data = (dataset1.groupby(by = ['Book-Title'])['Book-Rating'].count().reset_index().
        rename(columns = {'Book-Rating': 'Total-Rating'})[['Book-Title', 'Total-Rating']])

    result = pd.merge(data, dataset1, on='Book-Title')
    result = result[result['Total-Rating'] >= popularity_threshold]
    result = result.reset_index(drop = True)

    matrix = result.pivot_table(index = 'Book-Title', columns = 'User-ID', values = 'Book-Rating').fillna(0)
    up_matrix = csr_matrix(matrix)
    model = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
    model.fit(up_matrix)

    distances, indices = model.kneighbors(matrix.loc[bookName].values.reshape(1, -1), n_neighbors = number+1)
    print("\nRecommended books:\n")
    for i in range(0, len(distances.flatten())):
        if i > 0:
            print(matrix.index[indices.flatten()[i]])

In [None]:
recom_knneighbor()

In [None]:
popularity_threshold = 80
popular_book = df[df['Total-Ratings'] >= popularity_threshold]
popular_book = popular_book.reset_index(drop = True)
popular_book.shape

In [None]:
tf = TfidfVectorizer(ngram_range=(1, 2), min_df = 1, stop_words='english')
tfidf_matrix = tf.fit_transform(popular_book['Book-Title'])
tfidf_matrix.shape

In [None]:
normalized_df = tfidf_matrix.astype(np.float32)
cosine_similarities = cosine_similarity(normalized_df, normalized_df)
cosine_similarities.shape

In [None]:
print("Recommended Books:\n")
isbn = df_Books.loc[df_Books['Book-Title'] == bookName].reset_index(drop = True).iloc[0]['ISBN']
content = []

idx = popular_book.index[popular_book['ISBN'] == isbn].tolist()[0]
similar_indices = cosine_similarities[idx].argsort()[::-1]
similar_items = []
for i in similar_indices:
    if popular_book['Book-Title'][i] != bookName and popular_book['Book-Title'][i] not in similar_items and len(similar_items) < number:
        similar_items.append(popular_book['Book-Title'][i])
        content.append(popular_book['Book-Title'][i])

for book in similar_items:
    print(book)

In [None]:
z = list()
k = float(1/number)
for x in range(number):
      z.append(1-k*x)

dictISBN = {}
for x in collaborative:
      dictISBN[x] = z[collaborative.index(x)]

for x in content:
    if x not in dictISBN:
        dictISBN[x] = z[content.index(x)]
    else:
        dictISBN[x] += z[content.index(x)]

ISBN = dict(sorted(dictISBN.items(),key=operator.itemgetter(1),reverse=True))
w=0
print("Input Book:\n")
print(bookName)
print("\nRecommended Books:\n")
for x in ISBN.keys():
    if w>=number:
        break
    w+=1
    print(x)