# **Recommendation system Using Collaborative Filtering**

The work I present in this project primarily revolves around the construction of recommendation systems using a collaborative filtering approach. The objective is to provide personalized recommendations to users based on their preferences and similarities with other users or items.


To achieve this, I have employed a memory-based collaborative filtering technique, which encompasses two distinct types: user-based collaborative filtering and item-based collaborative filtering. These methods leverage the collective wisdom of users or items to generate accurate recommendations.


User-based collaborative filtering involves analyzing the behavior and preferences of similar users to make recommendations. By identifying users with similar tastes and preferences, the system suggests items that these users have enjoyed, but the current user has not yet explored. This approach taps into the power of social influence and user communities to enhance the accuracy of recommendations.

On the other hand, item-based collaborative filtering focuses on the similarities between items themselves. It examines the historical preferences of users and the relationships between different items to identify items that are similar to those the user has already shown interest in. By suggesting similar items, this method aims to capture the user's taste and provide diverse and relevant recommendations.


In this project, I have implemented and evaluated both user-based and item-based collaborative filtering algorithms. For the traditional approach, I developed custom algorithms from scratch to compute user or item similarities, generate recommendations, and evaluate system performance. This allowed me to have full control over the implementation details and fine-tune the algorithms based on specific requirements.


In addition to the traditional approach, I also utilized the Surprise library, which offers a comprehensive set of tools and pre-implemented algorithms specifically designed for recommendation systems. By incorporating the Surprise library into my project, I could take advantage of its optimized algorithms, convenient API, and evaluation metrics. This facilitated efficient development and evaluation of recommendation models.


By comparing the performance of both the traditional implementation and the Surprise library approach, I aim to provide insights into the effectiveness of each method in different scenarios. This comprehensive analysis highlights the benefits of leveraging existing libraries like Surprise for efficient recommendation system development, while also showcasing the flexibility and control offered by the traditional implementation approach.
















---



---





# **Dataset**

The Book Recommendation dataset is a comprehensive collection of user ratings and book information, compiled for the purpose of building and evaluating recommendation systems. The dataset consists of multiple components that provide a rich source of data for training and testing recommendation models.


The main components of the dataset include:


**User Ratings**: This portion of the dataset contains information about user preferences and ratings for various books. Each user rating typically includes the user's unique identifier, the book's identifier, and the corresponding rating given by the user. These ratings serve as valuable indicators of user preferences and play a crucial role in the collaborative filtering approach.


**Book Information**: This segment of the dataset contains detailed information about the books themselves. It typically includes attributes such as the book's title, author, genre, publication year, and other relevant metadata. This information helps in understanding the characteristics and features of the books, which can be used in content-based filtering or as additional contextual data in recommendation systems.


The dataset provides a diverse collection of books from various genres, authors, and time periods, allowing for a comprehensive analysis of different recommendation techniques. By combining user ratings with book information, it offers a rich context for building personalized recommendation models.


In this project, the Book Recommendation dataset from Kaggle has been used to train, test, and evaluate collaborative filtering models, both using traditional implementation and leveraging the Surprise library. The dataset's rich collection of user ratings and book attributes enables the exploration of different collaborative filtering approaches and the assessment of their performance in providing accurate and personalized recommendations.


By utilizing this dataset, the project aims to contribute to the field of recommendation systems by providing insights into the effectiveness of collaborative filtering algorithms and their application in real-world book recommendation scenarios.







In [None]:
# Installing and importing all the necessary libraries 
!pip install --upgrade scikit-surprise

In [2]:
from google.colab import files
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import missingno as msno
from surprise import Reader, Dataset
from surprise.model_selection import GridSearchCV
from surprise import KNNBasic
from surprise import accuracy
import random
from surprise.model_selection import train_test_split
from surprise import KNNBasic
from sklearn.metrics.pairwise import cosine_similarity
from surprise import KNNWithMeans

In this project, two CSV files from the Book Recommendation dataset were utilized for building the recommendation system: "Books.csv" and "Ratings.csv." These files contain essential information about books and user ratings, respectively.


The "Books.csv" file includes detailed information about the books in the dataset. It typically consists of attributes such as the book's unique identifier, title, author, genre, publication year, and other relevant metadata. This information provides valuable insights into the characteristics and features of the books, enabling better understanding and analysis for recommendation purposes.


The "Ratings.csv" file contains user ratings for the books in the dataset. It typically includes information about the users' unique identifiers, the books they have rated, and the corresponding ratings they have assigned. These ratings serve as the foundation for collaborative filtering techniques, enabling the system to identify user preferences and make personalized recommendations.

In [None]:
books=pd.read_csv("/content/Books.csv")
books.head()

  books=pd.read_csv("/content/Books.csv")


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [None]:
ratings=pd.read_csv("/content/Ratings.csv")
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [None]:
print("Books Shape: " ,books.shape )
print("Ratings Shape: " ,ratings.shape )

Books Shape:  (271360, 8)
Ratings Shape:  (1149780, 3)


In [None]:
# Check for null values in both the datasets.
print("Count of null values in Books:\n" ,books.isnull().sum())

print("                                                        ")
print("                                                        ")
print("Count of values in Ratings:\n ",ratings.isnull().sum())

Count of null values in Books:
 ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64
                                                        
                                                        
Count of values in Ratings:
  User-ID        0
ISBN           0
Book-Rating    0
dtype: int64


To create a unified dataset for the recommendation system, both the "Books.csv" and "Ratings.csv" files are merged based on a common identifier, such as the book's unique identifier- 'ISBN'. This merging process allows the system to associate the book attributes from "Books.csv" with the corresponding user ratings from "Ratings.csv." By combining these datasets, a comprehensive dataset is formed, which includes both book information and user ratings for building the recommendation system.


By merging the datasets into a single cohesive dataset, the project aims to leverage the combined information to develop accurate and personalized recommendation models. The merged dataset enables the exploration of collaborative filtering algorithms and the evaluation of their performance using real-world book ratings and attributes.

In [None]:
books_data=books.merge(ratings,on="ISBN")
books_data.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Book-Rating
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,2,0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,8,5
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11400,0
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11676,8
4,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,41385,0


# **Data Cleaning and Preprocessing**

In the data cleaning and preprocessing phase, several operations were performed to ensure the data is in a suitable format for building the recommendation system. The following steps were executed:


1. Creation of a New DataFrame: A new DataFrame called 'df' was created by making a copy of the original 'books_data' DataFrame. This allows for a separate working copy to be used for data cleaning without modifying the original dataset.


2. Removal of Missing Values: Rows with missing values (NaN) were dropped from the 'df' DataFrame using the 'dropna' method. This ensures that the dataset remains consistent and complete. Furthermore, the index of the DataFrame was reset using the 'reset_index' method to maintain a sequential index.


3. Removal of Unnecessary Columns: Specific columns, such as "ISBN," "Year-Of-Publication," "Image-URL-S," and "Image-URL-M," were dropped from the 'df' DataFrame. These columns were deemed unnecessary for the recommendation system, allowing for a more focused and streamlined dataset.


4. Filtering Zero Ratings: Rows in 'df' where the 'Book-Rating' column has a value of 0 were dropped. This was accomplished by selecting the corresponding indices using the expression 'df[df["Book-Rating"] == 0].index' and then dropping them using the 'drop' method with the 'index' parameter. By excluding zero ratings, the system focuses on meaningful ratings that can contribute to accurate recommendations.


5. Preview of Modified DataFrame: Finally, the modified 'df' DataFrame was displayed to provide a glimpse of the cleaned dataset. The 'head' method was used to showcase the initial rows, giving an overview of the transformed data ready for further analysis.


By performing these data cleaning and preprocessing steps, the dataset was refined to ensure data integrity, remove unnecessary information, and focus on relevant ratings. This sets the foundation for building a robust and effective recommendation system based on the processed dataset.

In [None]:
# Create a new DataFrame 'df' by making a copy of 'books_data'
df = books_data.copy()

# Drop rows with missing values (NaN) from 'df'
df.dropna(inplace=True)

# Reset the index of 'df' after dropping the rows
df.reset_index(drop=True, inplace=True)

# Drop the specified columns from 'df' using the 'drop' method
df.drop(columns=["ISBN", "Year-Of-Publication", "Image-URL-S", "Image-URL-M"], axis=1, inplace=True)

# Drop rows from 'df' where the 'Book-Rating' column has a value of 0
df.drop(index=df[df["Book-Rating"] == 0].index, inplace=True)

# Display the first few rows of the modified 'df'
df.head()


Unnamed: 0,Book-Title,Book-Author,Publisher,Image-URL-L,User-ID,Book-Rating
1,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,8,5
3,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,11676,8
5,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,67544,8
8,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,116866,9
9,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,123629,9


# **USER BASED Collaborative Filtering**

The code below showcases a series of operations to generate book recommendations using collaborative filtering:


1. Filtering Users: In order to improve the reliability and accuracy of the recommendation system, the code drops users from the DataFrame 'df' who have voted less than 200 times. By excluding users with a limited voting history, the system focuses on users who have actively engaged with the dataset, providing more reliable recommendations.


2. Creating a User-Item Matrix: To facilitate collaborative filtering, a user-item matrix is created using the 'pivot_table' method. The pivot table is constructed by selecting the appropriate columns from the DataFrame 'df' and specifying 'index='User-ID'', 'columns='Book-Title'', and 'values='Book-Rating''. This transformation results in a matrix-like structure where rows represent users, columns represent book titles, and the values represent book ratings. Missing ratings are filled with 0 to complete the matrix, indicating that the user has not rated that particular book.

    In the user-item matrix:

    **Rows**: Each row in the matrix represents a unique user from the dataset. The 'User-ID' values from the DataFrame 'df' are used as the row index. Each row corresponds to a specific user and contains their ratings for different books.

    **Columns**: Each column in the matrix represents a unique book title from the dataset. The 'Book-Title' values from the DataFrame 'df' are used as the column headers. Each column corresponds to a specific book and contains the ratings given by different users for that book.

    Values: The values in the user-item matrix represent the ratings provided by users for different books. The 'Book-Rating' values from the DataFrame 'df' are used as the values in the matrix. Each cell in the matrix contains a specific user's rating for a particular book





3. Fitting a k-Nearest Neighbors Model: The code fits a k-nearest neighbors (k-NN) model on the user-item matrix. The k-NN algorithm measures the similarity between users based on their voting patterns and identifies neighbors with similar tastes. In this case, the cosine distance metric is used to calculate the similarity between user vectors in the matrix. The choice of k-NN and the cosine distance metric is common in collaborative filtering as they effectively capture user preferences and produce reliable recommendations.


4. Retrieving Neighbors: To retrieve the k-nearest neighbors for a given user, the code defines a function named 'get_neighbors'. This function takes the user-item matrix and the target user as inputs and utilizes the fitted k-NN model to identify users who share similar voting patterns. The similarity is determined by the cosine distance calculated during the model fitting process. Retrieving the k-nearest neighbors forms the basis for generating personalized recommendations by considering the preferences and choices of similar users.


5. Generating Recommendations: Another function named 'generate_recommendations' is defined to generate book recommendations for a given user. This function leverages the k-nearest neighbors obtained from 'get_neighbors' and suggests books that the target user's neighbors have enjoyed but the target user has not yet explored. This collaborative filtering approach taps into the collective wisdom of similar users to provide relevant and diverse recommendations.


6. Random User Selection: As an example, the code randomly selects a user from the user-item matrix and obtains book recommendations for that user using the 'generate_recommendations' function. This random user serves as a demonstration to showcase the recommendation generation process.


7. Retrieving Book Details and Ratings: To provide comprehensive information about the recommended books, the code queries the original DataFrame 'df' to retrieve the book details and average ratings. This step ensures that the recommendations include essential information such as book titles, authors, and average ratings.


8. Creating a DataFrame for Recommendations: The code constructs a DataFrame named 'book_df' that contains the book details and average ratings of the recommended books. This organized representation allows for easy visualization and analysis of the recommended books.

In [None]:
# Drop users who have voted less than 200 times
user_counts = df['User-ID'].value_counts()
df = df[df['User-ID'].isin(user_counts[user_counts > 200].index)]

# Create the user-item matrix
user_item_matrix = df.pivot_table(index='User-ID', columns='Book-Title', values='Book-Rating', fill_value=0)

# Fit the k-NN model
k = 5
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(user_item_matrix.values)

# Function to get the k-nearest neighbors for a given user
def get_neighbors(user_id):
    """
    Retrieve the k-nearest neighbors for a given user based on the fitted k-NN model.

    Args:
        user_id (int): ID of the user for whom neighbors need to be found.

    Returns:
        numpy.ndarray: Array containing the IDs of the k-nearest neighbors.
    """
    user_index = user_item_matrix.index.get_loc(user_id)
    distances, indices = knn_model.kneighbors(user_item_matrix.iloc[user_index, :].values.reshape(1, -1), n_neighbors=k+1)
    neighbor_indices = indices.flatten()[1:]  # Exclude the user's own index
    return user_item_matrix.index[neighbor_indices]

# Function to generate recommendations for a given user
def generate_recommendations(user_id):
    """
    Generate book recommendations for a given user based on collaborative filtering.

    Args:
        user_id (int): ID of the user for whom recommendations need to be generated.

    Returns:
        pandas.Index: Index of the top recommended book titles.
    """
    neighbors = get_neighbors(user_id)
    user_ratings = user_item_matrix.loc[user_id]

    recommendations = []

    for neighbor_id in neighbors:
        neighbor_ratings = user_item_matrix.loc[neighbor_id]
        unrated_books = neighbor_ratings.index[neighbor_ratings == 0]

        if not unrated_books.empty:
            recommendations.extend(unrated_books)

    recommendations = pd.Series(recommendations).value_counts().sort_values(ascending=False)
    return recommendations.index[:5]

# Example usage
target_user_id = random.choice(user_item_matrix.index.tolist())
recommendations = generate_recommendations(target_user_id)

# Get book details and average ratings for the recommendations
book_details = []
for book_title in recommendations:
    book_author = df.loc[df['Book-Title'] == book_title, 'Book-Author'].iloc[0]
    book_ratings = df.loc[df['Book-Title'] == book_title, 'Book-Rating']
    average_rating = np.mean(book_ratings)
    book_details.append({'Book-Title': book_title, 'Book-Author': book_author, 'Book-Avg-Rating': average_rating})

# Create DataFrame with book details
print(f"Book recommendations for user {target_user_id}:")
print("                                 ")
book_df = pd.DataFrame(book_details).reset_index(drop=True)

# Display the DataFrame
book_df



Book recommendations for user 174304:
                                 


Unnamed: 0,Book-Title,Book-Author,Book-Avg-Rating
0,Dark Justice,Jack Higgins,10.0
1,Agyar,Steven Brust,10.0
2,Ahead Of The Game (Mira),Suzann Ledbetter,8.0
3,Young Goodman Brown and Other Tales (Oxford Wo...,Nathaniel Hawthorne,10.0
4,Aimez Vous Brahms?,Francoise Sagan,8.0


# **Using Surpise Library for User Based Collaborative Filtering**

# **Surpise Library**

The code below utilizes the Surprise library, a powerful tool for building collaborative filtering-based recommendation systems. Surprise is a Python scikit-learn library specifically designed for recommendation systems, providing a wide range of algorithms and evaluation metrics.


Here's how the Surprise library typically works:

1. Data Preparation: The library expects data in a specific format, often as a user-item matrix or as a collection of user-item ratings. The data can be loaded from various sources, such as CSV files or Pandas DataFrames. Surprise provides the Reader object to define the rating scale and format of the data.


2. Dataset Creation: The data is then converted into a Surprise Dataset object using the Reader. The Dataset object handles data storage, splitting into training and testing sets, and other preprocessing tasks required for modeling.


3. Algorithm Selection and Configuration: Surprise offers a range of collaborative filtering algorithms, each with its own strengths and characteristics. These algorithms can be chosen based on the specific needs and characteristics of the dataset. The algorithms can also be configured with parameters to fine-tune their behavior.


4. Model Training: The selected algorithm is trained on the training set of the dataset. During training, the algorithm learns the underlying patterns and relationships in the data. For example, in user-based collaborative filtering, the algorithm identifies similar users based on their voting patterns or preferences.


5. Prediction and Recommendation Generation: Once the model is trained, it can make predictions or generate recommendations for unseen data. For example, given a user, the model can predict the rating the user would give to a specific item. These predictions can be used to provide personalized recommendations based on user preferences.


6. Evaluation: The performance of the recommendation model is evaluated using appropriate evaluation metrics. Surprise provides various evaluation metrics, such as Root Mean Squared Error (RMSE) and Mean Average Precision (MAP), to measure the accuracy and quality of the recommendations. These metrics help assess the effectiveness of the model in predicting user preferences.











# **Using surprise libarary in the project**



In the code:




 ***Surprise Reader Object and Dataset Creation:*** 

A Surprise Reader object is created to define the rating scale of the dataset. This allows the system to understand the range and format of the ratings.
The DataFrame 'df' is loaded into a Surprise Dataset object using the Reader. This conversion prepares the data for further processing and analysis within the Surprise library.
The dataset is then split into training and test sets, with 80% of the data allocated for training and 20% for testing. This division allows the model's performance to be evaluated using unseen data.





***Building the User-Based Collaborative Filtering Model:***


The user-based collaborative filtering model, specifically KNNWithMeans, is constructed. This algorithm identifies similar users based on their voting patterns and predicts ratings for items based on those similarities.
The model is trained on the training set, learning the underlying patterns and relationships between users and items in the dataset. This training process enables the model to make accurate predictions on unseen data.




***Generating Recommendations for a Random User:***

A random user from the test set is selected. This user serves as an example to showcase the recommendation generation process.
The code generates the top 5 recommendations for the selected user by iterating over unique book titles. For each book, the model predicts the rating that the user would give.
The recommendations, including the user ID, book title, and predicted rating, are stored in a list.
The recommendations are then sorted based on the predicted rating in descending order. This ensures that the most highly rated books are presented as the top recommendations for the user.





***Evaluating the Model's Performance:***

The model's predictions on the test set are utilized to calculate the Root Mean Squared Error (RMSE). RMSE is a common evaluation metric for recommendation systems, measuring the average difference between the predicted ratings and the actual ratings in the test set.
The RMSE provides an indication of the model's accuracy in predicting user ratings. A lower RMSE value signifies better performance, as it indicates that the model's predictions are closer to the actual ratings.
By utilizing the Surprise library and its collaborative filtering algorithms, the code efficiently generates personalized recommendations based on user preferences and patterns. The evaluation metric, RMSE, quantifies the model's predictive accuracy, allowing for a comprehensive assessment of its performance.






In [None]:
# Create surprise dataset
# The 'Reader' class defines the rating scale for the dataset
reader = Reader(rating_scale=(0, 10))

# Load the DataFrame 'df' into a Surprise Dataset object
data = Dataset.load_from_df(df[['User-ID', 'Book-Title', 'Book-Rating']], reader)

# Split the data into training and test sets
# 80% of the data is used for training and 20% for testing
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Build the user-based collaborative filtering model
# 'KNNWithMeans' is a k-nearest neighbors algorithm that takes into account the mean ratings of users
# 'k=5' specifies that the algorithm will consider 5 nearest neighbors
# 'sim_options={'user_based': True}' indicates that the algorithm will use user-based collaborative filtering
algo = KNNWithMeans(k=5, sim_options={'user_based': True})

# Train the model on the training set
algo.fit(trainset)

# Randomly select a user from the test set
target_user_id = random.choice([uid for (uid, _, _) in testset])

# Generate top 5 recommendations for the user
top_n = []
for book_id in df['Book-Title'].unique():
    # Predict the rating for the target user and each book in the dataset
    prediction = algo.predict(target_user_id, book_id)
    top_n.append((target_user_id, book_id, prediction.est))

# Sort the recommendations by predicted rating in descending order
top_n.sort(key=lambda x: x[2], reverse=True)

# Create a DataFrame from the recommendations
recommendations_df = pd.DataFrame(top_n, columns=['User-ID', 'Book-Title', 'Predicted Rating'])

# Calculate Root Mean Squared Error (RMSE) for the model's predictions on the test set
predictions = algo.test(testset)
rmse = accuracy.rmse(predictions)

# Print the RMSE value
print("RMSE:", rmse)

# Print the top 5 recommendations for the target user
print("Top 5 Recommendations for User", target_user_id)
recommendations_df.head(5)



Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.8399
RMSE: 1.839905079055281
Top 5 Recommendations for User 25981


Unnamed: 0,User-ID,Book-Title,Predicted Rating
0,25981,EYES OF DARKNESS,10.0
1,25981,The Road to Oz,10.0
2,25981,The Pool in the Desert (Penguin Short Fiction),10.0
3,25981,Joy Luck Club,10.0
4,25981,Ozma of Oz,10.0


# **ITEM-BASED Collaborative filtering**


The code employs item-based collaborative filtering, a powerful technique for generating personalized recommendations based on the similarity of items. Here's how the code works:


1. Creating the Item-User Matrix: The code starts by constructing an item-user matrix using the pivot_table() function from the pandas library. This matrix captures the ratings provided by different users for different books. Each row represents a book title, each column corresponds to a user ID, and the matrix values denote the ratings assigned by users to books. If a user has not rated a particular book, the value is filled with 0. This matrix provides a structured representation of user preferences and forms the basis for similarity calculations.


2. Sparse Matrix Conversion: To optimize memory usage and enable efficient computations, the item-user matrix is transformed into a sparse Compressed Sparse Row (CSR) matrix using the csr_matrix() function from the scipy.sparse module. Sparse matrices store only non-zero values, resulting in significant memory savings. This conversion ensures that the matrix can be processed by various machine learning algorithms.


3. Initializing the k-Nearest Neighbors Model: The code initializes a k-nearest neighbors (k-NN) model using the NearestNeighbors class from the sklearn.neighbors module. The k-NN model is configured to utilize the cosine similarity metric, which measures the similarity between two vectors by computing the cosine of the angle between them. The brute-force algorithm is employed for finding nearest neighbors based on the similarity of rating patterns. This model will identify items (books) with similar rating patterns, enabling the generation of relevant recommendations.


4. Training the k-Nearest Neighbors Model: The k-NN model is trained on the item-user matrix using the fit() method. During this step, the model computes the nearest neighbors for each item in the matrix, based on the similarity of their rating patterns. By analyzing the ratings given to different books by various users, the model learns the underlying relationships and identifies items that are likely to be preferred by users with similar tastes.


5. Defining Utility Functions: The code defines two utility functions. The first function, get_neighbors(), takes an item index as input and retrieves the indices of its k nearest neighbors using the trained k-NN model. These nearest neighbors are items that exhibit similar rating patterns, making them potentially relevant for generating recommendations. The second function, generate_recommendations(), takes the title of a target item (book) as input. It checks if the item exists in the item-user matrix, retrieves the k nearest neighbors using get_neighbors(), and identifies the users who have not rated the target item. It then creates a list of recommendations by pairing each unrated user with the similar item from the neighbors. The recommendations are stored in a DataFrame and returned as the output.


6. Generating Recommendations: To demonstrate the recommendation generation process, the code randomly selects a target item (book) from the item-user matrix. It calls the generate_recommendations() function to obtain the recommendations for that item. The recommendations, which suggest books that users who liked the target item also enjoyed, are displayed as the final output.


In summary, the code leverages item-based collaborative filtering to generate personalized recommendations. By constructing an item-user matrix, identifying nearest neighbors, and analyzing user rating patterns, the system can suggest items that are likely to be of interest to users based on their preferences. This approach harnesses the collective wisdom of users and the similarity between items to provide relevant and tailored recommendations.








In [None]:
#The line of code below assigns unique item IDs to each book title in the DataFrame 'df'.
#The pd.factorize() function is used to encode the book titles into numeric values.
#The [0] index of the returned tuple contains the encoded values.
#We add 1 to the encoded values to ensure that the item IDs start from 1 instead of 0.
#By assigning unique item IDs, we create a numerical representation of the book titles.



df['Item-ID'] = pd.factorize(df['Book-Title'])[0] + 1

In [None]:

# Create the item-user matrix
item_user_matrix = df.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating', fill_value=0)

# Convert the item-user matrix to a sparse CSR matrix
item_user_matrix_sparse = csr_matrix(item_user_matrix.values)

# Fit the k-NN model
k = 5
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(item_user_matrix_sparse)

# Function to get the k-nearest neighbors for a given item
def get_neighbors(item_index):
    distances, indices = knn_model.kneighbors(item_user_matrix_sparse[item_index, :].reshape(1, -1), n_neighbors=k+1)
    neighbor_indices = indices.flatten()[1:]  # Exclude the item's own index
    return neighbor_indices

# Function to generate recommendations for a given item
def generate_recommendations(item_title):
    if item_title not in item_user_matrix.index:
        print("Item not found in matrix.")
        return pd.DataFrame(columns=['Similar-Item (Books)', 'User-ID'])
    
    item_index = item_user_matrix.index.get_loc(item_title)
    neighbors = get_neighbors(item_index)
    
    # Recreate item_user_matrix_sparse based on updated item_user_matrix
    item_user_matrix_sparse = csr_matrix(item_user_matrix.values.T)
    
    recommendations = []
    for neighbor_index in neighbors:
        unrated_users = item_user_matrix_sparse[:, neighbor_index].nonzero()[0]
        similar_items = [item_user_matrix.index[neighbor_index]] * len(unrated_users)
        recommendations.extend(zip(similar_items, unrated_users))
    
    recommendations_df = pd.DataFrame(recommendations, columns=['Similar-Item (Books)', 'User-ID'])
    recommendations_df = recommendations_df[recommendations_df['Similar-Item (Books)'] != item_title]
    recommendations_df = recommendations_df.groupby(['Similar-Item (Books)']).head(5)
    return recommendations_df





# Example usage
target_item_title = random.choice(item_user_matrix.index)
recommendations = generate_recommendations(target_item_title)
print("Recommendations for Item (Book):", target_item_title)
# Display the recommendations
recommendations.head(5)


Recommendations for Item (Book): Heat Stroke (Weather Warden Series Book 2)


Unnamed: 0,Similar-Item (Books),User-ID
0,The Angel Whispered Danger: An Augusta Goodnig...,49
1,True North,49
2,"The Grand Crusade (The DragonCrown War Cycle, ...",49
3,True Love (and Other Lies),49
4,Intensive Scare Unit,49


# **Using Surpise Library**

In [None]:
import gc
gc.collect()

74

Working of the Code:


The provided code showcases the implementation of an item-based collaborative filtering recommendation system using the Surprise library. It follows a series of steps to generate personalized recommendations based on the relationships between items in the dataset.


1. Data Sampling: To address the challenges posed by large datasets, the code incorporates a data sampling technique. It randomly selects a subset of the original DataFrame using the sample() function from pandas. By specifying the desired subset size and setting a random seed, the code ensures reproducibility while reducing memory and performance limitations.



2. Converting to Surprise Dataset: The sampled DataFrame is transformed into Surprise's Dataset format. This step involves utilizing the Reader class to define the rating scale, which is specified as a range from 1 to 10. The load_from_df() method of the Dataset class is then employed to load the sampled DataFrame into a Surprise Dataset object.



3. Building the Item-Based Collaborative Filtering Model: The code constructs an item-based collaborative filtering model using the KNNBasic class from Surprise. Configured with cosine similarity and the item-based approach, the model determines the similarity between items based on their rating patterns. The parameter k specifies the number of nearest neighbors to consider in the recommendation process.



4. Model Training: The model is trained on the dataset using the fit() method. This step allows the model to learn the underlying patterns and relationships between items, enabling it to make informed recommendations. The dataset is split into training and test sets using the train_test_split() function from Surprise, ensuring evaluation of the model's performance.



5. Retrieving Neighbors: The code defines a function called get_neighbors() that takes an item ID and the number of neighbors (k) as inputs. This function retrieves the k nearest neighbors for the given item based on the fitted item-based CF model. It performs mapping between the item IDs used in the Surprise dataset's internal representation and the raw IDs used in the DataFrame.



6. Generating Recommendations: Another function called generate_recommendations() is defined to generate recommendations for a given item. It takes the ID of the item as input, retrieves the k nearest neighbors using the get_neighbors() function, and identifies the unrated users for those neighbors. By pairing each unrated user with the similar item (neighbor), a list of recommendations is created. These recommendations are stored in a DataFrame and returned as the output.



7. RMSE Calculation: The code calculates the root mean squared error (RMSE) for the model's predictions on the test set using the test() method of the model. This metric provides an evaluation of the model's accuracy by comparing the predicted ratings with the actual ratings in the test set.



8. Generating Recommendations: Finally, the code randomly selects a target item (book) from the sampled DataFrame and calls the generate_recommendations() function to obtain personalized recommendations for that item. The recommendations, along with the calculated RMSE, are displayed as output.



By leveraging the item-based collaborative filtering approach and the Surprise library, the code effectively analyzes the relationships between items and generates relevant recommendations tailored to individual users' preferences.

In [None]:
# Randomly sample a subset of your DataFrame
subset_size = 20000  # Specify the desired subset size
subset_df = df.sample(n=subset_size, random_state=42)

# Convert the sampled DataFrame to Surprise's Dataset format
reader = Reader(rating_scale=(1, 10))  # Set the rating scale according to your data
data = Dataset.load_from_df(subset_df[['User-ID', 'Book-Title', 'Book-Rating']], reader)

# Build the item-based collaborative filtering model
similarity_options = {'name': 'cosine', 'user_based': False}  # Use cosine similarity and item-based approach
k = 5  # Number of nearest neighbors to consider
model = KNNBasic(k=k, min_k=1, sim_options=similarity_options)

# Train the model on the dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
model.fit(trainset)

# Function to get the k-nearest neighbors for a given item
def get_neighbors(item_id, k):
    """
    Retrieve the k-nearest neighbors for a given item based on the fitted item-based CF model.

    Args:
        item_id (int): ID of the item for which neighbors need to be found.
        k (int): Number of nearest neighbors to retrieve.

    Returns:
        list: List containing the IDs of the k-nearest neighbors.
    """
    item_inner_id = trainset.to_inner_iid(item_id)
    raw_neighbors = model.get_neighbors(item_inner_id, k=k)
    neighbors = [trainset.to_raw_iid(inner_id) for inner_id in raw_neighbors]
    return neighbors

# Function to generate recommendations for a given item
def generate_recommendations(item_id):
    """
    Generate recommendations for a given item based on the item-based CF model.

    Args:
        item_id (int): ID of the item for which recommendations need to be generated.

    Returns:
        pandas.DataFrame: DataFrame containing the recommendations with columns 'Similar-Item (Books)' and 'User-ID'
    """
    neighbors = get_neighbors(item_id, k)

    recommendations = []
    for neighbor_id in neighbors:
        unrated_users = subset_df.loc[(subset_df['Book-Title'] == neighbor_id) & (subset_df['Item-ID'] != item_id), 'User-ID'].unique()
        similar_items = [neighbor_id] * len(unrated_users)
        recommendations.extend(zip(similar_items, unrated_users))

    return pd.DataFrame(recommendations, columns=['Similar-Item (Books)', 'User-ID'])

# Calculate RMSE
predictions = model.test(testset)
rmse = accuracy.rmse(predictions)

# Example usage
target_item_id = random.choice(subset_df['Book-Title'].unique())
recommendations = generate_recommendations(target_item_id)

# Display the recommendations and RMSE
print("RMSE:", rmse)
print("Recommendations for Item (Book):", target_item_id)
recommendations.head(5)


Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.8590
RMSE: 1.8589727157408695
Recommendations for Item (Book): The Conspiracy Club


Unnamed: 0,Similar-Item (Books),User-ID
0,Flawed Light,235105
1,Ragtime in Simla: The Second in the Detective ...,235105
2,The Delicate Storm (Marian Wood Book),98391
3,The Delicate Storm (Marian Wood Book),235105
4,Mind Prey,258185


**Conclusion**


After evaluating the RMSE values from the item-based and user-based collaborative filtering models, it can be concluded that the user-based collaborative filtering model outperforms the item-based model, albeit by a small margin.



The user-based collaborative filtering model achieved an RMSE of 1.8399, while the item-based model obtained an RMSE of 1.8590. Lower RMSE values indicate that the model's predictions closely align with the actual ratings provided by users. Hence, the user-based collaborative filtering model demonstrates slightly better accuracy in predicting user ratings compared to the item-based model.



However, it's important to consider that the choice between item-based and user-based collaborative filtering depends on various factors, including the dataset's characteristics, data sparsity, and the specific recommendation goals. In this particular scenario, the user-based approach appears to yield more accurate results. Nonetheless, it is recommended to further assess the models using additional evaluation metrics and conduct further experiments to validate these findings.



Overall, the RMSE values offer valuable insights into the collaborative filtering models' performance, enabling a comparison of their predictive accuracy. These findings can guide decision-making processes when selecting and implementing recommendation algorithms in practical applications.





