
## Key Concepts in Implementing Collaborative Filtering in Python

1. **Understanding Collaborative Filtering**: Grasp the fundamental concept of collaborative filtering, which leverages user behavior to recommend items. Learn about its types, such as user-based and item-based collaborative filtering.

2. **Data Handling**: Learn to manipulate and process data in Python. Understand how to load, clean, and preprocess the dataset, including handling missing values and normalizing data.

3. **Matrix Factorization Techniques**: Understand matrix factorization methods like Singular Value Decomposition (SVD) used in collaborative filtering to decompose a matrix into factors that can predict user preferences.

4. **Similarity Metrics**: Learn about different similarity metrics like cosine similarity, Pearson correlation, and Jaccard similarity, which are crucial in comparing user or item profiles.

5. **Building Recommendation Systems**: Learn to build a recommender system, focusing on generating user-item matrices, computing similarities, and making predictions.

6. **Evaluation Metrics**: Understand various evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Precision at K, which are essential for assessing the performance of your recommendation system.

7. **Handling Sparse Matrices**: Learn techniques to handle sparse matrices efficiently, as collaborative filtering often deals with large, sparse datasets.

8.  **Scalability and Performance Issues**: Learn about scalability and performance considerations, such as handling large datasets and improving computational efficiency, which are crucial for real-world applications.




In [2]:
import pandas as pd

# Take file names as input
datasets_path = '/Users/saip/My Drive/machine-learning-fundamentals/datasets/'
users_file = datasets_path + 'books_crossings/Users.csv'
books_file = datasets_path + 'books_crossings/Books.csv'
ratings_file = datasets_path + 'books_crossings/Ratings.csv'

# Load the data
users = pd.read_csv(users_file, sep=';', encoding='latin-1', low_memory=False)
books = pd.read_csv(books_file, sep=';', encoding='latin-1', low_memory=False)
ratings = pd.read_csv(ratings_file, sep=';', encoding='latin-1', low_memory=False)

# Print the first few rows of each dataframe
print(users.head())
print(books.head())
print(ratings.head())

  User-ID  Age
0       1  NaN
1       2   18
2       3  NaN
3       4   17
4       5  NaN
         ISBN                                              Title  \
0  0195153448                                Classical Mythology   
1  0002005018                                       Clara Callan   
2  0060973129                               Decision in Normandy   
3  0374157065  Flu: The Story of the Great Influenza Pandemic...   
4  0393045218                             The Mummies of Urumchi   

                 Author  Year                Publisher  
0    Mark P. O. Morford  2002  Oxford University Press  
1  Richard Bruce Wright  2001    HarperFlamingo Canada  
2          Carlo D'Este  1991          HarperPerennial  
3      Gina Bari Kolata  1999     Farrar Straus Giroux  
4       E. J. W. Barber  1999   W. W. Norton & Company  
   User-ID        ISBN  Rating
0   276725  034545104X       0
1   276726  0155061224       5
2   276727  0446520802       0
3   276729  052165615X       3
4   

In [3]:
# Explore the data

# Check the number of users, books and ratings
print('Number of users: {}'.format(len(users)))
print('Number of books: {}'.format(len(books)))
print('Number of ratings: {}'.format(len(ratings)))

# Average number of ratings per user
print('Average number of ratings per user: {}'.format(ratings['User-ID'].value_counts().mean()))

# Average number of ratings per book
print('Average number of ratings per book: {}'.format(ratings['ISBN'].value_counts().mean()))

# Sparsity of the rating matrix in %
sparsity = 1 - ((len(ratings) * 100) / (len(users) * len(books))) 
print('Sparsity of the rating matrix: {}'.format(sparsity))

Number of users: 278859
Number of books: 271379
Number of ratings: 1149780
Average number of ratings per user: 10.920851419507423
Average number of ratings per book: 3.376184827164989
Sparsity of the rating matrix: 0.9984806639364702


In [4]:
# count the number of unique users and books in ratings table
n_users = ratings['User-ID'].nunique()
n_books = ratings['ISBN'].nunique()

print('Number of unique users: {}'.format(n_users))
print('Number of unique books: {}'.format(n_books))

# get the size of ratings matrix
ratings_matrix_size = n_users * n_books
print('Size of ratings matrix: {}'.format(ratings_matrix_size))

# estimate the size in GB occupied by the ratings matrix assuming each rating is a float number
ratings_matrix_size_in_bytes = ratings_matrix_size * 8
ratings_matrix_size_in_gb = ratings_matrix_size_in_bytes / (1024**3)
print('Size of ratings matrix in GB: {}'.format(ratings_matrix_size_in_gb))

Number of unique users: 105283
Number of unique books: 340556
Size of ratings matrix: 35854757348
Size of ratings matrix in GB: 267.13875940442085


We cannot fit such a huge matrix into a 16 GB RAM. But scikit-surprise handles this internally by creating a sparse representation.

In [5]:
# Use scikit-surprise to build a recommender system using collaborative filtering

from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# Load the data from the file using a reader and the load_from_df method
reader = Reader(rating_scale=(0, 10))
data = Dataset.load_from_df(ratings[['User-ID', 'ISBN', 'Rating']], reader)

# Use the SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    3.4999  3.5002  3.5026  3.5046  3.4998  3.5014  0.0019  
MAE (testset)     2.8143  2.8128  2.8147  2.8191  2.8155  2.8153  0.0021  
Fit time          9.14    9.20    9.18    9.42    9.38    9.26    0.11    
Test time         1.03    1.12    1.14    0.88    0.89    1.01    0.11    


{'test_rmse': array([3.49985579, 3.50016134, 3.50256795, 3.50457273, 3.49981268]),
 'test_mae': array([2.81428518, 2.81283864, 2.81468618, 2.81906326, 2.81549236]),
 'fit_time': (9.139102935791016,
  9.198290348052979,
  9.181260108947754,
  9.420634269714355,
  9.377026081085205),
 'test_time': (1.0309600830078125,
  1.1181998252868652,
  1.1370129585266113,
  0.8779547214508057,
  0.890427827835083)}