In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import coo_matrix
from sklearn.metrics.pairwise import euclidean_distances
import matplotlib.pyplot as plt

### Dataset: Goodread 10k
Because Collaborative Filtering required a huge dataset of user-item interaction, so we going to use a dataset called "Goodread 10k" which contain 10000 unique books, 53000 unique users and 1 million interaction logs between user-item

https://www.kaggle.com/jealousleopard/goodreadsbooks

In [2]:
rating = pd.read_csv("./data/goodread/ratings.csv")
books = pd.read_csv("./data/goodread/books.csv")
books = books[['id','authors','title','average_rating','ratings_count']].rename(columns={'id':'book_id'})

In [3]:
books.sample(20)

Unnamed: 0,book_id,authors,title,average_rating,ratings_count
6716,6717,Rachel Caine,"The Morganville Vampires, Volume 3 (The Morgan...",4.39,13849
8028,8029,Robert Jackson Bennett,"City of Stairs (The Divine Cities, #1)",4.13,11863
8136,8137,Tarryn Fisher,F*ck Love,4.13,12203
3983,3984,Gena Showalter,The Darkest Whisper (Lords of the Underworld #4),4.33,35144
4754,4755,Mark Helprin,Winter's Tale,3.5,19932
4302,4303,Erin Hunter,"Forest of Secrets (Warriors, #3)",4.4,28913
1670,1671,Rainbow Rowell,Carry On,4.19,46197
5086,5087,Lily Koppel,The Astronaut Wives Club,3.26,17837
9172,9173,Indu Sundaresan,"The Twentieth Wife (Taj Mahal Trilogy, #1)",4.03,9873
6886,6887,"Sergei Lukyanenko, Andrew Bromfield",Day Watch (Watch #2),4.05,13812


In [4]:
rating.sample(20)

Unnamed: 0,book_id,user_id,rating
84966,850,31376,3
898263,9101,28302,3
278574,2788,40737,4
702974,7073,36656,5
620927,6236,25847,4
888040,8992,28235,2
375763,3763,11569,4
488213,4894,977,4
844223,8533,38889,4
710333,7148,15563,3


In [5]:
num_users = len(rating['user_id'].unique())
num_books = len(rating['book_id'].unique())
print("Number of user:",num_users)
print("Number of books:",num_books)

Number of user: 53424
Number of books: 10000


![title](./assets/utility-matrix.png)

In [6]:
cor = rating[['user_id','book_id']].values
# -1 because our dataset count index from 1
# but python index count from 0
cor -= 1
val = rating['rating'].values

In [7]:
rating_matrix = coo_matrix((val, (cor[:,0],cor[:,1]))).toarray()

In [8]:
rating_matrix.shape

(53424, 10000)

## Build a Recommendation base on similar users
![title](./assets/alsoview.png)

In [9]:
def get_book_read_by_user(user_id,with_book_info=False):
    user_books = rating[rating['user_id']==user_id].reset_index(drop=True)
    read_books_id = user_books['book_id'].values
    if not with_book_info:
        return read_books_id
    read_books_info = books[books['book_id'].isin(read_books_id)].reset_index(drop=True)
    read_books_info = pd.merge(user_books,read_books_info,on='book_id')
    return read_books_info

In [10]:
def get_similar_user(user_id,k=10):
    # -1 rating table
    _id = user_id - 1
    idx = np.where(rating_matrix[_id]!=0)[0]
    sliced_rating_matrix = rating_matrix[:,idx]
    target = sliced_rating_matrix[_id]
    
    dist = [np.squeeze(euclidean_distances(target.reshape(1,-1),u.reshape(1,-1))) for u in sliced_rating_matrix]
    
    similar_user = np.argsort(dist)
    similar_user += 1
    return similar_user[:k]

In [11]:
def similar_books(user_1,user_2):
    u1 = get_book_read_by_user(user_1)
    u2 = get_book_read_by_user(user_2)
    return np.intersect1d(u1,u2)

In [12]:
target_user = 412
get_book_read_by_user(412,True)

Unnamed: 0,book_id,user_id,rating,authors,title,average_rating,ratings_count
0,3930,412,4,"Aristotle, J.A.K. Thomson, Jonathan Barnes, Hu...",The Nicomachean Ethics,3.91,19380
1,4640,412,5,Raymond Carver,Cathedral,4.3,18378
2,4908,412,2,Neil Strauss,The Game: Penetrating the Secret Society of Pi...,3.73,16698
3,5122,412,4,Hunter S. Thompson,Fear and Loathing on the Campaign Trail '72,4.1,15306
4,7379,412,3,Steve Martin,The Pleasure of My Company,3.78,12378
5,7771,412,4,Lorrie Moore,Birds of America,4.12,10768
6,8676,412,5,Richard Price,Lush Life,3.69,9478
7,8703,412,4,Adrian Nicole LeBlanc,"Random Family: Love, Drugs, Trouble, and Comin...",4.23,9053
8,8878,412,3,"Gabriel García Márquez, J.S. Bernstein",No One Writes to the Colonel and Other Stories,3.86,10480
9,9533,412,4,Sherman Alexie,Reservation Blues,3.98,9919


In [13]:
get_similar_user(412,k=10)

array([  412, 39032,  3222,  2645,   491,  1861,  8284, 28563, 44496,
       24040])

In [16]:
similar_books(412,2645)

array([5122, 7379, 8676])

In [17]:
get_book_read_by_user(39032,True)

Unnamed: 0,book_id,user_id,rating,authors,title,average_rating,ratings_count
0,34,39032,2,E.L. James,"Fifty Shades of Grey (Fifty Shades, #1)",3.67,1338493
1,106,39032,4,Tina Fey,Bossypants,3.94,506250
2,144,39032,4,Laura Hillenbrand,"Unbroken: A World War II Story of Survival, Re...",4.40,487775
3,190,39032,5,Cheryl Strayed,Wild: From Lost to Found on the Pacific Crest ...,3.96,379872
4,196,39032,5,Chuck Palahniuk,Fight Club,4.20,365349
5,199,39032,5,John Grogan,Marley and Me: Life and Love With the World's ...,4.12,367304
6,236,39032,5,Jon Krakauer,Into Thin Air: A Personal Account of the Mount...,4.11,291258
7,242,39032,4,James Patterson,"Along Came a Spider (Alex Cross, #1)",4.08,311499
8,290,39032,3,Mindy Kaling,Is Everyone Hanging Out Without Me? (And Other...,3.84,290674
9,297,39032,5,Kurt Vonnegut Jr.,Cat's Cradle,4.18,238940


## Matrix Factorization

![title](./assets/matrix-factorization.png)


### Why Matrix Factorization

#### 1. Smaller Size

   For example, if we have 53424 users, 10000 items (books) => our rating matrix will have shape of $(U,I) = (53424,10000)$. If each cell is a 'float64' number:

  $$total\_memory = 53424\cdot10000\cdot8$$
  $$total\_memory = 4273920000(b)$$
  $$total\_memory \approx 4,274(gb)$$
  
   If we can factorize this $(U,I)$ matrix to $(U,k)\cdot(I,k)^T$ with $k = 32$ (for example), then we will greatly reduce our space complexity:
   
   $$total\_memory = 53424\cdot32\cdot8 + 10000\cdot32\cdot8$$
   $$total\_memory = 16236544(b)$$
   $$total\_memory \approx 16(mb)$$


#### 2. Predict Items Rating (Matrix completion)
   Make inference time much faster. Instead of calculate similar users to a given target user, then sample a list of recommendation from these similar users, the product of 2 latent matrices will give us predicted rating score of each user to each item.
   
   
#### 3. "Meaningful" latent vectors
   
   From an application point of view, matrix factorization can be used to discover latent features underlying the interactions between two different kinds of entities. (Of course, you can consider more than two kinds of entities and you will be dealing with tensor factorization, which would be more complicated.) And one obvious application is to predict ratings in collaborative filtering
   
   More info: http://nicolas-hug.com/blog/matrix_facto_1

## Singular Value Decomposition (SVD)

![title](./assets/svd.png)

https://en.wikipedia.org/wiki/Singular_value_decomposition

https://research.fb.com/blog/2014/09/fast-randomized-svd/

Note: Definition of SVD in Recommendation System often doesn't equal its definition in Mathematic.  

$A \approx U\cdot\Sigma\cdot{V^T}$

Where:

   - A is our utility matrix that has shape $(m,n)$

   - U is an matrix that has shape $(m,k)$

   - V is an matrix that has shape $(n,k)$
   
   - $\Sigma$ have shape $(k,k)$

In [18]:
from sklearn.utils.extmath import randomized_svd

In [19]:
U, Sigma, V_T = randomized_svd(rating_matrix,n_components=128)
Sigma = np.diag(Sigma)

In [20]:
print(U.shape)
print(Sigma.shape)
print(V_T.shape)

(53424, 128)
(128, 128)
(128, 10000)


In [31]:
# 23122, 42135 (manga)
# 2123 (non-fiction)
# 5231 (author-bias)
# 42152 (cold start)
target_user = 5231
target_user_latent = U[target_user-1].reshape(1,-1)
target_user_latent.shape

(1, 128)

In [32]:
predicted_rating = np.dot(np.dot(target_user_latent,Sigma),V_T)[0]

In [33]:
recommended_books = np.argsort(predicted_rating)[::-1]
recommended_books_rating = np.sort(predicted_rating)[::-1]

recommended_books = recommended_books[:20]
recommended_books_rating = recommended_books_rating[:20]

recommended_books+=1

In [34]:
get_book_read_by_user(target_user,True)

Unnamed: 0,book_id,user_id,rating,authors,title,average_rating,ratings_count
0,2771,5231,5,Catherine Ryan Hyde,When I Found You,3.92,16032
1,3012,5231,5,Donna Mabry,Maude,4.05,28316
2,3429,5231,5,Jennifer McMahon,The Winter People,3.76,27234
3,5135,5231,3,Mary Higgins Clark,"Loves Music, Loves to Dance",3.89,21880
4,5379,5231,3,Rachel Abbott,Sleep Tight,3.98,10894
5,5447,5231,4,Camron Wright,The Rent Collector,4.23,16153
6,5628,5231,2,"María Dueñas, Daniel Hahn",The Time in Between,4.04,10166
7,5738,5231,4,Chris Cleave,Everyone Brave is Forgiven,3.8,16183
8,5758,5231,3,Harlan Coben,Promise Me (Myron Bolitar #8),3.96,18168
9,5801,5231,3,Mary Higgins Clark,While My Pretty One Sleeps,3.88,18846


In [35]:
temp = books[books['book_id'].isin(recommended_books)]
temp.insert(5,'user_predicted_rating',recommended_books_rating)
temp

Unnamed: 0,book_id,authors,title,average_rating,ratings_count,user_predicted_rating
3017,3018,Mary Higgins Clark,Where Are the Children?,4.0,34816,1.542324
5013,5014,Mary Higgins Clark,Two Little Girls in Blue,3.85,20727,1.507866
5076,5077,Mary Higgins Clark,All Around the Town,3.94,20029,1.481117
5134,5135,Mary Higgins Clark,"Loves Music, Loves to Dance",3.89,21880,1.474872
5796,5797,Mary Higgins Clark,You Belong To Me,3.85,19131,1.47189
5800,5801,Mary Higgins Clark,While My Pretty One Sleeps,3.88,18846,1.46322
6145,6146,Mary Higgins Clark,Let Me Call You Sweetheart,3.84,17472,1.442045
6303,6304,Mary Higgins Clark,Remember Me,3.91,16963,1.413903
6714,6715,Mary Higgins Clark,A Stranger Is Watching,3.91,15659,1.410484
6741,6742,Mary Higgins Clark,On the Street Where You Live,3.85,16327,1.399965


# Conclusion

### Collaborative Filtering / Matrix Factorization

#### Pros:

- Smaller memory footprint
   
- Efficient query time (after the factorization step is complete)
   
- "Meaningful" latent space, helpful when compute similar users, similar items
   
- Doesn't require any meta data (item's title, description, tags,... ). Derived users' and items' features directly from the interaction matrix alone.

   
#### Cons:
    
- Popularity bias: Super popular items have more iteraction in the utility matrix => our model will start recommend some super popular items to everyone just because most people have interact with these items.

- Cold-start problems: As oppose to popular items, items that have small number of interactions will never get recommened to anyone. The same thing is true to user, if an user that have small number of interactions, the recommended items will not accurate.
