# Recommendation System
1. Content-based model<br>
2. Collaborative filtering model<br>
3. Matrix Factorization<br>
4. Deep model

1. Content-based model
- Ma trận user-item được xây dựng dựa theo "nội dung" của item mà không quan tâm mối liên hệ giữa các user
- item được phân loại theo từng cluster theo từng mục đích (nội dung, tags, ...)
- Xây dựng ma trận theo độ tương quan 
1.1 Utility matrix
- Là ma trận với index hàng là user và cột là item (NxM)
- Giá trị được sử dụng là điểm rating của từng user với một hay nhiều item. Điều này dẫn đến ma trận xây dựng được là ma trận thưa


(bổ sung)

In [9]:
import os
from pprint import pprint
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse.linalg import svds
from tensorflow.keras import optimizers, losses, metrics
from tensorflow.keras.callbacks import Callback, EarlyStopping, ModelCheckpoint

from utils import train_test_split
from model_selection import SVD, fastSVD, NMF, DeepMF , NeuralMF

sns.set(rc={'figure.figsize':(10, 6)})

ImportError: cannot import name 'NeuralMF' from 'model_selection' (/home/vietnd/Documents/movielens-1m/model_selection/__init__.py)

In [10]:
sns.set(rc={'figure.figsize':(10, 6)})
seed = 123
np.random.seed(seed)

## 1. Load data

### Summary

```
ml-1m
├── users.dat
├── movies.dat
├── ratings.dat
└── README.txt
```

In [3]:
PATH = "ml-1m/"
ratings = pd.read_csv(os.path.join(PATH, 'ratings.dat'), sep='::', names=["userID", "movieID", "rating", "timestamp"], engine='python')
movies = pd.read_csv(os.path.join(PATH, 'movies.dat'), sep='::', names=["movieID", "title", "genres"], engine='python')
users = pd.read_csv(os.path.join(PATH, 'users.dat'), sep='::', names=["userID", "gender", "age", "occupation", "zipcode"], engine='python')

In [None]:
users

## 2. EDA

### 2.1 ratings.dat

In [None]:
ratings.head()

In [None]:
sns.distplot(ratings["rating"], kde=False)

In [None]:
ratings['datetime'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings.drop('timestamp', axis=1, inplace=True)
sns.distplot(ratings['datetime'], kde=False)

In [None]:
n_users = len(ratings.userID.unique())
n_items = len(ratings.movieID.unique())
print(n_users)
print(n_items)
print("Coverage : {:.4f}%".format(len(ratings) * 100 / (n_users * n_items)))

### 2.2 movies.dat

In [None]:
movies.head()

In [None]:
movies_refine = movies.copy()
movies_refine['year'] = (movies_refine['title'].str.extract(r'(\d{4})')).astype('int32')
movies_refine = movies_refine.dropna()
movies_refine['title'] = (movies_refine['title'].str.extract(r'(^[^\(]+)'))[0]
movies_refine['genres'] = movies_refine['genres'].str.split('|')

movies_refine

In [None]:
movies_refine.year.unique()
movies_refine.year.value_counts()

In [None]:
genre_count = dict()
for index, series in movies_refine.iterrows():
    for genre in series['genres']:
        genre_count[genre] = genre_count.get(genre, 0) + 1
pprint(genre_count)

### 2.3 users.dat

In [None]:
users.head()

In [None]:
users.gender.value_counts()

In [None]:
users.age.value_counts()

In [None]:
len(users)

In [None]:
len(ratings.userID.unique())

## 3. Matrix Factorization Collaborative Filtering

Giải quyết vấn đề về ma trận thưa
- Tính mean của giá trị theo hàng :
$E(x) = \sum_{i=1}^{m} v_i $
- Chuẩn hoá : trừ giá trị hàng cho mean của từng hàng

- Tính ma trận hiệp phương sai :
(công thức)

SVD:
Tách utility matrix thành 3 ma trận : user - strength, strength - strength, strength - item
Ví dụ:
$$\begin{bmatrix}
 3 &2 &2 \\
  2  &3 &-2 
\end{bmatrix}
=
\begin{bmatrix}
 1/\sqrt[]{2}  & 1/\sqrt[]{2} \\
  1/\sqrt[]{2}  & -1/\sqrt[]{2} 
\end{bmatrix}
.
\begin{bmatrix}
 5 &0 &0 \\
 0 &3 &0 
\end{bmatrix}
.
\begin{bmatrix}
 1/\sqrt[]{2}  & 1/\sqrt[]{2}  &0 \\
  1/\sqrt[]{18}  & -1/\sqrt[]{18} & -1/\sqrt[]{18} \\
2/3 & -2/3 & -1/3
\end{bmatrix} $$

Giả sử ma trận A(2x3) tương ứng với 2 user - 3 items , value là rating của mỗi user tương ứng với mỗi item:
- Ma trận U (2x2) : mỗi cột thể hiện độ "mạnh" tương ứng trị riêng của ma trận $\sum$
- Ma trận $\sum$ (2x3): là ma trận đường chéo với trị riêng giảm dần (Giá trị $\sigma _{1}$ là giá trị có độ mạnh )
- Ma trận V (3x3) : mỗi hàng thể hiện khả năng match giữ mỗi cluster của item và điểm rating của mỗi user

### 3.1 Low rank Approximation (Truncated SVD)

In [None]:
data_mat = np.array(ratings.pivot(index = 'movieID', columns = 'userID', values = 'rating'))
data_mat = np.nan_to_num(data_mat)

In [None]:
nor_data_mat = train - np.mean(train, axis = 0)
u, s, vT = svds(nor_data_mat, k = 50)  # 50 eigenvalues

In [None]:
recon_data_mat = u.dot(np.diag(s)).dot(vT)
recon_data_mat

In [None]:
def recommend_movies(pred_matrix, userID, num_recommendations):
    """Recommend movies based on reconstructed svd matrix
    Params:
     - pred_matrix (num_movies, num_users) : reconstructed matrix
     - userID (scalar)
     - num_recommendations (scalar)
    Outputs:
     - movies
    """
    sorted_predict_idx = np.argsort(pred_matrix[:, userID-1])[::-1]

    user_data = ratings[ratings.userID == (userID)]
    user_full = user_data.merge(movies_refine).sort_values(['rating'], ascending=False)
    # print(user_full)
    print('User {0} has already rated {1} movies.'.format(userID, len(user_full)))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))

    recommendations = movies_refine.iloc[sorted_predict_idx, :][~movies_refine.movieID.isin(user_full.movieID)]

    return user_full, recommendations.head(num_recommendations)

In [None]:
rated, _ = recommend_movies(recon_data_mat, 1310, 20)
_

### 3.2.1 FunkSVD (slow ver.)

In [None]:
model_svd = SVD(ratings, K=10, lambd=.1, lr_rate=0.75, max_iter=100, verbose=True, user_based=1, use_biased=False)

### 3.2.2 FunkSVD (fast ver.)

In [None]:
X_train, X_valid, X_test = train_test_split(ratings, split_ratio=0.7, shuffle=True)

In [None]:
model_svd = fastSVD(K=15, lambd=0.06, lr_rate=0.004, max_iter=50)
model_svd.fit(X_train, X_valid, early_stopping=True, verbose=False, use_biased=True)

In [None]:
model_svd.predict(X_test)

### 3.3 Nonnegative Matrix Factorization

In [None]:
model_nmf = NMF(K=15, lambd_pu=0.4, lambd_qi=0.7, max_iter=10)

In [None]:
model_nmf.fit(X_train, X_valid, early_stopping=False, verbose=True)

## 4 Deep learning

### 4.1 Simple deep model

**Prepare data for DeepCF model**

In [6]:
ratings = ratings.sample(frac=1., random_state=seed)
user_list = ratings['userID'].values.reshape(-1, 1) - 1
item_list = ratings['movieID'].values.reshape(-1, 1) - 1
rating_list = ratings['rating'].values.reshape(-1, 1)
max_user_id = ratings.userID.unique().max()
max_item_id = ratings.movieID.unique().max()

**Create and compile model**

In [None]:
model_deepmf = DeepMF(max_user_id, max_item_id, K=100)

In [None]:
callbacks = [EarlyStopping('val_loss', patience=2), 
             ModelCheckpoint('weights.h5', save_best_only=True)]
model_deepmf.compile(loss='mse', optimizer='adam')

In [None]:
history = model_deepmf.fit([user_list, item_list], rating_list, epochs=30,
                           validation_split=.3, verbose=2, callbacks=callbacks)

**Load trained model for prediction phase**

In [9]:
model_deepmf.load_weights('weights.h5')

**Predict rating score given userID and itemID**

In [7]:
test_id = 69

user_ratings = ratings[ratings.userID == test_id][['userID', 'movieID', 'rating']]
user_ratings['pred_rating'] = user_ratings.apply(lambda x: model_deepmf.pred(test_id, x['movieID']), axis=1)
user_ratings.sort_values(by='rating', ascending=False).merge(movies).head(10)

Unnamed: 0,userID,movieID,rating,pred_rating,title,genres
0,69,296,5,4.220314,Pulp Fiction (1994),Crime|Drama
1,69,1120,5,3.404435,"People vs. Larry Flynt, The (1996)",Drama
2,69,1041,5,3.478866,Secrets & Lies (1996),Drama
3,69,17,5,3.519573,Sense and Sensibility (1995),Drama|Romance
4,69,1704,5,4.000617,Good Will Hunting (1997),Drama
5,69,2858,5,3.948442,American Beauty (1999),Comedy|Drama
6,69,1747,5,3.698985,Wag the Dog (1997),Comedy|Drama
7,69,1810,5,3.477646,Primary Colors (1998),Drama
8,69,2890,5,4.122479,Three Kings (1999),Drama|War
9,69,431,5,3.340197,Carlito's Way (1993),Crime|Drama


**Recommend 10 movies given userID**

In [30]:
test_id = 96

rated_movie = ratings.loc[ratings.userID == test_id]['movieID']
recom = movies.loc[~movies.movieID.isin(rated_movie)]
recom['pred_rating'] = recom.apply(lambda x: model_deepmf.pred(test_id, x['movieID']), axis=1)
recom.sort_values(by='pred_rating', ascending=False).head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,movieID,title,genres,pred_rating
2425,2494,"Last Days, The (1998)",Documentary,4.190029
1287,1307,When Harry Met Sally... (1989),Comedy|Romance,4.11003
1676,1725,"Education of Little Tree, The (1997)",Drama,4.103101
2707,2776,"Marcello Mastroianni: I Remember Yes, I Rememb...",Documentary,4.062382
2849,2918,Ferris Bueller's Day Off (1986),Comedy,4.047093
1180,1198,Raiders of the Lost Ark (1981),Action|Adventure,4.0359
1656,1704,Good Will Hunting (1997),Drama,4.029357
1873,1942,All the King's Men (1949),Drama,4.015693
1156,1172,Cinema Paradiso (1988),Comedy|Drama|Romance,4.015368
3023,3092,Chushingura (1962),Drama,4.000569


### 4.2 Neural Collaborative Filtering

In [4]:
from model_selection import Embedding, MLP