### Tasks
* Create a Jupyter notebook.
* Load the data (which you can find in MovieTweetings).
* Use sklearn.decompose.nmf to create latent vectors for each movie.
* Save the vectors in the following format (user userid, how should have content_id1 and content_id3 recommended, with the predicted ratings being value1 and value2 respectively: Userid content_id1:value1 content_id3:value2
    * For example, for user 1000 (this is only a top-4 rec, list should contain 10-20),
        * 100 1375666:1.420 0482571:0.232 1457767:0.158 1130884:0.113

* Locate the recsys api template, where you should verify that the implementation will work with your implementation(/live-project/recs/non_negative_mf_recommender.py):
    * In the __init__ method, check if the implementation can load your trained vectors.
    * In the recommend_items method, return a recommendation for the user. Use the vectors loaded in the __init__ method.
* Start the MovieGeek site.
    * Find a user with a taste similar to yours by looking through users in the analytics part. This is user_id 100: http://0.0.0.0:8010/analytics/user/100/.
    * Look at the recommendations your algorithm provides.
* Write a report that describes
    * how you implemented your algorithm
    * how you trained the model
    * what you think of the result

### Questions prior to eval
* I understand normalization when used for Feature Scaling; in the code you use Mean normalization with a plus 1 in the denominator, to coerce that matrix to be positive for the Non-negative matrix factorization. In the text, you say that normalized ratings makes them comparable. Could you elaborate on this? Is the idea of normalization to even out the effects of people who make different types of ratings, e.g.: one person gives ratings 1-5 while another person typically gives rating 4-9?

* In the NMF example, how did you choose n_components=100? in the text you mention running SVD and selecting the number of columns that capture 90% of the sum of the Sigma matrix.

* init='nndsvda' is excellent, but it averages across X; would it be better if each zero slot were filled by sampling from a distribution of that movies ratings?

In [1]:
import pickle
import os
import re

import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt
from scipy.sparse import coo_matrix
from tqdm import tqdm

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

In [2]:
# Data from https://github.com/sidooms/MovieTweetings
# user_id::movie_id::rating::rating_timestamp. 
df = pd.read_csv("MovieTweetings-master/latest/ratings.dat", 
                 sep="::", engine="python",
                 names=["user_id", "movie_id", "rating", "rating_timestamp"] )

df['rating_timestamp'] = pd.to_datetime(df['rating_timestamp'], unit='s')
df.head()

Unnamed: 0,user_id,movie_id,rating,rating_timestamp
0,1,114508,8,2013-10-05 21:00:50
1,2,75314,1,2020-07-23 01:42:04
2,2,102926,9,2020-05-22 11:46:56
3,2,114369,10,2020-08-16 05:22:27
4,2,118715,8,2020-07-29 07:13:18


In [3]:
# drop all users who only rated 1 movie, they aren't helpful for recommendations;
#  they can't be used to find similar items
original_total = len(df)
df = df.groupby("user_id").filter(lambda x: len(x.movie_id) > 1)
print(f"Total number of ratings made my users who have made more than one review: {len(df):,}")
print(f"One review users: {original_total - len(df):,}")
print(f"Number of distinct users with more than one review: {len(set(df['user_id'].tolist())):,}")

Total number of ratings made my users who have made more than one review: 858,457
One review users: 29,995
Number of distinct users with more than one review: 39,329


In [4]:
# Early attempt; cut out ratings from users with many
# df100plus = df.groupby("user_id").filter(lambda x: len(x.movie_id) >= 200) 
# df100plus['user_id'] = df100plus['user_id'].astype('category')
# len(df100plus), len(set(df100plus['user_id'].tolist()))
# 1701
# df100plus['user_id'].unique() 
# df100plus[df100plus['user_id'] ==722]

In [5]:
# Helper functions
strip_parens = re.compile(r"\s+\(.*\)")
text ="In My Room (2020)"
# strip_parens.sub("", text)
def drop_parens(text):
    return strip_parens.sub("", text)
def extract_year(text):
    return text[text.rfind("(") + 1 : text.rfind(")")]
# extract_year(text)
# extract_year('Remélem legközelebb sikerül meghalnod:) (2018)')

In [6]:
# movies.dat
# Contains the items (i.e., movies) that were rated in the tweets,
# together with their genre metadata in the following 
# format: movie_id::movie_title (movie_year)::genre|genre|genre. For example:

# 0110912::Pulp Fiction (1994)::Crime|Thriller

mdf = pd.read_csv("MovieTweetings-master/latest/movies.dat", 
                 sep="::", engine="python",
                 names=["movie_id", "movie_title", "genres"] )
mdf.genres.fillna(value='', inplace=True)
mdf['title'] = mdf.movie_title.apply(drop_parens)
mdf['movie_year'] = mdf.movie_title.apply(extract_year)
mdf.movie_year = mdf.movie_year.astype('int')
mdf['genre_list'] = mdf.genres.apply(lambda x: x.split("|"))
del mdf['movie_title']
mdf.head()

Unnamed: 0,movie_id,genres,title,movie_year,genre_list
0,8,Documentary|Short,Edison Kinetoscopic Record of a Sneeze,1894,"[Documentary, Short]"
1,10,Documentary|Short,La sortie des usines Lumière,1895,"[Documentary, Short]"
2,12,Documentary|Short,The Arrival of a Train,1896,"[Documentary, Short]"
3,25,,The Oxford and Cambridge University Boat Race,1895,[]
4,91,Short|Horror,Le manoir du diable,1896,"[Short, Horror]"


In [7]:
# user_ids = list(sorted(set(rdf['user_id'].tolist())))

In [8]:
# movie_ids = list(sorted(set(rdf['movie_id'].tolist())))

In [9]:
# print(len(movie_ids), movie_ids[-1])
# first stab was: (36380, 12920708)

In [10]:
# movie_indices = dict(zip( movie_ids,   range(len(movie_ids)) ))

In [11]:
np.seterr('raise')

def normalize(x):
    """Here we Mean normalize each of the users ratings against all their other ratings.
    We plus 1 in the denomimator to coerce the matrix to be non-negative.
    """
    x = x.astype(float)
    x_sum = x.sum()
    x_num = x.astype(bool).sum()
    
    if x_num == 1 or x.std() == 0 or x_sum == 0 or x_num == 0:
        return 0.0
    x_mean = x_sum / x_num
    result = (x - x_mean) / (x.max() - x.min()) + 1 
    # we add one so that non-negative numbers are passed to NMF
    return result

df['rating'] = df['rating'].astype(float)

# normalize the ratings?
df['avg'] = df.groupby('user_id')['rating'].transform(lambda x: normalize(x))
df['avg'] = df['avg'].astype(float)

df['user_id'] = df['user_id'].astype('category')
df['movie_id'] = df['movie_id'].astype('category')

In [12]:
train, test = train_test_split(df, test_size=.15)

print(f"original train test sizes: {len(train):,},  {len(test):,}") 

train = train.groupby("user_id").filter(lambda x: len(x.movie_id) > 1)
# test = test.groupby("user_id").filter(lambda x: len(x.movie_id) > 1)
print(f"Train set size: {len(train):,} test set before trimming {len(test):,} ")

test = test[test['user_id'].isin(train['user_id'].unique())]
print(f"test set size: {len(test):,}")  

df = train

original train test sizes: 729,688,  128,769
Train set size: 727,183 test set before trimming 128,769 
test set size: 125,469


In [13]:
df.head()

Unnamed: 0,user_id,movie_id,rating,rating_timestamp,avg
508140,39980,1392190,5.0,2015-08-23 14:15:52,0.634921
528410,41436,6966692,8.0,2019-03-04 16:54:14,1.134683
583058,45641,4076916,4.0,2017-11-29 17:23:42,0.897181
633102,49577,1258972,5.0,2015-08-03 12:59:09,0.748971
289053,22972,3001638,10.0,2014-09-06 17:29:53,1.5


In [14]:
df.sample(20)

Unnamed: 0,user_id,movie_id,rating,rating_timestamp,avg
506866,39895,8579674,7.0,2020-02-13 14:36:40,1.1
38527,2782,2239822,6.0,2017-08-01 18:19:32,0.583333
266033,21097,2338151,8.0,2018-03-03 20:13:00,1.205128
650220,50719,1343092,7.0,2014-04-06 17:48:16,1.004127
690906,54040,2388771,9.0,2018-12-09 18:21:35,1.173611
550691,43101,1972779,6.0,2015-08-15 17:54:18,0.701266
180512,14207,2294449,10.0,2014-10-18 23:25:31,1.057613
590806,46353,2334896,5.0,2014-02-24 09:55:11,0.957965
757461,59521,110912,9.0,2013-04-08 05:58:53,1.192837
345425,27116,3369806,7.0,2015-11-15 01:20:44,1.125828


In [15]:
coo = coo_matrix((df['avg'].astype(float), # was 'avg'
                  (df['movie_id'].cat.codes.copy(), # rows
                   df['user_id'].cat.codes.copy()))) # columns
csr = coo.tocsr()
print(f"ratings are between {csr.asfptype().min()} and {csr.asfptype().max()}")

ratings are between 0.0 and 1.9090909090909083


In [16]:
consumed = df.groupby("user_id")['movie_id'].apply(list)

def get_consumed_movies(inx):
    return consumed.loc[inx]

In [17]:
movies = dict(enumerate(df['movie_id'].cat.categories))
users = dict(enumerate(df['user_id'].cat.categories))

users2inx = {v:k for k,v in users.items()}
movie2inx = {v:k for k,v in movies.items()}

In [18]:
model = NMF(n_components=100, init='nndsvda', random_state=0, verbose=False)

In [19]:
W = model.fit_transform(csr)

In [20]:
H = model.components_.T

In [21]:
def predict_recs(user_inx, take = 10):
    user_vec = H[user_inx]
    predictions = np.dot(user_vec, W.T)
    top = np.argsort(predictions)[::-1][:(take * 2)]
    return {movies.get(r, "0"): predictions[r] for r in top}

def pred_rating(user_id, movie_id):
    user_vec = H[users2inx[user_id]]
    movie_vec = W[movie2inx[movie_id]]
    return np.dot(user_vec, movie_vec)

In [24]:
yhat =[]
for idx, row in test.iterrows():
    uid = row['user_id']
    mid = row['movie_id']
    yhat.append( pred_rating(uid, mid))


In [26]:
mean_squared_error(test['avg'].tolist(), yhat)

0.892855886793963

In [24]:
recs = [(user_id, predict_recs(inx)) for inx, user_id in tqdm(users.items())]

100%|██████████| 39329/39329 [02:10<00:00, 301.78it/s]


In [25]:
def lookup_preds(row, user_inx=3):
    result = []
    for r in row:
        user_vec = H[user_inx]
        movie_vec = W[movie2inx[r]]
        result.append(np.dot(user_vec, movie_vec))
    return result

In [26]:
def compare_ratings_predictions(user_id):
    
    user_x = df.loc[(df['user_id'] == user_id)]

    user_x.loc[:, "predictions"] = lookup_preds(user_x['movie_id'], users2inx[user_id])
    return user_x.sort_values('predictions', ascending=False)

compare_ratings_predictions(100)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


Unnamed: 0,user_id,movie_id,rating,rating_timestamp,avg,predictions
1287,100,7286456,10.0,2019-10-05 00:52:03,1.394636,1.41213
1285,100,6806448,6.0,2019-08-09 04:06:49,0.950192,0.145258
1281,100,6105098,5.0,2019-07-24 08:58:28,0.83908,0.139432
1288,100,7349950,7.0,2019-09-07 02:03:50,1.061303,0.11553
1275,100,5164214,7.0,2019-11-09 01:45:52,1.061303,0.109325
1290,100,7798634,7.0,2019-11-22 19:28:06,1.061303,0.100432
1284,100,6450804,10.0,2019-11-01 02:21:48,1.394636,0.073968
1273,100,3741700,7.0,2019-06-09 18:40:18,1.061303,0.07186
1293,100,8364368,9.0,2019-07-12 00:53:47,1.283525,0.066279
1278,100,5606664,10.0,2019-11-09 23:32:08,1.394636,0.06453


In [27]:
save_path = "nnmf_recs.csv"
import csv
with open(save_path, 'w', newline='') as csvfile:
    rec_writer = csv.writer(csvfile, delimiter=' ',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
    for rec in tqdm(recs):
        rec_writer.writerow([rec[0]] + ['{}:{}'.format(k,v) for k,v in rec[1].items()])

100%|██████████| 39329/39329 [00:01<00:00, 19774.32it/s]
