<a href="https://colab.research.google.com/github/thegreatwarlo/BeerPersonalization/blob/master/Matrix_Factorisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Matrix Factorisation For Recomendation

In this notebook we will look at how matrix facotrisation is done for the beer recommendation 

## Part 1: Data Preprocessing

In [25]:
# based on Google Colab 
# python 3
!pip install scikit-surprise



In [0]:
import pandas as pd
import numpy as np
import os
import itertools as it
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import gzip
import pickle
from sklearn.model_selection import train_test_split
from matplotlib.ticker import FormatStrFormatter
from surprise.model_selection import train_test_split
from surprise import NMF, Reader, Dataset, SVD, NMF, accuracy, KNNWithMeans

In [27]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### 1.1 Load Data

In [0]:
# load metadata from Google Drive 
# save as a list
data = []
one_complete_review = []


with gzip.open('/content/gdrive/My Drive/Final Project/Beeradvocate.txt.gz', 'r') as f:
  rb_file = f.readlines()
data = []
row_out = []

for i in rb_file:
    row = i.decode('utf-8', errors = 'replace')
    #print(row)
    if row == '\n':
      data.append(row_out)
      row_out = []
      continue
    cat, field = row.split(":", 1)
    #remove leading white spaces
    field = field.rstrip()
    row_out.append(field)

In [0]:
# convert list to dataframe
column_names = ['beer_name', 'beer_beerId', 'beer_brewer', 'beer_ABV', 'beer_style', 
                'review_appearance', 'review_aroma', 'review_palate', 'review_taste', 
                'review_overall', 'review_time', 'review_profileName', 'review_text']

df = pd.DataFrame.from_records(data, columns=column_names)

In [30]:
# descriptive 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586614 entries, 0 to 1586613
Data columns (total 13 columns):
beer_name             1586614 non-null object
beer_beerId           1586614 non-null object
beer_brewer           1586614 non-null object
beer_ABV              1586614 non-null object
beer_style            1586614 non-null object
review_appearance     1586614 non-null object
review_aroma          1586614 non-null object
review_palate         1586614 non-null object
review_taste          1586614 non-null object
review_overall        1586614 non-null object
review_time           1586614 non-null object
review_profileName    1586614 non-null object
review_text           1586614 non-null object
dtypes: object(13)
memory usage: 157.4+ MB


In [31]:
df.head(3)

Unnamed: 0,beer_name,beer_beerId,beer_brewer,beer_ABV,beer_style,review_appearance,review_aroma,review_palate,review_taste,review_overall,review_time,review_profileName,review_text
0,Sausa Weizen,47986,10325,5.0,Hefeweizen,2.5,2.0,1.5,1.5,1.5,1234817823,stcules,A lot of foam. But a lot.\tIn the smell some ...
1,Red Moon,48213,10325,6.2,English Strong Ale,3.0,2.5,3.0,3.0,3.0,1235915097,stcules,"Dark red color, light beige foam, average.\tI..."
2,Black Horse Black Beer,48215,10325,6.5,Foreign / Export Stout,3.0,2.5,3.0,3.0,3.0,1235916604,stcules,"Almost totally black. Beige foam, quite compa..."


In [32]:
print('The full dataset includes:')
print('%d unique beers;' % df.beer_beerId.nunique())
print('%d unique users;' % df.review_profileName.nunique())
print('and %d reviews in total.' % df.shape[0])

The full dataset includes:
66055 unique beers;
33388 unique users;
and 1586614 reviews in total.


#### Users Description

In [0]:
num_review_byuser = df.review_profileName.value_counts()
freq_list_user = np.array(list(dict(num_review_byuser).values()))

We have 33388 unique users, among them ~28000 users have less than 30 reviews.

#### Beer Description

In [0]:
num_review_bybeer = df.beer_name.value_counts()
freq_list_beer = np.array(list(dict(num_review_bybeer).values()))

### 1.2 Subset from Metadata

In [35]:
# subset data for collabrotive filtering
df1 = df[['beer_name', 'beer_beerId', 'review_profileName', 'review_overall', 'review_time']]
print('Original data size: %s' % str(df1.shape)) 

# remove NA 
df1 = df1[pd.notnull(df1.beer_name) & pd.notnull(df1.review_profileName) & pd.notnull(df1.review_overall)]
# remove blanks
df1 = df1.loc[df1.review_profileName != '']
df1 = df1.loc[df1.beer_name != '']
df1 = df1.loc[df1.review_overall != '']
print('After removing NAs and blanks: %s' % str(df1.shape)) 

# drop duplicate (beer&user) pairs, keep the latest rating
df1['beer_user_pair'] = df1.beer_name + df1.review_profileName
df1 = df1.sort_values(by=['review_time'], ascending=False).drop_duplicates(subset=['beer_user_pair'])
print('After drop duplicate user-item pairs (only keep the latest rating), data size: %s' % str(df1.shape))

# convert review ratings to numberic
df1.review_overall = pd.to_numeric(df1.review_overall)

Original data size: (1586614, 5)
After removing NAs and blanks: (1586266, 5)
After drop duplicate user-item pairs (only keep the latest rating), data size: (1561405, 6)


In [36]:
df1.head(3)

Unnamed: 0,beer_name,beer_beerId,review_profileName,review_overall,review_time,beer_user_pair
581215,Pete's Wicked Strawberry Blonde,381,bk3nj,3.0,999999652,Pete's Wicked Strawberry Blonde bk3nj
1023623,Fiji Bitter,1480,Mark,4.0,999980551,Fiji Bitter Mark
1077899,Wolaver's India Pale Ale,399,bcm119,3.5,999903142,Wolaver's India Pale Ale bcm119


#### Subset low number of users and beers

In [37]:
# subset three columns
cf = df1[['review_profileName', 'beer_name', 'review_overall']]

# sort user by # of reviews
cnt_user = dict(num_review_byuser)
cf['user_freq'] = [cnt_user.get(x) for x in cf.review_profileName]

# sort beer by # of reviews
cnt_beer = dict(num_review_bybeer)
cf['beer_freq'] = [cnt_beer.get(x) for x in cf.beer_name]

# drop users with less than 10 reviews
# drop beers with less than 5 reviews
cf = cf.loc[cf.user_freq > 10]
cf = cf.loc[cf.beer_freq > 5]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [38]:
print('After removing bottom users and beers,')
print('%d unique beers;' % cf.beer_name.nunique())
print('%d unique users;' % cf.review_profileName.nunique())
print('and %d reviews in total.' % cf.shape[0])

After removing bottom users and beers,
18925 unique beers;
10189 unique users;
and 1423856 reviews in total.


In [39]:
cf.head(3)

Unnamed: 0,review_profileName,beer_name,review_overall,user_freq,beer_freq
581215,bk3nj,Pete's Wicked Strawberry Blonde,3.0,45,298
1023623,Mark,Fiji Bitter,4.0,532,8
1077899,bcm119,Wolaver's India Pale Ale,3.5,175,257


## Part 2: Prediction Models

In [0]:
# define functions to generate prediction dataframe
# get_Iu and get_Ui are borrowed from Surprise library
def get_Iu(uid):
    """Return the number of items rated by given user
    Args:
        uid: The raw id of the user.
    Returns:
        The number of items rated by the user.
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError:  # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """Return the number of users that have rated given item
    Args:
        iid: The raw id of the item.
    Returns:
        The number of users that have rated the item.
    """
    try:
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:  # item was not part of the trainset
        return 0

# customized function to get predictions
def get_pred_df(pred):
  pred_df = pd.DataFrame(pred, columns=['uid', 'iid', 'rui', 'est', 'details'])    
  pred_df['Iu'] = pred_df.uid.apply(get_Iu)
  pred_df['Ui'] = pred_df.iid.apply(get_Ui)
  pred_df['err'] = abs(pred_df.est - pred_df.rui)
  
  # append review counts
  pred_df['user_freq'] = [cnt_user.get(x) for x in pred_df.uid]
  pred_df = pred_df.sort_values(by=['uid'])
  
  return pred_df

In [0]:
# train test split
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(cf[['review_profileName', 'beer_name', 'review_overall']], reader)
trainset, testset = train_test_split(data, test_size=.33)

### 2.1 SVD

In [42]:
# fit
svd = SVD(n_factors = 30, lr_all = 0.01, reg_all = 0.05)
svd.fit(trainset)

# test 
svd_pred = svd.test(testset)
accuracy.rmse(svd_pred, verbose=True)

# predict
svd_pred_df = get_pred_df(svd_pred)
svd_pred_df.head()

RMSE: 0.5907


Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err,user_freq
66699,0110x011,Good Mojo,5.0,4.488709,{'was_impossible': False},83,5,0.511291,139
365936,0110x011,AleSmith Speedway Stout - Barrel Aged,3.5,4.537887,{'was_impossible': False},83,136,1.037887,139
144918,0110x011,Tovarish Imperial Espresso Stout,4.0,4.31627,{'was_impossible': False},83,10,0.31627,139
246970,0110x011,Sanctification,5.0,4.514253,{'was_impossible': False},83,291,0.485747,139
348056,0110x011,Arctic Devil Barley Wine,4.0,4.342719,{'was_impossible': False},83,145,0.342719,139


In [0]:
# predict on whole dataset
trainset_s, testset_s = train_test_split(data, test_size=.99)

svd_pred_s = svd.test(testset_s)
svd_pred_df_s = get_pred_df(svd_pred_s)
svd_pred_df_s.head()
pickle.dump(svd_pred_df_s,open("/content/gdrive/My Drive/Final Project/svd_rec.sav", 'wb'))

### 2.2 Non-negative Matrix Factorization (NMF)

In [44]:
# fit
nmf = NMF(n_factors = 25, n_epochs = 50, reg_pu = 0.1, reg_qi = 0.1)
nmf.fit(trainset)

# test 
nmf_pred = nmf.test(testset)
nmf_rmse = accuracy.rmse(nmf_pred, verbose=True) 
print('Test RMSE of NMF is %s' % round(nmf_rmse, 3))

# predict
nmf_pred_df = get_pred_df(nmf_pred)
nmf_pred_df.head()

RMSE: 0.5980
Test RMSE of NMF is 0.598


Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err,user_freq
66699,0110x011,Good Mojo,5.0,4.548092,{'was_impossible': False},83,5,0.451908,139
365936,0110x011,AleSmith Speedway Stout - Barrel Aged,3.5,4.595024,{'was_impossible': False},83,136,1.095024,139
144918,0110x011,Tovarish Imperial Espresso Stout,4.0,4.332879,{'was_impossible': False},83,10,0.332879,139
246970,0110x011,Sanctification,5.0,4.551182,{'was_impossible': False},83,291,0.448818,139
348056,0110x011,Arctic Devil Barley Wine,4.0,4.301966,{'was_impossible': False},83,145,0.301966,139


In [0]:
# predict on whole dataset
nmf_pred_s = nmf.test(testset_s)
nmf_pred_df_s = get_pred_df(nmf_pred_s)
nmf_pred_df_s.head()
pickle.dump(nmf_pred_df,open("/content/gdrive/My Drive/Final Project/nmf_rec.sav", 'wb'))