
## Explanation of Matrix Factorization

We want to estimate the vectors $b_i$, $u_j$ of length k for each user and beer in our data set. The vector $b_i$ for the the beer will represent to which extent the beer i posseses the factor that are being interested by the users, whereas similarly $u_i$ will represent the which extent user j is interested in these factors.

When we take the dot product of these 2 vectors, we can estimate the rating of a specific users on a specific type of beer. 

\begin{equation}
\hat{r_{iu}} & = \ b^T_{i} p_{j}
\end{equation}

Assuming our initial rating matrix R has dimensions of $mxn$, we can combine these vectors into two matrices and illustrate the matrix factorization. 

\begin{align}
\hat{R} & = \ UxB^T
\end{align}

Where U is a $mxk$ matrix that represent each users assosication with latent factors and B is a $nxk$ matrix that represents each beers association with the latent factors. Gradient Descent to estimate the matrices Q and P. Our loss/objective function for each rating can be defined as:

\begin{align}
l_{ij} = (R_{ij}- \sum_{k=1}^{K}b^T_ip_j)^2
\end{align}

Our overall loss and the objective function becomes:

\begin{align}
\min\limits_{U,B} \sum_{i,j \in R} (R_{ij }- \sum_{k=1}^{K}b^T_ip_j)^2 
\end{align}

Adding the regularization to prevent overfitting we have :

\begin{align}
\min\limits_{U,B} \sum_{i,j \in R} (R_{ij }- \sum_{k=1}^{K}b^T_ip_j)^2 +\beta(||B|| +||U||)^2
\end{align}




In [1]:
import pandas as pd 
import numpy as np
from scipy.spatial.distance import cosine
from scipy.sparse import csr_matrix, find,lil_matrix
import tensorflow as tf
import pickle
import time

## Data Preperation 

In [None]:
df=pd.read_csv("final_data.csv")
dfc=df[["score_overall","user_id","beer_names","beer_id","brewery_name"]].dropna()


beers=dfc.groupby('beer_id').count().query("score_overall >=50").index
users=dfc.groupby('user_id').count().query("score_overall >=50").index
df_filtered=dfc[dfc.beer_id.isin(beers)][dfc.user_id.isin(users)]

users=pd.factorize(df_filtered.user_id)[0]
beers=pd.factorize(df_filtered.beer_id)[0]
index_to_userid=dict(zip(users,df_filtered.user_id))
index_into_beerid=dict(zip(beers,df_filtered.beer_id))

index_into_beerid = {v: k for k, v in index_into_beerid.items()}
index_to_userid = {v: k for k, v in index_to_userid.items()}
##R=[]
R=np.zeros((len(index_into_beerid),len(index_to_userid))).T

for index, row in df_filtered.iterrows():
    R[index_to_userid[row['user_id']],index_into_beerid[row['beer_id']]]=row['score_overall']
    ##R.append((index_to_userid[row['user_id']],index_into_beerid[row['beer_id']],row['score_overall']))
    
index_into_beerid = {v: k for k, v in index_into_beerid.items()}
index_to_userid = {v: k for k, v in index_to_userid.items()}

index_into_beername=dict(zip(beers, df_filtered.beer_names+ " by "+ df_filtered.brewery_name ))

R=np.nan_to_num(R)

### Test and Training Set Creation

In [None]:

## Create Test and Training Set and saving the beers id's and their names to use in the reccomendation app

np.random.seed(seed=42)
#z_i,z_j,_=find(R>0)
z_i,z_j=np.where(R>0)
all_data=np.arange(len(z_i))
test=np.random.choice(len(all_data),size=len(all_data)//10,replace=False)
train=all_data[np.isin(all_data,test,invert=True)]

## Test and Training set Tensorflow
R_test=R.copy()
R_test[z_i[train],z_j[train]]=0
R[z_i[test],z_j[test]]=0



pickle_out = open("beers.pickle","wb")
pickle.dump(index_into_beername, pickle_out)
pickle_out.close()
