In [None]:
# general imports
import pandas as pd
import numpy as np
import os
from IPython.display import display
pd.options.display.max_columns = None

from util.cloud_connection import bucket_connection
from src.recommender import recommender


recommender = recommender.Recommender()

# Enrich sparse matrix through SVD

Motivated through [this blog post](http://nicolas-hug.com/blog/matrix_facto_4), we try enriching the sparse user-meal-matrix with Singular-Value-Decomposition and subsequent calculation of the missing entries with the generated user- / meal-vectors.
The `surprise` framework will however (due to lacking docs) not utilized...

From the `sklearn.cecomposition.TruncatedSVD` [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html):

```
"Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with scipy.sparse matrices efficiently.

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

This estimator supports two algorithms: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient."
```

In [None]:
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt

SVD_K = 52    # only consider 10 most dominant 

df = recommender.user_similarity['cosine']

svd = TruncatedSVD(SVD_K)
df = pd.DataFrame(svd.fit_transform(df))

vars_accum_sums = [svd.explained_variance_ratio_[:(i+1)].sum() for i in range(52)]
plt.plot(vars_accum_sums)
plt.show()

As seen above by the curve of accumulated variance explained through the first n most impactful vectors.

We can see that the first 10 components only cover approx. 70% of total variance, while the first 20 components already cover approx. 90%. After 30 components, most of the variance is covered.

This is an issue for visualization purposes, as the first 1-3 components do not hold a high enough expressability.

In [None]:
n_factors = 30  # number of factors
alpha = .01  # learning rate
n_epochs = 200  # number of iteration of the SGD procedure

class sparseSVD:
    
    def __init__(self, data):
        self.p = np.random.normal(0, .1, (data['n_users'], n_factors))
        self.q = np.random.normal(0, .1, (data['n_items'], n_factors))

    def SGD(self, data):
        '''Learn the vectors p_u and q_i with SGD.
           data is a dataset containing all ratings + some useful info (e.g. number
           of items/users).
        '''


        # Randomly initialize the user and item factors.
        p = np.random.normal(0, .1, (data['n_users'], n_factors))
        q = np.random.normal(0, .1, (data['n_items'], n_factors))

        # Optimization procedure
        for _ in range(n_epochs):
            for u, i, r_ui in data['all_ratings']:
                err = r_ui - np.dot(p[u], q[i])
                # Update vectors p_u and q_i
                p[u] += alpha * err * q[i]
                q[i] += alpha * err * p[u]
        
        self.p = p
        self.q = q
            
    def estimate(self, u, i):
        '''Estimate rating of user u for item i.'''
        return np.dot(self.p[u], self.q[i])


df = recommender.df_user_item
data = {
    'n_users': df.shape[0],
    'n_items': df.shape[1],
    'all_ratings': [(u,i, df.iloc[u,i]) for u in range(df.shape[0]) for i in range(df.shape[1]) if df.iloc[u,i] > 0]
}

svd = sparseSVD(data)
svd.SGD(data)

# svd.estimate(51,76)
# data['all_ratings']

# for idx, row in df.iterrows():
#     for i in row.size:


new_pred = np.zeros(df.shape)
for u in range(df.shape[0]):
    for i in range(df.shape[1]):
        new_pred[u,i] = svd.estimate(u,i)
#         if df.iloc[u,i] > 0:
#             print("{},{}:\t actual={} \t predicted={}".format(u,i,df.iloc[u,i],svd.estimate(u,i)))
df_new_pred = pd.DataFrame(new_pred, index=recommender.df_user_item.index, columns=recommender.df_user_item.columns)

df_new_pred.head()


In [None]:
recommender.df_user_item.head()

With 200 epochs, our SVD already nicely approximates the known ratings (e.g. user 1 rating 57 at 0.978 ~~ 1.0, user 4 rating 160 at 5.000124 ~~ 5.0). However, users with a low number of predominently low ratings (e.g. twice a 1.0) get only negative ratings assigned.

We conclude that SVD learns user-wide and food-wide preferences which do not suffice to enrich sparse rows/columns of this matrix.