# Exploring User Clustering Representations to identify user structure

This exploration's goal is to grasp not necessarily how the provided user-rating-data is distributed (average number of ratings per user, number of users with more than 10 ratings etc.) but how expressive these are concerning actual user profiling.

This is done by preparing a _user-food-matrix_ and applying dimensionality reduction on them to see how well these separate in low-dimensional (2D) space.

Subsequently, a _user-user-matrix_ is constructed to - by contrast to the (very sparse) user-food-matrix - compare user profile similarities (kind of 2nd-moment-comparison).

In [None]:
# general imports
import pandas as pd
import numpy as np
import os
from util.cloud_connection import bucket_connection
from IPython.display import display
pd.options.display.max_columns = None

# constants
RATING_CSV = 'data/rating_normalized.csv'

In [None]:
df_ratings = pd.read_csv(RATING_CSV)
df = bucket_connection.get_meals()

First we merge the normalizer (primary) title with the rating matrix - essentially looking up `m_id` - to see what meals the ratings actually belong to.

In [None]:
df_ratings = df_ratings.assign(
    title_prim=[df.loc[(df['m_id']==m_id),'title_prim'].to_string(index=False) 
                for m_id in df_ratings.loc[:,'m_id']])

df_ratings.head()

## Generate (sparse) user-item-matrix

Now build up a (pretty _sparse_ ...) user-(primary)meal-matrix across all (primary) meal titles where for each user its ratings are entered at the corresponding position.

Note: Users may have rated food at different times where the food with primary title is the same. These ratings are averaged (`np.mean`) and entered as a single rating.

In [None]:
df_user_item = df_ratings.pivot_table(index="user",
                                      columns="title_prim",
                                      values="rating",
                                      aggfunc=np.mean).fillna(0)


print("Shape: {}, Size: {}".format(df_user_item.shape, df_user_item.size))
print("Non-zero entries: {}".format(np.count_nonzero(df_user_item)))
print("... making up {}% of entries".format(round((np.count_nonzero(df_user_item)/df_user_item.size)*100, 4)))

df_user_item

## Compute user similarities & generate user-user-matrix

... and subsequently breifly evaluate each generated matrix 

In [None]:
from sklearn.metrics.pairwise import pairwise_distances

user_user_mats = []
for metric in ['correlation', 'cosine', 'dice', 'jaccard']:
    df = pd.DataFrame(1 - pairwise_distances(df_user_item, metric=metric))    
    user_user_mats.append({'metric':metric,'df':df})
    
def print_stats(mat):
    metric, df = mat['metric'], mat['df']
    print("--> Simi-fct.: {}".format(metric))
    print("Shape: {}, Size: {}".format(df.shape, df.size))
    print(df.iloc[:7,:7].round(4))
    print("\n> Row-sums: {}".format(list(df.sum().round(2))))
    print("> Total-sum:\t{}".format(round(df.sum().sum(), 4)))
    entries_zero = df.size - np.count_nonzero(df)
    print("\n> Zero-Entries:\t{}%\t({} entries)".format(round(entries_zero/df.size*100, 2), entries_zero))
    entries_pos = len(np.where(df > 0.0)[0])
    print("> Positive:\t{}%\t({} entries)".format(round(entries_pos/df.size*100, 2), entries_pos))
    entries_neg = len(np.where(df < 0.0)[0])
    print("> Negative:\t{}%\t({} entries)".format(round(entries_neg/df.size*100, 2), entries_neg))
    entries_greater_dot5 = len(np.where(df > 0.5)[0])
    print("> Greater 0.5:\t{}%\t({} entries)".format(round(entries_greater_dot5/df.size*100, 2), entries_greater_dot5))

In [None]:
print_stats(user_user_mats[0])

### (Pearson) Correlation - Evaluation

Only 8 valus bigger then 0.5 (when substracting the 53 values of 1.0 o the diagonal of the matrix).

With the (Pearson) correlation coefficient we get a dense matrix, but it has a lot of negative values. We need to check how those affect the further recommendation steps.

In [None]:
print_stats(user_user_mats[1])

### Cosine - Evaluation

We have one user whose sum of cosine similarities for all other users including himself is 1. This means he has no similarity with any other user.

The user item matrix is 6.3% nonzero, the user-user similarity matrix is 25% nonzero.

In [None]:
print_stats(user_user_mats[2])

### Dice - Evaluation

670 valus are bigger then 0 while 6 (59-53) are bigger then 0.5.

With the dice similarity one user has no similarity to any user. It's the same user as with the cosine similarity.

In [None]:
print_stats(user_user_mats[3])

### Jaccard - Evaluation

670 valus bigger then 0 while no values are bigger then 0.5.

### --> Intermediate summary:

So far we've computed a sparse _user-food-matrix_ and 4 _user-user-matrices_ according to 4 different similarity functions ('Pearson Correlation', 'Cosine', 'Dice' and 'Jaccard').

As these matrices represent user preferences for each user, we now reduce their dimensionality to 2 to plot their distribution on the 2D plane to inspect clusters / positioning of said user preference profiles.

## Dim. Red. and Visualization

Next, we (possibly) standardize the matrix values and apply PCA & t-SNE to see some first results.

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

ANNOTATE_THRESH_NUM = 10

"""Preprocessing"""

# from sklearn.preprocessing import StandardScaler

# vals = df_user_food_ratings_matrix.values
# x = StandardScaler().fit_transform(vals)

# # pd.DataFrame(x)

X_names = ['User-Food filled with 0s',
           'User-Food filled with 2.5s']
X_all = [df_user_item.values, 
         df_user_item.values]
for user_user_mat in user_user_mats:
    X_names.append('User-User ({})'.format(user_user_mat['metric']))
    X_all.append(user_user_mat['df'].values)

users_num_ratings = [np.sum([1 if r != 0 else 0 for r in user_ratings]) 
                     for user_ratings in df_user_item]
# print(users_num_ratings)

"""Dimensionality-reduction to 2D."""

nc, dfs = 2, []
for i, X in enumerate(X_all):
    dfs_X = []
    pca = PCA(n_components=nc)
    dfs_X.append({'title': 'PCA',
                  'df': pd.DataFrame(pca.fit_transform(X), columns=['c1', 'c2'])})
    tsne = TSNE(n_components=nc, perplexity=3)
    dfs_X.append({'title': 't-SNE; perp=3',
                  'df': pd.DataFrame(tsne.fit_transform(X), columns=['c1', 'c2'])})
    tsne = TSNE(n_components=nc, perplexity=5)
    dfs_X.append({'title': 't-SNE; perp=5',
                  'df': pd.DataFrame(tsne.fit_transform(X), columns=['c1', 'c2'])})
    tsne = TSNE(n_components=nc, perplexity=10)
    dfs_X.append({'title': 't-SNE; perp=10',
                  'df': pd.DataFrame(tsne.fit_transform(X), columns=['c1', 'c2'])})
    dfs.append(dfs_X)

"""Visualization."""

fig = plt.figure(figsize = (16,24))

print('Number of components (dim.): {}'.format(nc))
for i, dfs_X in enumerate(dfs):
    print('Row {}: "{}"'.format(i+1, X_names[i]))
    for j, df_type in enumerate(dfs_X):
        ax = plt.subplot(len(dfs), len(dfs_X), i*len(dfs_X)+j+1)
#         ax.set_xlabel('Component 1', fontsize = 15)
#         ax.set_ylabel('Component 2', fontsize = 15)
        ax.set_title(df_type['title'], fontsize = 25)
        ax.grid()
        scatter = ax.scatter(df_type['df'].loc[:, 'c1'], 
                             df_type['df'].loc[:, 'c2'],
                             s = 50)
#         for k,r in enumerate(users_num_ratings):
#             if r > ANNOTATE_THRESH_NUM:
#                 ax.annotate('User {0} ({1} ratings)'.format(k,r),
#                             (df_type['df'].loc[k, 'c1'], df_type['df'].loc[k, 'c2']))

[...]

We conclude that PCA scaptures points with a _representative enough_ number of ratings towards the outside compared to the cluster of few-rating-users, where there is a dendency towards different directions for different users.

T-SNE on the other hand maximizes difference between different more-rating-users as it places them outside a cluster of similarly distributed few-rating-users. This intuitively makes sense at t-SNE aims to map distances in high-dim. space as closely as possible to distance in low-dim. space (frequently modeled as ith _springs_). This positions the only in a single dim. differing samples mostly uniformly in the inside and then stochastically distributes more complex positionde points outside.