# Exploring User Clustering Representations 

This exploration's goal is to grasp not necessarily how the provided user-rating-data is distributed (average number of ratings per user, number of users with more than 10 ratings etc.) but how expressive these are concerning actual user profiling.

This is done by preparing a user-food-matrix and applying dimensionality reduction on them to see how well these separate in low-dimensional (2D) space.

## General preparation

In [None]:
# general imports
import pandas as pd
import numpy as np

from IPython.display import display
pd.options.display.max_columns = None

# constants
MEAL_CSV = 'data/meal_manually_cleaned.csv'
RATING_CSV = 'data/rating.csv'

In [None]:
from util.normalizer.food_title_normalizer import FoodNormalizer

df_ratings = pd.read_csv(RATING_CSV)

normalizer = FoodNormalizer('../../'+MEAL_CSV)    # takes relative path from view of normalizer class as arg ...
normalizer.assign_norm_titles()

# normalizer.meal_df.head()
# df_ratings.head()

First we merge the normalizer (primary) title with the rating matrix - essentially looking up `m_id` - to see what meals the ratings actually belong to.

In [None]:
df = normalizer.meal_df
df_ratings = df_ratings.assign(
    title_prim=[df.loc[(df['m_id']==m_id),'title_prim'].to_string(index=False) 
                for m_id in df_ratings.loc[:,'m_id']])

df_ratings.head()

## Building up Sparse Matrix

Now build up a (pretty _sparse_ ...) user-(primary)meal-matrix across all (primary) meal titles where for each user its ratings are entered at the corresponding position.

Note: Users may have rated food at different times where the food with primary title is the same. These ratings are averaged and entered as a single rating.

In [None]:
def get_rating(ratings, food_title):
    """Searches for the food in a user's rating dict `{'rating':[...],'title_prim':[...]}`.
    
    This account for multiple ratings for meals having the same title_prim by taking the average
    over all respective ratings.
    """
    rating, n = 0, 0
    for r in ratings:
        if r['title_prim'] == food_title:
            rating += r['rating']
            n += 1
    return rating / n

title_prim_header = sorted(list(set(df_ratings.loc[:,'title_prim'])))
users = sorted(set(df_ratings.loc[:,'user']))

user_food_ratings_matrix, user_food_ratings_matrix_meaned = [], []
for usr in users:
    user_foods_rated = df_ratings.loc[(df_ratings['user']==usr),['title_prim','rating']].to_dict('records')
    user_foods = set([food_rated['title_prim'] for food_rated in user_foods_rated])    # true subset of title_prim_header
    food_rating_vector = [0 if food not in user_foods else get_rating(user_foods_rated, food) 
                          for food in title_prim_header]
    user_food_ratings_matrix_meaned.append(food_rating_vector)
    food_rating_vector_meaned = [2.5 if food not in user_foods else get_rating(user_foods_rated, food) 
                                 for food in title_prim_header]
    user_food_ratings_matrix.append(food_rating_vector)
df_user_food_ratings_matrix = pd.DataFrame(user_food_ratings_matrix, index=users, columns=title_prim_header)
df_user_food_ratings_matrix_meaned = pd.DataFrame(user_food_ratings_matrix_meaned, index=users, columns=title_prim_header)
    
# df_user_food_ratings_matrix

We have now obtained a __#users by #uniquePrimaryMeals__ -matrix.

## Dim. Red. and Visualization

Next, we standardize the matrix values and apply PCA to see some first results.

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

ANNOTATE_THRESH_NUM = 10

"""Preprocessing"""

# from sklearn.preprocessing import StandardScaler

# vals = df_user_food_ratings_matrix.values
# x = StandardScaler().fit_transform(vals)

# # pd.DataFrame(x)

X_names = ['User ratings missing filled with 0s', 'User ratings missing filled with 2.5s']
X_all = df_user_food_ratings_matrix.values, df_user_food_ratings_matrix_meaned.values
users_num_ratings = [np.sum([1 if r != 0 else 0 for r in user_ratings]) 
                     for user_ratings in user_food_ratings_matrix]
# print(users_num_ratings)

"""Dimensionality-reduction to 2D."""

dfs = []
for i, X in enumerate(X_all):
    dfs_X = []
    pca = PCA(n_components=2)
    dfs_X.append({'title': 'PCA, n=2',
                  'df': pd.DataFrame(pca.fit_transform(X), columns=['c1', 'c2'])})
    tsne = TSNE(n_components=2, perplexity=3)
    dfs_X.append({'title': 't-SNE, n=2, perp=3',
                  'df': pd.DataFrame(tsne.fit_transform(X), columns=['c1', 'c2'])})
    tsne = TSNE(n_components=2, perplexity=5)
    dfs_X.append({'title': 't-SNE, n=2, perp=5',
                  'df': pd.DataFrame(tsne.fit_transform(X), columns=['c1', 'c2'])})
    tsne = TSNE(n_components=2, perplexity=10)
    dfs_X.append({'title': 't-SNE, n=2, perp=10',
                  'df': pd.DataFrame(tsne.fit_transform(X), columns=['c1', 'c2'])})
    dfs.append(dfs_X)

"""Visualization."""

fig = plt.figure(figsize = (16,12))

for i, dfs_X in enumerate(dfs):
    print('Row {}: "{}"'.format(i+1, X_names[i]))
    for j, df_type in enumerate(dfs_X):
        ax = plt.subplot(len(dfs), len(dfs_X), i*len(dfs_X)+j+1)
#         ax = plt.subplot(1, 2, 1)
#         ax.set_xlabel('Principal Component 1', fontsize = 15)
#         ax.set_ylabel('Principal Component 2', fontsize = 15)
        ax.set_title(df_type['title'], fontsize = 25)
        ax.grid()
        scatter = ax.scatter(df_type['df'].loc[:, 'c1'], 
                             df_type['df'].loc[:, 'c2'],
                             s = 50)
        for k,r in enumerate(users_num_ratings):
            if r > ANNOTATE_THRESH_NUM:
                ax.annotate('User {0} ({1} ratings)'.format(k,r),
                            (df_type['df'].loc[k, 'c1'], df_type['df'].loc[k, 'c2']))

We conclude that PCA scaptures points with a _representative enough_ number of ratings towards the outside compared to the cluster of few-rating-users, where there is a dendency towards different directions for different users.

T-SNE on the other hand maximizes difference between different more-rating-users as it places them outside a cluster of similarly distributed few-rating-users. This intuitively makes sense at t-SNE aims to map distances in high-dim. space as closely as possible to distance in low-dim. space (frequently modeled as ith _springs_). This positions the only in a single dim. differing samples mostly uniformly in the inside and then stochastically distributes more complex positionde points outside.

Now trying setting unknwn values to `2.5` instead of `0`: