<a href="https://colab.research.google.com/github/rajivsam/arangomlFeatureStore/blob/master/examples/feature_store_consumer_data_analyst.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Overview
The purpose of this notebook is to show how a data analyst can use embeddings associated with movie ratings dataset to understand user preferences and run ad-hoc analysis on users. A similar kind of analysis can be performed to understand insights about movies as well. The notebook connects to the feature store containing the user embeddings. Execute the last cell of the notebook "feature_store_producer_DS.ipynb" to obtain the connection information. The connection information provided here is representative and provided for illustration only. You will need to replace this with the connection information that is valid for your session.

## Install the pre-requisite packages

In [None]:
!pip install -i https://test.pypi.org/simple/ arangomlFeatureStore
!pip install  pyArango python-arango PyYAML==5.2 scikit-surprise hdbscan

## Provide the connection information to the Feature Store
__Note: THIS IS REPRESENTATIVE AND PROVIDED FOR ILLUSTRATION. REPLACE WITH INFORMATION VALID FOR YOUR SESSION__ 

In [None]:
connection_info_producer_fs = {'dbName': 'TUTpoaywuvsyv9bogx0e3vdnr',
 'edge_col': 'entity-feature-value',
 'entity_col': 'entity',
 'feature_value_col': 'feature-value',
 'graph_name': 'feature_store_graph',
 'hostname': 'tutorials.arangodb.cloud',
 'password': 'TUTc9fby27ixqcidev68cmxbc',
 'port': 8529,
 'protocol': 'https',
 'replication_factor': 3,
 'username': 'TUTu6kipkgspt01shbmdi3gg9'}

## Set up the arangomlFeatureStore for use in colab

In [None]:
import arangomlFeatureStore as p
import numpy as np
import sys
import json
import pandas as pd
sys.path.append(p.__path__)
print(f"Feature store at {p.__path__}")
sys.path.insert(0, p.__path__)


In [None]:
!chmod -R 777 /usr/local/lib/python3.7/dist-packages/arangomlFeatureStore

In [None]:
from arangomlFeatureStore.feature_store_admin import FeatureStoreAdmin
from arango.database import StandardDatabase

## Connect to the FeatureStore specified by the connection information

In [None]:
fa = FeatureStoreAdmin(conn_config = connection_info_producer_fs)

In [None]:
fs = fa.get_feature_store()

## Retrieve the Item embeddings and User embeddings associated with the tags _NMF-item-embeddings_ and _NMF-user-embeddings_ respectively

In [None]:
item_embs = fs.get_featureset_with_tag('tag', 'NMF-item-embeddings')

In [None]:
user_embs = fs.get_featureset_with_tag('tag', 'NMF-user-embeddings')

## Need to convert JSON to numeric type (Marshalling and Unmarshalling from JSON to Python)

In [None]:
emblist = [json.loads(user['embedding']) for user in user_embs]

In [None]:
uidlist = [user['_key'] for user in user_embs]

## Set up a Pandas DataFrame with User Embeddings

In [None]:
df_user_embs = pd.DataFrame(emblist, columns = ["dim_" + str(i+1) for i in range(5)])

## Cluster user embeddings with hdbscan

In [None]:
import hdbscan
clusterer = hdbscan.HDBSCAN(min_samples=5)

In [None]:
clusterer.fit(df_user_embs)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
from sklearn.manifold import TSNE

In [None]:
plot_kwds = {'alpha' : 0.25, 's' : 10, 'linewidths':0}
fig = plt.figure(figsize=(11.7,8.27))
projection = TSNE().fit_transform(df_user_embs)
plt.scatter(*projection.T, **plot_kwds)

In [None]:
color_palette = sns.color_palette('tab10')
fig = plt.figure(figsize=(11.7,8.27))
cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=20, linewidth=0, c=cluster_member_colors, alpha=0.25)

In [None]:
np.unique(clusterer.labels_)

In [None]:
type(clusterer.labels_)

## Clustering Observations
1. User preferences group into two clusters
2. A group of users have preferences that are not aligned with either user group - this is the noise cluster

## Running Adhoc Analysis
We can use a similarity search tool, for example, FAISS to perform ad-hoc similarity searches with user embeddings. The following segment installs FAISS and then uses it to search for users similar to a user.

## Install FAISS

In [None]:
!apt install libomp-dev
!pip install faiss

In [None]:
import faiss      

## Setup a FAISS index

In [None]:
EMB_SIZE = 5
index = faiss.IndexFlatL2(EMB_SIZE)  

In [None]:
emb_array = np.stack(emblist)

In [None]:
emb_array = emb_array.astype('float32')

In [None]:
index.add(emb_array)                  # add vectors to the index
print(index.ntotal)


## Search for Users similar to user # 6 (arrays are zero based) using FAISS

In [None]:
k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(emb_array[:5], k) # sanity check
print(I)
print(D)