<a href="https://colab.research.google.com/github/rajivsam/arangomlFeatureStore/blob/master/examples/feature_store_producer_DS.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Overview
The purpose of this notebook is to illustrate how an application or a model that produces embeddings for entities can use the arangomlFeatureStore to store the embeddings it produces. These embeddings can be used by downstream applications, for example, recommender systems, or, by analysts for analysis done for purposes of extracting insights from the data. Notebooks that serve to illustrate these applications are also provided. This notebook uses a matrix factorization model to produce embeddings for the user and item entities for the ml-100k dataset. The sections that perform the tasks in developing the embeddings are appropriately labeled.



## Clone the repository to get the data


In [None]:
!git clone -b master --single-branch https://github.com/rajivsam/arangomlFeatureStore.git
!rsync -av  interactive_tutorials/notebooks/data  ./ --

## Install required packages

In [None]:
!pip install -i https://test.pypi.org/simple/ arangomlFeatureStore
!pip install  pyArango python-arango PyYAML==5.2 numpy scikit-surprise

## Create a Dataset entity for the Recommender package (Surprise)

In [None]:
import os
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# path to dataset file
file_path = os.path.expanduser('/content/arangomlFeatureStore/data/ml-100k/u.data')

# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)

# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(BaselineOnly(), data, verbose=True)


## Add the arangomlFeatureStore to the Colab module search path

In [None]:
import arangomlFeatureStore as p
import sys
sys.path.append(p.__path__)
print(f"Feature store at {p.__path__}")
sys.path.insert(0, p.__path__)

In [None]:
!chmod -R 777 /usr/local/lib/python3.7/dist-packages/arangomlFeatureStore


## Create the FeatureStore on Oasis

In [None]:
from arangomlFeatureStore.feature_store_admin import FeatureStoreAdmin
from arango.database import StandardDatabase

In [None]:
fa = FeatureStoreAdmin()

## Develop a NMF Recommender model with 5 factors

In [None]:
from surprise import NMF
from surprise import Dataset
from surprise.model_selection import cross_validate


# Use the NMF algorithm.
algo = NMF(n_factors=5)

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

## Create User Entity
Users are represented by the their id and rating history

In [None]:
um_ratings = {}
for uid, iid, rating, timestamp in data.raw_ratings:
  if uid in um_ratings:
    um_ratings[uid][iid] = rating
  else:
    um_ratings[uid] = {}

## Trained Model has embeddings for User and Item

In [None]:
from surprise.model_selection import train_test_split
#data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = NMF(n_factors=5)

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

## The feature store interface has the functionality to write embeddings to the feature store

In [None]:
db = fa.db
fs = fa.get_feature_store()

## Write user, item, user embeddings and item embeddings to the feature store
__NOTE: THE embeddings for the user and item for this data are tagged with the label NMF-user-embeddings and NMF-item-embeddings. These tags will be used by consumer applications to retrieve the user and item embeddings.__

In [None]:
# ENTITY_COLL = cfg['arangodb']['entity_col']
import json

for id in trainset.all_users():
  ruid = trainset.to_raw_uid(id)
  ratings_for_ruid = um_ratings[ruid] 
  user_data = {'_key': 'user-' + str(ruid), 'ratings': ratings_for_ruid}
  user_info = fs.add_entity(user_data)
  user_embedding = json.dumps(algo.pu[id].tolist())
  value_data = {'_key': 'user-' + str(ruid), 'embedding': user_embedding}
  emb_info = fs.add_value(value_data)
  edoc = {'_from': user_info['_key'],'_to': emb_info['_key'], 'tag': 'NMF-user-embeddings'}
  edge_info = fs.link_entity_feature_value(edoc)
  #print(f"iid: {id}, embedding: {algo.pu[id]}")

In [None]:
for id in trainset.all_items():
  riid = trainset.to_raw_iid(id) 
  item_data = {'_key': 'item-'+str(riid), 'type': 'item' }
  item_info = fs.add_entity(item_data)
  item_embedding = json.dumps(algo.qi[id].tolist())
  value_data = {'_key': 'item-'+str(riid), 'embedding': item_embedding}
  emb_info = fs.add_value(value_data)
  edoc = {'_from': item_info['_key'],'_to': emb_info['_key'], 'tag': 'NMF-item-embeddings'}
  edge_info = fs.link_entity_feature_value(edoc)

In [None]:
fa2 = FeatureStoreAdmin(conn_config=fa.cfg['arangodb'])

In [None]:
fa.db_name == fa2.db_name

## Connection information for the feature store can be obtained as shown below
Note: The consumer applications would use the connection information obtained from executing the code segment below to connect to the feature store that has the embeddings stored.

In [None]:
fa.cfg['arangodb']