<a href="https://colab.research.google.com/github/rajivsam/arangomlFeatureStore/blob/master/examples/feature_store_producer_DS.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Overview
The purpose of this notebook is to illustrate how an application or a model that produces embeddings for entities can use the arangomlFeatureStore to store the embeddings it produces. These embeddings can be used by downstream applications, for example, recommender systems, or, by analysts for analysis done for purposes of extracting insights from the data. Notebooks that serve to illustrate these applications are also provided. This notebook uses a matrix factorization model to produce embeddings for the user and item entities for the ml-100k dataset. The sections that perform the tasks in developing the embeddings are appropriately labeled.



## Clone the repository to get the data


In [1]:
!git clone -b master --single-branch https://github.com/rajivsam/arangomlFeatureStore.git
!rsync -av  interactive_tutorials/notebooks/data  ./ --

Cloning into 'arangomlFeatureStore'...
remote: Enumerating objects: 178, done.[K
remote: Counting objects: 100% (178/178), done.[K
remote: Compressing objects: 100% (121/121), done.[K
remote: Total 178 (delta 75), reused 147 (delta 44), pack-reused 0[K
Receiving objects: 100% (178/178), 7.59 MiB | 5.64 MiB/s, done.
Resolving deltas: 100% (75/75), done.
sending incremental file list
rsync: change_dir "/content//interactive_tutorials/notebooks" failed: No such file or directory (2)

sent 20 bytes  received 12 bytes  64.00 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]


## Install required packages

In [2]:
!pip install -i https://test.pypi.org/simple/ arangomlFeatureStore
!pip install  pyArango python-arango PyYAML==5.2 numpy scikit-surprise

Looking in indexes: https://test.pypi.org/simple/
Collecting arangomlFeatureStore
  Downloading https://test-files.pythonhosted.org/packages/5e/93/2794112f2222124281a3fee90f41c64432df6a7eb8dedd2952c01d4a6992/arangomlFeatureStore-0.0.7.8-py3-none-any.whl (9.1 kB)
Installing collected packages: arangomlFeatureStore
Successfully installed arangomlFeatureStore-0.0.7.8
Collecting pyArango
  Downloading pyArango-2.0.1.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 2.5 MB/s 
[?25hCollecting python-arango
  Downloading python_arango-7.3.1-py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 3.6 MB/s 
[?25hCollecting PyYAML==5.2
  Downloading PyYAML-5.2.tar.gz (265 kB)
[K     |████████████████████████████████| 265 kB 48.7 MB/s 
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 46.9 MB/s 
Collecting datetime
  Downloading DateTime-4.4-py2.py3-none-any.whl (51 kB)
[K     |███

## Create a Dataset entity for the Recommender package (Surprise)

In [3]:
import os
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# path to dataset file
file_path = os.path.expanduser('/content/arangomlFeatureStore/data/ml-100k/u.data')

# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)

# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(BaselineOnly(), data, verbose=True)


Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9428  0.9406  0.9437  0.9410  0.9531  0.9442  0.0046  
MAE (testset)     0.7480  0.7472  0.7493  0.7459  0.7531  0.7487  0.0025  
Fit time          0.41    0.45    0.45    0.46    0.45    0.45    0.02    
Test time         0.10    0.18    0.10    0.19    0.11    0.14    0.04    


{'fit_time': (0.40720152854919434,
  0.4526515007019043,
  0.45453453063964844,
  0.4598708152770996,
  0.45254087448120117),
 'test_mae': array([0.74798566, 0.74716782, 0.74932606, 0.74586924, 0.7530946 ]),
 'test_rmse': array([0.94276898, 0.9405954 , 0.94365128, 0.94104506, 0.95313324]),
 'test_time': (0.10293388366699219,
  0.17961359024047852,
  0.10064840316772461,
  0.19076323509216309,
  0.10516953468322754)}

## Add the arangomlFeatureStore to the Colab module search path

In [4]:
import arangomlFeatureStore as p
import sys
sys.path.append(p.__path__)
print(f"Feature store at {p.__path__}")
sys.path.insert(0, p.__path__)

Feature store at ['/usr/local/lib/python3.7/dist-packages/arangomlFeatureStore']


In [5]:
!chmod -R 777 /usr/local/lib/python3.7/dist-packages/arangomlFeatureStore


## Create the FeatureStore on Oasis

In [6]:
from arangomlFeatureStore.feature_store_admin import FeatureStoreAdmin
from arango.database import StandardDatabase

In [7]:
fa = FeatureStoreAdmin()

## Develop a NMF Recommender model with 5 factors

In [8]:
from surprise import NMF
from surprise import Dataset
from surprise.model_selection import cross_validate


# Use the NMF algorithm.
algo = NMF(n_factors=5)

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0647  1.0659  1.0554  1.0585  1.0640  1.0617  0.0041  
MAE (testset)     0.8786  0.8762  0.8660  0.8700  0.8752  0.8732  0.0046  
Fit time          4.25    4.28    6.48    4.15    4.10    4.65    0.91    
Test time         0.14    0.22    0.41    0.13    0.21    0.22    0.10    


{'fit_time': (4.251261472702026,
  4.276087045669556,
  6.475595951080322,
  4.153860569000244,
  4.100719690322876),
 'test_mae': array([0.8786172 , 0.87620586, 0.86600728, 0.86995216, 0.87515295]),
 'test_rmse': array([1.06471003, 1.06593107, 1.05535377, 1.05850428, 1.06403628]),
 'test_time': (0.1383523941040039,
  0.21628379821777344,
  0.4054992198944092,
  0.1268014907836914,
  0.2091057300567627)}

## Create User Entity
Users are represented by the their id and rating history

In [9]:
um_ratings = {}
for uid, iid, rating, timestamp in data.raw_ratings:
  if uid in um_ratings:
    um_ratings[uid][iid] = rating
  else:
    um_ratings[uid] = {}

## Trained Model has embeddings for User and Item

In [10]:
from surprise.model_selection import train_test_split
#data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = NMF(n_factors=5)

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

## The feature store interface has the functionality to write embeddings to the feature store

In [11]:
db = fa.db
fs = fa.get_feature_store()

## Write user, item, user embeddings and item embeddings to the feature store
__NOTE: THE embeddings for the user and item for this data are tagged with the label NMF-user-embeddings and NMF-item-embeddings. These tags will be used by consumer applications to retrieve the user and item embeddings.__

In [12]:
# ENTITY_COLL = cfg['arangodb']['entity_col']
import json
user_list = list()
user_emb_list = list()
user_emb_assoc_list = list()
for id in trainset.all_users():
  ruid = trainset.to_raw_uid(id)
  ratings_for_ruid = um_ratings[ruid] 
  user_data = {'_key': 'user-' + str(ruid), 'ratings': ratings_for_ruid}
  user_list.append(user_data)
  #user_info = fs.add_entity(user_data)
  user_embedding = json.dumps(algo.pu[id].tolist())
  value_data = {'_key': 'user-' + str(ruid), 'embedding': user_embedding}
  user_emb_list.append(value_data)
  #emb_info = fs.add_value(value_data)
  edoc = {'_from': user_data['_key'],'_to': value_data['_key'], 'tag': 'NMF-user-embeddings'}
  user_emb_assoc_list.append(edoc)
  #edge_info = fs.link_entity_feature_value(edoc)
  #print(f"iid: {id}, embedding: {algo.pu[id]}")

In [13]:
fs.add_entity_bulk(user_list)
fs.add_value_bulk(user_emb_list)
fs.link_entity_feature_value_bulk(user_emb_assoc_list)

In [14]:
item_list = list()
item_emb_list = list()
item_emb_assoc_list = list()
for id in trainset.all_items():
  riid = trainset.to_raw_iid(id) 
  item_data = {'_key': 'item-'+str(riid), 'type': 'item' }
  item_list.append(item_data)
  #item_info = fs.add_entity(item_data)
  item_embedding = json.dumps(algo.qi[id].tolist())
  value_data = {'_key': 'item-'+str(riid), 'embedding': item_embedding}
  item_emb_list.append(value_data)
  #emb_info = fs.add_value(value_data)
  edoc = {'_from': item_data['_key'],'_to': value_data['_key'], 'tag': 'NMF-item-embeddings'}
  #edge_info = fs.link_entity_feature_value(edoc)
  item_emb_assoc_list.append(edoc)

In [15]:
fs.add_entity_bulk(item_list)
fs.add_value_bulk(item_emb_list)
fs.link_entity_feature_value_bulk(item_emb_assoc_list)

In [16]:
fa2 = FeatureStoreAdmin(conn_config=fa.cfg['arangodb'])

In [17]:
fa.db_name == fa2.db_name

True

## Connection information for the feature store can be obtained as shown below
Note: The consumer applications would use the connection information obtained from executing the code segment below to connect to the feature store that has the embeddings stored.

In [18]:
fa.cfg['arangodb']

{'dbName': 'TUThyeakkvs3t80rlwqg9tnipi',
 'edge_col': 'entity-feature-value',
 'entity_col': 'entity',
 'feature_value_col': 'feature-value',
 'graph_name': 'feature_store_graph',
 'hostname': 'tutorials.arangodb.cloud',
 'password': 'TUTiupac2s53i2kmnfuo5foq',
 'port': 8529,
 'protocol': 'https',
 'replication_factor': 3,
 'username': 'TUTfprz11d3r4omnmbrvidia'}