If you haven't already run DataPipelineDemo, do that first. This is a deep dive into how the backend scales with additional data. 

In [25]:
# Third party imports
import pandas as pd
from pathlib import Path
import sys

In [26]:
sys.path.append("..")
from app.projection.projection_datastore_factory import ProjectionDatastoreFactory
from app.recommendation.similarity.similarity_engine_factory import SimilarityEngineFactory

We set a shard size based on the number of entries we want on a single shard. For demonstration, lets pick 25 so that we can have multiple shards in memory for 1 program.

In [27]:
max_shard_size = 25

We'll show that the number of shards created grows as we upload larger data sets to a projection datastore

In [28]:
for i in range(10):
    projection = dict([(str(j),[float(j)]) for j in range(max_shard_size * i)])
    movie_indices = dict([(str(j), j) for j in range(max_shard_size * i)])
    projection_datastore = ProjectionDatastoreFactory(in_memory=True).build()
    projection_datastore.upload(projection, movie_indices)
    print("Number of shards created: " + str(len(projection_datastore.get_shards())))

Number of shards created: 0
Number of shards created: 1
Number of shards created: 2
Number of shards created: 3
Number of shards created: 4
Number of shards created: 5
Number of shards created: 6
Number of shards created: 7
Number of shards created: 8
Number of shards created: 9


Using the largest dataset produced by the for loop above, plus a special "_average" vector computed ahead of time, we will show how projection shards convert to similarity shards that can be used by the backend for closest neighbor lookups.

In [34]:
projection["_average"] = [0.0]
movie_indices["_average"] = len(projection)
projection_datastore.upload(projection, movie_indices)
similarity_engine = SimilarityEngineFactory(projection_datastore).build()
average_vector = similarity_engine.find_average_vector()
print("\"_average\" vector: " + str(average_vector))
print("index of closest neighbor: " + str(similarity_engine.get_closest_neighbor(average_vector)))

"_average" vector: [0.0]
index of closest neighbor: 0
