Skip to content

Commit

Permalink
first unit test
Browse files Browse the repository at this point in the history
  • Loading branch information
veekaybee committed Jun 21, 2023
1 parent caeb472 commit 3d4f7de
Show file tree
Hide file tree
Showing 7 changed files with 32 additions and 20 deletions.
21 changes: 9 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,25 +42,22 @@ Since the project is actively in exploration and development, there are a lot of

+ `src` - where all the code is
+ `api` - Flask sever that calls the model, includes a search endpoint. Eventually will be rewritten in Go (for performance reasons)
+ `models` - The actual models including Word2Vec and BERT. Right now in production only BERT gets called from the API.
+ `notebooks` - Exploration and development of the input data, various concepts, algorithms, etc. The best resource there [is this notebook](https://github.com/veekaybee/viberary/blob/main/notebooks/05_duckdb_0.7.1.ipynb), which covers the end-to-end workflow of starting with raw data, processing in DuckDB, learning a Word2Vec embeddings model, and storing and querying those embeddings in Redis Search. This is the solution I'm working towards for the first baseline production model.




+ models:
+ `word2vec` - Word2Vec implemented in PyTorch. I did this before I implemented Word2Vec in Gensim to learn about PyTorch idioms and paradigms. [Annotated output is here.](https://colab.research.google.com/gist/veekaybee/a40d8f37dd99eda2e6d03f4c10671674/cbow.ipynb)

+ `datagen` includes data generated for feeding into Word2Vec and for generating embeddings and also the code used to generate the embeddings, done on a Paperspace GPU instance.
+ `models` - The actual models including Word2Vec and BERT.
+ `bert` - Right now in production only BERT gets called from the API. the `bert` directory includes an indexer which indexes embeddings generated in `datagen` into a Redis instance. Redis and the Flask app talk to each other through an app running via `docker-compose` and the `Dockerfile` for the main app instance.
+ `word2vec` - Word2Vec implemented in PyTorch. I did this before I implemented Word2Vec in Gensim to learn about PyTorch idioms and paradigms. [Annotated output is here.](https://colab.research.google.com/gist/veekaybee/a40d8f37dd99eda2e6d03f4c10671674/cbow.ipynb)
+ There are some utilities such as data directory access, io operations and a separate indexer that indexes titles into Redis for easy retrieval by the application
+ `notebooks` - Exploration and development of the input data, various concepts, algorithms, etc. The best resource there [is this notebook](https://github.com/veekaybee/viberary/blob/main/notebooks/05_duckdb_0.7.1.ipynb), which covers the end-to-end workflow of starting with raw data, processing in DuckDB, learning a Word2Vec embeddings model, and storing and querying those embeddings in Redis Search. This is the solution I eventually turned into the application directory structure.
+ `docs` - This serves and rebuilds viberary.pizza

+ `api` - Me starting to learn Go for what will eventually be the production-grade server (ported from Flask


## Relevant Literature and Bibliography

+ ["Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning"](https://arxiv.org/abs/2006.02282)
+ ["PinnerSage"](https://arxiv.org/abs/2007.03634)
+ ["Making Machine Learning Easy with Embeddings"](https://mlsys.org/Conferences/doc/2018/115.pdf)
+ ["Research Rabbit Collection"](https://www.researchrabbitapp.com/collection/public/R6DO98QNZP)
+ ["My Research Rabbit Collection"](https://www.researchrabbitapp.com/collection/public/R6DO98QNZP)
+ My [paper on embeddings and its bibliography](https://vickiboykis.com/what_are_embeddings/index.html)

## Input Data Sample

Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ pyarrow==12.0.0
redis==4.5.3
sentence_transformers==2.2.2
tqdm==4.64.1

fakeredis
11 changes: 4 additions & 7 deletions src/models/bert/knn_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,12 +78,9 @@ def top_knn(
return scored_results

def rescore(self, result_list: List) -> List:
"""Takes a ranked list and returns ordinal scores for each
"""Takes a ranked list of tuples
Each tuple contains (index, cosine similarity, book title)
and returns ordinal scores for each
cosine similarity for UI
"""
ranked_list = []

for index, val in enumerate(result_list):
ranked_list.append((val[2], index))

return ranked_list
return [(val[2], index) for index, val in enumerate(result_list)]
Empty file.
Binary file not shown.
Binary file not shown.
18 changes: 18 additions & 0 deletions src/models/bert/tests/test_ranker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import pytest
from fakeredis import FakeRedis

from models.bert.knn_search import KNNSearch


@pytest.fixture
def redis_mock():
return FakeRedis()


def test_rescore(redis_mock):
result_list = [(0, 0.888, "dogs"), (1, 0.777, "cats"), (2, 0.666, "birds")]
expected_list = [("dogs", 0), ("cats", 1), ("birds", 2)]

rescore = KNNSearch(redis_mock).rescore(result_list)

assert rescore == expected_list

0 comments on commit 3d4f7de

Please sign in to comment.