first unit test

veekaybee · Jun 21, 2023 · 3d4f7de · 3d4f7de
1 parent caeb472
commit 3d4f7de
Show file tree

Hide file tree

Showing 7 changed files with 32 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -42,25 +42,22 @@ Since the project is actively in exploration and development, there are a lot of
 
 + `src` - where all the code is
   + `api` - Flask sever that calls the model, includes a search endpoint. Eventually will be rewritten in Go (for performance reasons)
-  + `models` - The actual models including Word2Vec and BERT. Right now in production only BERT gets called from the API. 
-  + `notebooks` - Exploration and development of the input data, various concepts, algorithms, etc. The best resource there [is this notebook](https://github.com/veekaybee/viberary/blob/main/notebooks/05_duckdb_0.7.1.ipynb), which covers the end-to-end workflow of starting with raw data, processing in DuckDB, learning a Word2Vec embeddings model, and storing and querying those embeddings in Redis Search. This is the solution I'm working towards for the first baseline production model. 
-
-
-
-
-+ models: 
-  + `word2vec` - Word2Vec implemented in PyTorch. I did this before I implemented Word2Vec in Gensim to learn about PyTorch idioms and paradigms. [Annotated output is here.](https://colab.research.google.com/gist/veekaybee/a40d8f37dd99eda2e6d03f4c10671674/cbow.ipynb)
-
+  + `datagen` includes data generated for feeding into Word2Vec and for generating embeddings and also the code used to generate the embeddings, done on a Paperspace GPU instance. 
+  + `models` - The actual models including Word2Vec and BERT. 
+    + `bert` - Right now in production only BERT gets called from the API. the `bert` directory includes an indexer which indexes embeddings generated in `datagen` into a Redis instance. Redis and the Flask app talk to each other through an app running via `docker-compose` and the `Dockerfile` for the main app instance. 
+     + `word2vec` - Word2Vec implemented in PyTorch. I did this before I implemented Word2Vec in Gensim to learn about PyTorch idioms and paradigms. [Annotated output is here.](https://colab.research.google.com/gist/veekaybee/a40d8f37dd99eda2e6d03f4c10671674/cbow.ipynb)
+  + There are some utilities such as data directory access, io operations and a separate indexer that indexes titles into Redis for easy retrieval by the application
+  + `notebooks` - Exploration and development of the input data, various concepts, algorithms, etc. The best resource there [is this notebook](https://github.com/veekaybee/viberary/blob/main/notebooks/05_duckdb_0.7.1.ipynb), which covers the end-to-end workflow of starting with raw data, processing in DuckDB, learning a Word2Vec embeddings model, and storing and querying those embeddings in Redis Search. This is the solution I eventually turned into the application directory structure. 
 + `docs` - This serves and rebuilds viberary.pizza
 
-+ `api` - Me starting to learn Go for what will eventually be the production-grade server (ported from Flask
+
 
 ## Relevant Literature and Bibliography
 
 + ["Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning"](https://arxiv.org/abs/2006.02282)
 + ["PinnerSage"](https://arxiv.org/abs/2007.03634)
-+ ["Making Machine Learning Easy with Embeddings"](https://mlsys.org/Conferences/doc/2018/115.pdf)
-+ ["Research Rabbit Collection"](https://www.researchrabbitapp.com/collection/public/R6DO98QNZP)
++ ["My Research Rabbit Collection"](https://www.researchrabbitapp.com/collection/public/R6DO98QNZP)
++ My [paper on embeddings and its bibliography](https://vickiboykis.com/what_are_embeddings/index.html) 
 
 ## Input Data Sample
 

diff --git a/requirements.txt b/requirements.txt
@@ -5,4 +5,4 @@ pyarrow==12.0.0
 redis==4.5.3
 sentence_transformers==2.2.2
 tqdm==4.64.1
-
+fakeredis
diff --git a/src/models/bert/knn_search.py b/src/models/bert/knn_search.py
@@ -78,12 +78,9 @@ def top_knn(
         return scored_results
 
     def rescore(self, result_list: List) -> List:
-        """Takes a ranked list and returns ordinal scores for each
+        """Takes a ranked list of tuples
+        Each tuple contains (index, cosine similarity, book title)
+        and returns ordinal scores for each
         cosine similarity for UI
         """
-        ranked_list = []
-
-        for index, val in enumerate(result_list):
-            ranked_list.append((val[2], index))
-
-        return ranked_list
+        return [(val[2], index) for index, val in enumerate(result_list)]
diff --git a/src/models/bert/tests/__init__.py b/src/models/bert/tests/__init__.py
diff --git a/src/models/bert/tests/__pycache__/__init__.cpython-39.pyc b/src/models/bert/tests/__pycache__/__init__.cpython-39.pyc
diff --git a/src/models/bert/tests/__pycache__/test_ranker.cpython-39-pytest-7.2.1.pyc b/src/models/bert/tests/__pycache__/test_ranker.cpython-39-pytest-7.2.1.pyc
diff --git a/src/models/bert/tests/test_ranker.py b/src/models/bert/tests/test_ranker.py
@@ -0,0 +1,18 @@
+import pytest
+from fakeredis import FakeRedis
+
+from models.bert.knn_search import KNNSearch
+
+
+@pytest.fixture
+def redis_mock():
+    return FakeRedis()
+
+
+def test_rescore(redis_mock):
+    result_list = [(0, 0.888, "dogs"), (1, 0.777, "cats"), (2, 0.666, "birds")]
+    expected_list = [("dogs", 0), ("cats", 1), ("birds", 2)]
+
+    rescore = KNNSearch(redis_mock).rescore(result_list)
+
+    assert rescore == expected_list