Keras vectorizer build_embedding_matrix additions #155

lizgzil · 2020-10-12T16:23:29Z

Description

Fixing https://github.com/wellcometrust/WellcomeML/issues/35

I can't actually see any usage of the "build_embedding_matrix" function in other projects, so I'm hoping setting embeddings_path=None doesn't break anything.

I've set it so that the function can take a gensim word vector name if the local path isn't given. But can change the logic if this isn't ideal, I wasn't sure whether it'd be better to just give a default model name?

Checklist

Added link to Github issue or Trello card
Added tests

…ading glove embeddgings'

codecov-io · 2020-10-13T11:10:41Z

Codecov Report

Merging #155 into master will increase coverage by 0.66%.
The diff coverage is 82.75%.

@@            Coverage Diff             @@
##           master     #155      +/-   ##
==========================================
+ Coverage   79.11%   79.77%   +0.66%     
==========================================
  Files          39       39              
  Lines        1958     1978      +20     
==========================================
+ Hits         1549     1578      +29     
+ Misses        409      400       -9

Impacted Files	Coverage Δ
wellcomeml/ml/keras_vectorizer.py	`90.19% <82.75%> (+35.35%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27b483d...ceb5e70. Read the comment docs.

nsorros · 2020-10-13T13:39:38Z

wellcomeml/ml/keras_vectorizer.py

+                        embeddings_index[word] = coefs
+                    emb_dim = len(coefs)
+            else:
+                logger.error("Incorrect local embeddings path")


should we log an error or throw an exception. does the pipeline work if you get None back?

yeh i think youre right it should really throw an exception and stop

nsorros · 2020-10-13T13:40:19Z

wellcomeml/ml/keras_vectorizer.py

+                logger.error(
+                    "Incorrect GenSim word vector model name, try e.g. 'glove-twitter-25'"
+                )
+                return


good catch but same consideration as above, maybe we want to throw our own exception here

nsorros · 2020-10-13T13:50:11Z

wellcomeml/ml/keras_vectorizer.py

-                word, coefs = line.split(maxsplit=1)
-                coefs = np.fromstring(coefs, "f", sep=" ")
-                embeddings_index[word] = coefs
+    def build_embedding_matrix(self, embeddings_path=None, word_vectors=None):


i wonder whether the two arguments should be one, check the from_pretrained argument in transformers https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained.

so the idea would be that the user passes either a path to embeddings stored locally or the name of a pretrained embedding. those embeddings are downloaded and cached so second time do not need to download again

nsorros

That is a great addition @lizgzil, there is a discussion point on whether we want to have one argument that received either a path or a name and if it receives a name it downloads and caches the embeddings. I think this would be a nice approach that follows ideas elsewhere in our library as well.

…le path or a name of a gensim model, get rid of logger and just throw errors

lizgzil · 2020-10-14T10:14:40Z

@nsorros I've made those changes. It looks like when you download a gensim model it will cache it automatically and get the file from there going forward - so didn't need to write anything for this.

"Gensim has a gensim.downloader module for programmatically accessing this data. The module leverages a local cache that ensures data is downloaded at most once." https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html

nsorros · 2020-10-14T21:59:00Z

wellcomeml/ml/keras_vectorizer.py

        num_words = len(self.tokenizer.word_index) + 1

        embedding_matrix = np.zeros((num_words, emb_dim))
        for word, i in self.tokenizer.word_index.items():
-            embedding_vector = embeddings_index.get(word)
+            if local_embeddings:


is that needed, why not use the try except block for both cases?

I think it is because the get command wont work for the gensim embeddings case

nsorros

LGTM, one possible consideration.

lizgzil added 3 commits October 12, 2020 17:22

Add test for build_embedding_matrix and start to add logic for downlo…

e96cfce

…ading glove embeddgings'

add test for gensim word vectors

847f33b

Add logic and using a gensim word vector

be47372

lizgzil marked this pull request as ready for review October 13, 2020 10:29

lizgzil requested review from aCampello and nsorros October 13, 2020 10:29

apply black formatting

c72772e

nsorros reviewed Oct 13, 2020

View reviewed changes

nsorros suggested changes Oct 13, 2020

View reviewed changes

use one argument in build_embedding_matrix - so it can either be a fi…

ceb5e70

…le path or a name of a gensim model, get rid of logger and just throw errors

nsorros reviewed Oct 14, 2020

View reviewed changes

nsorros approved these changes Oct 14, 2020

View reviewed changes

lizgzil merged commit a59b55b into master Oct 15, 2020

lizgzil deleted the add-glove-embeddings branch October 15, 2020 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keras vectorizer build_embedding_matrix additions #155

Keras vectorizer build_embedding_matrix additions #155

lizgzil commented Oct 12, 2020 •

edited

codecov-io commented Oct 13, 2020 •

edited

nsorros Oct 13, 2020

lizgzil Oct 14, 2020

nsorros Oct 13, 2020

nsorros Oct 13, 2020

nsorros left a comment

lizgzil commented Oct 14, 2020

nsorros Oct 14, 2020

lizgzil Oct 15, 2020

lizgzil Oct 15, 2020

nsorros left a comment

Keras vectorizer build_embedding_matrix additions #155

Keras vectorizer build_embedding_matrix additions #155

Conversation

lizgzil commented Oct 12, 2020 • edited

Description

Checklist

codecov-io commented Oct 13, 2020 • edited

Codecov Report

nsorros Oct 13, 2020

Choose a reason for hiding this comment

lizgzil Oct 14, 2020

Choose a reason for hiding this comment

nsorros Oct 13, 2020

Choose a reason for hiding this comment

nsorros Oct 13, 2020

Choose a reason for hiding this comment

nsorros left a comment

Choose a reason for hiding this comment

lizgzil commented Oct 14, 2020

nsorros Oct 14, 2020

Choose a reason for hiding this comment

lizgzil Oct 15, 2020

Choose a reason for hiding this comment

lizgzil Oct 15, 2020

Choose a reason for hiding this comment

nsorros left a comment

Choose a reason for hiding this comment

lizgzil commented Oct 12, 2020 •

edited

codecov-io commented Oct 13, 2020 •

edited