Keras vectorizer build_embedding_matrix additions #155
Conversation
Codecov Report
@@ Coverage Diff @@
## master #155 +/- ##
==========================================
+ Coverage 79.11% 79.77% +0.66%
==========================================
Files 39 39
Lines 1958 1978 +20
==========================================
+ Hits 1549 1578 +29
+ Misses 409 400 -9
Continue to review full report at Codecov.
|
wellcomeml/ml/keras_vectorizer.py
Outdated
embeddings_index[word] = coefs | ||
emb_dim = len(coefs) | ||
else: | ||
logger.error("Incorrect local embeddings path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we log an error or throw an exception. does the pipeline work if you get None back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh i think youre right it should really throw an exception and stop
wellcomeml/ml/keras_vectorizer.py
Outdated
logger.error( | ||
"Incorrect GenSim word vector model name, try e.g. 'glove-twitter-25'" | ||
) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch but same consideration as above, maybe we want to throw our own exception here
wellcomeml/ml/keras_vectorizer.py
Outdated
word, coefs = line.split(maxsplit=1) | ||
coefs = np.fromstring(coefs, "f", sep=" ") | ||
embeddings_index[word] = coefs | ||
def build_embedding_matrix(self, embeddings_path=None, word_vectors=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder whether the two arguments should be one, check the from_pretrained
argument in transformers https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained.
so the idea would be that the user passes either a path to embeddings stored locally or the name of a pretrained embedding. those embeddings are downloaded and cached so second time do not need to download again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great addition @lizgzil, there is a discussion point on whether we want to have one argument that received either a path or a name and if it receives a name it downloads and caches the embeddings. I think this would be a nice approach that follows ideas elsewhere in our library as well.
…le path or a name of a gensim model, get rid of logger and just throw errors
@nsorros I've made those changes. It looks like when you download a gensim model it will cache it automatically and get the file from there going forward - so didn't need to write anything for this. "Gensim has a gensim.downloader module for programmatically accessing this data. The module leverages a local cache that ensures data is downloaded at most once." https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html |
num_words = len(self.tokenizer.word_index) + 1 | ||
|
||
embedding_matrix = np.zeros((num_words, emb_dim)) | ||
for word, i in self.tokenizer.word_index.items(): | ||
embedding_vector = embeddings_index.get(word) | ||
if local_embeddings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that needed, why not use the try except block for both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because the get
command wont work for the gensim embeddings case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because the get
command wont work for the gensim embeddings case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one possible consideration.
Description
Fixing https://github.com/wellcometrust/WellcomeML/issues/35
I can't actually see any usage of the "build_embedding_matrix" function in other projects, so I'm hoping setting
embeddings_path=None
doesn't break anything.I've set it so that the function can take a gensim word vector name if the local path isn't given. But can change the logic if this isn't ideal, I wasn't sure whether it'd be better to just give a default model name?
Checklist