Skip to content
This repository has been archived by the owner on Aug 9, 2023. It is now read-only.

Keras vectorizer build_embedding_matrix additions #155

Merged
merged 5 commits into from Oct 15, 2020

Conversation

lizgzil
Copy link
Contributor

@lizgzil lizgzil commented Oct 12, 2020

Description

Fixing https://github.com/wellcometrust/WellcomeML/issues/35

I can't actually see any usage of the "build_embedding_matrix" function in other projects, so I'm hoping setting embeddings_path=None doesn't break anything.

I've set it so that the function can take a gensim word vector name if the local path isn't given. But can change the logic if this isn't ideal, I wasn't sure whether it'd be better to just give a default model name?

Checklist

  • Added link to Github issue or Trello card
  • Added tests

@lizgzil lizgzil marked this pull request as ready for review October 13, 2020 10:29
@codecov-io
Copy link

codecov-io commented Oct 13, 2020

Codecov Report

Merging #155 into master will increase coverage by 0.66%.
The diff coverage is 82.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #155      +/-   ##
==========================================
+ Coverage   79.11%   79.77%   +0.66%     
==========================================
  Files          39       39              
  Lines        1958     1978      +20     
==========================================
+ Hits         1549     1578      +29     
+ Misses        409      400       -9     
Impacted Files Coverage Δ
wellcomeml/ml/keras_vectorizer.py 90.19% <82.75%> (+35.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27b483d...ceb5e70. Read the comment docs.

embeddings_index[word] = coefs
emb_dim = len(coefs)
else:
logger.error("Incorrect local embeddings path")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we log an error or throw an exception. does the pipeline work if you get None back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh i think youre right it should really throw an exception and stop

logger.error(
"Incorrect GenSim word vector model name, try e.g. 'glove-twitter-25'"
)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch but same consideration as above, maybe we want to throw our own exception here

word, coefs = line.split(maxsplit=1)
coefs = np.fromstring(coefs, "f", sep=" ")
embeddings_index[word] = coefs
def build_embedding_matrix(self, embeddings_path=None, word_vectors=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder whether the two arguments should be one, check the from_pretrained argument in transformers https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained.

so the idea would be that the user passes either a path to embeddings stored locally or the name of a pretrained embedding. those embeddings are downloaded and cached so second time do not need to download again

Copy link
Contributor

@nsorros nsorros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great addition @lizgzil, there is a discussion point on whether we want to have one argument that received either a path or a name and if it receives a name it downloads and caches the embeddings. I think this would be a nice approach that follows ideas elsewhere in our library as well.

…le path or a name of a gensim model, get rid of logger and just throw errors
@lizgzil
Copy link
Contributor Author

lizgzil commented Oct 14, 2020

@nsorros I've made those changes. It looks like when you download a gensim model it will cache it automatically and get the file from there going forward - so didn't need to write anything for this.

"Gensim has a gensim.downloader module for programmatically accessing this data. The module leverages a local cache that ensures data is downloaded at most once." https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html

num_words = len(self.tokenizer.word_index) + 1

embedding_matrix = np.zeros((num_words, emb_dim))
for word, i in self.tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if local_embeddings:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that needed, why not use the try except block for both cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because the get command wont work for the gensim embeddings case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because the get command wont work for the gensim embeddings case

Copy link
Contributor

@nsorros nsorros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one possible consideration.

@lizgzil lizgzil merged commit a59b55b into master Oct 15, 2020
@lizgzil lizgzil deleted the add-glove-embeddings branch October 15, 2020 10:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants