Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uSIF vs Averaging #10

Closed
fros1y opened this issue Mar 3, 2020 · 1 comment
Closed

uSIF vs Averaging #10

fros1y opened this issue Mar 3, 2020 · 1 comment

Comments

@fros1y
Copy link

fros1y commented Mar 3, 2020

I noticed that you are calculating sentence embedding using an average of the individual word vectors when performing clustering, etc. Did you happen to evaluate whether SIF or uSIF would be advantageous over averaging?

@yumeng5
Copy link
Owner

yumeng5 commented Mar 3, 2020

Hi,

Thanks for the question. Are you referring to the following functions in the evaluation code?

def get_avg_emb(vec_file, text):

def calc_rep(docs, word_emb):
emb = [np.array([word_emb[w] for w in doc if w in word_emb]) for doc in docs]
emb = np.array([np.average(vec, axis=0) for vec in emb])
return emb

For baselines that produce document/sentence embeddings (like SIF and JoSE), we directly take their document/sentence embeddings as features for clustering/classification. The above functions (averaged word embedding) are used to produce sentence embeddings only for word embedding baselines (word2vec) that cannot naturally learn sentence representations. They are actually not used anywhere in the evaluation code (I should have deleted them to avoid confusion).

Please let me know if you have any further questions!

Best,
Yu

@fros1y fros1y closed this as completed Mar 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants