<a href="https://colab.research.google.com/github/wadaka0821/nlp-tutorial/blob/main/questions/3_5_word2vec_question.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec を使って分散表現を学習してみる

In [None]:
!pip install datasets
!pip install gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from datasets import load_dataset
import nltk
import gensim
nltk.download('punkt')

seed = 42

dataset = load_dataset("ACL-OCL/acl-anthology-corpus")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# アブストだけ取り出す
abstracts = [dataset['train'][i]['abstract'] for i in range(len(dataset['train']))]
# 単語分割
abstracts_tokenized = [nltk.word_tokenize(abst.lower()) for abst in abstracts if abst]

In [None]:
# CBOW:0, Skip-gram:1
model = gensim.models.word2vec.Word2Vec(abstracts_tokenized, size=100, window=5, seed=seed, sg=1)

In [None]:
model.wv.vectors.shape

(31142, 100)

 # 分散表現を使ってみる

In [None]:
model.wv.most_similar(positive=['pytorch'])

[('tensorflow', 0.8910528421401978),
 ('sockeye', 0.7975236177444458),
 ('tensor2tensor', 0.7924074530601501),
 ('huggingface', 0.7872991561889648),
 ('fasthan', 0.7814316749572754),
 ('opennmt', 0.7688520550727844),
 ('opendial', 0.7673929333686829),
 ('allennlp', 0.7662742733955383),
 ('soa', 0.7642906904220581),
 ('toolkits', 0.7589478492736816)]

In [None]:
model.wv.most_similar('generation')

[('generator', 0.7199503183364868),
 ('graph-to-text', 0.7176738977432251),
 ('controllable', 0.7165037393569946),
 ('data-to-text', 0.7153257131576538),
 ('lyric', 0.7152193188667297),
 ('concept-to-text', 0.7050998210906982),
 ('datato-text', 0.7002928853034973),
 ('infilling', 0.6993923187255859),
 ('data-totext', 0.6926992535591125),
 ('multi-round', 0.6896508932113647)]

In [None]:
model.wv.most_similar('supervised')

[('semi-supervised', 0.822796642780304),
 ('lightly', 0.789624810218811),
 ('unsupervised', 0.7633551955223083),
 ('weaklysupervised', 0.7631798982620239),
 ('semisupervised', 0.7614290714263916),
 ('distantly', 0.7483733296394348),
 ('weakly-supervised', 0.7467080354690552),
 ('distantly-supervised', 0.7452856302261353),
 ('fully-supervised', 0.7378839254379272),
 ('weakly', 0.7285276055335999)]

In [None]:
model.wv.most_similar('rnn')

[('lstm', 0.8546769022941589),
 ('recurrent', 0.8355414271354675),
 ('gru', 0.8175690174102783),
 ('lstm-based', 0.782768726348877),
 ('transformer', 0.774260401725769),
 ('lrn', 0.7730235457420349),
 ('bilstm', 0.7727881073951721),
 ('feed-forward', 0.7693912386894226),
 ('bilstms', 0.768124520778656),
 ('self-attentive', 0.7655966281890869)]

In [None]:
model.wv.most_similar('movie')

[('imdb', 0.7453122735023499),
 ('hotel', 0.7329555749893188),
 ('synopses', 0.7246863842010498),
 ('restaurant', 0.7226597666740417),
 ('hotels', 0.7176419496536255),
 ('movies', 0.7162580490112305),
 ('podcast', 0.7027246952056885),
 ('rotten', 0.7026669979095459),
 ('yelp', 0.7025178074836731),
 ('reservation', 0.6888513565063477)]

In [None]:
model.wv.most_similar('sentiment')

[('polarity', 0.8630881309509277),
 ('aspect-category', 0.8002216219902039),
 ('aspect-level', 0.7968421578407288),
 ('aspect-based', 0.7959100604057312),
 ('target-dependent', 0.7881436944007874),
 ('subjectivity', 0.7761756181716919),
 ('opinion', 0.7596176862716675),
 ('aspectbased', 0.7589598298072815),
 ('multi-aspect', 0.745111882686615),
 ('aspectlevel', 0.7416306138038635)]

In [None]:
model.wv.most_similar('corpus')

[('corpora', 0.743595540523529),
 ('subcorpus', 0.7178256511688232),
 ('collection', 0.6810144186019897),
 ('srcmf', 0.6712679862976074),
 ('sub-corpus', 0.6642714738845825),
 ('treebank', 0.6488476991653442),
 ('hansard', 0.6455675363540649),
 ('sinorama', 0.6404972076416016),
 ('semcor', 0.6375545263290405),
 ('l1-l2', 0.6342868804931641)]

## 問題1
---
上の例では入力した単語と最も類似度が高い単語を出力しています．これ以外にも分散表現を使用すれば，単語同士の意味の足し引きを行うことも出来ます．実際に，適当な単語について足し算と引き算をしたときに，結果がどうなるか確かめてみてください．

## 問題2
---
word2vec のモデルに未知語を入力した場合はどうなりますか？
また，未知語に対応する方法について調べて，実際に使用してみてください．

# 問題3
---
1) word2vec は単語分散表現を得るためのものです．では，文章の分散表現を得るためにはどうしたらよいでしょうか？word2vec を使用するものも，使用しないものもありますが，できればどちらも調べてください．   
2) 1)で調べた文章の分散表現を得るための方法(word2vecを使用するもの)を実装して，使用してみてください．