# 사전 훈련된 Word2Vec 임베딩(Pre-trained Word2Vec embedding)

- 사전 훈련된 3백만 개의 Word2Vec 단어 벡터들을 제공
- [영어/한국어 Word2Vec 실습](https://wikidocs.net/50739)

In [3]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2021-02-24 01:44:14--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.112.253
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.112.253|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2021-02-24 01:44:31 (95.2 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [4]:
!gzip -d GoogleNews-vectors-negative300.bin.gz

In [5]:
!ls -l

total 3558856
-rw-r--r-- 1 root root 3644258522 Mar  5  2015 GoogleNews-vectors-negative300.bin
drwxr-xr-x 1 root root       4096 Feb 22 14:38 sample_data


In [6]:
import gensim

# 구글의 사전 훈련된 Word2Vec 모델을 로드합니다.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)  

In [7]:
print(model.vectors.shape) # 모델의 크기 확인

(3000000, 300)


In [None]:
# 모델의 크기는 3,000,000 x 300. 즉, 3백만 개의 단어와 단어의 차원은 300개
# 파일의 크기가 3기가가 넘는 이유를 계산해보면
# 3 million * 300 features * 4bytes / feature = ~3.35GB

In [14]:
# 두 단어의 유사도 계산하기
print (model.similarity('this', 'is'))
print (model.similarity('post', 'book'))

0.40797037
0.057204384


In [9]:
print(model['book']) # 단어 'book'의 벡터 출력

[ 0.11279297 -0.02612305 -0.04492188  0.06982422  0.140625    0.03039551
 -0.04370117  0.24511719  0.08740234 -0.05053711  0.23144531 -0.07470703
  0.21875     0.03466797 -0.14550781  0.05761719  0.00671387 -0.00701904
  0.13183594 -0.25390625  0.14355469 -0.140625   -0.03564453 -0.21289062
 -0.24804688  0.04980469 -0.09082031  0.14453125  0.05712891 -0.10400391
 -0.19628906 -0.20507812 -0.27539062  0.03063965  0.20117188  0.17382812
  0.09130859 -0.10107422  0.22851562 -0.04077148  0.02709961 -0.00106049
  0.02709961  0.34179688 -0.13183594 -0.078125    0.02197266 -0.18847656
 -0.17480469 -0.05566406 -0.20898438  0.04858398 -0.07617188 -0.15625
 -0.05419922  0.01672363 -0.02722168 -0.11132812 -0.03588867 -0.18359375
  0.28710938  0.01757812  0.02185059 -0.05664062 -0.01251221  0.01708984
 -0.21777344 -0.06787109  0.04711914 -0.00668335  0.08544922 -0.02209473
  0.31835938  0.01794434 -0.02246094 -0.03051758 -0.09570312  0.24414062
  0.20507812  0.05419922  0.29101562  0.03637695  0.04

In [15]:
# Vector to word
book = model.word_vec('book')
model.most_similar(positive=[book], topn=1)

[('book', 1.0)]

In [16]:
# queen = king - man + woman
model.most_similar(positive=['king', 'woman'], negative=['man'], topn=3)

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951)]

## 한글

In [56]:
# https://github.com/wkentaro/gdown
!pip install gdown



In [64]:
!gdown https://drive.google.com/uc?id=0B0ZXk88koS2KbDhXdWg1Q2RydlU

Downloading...
From: https://drive.google.com/uc?id=0B0ZXk88koS2KbDhXdWg1Q2RydlU
To: /content/ko.zip
80.6MB [00:00, 121MB/s] 


In [67]:
!unzip ko.zip

Archive:  ko.zip
  inflating: ko.bin                  
  inflating: ko.tsv                  


In [69]:
!ls -l

total 3770440
-rw-r--r-- 1 root root 3644258522 Mar  5  2015 GoogleNews-vectors-negative300.bin
-rw------- 1 root root   50697568 Dec 21  2016 ko.bin
-rw------- 1 root root   85362829 Dec 21  2016 ko.tsv
-rw-r--r-- 1 root root   80596565 Feb 24 02:20 ko.zip
drwxr-xr-x 1 root root       4096 Feb 22 14:38 sample_data


In [68]:
!rm -rf ko

In [70]:
import gensim
model = gensim.models.Word2Vec.load('ko.bin')

In [71]:
result = model.wv.most_similar("강아지")
print(result)

[('고양이', 0.7290452718734741), ('거위', 0.7185635566711426), ('토끼', 0.7056223154067993), ('멧돼지', 0.6950401067733765), ('엄마', 0.6934334635734558), ('난쟁이', 0.6806551218032837), ('한마리', 0.6770296096801758), ('아가씨', 0.6750352382659912), ('아빠', 0.6729634404182434), ('목걸이', 0.6512460708618164)]
