# 워드 임베딩(Word Embedding)
- 인공신경망 학습을 통해 단어를 벡터화
- 밀집표현으로 변환 <-> 원핫인코딩(희소표현)

## 워드투벡터(Word2Vec)
단어의 의미를 여러 차원에 분산하여 표현하기 위한 학습방법

### CBOW(Continuous Bag of Words)
- 주변 단어들을 입력으로 중간에 단어를 예측하는 방법
- 윈도우(window): 중심 단어 예측을 위한 주변단어의 개수
- 은닉층이 1개인 신경망, 활성화 함수x, 룩업테이블
![nn](https://wikidocs.net/images/page/22660/word2vec_renew_2.PNG)
=> M: 임베딩 벡터 차원, V: 단어집합 크기\
=> W, W'을 데이터를 통해 학습\
=> W의 각 행벡터가 각 단어의 임베딩 벡터\
=> Projection layer부분에서 각 벡터의 평균 계산\
- 최종적으로 W의 각 행 또는 W와 W'을 이용해 임베딩 벡터를 사용

### Skip-gram
- 중심 단어을 입력으로 주변 단어 예측
- 성능면에서  Skip-gram > CBOW

## 실습

### 영어 Word2Vec 만들기

In [1]:
from pathlib import Path
import re
import urllib.request
import zipfile
from lxml import etree
from nltk.tokenize import word_tokenize, sent_tokenize

## data download
data_dir = Path('C:/Users/sinjy/jupyter_notebook/github/data') / 'english_word2vec'
data_dir.mkdir(exist_ok=True)

urllib.request.urlretrieve("https://raw.githubusercontent.com/ukairia777/tensorflow-nlp-tutorial/main/09.%20Word%20Embedding/dataset/ted_en-20160408.xml", 
                           filename=data_dir / "ted_en-20160408.xml")

(WindowsPath('C:/Users/sinjy/jupyter_notebook/github/data/english_word2vec/ted_en-20160408.xml'),
 <http.client.HTTPMessage at 0x22b4e424a88>)

In [2]:
## preprocessing
targetXML = open(data_dir / "ted_en-20160408.xml", 'r', encoding='UTF8')
target_text = etree.parse(targetXML)
parse_text = '\n'.join(target_text.xpath('//content/text()'))
content_text = re.sub(r'\([^)]*\)', '', parse_text)

sent_text = sent_tokenize(content_text)
sent_text

["Here are two reasons companies fail: they only do more of the same, or they only do what's new.",
 'To me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation.',
 'Both are necessary, but it can be too much of a good thing.',
 'Consider Facit.',
 "I'm actually old enough to remember them.",
 'Facit was a fantastic company.',
 'They were born deep in the Swedish forest, and they made the best mechanical calculators in the world.',
 'Everybody used them.',
 'And what did Facit do when the electronic calculator came along?',
 'They continued doing exactly the same.',
 'In six months, they went from maximum revenue ... and they were gone.',
 'Gone.',
 'To me, the irony about the Facit story is hearing about the Facit engineers, who had bought cheap, small electronic calculators in Japan that they used to double-check their calculators.',
 'Facit did too much exploitation.',
 'But exploration can go wild, too.',
 'A few

In [3]:
normalized_text = []
for string in sent_text:
    tokens = re.sub(r"[^a-z0-9]+", " ", string.lower())
    normalized_text.append(tokens)

result = [word_tokenize(sentence) for sentence in normalized_text]

In [4]:
len(result)

273380

In [9]:
## training
from gensim.models import Word2Vec, KeyedVectors

model = Word2Vec(sentences=result, vector_size=100, window=5, min_count=5, workers=4, 
                sg=0) # size: 임베딩 벡터차원, sg=0: cbow, sg=1: skip-gram

In [10]:
model_result = model.wv.most_similar('man')
model_result

[('woman', 0.8571733832359314),
 ('guy', 0.8187478184700012),
 ('lady', 0.7912018895149231),
 ('boy', 0.7416342496871948),
 ('girl', 0.7373136281967163),
 ('soldier', 0.7339044213294983),
 ('gentleman', 0.728788435459137),
 ('kid', 0.6981143951416016),
 ('surgeon', 0.6497038006782532),
 ('writer', 0.6434357166290283)]

In [11]:
## save & load
model.wv.save_word2vec_format(data_dir / 'eng_w2v')
loaded_model = KeyedVectors.load_word2vec_format(data_dir / 'eng_w2v')

## 사전 훈련된 Word2Vec 임베딩
데이터 부족할 때 미리 훈련된 임베딩 벡터 사용