# Word Embedding

- **Word Embedding**은 단어를 고정된 차원의 벡터로 변환하는 기술로, 단어 간의 의미적 유사성을 반영하도록 학습된 벡터를 말한다.  
- 이 기술은 자연어 처리에서 문장을 처리하고 이해하는 데 활용된다.  
- 숫자로 표현된 단어 목록을 통해 감정을 추출하는 것도 가능하다.  
- 연관성 있는 단어들을 군집화하여 다차원 공간에 벡터로 나타낼 수 있으며, 이는 단어나 문장을 벡터 공간에 매핑하는 과정이다.  

**Embedding Matrix 예시**

*아래 표의 벡터 값들은 모두 기계 학습을 통해 학습된 결과이다.*  

| Dimension | Man (5391) | Woman (9853) | King (4914) | Queen (7157) | Apple (456) | Orange (6257) |
|-----------|------------|--------------|-------------|--------------|-------------|---------------|
| 성별      | -1         | 1            | -0.95       | 0.97         | 0.00        | 0.01          |
| 귀족      | 0.01       | 0.02         | 0.93        | 0.95         | -0.01       | 0.00          |
| 나이      | 0.03       | 0.02         | 0.7         | 0.69         | 0.03        | -0.02         |
| 음식      | 0.04       | 0.01         | 0.02        | 0.01         | 0.95        | 0.97          |

<br>

*아래는 전치된 표이다.*

| Word          | 성별   | 귀족   | 나이   | 음식   |
|---------------|--------|--------|--------|--------|
| Man (5391)    | -1.00  | 0.01   | 0.03   | 0.04   |
| Woman (9853)  | 1.00   | 0.02   | 0.02   | 0.01   |
| King (4914)   | -0.95  | 0.93   | 0.70   | 0.02   |
| Queen (7157)  | 0.97   | 0.95   | 0.69   | 0.01   |
| Apple (456)   | 0.00   | -0.01  | 0.03   | 0.95   |
| Orange (6257) | 0.01   | 0.00   | -0.02  | 0.97   |

- **의미적 유사성 반영**  
  - 단어를 고정된 크기의 실수 벡터로 표현하며, 비슷한 의미를 가진 단어는 벡터 공간에서 가깝게 위치한다.  
  - 예를 들어, "king"과 "queen"은 비슷한 맥락에서 자주 사용되므로 벡터 공간에서 가까운 위치에 배치된다.  

- **밀집 벡터(Dense Vector)**  
  - BoW, DTM, TF-IDF와 달리 Word Embedding은 저차원 밀집 벡터로 변환되며, 차원이 낮으면서도 의미적으로 풍부한 정보를 담는다.  
  - 벡터 차원은 보통 100 또는 300 정도로 제한된다.  

- **문맥 정보 반영**  
  - Word Embedding은 단어 주변의 단어들을 학습해 단어의 의미를 추론한다.  
  - 예를 들어, "bank"라는 단어가 "river"와 함께 나오면 "강둑"을, "money"와 함께 나오면 "은행"을 의미한다고 학습한다.  

- **학습 기반 벡터**  
  - Word Embedding은 대규모 텍스트 데이터에서 단어 간 연관성을 학습해 벡터를 생성한다.  
  - 반면, BoW나 TF-IDF는 단순한 규칙 기반 벡터화 방법이다.  

### 희소 표현(Sparse Representation) | 분산 표현(Distributed Representation)
- 원-핫 인코딩으로 얻은 원-핫 벡터는 단어의 인덱스 값만 1이고 나머지는 모두 0으로 표현된다.
- 이렇게 대부분의 값이 0인 벡터나 행렬을 사용하는 표현 방식을 희소 표현(sparse representation)이라고 한다.  
- 희소 표현은 단어 벡터 간 유의미한 유사성을 표현할 수 없다는 단점이 있다.
- 이를 해결하기 위해 단어의 의미를 다차원 공간에 벡터화하는 분산 표현(distributed representation)을 사용한다.
- 분산 표현으로 단어 간 의미적 유사성을 벡터화하는 작업을 워드 임베딩(embedding)이라고 하며, 이렇게 변환된 벡터를 임베딩 벡터(embedding vector)라고 한다.  
- **원-핫 인코딩 → 희소 표현**  
- **워드 임베딩 → 분산 표현**  

**분산 표현(Distributed Representation)**
- 분산 표현은 분포 가설(distributional hypothesis)에 기반한 방법이다.
- 이 가설은 "비슷한 문맥에서 등장하는 단어들은 비슷한 의미를 가진다"는 내용을 전제로 한다.
- 예를 들어, '강아지'라는 단어는 '귀엽다', '예쁘다', '애교' 등의 단어와 함께 자주 등장하며, 이를 벡터화하면 해당 단어들은 유사한 벡터값을 갖게 된다.
- 분산 표현은 단어의 의미를 여러 차원에 걸쳐 분산하여 표현한다.  
- 이 방식은 원-핫 벡터처럼 단어 집합 크기만큼의 차원이 필요하지 않으며, 상대적으로 저차원으로 줄어든다.
- 예를 들어, 단어 집합 크기가 10,000이고 '강아지'의 인덱스가 4라면, 원-핫 벡터는 다음과 같다:
  
- **강아지 = [0 0 0 0 1 0 0 ... 0]** (뒤에 9,995개의 0 포함)  
- 그러나 Word2Vec으로 임베딩된 벡터는 단어 집합 크기와 무관하며, 설정된 차원의 수만큼 실수값을 가진 벡터가 된다:  
- **강아지 = [0.2 0.3 0.5 0.7 0.2 ... 0.2]**  

**요약하면,**
- 희소 표현은 고차원에서 각 차원이 분리된 방식으로 단어를 표현하지만, 분산 표현은 저차원에서 단어의 의미를 여러 차원에 분산시켜 표현한다.
- 이를 통해 단어 벡터 간 유의미한 유사도를 계산할 수 있으며, 대표적인 학습 방법으로 Word2Vec이 사용된다.  

### Embedding Vector 시각화 wevi
https://ronxin.github.io/wevi/

### Word2Vec
- 2013년 구글에서 개발한 Word Embedding 방법
- 최초의 neural embedding model
- 매우 큰 corpus에서 자동 학습
    - 비지도 지도 학습 (자기 지도학습)이라 할 수 있음
    - 많은 데이터를 기반으로 label 값 유추하고 이를 지도학습에 사용
- ex) 
    - **이사금**께 충성을 맹세하였다.
    - **왕**께 충성을 맹세하였다.

**WordVec 훈련방식에 따른 구분**
1. CBOW : 주변 단어로 중심 단어를 예측
2. Skip-gram : 중심 단어로 주변 단어를 예측

##### CBOW (Continuous Bag of Words)  
- CBOW는 원-핫 벡터를 사용하지만, 이는 단순히 위치를 가리킬 뿐 vocabulary를 직접적으로 참조하지 않는다.  

**예시:**  

> The fat cat sat on the mat  

주어진 문장에서 'sat'이라는 단어를 예측하는 것이 CBOW의 주요 작업이다.  
- **중심 단어(center word):** 예측하려는 단어 ('sat')  
- **주변 단어(context word):** 예측에 사용되는 단어들  

중심 단어를 예측하기 위해 앞뒤 몇 개의 단어를 참고할지 결정하는 범위를 **윈도우(window)**라고 한다.  
예를 들어, 윈도우 크기가 2이고 중심 단어가 'sat'라면, 앞의 두 단어(fat, cat)와 뒤의 두 단어(on, the)를 입력으로 사용한다.  
윈도우 크기가 n일 경우, 참고하는 주변 단어의 개수는 총 2n이다. 윈도우를 옆으로 이동하며 학습 데이터를 생성하는 방법을 **슬라이딩 윈도우(sliding window)**라고 한다.  

![](https://wikidocs.net/images/page/22660/%EB%8B%A8%EC%96%B4.PNG)


**훈련 과정**

CBOW는 embedding 벡터를 학습하기 위한 구조를 갖는다. 초기에는 가중치가 임의의 값으로 설정되며, 역전파를 통해 최적화된다.  

![](https://wikidocs.net/images/page/22660/word2vec_renew_1.PNG)

Word2Vec은 은닉층이 하나뿐인 얕은 신경망(shallow neural network) 구조를 사용한다.  
학습 대상이 되는 주요 가중치는 두 가지이다:  

1. **투사층(projection layer):**  
   - 활성화 함수가 없으며 룩업 테이블 연산을 담당한다.  
   - 입력층과 투사층 사이의 가중치 W는 V × M 행렬로 표현되며, 여기서 **V는 단어 집합의 크기, M은 벡터의 차원**이다.  
   - W 행렬의 각 행은 학습 후 단어의 M차원 임베딩 벡터로 간주된다.  
   - 예를 들어, 벡터 차원을 5로 설정하면 각 단어의 임베딩 벡터는 5차원이 된다.  

2. **출력층:**  
   - 투사층과 출력층 사이의 가중치 W'는 M × V 행렬로 표현된다.  
   - 이 두 행렬(W와 W')은 서로 독립적이며, 학습 전에는 랜덤 값으로 초기화된다.  

![](https://wikidocs.net/images/page/22660/word2vec_renew_3.PNG)


**예측 과정**
1. CBOW는 계산된 룩업 테이블의 평균을 구한 뒤, 출력층의 가중치 W'와 내적한다.  
2. 결과값은 **소프트맥스(softmax)** 활성화 함수에 입력되어, 중심 단어일 확률을 나타내는 예측값으로 변환된다.  
3. 출력된 예측값(스코어 벡터)은 실제 타겟 원-핫 벡터와 비교되며, **크로스 엔트로피(cross-entropy)** 함수로 손실값을 계산한다.  

![](https://wikidocs.net/images/page/22660/word2vec_renew_5.PNG)

손실 함수 식:  
$
cost(\hat{y}, y) = -\sum_{j=1}^{V} y_{j} \cdot log(\hat{y}_{j})
$  

여기서, $\hat{y}_{j}$는 예측 확률, $y_{j}$는 실제 값이며, V는 단어 집합의 크기를 의미한다.  


**학습 결과**  
- 역전파를 통해 가중치 W와 W'가 학습된다. 
- 학습이 완료되면 W 행렬의 각 행을 단어의 임베딩 벡터로 사용하거나, W와 W' 모두를 이용해 임베딩 벡터를 생성할 수 있다.  
- CBOW는 주변 단어를 기반으로 중심 단어를 예측하는 구조를 갖추고 있으며, 이를 통해 단어 간 의미적 관계를 효과적으로 학습할 수 있다.  

##### Skip-gram
- Skip-gram은 중심 단어에서 주변 단어를 예측한다.
- 윈도우 크기가 2일 때, 데이터셋은 다음과 같이 구성된다.

![](https://wikidocs.net/images/page/22660/skipgram_dataset.PNG)

![](https://wikidocs.net/images/page/22660/word2vec_renew_6.PNG)

- 중심 단어에 대해서 주변 단어를 예측하므로 투사층에서 벡터들의 평균을 구하는 과정은 없다.
- 여러 논문에서 성능 비교를 진행했을 때 전반적으로 Skip-gram이 CBOW보다 성능이 좋다고 알려져 있다.

In [27]:
# !pip install gensim

##### 영어 Word Embedding

- 데이터 취득 및 전처리

In [28]:
import gdown

url = "https://drive.google.com/uc?id=1DCgLPJsfyLGZ99lB-aF8EvpKIWSZYgp4"
output = "ted_en.xml"

# gdown.download(url, output)

In [29]:
from lxml import etree
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

In [30]:
# xml 데이터 처리
f = open('ted_en.xml', 'r', encoding='utf-8')
xml = etree.parse(f)

contents = xml.xpath('//content/text()')    # content 태그 하위 텍스트
# contents[:5]

corpus = '\n'.join(contents)
print(len(corpus))

# 정규식을 이용해 (Laughter), (Applause) 등 키워드 제거
corpus = re.sub(r'\([^)]*\)', '', corpus)
print(len(corpus))

24222849
24062319


In [31]:
# 데이터 전처리 (토큰화/대소문자 정규화/불용어 처리)
sentences = sent_tokenize(corpus)

preprocessed_sentences = []
en_stopwords = stopwords.words('english')

for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub(r'[^a-z0-9]', ' ', sentence)    # 영소문자, 숫자 외 제거
    tokens = word_tokenize(sentence)
    tokens = [token for token in tokens if token not in en_stopwords]
    preprocessed_sentences.append(tokens)

preprocessed_sentences[:5]

[['two', 'reasons', 'companies', 'fail', 'new'],
 ['real',
  'real',
  'solution',
  'quality',
  'growth',
  'figuring',
  'balance',
  'two',
  'activities',
  'exploration',
  'exploitation'],
 ['necessary', 'much', 'good', 'thing'],
 ['consider', 'facit'],
 ['actually', 'old', 'enough', 'remember']]

- Embedding 모델 학습

In [32]:
from gensim.models import Word2Vec

model = Word2Vec(
    sentences=preprocessed_sentences, # corpus
    vector_size=100,                  # 임베딩 벡터 차원
    sg=0,                             # 학습 알고리즘 선택 (0=CBOW, 1=Skip-gram)
    window=5,                         # 주변 단어 수 (앞뒤로 n개 고려)
    min_count=5                       # 최소 빈도 (빈도 n개 미만은 제거)
)

model.wv.vectors.shape

(21462, 100)

In [33]:
import pandas as pd

pd.DataFrame(model.wv.vectors, index=model.wv.index_to_key).head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
one,-0.507626,-0.442687,-1.167928,0.701634,0.086201,-0.765696,0.442688,0.754525,-2.556752,-0.735238,0.61084,0.033176,-0.066485,0.011295,-0.032708,-0.160991,0.496461,-0.888984,-0.569103,-1.482977,1.365504,-0.038934,0.96282,-0.0732,0.330002,-1.470577,0.181802,-0.449173,0.451726,0.134219,-0.88545,0.718595,0.833681,-1.3261,-0.224466,0.843157,0.434043,-0.757573,-0.069115,-1.689192,...,0.627114,0.801336,-0.682833,-0.122913,0.261963,0.640149,-0.004366,-0.173722,-0.111575,0.327797,-0.118447,0.050972,-0.269864,-0.169801,-0.620474,-1.721967,0.009763,-0.12076,-0.254427,0.940067,0.754327,1.280342,-0.881259,-0.576295,-1.628928,-0.392792,-0.13595,-0.65495,-0.46623,1.808225,1.030128,0.823591,-0.585958,0.257513,1.025655,-1.890258,-0.80614,-0.435385,0.692131,0.848384
people,-1.427202,0.52937,-0.490544,0.512964,-0.504665,-1.441672,0.076076,0.820056,-0.975314,-2.028039,0.794434,-0.3992,0.127624,-0.713485,1.505122,0.408434,0.38178,2.050386,-0.434477,0.536018,1.089365,-1.239129,2.258339,0.208337,0.093448,-0.469139,-0.045763,-1.200611,0.861557,-0.218029,0.620924,-0.925302,-0.716246,-0.199752,-0.115392,1.890724,1.394664,-0.418755,-0.880868,1.481014,...,0.714839,-0.255614,-1.066218,-0.851919,-1.292182,1.185493,0.790945,0.164187,0.419567,2.510297,-0.102671,0.052144,1.495816,0.849155,0.174237,1.243476,-0.001722,-0.087473,-0.868375,-1.440067,-2.07952,-0.192073,0.018215,-0.10412,-0.530741,0.600678,-0.022934,-1.272094,-0.250519,-0.116203,1.077553,0.666987,-1.494577,-0.226486,-0.775491,-0.085792,-0.324845,-1.191003,-1.717081,0.895945
like,-0.207078,-0.13114,-0.775172,-1.50159,0.004154,-0.063267,0.460151,0.498318,-1.79788,1.119787,-1.526861,0.338633,1.063064,0.014577,1.204981,-0.554407,0.429454,0.750728,-0.002742,-0.184297,-0.150572,-0.44366,-0.298882,-1.578368,0.890639,-1.10424,2.586069,0.347849,-1.034564,0.426953,0.978287,-0.442359,-0.820679,0.053675,-0.363577,-0.770446,-0.052837,1.311918,-0.326769,0.050285,...,0.354943,-0.26134,1.031166,-0.830391,-0.75432,2.160495,0.293686,0.183012,-1.748434,0.573363,0.418451,0.143926,1.100607,-0.704678,0.664372,1.060452,1.329798,0.409646,-0.849035,0.407822,0.9934,0.798178,-1.852511,0.426966,-0.088404,-0.69882,-1.237223,0.726062,0.779122,-0.323033,-0.780076,0.665731,-0.297956,-0.063771,-0.38863,0.711679,0.318711,-0.32564,0.840729,-0.541419
know,-0.78223,0.082809,-0.382945,0.238039,0.295448,0.189652,-0.517759,0.156646,-0.9356,-0.951888,-0.003083,-0.412244,-0.078371,-0.796105,0.779983,-0.578469,0.209035,-0.051398,-0.130366,-1.012064,-0.287448,0.151739,-0.059339,0.858964,0.061323,1.17437,1.209591,-0.366828,-0.484767,-0.171466,0.796183,0.109727,-0.300429,-0.166398,0.143674,0.727447,0.468853,-1.128786,-0.327399,-0.146518,...,0.218287,0.100185,0.706968,-0.057708,-1.35049,0.645816,0.610823,0.656669,-1.140041,0.077917,1.072479,0.354713,0.02333,0.180266,0.593578,0.380466,-0.056427,0.215206,-0.864214,-0.084126,-0.897058,0.832352,-1.315659,0.750415,-0.874311,0.324391,0.409033,1.211542,0.127888,-0.507821,0.202856,0.041537,0.031304,-0.378668,-0.291373,0.317489,0.226503,-0.684926,0.746383,-0.223859
going,-1.130339,0.754498,0.083723,-0.349729,1.642845,0.379064,-0.678453,1.569615,-1.044017,-0.937708,-0.643921,-0.972029,0.671734,1.165268,0.3568,-1.897946,0.336356,0.157773,0.73509,-1.238956,-0.826559,-1.908656,-0.334208,1.649768,0.249633,-0.863758,-0.939677,0.438178,-0.737797,-0.224734,1.134578,1.137345,0.100246,0.112718,0.464756,0.284342,-0.281611,-0.1852,-0.064977,0.050386,...,0.009749,1.023091,1.053219,-0.235591,-2.265276,0.437619,0.840162,0.971098,-0.681899,0.120063,-1.035932,1.029499,-0.065781,0.75496,0.071952,1.8186,1.643349,1.890043,-0.466175,0.137639,-0.582862,-0.161146,-0.666676,-1.263704,-0.953896,0.636947,0.890287,0.467035,-1.870485,-0.400472,0.564086,-0.904401,-0.055513,1.697131,-0.0629,-0.257962,0.435354,0.19039,-1.113422,0.255285
think,-0.50907,-0.412272,0.692741,-0.518409,0.02895,-0.849973,0.030227,-0.285951,-1.078182,-0.96613,0.189743,-1.202058,0.284056,-0.888331,0.236127,-0.807115,-0.599416,-0.61717,-0.027399,-0.419478,-0.560431,0.411597,1.175526,0.535902,0.368917,0.44541,0.221428,-0.131979,-0.1181,0.996133,0.645934,-0.071236,0.595299,-0.509688,-0.215432,1.723459,0.948665,0.39866,-0.289545,-1.087636,...,-0.040588,0.129383,1.364531,-0.195364,-0.854206,0.989114,0.786308,-0.326702,-1.198498,-0.146261,1.19236,1.309325,0.21995,0.581725,0.436682,0.248794,-0.175063,-0.036784,0.039753,-0.030889,-0.870519,-0.275872,-1.203302,-0.012457,-0.656578,-0.119583,0.390574,1.437569,0.130566,0.363299,0.587558,1.383334,-0.425625,0.165726,0.43956,-0.308536,0.125069,-0.863224,0.042337,-0.116391
see,0.257161,-0.129006,0.482975,-1.347949,-0.899297,-0.962455,0.080054,0.909067,-1.830676,0.850075,-0.07336,-0.543554,-0.094335,0.85804,0.938841,-1.352835,0.128303,-0.56966,1.572959,-0.911368,0.879652,-0.051598,0.187312,-0.023154,1.241177,1.347843,0.003317,0.229387,-0.499892,0.608323,-0.058787,1.022565,-0.560106,1.078127,0.309961,1.254737,-0.296157,-1.667353,0.009535,0.039516,...,-0.65688,0.060775,-0.46097,0.863397,-0.717029,-0.453252,-0.39362,-0.822204,-0.284825,0.147337,0.265105,0.743014,-0.299271,-0.250645,0.536162,-0.407188,0.347387,-0.197838,-0.125123,1.141522,0.432652,-0.274082,-0.867984,0.69131,0.299132,0.990966,0.590682,1.290969,0.221561,0.223187,-0.50416,0.908535,0.486973,0.669019,0.298825,0.135384,0.90806,-0.858874,0.427333,0.14972
would,0.399128,0.47384,0.767682,-1.008862,1.488817,0.616162,-0.691542,0.024907,-1.135851,-0.163176,-0.365925,-0.173582,0.34924,0.908904,-0.51642,-1.169889,0.297156,1.398615,-0.912876,-0.917497,0.054514,-0.332387,-0.599138,-0.523539,0.273564,-1.013528,-0.392055,-0.259261,-0.652026,-1.096553,-0.217578,2.118106,-0.858938,-0.399984,1.006813,0.253177,-1.26387,-0.381258,-0.612148,0.871903,...,2.093039,0.102087,-0.665174,-0.89796,-0.223073,0.403748,0.360143,0.511336,-0.680796,0.161001,1.281649,0.069149,0.693277,-0.596323,-0.439033,1.523359,2.76097,-0.396036,0.222093,-1.050966,-1.620689,0.240872,-1.681879,-1.192966,-0.915414,0.806651,1.950321,1.07833,1.33045,0.853955,1.294555,-0.379817,-0.826987,1.227198,-0.625762,1.029177,-1.316557,-0.849583,-1.01382,-1.074842
really,-2.467759,-0.681253,-0.160718,0.031583,-0.171753,-0.212609,1.1335,1.16616,-0.331638,-0.305692,0.068299,-0.549261,1.349175,-0.079367,0.909449,-0.098609,-0.598418,0.221684,0.184906,-1.291424,-0.523924,0.606377,0.513408,0.203924,1.131022,1.175176,-0.039263,-1.628961,0.386452,0.236225,1.587291,-0.68572,1.554574,-0.099598,0.042995,1.044548,0.678254,0.832545,-0.198887,-0.300057,...,0.078856,-0.101608,1.616599,-0.745953,-2.338244,0.351252,0.916544,0.288724,-0.560802,1.111418,0.230266,0.737359,0.337242,0.438271,0.164968,-0.010665,-0.187749,-0.353332,0.665881,-0.152519,-0.180341,-0.385241,-0.421603,0.188443,-0.104898,0.567142,0.570906,0.300105,0.644998,0.182097,0.470145,-0.445953,-0.003628,0.457778,0.591208,-0.101192,-0.29799,-1.721852,-0.406554,-0.113943
get,-2.240158,-1.401547,-1.398121,-1.050263,0.026443,-1.102116,-0.490727,1.809742,-0.235827,-1.559935,0.649605,0.076903,-2.269274,0.636442,0.590709,0.564088,-1.250176,-0.538048,0.943611,-0.269954,-0.05407,-0.309885,-0.145944,1.168802,0.811147,-0.096404,0.677176,0.907101,1.138353,-0.666579,0.701767,0.066697,0.562785,0.196302,-0.667223,1.730841,-0.918647,-0.67956,0.115835,0.138921,...,0.527976,2.181037,0.641664,-0.131852,-1.579503,-0.09656,0.589749,9e-05,-0.922896,-0.069473,0.105649,1.294301,0.580285,-0.593713,0.025643,-0.253912,0.145541,0.730842,-0.056078,0.468946,-0.405948,-0.666638,-0.63325,0.569055,1.304197,0.010399,0.126651,0.998734,0.052867,1.165161,-0.026789,-0.486165,-0.33589,0.688642,0.495889,0.311467,0.113702,-0.312463,-0.912927,0.555738


In [34]:
# 학습된 임베딩 모델 저장
model.wv.save_word2vec_format('ted_en_w2v')

In [35]:
# 임베딩 모델 로드
from gensim.models import KeyedVectors

load_model = KeyedVectors.load_word2vec_format('ted_en_w2v')

- 유사도 계산

In [36]:
model.wv.most_similar('man')
# model.wv.most_similar('abracadabra')    # 임베딩 벡터에 없는 단어로 조회 시 KeyError 발생

[('woman', 0.9017196893692017),
 ('daughter', 0.7984911203384399),
 ('girl', 0.7933292984962463),
 ('lady', 0.7810963988304138),
 ('boy', 0.777847409248352),
 ('son', 0.7718625664710999),
 ('father', 0.7517499327659607),
 ('grandfather', 0.7490587830543518),
 ('grandmother', 0.7343164086341858),
 ('sister', 0.7308599352836609)]

In [None]:
load_model.most_similar('man')    # Word2Vec.wv = KeyedVectors

[('woman', 0.9017196893692017),
 ('daughter', 0.7984911203384399),
 ('girl', 0.7933292984962463),
 ('lady', 0.7810963988304138),
 ('boy', 0.777847409248352),
 ('son', 0.7718625664710999),
 ('father', 0.7517499327659607),
 ('grandfather', 0.7490587830543518),
 ('grandmother', 0.7343164086341858),
 ('sister', 0.7308599352836609)]

In [37]:
model.wv.similarity('man', 'husband')

0.7067673

In [38]:
model.wv['man']

array([ 0.5571891 , -0.24520032,  1.2072326 ,  1.6726993 , -0.9124343 ,
       -0.20811436, -0.53903896,  1.4176455 , -0.46985295, -0.9732159 ,
        0.10648848,  0.69618565, -0.18309641,  0.5128085 ,  0.9872698 ,
       -0.5602139 ,  0.7148856 ,  0.09455778, -1.0495578 , -0.03309458,
        0.7899805 ,  0.9872811 , -0.12219412, -0.5620203 ,  0.4235168 ,
        0.40387183, -0.8787664 , -0.6562188 , -0.2712405 ,  1.2257813 ,
       -1.1729699 , -1.162402  ,  0.01518461, -1.4339955 , -0.10510466,
        1.0898159 , -0.22241019, -0.42248482,  0.72029775, -0.03143048,
        1.1148102 ,  0.28222924,  0.74960387,  0.5037726 ,  1.87225   ,
        0.10271132, -0.89295954,  0.4083665 ,  0.10096797, -0.36101657,
        0.80612326, -0.4279562 ,  0.20755748, -0.9981928 ,  0.17306508,
        0.57417077,  0.25590542,  0.44442192, -0.23800258, -0.1565756 ,
       -0.05000425, -0.7322015 , -1.7507703 ,  1.1291964 , -0.98587924,
        0.78607404,  0.04865063,  0.25423446,  0.54226995,  1.55

- 임베딩 시각화

https://projector.tensorflow.org/

- embedding vector(tensor) 파일 (.tsv)
- metadat 파일 (.tsv)

In [41]:
!python -m gensim.scripts.word2vec2tensor --input ted_en_w2v --output ted_en_w2v

2025-02-20 12:06:26,054 - word2vec2tensor - INFO - running c:\Users\USER\anaconda3\envs\pystudy_env\Lib\site-packages\gensim\scripts\word2vec2tensor.py --input ted_en_w2v --output ted_en_w2v
2025-02-20 12:06:26,054 - keyedvectors - INFO - loading projection weights from ted_en_w2v
2025-02-20 12:06:27,173 - utils - INFO - KeyedVectors lifecycle event {'msg': 'loaded (21462, 100) matrix of type float32 from ted_en_w2v', 'binary': False, 'encoding': 'utf8', 'datetime': '2025-02-20T12:06:26.914735', 'gensim': '4.3.3', 'python': '3.12.8 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:48:34) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'load_word2vec_format'}
2025-02-20 12:06:27,903 - word2vec2tensor - INFO - 2D tensor file saved to ted_en_w2v_tensor.tsv
2025-02-20 12:06:27,903 - word2vec2tensor - INFO - Tensor metadata file saved to ted_en_w2v_metadata.tsv
2025-02-20 12:06:27,904 - word2vec2tensor - INFO - finished running word2vec2tensor.py


##### 한국어 Word Embedding
- NSMC (Naver Sentiment Movie Corpus)

In [42]:
import numpy as np
import pandas as pd
import urllib.request
from konlpy.tag import Okt

In [None]:
# 데이터 다운로드
urllib.request.urlretrieve(
    "https://raw.githubusercontent.com/e9t/nsmc/master/ratings.txt",
    filename="naver_movie_ratings.txt"
)

('naver_movie_ratings.txt', <http.client.HTTPMessage at 0x24750a24ef0>)

In [None]:
# 데이터 프레임 생성
ratings_df = pd.read_csv('naver_movie_ratings.txt', sep='\t')

In [None]:
# 결측치 확인 및 처리(제거)
display(ratings_df.isnull().sum())

ratings_df = ratings_df.dropna(how='any')

id          0
document    8
label       0
dtype: int64

In [47]:
ratings_df['document'][200:300]

200    많은 생각을 할 수 있는 영화~ 시간여행류의 스토리를 좋아하는 사람이라면 빠트릴 수...
201    고소한 19 정말 재미있게 잘 보고 있습니다^^ 방송만 보면 털털하고 인간적이신 것...
202                                                  가연세
203                         goodgoodgoodgoodgoodgoodgood
204                                           이물감. 시 같았다
                             ...                        
295                                   박력넘치는 스턴트 액션 평작이다!
296                                      엄청 재미있다 명작이다 ~~
297    나는 하정우랑 개그코드가 맞나보다 엄청 재밌게봤네요 특히 단발의사샘 장면에서 계속 ...
298                                                적당 ㅎㅎ
299                                    배경이 이쁘고 캐릭터도 귀엽네~
Name: document, Length: 100, dtype: object

In [48]:
# 한글이 아닌 데이터 제거
ratings_df['document'] = ratings_df['document'].replace(r'[^0-9가-힣ㄱ-ㅎㅏ-ㅣ\s]', '', regex=True)

In [49]:
# 전처리
from tqdm import tqdm    # 진행도 시각화

okt = Okt()
ko_stopwords = ['은', '는', '이', '가', '을', '를', '와', '과', '들', '도',
                '부터', '까지', '에', '나', '너', '그', '걔', '얘']    # 한국어 불용어

preprocessed_data = []

for sentence in tqdm(ratings_df['document']):
    tokens = okt.morphs(sentence, stem=True)
    tokens = [token for token in tokens if token not in ko_stopwords]
    preprocessed_data.append(tokens)

100%|██████████| 199992/199992 [08:59<00:00, 370.65it/s]


In [50]:
model = Word2Vec(
    sentences=preprocessed_data,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0    # CBOW
)

model.wv.vectors.shape

(16841, 100)

In [51]:
model.wv.most_similar('극장')

[('영화관', 0.9396284222602844),
 ('틀어주다', 0.8187416195869446),
 ('케이블', 0.7927759289741516),
 ('학교', 0.7759482264518738),
 ('티비', 0.7276421785354614),
 ('투니버스', 0.695606529712677),
 ('개봉관', 0.6917786002159119),
 ('인터넷', 0.683275043964386),
 ('메가박스', 0.6818107962608337),
 ('방금', 0.6805811524391174)]

In [54]:
model.wv.similarity('김혜수', '박서준')

0.7329322

In [56]:
# 모델 저장
model.wv.save_word2vec_format('naver_movie_ratings_w2v')

In [57]:
!python -m gensim.scripts.word2vec2tensor --input naver_movie_ratings_w2v --output naver_movie_ratings_w2v

2025-02-20 12:42:01,986 - word2vec2tensor - INFO - running c:\Users\USER\anaconda3\envs\pystudy_env\Lib\site-packages\gensim\scripts\word2vec2tensor.py --input naver_movie_ratings_w2v --output naver_movie_ratings_w2v
2025-02-20 12:42:01,986 - keyedvectors - INFO - loading projection weights from naver_movie_ratings_w2v
2025-02-20 12:42:03,001 - utils - INFO - KeyedVectors lifecycle event {'msg': 'loaded (16841, 100) matrix of type float32 from naver_movie_ratings_w2v', 'binary': False, 'encoding': 'utf8', 'datetime': '2025-02-20T12:42:02.718774', 'gensim': '4.3.3', 'python': '3.12.8 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:48:34) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'load_word2vec_format'}
2025-02-20 12:42:03,576 - word2vec2tensor - INFO - 2D tensor file saved to naver_movie_ratings_w2v_tensor.tsv
2025-02-20 12:42:03,576 - word2vec2tensor - INFO - Tensor metadata file saved to naver_movie_ratings_w2v_metadata.tsv
2025-02-20 12:42: