# Word2Vec

* Word2Vec is a neural-network based method to learn *word embeddings*.
* It converts words into numerical vectors, which can capture semantic meanings (i.e., synonyms and analogies).
* It is based on the idea that __similar words appear in similar contexts__ (유사한 의미를 가진 단어는 유사한 맥락에서 발생할 가능성이 더 높다).
* Word2Vec has two main approaches:
    - Continuous Bag of Words (CBOW): Predicts the center word based on its surrounding words.
    - Skip-gram: Predicts surrounding words based on the center word.

In [2]:
"""
    Gensim is a widely used library for topic modeling (which we do not cover in this course).
    Meanwhile, it supports various word embedding models,
     which I find sufficient to demonstrate its effectiveness.

    Reference: https://radimrehurek.com/gensim/
    gensim은 topic modeling에 주로 쓰이는 library이다. 다양한 단어 임베딩 모델을 지원함.
"""

!pip show gensim  # `pip install gensim` if not installed

Name: gensim
Version: 4.3.3
Summary: Python framework for fast Vector Space Modelling
Home-page: https://radimrehurek.com/gensim/
Author: Radim Rehurek
Author-email: me@radimrehurek.com
License: LGPL-2.1-only
Location: /usr/local/lib/python3.10/dist-packages
Requires: numpy, scipy, smart-open
Required-by: 


In [3]:
import os
import re
import nltk
import gensim

from gensim.utils import simple_preprocess  # helps preprocessing and tokenization
from nltk.tokenize import sent_tokenize     # used for splitting long documents into sentences.

In [4]:
nltk.download('punkt') # ‘punkt’ 토크나이저 모델을 다운로드

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## 1. Data: [Alice’s Adventures in Wonderland](https://www.gutenberg.org/files/11/11-0.txt)

### 1-1. Load the corpus

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
%cd "/content/drive/MyDrive/Colab Notebooks/2024-2 딥러닝/E. Vector Representations of Text/"

/content/drive/MyDrive/Colab Notebooks/2024-2 딥러닝/E. Vector Representations of Text


In [7]:
%ls

'Alice’s Adventures in Wonderland_text file.txt'   word2vec_model2.model   word2vec_model.model
 word2vec_from_scratch_using_gensim.ipynb          word2vec_model3


In [8]:
# txt_file = "./Alice’s Adventures in Wonderland_text file.txt"
txt_file = "/content/drive/My Drive/Colab Notebooks/2024-2 딥러닝/E. Vector Representations of Text/Alice’s Adventures in Wonderland_text file.txt"
assert os.path.isfile(txt_file), "Make sure you fix the path to the file."

In [9]:
with open(txt_file, 'r', encoding='utf-8') as file:
    corpus = file.readlines() # # 파일의 모든 라인을 읽어서 리스트에 저장; [line1, line2, ...]
    corpus = corpus[30:3380]  # 31번 라인부터 3380번 라인만 사용

corpus = ''.join(corpus)      # 라인들을 하나의 문자열로 결합. [line1, line2, ...] -> "line1 line2 ..."

In [10]:
print(corpus[:100])  # first 100 characters

CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on 


In [11]:
print(corpus[-100:])  # last 100 characters

nd a pleasure in all
their simple joys, remembering her own child-life, and the happy summer
days.




### 1-2. Preprocess the corpus

In [12]:
corpus = re.sub(r"[^a-zA-Z\s\s+.,?!-]", '', corpus)  # a-zA-Z: 영어 알파벳, \s: 공백, \s+: 하나 이상의 연속된 공백. ^는 부정을 의미: corpus에서 리스트 안에 있는 것 빼고 모두 제거
corpus = corpus.replace('\n', ' ')  # 줄바꿈을 공백으로 변경
corpus = re.sub("\s\s+", " ", corpus)  # 2개 이상 연속된 공백을 공백으로 변경(.e.g, 'the    fox' -> 'the fox')

In [13]:
print(corpus)  # it's still a very long string



### 1-3. Split corpus into sentences

In [16]:
# from nltk.tokenize import sent_tokenize
# LookupError -> nltk.download('punkt_tab')

sentences = sent_tokenize(corpus)  # corpus를 문장 단위로 나누기(구두점(마침표, 물음표, 느낌표 등) 기준으로)
print(len(sentences))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


1625


## 단어 단위 토큰화, 구두점 없애기, 소문자화

In [17]:
"""
    We still need to:
        - tokenize each sentence into a list of words,
        - remove the punctuations,
        - and lowercase each word.
"""

sentences[:10]  # we still need to remove the punctuations, and lowercase the words

['CHAPTER I.',
 'Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, and what is the use of a book, thought Alice without pictures or conversations?',
 'So she was considering in her own mind as well as she could, for the hot day made her feel very sleepy and stupid, whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.',
 'There was nothing so very remarkable in that nor did Alice think it so very much out of the way to hear the Rabbit say to itself, Oh dear!',
 'Oh dear!',
 'I shall be late!',
 'when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural but when the Rabbit actually took a watch out of i

### 1-4. Tokenize sentences into a list of words.

In [18]:
# from gensim.utils import simple_preprocess -> 단어 토큰화, 구두점 없애기, 소문자화 한번에 해줌.

tokenized_sentences = [simple_preprocess(sent) for sent in sentences] # 문장 내 단어 토큰화 -> "This is a test sentence." -> ['this', 'is', 'a', 'test', 'sentence']

In [19]:
for i, sent in enumerate(tokenized_sentences):
    print(f"({i:>2})", sent) # i:>2 -> 2자리 오른쪽 정렬
    if i == 30:  # check only the first 30 sentences
        break

( 0) ['chapter']
( 1) ['down', 'the', 'rabbit', 'hole', 'alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'having', 'nothing', 'to', 'do', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', 'and', 'what', 'is', 'the', 'use', 'of', 'book', 'thought', 'alice', 'without', 'pictures', 'or', 'conversations']
( 2) ['so', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', 'as', 'well', 'as', 'she', 'could', 'for', 'the', 'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid', 'whether', 'the', 'pleasure', 'of', 'making', 'daisy', 'chain', 'would', 'be', 'worth', 'the', 'trouble', 'of', 'getting', 'up', 'and', 'picking', 'the', 'daisies', 'when', 'suddenly', 'white', 'rabbit', 'with', 'pink', 'eyes', 'ran', 'close', 'by', 'her']
( 3) ['there', 'was', 'nothing', 'so'

## 2. Word2Vec

### 2-1. Model Training

In [20]:
from gensim.models import Word2Vec

# Train the Word2Vec model
model = Word2Vec(
    sentences=tokenized_sentences,  # 단어 단위 토큰화된 데이터
    vector_size=100,                # 임베딩 벡터 차원(input, output embedding vector)
    window=5,                       # Context window size
    min_count=2,                    # 모델 학습에 포함할 단어의 최소 빈도 수
    workers=4,                      # Number of worker threads for training = 병렬화를 위한 워커 스레드 수
    sg=0,                           # CBOW=0, Skip-gram=1
    epochs=300,                     # larger the number of epochs, the longer the training
)

# Check the vocabulary size
print("Vocabulary size:", len(model.wv.key_to_index))

Vocabulary size: 1476


### 2-2. Vocabulary

#### 특정 단어의 index 불러오기

In [21]:
"""
    In NLP with deep learning, it is a standard to represent
    the vocabulary as a python dictionary, where the keys are
    the 'words', and the values are integer indexes.
"""

model.wv.key_to_index  # dictionary

{'the': 0,
 'and': 1,
 'to': 2,
 'she': 3,
 'it': 4,
 'of': 5,
 'said': 6,
 'alice': 7,
 'in': 8,
 'you': 9,
 'was': 10,
 'that': 11,
 'as': 12,
 'her': 13,
 'at': 14,
 'on': 15,
 'with': 16,
 'all': 17,
 'had': 18,
 'but': 19,
 'for': 20,
 'so': 21,
 'be': 22,
 'not': 23,
 'very': 24,
 'what': 25,
 'this': 26,
 'little': 27,
 'they': 28,
 'he': 29,
 'out': 30,
 'its': 31,
 'is': 32,
 'one': 33,
 'down': 34,
 'up': 35,
 'his': 36,
 'if': 37,
 'about': 38,
 'then': 39,
 'no': 40,
 'were': 41,
 'like': 42,
 'know': 43,
 'them': 44,
 'went': 45,
 'herself': 46,
 'would': 47,
 'again': 48,
 'have': 49,
 'do': 50,
 'when': 51,
 'could': 52,
 'or': 53,
 'there': 54,
 'thought': 55,
 'off': 56,
 'time': 57,
 'me': 58,
 'queen': 59,
 'into': 60,
 'see': 61,
 'how': 62,
 'your': 63,
 'did': 64,
 'well': 65,
 'who': 66,
 'king': 67,
 'dont': 68,
 'my': 69,
 'began': 70,
 'now': 71,
 'by': 72,
 'an': 73,
 'im': 74,
 'turtle': 75,
 'way': 76,
 'mock': 77,
 'gryphon': 78,
 'quite': 79,
 'hatter': 8

In [22]:
"""
    Q) What is the index of the word 'was'?
"""

model.wv.key_to_index['was']

10

#### 특정 index의 단어 가져오기

In [23]:
"""
    Sometimes, we wish to retrieve a word by its index.
    The trick is to store the words in a list.
"""

model.wv.index_to_key

['the',
 'and',
 'to',
 'she',
 'it',
 'of',
 'said',
 'alice',
 'in',
 'you',
 'was',
 'that',
 'as',
 'her',
 'at',
 'on',
 'with',
 'all',
 'had',
 'but',
 'for',
 'so',
 'be',
 'not',
 'very',
 'what',
 'this',
 'little',
 'they',
 'he',
 'out',
 'its',
 'is',
 'one',
 'down',
 'up',
 'his',
 'if',
 'about',
 'then',
 'no',
 'were',
 'like',
 'know',
 'them',
 'went',
 'herself',
 'would',
 'again',
 'have',
 'do',
 'when',
 'could',
 'or',
 'there',
 'thought',
 'off',
 'time',
 'me',
 'queen',
 'into',
 'see',
 'how',
 'your',
 'did',
 'well',
 'who',
 'king',
 'dont',
 'my',
 'began',
 'now',
 'by',
 'an',
 'im',
 'turtle',
 'way',
 'mock',
 'gryphon',
 'quite',
 'hatter',
 'think',
 'are',
 'their',
 'just',
 'some',
 'much',
 'go',
 'say',
 'which',
 'thing',
 'here',
 'only',
 'first',
 'head',
 'more',
 'voice',
 'rabbit',
 'get',
 'never',
 'come',
 'got',
 'looked',
 'must',
 'after',
 'such',
 'round',
 'him',
 'why',
 'two',
 'over',
 'came',
 'tone',
 'duchess',
 'mouse

In [24]:
"""
    Q) What is the 10th word in our vocabulary?
"""

model.wv.index_to_key[10]

'was'

### 2-3. Word embedding vectors

In [25]:
model.wv  # it's a special datatype exclusively used by the gensim library

<gensim.models.keyedvectors.KeyedVectors at 0x7bdf0126be20>

In [26]:
"""
    Q) What the word vector for 'alice'?
        - Treat 'model.wv' as a python dictionary.
        - Use the string 'alice' as the key, and it will return the
          the word embedding vector that the model has learned.
"""

model.wv['was']  # was의 embedding vector. 차원은 지정했던대로 100차원.

array([-0.7648241 , -2.171586  ,  0.81657743, -0.28726956, -0.2485494 ,
        0.19138649, -1.0530378 , -1.0284163 , -2.361386  , -0.4315796 ,
        0.34746924,  0.7861353 , -0.3775395 , -2.0935822 ,  1.0322199 ,
        0.89117694,  0.5005209 ,  0.23650514, -1.0501304 ,  1.9605324 ,
        2.3270943 , -1.2471851 ,  2.1703086 ,  0.73370785,  2.02952   ,
        0.47547966,  0.72619236,  0.4540573 ,  1.3711619 ,  1.5690695 ,
        0.27779254,  0.85145515,  0.17296651, -1.150265  ,  0.59570485,
       -0.57433444, -0.18433835,  0.43712854,  2.3970835 ,  0.26640704,
        0.12668718,  0.54101205,  0.79475313,  1.8195338 , -1.5858712 ,
       -1.119407  ,  1.4717599 , -0.01747941, -0.92666125,  0.451088  ,
       -1.3933736 , -1.7236149 , -2.9459114 , -2.19814   ,  0.42909905,
       -1.247871  ,  0.5416361 , -1.330222  ,  2.2865705 , -1.4958233 ,
       -2.2918012 ,  0.97478276,  0.27958277, -0.44752672, -0.83434486,
       -1.3614043 , -1.0304728 ,  0.5324266 ,  1.3017192 , -3.29

In [27]:
model.wv['alice'].shape

(100,)

### 2-4. Indentifying similar vectors

In [28]:
"""
    Q) What are the 5 most similar words to `alice`?
        - The 'most_similar' method returns a list of tuples,
          where each tuple is a (word, cosine_similarity) pair.
"""

model.wv.most_similar('alice', topn=5) # 튜플 형태(단어, 코사인 유사도 값)

[('she', 0.5439764857292175),
 ('it', 0.4308808445930481),
 ('but', 0.3920069634914398),
 ('unhappy', 0.3268224000930786),
 ('that', 0.2968774437904358)]

In [29]:
"""
    Q) What are the 10 most similar words to 'time'?
"""

model.wv.most_similar('time', topn=10)

[('minutes', 0.4379795491695404),
 ('pool', 0.3897824287414551),
 ('silence', 0.3810037076473236),
 ('finished', 0.3289918601512909),
 ('once', 0.30491042137145996),
 ('moment', 0.30395081639289856),
 ('paper', 0.3030426800251007),
 ('back', 0.2929920554161072),
 ('noise', 0.28808119893074036),
 ('concert', 0.2783009111881256)]

In [30]:
"""
    What are the 7 most similar words to 'face'?
"""

model.wv.most_similar('face')

[('beating', 0.4483812749385834),
 ('eyes', 0.4196746349334717),
 ('folded', 0.4189411997795105),
 ('turn', 0.3936787247657776),
 ('violently', 0.3767057955265045),
 ('looked', 0.3761976361274719),
 ('paws', 0.37475699186325073),
 ('chin', 0.35099056363105774),
 ('afterwards', 0.346349835395813),
 ('arms', 0.34370121359825134)]

### 2-5. Loading and Saving Models

In [31]:
# 모델을 저장할 경로 설정, ~경로/모델 이름.model -> 확장자를 model로 줌으로써 다른 파일과 헷갈리지 않게 한다.
model.save("/content/drive/My Drive/Colab Notebooks/2024-2 딥러닝/E. Vector Representations of Text/word2vec_model.model")

In [32]:
loaded_model = Word2Vec.load("/content/drive/My Drive/Colab Notebooks/2024-2 딥러닝/E. Vector Representations of Text/word2vec_model.model")  # load

In [33]:
loaded_model.wv['was']

array([-0.7648241 , -2.171586  ,  0.81657743, -0.28726956, -0.2485494 ,
        0.19138649, -1.0530378 , -1.0284163 , -2.361386  , -0.4315796 ,
        0.34746924,  0.7861353 , -0.3775395 , -2.0935822 ,  1.0322199 ,
        0.89117694,  0.5005209 ,  0.23650514, -1.0501304 ,  1.9605324 ,
        2.3270943 , -1.2471851 ,  2.1703086 ,  0.73370785,  2.02952   ,
        0.47547966,  0.72619236,  0.4540573 ,  1.3711619 ,  1.5690695 ,
        0.27779254,  0.85145515,  0.17296651, -1.150265  ,  0.59570485,
       -0.57433444, -0.18433835,  0.43712854,  2.3970835 ,  0.26640704,
        0.12668718,  0.54101205,  0.79475313,  1.8195338 , -1.5858712 ,
       -1.119407  ,  1.4717599 , -0.01747941, -0.92666125,  0.451088  ,
       -1.3933736 , -1.7236149 , -2.9459114 , -2.19814   ,  0.42909905,
       -1.247871  ,  0.5416361 , -1.330222  ,  2.2865705 , -1.4958233 ,
       -2.2918012 ,  0.97478276,  0.27958277, -0.44752672, -0.83434486,
       -1.3614043 , -1.0304728 ,  0.5324266 ,  1.3017192 , -3.29

In [34]:
loaded_model.wv.key_to_index['was']

10

In [35]:
loaded_model.wv[10]

array([-0.7648241 , -2.171586  ,  0.81657743, -0.28726956, -0.2485494 ,
        0.19138649, -1.0530378 , -1.0284163 , -2.361386  , -0.4315796 ,
        0.34746924,  0.7861353 , -0.3775395 , -2.0935822 ,  1.0322199 ,
        0.89117694,  0.5005209 ,  0.23650514, -1.0501304 ,  1.9605324 ,
        2.3270943 , -1.2471851 ,  2.1703086 ,  0.73370785,  2.02952   ,
        0.47547966,  0.72619236,  0.4540573 ,  1.3711619 ,  1.5690695 ,
        0.27779254,  0.85145515,  0.17296651, -1.150265  ,  0.59570485,
       -0.57433444, -0.18433835,  0.43712854,  2.3970835 ,  0.26640704,
        0.12668718,  0.54101205,  0.79475313,  1.8195338 , -1.5858712 ,
       -1.119407  ,  1.4717599 , -0.01747941, -0.92666125,  0.451088  ,
       -1.3933736 , -1.7236149 , -2.9459114 , -2.19814   ,  0.42909905,
       -1.247871  ,  0.5416361 , -1.330222  ,  2.2865705 , -1.4958233 ,
       -2.2918012 ,  0.97478276,  0.27958277, -0.44752672, -0.83434486,
       -1.3614043 , -1.0304728 ,  0.5324266 ,  1.3017192 , -3.29

# wv와 wv.vectors 차이
wv는 검색과 메타정보를 포함한 고수준 관리 객체, wv.vectors는 단순히 벡터를 저장하는 배열

In [36]:
loaded_model.wv.vectors[10]

array([-0.7648241 , -2.171586  ,  0.81657743, -0.28726956, -0.2485494 ,
        0.19138649, -1.0530378 , -1.0284163 , -2.361386  , -0.4315796 ,
        0.34746924,  0.7861353 , -0.3775395 , -2.0935822 ,  1.0322199 ,
        0.89117694,  0.5005209 ,  0.23650514, -1.0501304 ,  1.9605324 ,
        2.3270943 , -1.2471851 ,  2.1703086 ,  0.73370785,  2.02952   ,
        0.47547966,  0.72619236,  0.4540573 ,  1.3711619 ,  1.5690695 ,
        0.27779254,  0.85145515,  0.17296651, -1.150265  ,  0.59570485,
       -0.57433444, -0.18433835,  0.43712854,  2.3970835 ,  0.26640704,
        0.12668718,  0.54101205,  0.79475313,  1.8195338 , -1.5858712 ,
       -1.119407  ,  1.4717599 , -0.01747941, -0.92666125,  0.451088  ,
       -1.3933736 , -1.7236149 , -2.9459114 , -2.19814   ,  0.42909905,
       -1.247871  ,  0.5416361 , -1.330222  ,  2.2865705 , -1.4958233 ,
       -2.2918012 ,  0.97478276,  0.27958277, -0.44752672, -0.83434486,
       -1.3614043 , -1.0304728 ,  0.5324266 ,  1.3017192 , -3.29