## Skip-gram and Negative Sampling 

## Setup

In [2]:
import io
import re
import string
import tensorflow as tf
import tqdm

from tensorflow.keras import Model
from tensorflow.keras.layers import Dot, Embedding, Flatten
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [3]:
SEED = 42
AUTOTUNE = tf.data.AUTOTUNE

### Vectorize an example sentence

Consider the following sentence:    
`The wide road shimmered in the hot sun.`

Tokenize the sentence:

In [4]:
sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
print(tokens)
print(len(tokens))

['the', 'wide', 'road', 'shimmered', 'in', 'the', 'hot', 'sun']
8


Create a vocabulary to save mappings from tokens to integer indices.

In [5]:
vocab, index = {}, 1  # start indexing from 1
vocab['<pad>'] = 0  # add a padding token
for token in tokens:
  if token not in vocab:
    vocab[token] = index
    index += 1
vocab_size = len(vocab)
print(vocab)

{'<pad>': 0, 'the': 1, 'wide': 2, 'road': 3, 'shimmered': 4, 'in': 5, 'hot': 6, 'sun': 7}


Create an inverse vocabulary to save mappings from integer indices to tokens.

In [6]:
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)

{0: '<pad>', 1: 'the', 2: 'wide', 3: 'road', 4: 'shimmered', 5: 'in', 6: 'hot', 7: 'sun'}


Vectorize your sentence.


In [7]:
example_sequence = [vocab[word] for word in tokens]
print(example_sequence)

[1, 2, 3, 4, 5, 1, 6, 7]


### Generate skip-grams from one sentence

The `tf.keras.preprocessing.sequence` module provides useful functions that simplify data preparation for Word2Vec. You can use the `tf.keras.preprocessing.sequence.skipgrams` to generate skip-gram pairs from the `example_sequence` with a given `window_size` from tokens in the range `[0, vocab_size)`.

Note: `negative_samples` is set to `0` here as batching negative samples generated by this function requires a bit of code. You will use another function to perform negative sampling in the next section.


In [8]:
window_size = 2
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
      example_sequence,
      vocabulary_size=vocab_size,
      window_size=window_size,
      negative_samples=0, shuffle=True) ## shuffle=False     # negative 뽑지 않음 = 0


print(positive_skip_grams)
print(len(positive_skip_grams))

[[4, 5], [6, 5], [3, 5], [2, 3], [5, 1], [6, 7], [1, 2], [1, 4], [5, 6], [1, 6], [6, 1], [4, 3], [3, 2], [3, 1], [1, 7], [1, 3], [7, 1], [2, 4], [7, 6], [1, 5], [4, 1], [4, 2], [5, 3], [3, 4], [2, 1], [5, 4]]
26


Take a look at few positive skip-grams.

In [9]:
for target, context in positive_skip_grams[:5]:
  print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")

(4, 5): (shimmered, in)
(6, 5): (hot, in)
(3, 5): (road, in)
(2, 3): (wide, road)
(5, 1): (in, the)


### Negative sampling for one skip-gram 

The `skipgrams` function returns all positive skip-gram pairs by sliding over a given window span. To produce additional skip-gram pairs that would serve as negative samples for training, you need to sample random words from the vocabulary. Use the `tf.random.log_uniform_candidate_sampler` function to sample `num_ns` number of negative samples for a given target word in a window. You can call the function on one skip-grams's target word and pass the context word as true class to exclude it from being sampled.


Key point: *num_ns* (number of negative samples per positive context word) between [5, 20] is [shown to work](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) best for smaller datasets, while *num_ns* between [2,5] suffices for larger datasets. 

In [32]:
# Get target and context words for one positive skip-gram.
target_word, context_word = positive_skip_grams[0]

# Set the number of negative samples per positive context.
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,  # class that should be sampled as 'positive'       # context_class 이거 뺴고 뽑아 
    num_true=1,  # each positive skip-gram has 1 positive context class
    num_sampled=num_ns,  # number of negative context words to sample
    unique=True,  # all the negative samples should be unique                      # unique=True 은 비복원추출
    range_max=vocab_size,  # pick index of the samples from [0, vocab_size]
    seed=SEED,  # seed for reproducibility
    name="negative_sampling"  # name of this operation
)
print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])




# context_class 이거 뺴고 뽑은 것 중 num_sampled=num_ns 개수만큼 랜덤하게 뽑은 

tf.Tensor([86 30 36 42], shape=(4,), dtype=int64)
['must', 'thy', 'we', 'do']


### Construct one training example

For a given positive `(target_word, context_word)` skip-gram, you now also have `num_ns` negative sampled context words that do not appear in the window size neighborhood of `target_word`. Batch the `1` positive `context_word` and `num_ns` negative context words into one tensor. This produces a set of positive skip-grams (labelled as `1`) and negative samples (labelled as `0`) for each target word.

In [33]:
# Add a dimension so you can use concatenation (on the next step).
negative_sampling_candidates = tf.expand_dims(negative_sampling_candidates, 1)     #negative_sampling_candidates 차원짜리 벡터 EX) 4 ->  4 * 1짜리 벡터로

# Concat positive context word with negative sampled words.
context = tf.concat([context_class, negative_sampling_candidates], 0)

# Label first context word as 1 (positive) followed by num_ns 0s (negative).
label = tf.constant([1] + [0]*num_ns, dtype="int64")

# Reshape target to shape (1,) and context and label to (num_ns+1,).
target = tf.squeeze(target_word)
context = tf.squeeze(context)
label = tf.squeeze(label)

print(target)
print(context)
print(label)

tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor([ 5 86 30 36 42], shape=(5,), dtype=int64)
tf.Tensor([1 0 0 0 0], shape=(5,), dtype=int64)


Take a look at the context and the corresponding labels for the target word from the skip-gram example above. 

In [34]:
print(f"target_index    : {target}")
print(f"target_word     : {inverse_vocab[target_word]}")
print(f"context_indices : {context}")
print(f"context_words   : {[inverse_vocab[c.numpy()] for c in context]}")
print(f"label           : {label}")

target_index    : 4
target_word     : to
context_indices : [ 5 86 30 36 42]
context_words   : ['i', 'must', 'thy', 'we', 'do']
label           : [1 0 0 0 0]


## Compile all steps into one function


### Skip-gram Sampling table 

In [13]:
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(size=10)
print(sampling_table)
print((sampling_table).sum())


[0.00315225 0.00315225 0.00547597 0.00741556 0.00912817 0.01068435
 0.01212381 0.01347162 0.01474487 0.0159558 ]
0.09530464658339152


`sampling_table[i]` denotes the probability of sampling the i-th most common word in a dataset. The function assumes a [Zipf's distribution](https://en.wikipedia.org/wiki/Zipf%27s_law) of the word frequencies for sampling.

Key point: The `tf.random.log_uniform_candidate_sampler` already assumes that the vocabulary frequency follows a log-uniform (Zipf's) distribution. Using these distribution weighted sampling also helps approximate the Noise Contrastive Estimation (NCE) loss with simpler loss functions for training a negative sampling objective.

### Generate training data

Compile all the steps described above into a function that can be called on a list of vectorized sentences obtained from any text dataset. Notice that the sampling table is built before sampling skip-gram word pairs. You will use this function in the later sections.

In [14]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
  targets, contexts, labels = [], [], []

  # Build the sampling table for vocab_size tokens.
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in dataset.
  for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence).
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0)

    # Iterate over each positive skip-gram pair to produce training examples
    # with positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=SEED,
          name="negative_sampling")


      # Build context and label vectors (for one target word)
      negative_sampling_candidates = tf.expand_dims(
          negative_sampling_candidates, 1)

      context = tf.concat([context_class, negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      # Append each element from the training example to global lists.
      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

## Prepare training data for Word2Vec

With an understanding of how to work with one sentence for a skip-gram negative sampling based Word2Vec model, you can proceed to generate training examples from a larger list of sentences!

### Download text corpus


You will use a text file of Shakespeare's writing for this tutorial. Change the following line to run this code on your own data.

In [15]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


Read text from the file and take a look at the first few lines. 

In [16]:
with open(path_to_file) as f: 
  lines = f.read().splitlines()
for line in lines[:20]:
  print(line)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.


Use the non empty lines to construct a `tf.data.TextLineDataset` object for next steps.

In [17]:
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))   # cast " 변수 종류 바꾸기 -> BOOL" True만 뽑아내라. 

i=1
for element in text_ds.as_numpy_iterator():
  print(element)
  i+=1
  if i==5: break


b'First Citizen:'
b'Before we proceed any further, hear me speak.'
b'All:'
b'Speak, speak.'


- b 는 string 변수 형태 표시하기 위해 : 텐서플로우 

### Vectorize sentences from the corpus

You can use the `TextVectorization` layer to vectorize sentences from the corpus. Learn more about using this layer in this [Text Classification](https://www.tensorflow.org/tutorials/keras/text_classification) tutorial. Notice from the first few sentences above that the text needs to be in one case and punctuation needs to be removed. To do this, define a `custom_standardization function` that can be used in the TextVectorization layer.

In [18]:
# Now, create a custom standardization function to lowercase the text and
# remove punctuation.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase,                                           # lowercase를
                                  '[%s]' % re.escape(string.punctuation), '')


# Define the vocabulary size and number of words in a sequence.
vocab_size = 4096
sequence_length = 10

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Set output_sequence_length length to pad all samples to same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

Call `adapt` on the text dataset to create vocabulary.


In [19]:
vectorize_layer.adapt(text_ds.batch(1024)) ## creating vocab set  1024씩 처리하셈. 너무 기니까

Once the state of the layer has been adapted to represent the text corpus, the vocabulary can be accessed with `get_vocabulary()`. This function returns a list of all vocabulary tokens sorted (descending) by their frequency. 

In [20]:
# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()
print(len(inverse_vocab))
print(inverse_vocab[:20])   # 빈도수 기준 상위 20

4096
['', '[UNK]', 'the', 'and', 'to', 'i', 'of', 'you', 'my', 'a', 'that', 'in', 'is', 'not', 'for', 'with', 'me', 'it', 'be', 'your']


The vectorize_layer can now be used to generate vectors for each element in the `text_ds`.

In [21]:
# Vectorize the data in text_ds.
## prefetch: 전처리와 훈련 스텝 모델 실행 오버랩 => 계산 시간 단축하게끔 하는 목적 (map실행하는 동안 그 다음 데이터 불러옴)
## map(fun): data에 fun 함수 실행
## unbatch => batch를 없앰
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch() 
print(text_vector_ds)

<_UnbatchDataset element_spec=TensorSpec(shape=(10,), dtype=tf.int64, name=None)>


### Obtain sequences from the dataset

You now have a `tf.data.Dataset` of integer encoded sentences. To prepare the dataset for training a Word2Vec model, flatten the dataset into a list of sentence vector sequences. This step is required as you would iterate over each sentence in the dataset to produce positive and negative examples. 

Note: Since the `generate_training_data()` defined earlier uses non-TF python/numpy functions, you could also use a `tf.py_function` or `tf.numpy_function` with `tf.data.Dataset.map()`.

In [22]:
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

32777


Take a look at few examples from `sequences`.


In [23]:
for seq in sequences[:5]:
  print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']
[138  36 982 144 673 125  16 106   0   0] => ['before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', '', '']
[34  0  0  0  0  0  0  0  0  0] => ['all', '', '', '', '', '', '', '', '', '']
[106 106   0   0   0   0   0   0   0   0] => ['speak', 'speak', '', '', '', '', '', '', '', '']
[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']


### Generate training examples from sequences

`sequences` is now a list of int encoded sentences. Just call the `generate_training_data()` function defined earlier to generate training examples for the Word2Vec model. To recap, the function iterates over each word from each sequence to collect positive and negative context words. Length of target, contexts and labels should be same, representing the total number of training examples.

In [24]:
targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=4,
    vocab_size=vocab_size,
    seed=SEED)

print(len(targets), len(contexts), len(labels))

100%|██████████| 32777/32777 [00:16<00:00, 2015.48it/s]

64777 64777 64777





### Configure the dataset for performance

To perform efficient batching for the potentially large number of training examples, use the `tf.data.Dataset` API. After this step, you would have a `tf.data.Dataset` object of `(target_word, context_word), (label)` elements to train your Word2Vec model!

In [25]:
BATCH_SIZE = 1024 ## batch size
BUFFER_SIZE = 10000  ## shuffle하는데 필요한 수치.
## slice되어 있는 데이터를 합쳐서 train data 만듬
## input: (targets (1),contexts (1+4)), output: labels (5)
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels)) 
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

<BatchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int32, name=None), TensorSpec(shape=(1024, 5, 1), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>


Add `cache()` and `prefetch()` to improve performance.

In [26]:
## cache, prefetch 모두 효율적인 계산시간을 위해 사용하는 것.
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<PrefetchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int32, name=None), TensorSpec(shape=(1024, 5, 1), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>


## Model and Training

The Word2Vec model can be implemented as a classifier to distinguish between true context words from skip-grams and false context words obtained through negative sampling. You can perform a dot product between the embeddings of target and context words to obtain predictions for labels and compute loss against true labels in the dataset.

### Subclassed Word2Vec Model

Use the [Keras Subclassing API](https://www.tensorflow.org/guide/keras/custom_layers_and_models) to define your Word2Vec model with the following layers:


* `target_embedding`: A `tf.keras.layers.Embedding` layer which looks up the embedding of a word when it appears as a target word. The number of parameters in this layer are `(vocab_size * embedding_dim)`.
* `context_embedding`: Another `tf.keras.layers.Embedding` layer which looks up the embedding of a word when it appears as a context word. The number of parameters in this layer are the same as those in `target_embedding`, i.e. `(vocab_size * embedding_dim)`.
* `dots`: A `tf.keras.layers.Dot` layer that computes the dot product of target and context embeddings from a training pair.
* `flatten`: A `tf.keras.layers.Flatten` layer to flatten the results of `dots` layer into logits.

With the subclassed model, you can define the `call()` function that accepts `(target, context)` pairs which can then be passed into their corresponding embedding layer. Reshape the `context_embedding` to perform a dot product with `target_embedding` and return the flattened result.

Key point: The `target_embedding` and `context_embedding` layers can be shared as well. You could also use a concatenation of both embeddings as the final Word2Vec embedding.

In [27]:
class Word2Vec(Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    ## Embedding: integer to dense vector
    ## target_embedding: target을 embedding하기 위해 필요한 layer
    self.target_embedding = Embedding(vocab_size,
                                      embedding_dim,
                                      input_length=1,
                                      name="w2v_embedding")
    ## context_embedding: context를 embedding하기 위해 필요한 layerR
    self.context_embedding = Embedding(vocab_size,
                                       embedding_dim,
                                       input_length=num_ns+1) ## context는 5개이므로
    self.dots = Dot(axes=(3, 1))  ## 두 개의 embedding vector를 내적
    self.flatten = Flatten()  ## 내적 결과를 일렬로 나열

  def call(self, pair):
    target, context = pair
    word_emb = self.target_embedding(target)
    context_emb = self.context_embedding(context)
    dots = self.dots([context_emb, word_emb])
    return self.flatten(dots)

  ## target : 1024 -> word_emb : 1024, 128
  ## context : 1024, 5, 1 ->   1024, 5, 1, 128
  ## dots : 1024, 5, 1
  ## flatten : 1024, 5  (label 과 같은 차원을 맞춰주기 위함) ex : 1,0,0,0,0

### Define loss function and compile model


In [28]:
## 모형 만들고 loss function, optimizer 정해주기
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

Also define a callback to log training statistics for tensorboard.

Train the model with `dataset` prepared above for some number of epochs.

In [29]:
## 학습 시작
word2vec.fit(dataset, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f0b40160810>

## Embedding lookup and analysis

Obtain the weights from the model using `get_layer()` and `get_weights()`. The `get_vocabulary()` function provides the vocabulary to build a metadata file with one token per line. 

In [35]:
word2vec.get_layer('w2v_embedding').get_weights()

[array([[-0.02513342,  0.03719174, -0.01053281, ...,  0.03800357,
          0.0245422 , -0.03417538],
        [-0.1177707 ,  0.17044562, -0.45756817, ..., -0.53201795,
         -0.02582205,  0.17690413],
        [-0.09348251,  0.15607134, -0.6326013 , ...,  0.06154591,
         -0.478574  ,  0.08689316],
        ...,
        [-0.16618694,  0.2946829 ,  0.06358801, ..., -0.07948469,
          0.09585428,  0.21353276],
        [ 0.12524194, -0.18285824,  0.30420834, ..., -0.04559448,
          0.01384903,  0.198292  ],
        [ 0.03209055,  0.04292702,  0.09761377, ...,  0.19488823,
         -0.15908404,  0.01632194]], dtype=float32)]

In [36]:
## embedding vector 살펴보기
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

print(vocab[2])
print(weights[2])

the
[-9.34825093e-02  1.56071335e-01 -6.32601321e-01  3.39980006e-01
  2.87253737e-01  1.89381614e-01 -5.52521288e-01  1.38086498e-01
  2.59725213e-01  1.16115145e-01  4.35260445e-01 -6.46997169e-02
  4.03349757e-01 -4.66348737e-01  5.06261468e-01 -2.10431874e-01
  2.88908511e-01  3.80315371e-02 -1.76113024e-01  1.89207762e-01
  7.86827877e-02  1.32178798e-01 -1.63359448e-01  2.75066979e-02
  9.54768360e-02 -2.41752863e-01 -2.36765802e-01 -1.66875854e-01
  2.35400215e-01  4.95101184e-01  3.64230841e-01  2.62189180e-01
  3.49348456e-01  1.25922695e-01 -1.04805298e-01 -9.91582945e-02
  2.16945872e-01 -1.75027430e-01 -6.66567162e-02 -1.27007617e-02
  4.43117283e-02 -1.78878501e-01  2.13328272e-01  2.45017916e-01
 -2.63587683e-01 -2.08952576e-01 -4.22608793e-01  1.21816173e-01
  1.05944306e-01 -2.37400040e-01  1.41065598e-01  1.23708472e-01
 -2.96245575e-01  9.95864198e-02  2.94463933e-01  1.22015357e-01
 -4.57480490e-01  9.56135914e-02 -1.99018434e-01  1.77657291e-01
  1.88235957e-02 -1.1