first we have to change to text into numbers so machine can understand it:

1. creating vocabulary for our words and assign an unique number to each one

    issues: numbers are random, they don't capture relationship between words
2. one hot encodding: creating a vector for word existance

    same issue as first option and computationally in-efficient
3. word embedding: it will retrieve features from words then compare it , so it can say apple and bananna is similar

# word embedding

convert words into features vector

we use different techniques like : TF-IDF , Word2Vec

embeddings are not hand crafted , instead they are learned during neural network training

techniques:
1. using supervised learning
2. using self-supervised learning
   1. Word2vec
   2. Glove

# supervised learning

take a NLP problem and try to solve it.in that pursuit as a side effect, you get word embeddings

we use padding to get fixed number of neurons

In [2]:
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

In [47]:
reviews = ['nice food',
        'amazing restaurant',
        'too good',
        'just loved it!',
        'will go again',
        'horrible food',
        'never go there',
        'poor service',
        'poor quality',
        'needs improvement']

In [48]:
sentiment = np.array([1,1,1,1,1,0,0,0,0,0]) # first five are positive reviews and next five are negative reviews

In [49]:
one_hot('amazing restaurant',30) # it will assign a number to words with 30 maximum

[4, 22]

In [95]:
vocab_size = 40

encoded_reviews = [one_hot(a, vocab_size) for a in reviews]
encoded_reviews

[[27, 12],
 [2, 21],
 [14, 2],
 [39, 4, 36],
 [7, 17, 29],
 [22, 12],
 [15, 17, 10],
 [3, 14],
 [3, 29],
 [27, 25]]

In [96]:
# now we need padding , some have less words than other reviews

max_length = 3

padded_reviews = pad_sequences(encoded_reviews, maxlen= max_length, padding= 'post')
padded_reviews  # now we have equal size

array([[27, 12,  0],
       [ 2, 21,  0],
       [14,  2,  0],
       [39,  4, 36],
       [ 7, 17, 29],
       [22, 12,  0],
       [15, 17, 10],
       [ 3, 14,  0],
       [ 3, 29,  0],
       [27, 25,  0]])

In [97]:
embeded_vector_size = 4

model = Sequential()
model.add(Embedding(vocab_size, embeded_vector_size, input_length=max_length, name= 'embedding'))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [98]:
x = padded_reviews
y = sentiment

In [99]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

In [100]:
model.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3, 4)              160       
                                                                 
 flatten_9 (Flatten)         (None, 12)                0         
                                                                 
 dense_9 (Dense)             (None, 1)                 13        
                                                                 
Total params: 173
Trainable params: 173
Non-trainable params: 0
_________________________________________________________________


In [101]:
model.fit(x, y, epochs=100, verbose=0)

<keras.callbacks.History at 0x2d2298096a0>

In [102]:
loss, acc = model.evaluate(x, y)
acc



1.0

In [103]:
# we are more interested in word embedding and weights

In [104]:
weights = model.get_layer('embedding').get_weights()[0]

In [105]:
len(weights)

40

In [106]:
weights[27]

array([-0.08301078, -0.06225186,  0.11192101,  0.11618899], dtype=float32)

In [107]:
weights[2]

array([-0.12627634, -0.13584265,  0.08930933,  0.12957135], dtype=float32)