# What is Word Embeding?
* **Representing text as numbers:** Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing we must do come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model. 

In general, there are 3 strategies for doing so:
* **One-hot encodings**:

![alt text](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/one-hot.png?raw=1)

* **Encode each word with a unique number**

the -> 0
cat -> 1
mat -> 2
on  -> 3 

* **Word embeddings**: Dense Vector Representation using floating point values which are trainable parameters. -> model eğitilirken değerler de değişiyor

![alt text](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/embedding2.png?raw=1)


# References:

* [TF word embedding tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings)
* [Word Embedding Example](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)

* [Tokenazation](#https://www.kdnuggets.com/2020/03/tensorflow-keras-tokenization-text-data-prep.html)



In [None]:
from numpy import array
import tensorflow as tf 
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SpatialDropout1D, Dropout, Convolution1D
from tensorflow.keras.layers import Flatten,  LSTM, GlobalMaxPooling1D

from tensorflow.keras.layers import Embedding
import numpy as np
import pandas as pd
import urllib 

---
# Corpus

In [None]:
def loadFile(url):
  stms =[]
  file = urllib.request.urlopen(url)

  for line in file:
    line = line.decode("utf-8")
    if(len(line)>2):
          stm =  line.strip()
          #print(stm)
          stms.append(stm)
  return stms


In [None]:
urlWish = 'https://raw.githubusercontent.com/kmkarakaya/ML_tutorials/master/data/dua.txt'
urlCurse= 'https://raw.githubusercontent.com/kmkarakaya/ML_tutorials/master/data/beddua.txt'

wish = loadFile(urlWish)
curse = loadFile(urlCurse)

totalWish=len(wish)
print('totalWish: ',totalWish) 
totalCurse = len(curse)
print('totalCurse: ',totalCurse)         

totalWish:  177
totalCurse:  801


In [None]:
curse= curse[:totalWish]
totalCurse = len(curse)
print('totalCurse: ',totalCurse) 

totalCurse:  177


In [None]:
testWish= int(totalWish* 0.1)
testCurse = int(totalCurse * 0.1)
print('testWish ', testWish)
print('testCurse ', testCurse)

trainDocs= wish[:-testWish]+curse[:-testCurse]
testDocs= wish[-testWish:]+curse[-testCurse:]
print(len(trainDocs)) 
print(len(testDocs)) 

trainLabels = np.concatenate((np.ones(totalWish-testWish),np.zeros(totalCurse-testCurse)), axis=0) 
testLabels = np.concatenate((np.ones(testWish),np.zeros(testCurse)), axis=0) 

print(len(trainLabels)) 
print(len(testLabels))

testWish  17
testCurse  17
320
34
320
34


In [None]:
allDocs= trainDocs + testDocs
print(allDocs)
print(len(allDocs))

['Acı yüzü görmeyesin.', 'Allah kimseyi aç açık bırakmasın.', 'Allah’ım beni affet.', 'Allah’ım affet.', 'Afiyet  şeker olsun.', 'Ağzını hayra aç.', 'Allah ayrılık vermesin.', 'Allah dert vermesin.', 'Allah acı vermesin.', 'Babanın canına rahmet.', 'Annenin canına rahme', 'Bahtın açık olsun.', 'Yolun açık olsun.', 'Şansın açık olsun.', 'Başına devlet kuşu kona.', 'Başına devlet kuşu konsun', 'Bereketi Allah’tan olsun.', 'Beytullaha yüz süresin.', 'Bolluğun başından aşa.', 'Ciğer acısı görmeyesin.', 'Çıran her daim yakılı kalsın.', 'Çift  çubuk sahibi olasın.', 'Dal  budak salasın.', 'Damatlığını da görürüz inşallah.', 'Darlık yüzü görmeyesin.', 'Yokluk yüzü görmeyesin.', 'Ekenin doğuranın eksik olmasın.', 'Ermişlerden olasın.', 'Evladınla binbir yaşa.', 'Ahrette Fatma anamıza komşu olasın.', 'Geçmiş olsun.', 'Gurbet yüzü görmeyesin.', 'Hatır soranların çok olsun.', 'El öpenlerin çok olsun', 'Hayırlı  uğurlu olsun.', 'Hızır yoldaşın olsun.', 'İki cihanda aziz ol.', 'İyi yolculuklar.', '

---
# Tokenize the corpus

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
# Tokenize our training data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(allDocs)

document_count = tokenizer.document_count
vocab_size = len(tokenizer.word_index)

# Encode training data sentences into sequences
allDocs_sequences = tokenizer.texts_to_sequences(allDocs)

# Get max training sequence length
max_length = max([len(x) for x in allDocs_sequences])

# Get our training data word index
word_index = tokenizer.word_index
print("Corpus Summary")
print("Word index:", word_index)
print("document count  :", document_count)
print("vocabulary size :", vocab_size)
print("Maximum length of the statements :", max_length)

Corpus Summary
Word index: {'allah': 1, 'olsun': 2, 'versin': 3, 'adın': 4, 'görmeyesin': 5, 'olasın': 6, 'ola': 7, 'başına': 8, 'bereketi': 9, 'seni': 10, 'batsın': 11, 'yüzü': 12, 'bol': 13, 'taş': 14, 'elin': 15, 'gelesin': 16, 'gele': 17, 'açık': 18, 'dert': 19, 'ol': 20, 'altın': 21, 'su': 22, 'beladan': 23, 'etsin': 24, 'bir': 25, 'kurusun': 26, 'kara': 27, 'sanın': 28, 'kalsın': 29, 'inşallah': 30, 'olmasın': 31, 'hayırlı': 32, 'ömrün': 33, 'uzun': 34, 'gibi': 35, 'kazadan': 36, 'görmesin': 37, 'gelsin': 38, 'ağzın': 39, 'vermesin': 40, 'canına': 41, 'da': 42, 'yokluk': 43, 'çok': 44, 'kısmetin': 45, 'senden': 46, 'esirgesin': 47, 'halil': 48, 'i̇brahim': 49, 'korusun': 50, 'toprak': 51, 'göre': 52, 'bağışlasın': 53, 'sana': 54, 'gözün': 55, 'çıksın': 56, 'kör': 57, 'rahmet': 58, 'kuşu': 59, 'kona': 60, 'daim': 61, 'komşu': 62, 'aziz': 63, 'muhtaç': 64, 'nasibin': 65, 'analı': 66, 'babalı': 67, 'büyütsün': 68, 'ne': 69, 'varsa': 70, 'iki': 71, 'muradın': 72, 'nur': 73, 'anan': 7

In [None]:
# Encode training data sentences into sequences
train_sequences = tokenizer.texts_to_sequences(trainDocs)

# Pad the training sequences
train_padded = pad_sequences(train_sequences, padding='post', truncating='post', maxlen=max_length)

# Output the results of our work
print("Train Doc Summary")
print("\nTraining sequences:\n", train_sequences)
print("\nPadded training sequences:\n", train_padded[:5])
print("\nPadded training shape:", train_padded.shape)
print("Training sequences data type:", type(train_sequences))
print("Padded Training sequences data type:", type(train_padded))

Train Doc Summary

Training sequences:
 [[94, 12, 5], [1, 186, 95, 18, 187], [96, 188, 97], [96, 97], [189, 190, 2], [191, 192, 95], [1, 193, 40], [1, 19, 40], [1, 94, 40], [194, 41, 58], [195, 41, 196], [98, 18, 2], [99, 18, 2], [100, 18, 2], [8, 101, 59, 60], [8, 101, 59, 197], [9, 198, 2], [199, 200, 201], [202, 203, 204], [205, 206, 5], [207, 208, 61, 209, 29], [210, 211, 212, 6], [213, 214, 215], [216, 42, 217, 30], [218, 12, 5], [43, 12, 5], [219, 220, 221, 31], [222, 6], [223, 224, 225], [226, 227, 228, 62, 6], [229, 2], [230, 12, 5], [231, 232, 44, 2], [102, 103, 44, 2], [32, 233, 2], [234, 235, 2], [104, 236, 63, 20], [237, 105], [32, 105], [238, 239, 5], [240, 2], [241, 20], [242, 243, 244, 245, 246], [106, 61, 7], [247, 248, 249], [106, 61, 2], [250, 64, 251], [1, 252, 64, 253], [65, 45, 13, 7], [65, 45, 13, 2], [45, 13, 2], [65, 13, 2], [107, 254, 255, 256], [108, 257, 21, 258], [33, 34, 45, 259, 2], [22, 35, 63, 20], [260, 261, 10, 262], [32, 263], [32, 264], [109, 265, 9,

In [None]:
# Encode training data sentences into sequences
test_sequences = tokenizer.texts_to_sequences(testDocs)

# Pad the training sequences
test_padded = pad_sequences(test_sequences, padding='post', truncating='post', maxlen=max_length)

# Output the results of our work
print("Test Doc Summary")
print("\nTest sequences:\n", test_sequences)
print("\nPadded test sequences:\n", test_padded[:5])
print("\nPadded test shape:", test_padded.shape)
print("Test sequences data type:", type(test_sequences))
print("Padded Test sequences data type:", type(test_padded))

Test Doc Summary

Test sequences:
 [[630, 184, 2], [184, 2], [631, 632, 6], [633, 634], [635, 636, 637, 638, 7], [639, 640, 641, 83, 2], [33, 34, 7], [33, 34, 2], [34, 115], [642, 643, 7], [1, 54, 644], [1, 3], [1, 80, 52, 3], [185, 84, 645], [185, 84, 646], [647, 648, 5], [43, 5], [649, 650, 6], [4, 651, 60], [4, 85], [4, 11, 30], [4, 652, 169, 172], [4, 27, 136, 17], [4, 182, 653], [4, 654, 655, 17], [21, 4, 656, 7], [4, 28, 11], [4, 28, 657, 2], [4, 28, 27, 17], [4, 28, 26], [658, 659, 167], [39, 660, 177], [39, 661], [39, 662]]

Padded test sequences:
 [[630 184   2   0   0   0   0   0]
 [184   2   0   0   0   0   0   0]
 [631 632   6   0   0   0   0   0]
 [633 634   0   0   0   0   0   0]
 [635 636 637 638   7   0   0   0]]

Padded test shape: (34, 8)
Test sequences data type: <class 'list'>
Padded Test sequences data type: <class 'numpy.ndarray'>


---
# Model 1: Vanilla Deep NN

In [None]:
#@title ENTER EPOCH 
epochs =  100#@param {type:"integer"}


In [None]:
# define the model
model1 = Sequential()
model1.add(Dense(8, input_shape=(max_length,)))  # 8 integer i dogrudan parametre olarak veriyoruz, orpus a sadece integer encoding yaptik
#model.add(Flatten())
model1.add(Dense(64, activation='relu'))
model1.add(Dense(128, activation='relu'))
model1.add(Dense(64, activation='relu'))
model1.add(Dense(32, activation='relu'))
model1.add(Dense(1, activation='sigmoid'))
# compile the model
model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model1.summary())
# fit the model
model1.fit(train_padded, trainLabels, epochs=epochs, verbose=0)
# evaluate the model
loss, accuracy = model1.evaluate(test_padded, testLabels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_15 (Dense)             (None, 8)                 72        
_________________________________________________________________
dense_16 (Dense)             (None, 64)                576       
_________________________________________________________________
dense_17 (Dense)             (None, 128)               8320      
_________________________________________________________________
dense_18 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_19 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_20 (Dense)             (None, 1)                 33        
Total params: 19,337
Trainable params: 19,337
Non-trainable params: 0
__________________________________________________

---
# Model 2: Deep NN with Word Embedding





tf.keras.layers.Embedding(
    **input_dim,** **output_dim,** embeddings_initializer='uniform',
    embeddings_regularizer=None, activity_regularizer=None,
    embeddings_constraint=None, mask_zero=False, **input_length=**None, **kwargs
)

In [None]:
input_dim = vocab_size+1
output_dim = 8

# define the model
model2 = Sequential()
model2.add(Embedding(input_dim, output_dim, input_length=max_length, name= 'embeded'))
model2.add(Flatten())
model2.add(Dense(32, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))

# compile the model
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print(model2.summary())

# fit the model
model2.fit(train_padded, trainLabels, epochs=epochs, verbose=0)

# evaluate the model
loss, accuracy = model2.evaluate(test_padded, testLabels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embeded (Embedding)          (None, 8, 8)              5304      
_________________________________________________________________
flatten_3 (Flatten)          (None, 64)                0         
_________________________________________________________________
dense_21 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_22 (Dense)             (None, 1)                 33        
Total params: 7,417
Trainable params: 7,417
Non-trainable params: 0
_________________________________________________________________
None
Accuracy: 85.294116


---
# Model with Word Embedding + LSTM

In [None]:
input_dim = vocab_size+1
output_dim = 8

# define the model
model3 = Sequential()
model3.add(Embedding(input_dim, output_dim, input_length=max_length, name= 'embeded'))
model3.add(SpatialDropout1D(0.25))
model3.add(LSTM(16, return_sequences=True))
model3.add(LSTM(8))
model3.add(Dropout(0.25))
model3.add(Dense(1, activation='sigmoid'))

# compile the model
model3.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print(model3.summary())

# fit the model
model3.fit(train_padded, trainLabels, epochs=epochs, verbose=0)

# evaluate the model
loss, accuracy = model3.evaluate(test_padded, testLabels, verbose=0)
print('Accuracy: %f' % (accuracy*100))


Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embeded (Embedding)          (None, 8, 8)              5304      
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 8, 8)              0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 8, 16)             1600      
_________________________________________________________________
lstm_3 (LSTM)                (None, 8)                 800       
_________________________________________________________________
dropout_3 (Dropout)          (None, 8)                 0         
_________________________________________________________________
dense_23 (Dense)             (None, 1)                 9         
Total params: 7,713
Trainable params: 7,713
Non-trainable params: 0
____________________________________________________

---
# Model with Word Embedding + Convolution1D

In [None]:
input_dim = vocab_size+1
output_dim = 8

# define the model
model4 = Sequential()
model4.add(Embedding(input_dim, output_dim, input_length=max_length, name= 'embeded'))
model4.add(Dropout(0.50))
model4.add(Convolution1D(16,3))
model4.add(Convolution1D(16,5))
model4.add(GlobalMaxPooling1D())
model4.add(Dropout(0.50))
model4.add(Dense(16, activation='relu'))
model4.add(Dense(1, activation='sigmoid'))

# compile the model
model4.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print(model4.summary())

# fit the model
model4.fit(train_padded, trainLabels, epochs=epochs, verbose=0)

# evaluate the model
loss, accuracy = model4.evaluate(test_padded, testLabels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embeded (Embedding)          (None, 8, 8)              5304      
_________________________________________________________________
dropout_4 (Dropout)          (None, 8, 8)              0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 6, 16)             400       
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 2, 16)             1296      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 16)                0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_24 (Dense)             (None, 16)               

---
# More models: Embedding + Conv1D+ LSTM + Attention



*Do it yourself :)*

---
# Some free text



* 'gözlerin dert görmesin'
* 'gözlerin görmesin'
*  'gün görmesin'
* 'yüzün gün görmesin'
* 'ellerin dert görmesin'
* 'dert görmesin'
* 'gözlerin kör olsun'
* 'hayırlı olsun'
* 'hayır olmasın'
* 'belanı göresin'
* 'belanı görmeyesin'
* 'kör ol inşallah'
* 'mutlu ol inşallah'
* 'cennetlik ol inşallah'
* 'toprak ol inşallah'
* 'kısmetin bol olsun inşallah'
* 'kısmetin yok olsun inşallah'

In [None]:
#@title Enter your statement
statement = "belan\u0131 g\xF6rmeyesin" #@param {type:"string"}
#'gözlerin dert görmesin'
#'gözlerin görmesin'
# 'gün görmesin'
#'yüzün gün görmesin'
# 'dert görmesin'
#'gözlerin kör olsun'
#'hayırlı olsun'
#'hayır olmasın'
#'belanı göresin'
#'belanı görmeyesin'
#'kör ol inşallah'
#'mutlu ol inşallah'
#'cennetlik ol inşallah'
#'toprak ol inşallah'
#'kısmetin bol olsun inşallah'
#'kısmetin yok olsun inşallah'

myTest=[statement]
myTestEncoded= tokenizer.texts_to_sequences(myTest)
print (myTestEncoded)
# Pad the training sequences
myTestPadded = pad_sequences(myTestEncoded, padding='post', truncating='post', maxlen=max_length)
print (myTestPadded)

print("Deep NN model ", 'Wish' if model1.predict(myTestPadded)[0][0]> 0.5 else 'Curse', model1.predict(myTestPadded)[0][0])
print("Word Embedding ", 'Wish' if model2.predict(myTestPadded)[0][0]> 0.5 else 'Curse', model2.predict(myTestPadded)[0][0])
print("Word Embedding + LSTM ", 'Wish' if model3.predict(myTestPadded)[0][0]> 0.5 else 'Curse', model3.predict(myTestPadded)[0][0])
print("Word Embedding + Conv1D ", 'Wish' if model4.predict(myTestPadded)[0][0]> 0.5 else 'Curse', model4.predict(myTestPadded)[0][0])

[[541, 5]]
[[541   5   0   0   0   0   0   0]]
Deep NN model  Wish 0.88761014
Word Embedding  Curse 0.0026264114
Word Embedding + LSTM  Curse 0.0010906169
Word Embedding + Conv1D  Curse 0.008143903


---
# Visualize the embedding

## Save the word vectors and words 

In [None]:
e= model3.get_layer(name='embeded')
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

(663, 8)


In [None]:
import io
file_vec = 'vecs_'+str(epochs)+'.tsv'
file_meta= 'meta_'+str(epochs)+'.tsv'
out_v = io.open(file_vec, 'w', encoding='utf-8')
out_m = io.open(file_meta, 'w', encoding='utf-8')

for num, word in enumerate(tokenizer.word_index):
  vec = weights[num+1] # skip 0, it's padding.
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()
out_m.close()

## Download 2 files

In [None]:
try:
  from google.colab import files
except ImportError:
   pass
else:
  files.download(file_vec)
  files.download(file_meta)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


## Open http://projector.tensorflow.org/