# Basic Definitions

Deep learning for natural language processing is pattern recognition applied to textual data. Textual data has to be first processed before one can apply machine learning techniques. Typically, the text is broken down either into characters or words or $n$-grams. An $n$-gram is a consecutive sequence of $n$ or fewer characters or words that can be extracted from a sentence. The units into which textual data is broken down (either characters or words or $n$-grams) are called _tokens_. The process in which text is broken down into tokens is called _tokenization_ and the tokens are assigned numeric vectors in accordance with some embedding scheme. The major token embedding schemes are:
* one hot encoding of tokens
* token embedding (this is applied to words and is called word embedding)

## One Hot Encoding

This consists of assigning a unique integer index to every word from a finite vocabulary of size $N$. The vector associated with the $i$th word from this vocabulary is a bit vector that has a $1$ in position $i$ and zeros elsewhere. 


## Basic Tokenization

In [1]:
import numpy as np
import re

from typing import List


def tokenize(sample: str, drop_singletons=False):
    neither_char_nor_number = '[^A-Za-z0-9]+'
    without_special_chars = re.sub(f'{neither_char_nor_number}', ' ', sample)
    no_leading_trailing_spaces = without_special_chars.strip() 
    words = no_leading_trailing_spaces.split(' ')
    
    if drop_singletons:
        words = [w for w in words if len(w) > 1]
        
    return words

def create_token_dict(samples: List[str]):
    token_dict = dict()
    for sample in samples:
        words = tokenize(sample, drop_singletons=True)
        for w in words:
            if w not in token_dict.keys():
                token_dict[w] = len(token_dict) + 1
    return token_dict

In [2]:
create_token_dict(["Consider the number  90**5.", "This a large number!", "I'm fine with smaller ones:)"])

{'Consider': 1,
 'the': 2,
 'number': 3,
 '90': 4,
 'This': 5,
 'large': 6,
 'fine': 7,
 'with': 8,
 'smaller': 9,
 'ones': 10}

In [3]:
from keras.preprocessing.text import Tokenizer


samples = ['The cat sat on the mat.', 'The dog ate my homework.']

tokenizer = Tokenizer(num_words=1000)

# Builds the word index
tokenizer.fit_on_texts(samples)

# Strings into lists of integer indices
sequences = tokenizer.texts_to_sequences(samples)

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 9 unique tokens.


In [4]:
one_hot_results

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

## Word Encodings

A word encoding is a map from a vocabulary $V$ to $\mathbf{R}^n$, where typically $n \ll |V|$. They give an efficient, dense representation where similar words have encodings that are close to each other. There is no universally "good" embedding: an embedding that is suitable for movie reviews classification may not be suitable for classifying scientific documents as the relative importance of word pairs in these two fields differ. Thus it is imperative to learn a new word embedding for each new task. 

In [5]:
from keras.layers import Embedding

# number of possible tokens = 1000 = 1 + maximum word index; dimensionality = 64
embedding_layer = Embedding(1000, 64)

In [6]:
# Loading the IMDB data for use with the embedding layer

from keras.datasets import imdb
from keras import preprocessing

# we will choose the top 10000 most commonly used words; so word index = [0 ... 9999]
max_features = 10000
# Each review would be truncated to the first 20 words
maxlen = 20


(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [7]:
y_train[0:3]

array([1, 0, 0])

In [8]:
type(y_train)

numpy.ndarray

In [9]:
np.unique(y_train)

array([0, 1])

### Keras Embedding Layer


* input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.
* output_dim: Integer. Dimension of the dense embedding.
* input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).

In [10]:
from keras.models import Sequential
from keras.layers import Flatten, Dense


model = Sequential()
# input_dim = size of the input vocabulary
# output_dim = size of the embedded vectors
# input_length = size of each input sequence
model.add(Embedding(input_dim=max_features, output_dim=8, input_length=maxlen))

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', 
              loss='binary_crossentropy', 
              metrics=['acc'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten (Flatten)            (None, 160)               0         
_________________________________________________________________
dense (Dense)                (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [11]:
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [12]:
history.history.keys()

dict_keys(['loss', 'acc', 'val_loss', 'val_acc'])

In [13]:
model.layers

[<tensorflow.python.keras.layers.embeddings.Embedding at 0x7fde3c4d4be0>,
 <tensorflow.python.keras.layers.core.Flatten at 0x7fde3c51c550>,
 <tensorflow.python.keras.layers.core.Dense at 0x7fde3c4de460>]

In [14]:
model.layers[0].get_weights()[0]

array([[-0.06403131,  0.02865895,  0.01554053, ...,  0.08405918,
        -0.07743204,  0.06030514],
       [-0.05036338,  0.05972435, -0.01678254, ...,  0.01711121,
        -0.09139732, -0.06354951],
       [ 0.03289519, -0.02895788,  0.10627425, ...,  0.0610166 ,
        -0.02239222,  0.0299639 ],
       ...,
       [ 0.01399802,  0.00061152,  0.01824389, ...,  0.02553368,
        -0.04194051, -0.02597152],
       [-0.03236458, -0.00690295, -0.0175093 , ..., -0.00018737,
        -0.02844385,  0.00599523],
       [-0.02778889, -0.04072574,  0.00177168, ..., -0.00709735,
        -0.00107574, -0.03337316]], dtype=float32)

In [15]:
model.layers[0].get_weights()[0][0]

array([-0.06403131,  0.02865895,  0.01554053,  0.08978625, -0.00389895,
        0.08405918, -0.07743204,  0.06030514], dtype=float32)

In [16]:
model.layers[0].get_weights()[0].shape

(10000, 8)