## Basic Definitions

Deep learning for natural language processing is pattern recognition applied to textual data. Textual data has to be first processed before one can apply machine learning techniques. Typically, the text is broken down either into characters or words or $n$-grams. An $n$-gram is a consecutive sequence of $n$ or fewer characters or words that can be extracted from a sentence. The units into which textual data is broken down (either characters or words or $n$-grams) are called _tokens_. The process in which text is broken down into tokens is called _tokenization_ and the tokens are assigned numeric vectors in accordance with some embedding scheme. The major token embedding schemes are:
* one hot encoding of tokens
* token embedding (this is applied to words and is called word embedding)

### One Hot Encoding

This consists of assigning a unique integer index to every word from a finite vocabulary of size $N$. The vector associated with the $i$th word from this vocabulary is a bit vector that has a $1$ in position $i$ and zeros elsewhere. 


## Basic Tokenization

In [1]:
import numpy as np
import re

from typing import List


def tokenize(sample: str, drop_singletons=False):
    neither_char_nor_number = '[^A-Za-z0-9]+'
    without_special_chars = re.sub(f'{neither_char_nor_number}', ' ', sample)
    no_leading_trailing_spaces = without_special_chars.strip() 
    words = no_leading_trailing_spaces.split(' ')
    
    if drop_singletons:
        words = [w for w in words if len(w) > 1]
        
    return words

def create_token_dict(samples: List[str]):
    token_dict = dict()
    for sample in samples:
        words = tokenize(sample, drop_singletons=True)
        for w in words:
            if w not in token_dict.keys():
                token_dict[w] = len(token_dict) + 1
    return token_dict

In [2]:
create_token_dict(["Consider the number  90**5.", "This a large number!", "I'm fine with smaller ones:)"])

{'Consider': 1,
 'the': 2,
 'number': 3,
 '90': 4,
 'This': 5,
 'large': 6,
 'fine': 7,
 'with': 8,
 'smaller': 9,
 'ones': 10}

In [4]:
from keras.preprocessing.text import Tokenizer


samples = ['The cat sat on the mat.', 'The dog ate my homework.']

tokenizer = Tokenizer(num_words=1000)

# Builds the word index
tokenizer.fit_on_texts(samples)

# Strings into lists of integer indices
sequences = tokenizer.texts_to_sequences(samples)

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 9 unique tokens.


In [5]:
one_hot_results

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])