In [None]:
# Deep learning for text and sequence
# Chapter 6 of Deep Learning with Python

In [None]:
# 6.1 Working with text data : 
# Text is one of the most widespread forms of sequence data. It can be understood as either a sequence of characters or a
# sequence of words, but it’s most common to work at the level of words. The deep-learning sequence-processing models 
# introduced in the following sections can use text to produce a basic form of natural-language understanding, sufficient for
# applications including document classification, sentiment analysis, author identification, and even question-answering (QA) 
# (in a constrained context). Of course, keep in mind throughout this chapter that none of these deeplearning models truly 
# understand text in a human sense; rather, these models can map the statistical structure of written language, which is 
# sufficient to solve many simple textual tasks. Deep learning for natural-language processing is pattern recognition
# applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels

In [None]:
# Like all other neural networks, deep-learning models don’t take as input raw text. They only work with numeric tensors.
# Vectorizing text is the process of transforming text into numeric tensors. This can be done in multiple ways:
# ** Segment text into words, and transform each word into a vector.
# ** Segment text into characters, and transform each character into a vector.
# ** Extract n-grams of words or characters, and transform each n-gram into a vector. N-grams are overlapping groups of multiple consecutive words or characters.

In [None]:
# Collectively, the different units into which you can break down text (words, characters, or n-grams) are called tokens, 
# and breaking text into such tokens is called tokenization.

In [None]:
# 6.1.1 One-hot encoding of words and characters
# One-hot encoding is the most common, most basic way to turn a token into a vector. You saw it in action in the initial IMDB 
# and Reuters examples in chapter 3 (done with words, in that case). It consists of associating a unique integer index with
# every word and then turning this integer index i into a binary vector of size N (the size of the vocabulary); the vector is 
# all zeros except for the ith entry, which is 1.

In [6]:
# Listing 6.1 Word-level one-hot encoding (toy example)

import numpy as np

# Initial data: one entry per sample (in this example, a sample is a sentence, but it could be an entire document)
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# Builds an index of all tokens in the data
token_index = {}
for sample in samples:
    for word in sample.split(): # # Tokenizes the samples via the split method. In real life, you’d also strip punctuation and special characters from the samples.
        if word not in token_index:
            token_index[word] = len(token_index) + 1  # ssigns a unique index to each unique word. Note that you don’t attribute index 0 to anything.

# Just to view the Token Index
print(token_index)

max_length = 10 # Vectorizes the samples. You’ll only consider the first max_length words in each sample.

# This is where we will store the results.
results = np.zeros(shape=(len(samples), max_length, max(token_index.values()) + 1)) 



for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1

print(results)        
    

{'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10}
[[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]
