# Representations & Text Processing

Two weeks ago we had a short look at image processing through the convolutional neural network, while next week we will look at recurrent neural networks and how they can be used for processing sequential data including text. 

This week we aren't going to look at a new specific architecture type, but instead we are going to step back and consider as a whole the issue of how are data is captured or represented by these networks, and ask what impacts that has for issues like efficient data processing and re-use.

Some notes in this section - including some images - are based on the excellent tutorials on convolutional network use for Natural Language Processing that you can find online here:
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ as well as the follow-up with some implementation notes: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

The notes here are more compact than the original tutorial. Please refer to the original source for more detail if you need it. 

Let's start by importing some bits and pieces to get that out of the way. 

In [1]:
import numpy as np
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Embedding, Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from tensorflow.keras.datasets import imdb
import tensorflow as tf

## Representing the Meaning of Words

As we will be primarily looking at text this week, let's begin with its representation options. This question of 'text representation' is not only important for Neural Network approaches to text processing tasks, but is equally important to text processing in Machine Learning in general. 

The challenge of representing text for Machine Learning tasks is a very broad research topic. Here we unfortunately don't have time to look at the particular motivations and problems with many approaches, but we will cover the key approaches that are used in practice. 

We need a way of representing text, but text presents us some challenges. Words have variable length, i.e., one word can have length 1, e.g., 'a', while another word can be just a little bit longer, e.g., 'pneumonoultramicroscopicsilicovolcanoconiosis'. Worse still, sentences themselves have variable length. Obviously the string type in a computer can capture a sentence in a way that is great for word processing tasks and similar operations, but it isn't a good representation for Machine Learning. Instead we need a representation that is a bit more predictable in terms of the length of our inputs. 

### Bag of Word Count Matrices 

To achieve this, the most basic form of Text Representation is the so-called Bag of Words style approach. Bag of Words (BOW) representations were very popular for many years and are typical of what was used in classical Information Retrieval and Text Classification Tasks. The basic features of a BOW approach are: 

 1. Each input text document is represented by a vector of values
 2. The vector length corresponds to the number of entries in our vocabulary
 3. Each entry in the vector represents the presence of a given word occurred in the input text 
 
The image below summarises this approach. 

<!-- bow.png --> 
<img width="600" src="https://drive.google.com/uc?id=1geHOCUF3eT5_Nwjt6tbso1a18GQJo_I6"/>
 
Each row in our table corresponds to an individual document. Depending on the task at hand, the document could be just a sentence, or a paragraph, or indeed, a document, such as a webpage or book. For a given row, each cell represents whether a given word was found in the document. In basic approaches, this might just be a binary flag that says whether the word was in the document or not. Alternatively it can be a count of the number of times that the word is found in the document. 

The number of columns is equal to the length of our vocabulary. This is something that we need to know in advance for our task. In practice it is often impossible to know this perfectly -- there will be new words that crop up in our test data or at run time that we have not seen before -- so for new unseen words we might represent them as a symbol such an UNK which stands for 'unknown'. 

Even in English words can have many different forms, e.g,, eat, eats, eaten, ate. This presents a design challenge for BOW style representations. Do we represent such similar but different words with multiple columns or do we create a single column that captures what we call the stem or basic form of a word? There is no perfect answer to this question and this will often again be a task defined decision. By the way, if we think this problem is difficult in English, it is far more difficult in a morphologically rich language such as Turkish. 

This BOW style approach leads to sparse vectors per document where most entries in a given vector are 0s. In other words, if our vocabulary is say 171,000 (approximate length of the Oxford dictionary), and the sentence we want to encode has say 17 words, then our vector will have 17 non-0 entries and an awful lot of 0s. 

This Bag of Words approach works surprisingly well for a range of text classification tasks. By limiting the size of our vocabulary and essentially ignoring rarely seen words, we can get very reasonable results for many text classification tasks by simply feeding our sparse vector representations into basic classifiers like Logistic Regression and Naive Bayes classifiers. This is exactly how early spam classifiers were originally designed.  

### Word Vectors 

While BOW representations are useful, they are now very far from the state of the art. There are many problems with them 

 1. We loose information on the ordering of words. The representation is a list of words, typically in alphabetical order, and an indication of whether the word was present. The sentences 'John gave Mary 500 Euro' and 'Mary gave John 500 Euro' have the exact same BOW representation. 
 2. The vectors are long. For real word tasks the vectors are very long -- and mostly empty -- which is in practice not a great representation for feeding into classifiers. 
 3. There is no relationship between different columns. From a semantics (or meaning) perspective, the document represented by the short BOW representation 001000 is as close to 010000 as it is to 000001.  

What we need are representations of words that overcome these limitations. 

To do this, the first thing we might do is stop storing a sentence in a single vector. Instead we switch to representing every word as a single vector. Thus we represent an input document as a full 2D array rather than just a vector as in the BOW case above. 

In these matrices each row captures the meaning of a single word, and the columns are then the features that are used to represent the meaning of the individual words. We can visualize this approach with the simple example below. 

<!-- word_vectors.png --> 
<img width="600" src="https://drive.google.com/uc?id=1gvmNOsIqc2h3Z5tYjcV-6OhzykYFqzeg"/>

The major advantage of this approach over our BOW representation is that it retains ordering information on our input text. The key disadvantage of this approach is that we end up needing a two dimensional document to represent each individual input document (sentence or full webpage etc.). Also, the number of rows in our representation of the document is now variable and depends on the length of the sentence. In practice though we can say assume that the maximum length of our sentence is maybe 256 words, and we can pad out sentences with empty vectors up until the end of the array. 

We have said nothing yet about the contents of our word vectors. In fact how we capture meaning in these word vectors can vary considerably from one approach to another. In a basic approach we take a similar approach to our Bag of Words representation and once again use a one-hot-encoding against vocab features. In this approach each word maps to a single active feature with all other features being inactive. Thus for a 10 word sentence represented with features taken from a 10,000 word vocabulary, we would end up with a 10x10000 matrix. In the visualization below I leave 0s unmarked for clarity. 

<!-- word_vectors_ohe.png --> 
<img width="600" src="https://drive.google.com/uc?id=1h6-OKjv8aucMIXRoA7ciQvucqZsYYkh6"/>


### Distributed Word Vectors

While the approach above can help since the order of words in sentences is now left intact, it isn't enough yet. The representation is still highly sparse and hence difficult to process for large vocabulary sizes. We can get around this problem a little bit by using specialist sparse matrix representations to encode the sparse matrix in a more compact way, but the sparse encoding in itself can make it more difficult to train. The second huge limitation is that our representation fails to account for any similarities between words. As indicated above with my 001000 example, the words 'dog' and 'labrador' are as close to each other in the data space as the words 'dog' and 'deprive'. 

To overcome these limitations, an alternative approach is to instead use a distributed representation where words are represented in a lower-dimensional feature space (e.g., 128 dimensions) where individual features are real valued (unlike the binary features used in a word identity matrix). This compacted representation is widely known as a distributed representation or word embedding and is a key tool in neural text processing. Keep in mind that each feature in a distributed word embeddings does not necessarily have to 'mean' anything. It can be an aspect of meaning that is well defined, or perhaps only makes sense when we combine it with other features. 

We can illustrate this approach as follows: 

<!-- word_vectors_dist.png --> 
<img width="600" src="https://drive.google.com/uc?id=1h0vrvcNw_0HBec3yifGCLeKA48asVWpf"/>


Within the embedding space related words are located nearby each other. For example we would expect the words 'dog' and 'laborador' to be located much closer to each other in our 128 dimensional space than we would expect the words 'dog' and 'deprive'. Generally speaking, related words are clustered together. 

We can visualise these clusterings at a course grain by taking a high-level view of clusterings of words in our embedding space by projecting them into 2D:


<!-- word_vectors_colab.png --> 
<img width="600" src="https://drive.google.com/uc?id=1grGp1yvd0uDvT9ceDIPFJFyDoLFgBVzv"/>
<!-- Sourced from http://sebastianruder.com/content/images/2016/04/word_embeddings_colah.png" -->

If we zoom in to a particular section of our embedding space we will find similar words co-located together:

<!-- word_vectors_WordTSNE.png --> 
<img width="600" src="https://drive.google.com/uc?id=1gj_8v1PWi5q2Kc6fhKmjeZMzJYySUi2-"/>

Keep in mind that any 2D visualization of our embedding space is a projection out from our n-dimensional embedding space. 

Projecting from our n-dimensional embedding space into 2D for visualization has been a big research topic over the years. tSNE is one well regarded tool for making these projections, but there are many other options. 

## Learning Distributed Representations

There are many ways in which we could learn a compacted distributed word embedding for text representation. 

The easiest way to create a distributed representation of a word is simply to feed a one hot encoding of the word into a network and then take the first hidden layer after training and call that a vector representation. In other words, the first hidden layer in any network where the input vector is a raw BOW style encoding can be thought of as an embedding. Importantly, while the input space for the input layer is theoretically huge, we know that the input vectors are one-hot encoded and hence are very limited in their space. We can take advantage of this fact to use the input (the raw text encoding) to 'lookup' saved embeddings values from the first hidden layer. 

This is exactly what the standard Embeddings layer of an implementation like Keras is doing, i.e., the embeddings are generated from a weight matrix for a hideen layer. However at runtime, rather than computing the embeddings directly from a layer which could be very big, we can just look up the embeddings values from a table that is indexed by the input word. 

The advantage of this approach is that it is very easy to implement, and conceptually it is not very different to just plugging in the text into the network. However, since the embeddings are specific to your own trained application, they tend not to be semantically very rich, and are instead usually very oriented towards your own particular task. So for example, if you are predicting whether a review is postive or not, the embeddings are likely to cluster words into postive, negative, and then probably a group that doens't really tell you much. 

Generally more useful embeddings are generated by different types of non-application specific tasks that can try to capture the full richness of the semantics. To illustrate, let us consider the use of the auto-encoder for example. An autoencoder is a relatively simple architecture which is used to take an input, squeeze it into a more compact representation, and then reconstruct the input in the final layer. No training labels are needed for this type of architecture, and the representations generated (particularly by so-called varitional autoencoders) are quite good. 

Using this approach, we might for example try using the layer of values at the centre of the autoencoder as a representation / embedding: 

<!-- auto-embed.png --> 
<img width="600" src="https://drive.google.com/uc?id=1gQnTGyuJNZPQuhO7NcsZgHuK3HD6BdPG"/>

where X is the one-hot-encoding of a word in our vocabulary and Z is our embedding that we can then used whenever we need a distributed representation. 

In practice this simple auto-encoder method does not work well for the problem at hand, but fortunately a lot of work has gone into devising other methods to learn the compacted word embeddings. Still possibly the best known method is the is the Word2Vec approach from Mikolov et al which learns word embeddings in a supervised learning process between target words and their contexts. Actually there are two variant models in Word2Vec. The first variant, continuous bag of words, learns the embeddings in a task where the target word is to be predicted from a context (set of surrounding words). The second variant, skip-gram, learns the word embeddings in a task where context words are to be predicted from a single target word. 

Whether we train using the continuous bag of words or skip-gram variants, we still achieve as a side effect a hidden layer which can be used subsequently as a word embedding. These two variants are illustrated below:

<!-- word2vec.png --> 
<img width="600" src="https://drive.google.com/uc?id=1gmiOr-gTnCMMhixxmWXebtKZksPzUeFX"/>

For anyone interested in the specifics of learning Word2Vec style embeddings, the TensorFlow website has a nice tutorial that will guide you through the theory and the practical code for learning Word2Vec style embeddings. 

It should be noted that while Word2Vec is still at interesting example illustrating the general approach of mapping words to embeddings vectors, technologies that map entire sentences to vectors based on transformer achitecture variants (discussed later) are now the standard for serious text processing. 

## Using Word Embeddings

Embeddings can be used in one of 2 ways in practice:

 1. We learn embeddings for the sake of learning embeddings with a Word2Vec model or similar, and then put these vectors to work in a specific task. 
 2. We learn word embeddings on the fly for a given task.
 
While Option 1 is almost often teh more correct way of dealing with embeddings, Option 2 is still encountered very frequently in code. 

Keras / Tensorflow proves an awful lot of the machinery for us to do this. There is a very short and elegant tutorial on the subject on the Tensorflow website here:
https://www.tensorflow.org/tutorials/text/word_embeddings

The code below roughly follows the same layout in that tutorial, but we use a different example to give an alternative perspective on the design. The example we work from is based on this code:
https://keras.io/examples/imdb_cnn/

Tensorflow provides an Embeddings layer for your networks. This Embeddings layer is a very specialist layer that that has all the machinery to both train and apply embeddings all built into it. Specifically the Embeddings layer learns a mapping between the input text that we will provide in a format similar to our 'word vectors' above, and maps them to a real valued space. 

Let's set up our example and import some text that we will use to illustrate. 

In [2]:
max_features = 5000
maxlen = 400
embedding_dims = 50

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

Here we have loaded up loading up Keras's inbuilt wrapper for the IMDB dataset and used it to generate training and validation / testing data. The data is split between inputs (_x) and labels (_y). 

Next we need to trake the inputs and pad them out to all be the same length. This is essential for processing them. 

In [3]:
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

Let's start building up our model / network. We can use the Embeddings layer in a simple Sequential model. (On the Sequential model, note that Keras provides two overall model architectures, sequantial and functional. Sequential models assume a straight feedforward design and are easy to code by stacking layers on top of each other. Functional models can be far more intricate with informaiton flowing in far more arbitrary designs, but are slightly more complex to code as a result). 

In [4]:
model = Sequential()

The first layer of our network will be our Embeddings layer. When we define the Embeddings layer, we need to specify the expected size of the raw non-distributed vectors, and then the target embedding dimensionality. We will go with an embedding size of just 8 to begin with. This means that each of our words will be encoded simple as a vector of 8 numbers after the first layer. 

In [5]:
embedding_size=8 
model.add(Embedding(max_features,
                    embedding_size,
                    input_length=maxlen))

The Embeddings layer will make the mapping from the non-distributed input text to our numerical embeddings. After the embedding layer we have an output of dimension Batch_Size x Max_Document_Length x Embedding size. The Batch_Size here is just a batch size in the normal sense. In other words we can train in parallel for multiple instances of our training data in parallel. The Max_Document_Length is our expected maximum length of words in each review. Remember that the input encoder will pad the length of shorter documents out so that the tensor size is always as expected --even for much shorter sentences.

At this point there are many different ways in which we could deal with our text. In the simple case we can feed this whole document representation into a Dense Layer. Every unit in this dense layer has access to the embedding values for each word in our document. This unfortunately will also include access to all those 0 padded entries between the actual document length and the max document length. 

In [6]:
model.add(Flatten())
model.add(Dense(200, activation='relu'))
model.add(Dense(1))

Once that is done we can think of our representation just as we would any other layer in a network. We can now follow it with a number of different layers as required before we eventually finish up with an output layer. 

We can now compile and train the model in the usual way. 

In [7]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

hist = model.fit(x_train, y_train,
          batch_size=32,
          epochs=4,
          validation_data=(x_test, y_test))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


We can have a look at the summary of the model to see how many parameters are needed in practice now to run this model. 

In [8]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 400, 8)            40000     
                                                                 
 flatten (Flatten)           (None, 3200)              0         
                                                                 
 dense (Dense)               (None, 200)               640200    
                                                                 
 dense_1 (Dense)             (None, 1)                 201       
                                                                 
Total params: 680,401
Trainable params: 680,401
Non-trainable params: 0
_________________________________________________________________


For those feeling adventurous, consider how to implement a version of this embedding approach that doesn't use the embeddings layer but instead just uses a dense layer instead. It will have many more paramters. 

## Very Simple Sentence Embeddings

As an alternative to just flatting all embedded document and passing it straight to a dense layer, we do have other options. Another option that is widely applied is that we instead simplify our representation a little by taking an average of all our word embeddings. In other words we end up with an averaged embedding across the whole document. What this means in practice is that we are assuming that the meaning of a document is equal to the averaged meaning of all the words in the document. We can implement this with a GlobalAveragePooling1D layer. It should be noted that this will throw away the information we wanted to keep on the ordering of words in our representations, but we will come back to better ways of handling this later. 

In [9]:
model = Sequential()
model.add(Embedding(max_features,
                    embedding_size,
                    input_length=maxlen))
model.add(GlobalAveragePooling1D())
model.add(Dense(200, activation='relu'))
model.add(Dense(1))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

hist = model.fit(x_train, y_train,
          batch_size=32,
          epochs=4,
          validation_data=(x_test, y_test))

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 400, 8)            40000     
                                                                 
 global_average_pooling1d (G  (None, 8)                0         
 lobalAveragePooling1D)                                          
                                                                 
 dense_2 (Dense)             (None, 200)               1800      
                                                                 
 dense_3 (Dense)             (None, 1)                 201       
                                                                 
Total params: 42,001
Trainable params: 42,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


Note that (a) the number of parameters that we have to use in our network are considerable less because we averaged the embeddings within a document; (b) as a result, training was much faster; and (c) we often achieve a comparable result! 

## Convolving Through Word Embeddings 

In the work above we threw away any processing that was dependent on the order of words in our documents. In practice we want to keep this information. One way in which we can make use of it in a sensible way is through the application of CNNs. 

What we will be doing is allowing local feature detectors that maybe look at 4 words at a time to convolve through a complete document. This way our feature detectors are able to look for local features in the text, i.e., combinations of 4 words together that have a specific benefit and meaning in our text processing task. 

In our image processing tasks our feature detectors were convolved across the length and breadth of our input image. A key difference with convolution when applied to text is that we do not convolve across our columns (i.e., the input features). Instead we convolve only down the vertical, i.e., through the words. This makes a lot of sense as the purpose of convolution is to pick up on local features in the data where there is global invariance. For our input representation features we do not have global invariance. However where we are scanning down through the words in sequence it makes a lot of sense to look for features across adjacent words or words which are close by each other. 

This approach to applying a convolution to an input document is illustrated below. Note that this image comes from the blog past here:
 http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ 

<!-- conv_text.png --> 
<img width="600" src="https://drive.google.com/uc?id=1ghFyeqjfTt9eAVskp1todD1agxLFT3M5"/>

In our image processing examples we stuck with a consistent sliding window size for each convolution in a given layer. In text processing we however would often want to use different length convolutional windows that operate over say 2, 3, 4 or even more words. For each convolution length we will typically have a number of instances of that convolution length, i.e., we allow training of multiple features at each permitted window length. These convolutions in turn produce our standard convolved features which can then be pooled and combined in the usual way. 

In the illustration above our word embeddings are of dimension 5 and we have a maximum document length of 7 with an input document of length 7. 6 convolution features (kernels or filters) are being used in this case. Two of these have a convolution width of two, two have a convolution width of three, and two have a convolution width of 4. Convolutional application of these filters to our input results in 6 separate feature maps. Since we do not convolve horizontally across our input, note that our feature maps are 1 dimensional. 

Max pooling can then optionally be used to reduce the size of our feature maps. In the illustration above an extreme form of max pooling is used where all feature maps are reduced to a single value for each feature map. 

###  Implementing Convolution to Embeddings 

In the code below we apply the idea of convolutions to our training case. We assume a fixed filter width of 3. This is in contrast to the image above where filters of length 2, 3 and 4 were investigated. We use 50 different kernel instances - this is much larger than the image above, and we use a fully connected hidden layer of 50 units after the pooling layer. 
 

In [10]:
model = Sequential()
model.add(Embedding(max_features,
                    embedding_size,
                    input_length=maxlen))
model.add(Conv1D(25,3,padding='valid',activation='relu',strides=1))
model.add(Flatten())
model.add(Dense(200, activation='relu'))
model.add(Dense(1))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

hist = model.fit(x_train, y_train,
          batch_size=32,
          epochs=4,
          validation_data=(x_test, y_test))

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 400, 8)            40000     
                                                                 
 conv1d (Conv1D)             (None, 398, 25)           625       
                                                                 
 flatten_1 (Flatten)         (None, 9950)              0         
                                                                 
 dense_4 (Dense)             (None, 200)               1990200   
                                                                 
 dense_5 (Dense)             (None, 1)                 201       
                                                                 
Total params: 2,031,026
Trainable params: 2,031,026
Non-trainable params: 0
_________________________________________________________________
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


This model does have a lot of paramters and easily overfits. Fortunately though we can simplfy a little but again using the Global Average Layer and get a similar performance.

In [11]:
model = Sequential()
model.add(Embedding(max_features,
                    embedding_size,
                    input_length=maxlen))
model.add(Conv1D(25,3,padding='valid',activation='relu',strides=1))
model.add(GlobalAveragePooling1D())
model.add(Dense(200, activation='relu'))
model.add(Dense(1))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

hist = model.fit(x_train, y_train,
          batch_size=32,
          epochs=4,
          validation_data=(x_test, y_test))

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 400, 8)            40000     
                                                                 
 conv1d_1 (Conv1D)           (None, 398, 25)           625       
                                                                 
 global_average_pooling1d_1   (None, 25)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_6 (Dense)             (None, 200)               5200      
                                                                 
 dense_7 (Dense)             (None, 1)                 201       
                                                                 
Total params: 46,026
Trainable params: 46,026
Non-trainable params: 0
__________________________________________________

From the code above we can see that the CNN based method in this case does not necessarily do much better than in the averaged embeddings case. The point here though is to illustrate that while we often think of particular architectures being associated with particular data types, these rules are not fast and hard. 
Generally the results are expected to improve though if we use a couple of different types of covolutions, i.e., convolutions of length 2, 3, 4 and 5. 

## True Sentence Embeddings

True sentence embeddings will try to encode the meaning of a sentence into a single vector, but unlike our approach above, they don't just average over a lot of individual word vectors. Sentence embeddings are mostly built from transformer models. We will come back to this achitecture in detail after the easter to understand what how it is training and what it does. For now though we can just assume them to be a black box that will output a vector representation of a sentence that captures the meaning of the sentence -- up to and including word orderings. 



## Using Pre-Trained Word Embeddings

In many cases we do not need to learn word embeddings from scratch and can instead use pre-trained word embeddings that have been learned from large text corpora using the skip-gram method or similar. Generally the workflow for using these pre-trained embeddings is to download pre-computed vectors for all of the words in a large vocabulary. We then lookup the word vector for a particular word in a lookup table as needed in our task. 

There are many trained embedding sets available which vary in terms of: (a) the size and nature of the raw text data they were trained over and (b) the specific training method which was used to learn the embeddings. You can for example download a dataset of word vectors for 3 million words and phrases trained from a 100 Billion words from the Google News dataset, and understandably enough the dataset is big. Incidentally the embeddings in that model are 300 dimensions wide. 

### Tensorflow and Word Embeddings

TensorFlow Hub gives us a lot of the plumbing which we require to quickly start working with pre-trained embeddings (and other types of pre-trained models). Basically we can simply download a set of representations that have already been built for us by others and use them directly as an embedding layer just as we did above. The difference is that the training of these Embeddings has already been done for us, and we often can leave the embeddings as is. Sometimes it makes sense though to fine tune these embeddings during our own training process. 

The Tensorflow Hub makes it incredibly easy to download pre-trained embeddings information and put them directly to use. For example, on the page below you can learn how to directly download an embeddings definition and use it to build a distributed embedding one sentence at a time. 

https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1


To illustrate, let's import Tensorflow Hub and ask it to load the SWIVEL word embeddings. 

In [12]:
import tensorflow_hub as hub

print("loading embedding")
embed = hub.load("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1")


loading embedding


We can illustrate the use of the embeddings by just feeding some text in and seeing what we get out. 

In [13]:
embeddings = embed(["cat is on the mat", "dog is in the fog", "dog"])
print(embeddings)

tf.Tensor(
[[ 0.8666395   0.35917717  0.00579667  0.681002   -0.54226625  0.22343189
  -0.38796625  0.62195706  0.22117122 -0.48538068 -1.2674141   0.886369
  -0.32849073 -0.13924702 -0.53327686  0.5739708  -0.05905761  0.13629246
  -1.1718255  -0.31494334]
 [ 0.9602181   0.62520486  0.06261905  0.37425604  0.24782333 -0.39351934
  -0.7418429   0.56599647 -0.26197797 -0.69016844 -0.76565284  0.71412426
  -0.4537978  -0.50701594 -0.8499377   0.8917156  -0.30278975  0.2149126
  -1.1098894  -0.46719775]
 [ 1.1263883  -0.46177042 -0.8531583   0.5697219  -0.04634653  0.00869457
  -0.41134015  1.0862297   0.9390011   0.53587663 -0.964659    0.9846872
  -0.5436216  -0.459042   -1.0998259   0.37084442 -0.05279565  0.2736311
  -0.54693335 -0.20116976]], shape=(3, 20), dtype=float32)


Note that the particular embeddings model provides us back one embedding per sentence rather than one embedding per word. 

To use the embedding layer within a network, we simply have to parameterise it appropriately and then directly add it to a model as illustrated below. 

In [14]:
hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1", output_shape=[20],
                           input_shape=[], dtype=tf.string)

m2 = tf.keras.Sequential()
m2.add(hub_layer)
m2.add(tf.keras.layers.Dense(16, activation='relu'))
m2.add(tf.keras.layers.Dense(1, activation='sigmoid'))

m2.summary()

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 20)                400020    
                                                                 
 dense_8 (Dense)             (None, 16)                336       
                                                                 
 dense_9 (Dense)             (None, 1)                 17        
                                                                 
Total params: 400,373
Trainable params: 353
Non-trainable params: 400,020
_________________________________________________________________


As an assignment, make use of a pre-trained embeddings layer to train the IMDB data. How does performance compare against on-the-fly embeddings? Do you have to rework the data in any way? 

## Pre-Trained Models & Transfer Learning

We have just seen that with embeddings it is possible to use pre-trained word embeddings rather than have to learn the embeddigns on the fly for our current application. This is an important point in Deep Learning -- we have some modularity in our designs and try to take advantage of pre-trained models when possible. 

To understand this, consider that a neural network is just an architecture (design) and a collection of parameters that can instantiate that design. Networks can in general be saved to disk and loaded back up whenever we like. This is essential in allowing us to save the model as we are going along (just in case anything causes it to crash) but is also essentail in allowing a degree of re-use across models. 

### Reuse in Image Processing

While re-use is used considerably in text embeddings models, it is arguably used even more in image processing. 

All the major computer vision architectures are available to be used on a pre-trained basis. In most of these cases the networks have been pre-trained using the ImageNet dataset. This is a very large collection of labelled images that have been used for many years to train neural networks. Keras for example provides very convinient wrappers that allow us to load up a pre-trained instance of one of these networks and use it within our own model type. 

Note that this idea of re-use (from a software or design perspective) is very close to the idea of Transfer Learning from a strct machine learning perspective. Transfer Learning generall refers to the idea of learning a model within one domain, and then applying that model to a new domain. What we are trying to achieve is a transfer of knowledge from one application to another -- this is different to a transfer of data. 

In Transfer Learning we typically run one background training process once on a VERY large dataset to build a good re-usable model that can then be applied to new domains that typically have less labelled data available. We usually however do not copy over the whole network. Instead we often leave the final few layers of the original network behind. We refer to these as the head of the network, and typically these are very much related to the original application that was used to train the network. For example if the original network was trained for image classification of 100 different classes, these networks will be very focused on providing the final softmax layer that maps directly to 100 classes. 

The assumption here is that while the head of the network is very focused on the specific task, the earlier layers in the network will be more general purpose and will in fact to re-usable in other domains. 

In our new application network we typically have to create a new 'head' that is specific to our own application. The head of the network will be trained for our new target data but with lower layers in the network being those that came from our original source network. 

### Training Policy and Fine-Tuning

For the layers imported into our target network we have a choice on how to use those layers. In the extreme we can decide to leave the parameters in those layers exactly as they were (freeze), or allow the training process on the new application to modify these weights for the new domain (fine-tune). 

In practice there are many different strategies for fine tuning availabile, but these often involve designing your new architecture initially with imported weights that cannot change. Then after a period of training we allow the weights to be adjusted during our training process.

What do you suppose is the advantage to this graduated fine tuning process versus allowing full weight editing from the beginning? 

### Pre-Training and Text Processing

In text processing the reused network is often a complex sentence level embedder such as Google's Universal Sentence Enocder or BERT or similar. We will return to the use of these networks in the context of Transformers after the Easter break. 