<a href="https://colab.research.google.com/github/rahiakela/transfer-learning-for-natural-language-processing/blob/main/3-shallow-transfer-learning-for-nlp/1_semi_supervised_learning_with_pretrained_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semi-supervised Learning with Pretrained Word Embeddings

In this notebook, we will cover some prominent shallow transfer learning approaches and concepts. This allows us to explore some major themes in transfer learning, while doing so in the context of relatively simple models of the class of eventual interest, i.e., shallow neural networks.

Roughly speaking, categorization is based on whether transfer occurs between different languages, tasks or data domains. Each of these types of categorization is usually correspondingly referred to as cross-lingual learning, multi-task learning and domain adaptation.

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/shallow-transfer-learning.png?raw=1' width='800'/>

The methods we will look at here will involve components that are neural networks in one way or another.these neural networks do not have many
layers. This is the reason why the label “shallow” is appropriate to describe this collection of methods.

A common form of semi-supervised learning that employs pretrained word embeddings such as word2vec that they produce a single vector per word, regardless of context.

We revisit the IMDB movie review sentiment classification. Recall that this example is concerned with classifying movie reviews from IMDB into positive or negative sentiments expressed. It is a prototypical sentiment analysis example that has been used widely in the literature to study many algorithms. We combine feature vectors generated by pretrained word embeddings for each review with some traditional machine learning classification methods, namely random forests and logistic regression.

We then demonstrate that using higher-level embeddings which vectorize bigger sections of text – such as at the sentence-level, paragraphlevel and document-level – can lead to improved performance. The general idea of vectorizing text and then applying a traditional machine learning classification method to the resulting vectors.

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/semi-supervised-learning.png?raw=1' width='800'/>

**Multi-task learning**

Subsequently, we introduce the reader to multi-task learning. We demonstrate how one can train a single system simultaneously to perform multiple tasks, email spam classification and IMDB movie review sentiment analysis. 

There are several potential benefits to multi-task learning. By
training a single machine learning model for multiple tasks, a shared representation is learned on a larger and more varied collection of data from the combined data pool, which can lead to performance improvements. Moreover, it has been widely observed that this shared representation has a better ability to generalize to tasks beyond those that were trained on, and
this improvement can be achieved without any increase in model size.

Specifically, we focus on shallow neural multitask learning, where a single additional dense layer, as well as a classification layer, is trained
for each specific task in the setup. Different tasks also share a layer between them, a setup typically referred to as hard-parameter sharing.

**Domain adaptation**

Assume that we are given one source domain, which can be defined as a particular distribution of data for a specific task, and a classifier that has been trained to perform well on data in that domain for that task. The goal of domain adaptation is to modify, or adapt, data in a different target domain in such a way that the pretrained knowledge from the source domain can aid
learning in the target domain. We apply a simple autoencoding approach to “project” samples in the target domain into the source domain feature space.

An autoencoder is a system that learns to reconstruct inputs with very high accuracy, typically by encoding them into an efficient latent representation and learning to decode the said representation efficiently. They have traditionally been heavily used in model reduction applications, since the latent representation is often of smaller dimension than the original space
from which the encoding happens, and the said dimension value can also be picked to strike the right balance of computational efficiency and accuracy.

In the extreme scenario, improvements can be obtained with no labelled data in the target domain being used for training. This is typically referred to as zero-shot domain adaptation, where learning happens with no labeled
data in the target domain.


## Setup

In [1]:
import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random
import re
import os
import time

from gensim.models import FastText, KeyedVectors
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier      # random forest classifier library
from sklearn.metrics import accuracy_score

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

import matplotlib.pyplot as plt
from IPython.display import HTML

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Download IMDB Movie Review Dataset

In [None]:
%%shell

wget -q "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
tar xzf aclImdb_v1.tar.gz

rm -rf aclImdb_v1.tar.gz
rm -rf aclImdb/train/unsup

Let's download fastText word Embedding from [Kaggle](https://www.kaggle.com/yangjia1991/jigsaw#)

In [None]:
from google.colab import files
files.upload() # upload kaggle.json file

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"rahiakela","key":"484f91b2ebc194b0bff8ab8777c1ebff"}'}

In [None]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle
kaggle datasets download -d yangjia1991/jigsaw/wiki.en.vec
unzip -qq jigsaw.zip

mv: cannot stat 'kaggle.json': No such file or directory
kaggle.json
Downloading jigsaw.zip to /content
 11% 258M/2.37G [00:03<00:30, 75.6MB/s]
User cancelled operation




## Preprocessing IMDB Movie Data

Before proceeding, we must decide how many samples to draw from each class. We must also decide the maximum number of tokens per email, and the maximum length of each token. This is done by setting the following overarching hyperparameters.

In [6]:
n_sample = 1000   # number of samples to generate in each class
maxtokens = 200    # the maximum number of tokens per document
maxtokenlen = 100  # the maximum length of each token

### Tokenization

Let’s proceed by defining a function to tokenize text by splitting them into 
words.

In [4]:
def tokenize(row):
  if row is None or row is "":
    tokens = ""
  else:
    tokens = row.split(" ")[:maxtokens]
  return tokens

### Remove punctuation and unnecessary characters

**In order to ensure that classification is done based on language content only, we have to remove punctuation marks and other non-word characters from the emails.** We do this by employing regular expressions with the Python regex library. We also normalize words by turning them into lower case.

In [3]:
def reg_expressions(row):
  tokens = []
  try:
    for token in row:
      token = token.lower()          # make all characters lower case
      token = re.sub(r"[\W\d]", "", token)
      token = token[:maxtokenlen]    # truncate all tokens to hyperparameter maxtokenlen
      tokens.append(token)
  except:
    token = ""
    tokens.append(token)
  return tokens

### Stop-word removal

Stop-words are also removed. Stop-words are words that are very common in text but offer no useful information that can be used to classify the text. Words such as is, and, the, are are examples of stop-words. The NLTK library contains a list of 127 English stop-words and can be used to filter our tokenized strings.

In [None]:
stop_words = stopwords.words("english")
print(stop_words)

In [None]:
# print(stopwords) # see default stopwords
# it may be beneficial to drop negation words from the removal list, as they can change the positive/negative meaning
# of a sentence
# stopwords.remove("no")
# stopwords.remove("nor")
# stopwords.remove("not")

In [5]:
def stop_word_removal(row):
  token = [token for token in row if token not in stop_words]
  token = filter(None, token)

  return token

## Semi-supervised Learning with fastText Embedding Vectors

The concept of word embeddings is central to the field of NLP. It is a name given to a collection of techniques which produce a set of vectors of real numbers for each word that needs to be analyzed. A major consideration in word embedding design is the dimension of the vector generated. Bigger vectors generally can achieve better representation capability of words within
a language and thereby better performance on many tasks, while naturally being more expensive computationally.

As was outlined, this important sub-area of NLP research has a rich history originating with the term-vector model of information retrieval in the 1960s. This culminated with pretrained shallow neural-network-based techniques such as fastText, GloVe and word2vec – which came in several variants in mid 2010s including:

- Continuous Bag of Words(CBOW) 
- Skip-Gram

Both CBOW and Skip-Gram are extracted from shallow neural networks that were trained for various goals. **Skip-Gram attempts to predict words neighboring any target word in a sliding window, while CBOW attempts to predict the target word given the neighbors.**

GloVe - which stands for “Global Vectors” - attempts to extend word2vec by incorporating global information into the embeddings. It optimizes the embeddings such that the cosine product between words reflects the number of times they co-occur, with the goal of making the resulting vectors more interpretable. 

**fastText**

The technique fastText attempts to enhance word2vec by repeating the Skip-Gram methods on character n-grams (versus word n-grams) thereby being able to
handle previously unseen words.

To reiterate, **fastText is known for its ability to handle out-of-vocabulary words, which comes from it having been designed to embed sub-word character n-grams or sub-words** (versus entire words as is the case with word2vec). **This enables it to build embeddings up for previously unseen words by aggregating composing character n-gram embeddings. That comes at the
expense of a larger pretrained embedding, and higher computing resource requirement and cost.**

For these reasons, we elect to use the fastText framework as the representative pretrained word embedding computing method in this notebook, albeit with the word2vec input format. This allows us to keep the computing cost lower, making the exercise easier for the reader, while also showcasing how out-of-vocabulary issues would be handled and providing a solid experience platform from which the reader can venture into sub-word embeddings.

## Load pre-trained fastText embedding

Once the embedding is available, it can be loaded using the following code snippet.

In [None]:
start = time.time()
fasttext_embedding = KeyedVectors.load_word2vec_format("wiki.en.vec")
end = time.time()

print("Loading the embedding took %d seconds" % (end - start))

In practice, in such a situation, it is not uncommon to load the embedding
once into memory and then serve access to it using an approach such as flask for as long as it is needed.


Since our embedding of choice does not handle out-of-vocabulary words out-of-the-box, the next thing we do is to develop a methodology for addressing this situation. The simplest thing to do, quite naturally, is to simply skip any such words. Since the fastText framework errors out when such a word is encountered, we will use a try and except block to catch these errors without interrupting execution. Assume that you are given a pretrained input embedding that serves as a dictionary, with words as keys and corresponding vectors as values, and an input list of words in a review.

In [2]:
def handle_out_of_vocab(embedding, in_text):
  out = None
  for word in in_text:    # Loop through every word
    try:
      tmp = embedding[word]              # Extract corresponding embedding vector and enforce “row shape”
      tmp = tmp.reshape(1, len(tmp))

      if out is None:     # Handle edge case of the first vector and an empty out array
        out = tmp
      else:
        out = np.concatenate((out, tmp), axis=0)    # Concatenate row embedding vector to output Numpy array
    except:     # Skip execution on current word and continue execution from the next word when out-of-vocabulary errors occur
      pass

  return out

However, before doing so we must decide how we will combine or aggregate
the embedding vectors for individual words in a review into a single vector representing the entire review. It has been found in practice that the heuristic of simply averaging the words works as a very strong baseline. Since the embeddings were trained in a way that ensures that similar words are closer to each other in the resulting vector space, it makes intuitive sense that
their average would represent the average meaning of the collection. The averaging baseline for summarization/aggregation is often recommended as a first step in embedding bigger sections of text from word embeddings.

Effectively, this code calls the above function repeatedly on every review in the corpus, averages the output and concatenates the resulting vectors into a single 2-dimensional Numpy array. The rows of this resulting array correspond to
aggregated-by-averaging embedding vectors for each review.

In [7]:
def assemble_embedding_vectors(data):
  out = None
  for item in data:    # Loop through every IMDB review
    tmp = handle_out_of_vocab(fasttext_embedding, item)     # Extract embedding vectors for every word in review, making sure to handle out-of-vocab words
    if tmp is not None:
      dim = tmp.shape[1]
      if out is not None:
        vec = np.mean(tmp, axis=0)                  # Average word vectors in each review
        vec = vec.reshape((1, dim))
        out = np.concatenate((out, vec), axis=0)    # Concatenate average row vector to output Numpy array
      else:
        out = np.mean(tmp, axis=0).reshape((1, dim))
    else:
      pass     # Every-word-out-of- vocab edge case handling

  return out

## Putting It All Together To Assemble Dataset

Having obtained and loaded the pre-trained embedding, let’s look back at the IMDB movie review classification example, which we will be analyzing in this section.

If you have already proceeded to generate a simple bag-of-words representation for the output Numpy array – which simply counts occurrence frequencies of possible word tokens in each review. We then used the resulting vectors as numerical features for further machine learning tasks. 

Here, instead of the bag-of-words representation, we extract corresponding vectors from the pretrained embedding instead.



In [8]:
# shuffle raw data first
def unison_shuffle_data(data, header):
    p = np.random.permutation(len(header))
    data = data[p]
    header = np.asarray(header)[p]
    return data, header

# load data in appropriate form
def load_data(path):
  data, sentiments = [], []
  for folder, sentiment in (("neg", 0), ("pos", 1)):
    folder = os.path.join(path, folder)
    for name in os.listdir(folder):    # Go through every file in current folder
      with open(os.path.join(folder, name), "r") as reader:
        text = reader.read()
      # Apply tokenization, stopword analysis routines
      text = tokenize(text)
      text = stop_word_removal(text)
      text = reg_expressions(text)
      # Track corresponding text and sentiment labels
      data.append(text)
      sentiments.append(sentiment)
  # Convert to Numpy array
  data_np = np.array(data)
  data, sentiments = unison_shuffle_data(data, sentiments)

  return data, sentiments

In [None]:
train_path = os.path.join("aclImdb", "train")
test_path = os.path.join("aclImdb", "test")
raw_data, raw_header = load_data(train_path)

print(raw_data.shape)
print(len(raw_header))

In [None]:
# Subsample required number of samples
random_indices = np.random.choice(range(len(raw_header)), size=(n_sample * 2,), replace=False)
data_train = raw_data[random_indices]
header = raw_header[random_indices]

print("DEBUG::data_train::")
print(data_train)

Display sentiments and their frequencies in the dataset, to ensure it is roughly balanced between classes.

In [None]:
unique_elements, counts_elements = np.unique(header, return_counts=True)

print("Sentiments and their frequencies:")
print(unique_elements)
print(counts_elements)

We can now assemble embedding vectors for the whole dataset using the function call:

In [None]:
embedding_vectors = assemble_embedding_vectors(data_train) 
print(embedding_vectors)

These can now be used as feature vectors for the same logistic regression and random forest.

As the very last step of preparing the sentiment dataset for training by our baseline classifiers, we split it into independent training and testing or validation sets. This will allow us to evaluate the performance of the classifier on a set of data that was not used for training, an important thing
to ensure in machine learning practice. We elect to use 70% of the data for training, and 30% for testing/validation afterwards.

In [None]:
data = embedding_vectors

idx = int(0.7 * data.shape[0])


# 70% of data for training
train_x = data[:idx, :]
train_y = header[:idx]

# remaining 30% for testing
test_x = data[idx:, :]
test_y = header[idx:]

print("train_x/train_y list details, to make sure it is of the right form:")
print(len(train_x))
print(train_x[:5])
print(len(train_y))
print(train_y[:5])

## Logistic Regression Classifier