<a href="https://colab.research.google.com/github/rahiakela/transfer-learning-for-natural-language-processing/blob/main/3-shallow-transfer-learning-for-nlp/3_multi_task_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Multi-Task Learning

In this notebook, we will cover some prominent shallow transfer learning approaches and concepts. This allows us to explore some major themes in transfer learning, while doing so in the context of relatively simple models of the class of eventual interest, i.e., shallow neural networks.

Roughly speaking, categorization is based on whether transfer occurs between different languages, tasks or data domains. Each of these types of categorization is usually correspondingly referred to as cross-lingual learning, multi-task learning and domain adaptation.

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/shallow-transfer-learning.png?raw=1' width='800'/>

The methods we will look at here will involve components that are neural networks in one way or another.these neural networks do not have many
layers. This is the reason why the label “shallow” is appropriate to describe this collection of methods.

A common form of semi-supervised learning that employs pretrained word embeddings such as word2vec that they produce a single vector per word, regardless of context.

We revisit the IMDB movie review sentiment classification. Recall that this example is concerned with classifying movie reviews from IMDB into positive or negative sentiments expressed. It is a prototypical sentiment analysis example that has been used widely in the literature to study many algorithms. We combine feature vectors generated by pretrained word embeddings for each review with some traditional machine learning classification methods, namely random forests and logistic regression.

We then demonstrate that using higher-level embeddings which vectorize bigger sections of text – such as at the sentence-level, paragraphlevel and document-level – can lead to improved performance. The general idea of vectorizing text and then applying a traditional machine learning classification method to the resulting vectors.

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/semi-supervised-learning.png?raw=1' width='800'/>

**Multi-task learning**

Subsequently, we introduce the reader to multi-task learning. We demonstrate how one can train a single system simultaneously to perform multiple tasks, email spam classification and IMDB movie review sentiment analysis. 

There are several potential benefits to multi-task learning. By
training a single machine learning model for multiple tasks, a shared representation is learned on a larger and more varied collection of data from the combined data pool, which can lead to performance improvements. Moreover, it has been widely observed that this shared representation has a better ability to generalize to tasks beyond those that were trained on, and
this improvement can be achieved without any increase in model size.

Specifically, we focus on shallow neural multitask learning, where a single additional dense layer, as well as a classification layer, is trained
for each specific task in the setup. Different tasks also share a layer between them, a setup typically referred to as hard-parameter sharing.

**Domain adaptation**

Assume that we are given one source domain, which can be defined as a particular distribution of data for a specific task, and a classifier that has been trained to perform well on data in that domain for that task. The goal of domain adaptation is to modify, or adapt, data in a different target domain in such a way that the pretrained knowledge from the source domain can aid
learning in the target domain. We apply a simple autoencoding approach to “project” samples in the target domain into the source domain feature space.

An autoencoder is a system that learns to reconstruct inputs with very high accuracy, typically by encoding them into an efficient latent representation and learning to decode the said representation efficiently. They have traditionally been heavily used in model reduction applications, since the latent representation is often of smaller dimension than the original space
from which the encoding happens, and the said dimension value can also be picked to strike the right balance of computational efficiency and accuracy.

In the extreme scenario, improvements can be obtained with no labelled data in the target domain being used for training. This is typically referred to as zero-shot domain adaptation, where learning happens with no labeled
data in the target domain.


## Setup

In [None]:
import tensorflow as tf

print(tf.__version__)

In [None]:
# install sent2vec
!pip install git+https://github.com/epfml/sent2vec

In [None]:
import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random
import re
import os
import time

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout, concatenate

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

import sent2vec

import matplotlib.pyplot as plt
from IPython.display import HTML

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Download IMDB Movie Review Dataset

In [None]:
%%shell

wget -q "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
tar xzf aclImdb_v1.tar.gz

rm -rf aclImdb_v1.tar.gz
rm -rf aclImdb/train/unsup



Let's download sent2vec word Embedding from [Kaggle](https://www.kaggle.com/maxjeblick/sent2vec)

In [None]:
from google.colab import files
files.upload() # upload kaggle.json file

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"rahiakela","key":"484f91b2ebc194b0bff8ab8777c1ebff"}'}

In [None]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download word embeddings from kaggle
kaggle datasets download -d maxjeblick/sent2vec/wiki_unigrams.bin
unzip -qq sent2vec.zip
rm -rf sent2vec.zip

# download dataset from kaggle
kaggle datasets download -d wcukierski/enron-email-dataset
unzip -qq enron-email-dataset.zip

kaggle datasets download -d rtatman/fraudulent-email-corpus
unzip -qq fraudulent-email-corpus.zip

rm -rf enron-email-dataset.zip fraudulent-email-corpus.zip

kaggle.json
Downloading sent2vec.zip to /content
100% 4.42G/4.43G [02:04<00:00, 66.3MB/s]
100% 4.43G/4.43G [02:04<00:00, 38.2MB/s]




In [None]:
def extract_messages(df):
  messages = []
  for item in df["message"]:
    # Return a message object structure from a string
    e = email.message_from_string(item)
    # get message body
    message_body = e.get_payload()
    messages.append(message_body)
  print("Successfully retrieved message body from e-mails!")
  return messages

## Preprocessing IMDB Data

Before proceeding, we must decide how many samples to draw from each class. We must also decide the maximum number of tokens per email, and the maximum length of each token. This is done by setting the following overarching hyperparameters.

In [None]:
n_sample = 1000    # number of samples to generate in each class
maxtokens = 200    # the maximum number of tokens per document
maxtokenlen = 100  # the maximum length of each token

### Tokenization

Let’s proceed by defining a function to tokenize text by splitting them into 
words.

In [None]:
def tokenize(row):
  if row is None or row is "":
    tokens = ""
  else:
    tokens = str(row).split(" ")[:maxtokens]
  return tokens

### Remove punctuation and unnecessary characters

**In order to ensure that classification is done based on language content only, we have to remove punctuation marks and other non-word characters from the emails.** We do this by employing regular expressions with the Python regex library. We also normalize words by turning them into lower case.

In [None]:
def reg_expressions(row):
  tokens = []
  try:
    for token in row:
      token = token.lower()          # make all characters lower case
      token = re.sub(r"[\W\d]", "", token)
      token = token[:maxtokenlen]    # truncate all tokens to hyperparameter maxtokenlen
      tokens.append(token)
  except:
    token = ""
    tokens.append(token)
  return tokens

### Stop-word removal

Stop-words are also removed. Stop-words are words that are very common in text but offer no useful information that can be used to classify the text. Words such as is, and, the, are are examples of stop-words. The NLTK library contains a list of 127 English stop-words and can be used to filter our tokenized strings.

In [None]:
stop_words = stopwords.words("english")
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
# print(stopwords) # see default stopwords
# it may be beneficial to drop negation words from the removal list, as they can change the positive/negative meaning
# of a sentence
# stopwords.remove("no")
# stopwords.remove("nor")
# stopwords.remove("not")

In [None]:
def stop_word_removal(row):
  token = [token for token in row if token not in stop_words]
  token = filter(None, token)

  return token

### Load pre-trained sent2vec embedding

Quite naturally, just as in the case of the pretrained word embeddings, the next step is to obtain the pretrained sent2vec sentence embedding to be loaded by the particular implementation/framework installed.

We choose the smallest 600-dimensional embedding `wiki_unigrams.bin`, approximately 5 Gigabytes in size, which captures just the unigram information on Wikipedia.

Now let's load the pre-trained embedding.

In [None]:
# load sent2vec embedding
model = sent2vec.Sent2vecModel()

start = time.time()
model.load_model("wiki_unigrams.bin")
end = time.time()

print("Loading the sent2vec embedding took %d seconds" % (end - start))

Loading the sent2vec embedding took 5 seconds


### Extract corresponding vectors from the pretrained word embedding

Next, we define a function to generate vectors for a collection of reviews. It is essentially a simpler form of the function presented in Listing 3.2 for pretrained word embeddings – it is simpler as we do not need to worry about out-of-vocabulary words.

In [None]:
def assemble_embedding_vectors(data):
  out = None
  for item in data:    # Loop through every IMDB review
    vec = model.embed_sentence(" ".join(item))     # Extract embedding vectors for every word in review, now we dont need to handle out-of-vocab words
    if vec is not None:                            # Edge case handling
      if out is not None:
        out = np.concatenate((out, vec), axis=0)    # Concatenate row vector to output Numpy array
      else:
        out = vec
    else:
      pass

  return out

### Preparing and assembling Dataset

In [None]:
# shuffle raw data first
def unison_shuffle_data(data, header):
    p = np.random.permutation(len(header))
    data = data[p]
    header = np.asarray(header)[p]
    return data, header

# load data in appropriate form
def load_data(path):
  data, sentiments = [], []
  for folder, sentiment in (("neg", 0), ("pos", 1)):
    folder = os.path.join(path, folder)
    for name in os.listdir(folder):    # Go through every file in current folder
      with open(os.path.join(folder, name), "r") as reader:
        text = reader.read()
      # Apply tokenization, stopword analysis routines
      text = tokenize(text)
      text = stop_word_removal(text)
      text = reg_expressions(text)
      # Track corresponding text and sentiment labels
      data.append(text)
      sentiments.append(sentiment)
  # Convert to Numpy array
  #print(data)
  data_np = np.array(data)
  #print(data_np[:10])
  data, sentiments = unison_shuffle_data(data_np, sentiments)

  return data, sentiments

In [None]:
train_path = os.path.join("aclImdb", "train")
test_path = os.path.join("aclImdb", "test")
raw_data, raw_header = load_data(train_path)

print(raw_data.shape)
print(len(raw_header))

(25000,)
25000




In [None]:
# Subsample required number of samples
random_indices = np.random.choice(range(len(raw_header)), size=(n_sample * 2,), replace=False)
data_train = raw_data[random_indices]
header = raw_header[random_indices]

print("DEBUG::data_train::")
print(data_train[:10])

DEBUG::data_train::
[list(['i', 'found', 'good', 'movie', 'pass', 'time', 'chance', 'historical', 'value', 'the', 'portrayal', 'cleopatra', 'reminded', 'cheap', 'soap', 'operabr', 'br', 'the', 'twist', 'facts', 'is', 'funny', 'she', 'gave', 'birth', 'feeding', 'people', 'o', 'please', 'a', 'pregnant', 'queen', 'egypt', 'especially', 'one', 'would', 'bother', 'go', 'one', 'room', 'reason', 'they', 'tried', 'make', 'appear', 'saint', 'gods', 'sake', 'and', 'way', 'tried', 'justify', 'murdering', 'sister', 'beyond', 'descriptionbr', 'br', 'cleopatra', 'greatest', 'politician', 'time', 'her', 'decisions', 'based', 'anything', 'feelings', 'morals', 'she', 'everything', 'two', 'reasons', 'power', 'selfpreservation', 'she', 'borne', 'family', 'straggle', 'survival', 'something', 'well', 'anything', 'stood', 'way', 'either', 'murdered', 'her', 'brothers', 'sister', 'seduced', 'ceasar', 'mark', 'anthonybr', 'br', 'unfortunately', 'octavian', 'powerful', 'kill', 'too', 'gay', 'seduced', 'so', 'e

Display sentiments and their frequencies in the dataset, to ensure it is roughly balanced between classes.

In [None]:
unique_elements, counts_elements = np.unique(header, return_counts=True)

print("Sentiments and their frequencies:")
print(unique_elements)
print(counts_elements)

Sentiments and their frequencies:
[0 1]
[ 994 1006]


We can now use this function to extract sent2vec embedding vectors for each review.

In [None]:
embedding_vectors = assemble_embedding_vectors(data_train) 
print(embedding_vectors)

[[ 0.0327281  -0.18146102  0.08980507 ... -0.07412156  0.03698216
   0.13855496]
 [-0.15479232 -0.04556673  0.10688065 ...  0.00971753 -0.1102825
   0.0852626 ]
 [ 0.03766685  0.04746525  0.05767557 ...  0.0738261  -0.18938291
   0.30567428]
 ...
 [ 0.02219635 -0.0454625  -0.24661975 ... -0.01034764  0.04574064
   0.19974674]
 [-0.06565411  0.02747259  0.06684896 ...  0.11156784 -0.00790572
   0.07958621]
 [ 0.0733537  -0.10498576  0.03755949 ...  0.00296096  0.02171134
   0.28900686]]


These can now be used as feature vectors for the same logistic regression and random forest.

As the very last step of preparing the sentiment dataset for training by our baseline classifiers, we split it into independent training and testing or validation sets. This will allow us to evaluate the performance of the classifier on a set of data that was not used for training, an important thing
to ensure in machine learning practice. We elect to use 70% of the data for training, and 30% for testing/validation afterwards.

In [None]:
data = embedding_vectors
del embedding_vectors

idx = int(0.7 * data.shape[0])


# 70% of data for training
train_x = data[:idx, :]
train_y = header[:idx]

# remaining 30% for testing
test_x = data[idx:, :]
test_y = header[idx:]

print("train_x/train_y list details, to make sure it is of the right form:")
print(len(train_x))
print(train_x[:5])
print(len(train_y))
print(train_y[:5])

train_x/train_y list details, to make sure it is of the right form:
1400
[[ 0.0327281  -0.18146102  0.08980507 ... -0.07412156  0.03698216
   0.13855496]
 [-0.15479232 -0.04556673  0.10688065 ...  0.00971753 -0.1102825
   0.0852626 ]
 [ 0.03766685  0.04746525  0.05767557 ...  0.0738261  -0.18938291
   0.30567428]
 [-0.04202437  0.01947553 -0.04474144 ... -0.0817741  -0.07621752
   0.14238654]
 [-0.02722309 -0.00216819 -0.04766743 ...  0.07337526 -0.05352686
   0.22470234]]
1400
[0 1 0 0 1]


## Preprocessing Email Spam Data

### Loading and Visualizing the Fraudulent Email Corpus

In [None]:
filepath = "./fradulent_emails.txt"
with open(filepath, "r", encoding="latin1") as file:
  data = file.read()

In [None]:
fraud_emails = data.split("From r")
del data

print("Successfully loaded {} spam emails!".format(len(fraud_emails)))

In [None]:
fraud_bodies = extract_messages(pd.DataFrame(fraud_emails, columns=["message"]))
del fraud_emails

fraud_bodies_df = pd.DataFrame(fraud_bodies[1:])
del fraud_bodies

fraud_bodies_df.head()

### Loading and Visualizing the Enron Corpus

In [None]:
filepath = "./emails.csv"

# Read the enron data into a pandas.DataFrame called emails
emails = pd.read_csv(filepath)
print("Successfully loaded {} rows and {} columns!".format(emails.shape[0], emails.shape[1]))
print(emails.head())

In [None]:
# take a closer look at the first email
print(emails.loc[0]["message"])

In [None]:
bodies = extract_messages(emails)

# no longer needed, get rid of them
del emails

In [None]:
# extract random 10000 enron email bodies for building dataset
bodies_df = pd.DataFrame(random.sample(bodies, 10000))
# these are huge, no longer needed, get rid of them
del bodies 

# expand default pandas display options to make emails more clearly visible when printed
pd.set_option("display.max_colwidth", 300)
# you could do print(bodies_df.head()), but Jupyter displays this nicer for pandas DataFrames
bodies_df.head()

### Preparing and assembling Dataset

We are now going to put all these functions together to build the single dataset representing both classes. Most methods expect this dataset to be a Numpy array in order to process it, so we convert it to that form after combining the emails.

Now, putting all the preprocessing steps together we assemble our dataset...

In [None]:
# Convert everything to lower-case, truncate to maxtokens and truncate each token to maxtokenlen

# Apply predefined processing functions
enron_emails = bodies_df.iloc[:, 0].apply(tokenize)
enron_emails = enron_emails.apply(stop_word_removal)
enron_emails = enron_emails.apply(reg_expressions)
# sample the right number of emails from each class.
enron_emails = enron_emails.sample(n_sample)

del bodies_df

# Apply predefined processing functions
spam_emails = fraud_bodies_df.iloc[:, 0].apply(tokenize)
spam_emails = spam_emails.apply(stop_word_removal)
spam_emails = spam_emails.apply(reg_expressions)
# sample the right number of emails from each class.
spam_emails = spam_emails.sample(n_sample)

del fraud_bodies_df

# convert to Numpy array
raw_data = pd.concat([enron_emails, spam_emails], axis=0).values

Now, let’s take a peek at the result to make sure things are proceeding as expected:

In [None]:
print("Shape of combined data is:", raw_data.shape)
print("Data is:")
print(raw_data)

We see that the resulting array has divided the text into word units, as we intended to.

Let’s create the headers corresponding to these emails, consisting of n_sample=1000 of spam emails followed by n_sample=1000 of non-spam emails:

In [None]:
categories = ["spam", "notspam"]
header = ([1] * n_sample)
header.extend(([0] * n_sample)) 

We are now ready to convert these into numerical vectors!!

In [None]:
embedding_vectors = assemble_embedding_vectors(raw_data)
print(embedding_vectors)

In [None]:
# shuffle raw data first
def unison_shuffle_data(data, header):
  p = np.random.permutation(len(header))
  data = data[p,:]
  header = np.asarray(header)[p]
  return data, header

In [None]:
raw_data, header = unison_shuffle_data(embedding_vectors, header)

# split into independent 70% training and 30% testing sets
idx = int(0.7 * raw_data.shape[0])  # get 70% index value

# 70% of data for training
train_x2 = raw_data[:idx, :]
train_y2 = header[:idx]

# remaining 30% for testing
test_x2 = raw_data[idx:, :]
test_y2 = header[idx:]

print("train_x2/train_y2 list details, to make sure they are of the right form:")
print(len(train_x2))
print(train_x2)
print(len(train_y2))
print(train_y2[:5])

## Multi-Task Learning

Traditionally, **machine learning algorithms have been trained to perform a single task at a time, with the data collected and trained on independent for each separate task.** This is somewhat antithetical to the way humans and other animals learn, where training for multiple tasks occurs
simultaneously and information from training on one task may inform and accelerate learning of other different tasks. **This additional information may improve performance not just on the current tasks being trained on, but also on future tasks, and sometimes even in cases where no labeled data is available on such future tasks. This scenario of transfer learning with no labeled data in the target domain is often referred to as zero-shot transfer learning.**

In machine learning, multi-task learning has historically appeared in a number of settings – from multi-objective optimization to l2 and other forms of regularization (which itself can be framed as a form of multi-objective optimization).

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/neural-multi-task-learning.png?raw=1' width='800'/>

In the other prominent type of neural multi-task learning, soft parameter sharing, all tasks have their own layers/parameters, which are not shared. Instead, they are encouraged to be similar via various constraints imposed on the task-specific layers across the various tasks.



## Problem Setup and Shallow Neural Single-Task Baseline

Let's consider now again, with 2 tasks only, the first task being IMDB movie review classification and the second task being email spam classification.

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/neural-multi-task-hard-parameter-sharing.png?raw=1' width='800'/>

Before proceeding, we must decide how the inputs to the resulting neural network will be converted into numbers for analysis. One popular choice is to encode the input at the characterlevel using one-hot encoding, where each character is replaced by a sparse vector of dimension equal to the total number of possible characters. This vector contains 1 in the column
corresponding to the character and 0 otherwise. An illustration of this method, which aims to help the reader concisely visualize the process of one-hot encoding.

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/one-hot-encoding-characters.png?raw=1' width='800'/>

Before proceeding to the exact two-task setup, we perform another
baseline. We use the IMDB movie classification task as the only one present, to see how the task-specific shallow neural classifier compares with the model from the previous section.


In [None]:
input_shape = (len(train_x[0]), )

# Input must match the dimension of the sent2vec vectors
sent2vec_vectors = Input(shape=input_shape)
# Dense neural layer trained on top of the sent2vec vectors
dense = Dense(512, activation="relu")(sent2vec_vectors)
# Apply dropout to reduce overfitting
dense = Dropout(0.3)(dense)

# Output indicates a single binary classifier - is review “positive” or “negative”?
output = Dense(1, activation="sigmoid")(dense)

model = Model(inputs=sent2vec_vectors, outputs=output)

In [None]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_x, train_y, validation_data=(test_x, test_y), batch_size=32, epochs=10, shuffle=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


We found that the performance of this classifier was about 82% at the hyperparameter values specified.

This is higher than the baseline of bag-of-words combined with logistic regression, and approximately equal to sent2vec combined with logistic regression.

## Dual Task Experiment

Logistic regression models the relationship between a categorical output variable and a set of input variables by estimating probabilities with the logistic function. Assuming the existence of a single input variables x, and a single output binary variable y with associated probability $P(y=1)=p$.

Now, let’s go ahead and build our classifier using the popular library scikit-learn.

In [None]:
def fit(train_x, train_y):
  model = LogisticRegression()

  try:
    model.fit(train_x, train_y)
  except:
    pass
  
  return model

In [None]:
model = fit(train_x, train_y)

predicted_labels = model.predict(test_x)
print("DEBUG::The logistic regression predicted labels are::")
print(predicted_labels)

DEBUG::The logistic regression predicted labels are::
[0 0 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 0 0 1 1
 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0
 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1
 1 1 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 0 0 0 1 1 1 0 0 1 0
 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 1
 0 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1
 1 0 0 1 0 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1
 1 0 1 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 1 1 0
 1 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1
 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1 0 0
 1 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0
 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 1

In [None]:
acc_score = accuracy_score(test_y, predicted_labels)
print("The logistic regression accuracy score is::")
print(acc_score)

The logistic regression accuracy score is::
0.815


## Random Forests

Random Forests (RFs) provide a practical machine learning method for applying decision trees. It involves generating a very large number of specialized trees and ensembling their outputs. RFs are extremely flexible and widely applicable, making them often the second algorithm practitioners try after logistic regression for baselining.

In [None]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_jobs=1, random_state=0)

# Train the Classifier to take the training features and learn how they relate to the training y (spam, not spam?)
start_time = time.time()
clf.fit(train_x, train_y)
end_time = time.time()
print("Training the Random Forest Classifier took %3d seconds"%(end_time-start_time))

Training the Random Forest Classifier took   2 seconds


In [None]:
predicted_labels = clf.predict(test_x)
print("DEBUG::The RF predicted labels are::")
print(predicted_labels)

DEBUG::The RF predicted labels are::
[0 0 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 1
 1 0 1 1 1 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0
 1 0 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1
 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1
 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 1 1 0 0 1
 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 1
 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 1 1
 0 0 1 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1
 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 0
 1 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 0 1 0 1 1 1 1
 1 0 0 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 1 1 0 0 1 0 0
 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 0 1 1 0 0 0
 1 1 1 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 

In [None]:
acc_score = accuracy_score(test_y, predicted_labels)
print("DEBUG::The RF testing accuracy score is::")
print(acc_score)

DEBUG::The RF testing accuracy score is::
0.745


## Conclusions

This yields accuracy scores of 82% and 74% for the logistic regression and random forest classifiers respectively (at the same hyperparameter
values as in the previous section). This value for the logistic regression classifier combined with sent2vec is an improvement on the corresponding values of 63% and 67% respectively for the bag-of-words baseline.