<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-4-text-classification/2_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Neural Embeddings in Text Classification

As we already know that feature engineering techniques based on using neural networks, such as word embeddings, character embeddings, and document embeddings. The advantage of using embedding-based features is that they create a dense, low-dimensional feature representation instead of the sparse, highdimensional structure of BoW/TF-IDF and other such features. There are different ways of designing and using features based on neural embeddings.

## Setup

In [4]:
#basic imports
import os
from time import time

#pre-processing imports
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

#imports related to modeling
import numpy as np
from gensim.models import Word2Vec, KeyedVectors
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [21]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Word Embeddings

Words and n-grams have been used primarily as features in text classification for a long time. Different ways of vectorizing words have been proposed, and we used one such representation, CountVectorizer. 

In the past few years, neural network–based architectures have become popular for “learning” word representations, which are known as “word embeddings.”

We will see an example of how to use a pre-trained Word2vec model for doing feature extraction and performing text classification.

Let’s now take a look at how to use word embeddings as features for text classification. We’ll use the sentiment-labeled sentences dataset from
the UCI repository, consisting of 1,500 positive-sentiment and 1,500 negativesentiment sentences from Amazon, Yelp, and IMDB.

We will use the sentiment labelled sentences dataset from UCI repository http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

The dataset consists of 1500 positive, and 1500 negative sentiment sentences from Amazon, Yelp, IMDB. Let us first combine all the three separate data files into one using the following unix command:

`
cat amazon_cells_labelled.txt imdb_labelled.txt yelp_labelled.txt > sentiment_sentences.txt
`

For a pre-trained embedding model, we will use the Google News vectors. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

Let us get started!

In [15]:
!wget https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/sentiment%20labelled%20sentences/amazon_cells_labelled.txt
!wget https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/sentiment%20labelled%20sentences/imdb_labelled.txt
!wget https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/sentiment%20labelled%20sentences/yelp_labelled.txt
!cat amazon_cells_labelled.txt imdb_labelled.txt yelp_labelled.txt > sentiment_sentences.txt

--2020-10-01 11:54:17--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/sentiment%20labelled%20sentences/amazon_cells_labelled.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58226 (57K) [text/plain]
Saving to: ‘amazon_cells_labelled.txt’


2020-10-01 11:54:17 (6.95 MB/s) - ‘amazon_cells_labelled.txt’ saved [58226/58226]

--2020-10-01 11:54:17--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/sentiment%20labelled%20sentences/imdb_labelled.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 

In [13]:
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
!gunzip GoogleNews-vectors-negative300.bin.gz

--2020-10-01 11:46:50--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.46.134
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.46.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-10-01 11:47:36 (34.4 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [16]:
# Load the pre-trained word2vec model and the dataset
data_path = "."
path_to_model = os.path.join(data_path, "GoogleNews-vectors-negative300.bin")
training_data_path = os.path.join(data_path, "sentiment_sentences.txt")

# Load W2V model. This will take some time.
%time w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
print("done loading Word2Vec")

#Read text data, cats.
#the file path consists of tab separated sentences and cats.
texts = []
cats = []
fh = open(training_data_path)
for line in fh:
  text, sentiment = line.split("\t")
  texts.append(text)
  cats.append(sentiment)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


CPU times: user 37.6 s, sys: 6.22 s, total: 43.9 s
Wall time: 1min 1s
done loading Word2Vec


In [17]:
# Inspect the model
word2vec_vocab = w2v_model.vocab.keys()
word2vec_vocab_lower = [item.lower() for item in word2vec_vocab]
print(len(word2vec_vocab))

3000000


In [18]:
# Inspect the dataset
print(len(cats), len(texts))
print(texts[1])
print(cats[1])

3000 3000
Good case, Excellent value.
1



In [19]:
# preprocess the text.
def preprocess_corpus(texts):
  mystopwords = set(stopwords.words("english"))
  def remove_stops_digits(tokens):
    # Nested function that lowercases, removes stopwords and digits from a list of tokens
    return [token.lower() for token in tokens if token not in mystopwords and not token.isdigit() and token not in punctuation]

  # This return statement below uses the above function to process twitter tokenizer output further.
  return [remove_stops_digits(word_tokenize(text)) for text in texts]

In [22]:
texts_processed = preprocess_corpus(texts)
print(len(cats), len(texts_processed))
print(texts_processed[1])
print(cats[1])

3000 3000
['good', 'case', 'excellent', 'value']
1



This is a large model that can be seen as a dictionary where the keys are words in the vocabulary and the values are their learned embedding representations. Given a query word, if the word’s embedding is present in the dictionary, it will return the same. 

How do we use this pre-learned embedding to represent features? there are multiple ways of doing this. A simple approach is just to average the embeddings for individual words in text.

In [27]:
# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
  DIMENSION = 300
  zero_vector = np.zeros(DIMENSION)
  feats = []
  for tokens in list_of_lists:
    feat_for_this = np.zeros(DIMENSION)
    count_for_this = 0
    for token in tokens:
      if token in w2v_model:
        feat_for_this += w2v_model[token]
        count_for_this += 1
    feats.append(feat_for_this / count_for_this)
  return feats

In [28]:
train_vectors = embedding_feats(texts_processed)
print(len(train_vectors))

3000


  del sys.path[0]


Note that it uses embeddings only for the words that are present in the dictionary. It ignores the words for which embeddings are absent. Also, note that the above code will give a single vector with DIMENSION(=300) components.

When trained with a logistic regression classifier, these features gave a classification accuracy of 81% on our dataset. Considering that
we just used an existing word embeddings model and followed only basic preprocessing steps, this is a great model to have as a baseline!

In [31]:
# Take any classifier (LogisticRegression here, and train/test it like before.
classifier = LogisticRegression(random_state=1234)
train_data, test_data, train_cats, test_cats = train_test_split(train_vectors, cats)
print(len(train_data), len(train_cats))
classifier.fit(train_data, train_cats)

2250 2250


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1234, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [34]:
print(len(test_data), len(test_cats))
print("Accuracy: ", classifier.score(test_data, test_cats))
preds = classifier.predict(test_data)
print(classification_report(test_cats, preds))

750 750


ValueError: ignored

Not bad. With little efforts we got 81% accuracy. Thats a great starting model to have!!

In order to decide whether to train our own embeddings or use pre-trained embeddings, a good rule of thumb is to compute the vocabulary overlap. If the overlap between the vocabulary of our custom domain and that of pre-trained word embeddings is greater than 80%, pre-trained word embeddings tend to give good results in text classification.