<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-4-text-classification/4_document_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Neural Embeddings in Text Classification

As we already know that feature engineering techniques based on using neural networks, such as word embeddings, character embeddings, and document embeddings. The advantage of using embedding-based features is that they create a dense, low-dimensional feature representation instead of the sparse, highdimensional structure of BoW/TF-IDF and other such features. There are different ways of designing and using features based on neural embeddings.

## Setup

In [10]:
import pandas as pd
import nltk
nltk.download('stopwords')

from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Document Embeddings

In the Doc2vec embedding scheme, we learn a direct representation for the entire document (sentence/paragraph) rather than each word. Just as we used word and character embeddings as features for performing text classification, we can also use Doc2vec as a feature representation mechanism. 

Since there are no existing pretrained models that work with the latest version of Doc2vec, let’s see how to build our own Doc2vec model and use it for text classification.

We’ll use a dataset called “Sentiment Analysis: Emotion in Text” from [figureeight.com](https://appen.com/), which contains 40,000 tweets labeled with 13 labels signifying different emotions. Let’s take the three most frequent labels in this dataset—neutral, worry, happiness—and build a text classifier for classifying new tweets into one of these three classes.

In [2]:
# downloding data
!wget -P DATAPATH https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/train_data.csv
!wget -P DATAPATH https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/test_data.csv
!ls -lah DATAPATH

--2020-10-09 10:01:09--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/train_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2479133 (2.4M) [text/plain]
Saving to: ‘DATAPATH/train_data.csv’


2020-10-09 10:01:10 (6.00 MB/s) - ‘DATAPATH/train_data.csv’ saved [2479133/2479133]

--2020-10-09 10:01:10--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/test_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 20

In [3]:
#Load the dataset and explore.
filepath = "DATAPATH/train_data.csv"
df = pd.read_csv(filepath)
print(df.shape)
df.head()

(30000, 2)


Unnamed: 0,sentiment,content
0,empty,@tiffanylue i know i was listenin to bad habi...
1,sadness,Layin n bed with a headache ughhhh...waitin o...
2,sadness,Funeral ceremony...gloomy friday...
3,enthusiasm,wants to hang out with friends SOON!
4,neutral,@dannycastillo We want to trade with someone w...


In [4]:
df['sentiment'].value_counts()

worry         7433
neutral       6340
sadness       4828
happiness     2986
love          2068
surprise      1613
hate          1187
fun           1088
relief        1021
empty          659
enthusiasm     522
boredom        157
anger           98
Name: sentiment, dtype: int64

In [5]:
# Let us take the top 3 categories and leave out the rest.
shortlist = ["neutral", "happiness", "worry"]
df_subset = df[df["sentiment"].isin(shortlist)]
df_subset.shape

(16759, 2)

## Text pre-processing

Our traditional tokenizers may not work well with tweets, splitting smileys, hashtags, Twitter handles, etc., into multiple tokens. Such specialized needs prompted a lot of research into NLP for Twitter in the
recent past, which resulted in several pre-processing options for tweets. One such solution is a TweetTokenizer, implemented in the NLTK library in Python.

Tweets are different. Somethings to consider:

- Removing @mentions, and urls perhaps?
- using NLTK Tweet tokenizer instead of a regular one
- stopwords, numbers as usual.

In [6]:
# strip_handles removes personal information such as twitter handles, which don't
# contribute to emotion in the tweet. preserve_case=False converts everything to lowercase.
tweeter = TweetTokenizer(strip_handles=True, preserve_case=False)
mystopwords = set(stopwords.words("english"))

# Function to tokenize tweets, remove stopwords and numbers. 
# Keeping punctuations and emoticon symbols could be relevant for this task!
def preprocess_corpus(texts):

  def remove_stops_digits(tokens):
    # Nested function that removes stopwords and digits from a list of tokens
    return [token for token in tokens if token not in mystopwords and not token.isdigit()]

  # This return statement below uses the above function to process twitter tokenizer output further.
  return [remove_stops_digits(tweeter.tokenize(content)) for content in texts]

In [7]:
# df_subset contains only the three categories we chose.
mydata = preprocess_corpus(df_subset["content"])
mycats = df_subset["sentiment"]
print(len(mydata), len(mycats))

16759 16759


## Training Doc2vec classifier

The next step in this process is to train a Doc2vec model to learn tweet representations. Ideally, any large dataset of tweets will work for this step. However, since we don’t have such a ready-made corpus, we’ll split our dataset into train-test and use the training data for learning the Doc2vec representations. 

The first part of this process involves converting the data into a format readable by the Doc2vec implementation, which can be done using the TaggedDocument class. It’s used to represent a document as a list of tokens, followed by a “tag,” which in its simplest form can be just the filename or ID of the document. 

However, Doc2vec by itself can also be used as a nearest neighbor classifier for both multiclass and multilabel classification problems using . We’ll leave this as an exploratory exercise for the reader. 

Let’s now see how to train a Doc2vec classifier for tweets.

In [9]:
# Split data into train and test, following the usual process
train_data, test_data, train_cats, test_cats = train_test_split(mydata, mycats, random_state=1234)

# prepare training data in doc2vec format
train_doc2vec = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate(train_data)]

# Train a doc2vec model to learn tweet representations. Use only training data!!
model = Doc2Vec(vector_size=50, alpha=0.025, min_count=5, dm=1, epochs=100)
model.build_vocab(train_doc2vec)
model.train(train_doc2vec, total_examples=model.corpus_count, epochs=model.epochs)
model.save("d2v.model")
print("Model Saved")

Model Saved


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Doc2vec’s infer_vector function can be used to infer the vector representation for a given text using a pre-trained model. Since there is some amount of randomness due to the choice of hyperparameters, the inferred vectors differ each time we extract them. For this reason, to get a stable representation, we run it multiple times (called steps) and aggregate the vectors.

Let’s use the learned model to infer features for our data and train a logistic regression classifier:

In [12]:
# Infer the feature representation for training and test data using the trained model
model = Doc2Vec.load("d2v.model")

# infer in multiple steps to get a stable representation.
train_vectors = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in train_data]
test_vectors = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in test_data]

# Use any regular classifier like logistic regression
myclass = LogisticRegression(class_weight="balanced")  # because classes are not balanced. 
myclass.fit(train_vectors, train_cats)

preds = myclass.predict(test_vectors)
print(classification_report(test_cats, preds))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


              precision    recall  f1-score   support

   happiness       0.33      0.53      0.40       713
     neutral       0.46      0.53      0.49      1595
       worry       0.60      0.39      0.47      1882

    accuracy                           0.47      4190
   macro avg       0.46      0.48      0.46      4190
weighted avg       0.50      0.47      0.47      4190



Now, the performance of this model seems rather poor, achieving an F1 score of 0.51 on a reasonably large corpus, with only three classes. There are a couple of interpretations for this poor result. 

First, unlike full news articles or even well-formed sentences, tweets contain very little data per instance. 

Further, people write with a wide variety in spelling and syntax when they tweet. There are a lot of emoticons in different forms. Our feature representation should be able to capture such aspects. While tuning the algorithms by searching a large parameter space for the best model may
help, an alternative could be to explore problem-specific feature representations.

An important point to keep in mind when using Doc2vec is the same as for fastText: if we have to use Doc2vec for feature representation, we have to store the model that learned the representation. While it’s not typically as bulky as fastText, it’s also not as fast to train. Such trade-offs need to be considered and compared before we make a deployment decision.