# Document Vectorization (Doc2Vec)

In this notebook, let us take a look at how to "learn" document embeddings and use them for text classification. We will be using the dataset of "Sentiment and Emotion in Text" from [Kaggle](https://www.kaggle.com/c/sa-emotions/data).

"In a variation on the popular task of sentiment analysis, this dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. Hundreds to thousands of examples across 13 labels. A subset of this data is used in an experiment we uploaded to Microsoft’s Cortana Intelligence Gallery."


## Setup

### Imports

In [1]:
import pandas as pd
import os, subprocess
import nltk
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vishalgupta/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Loading Data

In [2]:
#downloding data
DATA_PATH = "Data"
TRAIN_URL = "https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/train_data.csv"
TRAIN_FILE_PATH = os.path.join(DATA_PATH, TRAIN_URL.split('/')[-1])
if not os.path.exists(TRAIN_FILE_PATH):
    process = subprocess.run("curl %s --output %s"%(TRAIN_URL, TRAIN_FILE_PATH), shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)

TEST_URL = "https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/test_data.csv"
TEST_FILE_PATH = os.path.join(DATA_PATH, TEST_URL.split('/')[-1])
if not os.path.exists(TEST_FILE_PATH):
    process = subprocess.run("curl %s --output %s"%(TEST_URL, TEST_FILE_PATH), shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)

train_df = pd.read_csv(TRAIN_FILE_PATH)    
test_df = pd.read_csv(TEST_FILE_PATH)

## EDA

In [3]:
print ("Training Data shape: %s"%str(train_df.shape))
train_df.head()

Training Data shape: (30000, 2)


Unnamed: 0,sentiment,content
0,empty,@tiffanylue i know i was listenin to bad habi...
1,sadness,Layin n bed with a headache ughhhh...waitin o...
2,sadness,Funeral ceremony...gloomy friday...
3,enthusiasm,wants to hang out with friends SOON!
4,neutral,@dannycastillo We want to trade with someone w...


In [4]:
train_df['sentiment'].value_counts()

worry         7433
neutral       6340
sadness       4828
happiness     2986
love          2068
surprise      1613
hate          1187
fun           1088
relief        1021
empty          659
enthusiasm     522
boredom        157
anger           98
Name: sentiment, dtype: int64

Let us take the top 5 categories and leave out the rest.

In [5]:
sentiment_shortlist = train_df['sentiment'].value_counts().nlargest(5).index
train_df_subset = train_df[train_df['sentiment'].isin(sentiment_shortlist)]
print("train_df shape after filtering top 3 sentiments : %s"%str(train_df_subset.shape))

train_df shape after filtering top 3 sentiments : (23655, 2)


## Pre-processing
Tweets are different from raw text. Hence, they must also be pre-processed uniquely.
- Remove @mentions, and URLs
- Use NLTK Tweet tokenizer instead of a regular one.
- Remove stopwords and numbers

In [6]:
#strip_handles removes personal information such as twitter handles, which don't
#contribute to emotion in the tweet. preserve_case=False converts everything to lowercase.
tweeter = TweetTokenizer(strip_handles=True,preserve_case=False)
mystopwords = set(stopwords.words("english"))

#Function to tokenize tweets, remove stopwords and numbers. 
#Keeping punctuations and emoticon symbols could be relevant for this task!
def preprocess_corpus(texts):
    def remove_stops_digits(tokens):
        #Nested function that removes stopwords and digits from a list of tokens
        return [token for token in tokens if token not in mystopwords and not token.isdigit()]
    #This return statement below uses the above function to process twitter tokenizer output further. 
    return [remove_stops_digits(tweeter.tokenize(content)) for content in texts]

In [7]:
X = preprocess_corpus(train_df_subset['content'])
y = train_df_subset['sentiment']
print("Number of rows in X_train : %d"%len(X))

Number of rows in X_train : 23655


In [8]:
#Split data into train and test, following the usual process
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1234)

## Training

### Vectorizing Documents

In [9]:
#Prepare training data in doc2vec format:
train_doc2vec = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate(X_train)]

#Train a doc2vec model to learn tweet representations. Use only training data!!
model = Doc2Vec(vector_size=50, alpha=0.025, min_count=5, dm =1, epochs=100)
model.build_vocab(train_doc2vec)
model.train(train_doc2vec, total_examples=model.corpus_count, epochs=model.epochs)
model.save("d2v.model")
print("Model Saved")

#Infer the feature representation for training and test data using the trained model
# model= Doc2Vec.load("d2v.model")

#Generate document vectors for 
X_train_vecs =  [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in X_train]
X_test_vecs = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in X_test]

Model Saved


### Training the Classifier

In [10]:
#Use any regular classifier like logistic regression
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(class_weight="balanced") #because classes are not balanced. 
logreg.fit(X_train_vecs, y_train)

LogisticRegression(class_weight='balanced')

## Evaluation

In [11]:
from sklearn.metrics import classification_report, confusion_matrix

y_test_preds = logreg.predict(X_test_vecs)
print(classification_report(y_test, y_test_preds))

              precision    recall  f1-score   support

   happiness       0.28      0.27      0.28       768
        love       0.21      0.43      0.28       501
     neutral       0.38      0.46      0.42      1614
     sadness       0.33      0.34      0.33      1183
       worry       0.45      0.22      0.30      1848

    accuracy                           0.34      5914
   macro avg       0.33      0.35      0.32      5914
weighted avg       0.36      0.34      0.33      5914

