# Word2Vec for Text Classification

In this short notebook, we will see an example of how to use a pre-trained Word2vec model for doing feature extraction and performing text classification.

We will use the sentiment labelled sentences dataset from UCI repository
http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

The dataset consists of 1500 positive, and 1500 negative sentiment sentences from Amazon, Yelp, IMDB. Let us first combine all the three separate data files into one using the following unix command:

Let us get started!

## Setup

### Imports

In [1]:
#basic imports
import os, subprocess
from time import time

#pre-processing imports
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

#imports related to modeling
import numpy as np
from gensim.models import Word2Vec, KeyedVectors
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### Loading Data & Word2Vec Model

In [2]:
import zipfile

#Download Training data
DATA_PATH = "Data"
TRAIN_ZIP_URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip"
TRAIN_ZIP_PATH = os.path.join(DATA_PATH, TRAIN_ZIP_URL.split('/')[-1].replace("%20"," "))
TRAIN_FOLDER_PATH = TRAIN_ZIP_PATH.replace('.zip','')
TRAIN_DATA_PATH = os.path.join(TRAIN_FOLDER_PATH, 'sentiment_sentences.txt')

if not os.path.exists(TRAIN_ZIP_PATH):
    process = subprocess.run('curl "%s" --output "%s"'%(TRAIN_ZIP_URL, TRAIN_ZIP_PATH), shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)

if not os.path.exists(TRAIN_FOLDER_PATH):
    with zipfile.ZipFile(TRAIN_ZIP_PATH, 'r') as zip_ref:
        zip_ref.extractall(DATA_PATH)

if not os.path.exists(TRAIN_DATA_PATH):
    subprocess.run('cd "%s" && cat amazon_cells_labelled.txt imdb_labelled.txt yelp_labelled.txt > sentiment_sentences.txt'%TRAIN_FOLDER_PATH, shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)        

In [3]:
#Download Word2Vec model
WORD2VEC_URL = "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
WORD2VEC_PATH = os.path.join(DATA_PATH, WORD2VEC_URL.split('/')[-1])
if not os.path.exists(WORD2VEC_PATH):
    process = subprocess.run('curl "%s" --output "%s"'%(WORD2VEC_URL, WORD2VEC_PATH), shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)

In [4]:
#Load W2V model. This will take some time. 
w2v_model = KeyedVectors.load_word2vec_format(WORD2VEC_PATH, binary=True)
print('done loading Word2Vec')

#Read text data, cats.
#the file path consists of tab separated sentences and cats.
texts = []
cats = []
fh = open(TRAIN_DATA_PATH)
for line in fh:
    text, sentiment = line.split("\t")
    texts.append(text.strip())
    cats.append(sentiment.strip())
data_df = pd.DataFrame({"text":texts,'label':cats})    

done loading Word2Vec


## EDA

In [5]:
data_df.head()

Unnamed: 0,text,label
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [6]:
data_df.label.value_counts()

1    1500
0    1500
Name: label, dtype: int64

In [7]:
# Inspect the model
word2vec_vocab = w2v_model.vocab.keys()
word2vec_vocab_lower = [item.lower() for item in word2vec_vocab]
print(len(word2vec_vocab))

3000000


## Preprocessing

In [8]:
#preprocess the text.
mystopwords = set(stopwords.words("english"))
def preprocess_corpus(text):
    #Nested function that converts token to lowercase and removes stopwords & digits from a list of tokens        
    tokens = word_tokenize(text)
    keep_token = lambda token: token not in mystopwords and not token.isdigit() and token not in punctuation
    return [token.lower() for token in tokens if keep_token(token)]

data_df['tokens'] = data_df['text'].apply(preprocess_corpus)
data_df.head()

Unnamed: 0,text,label,tokens
0,So there is no way for me to plug it in here i...,0,"[so, way, plug, us, unless, i, go, converter]"
1,"Good case, Excellent value.",1,"[good, case, excellent, value]"
2,Great for the jawbone.,1,"[great, jawbone]"
3,Tied to charger for conversations lasting more...,0,"[tied, charger, conversations, lasting, minute..."
4,The mic is great.,1,"[the, mic, great]"


## Training

### Vectorizing Documents
- Get mean of all token vectors in a document to get document vector

In [9]:
# Creating a feature vector by averaging all embeddings for all sentences
WORD2VEC_DIMENSION = 300
def get_doc_vec(tokens):
    token_vecs = np.array([w2v_model[token] for token in tokens if token in w2v_model])
    doc_vec = token_vecs.mean(axis=0) if len(token_vecs)>0 else np.zeros(WORD2VEC_DIMENSION)
    return doc_vec

data_df['text_vec'] = data_df['tokens'].apply(get_doc_vec)

### Training the Classifier

In [10]:
#Take any classifier (LogisticRegression here, and train/test it like before.
classifier = LogisticRegression(random_state=1234)
train_data, test_data, train_cats, test_cats = train_test_split(data_df['text_vec'].apply(lambda x:x.tolist()), data_df['label'])
classifier.fit(train_data.to_list(), train_cats)

LogisticRegression(random_state=1234)

### Evaluation

In [11]:
pred = lambda text : classifier.predict([get_doc_vec(preprocess_corpus(text))])
print("Accuracy: ", classifier.score(test_data.to_list(), test_cats))
preds = classifier.predict(test_data.to_list())
print(classification_report(test_cats, preds))

Accuracy:  0.8573333333333333
              precision    recall  f1-score   support

           0       0.84      0.88      0.86       371
           1       0.88      0.83      0.86       379

    accuracy                           0.86       750
   macro avg       0.86      0.86      0.86       750
weighted avg       0.86      0.86      0.86       750



With little efforts we got 86% accuracy. Thats a great starting model to have!!

In [12]:
pred('Enjoyed the show. Will try again!')

array(['1'], dtype=object)

In [13]:
pred('Service was unsatisfactory!')

array(['0'], dtype=object)