# Creating model with embedding-based feature extraction approach

Instead of using BoW-based features to created the "word embeddings", understood as word-vector representations, we will use a pre-trained neural embedding model included in Word2Vec library, specifically one from Googlem, ...negative300.bins. The idea is to use these word embbedings as features to classify out text.  

Notice the difference between word vectorization, not done this time since we are using the Google word embeddings, and feature vectorization, which we'll do with our raw text by averaging all word embeddings for all sentences (function embedding_feats)

## Loading and Exploring dataset

In [1]:
SEED=42

In [2]:
from google.colab import files
files.upload()

!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [3]:
import kaggle

dataset_name = 'marklvl/sentiment-labelled-sentences-data-set'
destination_path = 'archive'

kaggle.api.dataset_download_files(dataset_name, path=destination_path, unzip=True)

dataset_name = 'leadbest/googlenewsvectorsnegative300'
kaggle.api.dataset_download_files(dataset_name, path=destination_path, unzip=True)


In [4]:
file_paths = ['archive/sentiment labelled sentences/amazon_cells_labelled.txt', 'archive/sentiment labelled sentences/imdb_labelled.txt', 'archive/sentiment labelled sentences/yelp_labelled.txt']
output_file = 'archive/sentiment_sentences.txt'

with open(output_file, 'w') as outfile:
    for file_path in file_paths:
        with open(file_path, 'r') as infile:
            outfile.write(infile.read())


In [5]:
import os

data_path= "archive"

training_data_path = os.path.join(data_path, "sentiment_sentences.txt")
training_data_path

'archive/sentiment_sentences.txt'

In [6]:
#Read text data, cats.
#the file path consists of tab separated sentences and cats.
texts = []
cats = []

fh = open(training_data_path)
for line in fh:
    text, sentiment = line.split("\t")
    texts.append(text)
    cats.append(sentiment)

print(texts[1])
print(cats[1])
print(len(texts), len(cats))

Good case, Excellent value.
1

3000 3000


In [7]:
import gensim.downloader as api
from gensim.models import Word2Vec, KeyedVectors

path_to_model = f'{data_path}/GoogleNews-vectors-negative300.bin'

# loading the model word2vec
%time w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')

CPU times: user 43.1 s, sys: 9.03 s, total: 52.2 s
Wall time: 1min 3s
done loading Word2Vec


In [8]:
#Inspecting the model
word2vec_vocab = w2v_model.key_to_index.keys()
word2vec_vocab_lower = [item.lower for item in word2vec_vocab]
print(len(word2vec_vocab_lower))
list(w2v_model.key_to_index.keys())[:14]

3000000


['</s>',
 'in',
 'for',
 'that',
 'is',
 'on',
 '##',
 'The',
 'with',
 'said',
 'was',
 'the',
 'at',
 'not']

In [9]:
from string import punctuation
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

# preprocess the text
def preprocess_corpus(texts):
  mystopwords = set(stopwords.words("english"))
  def remove_stop_digits(tokens):
    # lowercase, remove stop words and digits from a list of tokens
    return [token.lower() for token in tokens if token.lower() not in mystopwords and not token.isdigit() and token not in punctuation]
  return [remove_stop_digits(word_tokenize(text)) for text in texts]

text_processed = preprocess_corpus(texts)
print(len(cats), len(text_processed))
print(text_processed[1])
print(cats[1])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


3000 3000
['good', 'case', 'excellent', 'value']
1



In [10]:
import numpy as np
# Now that we have our text preprocessed in the form of list of lists (sentences),
# we create the feature vectors by averaging all embeddings for all sentences
# another way to say it, we are creating a unique feature vector for each list of tokens,
# where tokens are the individual words and the list is the sentence
def embedding_feats(list_of_lists):
  DIMENSION=300
  zero_vector=np.zeros(DIMENSION)
  count_for_this= 0 + 1e-5
  feats=[]
  for tokens in list_of_lists:
    feat_for_this=np.zeros(DIMENSION)
    for token in tokens:
      if token in w2v_model:
        feat_for_this+=w2v_model[token]
    if(count_for_this != 0):
      feats.append(feat_for_this/count_for_this)
    else:
      feats.append(zero_vector)
  return feats

train_vectors=embedding_feats(text_processed)
print(len(train_vectors))

3000


In [11]:
# we take a classiffier and create a model to train/test it with our vectors

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

classifier = LogisticRegression(random_state=SEED, max_iter=200)
train_data, test_data, train_cats, test_cats = train_test_split(train_vectors, cats)
classifier.fit(train_data, train_cats)
print("Accuracy: ", classifier.score(test_data, test_cats))

preds=classifier.predict(test_data)
print(classification_report(test_cats, preds))

Accuracy:  0.816
              precision    recall  f1-score   support

          0
       0.84      0.81      0.82       398
          1
       0.80      0.82      0.81       352

    accuracy                           0.82       750
   macro avg       0.82      0.82      0.82       750
weighted avg       0.82      0.82      0.82       750

