<a href="https://colab.research.google.com/github/zt55699/Dataset/blob/main/SENG474_Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading dataset

In [49]:
%rm -rf IMDB-Sentiment
!git clone https://github.com/zt55699/IMDB-Sentiment.git
%cd IMDB-Sentiment/
%ls

Cloning into 'IMDB-Sentiment'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 11 (delta 2), reused 7 (delta 1), pack-reused 0[K
Unpacking objects: 100% (11/11), done.
/content/IMDB-Sentiment/IMDB-Sentiment
labeledTrainData.tsv  SENG474_Word2Vec.ipynb  unlabeledTrainData.tsv
README.md             testData.tsv


In [50]:
import pandas as pd

# Read data from files 
train_data = pd.read_csv( "labeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )
test_data = pd.read_csv( "testData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv( "unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )

In [51]:
print(train_data.head)
print(test_data.head)

<bound method NDFrame.head of               id  sentiment                                             review
0       "5814_8"          1  "With all this stuff going down at the moment ...
1       "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2       "7759_3"          0  "The film starts with a manager (Nicholas Bell...
3       "3630_4"          0  "It must be assumed that those who praised thi...
4       "9495_8"          1  "Superbly trashy and wondrously unpretentious ...
...          ...        ...                                                ...
24995   "3453_3"          0  "It seems like more consideration has gone int...
24996   "5064_1"          0  "I don't believe they made this film. Complete...
24997  "10905_3"          0  "Guy is a loser. Can't get girls, needs to bui...
24998  "10194_3"          0  "This 30 minute documentary Buñuel made in the...
24999   "8478_8"          1  "I saw this movie as a child and it broke my h...

[25000 rows x 3 colum

# Data Cleaning 

Gensim preprocessing doc: https://radimrehurek.com/gensim/parsing/preprocessing.html

In [52]:
import gensim.parsing.preprocessing as gp
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import remove_stopwords

# Cast words to lower case; remove HTML tags, puctuation, numbers, short words and meaningless stopwords
# Use Porter Stemming e.g. treat "go", "going", and "went" as the same word
# Remove stop words here, which are the noise for later average vector
FILTERS = [lambda x: x.lower(), gp.strip_tags, gp.strip_punctuation, 
           gp.strip_multiple_whitespaces, gp.strip_short, gp.stem_text, 
           gp.remove_stopwords, gp.strip_numeric] # maybe not remove number as well

# clean a sentence, return a list of words
def clean_sentence(raw_sentence):
  return preprocess_string(raw_sentence, FILTERS)

r1 = train_data["review"][0]
print("Before: ", r1)
print("After: ", clean_sentence(r1))

Before:  "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it fi

# Data Pre-processing

Word2Vec expects single sentences as inputs, each one as a list of words. 

In [53]:
from gensim.summarization.textcleaner import split_sentences

# split a review by sentences, return a list of sentences, for each is a list of words
def split_review (raw_review):
  raw_sentences = split_sentences(raw_review)
  clean_sentences = []
  for s in raw_sentences:
    if len(s) > 0:
      clean_sentences.append( clean_sentence(s))
  return clean_sentences

print("Before: ", r1)
print("After: ", split_review(r1))

Before:  "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it fi

In [54]:
# prepare input data for Word2Vec (takes couple minutes):
all_sentences = []  

print(f'Parsing {len(train_data["review"])} sentences from training set...')
train_size = len(train_data["review"])
for i in range (0, train_size):
    # report progress
    progress = (i+1)/train_size *100
    if( progress%20 == 0 ):
        print(f'   {progress}%')  
    all_sentences += split_review( train_data["review"][i])

print(f'Parsing {len(unlabeled_train["review"])} sentences from unlabeled set...')
unlabel_size = len(unlabeled_train["review"])
for i in range (0, unlabel_size):
    # report progress
    progress = (i+1)/unlabel_size *100
    if( progress%20 == 0 ):
        print(f'   {progress}%')  
    all_sentences += split_review(unlabeled_train["review"][i])

Parsing 25000 sentences from training set...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%
Parsing 50000 sentences from unlabeled set...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%


In [55]:
print("Total:", len(all_sentences), "sentences")
print(all_sentences[0])

Total: 792761 sentences
['thi', 'stuff', 'moment', 'start', 'listen', 'hi', 'music', 'watch', 'odd', 'documentari', 'watch', 'wiz', 'watch', 'moonwalk']


# Word Embedding

## 1. Word2Vec 

In [56]:
# Output messages for training
from gensim.models import word2vec
import logging
import sys

logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(name)s - %(message)s',
    level=logging.INFO,
    datefmt='%Y-%m-%d %H:%M:%S',
    stream=sys.stdout,
)
log = logging.getLogger('notebook')

# parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# model training
word2vec_model = word2vec.Word2Vec(all_sentences, workers=num_workers, 
            size=num_features, min_count = min_word_count, 
            window = context, sample = downsampling)

word2vec_model.init_sims(replace=True) # internally calculates unit-length normalized vectors

2021-03-12 00:07:44 [INFO] gensim.models.word2vec - collecting all words and their counts
2021-03-12 00:07:44 [INFO] gensim.models.word2vec - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-12 00:07:44 [INFO] gensim.models.word2vec - PROGRESS: at sentence #10000, processed 111865 words, keeping 12374 word types
2021-03-12 00:07:44 [INFO] gensim.models.word2vec - PROGRESS: at sentence #20000, processed 223633 words, keeping 17075 word types
2021-03-12 00:07:44 [INFO] gensim.models.word2vec - PROGRESS: at sentence #30000, processed 332303 words, keeping 20504 word types
2021-03-12 00:07:44 [INFO] gensim.models.word2vec - PROGRESS: at sentence #40000, processed 443349 words, keeping 23280 word types
2021-03-12 00:07:44 [INFO] gensim.models.word2vec - PROGRESS: at sentence #50000, processed 553570 words, keeping 25640 word types
2021-03-12 00:07:44 [INFO] gensim.models.word2vec - PROGRESS: at sentence #60000, processed 662342 words, keeping 27632 word types
2021-0

In [58]:
# save model to drive for later use OPTIONAL
from google.colab import drive
drive.mount('/content/drive')

def save_model(model, name):
  model_save_name = f'{name}({num_features},{min_word_count},{context})'
  path = f"/content/drive/MyDrive/{model_save_name}" 
  model.save(path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [59]:
save_model(word2vec_model, "Word2Vec")

2021-03-12 00:10:17 [INFO] gensim.utils - saving Word2Vec object under /content/drive/MyDrive/Word2Vec(300,40,10), separately None
2021-03-12 00:10:17 [INFO] gensim.utils - not storing attribute vectors_norm
2021-03-12 00:10:17 [INFO] gensim.utils - not storing attribute cum_table
2021-03-12 00:10:19 [INFO] gensim.utils - saved /content/drive/MyDrive/Word2Vec(300,40,10)


load trained Word2Vec model

In [None]:
# load trained model OPTIONAL
from gensim.models import Word2Vec
word2vec_model = Word2Vec.load("Word2Vec(300,40,10)")
word2vec_model.trainables.syn1neg.shape

In [60]:
word2vec_model.wv["man"] # word vec

array([-0.00790779, -0.00668373,  0.01135651,  0.05575463, -0.04172793,
       -0.02518707, -0.02800363,  0.01221512,  0.01416656, -0.03569898,
       -0.05336002,  0.00683461,  0.09737637, -0.01613286, -0.02946621,
        0.01263367,  0.08575792,  0.03461728, -0.03028729, -0.05556952,
       -0.0234124 ,  0.04086632,  0.02783649, -0.06073735,  0.06068921,
        0.01167279,  0.08543117, -0.04580932,  0.02434501, -0.02314274,
        0.08592816,  0.01512182,  0.06692058,  0.01390861, -0.08533301,
       -0.06960441, -0.00121154,  0.03295327, -0.00071125, -0.06146927,
       -0.04905044, -0.10266563, -0.04576875,  0.10780175,  0.06847   ,
        0.09984921, -0.00564327,  0.04405795, -0.0087141 ,  0.06850781,
       -0.00908569, -0.03732553, -0.00051461,  0.00761091, -0.01683275,
       -0.01396136, -0.14533119, -0.06501181, -0.10504609,  0.08648138,
       -0.00426058, -0.07482792, -0.07241637, -0.03257763,  0.00865732,
        0.09045486, -0.10897283,  0.04535472,  0.08304214, -0.03

In [61]:
word2vec_model.wv.most_similar("woman")

[('women', 0.6129238605499268),
 ('giovanna', 0.5904942750930786),
 ('prostitut', 0.5829073786735535),
 ('ladi', 0.5744413733482361),
 ('loretta', 0.5725628733634949),
 ('lass', 0.563113808631897),
 ('widow', 0.5628523826599121),
 ('housewif', 0.5491349101066589),
 ('whore', 0.544928789138794),
 ('husband', 0.5417135953903198)]

## 2. Doc2Vec


Gensim Doc2Vec doc https://radimrehurek.com/gensim/models/doc2vec.html

Doc2Vec preprocessing

In [62]:
import gensim.utils
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# prepare input data for Doc2Vec (takes couple minutes):
taggeddoc = []  

print(f'Parsing {len(train_data["review"])} sentences from training set...')
train_size = len(train_data["review"])
for i in range (0, train_size):
    # report progress
    progress = (i+1)/train_size *100
    if( progress%20 == 0 ):
        print(f'   {progress}%')  
    clean_r = clean_sentence( train_data["review"][i])
    taggeddoc.append(TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(clean_r))).split(),str(i)))

print(f'Parsing {len(unlabeled_train["review"])} sentences from unlabeled_train set...')
unlabeled_train_size = len(unlabeled_train["review"])
for i in range (0, unlabeled_train_size):
    # report progress
    progress = (i+1)/unlabeled_train_size *100
    if( progress%20 == 0 ):
        print(f'   {progress}%')  
    clean_r = clean_sentence( unlabeled_train["review"][i])
    taggeddoc.append(TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(clean_r))).split(),str(i)))

Parsing 25000 sentences from training set...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%
Parsing 50000 sentences from unlabeled_train set...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%


Doc2Vec training

In [63]:
import logging
import sys

# output messages
logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(name)s - %(message)s',
    level=logging.INFO,
    datefmt='%Y-%m-%d %H:%M:%S',
    stream=sys.stdout,
)
log = logging.getLogger('notebook')

# parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# model training
doc2vec_model = Doc2Vec(taggeddoc, workers=num_workers, 
                        vector_size= num_features, min_count = min_word_count, 
                        window = context, sample = downsampling)


2021-03-12 00:13:57 [INFO] gensim.models.doc2vec - collecting all words and their counts
2021-03-12 00:13:57 [INFO] gensim.models.doc2vec - PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2021-03-12 00:13:58 [INFO] gensim.models.doc2vec - PROGRESS: at example #10000, processed 1169086 words (2282547/s), 34946 word types, 10 tags
2021-03-12 00:13:58 [INFO] gensim.models.doc2vec - PROGRESS: at example #20000, processed 2322701 words (2387794/s), 46493 word types, 10 tags
2021-03-12 00:13:59 [INFO] gensim.models.doc2vec - PROGRESS: at example #30000, processed 3474171 words (2842950/s), 57058 word types, 10 tags
2021-03-12 00:13:59 [INFO] gensim.models.doc2vec - PROGRESS: at example #40000, processed 4634266 words (2856544/s), 66247 word types, 10 tags
2021-03-12 00:13:59 [INFO] gensim.models.doc2vec - PROGRESS: at example #50000, processed 5809170 words (2877870/s), 74232 word types, 10 tags
2021-03-12 00:14:00 [INFO] gensim.models.doc2vec - PROGRESS: at example #6

In [64]:
save_model(doc2vec_model, "Doc2Vec")

2021-03-12 00:17:24 [INFO] gensim.utils - saving Doc2Vec object under /content/drive/MyDrive/Doc2Vec(300,40,10), separately None
2021-03-12 00:17:25 [INFO] gensim.utils - saved /content/drive/MyDrive/Doc2Vec(300,40,10)


In [65]:
doc2vec_model.wv.most_similar("woman")

2021-03-12 00:17:41 [INFO] gensim.models.keyedvectors - precomputing L2-norms of word weight vectors


[('women', 0.5527838468551636),
 ('loretta', 0.5242778658866882),
 ('ladi', 0.5215874910354614),
 ('husband', 0.48164546489715576),
 ('mathieu', 0.47373396158218384),
 ('conchita', 0.47359904646873474),
 ('nubil', 0.46975177526474),
 ('seduc', 0.46744436025619507),
 ('bisexu', 0.46443110704421997),
 ('whore', 0.462283194065094)]

# Building Feature Set

In [84]:
# same clean process apply to input data, keep consistent with the keys in encoding models vc dic
clean_reviews = []

for review in train_data["review"]:
  clean_reviews.append(clean_sentence(review))

get the feature set by averaging the word vectors in a single review

In [85]:
import numpy as np

# take a list of words as input, return average vector
def get_average_vec(review, model, n_features = num_features):
    vectorized = [model.wv[word] for word in review if word in model.wv.vocab]
    total = len(vectorized)
    sum_v = np.sum(vectorized, axis=0)
    average_v = np.divide(sum_v, total)
    return average_v

In [86]:
def to_feature_set(model, method, reviews):
  f_set = []
  train_size = len(reviews)
  print(f'Processing {train_size} training reviews...')
  
  for i in range (0, train_size):
    # report progress
    progress = (i+1)/train_size *100
    if( progress%20 == 0 ):
        print(f'   {progress}%')  

    avg_v = method(reviews[i], model)
    f_set.append(avg_v)
  return f_set

Same preprocessing as word2vec to keep input data format uniform

In [87]:
word2vec_train_reviews = to_feature_set(word2vec_model, get_average_vec, clean_reviews)

Processing 25000 training reviews...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%


In [88]:
doc2vec_train_reviews = to_feature_set(doc2vec_model, get_average_vec, clean_reviews)  

Processing 25000 training reviews...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%


# Classifier Modeling

In [89]:
# splitting train test sets
from sklearn.model_selection import train_test_split

X_train1, X_test1, y_train1, y_test1 = train_test_split(word2vec_train_reviews, train_data["sentiment"], 
                                                    test_size=0.2, random_state=42)

X_train2, X_test2, y_train2, y_test2 = train_test_split(doc2vec_train_reviews, train_data["sentiment"], 
                                                    test_size=0.2, random_state=42)

## Random forest



In [90]:
from sklearn.ensemble import RandomForestClassifier as rfc

RF = rfc(n_estimators=100)

# train
RF1 = RF.fit(X_train1, y_train1)
print("Word2Vec encoding Test accuracy:" ,RF.score(X_test1, y_test1))

RF2 = RF.fit(X_train2, y_train2)
print("Doc2Vec encoding Test accuracy:" ,RF.score(X_test2, y_test2))

# predict
#result = RF1.predict(y_train)


Word2Vec encoding Test accuracy: 0.8384
Doc2Vec encoding Test accuracy: 0.8524


## SVM

In [91]:
# may take an hour for runing this block
from sklearn.svm import SVC

C = 10

svc_clf = SVC(kernel='linear', C=C, probability=True, random_state=0)

svc_clf1 = svc_clf.fit(X_train1, y_train1)
print("Word2Vec encoding Test accuracy:" ,svc_clf1.score(X_test1, y_test1))

svc_clf2 = svc_clf.fit(X_train2, y_train2)
print("Doc2Vec encoding Test accuracy:" ,svc_clf2.score(X_test2, y_test2))

Word2Vec encoding Test accuracy: 0.8728
DocVec encoding Test accuracy: 0.88


## Bayes