<a href="https://colab.research.google.com/github/zt55699/IMDB-Sentiment/blob/main/SENG474_Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading dataset

In [53]:
%rm -rf IMDB-Sentiment
!git clone https://github.com/zt55699/IMDB-Sentiment.git
%cd IMDB-Sentiment/
%ls

Cloning into 'IMDB-Sentiment'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 8 (delta 1), reused 8 (delta 1), pack-reused 0[K
Unpacking objects: 100% (8/8), done.
/content/IMDB-Sentiment/IMDB-Sentiment/IMDB-Sentiment
labeledTrainData.tsv  README.md  testData.tsv  unlabeledTrainData.tsv


In [54]:
import pandas as pd

# Read data from files 
train_data = pd.read_csv( "labeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )
test_data = pd.read_csv( "testData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv( "unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )

In [138]:
print(train_data.head)
print(test_data.head)

<bound method NDFrame.head of               id  sentiment                                             review
0       "5814_8"          1  "With all this stuff going down at the moment ...
1       "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2       "7759_3"          0  "The film starts with a manager (Nicholas Bell...
3       "3630_4"          0  "It must be assumed that those who praised thi...
4       "9495_8"          1  "Superbly trashy and wondrously unpretentious ...
...          ...        ...                                                ...
24995   "3453_3"          0  "It seems like more consideration has gone int...
24996   "5064_1"          0  "I don't believe they made this film. Complete...
24997  "10905_3"          0  "Guy is a loser. Can't get girls, needs to bui...
24998  "10194_3"          0  "This 30 minute documentary Buñuel made in the...
24999   "8478_8"          1  "I saw this movie as a child and it broke my h...

[25000 rows x 3 colum

# Data Cleaning 

Gensim preprocessing doc: https://radimrehurek.com/gensim/parsing/preprocessing.html

In [79]:
import gensim.parsing.preprocessing as gp
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import remove_stopwords

# Cast words to lower case; remove HTML tags, puctuation, numbers, short words and meaningless stopwords
# Use Porter Stemming e.g. treat "go", "going", and "went" as the same word
# Not remove stop words here because Word2Vec relies on the broader context of the sentence
FILTERS = [lambda x: x.lower(), gp.strip_tags, gp.strip_punctuation, 
           gp.strip_multiple_whitespaces, gp.strip_short, gp.stem_text, 
           gp.remove_stopwords, gp.strip_numeric] # maybe not remove number as well

# clean a sentence, return a list of words
def clean_sentence(raw_sentence):
  return preprocess_string(raw_sentence, FILTERS)

r1 = train_data["review"][0]
print("Before: ", r1)
print("After: ", clean_sentence(r1))

Before:  "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it fi

# Data Pre-processing

Word2Vec expects single sentences as inputs, each one as a list of words. 

In [81]:
from gensim.summarization.textcleaner import split_sentences

# split a review by sentences, return a list of sentences, for each is a list of words
def split_review (raw_review):
  raw_sentences = split_sentences(raw_review)
  clean_sentences = []
  for s in raw_sentences:
    if len(s) > 0:
      clean_sentences.append( clean_sentence(s))
  return clean_sentences

print("Before: ", r1)
print("After: ", split_review(r1))

Before:  "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it fi

In [89]:
# prepare input data for Word2Vec (takes couple minutes):
all_sentences = []  

print(f'Parsing {len(train_data["review"])} sentences from training set...')
train_size = len(train_data["review"])
for i in range (0, train_size):
    # report progress
    progress = (i+1)/train_size *100
    if( progress%20 == 0 ):
        print(f'   {progress}%')  
    all_sentences += split_review( train_data["review"][i])

print(f'Parsing {len(unlabeled_train["review"])} sentences from unlabeled set...')
unlabel_size = len(unlabeled_train["review"])
for i in range (0, unlabel_size):
    # report progress
    progress = (i+1)/unlabel_size *100
    if( progress%20 == 0 ):
        print(f'   {progress}%')  
    all_sentences += split_review(unlabeled_train["review"][i])

Parsing 25000 sentences from training set...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%
Parsing 50000 sentences from unlabeled set...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%


In [98]:
print("Total:", len(all_sentences), "sentences")
print(all_sentences[0])

Total: 792761 sentences
['with', 'all', 'thi', 'stuff', 'go', 'down', 'the', 'moment', 'with', 'start', 'listen', 'hi', 'music', 'watch', 'the', 'odd', 'documentari', 'here', 'and', 'there', 'watch', 'the', 'wiz', 'and', 'watch', 'moonwalk', 'again']


# Word2Vec Training

In [116]:
# Output messages for training
from gensim.models import word2vec
import logging
import sys

logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(name)s - %(message)s',
    level=logging.INFO,
    datefmt='%Y-%m-%d %H:%M:%S',
    stream=sys.stdout,
)
log = logging.getLogger('notebook')

# parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# model training
model = word2vec.Word2Vec(all_sentences, workers=num_workers, 
            size=num_features, min_count = min_word_count, 
            window = context, sample = downsampling)

model.init_sims(replace=True) # internally calculates unit-length normalized vectors

Training model...
2021-03-11 09:48:39 [INFO] gensim.models.word2vec - collecting all words and their counts
2021-03-11 09:48:39 [INFO] gensim.models.word2vec - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-11 09:48:39 [INFO] gensim.models.word2vec - PROGRESS: at sentence #10000, processed 176663 words, keeping 12579 word types
2021-03-11 09:48:39 [INFO] gensim.models.word2vec - PROGRESS: at sentence #20000, processed 353909 words, keeping 17283 word types
2021-03-11 09:48:39 [INFO] gensim.models.word2vec - PROGRESS: at sentence #30000, processed 526263 words, keeping 20712 word types
2021-03-11 09:48:39 [INFO] gensim.models.word2vec - PROGRESS: at sentence #40000, processed 703043 words, keeping 23488 word types
2021-03-11 09:48:39 [INFO] gensim.models.word2vec - PROGRESS: at sentence #50000, processed 877247 words, keeping 25849 word types
2021-03-11 09:48:40 [INFO] gensim.models.word2vec - PROGRESS: at sentence #60000, processed 1049749 words, keeping 2784

In [121]:
# save model to drive for later use OPTIONAL
from google.colab import drive
drive.mount('/content/drive')

model_save_name = f'Word2Vec({num_features},{min_word_count},{context})'

#!ls /content/drive/MyDrive

path = f"/content/drive/MyDrive/{model_save_name}" 
model.save(path)

Mounted at /content/drive
'Assistive Tech for Mental Health'
 cleaned_processed.cleveland.data
'Colab Notebooks'
'Copy of ideation-assessment-group5.gsheet'
'Copy of unconstrained-design-design-groupX.gslides'
'CSC370 AS1.drawio'
 CSC474A1_neural_networks.ipynb
 CSC474A1_random_forest.ipynb
'CSC485D Visual Interim Presentation.gslides'
'Data Mining Lab 1'
'Data Mining Lab 1 - Jan 22nd by Keon  (1).ipynb'
'Lab2 MLP.ipynb'
'Lab4 - Logistic Regression .ipynb'
'Learning Roadmap.drawio'
'Project Evaluation (8% of total 40% project grade, due 7pm Sunday March 15th by email).gdoc'
'Status report.gslides'
'Untitled Diagram (1).drawio'
'Untitled Diagram.drawio'
2021-03-11 10:15:53 [INFO] gensim.utils - saving Word2Vec object under /content/drive/MyDrive/Word2Vec(300,40,10), separately None
2021-03-11 10:15:53 [INFO] gensim.utils - not storing attribute vectors_norm
2021-03-11 10:15:53 [INFO] gensim.utils - not storing attribute cum_table
2021-03-11 10:15:54 [INFO] gensim.utils - saved /content/

load trained Word2Vec model

In [None]:
# load trained model OPTIONAL
from gensim.models import Word2Vec
model = Word2Vec.load("Word2Vec(300,40,10)")
model.trainables.syn1neg.shape

In [114]:
model.most_similar("woman")

  """Entry point for launching an IPython kernel.


[('ladi', 0.6253357529640198),
 ('girl', 0.5801959037780762),
 ('man', 0.5739419460296631),
 ('widow', 0.566925048828125),
 ('prostitut', 0.5554071664810181),
 ('women', 0.5501986145973206),
 ('her', 0.5369620323181152),
 ('daughter', 0.5232342481613159),
 ('housewif', 0.5221171379089355),
 ('waitress', 0.5202633142471313)]

In [123]:
model.wv["man"] # word vec

array([-0.06165772,  0.02414589,  0.0058367 , -0.00124058,  0.08935915,
        0.07299625, -0.04081979,  0.08348368,  0.06279206,  0.0464688 ,
        0.04704444, -0.00869097,  0.04606373, -0.08394785,  0.01960148,
       -0.05242679, -0.01590137, -0.04255367,  0.0136599 ,  0.03912215,
        0.07074215, -0.02785238, -0.01252544, -0.0279937 , -0.07397655,
       -0.06079627, -0.07359461, -0.08928838,  0.03222402,  0.00256313,
        0.01830097, -0.04056092,  0.0269219 ,  0.02067096, -0.13578847,
        0.04455758, -0.01085038,  0.04109224, -0.08153564,  0.02400051,
       -0.06374152,  0.08122293, -0.04296341, -0.08774059, -0.01360485,
        0.00888975,  0.00342898,  0.01010613,  0.01431305,  0.02722558,
       -0.02642804,  0.05952154,  0.00172234,  0.07008486,  0.12017436,
       -0.10503765,  0.01424578, -0.07151203,  0.02760548, -0.02071025,
       -0.02047265,  0.00166217,  0.02733372,  0.02464361, -0.01315925,
        0.01201706, -0.06160785, -0.03325102,  0.11356603,  0.03

# Build Feature Set

get the feature set by averaging the word vectors in a single review

In [133]:
import numpy as np

# take a list of words as input, return average vector
def get_average_vec(review,  n_features = num_features):
    vectorized = [model.wv[word] for word in review if word in model.wv.vocab]
    total = len(vectorized)
    sum_v = np.sum(vectorized, axis=0)
    average_v = np.divide(sum_v, total)
    return average_v

Same preprocessing as word2vec to keep data uniform

In [135]:
clean_train_reviews = []  

print(f'Processing {len(train_data["review"])} training reviews...')
train_size = len(train_data["review"])
for i in range (0, train_size):
    # report progress
    progress = (i+1)/train_size *100
    if( progress%20 == 0 ):
        print(f'   {progress}%')  
    avg_v = get_average_vec(clean_sentence(train_data["review"][i]))
    clean_train_reviews.append(avg_v)

'''
clean_test_reviews = [] 

print(f'Processing {len(test_data["review"])} testing reviews...')
test_size = len(test_data["review"])
for i in range (0, test_size):
    # report progress
    progress = (i+1)/test_size *100
    if( progress%20 == 0 ):
        print(f'   {progress}%')  
    avg_v = get_average_vec(clean_sentence(test_data["review"][i]))
    clean_test_reviews.append(avg_v)
'''

Processing 25000 training reviews...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%
Processing 25000 testing reviews...
   20.0%
   40.0%
   60.0%
   80.0%
   100.0%


# Classifier Modeling

## Random forest



In [140]:
# splitting train test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(clean_train_reviews, train_data["sentiment"], test_size=0.2, random_state=42)

In [142]:
from sklearn.ensemble import RandomForestClassifier as rfc

RF = rfc(n_estimators=100)

# train
RF = RF.fit(X_train, y_train)

print("Test accuracy:" ,RF.score(X_test, y_test))

# predict
#result = RF.predict(y_train)


Test accuracy: 0.8316


## SVM

## Bayes