## Kaggle  - Word2Vec Tutorial

In [1]:
import pandas as pd

# Read the data
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv("testData.tsv", header=0, delimiter="\t", quoting=3)

# since Word2Vec can learn from unlabeled data, these extra 50,000 reviews can now be used
#  unlabeledTrain.tsv, which contains 50,000 additional reviews with no labels
unlabeled_train = pd.read_csv("unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [2]:
unlabeled_train.shape

(50000, 2)

In [3]:
test.shape

(25000, 2)

In [4]:
train.shape

(25000, 3)

In [5]:
print("Unlabeled_train Data --- ", list(unlabeled_train.columns.values))
print("Train Data --- ", list(train.columns.values))
print("Test Data --- ", list(test.columns.values))

Unlabeled_train Data ---  ['id', 'review']
Train Data ---  ['id', 'sentiment', 'review']
Test Data ---  ['id', 'review']


Total 100.000 reviews

#### Make a specific input format  //// Prepare our data for input to Word2Vec
** Word2Vec - input format is a list of lists **

to split a paragraph into sentences ---> use NLTK's punkt tokenizer for sentence splitting.

In [6]:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [7]:
from sentence_splitting import review_to_sentences

In [8]:
sentences = []  # Initialize an empty list of sentences

In [12]:
print("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences from training set


  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


In [13]:
print("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences from unlabeled set


  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


In [14]:
print(len(sentences))

795553


In [20]:
print(sentences[0])

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again']


### Training and Saving The Model
- With the list of nicely parsed sentences, we're ready to train the model.

In [23]:
# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [25]:
# Set values for various parameters
num_features = 300    # Word vector dimensionality
min_word_count = 40   # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size     
downsampling = 1e-3   # Downsample setting for frequent words

In [27]:
# Initialize and train the model
from gensim.models import word2vec
print("Model is training and money is rolling af")
model = word2vec.Word2Vec(sentences, workers=num_workers,size=num_features, min_count=min_word_count,window=context, sample=downsampling)
print("Model is trained...Fuck yeah")

2019-05-10 19:56:24,958 : INFO : collecting all words and their counts
2019-05-10 19:56:24,959 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


Model is training and money is rolling af


2019-05-10 19:56:25,043 : INFO : PROGRESS: at sentence #10000, processed 226019 words, keeping 17775 word types
2019-05-10 19:56:25,135 : INFO : PROGRESS: at sentence #20000, processed 452117 words, keeping 24945 word types
2019-05-10 19:56:25,209 : INFO : PROGRESS: at sentence #30000, processed 671351 words, keeping 30027 word types
2019-05-10 19:56:25,285 : INFO : PROGRESS: at sentence #40000, processed 897864 words, keeping 34346 word types
2019-05-10 19:56:25,352 : INFO : PROGRESS: at sentence #50000, processed 1117056 words, keeping 37759 word types
2019-05-10 19:56:25,422 : INFO : PROGRESS: at sentence #60000, processed 1338402 words, keeping 40716 word types
2019-05-10 19:56:25,491 : INFO : PROGRESS: at sentence #70000, processed 1561619 words, keeping 43321 word types
2019-05-10 19:56:25,562 : INFO : PROGRESS: at sentence #80000, processed 1780986 words, keeping 45713 word types
2019-05-10 19:56:25,631 : INFO : PROGRESS: at sentence #90000, processed 2004982 words, keeping 4812

2019-05-10 19:56:30,684 : INFO : PROGRESS: at sentence #730000, processed 16331861 words, keeping 118954 word types
2019-05-10 19:56:30,756 : INFO : PROGRESS: at sentence #740000, processed 16552718 words, keeping 119665 word types
2019-05-10 19:56:30,819 : INFO : PROGRESS: at sentence #750000, processed 16771272 words, keeping 120295 word types
2019-05-10 19:56:30,897 : INFO : PROGRESS: at sentence #760000, processed 16990689 words, keeping 120929 word types
2019-05-10 19:56:30,973 : INFO : PROGRESS: at sentence #770000, processed 17217953 words, keeping 121702 word types
2019-05-10 19:56:31,049 : INFO : PROGRESS: at sentence #780000, processed 17448128 words, keeping 122402 word types
2019-05-10 19:56:31,128 : INFO : PROGRESS: at sentence #790000, processed 17675202 words, keeping 123066 word types
2019-05-10 19:56:31,173 : INFO : collected 123504 word types from a corpus of 17798519 raw words and 795553 sentences
2019-05-10 19:56:31,174 : INFO : Loading a fresh vocabulary
2019-05-10

2019-05-10 19:57:22,145 : INFO : EPOCH 3 - PROGRESS: at 8.67% examples, 546885 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:57:23,158 : INFO : EPOCH 3 - PROGRESS: at 13.19% examples, 553040 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:57:24,171 : INFO : EPOCH 3 - PROGRESS: at 17.75% examples, 556067 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:57:25,190 : INFO : EPOCH 3 - PROGRESS: at 22.20% examples, 555761 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:57:26,192 : INFO : EPOCH 3 - PROGRESS: at 26.62% examples, 557002 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:57:27,198 : INFO : EPOCH 3 - PROGRESS: at 31.14% examples, 558567 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:57:28,198 : INFO : EPOCH 3 - PROGRESS: at 35.56% examples, 558331 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:57:29,211 : INFO : EPOCH 3 - PROGRESS: at 39.96% examples, 558361 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:57:30,223 : INFO : EPOCH 3 - PROGRESS: at 44.38% examples, 558365 words/s, in_qsize

2019-05-10 19:58:27,969 : INFO : EPOCH 5 - PROGRESS: at 95.17% examples, 547713 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:58:28,980 : INFO : EPOCH 5 - PROGRESS: at 99.61% examples, 548516 words/s, in_qsize 7, out_qsize 0
2019-05-10 19:58:29,036 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-05-10 19:58:29,038 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-10 19:58:29,046 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-10 19:58:29,052 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-10 19:58:29,054 : INFO : EPOCH - 5 : training on 17798519 raw words (12749083 effective words) took 23.2s, 548911 effective words/s
2019-05-10 19:58:29,061 : INFO : training on a 88992595 raw words (63746972 effective words) took 117.3s, 543649 effective words/s


Model is trained...Fuck yeah


In [28]:
# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

2019-05-10 22:09:02,562 : INFO : precomputing L2-norms of word weight vectors


In [29]:
# сreate a model name and 
# save the model for later use. to load it later use Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)

2019-05-10 22:10:25,450 : INFO : saving Word2Vec object under 300features_40minwords_10context, separately None
2019-05-10 22:10:25,515 : INFO : not storing attribute vectors_norm
2019-05-10 22:10:25,553 : INFO : not storing attribute cum_table
2019-05-10 22:10:26,095 : INFO : saved 300features_40minwords_10context


### Check out the created model 

In [30]:
model.doesnt_match("man woman child kitchen".split())

  """Entry point for launching an IPython kernel.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'kitchen'

In [31]:
model.doesnt_match("france england germany berlin".split())

  """Entry point for launching an IPython kernel.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'berlin'

In [32]:
model.doesnt_match("paris berlin london austria".split())

  """Entry point for launching an IPython kernel.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'paris'

In [34]:
model.wv.most_similar("man")

[('woman', 0.6361821889877319),
 ('lady', 0.5945104360580444),
 ('lad', 0.5891189575195312),
 ('monk', 0.5783478617668152),
 ('chap', 0.527835488319397),
 ('millionaire', 0.5215615034103394),
 ('guy', 0.5157663226127625),
 ('farmer', 0.5132026672363281),
 ('soldier', 0.5089080929756165),
 ('men', 0.5038425922393799)]

In [36]:
model.wv.most_similar("queen")

[('princess', 0.6379745602607727),
 ('maid', 0.5924339890480042),
 ('victoria', 0.5898520946502686),
 ('bride', 0.5822241902351379),
 ('showgirl', 0.5775257349014282),
 ('mistress', 0.5619137287139893),
 ('eva', 0.5530508756637573),
 ('countess', 0.5480133295059204),
 ('fatale', 0.54304039478302),
 ('starlet', 0.5383342504501343)]

In [37]:
model.wv.most_similar("awful")

[('terrible', 0.7853543162345886),
 ('horrible', 0.7502487897872925),
 ('atrocious', 0.7373061180114746),
 ('horrendous', 0.7108439207077026),
 ('abysmal', 0.7098503112792969),
 ('dreadful', 0.7081030607223511),
 ('appalling', 0.6880141496658325),
 ('horrid', 0.6760132312774658),
 ('lousy', 0.6248980760574341),
 ('laughable', 0.6150200366973877)]

Our model is capable of distinguishing differences in meaning! **The model is trained to understand semantic understanding of words**

Docs - Word2Vec model has a feature vector for each word in the vocabulary, stored in a numpy array called "syn0" - updated - "model.wv.vectors"

In [43]:
type(model.wv.syn0)

  """Entry point for launching an IPython kernel.


numpy.ndarray

In [45]:
model.wv.syn0.shape

  """Entry point for launching an IPython kernel.


(16490, 300)

In [47]:
type(model.wv.vectors)

numpy.ndarray

In [49]:
model.wv.vectors.shape

(16490, 300)

The number of rows is the number of words in the model's vocabulary, and the number of columns corresponds to the size of the feature vector

In [52]:
model["flower"]

  """Entry point for launching an IPython kernel.


array([ 0.00740212,  0.10584512,  0.00791013,  0.01596658,  0.02414292,
        0.02893355,  0.1469221 ,  0.03602852,  0.00566807, -0.04403942,
       -0.03476333,  0.07901409, -0.0525446 ,  0.05537491, -0.01916643,
        0.08392595,  0.05564789,  0.00716173,  0.05965544, -0.01869686,
        0.0145442 ,  0.06206767,  0.01221299, -0.07976333, -0.01082874,
       -0.03657145,  0.01086357,  0.01661276, -0.02401651, -0.00501237,
        0.00330291,  0.01660081, -0.11201243, -0.03290961,  0.00296581,
       -0.07590261,  0.01477253,  0.03373786, -0.12277917,  0.04188901,
       -0.01258932,  0.0175688 ,  0.03103006,  0.00069398, -0.03225007,
       -0.03792301,  0.05510093,  0.07353526,  0.01316158,  0.04571639,
        0.07401988,  0.00230472, -0.06641226, -0.00203784, -0.09438307,
        0.00041785, -0.03966195, -0.04849317,  0.04449907, -0.05610603,
        0.14267004,  0.04641591,  0.01891815, -0.04306285, -0.05566777,
        0.0195535 ,  0.05449059, -0.07047797,  0.06007987, -0.02

each word is a vector in 300-dimensional space

### From Words To Paragraphs, Attempt 1: Vector Averaging

Calculate average feature vectors for training and testing sets

In [57]:
from cleanData_word2vec import review_to_wordlist
clean_train_reviews = []
for review in train["review"]:
    clean_train_reviews.append(review_to_wordlist(review, remove_stopwords=True))

In [63]:
from vecAveraging import getAvgFeatureVecs

In [64]:
trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

Review 0.0 of 25000
Review 1000.0 of 25000
Review 2000.0 of 25000
Review 3000.0 of 25000
Review 4000.0 of 25000
Review 5000.0 of 25000
Review 6000.0 of 25000
Review 7000.0 of 25000
Review 8000.0 of 25000
Review 9000.0 of 25000
Review 10000.0 of 25000
Review 11000.0 of 25000
Review 12000.0 of 25000
Review 13000.0 of 25000
Review 14000.0 of 25000
Review 15000.0 of 25000
Review 16000.0 of 25000
Review 17000.0 of 25000
Review 18000.0 of 25000
Review 19000.0 of 25000
Review 20000.0 of 25000
Review 21000.0 of 25000
Review 22000.0 of 25000
Review 23000.0 of 25000
Review 24000.0 of 25000


In [66]:
print("Creating average feature vecs for test reviews")
clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append(review_to_wordlist(review,remove_stopwords=True))

testDataVecs = getAvgFeatureVecs(clean_test_reviews, model, num_features)

Creating average feature vecs for test reviews
Review 0.0 of 25000
Review 1000.0 of 25000
Review 2000.0 of 25000
Review 3000.0 of 25000
Review 4000.0 of 25000
Review 5000.0 of 25000
Review 6000.0 of 25000
Review 7000.0 of 25000
Review 8000.0 of 25000
Review 9000.0 of 25000
Review 10000.0 of 25000
Review 11000.0 of 25000
Review 12000.0 of 25000
Review 13000.0 of 25000
Review 14000.0 of 25000
Review 15000.0 of 25000
Review 16000.0 of 25000
Review 17000.0 of 25000
Review 18000.0 of 25000
Review 19000.0 of 25000
Review 20000.0 of 25000
Review 21000.0 of 25000
Review 22000.0 of 25000
Review 23000.0 of 25000
Review 24000.0 of 25000


use the average paragraph vectors to train a random forest. 

In [67]:
# Fit a random forest to the training data, using 100 trees
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier( n_estimators = 100 )

  from numpy.core.umath_tests import inner1d


In [68]:
print("Fitting a random forest to labeled training data...")
forest = forest.fit( trainDataVecs, train["sentiment"] )

Fitting a random forest to labeled training data...


In [69]:
# Test & extract results 
result = forest.predict(testDataVecs)

In [70]:
# Write the test results 
output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
output.to_csv("Word2Vec_AverageVectors.csv", index=False, quoting=3)

this produced results underperformed Bag of Words by a few percentage points.