# Word2Vec exploration
This notebook is going to explore word2vec and is based on [this problem sets(in JP)](http://www.cl.ecei.tohoku.ac.jp/nlp100/).  
Considering machine power, we'll use the 1/100 sample of Wikipedia English passages with more than 400 words at the time of Jan.12th, 2015. We'll focus on dealing with country names here.  
The contents include the following:  

- Normalize data and construct the corpus
- Construct words vector from scratch with demension compression
- Construct another set of word vectors with google's word2vec
- Compare the two vector sets with some Country-related analogy

## Country name normalizer

To begin with, since we're going to deal with country names, we might want to make sure that we have a normalized country name list, so that multi-words ones will not be splitted.  
We are going to use data from [here](https://mledoze.github.io/countries/). (Thanks to mledoze and all the contributors work). 

In [1]:
# Create a normalizer that does the follow:
# - Extract common data from countries json data
# - Convert all country names to lower case
# - Replace spaces in country names with _

from __future__ import print_function
import json
import urllib2
import re

In [2]:
class CountryNameNormalizer:
  def __init__(self):
    country_json = urllib2.urlopen("https://raw.githubusercontent.com/mledoze/countries/master/dist/countries.json").read()
    country_names = [record["name"]["common"] for record in json.loads(country_json)]
    self._normalizer_dict = {
      country.lower() : re.sub(" ", "_", country.lower())
      for country in country_names
    }

  def normalize_line(self, line):
    for k, v in self._normalizer_dict.items():
      line = re.sub(k, v, line)
    return line
  
# Test the normalizer
sample_sentence = "The United States is in Northern America."
print(CountryNameNormalizer().normalize_line(sample_sentence.lower()))

the united_states is in northern america.


## Corpus generation

Now let's load the data and normalize it to generate the corpus. The above normalization for countries would be added in particular.

In [3]:
# Uncomment and execute on first time
!mkdir data
!curl http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2 -o ./data/enwiki-20150112-400-r100-10576.txt.bz2

mkdir: data: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21.0M  100 21.0M    0     0  5437k      0  0:00:03  0:00:03 --:--:-- 5437k


In [4]:
import bz2
# We'll save the corpus into a file in-case we're going to start from the beging again.

def clean_word(word):
  result = word.lower()
  # depuncuate
  result = re.sub("^[ \$,\.:;\?!`\+<>'\"\[\]\(\)\*%#&\n(?:\\n)]+", "", result)
  result = re.sub("[ \$,\.:;\?!`\+<>'\"\[\]\(\)\*%#&\n(?:\\n)]+$", "", result)
  # rip number
  result = re.sub("[0-9]+", "0", result)
  return result

def generate_corpus(infile, outfile):
  country_normalizer = CountryNameNormalizer()

  i = 0
  with bz2.BZ2File(infile, "r") as fin, open(outfile, "w") as fout:
    for line in fin:
      i += 1
      cleaned = " ".join([clean_word(word) for word in line.decode("utf-8").split(" ")])
      cleaned = country_normalizer.normalize_line(cleaned)
      if cleaned != "":
        print(cleaned.encode("utf-8"), file=fout)

data_file = "./data/enwiki-20150112-400-r100-10576.txt.bz2"
corpus_file = "./data/normalized_corpus.txt"

In [5]:
%%time
# -*- coding:utf-8 -*-

corpus = generate_corpus(data_file, corpus_file)

CPU times: user 2h 9min 1s, sys: 3min 33s, total: 2h 12min 35s
Wall time: 2h 18min 37s


## Build the vectors set from scratch

With the corpus ready, we want to build the vector sets. To do that we could do the following:

- Build the cooccurance matrix calculated by PPMI(Positive Pointwise Mutual Information)
- Reduce the dementionality of the matrix to form the final vector sets

Each value of the matrix $X_{tx}$ should be a PPMI value:
$$ X_{tc} = PPMI(t, c) = max(log\frac{N f(t,c)}{f(t,*) f(*,c)}, 0)$$
Only when $f(t,c) >= 10$, otherwise it would be 0

Where:

- $f(t,c)$ represents the counts of cooccurance of words t & c
- $f(t,*)$ represents how many times t appear
- $f(*,c)$ represents how many times the word c appears in any cooccurance
- $N$ is the total counts of the cooccurance pairs

In [6]:
# The cooccurance will take 4 words before and after
from collections import Counter, OrderedDict
from sklearn.decomposition import TruncatedSVD
from scipy import sparse
import numpy as np
import pickle

t_encode_file = "./data/t_encode.pickle"
vectors_file = "./data/vectors.pickle"

def corpus2vectors(corpus_path, t_encode_path, vectors_path, k=4, batch_size=100000):
  buffer_tc, buffer_t, buffer_c = [], [], []
  batch = 0
  counter_tc = Counter()
  counter_t = Counter()
  counter_c = Counter()

  with open(corpus_file, "r") as fi:
    for line in fi:
      splited = line.decode("utf-8").rstrip("\n").split(" ")
      # Skip single word without pairing
      if len(splited) == 1:
        continue
      for i, token in enumerate(splited):
        for j in range(max(i-k,0), min(i+k+1, len(splited))):
          if i == j:
            continue
          # Use counter to record the data
          if batch >= batch_size:
            counter_tc.update(buffer_tc)
            counter_t.update(buffer_t)
            counter_c.update(buffer_c)
            buffer_tc, buffer_t, buffer_c = [], [], []
            batch = 0
          else:
            buffer_tc.append(" ".join([token, splited[j]]))
            buffer_t.append(token)
            buffer_c.append(splited[j])
            batch += 1

  #   Pre-calculate for the model
  t_size = len(counter_t)
  c_size = len(counter_c)
  t_encoder = OrderedDict({key : i for i, key in enumerate(counter_t.keys())})
  c_encoder = OrderedDict({key : i for i, key in enumerate(counter_c.keys())})
  n = sum(counter_tc.values())
  with open(t_encode_path, "wb") as fopen:
    pickle.dump(t_encoder, fopen)

  # Calculate the matrix and make it a sparse one
  sparse_m = sparse.lil_matrix((t_size, c_size))
  for k, v in counter_tc.items():
    if counter_tc[k] < 10:
        continue
    [t, c] = k.split(" ")
    ppmi = max(np.log(n) + np.log(counter_tc[k]) - np.log(counter_t[t] * counter_c[c]) , 0)
    sparse_m[t_encoder[t], c_encoder[c]] = ppmi

  # Use TruncatedSVD to reduce the dimension
  decomposer = TruncatedSVD(n_components = 300)
  decomposed = decomposer.fit_transform(sparse_m)
  with open(vectors_path, "wb") as fopen:
    pickle.dump(decomposed, fopen)


In [7]:
%%time

corpus2vectors(corpus_file, t_encode_file, vectors_file)

CPU times: user 12min 39s, sys: 54.2 s, total: 13min 33s
Wall time: 12min 53s


## Play around with the vector sets

In [8]:
%%time
import pickle

with open(vectors_file, "rb") as fo:
  vectors = pickle.load(fo)
with open(t_encode_file, "rb") as fo:
  t_enc = pickle.load(fo)

def acquire_vec(word):
  if word not in t_enc:
    print("can't find the word {}".format(word))
    return
  return vectors[t_enc[word]].reshape(1, -1)

CPU times: user 18.9 s, sys: 5.71 s, total: 24.6 s
Wall time: 25.8 s


In [9]:
acquire_vec("united_states")

array([[ 6.16281422, -5.97383436,  1.0581208 , -3.26995636,  2.68860027,
         0.3521841 , -1.97630152, -1.72122963,  5.17808816, -2.46909246,
        -3.2041226 , -1.42568934,  0.12640769, -3.90115325,  0.04068044,
         0.65825148,  2.17531384,  0.89453377,  0.93187425, -1.08951865,
         0.20583662, -1.21054511, -1.10067284, -0.51607978,  1.46278278,
         0.57653813, -0.26861058, -1.9193653 ,  0.07108861, -0.34940509,
         0.63357496, -1.8316893 , -0.69995353,  2.17526527, -1.47570393,
         0.03694056, -0.68575864,  3.96423008, -0.39579198,  0.50706906,
        -0.04697827,  0.8331905 ,  0.48540395,  1.26561296,  0.06427495,
        -0.56282367,  0.01594325, -1.00673919, -0.04219863, -2.04557386,
         0.33909519, -1.41376507,  0.21739398,  0.17391872,  0.41068706,
        -0.45960995,  1.03457755, -0.68804824,  0.85563211, -0.4170566 ,
        -0.53620743, -0.18710836, -0.87377082,  1.05999815,  0.11985704,
         1.23148534, -0.12812664, -0.52435822, -0.4

In [10]:
# What about the cosine similarity between United States and U.S.

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(acquire_vec("united_states"), acquire_vec("u.s"))

array([[ 0.81341673]])

In [11]:
import pandas as pd

In [12]:
# What about some other pairs?

comparison_set = [
  ("united_states", "u.s"),
  ("united_states", "usa"),
  ("united_kingdom", "britain"),
  ("netherlands", "holland"),
]
countries_similarity_scratch = pd.DataFrame(
  {"FromScratchSimilarity" : [
    cosine_similarity(acquire_vec(pair[0]), acquire_vec(pair[1]))[0][0]
    for pair in comparison_set
  ]},
  index=[" & ".join(pair) for pair in comparison_set]
)
countries_similarity_scratch

Unnamed: 0,FromScratchSimilarity
united_states & u.s,0.813417
united_states & usa,0.383201
united_kingdom & britain,0.830735
netherlands & holland,0.012669


Hmm, though United State vs. U.S. seems to generate quite good result. The one with Netherlands does not seems quite right.  
Let's try to see the similar top 10 vectors for some contries.

In [13]:
def top_n_similar(vectors, t_enc, x, n=10):
  if isinstance(x, str):
    if x not in t_enc:
      print("Can't find word {}".format(x))
      return
    vec = acquire_vec(x)
  else:
    vec = x
  sim_vect = cosine_similarity(vec, vectors).reshape(-1,)
  top_indices = sim_vect.argsort()[-1-n : -1][::-1]
  words_list = {v : k for k, v in t_enc.items() if v in set(top_indices)}
  return [(words_list[i].encode("utf-8"), sim_vect[i]) for i in top_indices]

In [14]:
%%time
result = top_n_similar(vectors, t_enc, "united_kingdom")

CPU times: user 1.97 s, sys: 592 ms, total: 2.56 s
Wall time: 2.53 s


In [15]:
top10_uk_scratch = pd.DataFrame(
  result,
  index = range(1, 11),
  columns=["FromScratchSimilarCountry", "FromScratchSimilarity"]
)
top10_uk_scratch

Unnamed: 0,FromScratchSimilarCountry,FromScratchSimilarity
1,netherlands,0.857399
2,italy,0.838267
3,britain,0.830735
4,germany,0.804151
5,france,0.798871
6,télévisions,0.77378
7,belgium,0.759755
8,ireland,0.732287
9,spain,0.712897
10,australia,0.706132


Somehow, Netherlands appears to be the top relative country of UK.

In [16]:
%%time
result = top_n_similar(vectors, t_enc, "united_states")

CPU times: user 1.88 s, sys: 571 ms, total: 2.45 s
Wall time: 2.41 s


In [17]:
top10_us_scratch = pd.DataFrame(
  result,
  index = range(1, 11),
  columns=["FromScratchSimilarCountry", "FromScratchSimilarity"]
)
top10_us_scratch

Unnamed: 0,FromScratchSimilarCountry,FromScratchSimilarity
1,u.s,0.813417
2,us,0.541306
3,canada,0.445503
4,reserve,0.431682
5,bureau,0.425039
6,census,0.423169
7,europe,0.42278
8,meteorology,0.415249
9,navy,0.397493
10,marine,0.388942


This is some how weird, south_korea seems to have high similarity to many cities.  
We'll compare this to googles word2vec later.  
For now, let's try one more thing: analogy.

In [18]:
target_vec = acquire_vec("spain") - acquire_vec("madrid") + acquire_vec("athens")
result = top_n_similar(vectors, t_enc, target_vec)

In [19]:
top10_greece_analogy_scratch = pd.DataFrame(
  result,
  index = range(1, 11),
  columns=["FromScratchSimilarCountry", "FromScratchSimilarity"]
)
top10_greece_analogy_scratch

Unnamed: 0,FromScratchSimilarCountry,FromScratchSimilarity
1,italy,0.852897
2,sweden,0.847748
3,germany,0.836689
4,belgium,0.824082
5,austria,0.819144
6,netherlands,0.818202
7,france,0.812697
8,télévisions,0.789071
9,denmark,0.787519
10,norway,0.746578


In [20]:
target_vec = acquire_vec("canada") - acquire_vec("ottawa") + acquire_vec("washington")
result = top_n_similar(vectors, t_enc, target_vec)

In [21]:
top10_us_analogy_scratch = pd.DataFrame(
  result,
  index = range(1, 11),
  columns=["FromScratchSimilarCountry", "FromScratchSimilarity"]
)
top10_us_analogy_scratch

Unnamed: 0,FromScratchSimilarCountry,FromScratchSimilarity
1,canada,0.745063
2,australia,0.546372
3,usa,0.542821
4,united_kingdom,0.484901
5,america,0.470903
6,ontario,0.47082
7,europe,0.464094
8,united_states,0.462584
9,d.c,0.461587
10,new_zealand,0.459321


## Using word2vec library

Here comes google's word2vec.  We'll use the pacakge from gensim.

In [22]:
%%time

from gensim.models import word2vec

with open(corpus_file, "rt") as fi:
  result = []
  for line in fi:
    result.append(line.decode("utf-8").rstrip("\n").split(" "))  
  w2v_model = word2vec.Word2Vec(result, size=300, min_count=2)

CPU times: user 4min 9s, sys: 6.77 s, total: 4min 15s
Wall time: 1min 43s


We would like to go through the same set of problems here with the from scratch model.

In [23]:

countries_similarity_w2v = pd.DataFrame(
  {"Word2VecSimilarity" : [
    w2v_model.wv.similarity(pair[0], pair[1])
    for pair in comparison_set
  ]},
  index=[" & ".join(pair) for pair in comparison_set]
)
pd.concat((countries_similarity_scratch, countries_similarity_w2v), axis=1)

Unnamed: 0,FromScratchSimilarity,Word2VecSimilarity
united_states & u.s,0.813417,0.856122
united_states & usa,0.383201,0.540755
united_kingdom & britain,0.830735,0.575966
netherlands & holland,0.012669,0.446076


Unlike the from scratch model, Netherlands now seems to be quite similar with Holland, which is what we expected.

In [24]:
top10_uk_w2v = pd.DataFrame(
  w2v_model.wv.most_similar(positive=["united_kingdom"], topn=10),
  index = range(1, 11),
  columns=["Word2VecSimilarCountry", "Word2VecSimilarity"]
)
pd.concat((top10_uk_scratch, top10_uk_w2v), axis=1)

Unnamed: 0,FromScratchSimilarCountry,FromScratchSimilarity,Word2VecSimilarCountry,Word2VecSimilarity
1,netherlands,0.857399,uk,0.788296
2,italy,0.838267,canada,0.785179
3,britain,0.830735,netherlands,0.715695
4,germany,0.804151,united_states,0.700951
5,france,0.798871,new_zealand,0.699565
6,télévisions,0.77378,sweden,0.690162
7,belgium,0.759755,australia,0.670177
8,ireland,0.732287,europe,0.655038
9,spain,0.712897,hong_kong,0.651373
10,australia,0.706132,philippines,0.651057


In [25]:
top10_us_w2v = pd.DataFrame(
  w2v_model.wv.most_similar(positive=["united_states"], topn=10),
  index = range(1, 11),
  columns=["Word2VecSimilarCountry", "Word2VecSimilarity"]
)
pd.concat((top10_us_scratch, top10_us_w2v), axis=1)

Unnamed: 0,FromScratchSimilarCountry,FromScratchSimilarity,Word2VecSimilarCountry,Word2VecSimilarity
1,u.s,0.813417,u.s,0.856122
2,us,0.541306,united_kingdom,0.700951
3,canada,0.445503,canada,0.667973
4,reserve,0.431682,philippines,0.618144
5,bureau,0.425039,uk,0.60841
6,census,0.423169,us,0.605478
7,europe,0.42278,europe,0.595408
8,meteorology,0.415249,new_zealand,0.588238
9,navy,0.397493,australia,0.581822
10,marine,0.388942,taiwan,0.579102


So far both models seems to general nice results. Let's see what they are going to propose with Netherlands.

In [26]:
top10_nl_scratch = pd.DataFrame(
  top_n_similar(vectors, t_enc, "netherlands"),
  index = range(1, 11),
  columns=["FromScratchSimilarCountry", "FromScratchSimilarity"]
)
top10_nl_w2v = pd.DataFrame(
  w2v_model.wv.most_similar(positive=["netherlands"], topn=10),
  index = range(1, 11),
  columns=["Word2VecSimilarCountry", "Word2VecSimilarity"]
)
pd.concat((top10_nl_scratch, top10_nl_w2v), axis=1)

Unnamed: 0,FromScratchSimilarCountry,FromScratchSimilarity,Word2VecSimilarCountry,Word2VecSimilarity
1,italy,0.920498,belgium,0.872092
2,belgium,0.915938,italy,0.837967
3,télévisions,0.888087,sweden,0.828278
4,germany,0.869561,norway,0.812825
5,france,0.862236,finland,0.808494
6,spain,0.858098,spain,0.7901
7,united_kingdom,0.857399,austria,0.788743
8,austria,0.83908,chile,0.782123
9,sweden,0.829271,brazil,0.771756
10,denmark,0.811705,denmark,0.767548


Sadly, though both models were able to other European countries that are close, neither have Holland in the top 10. Perhaps the two words Netherlands and Holland are just not close enough or to frequency enough in our corpus.

In [27]:
top10_greece_analogy_w2v = pd.DataFrame(
  w2v_model.wv.most_similar(positive=["spain", "athens"], negative=["madrid"], topn=10),
  index = range(1, 11),
  columns=["Word2VecSimilarCountry", "Word2VecSimilarity"]
)
pd.concat((top10_greece_analogy_scratch, top10_greece_analogy_w2v), axis=1)


Unnamed: 0,FromScratchSimilarCountry,FromScratchSimilarity,Word2VecSimilarCountry,Word2VecSimilarity
1,italy,0.852897,greece,0.79676
2,sweden,0.847748,egypt,0.762369
3,germany,0.836689,italy,0.73777
4,belgium,0.824082,portugal,0.730268
5,austria,0.819144,russia,0.729629
6,netherlands,0.818202,austria,0.720083
7,france,0.812697,syria,0.714533
8,télévisions,0.789071,germany,0.713075
9,denmark,0.787519,norway,0.71076
10,norway,0.746578,hungary,0.703329


In [28]:
top10_us_analogy_w2v = pd.DataFrame(
  w2v_model.wv.most_similar(positive=['canada', 'washington'], negative=['ottawa'], topn=10),
  index = range(1, 11),
  columns=["Word2VecSimilarCountry", "Word2VecSimilarity"]
)
pd.concat((top10_us_analogy_scratch, top10_us_analogy_w2v), axis=1)


Unnamed: 0,FromScratchSimilarCountry,FromScratchSimilarity,Word2VecSimilarCountry,Word2VecSimilarity
1,canada,0.745063,united_states,0.630832
2,australia,0.546372,u.s,0.570035
3,usa,0.542821,australia,0.542224
4,united_kingdom,0.484901,china,0.532506
5,america,0.470903,taiwan,0.507008
6,ontario,0.47082,america,0.504899
7,europe,0.464094,us,0.503777
8,united_states,0.462584,europe,0.498085
9,d.c,0.461587,united_kingdom,0.494034
10,new_zealand,0.459321,north_korea,0.466008


# Conclusion

Here we have built a words vector set from scratch and compared it with google's word2vec model. Although both works fine with country similarity and analogy, word2vec obvious has so much shorter model buiding time compared to the from the basic from-scratch one.