# Word Embeddings
A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning

Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.


Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

Gensim is an open source Python library for natural language processing, with a focus on topic modeling

In [2]:
!pip install gensim

Collecting gensim
  Downloading https://files.pythonhosted.org/packages/33/33/df6cb7acdcec5677ed130f4800f67509d24dbec74a03c329fcbf6b0864f0/gensim-3.4.0-cp36-cp36m-manylinux1_x86_64.whl (22.6MB)
[K    100% |████████████████████████████████| 22.6MB 36kB/s eta 0:00:011  13% |████▍                           | 3.1MB 3.7MB/s eta 0:00:06    24% |████████                        | 5.6MB 7.3MB/s eta 0:00:03    28% |█████████                       | 6.4MB 8.4MB/s eta 0:00:02    61% |███████████████████▌            | 13.8MB 11.6MB/s eta 0:00:01
[?25hCollecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/4b/69/c92661a333f733510628f28b8282698b62cdead37291c8491f3271677c02/smart_open-1.5.7.tar.gz
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
  Downloading https://files.pythonhosted.org/packages/bd/b7/a88a67002b1185ed9a8e8a6ef15266728c2361fcb4f1d02ea331e4c7741d/boto-2.48.0-py2.py3-none-any.whl (1.4MB)
[K    100% |████████████████████████████████| 1.

* size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
* window: (default 5) The maximum distance between a target word and words around the target word.
* min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
* workers: (default 3) The number of threads to use while training.
* sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1

In [3]:
import gensim

In [4]:
from gensim.models import Word2Vec

In [5]:
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
             ['this', 'is', 'the', 'second', 'sentence'],
             ['yet', 'another', 'sentence'],
             ['one', 'more', 'sentence'],
             ['and', 'the', 'final', 'sentence']]

In [6]:
# train model
model = Word2Vec(sentences, min_count=1)

In [7]:
# summarize the loaded model
print(model)

Word2Vec(vocab=14, size=100, alpha=0.025)


In [8]:
# summarize vocabulary
words = list(model.wv.vocab)

In [9]:
words


['this',
 'is',
 'the',
 'first',
 'sentence',
 'for',
 'word2vec',
 'second',
 'yet',
 'another',
 'one',
 'more',
 'and',
 'final']

In [10]:
# access vector for one word
print(model['more'])

[ -1.26956205e-03   2.57847086e-03   1.65186683e-03  -9.67474247e-04
  -2.91275186e-03   3.36612720e-04   4.88209585e-03  -6.89524109e-04
  -4.36244905e-03  -7.72329164e-04  -2.60435883e-03   1.22760213e-03
   6.53339303e-05  -6.38498983e-04   4.95865988e-03   4.83098580e-03
  -2.26933672e-03  -9.35183431e-04   3.18531558e-04   2.61861645e-03
   4.66634566e-03   1.30437769e-03  -4.60971985e-03  -1.08207052e-03
  -1.08986674e-03  -2.25137151e-03  -6.19933009e-04   4.91061900e-03
   3.18462937e-03  -3.05437273e-03   1.08086923e-03   4.16298350e-03
   3.49295023e-03   1.44492800e-03  -1.67568040e-03   3.06357979e-04
   2.44697090e-03  -3.58970091e-03   2.65719206e-03  -2.52905069e-03
   1.29277422e-03  -3.96614335e-03  -3.76847107e-03  -3.30755860e-03
  -4.37811203e-03  -3.52004333e-03   3.73974303e-03  -3.17658111e-03
  -3.21462424e-03   3.59716569e-03   4.67127468e-03  -1.19730353e-03
   4.84050810e-03  -2.85812770e-03   2.72097322e-03   2.15429906e-03
  -4.55529941e-03   4.55523515e-03

  


# Sentence Vector 

In [12]:
import numpy as np

In [13]:
# Processing sentences is not as simple as with Spacy:
vectors = np.array([model[x] for x in "this is the second sentence".split(' ')])

  


In [14]:
final_sent=vectors.sum(axis=0)

In [15]:
final_sent.shape

(100,)

In [16]:
final_sent

array([ 0.00512553,  0.00753538,  0.00380937,  0.00256127,  0.00202511,
        0.00061104, -0.0025634 ,  0.00628405, -0.00568757,  0.00039765,
        0.00158886,  0.00034985,  0.0066004 , -0.00320944,  0.0014687 ,
        0.00163751, -0.00615395, -0.00619273, -0.00590252,  0.00210619,
       -0.00065075,  0.00527864,  0.00335825,  0.00392724, -0.00287339,
       -0.00676278,  0.00879901,  0.00044252,  0.00448595, -0.00441785,
       -0.00582735, -0.00516625,  0.00115112, -0.00347584,  0.00864894,
       -0.00132769,  0.00957755,  0.00143966, -0.00264998,  0.0181527 ,
        0.00238072, -0.00591915,  0.00801951,  0.00775982,  0.00110044,
       -0.00528615, -0.00633104, -0.00577243,  0.00081442,  0.01009321,
        0.00216704,  0.00101184,  0.00538189,  0.00527494,  0.0009698 ,
       -0.00703135, -0.00447499, -0.00160621, -0.00802077, -0.00338456,
       -0.0043207 ,  0.0060369 ,  0.01693478,  0.00941087, -0.00774644,
        0.00334842, -0.00954389,  0.00098362, -0.00267814,  0.00