# Word Embeddings
A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning

Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.


Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

Gensim is an open source Python library for natural language processing, with a focus on topic modeling

* size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
* window: (default 5) The maximum distance between a target word and words around the target word.
* min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
* workers: (default 3) The number of threads to use while training.
* sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1

In [1]:
import gensim



In [2]:
from gensim.models import Word2Vec

In [3]:
from gensim.models import KeyedVectors

In [4]:
import pandas as pd

In [6]:
#load model
df=pd.read_csv('reviews.csv', encoding='latin-1')

In [7]:
df.shape

(11, 5)

In [8]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,type,review,label,file
0,0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt


In [None]:
#load test data



In [29]:
def tokenizer(text):
    return text.split()

In [9]:
df['tokenzied']=df["review"].str.split(" ")

In Python, Word2Vec expects to be given a list of sentences, each of which is a list of words. To make this data setup, we define a function to split our sentences into lists of words and then apply this within another function that splits our texts into lists of sentences

In [10]:
model = Word2Vec(df['tokenzied'], min_count=1)

In [11]:
print(model)

Word2Vec(vocab=998, size=100, alpha=0.025)


In [12]:
# summarize vocabulary
words = list(model.wv.vocab)

In [13]:
words

['Once',
 'again',
 'Mr.',
 'Costner',
 'has',
 'dragged',
 'out',
 'a',
 'movie',
 'for',
 'far',
 'longer',
 'than',
 'necessary.',
 'Aside',
 'from',
 'the',
 'terrific',
 'sea',
 'rescue',
 'sequences,',
 'of',
 'which',
 'there',
 'are',
 'very',
 'few',
 'I',
 'just',
 'did',
 'not',
 'care',
 'about',
 'any',
 'characters.',
 'Most',
 'us',
 'have',
 'ghosts',
 'in',
 'closet,',
 'and',
 "Costner's",
 'character',
 'realized',
 'early',
 'on,',
 'then',
 'forgotten',
 'until',
 'much',
 'later,',
 'by',
 'time',
 'care.',
 'The',
 'we',
 'should',
 'really',
 'is',
 'cocky,',
 'overconfident',
 'Ashton',
 'Kutcher.',
 'problem',
 'he',
 'comes',
 'off',
 'as',
 'kid',
 'who',
 'thinks',
 "he's",
 'better',
 'anyone',
 'else',
 'around',
 'him',
 'shows',
 'no',
 'signs',
 'cluttered',
 'closet.',
 'His',
 'only',
 'obstacle',
 'appears',
 'to',
 'be',
 'winning',
 'over',
 'Costner.',
 'Finally',
 'when',
 'well',
 'past',
 'half',
 'way',
 'point',
 'this',
 'stinker,',
 'tells

In [14]:
# access vector for one word
print(model['watch'])

[-0.00156634 -0.00057168 -0.00056892  0.00391186  0.00436611  0.00322857
 -0.00285753  0.00339262 -0.00220805 -0.0044377  -0.00116733 -0.00061375
 -0.00084118  0.00282253 -0.00306073  0.000721   -0.0030427  -0.00392674
  0.00339725 -0.00382695  0.0028149  -0.00345134  0.00334856 -0.00295137
  0.00306975 -0.00414636  0.00220857  0.00433798 -0.00164621 -0.00213806
  0.00295529 -0.0021785  -0.00255844  0.00091168  0.00101862  0.00272277
  0.00301214 -0.00261265 -0.00067677  0.00396885  0.00278577 -0.00067683
 -0.00230345 -0.004754   -0.00036882  0.00427084  0.00039569  0.00404209
  0.00088599 -0.00014652  0.00323671  0.00173736 -0.00414521 -0.00461659
 -0.00116143 -0.00495172 -0.00017505  0.00150349  0.00447745 -0.00216613
  0.00084042  0.00191833  0.00112578 -0.00336055  0.00395462 -0.0004212
 -0.00318817 -0.00248725  0.00030055 -0.00273889 -0.00323885 -0.00201435
 -0.00131316 -0.00255054 -0.00454028 -0.00400675 -0.00124169 -0.00212296
  0.00267673  0.00488541 -0.0041331   0.00363043 -0.

  


# Sentence Vector 

In [15]:
import numpy as np

In [16]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,type,review,label,file,tokenzied
0,0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt,"[Once, again, Mr., Costner, has, dragged, out,..."
1,1,test,This is an example of why the majority of acti...,neg,10000_4.txt,"[This, is, an, example, of, why, the, majority..."
2,2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt,"[First, of, all, I, hate, those, moronic, rapp..."


In [17]:
def vectorizer(text):
    return np.array([model[x] for x in text])
    

In [18]:
# apply the preprocess function to all reviews
df['vec_text'] = df['tokenzied'].apply(vectorizer)

  


In [27]:
df['vec_text'][0].shape

(168, 100)

In [30]:
df['vec_text'][0]

array([[ 0.0027712 , -0.00352015, -0.00263619, ...,  0.00078959,
        -0.00291353,  0.00162354],
       [ 0.00293399,  0.00423996,  0.00057422, ..., -0.00442938,
        -0.00162873, -0.00278063],
       [-0.00201099,  0.00433001,  0.003913  , ...,  0.00496293,
        -0.00131217,  0.00086428],
       ...,
       [ 0.00052247, -0.00117149, -0.00333994, ..., -0.00072355,
        -0.00052909, -0.00175135],
       [ 0.00342983, -0.00228086,  0.00462438, ..., -0.00275732,
        -0.00103558, -0.0016434 ],
       [-0.00175752, -0.00313165,  0.00180346, ...,  0.00290053,
         0.00070109, -0.00389416]], dtype=float32)

In [28]:
df['vec_text'][1].shape

(234, 100)

In [20]:
df['sent_vec']=list(map(lambda x:x.sum(axis=0),df.vec_text))

In [24]:
df['sent_vec']

0     [0.09312847, -0.094630785, 0.22852522, 0.05958...
1     [0.09245081, -0.09774694, 0.13850689, 0.094530...
2     [0.06079231, -0.13670747, 0.2628679, -0.002581...
3     [0.08619446, -0.13721795, 0.23525174, -0.03831...
4     [0.119398855, -0.14799763, 0.18988867, 0.04880...
5     [0.017543841, -0.03240566, 0.10874356, 0.03739...
6     [-0.015945364, -0.036358036, 0.18017143, 0.004...
7     [0.045363635, -0.05529844, 0.11734583, 0.03458...
8     [0.06777747, -0.08085461, 0.13584462, 0.001552...
9     [-0.0030489878, -0.050781287, 0.07091034, -0.0...
10    [0.055961557, -0.047435254, 0.055158243, 0.017...
Name: sent_vec, dtype: object

In [31]:
df['sent_vec'][1].shape

(100,)

In [32]:
X = pd.DataFrame(df['sent_vec'].tolist())

In [33]:
X.shape

(11, 100)

In [34]:
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.093128,-0.094631,0.228525,0.059582,-0.066349,0.108213,-0.069594,0.037511,-0.042918,-0.0177,...,0.12498,0.115464,-0.010149,-0.020738,0.072868,-0.09851,0.094066,0.039837,0.006789,-0.097372
1,0.092451,-0.097747,0.138507,0.09453,-0.017681,0.089813,-0.092549,0.036472,-0.021062,-0.058262,...,0.060163,0.191369,0.116722,-0.04354,0.039515,-0.146225,-0.011857,-0.014735,-0.030818,-0.090125
2,0.060792,-0.136707,0.262868,-0.002581,-0.056489,0.181052,-0.137023,-0.051748,0.067173,-0.002639,...,0.066746,0.240909,0.106708,0.032597,0.101189,-0.132757,0.009021,-0.061532,0.005246,-0.047295
3,0.086194,-0.137218,0.235252,-0.038314,-0.045759,0.219932,-0.143653,0.051413,0.009665,-0.110976,...,0.016176,0.259386,0.095933,-0.007485,0.089147,-0.194201,-0.001812,-0.078791,-0.058275,-0.114373
4,0.119399,-0.147998,0.189889,0.048805,-0.122236,0.096643,-0.113627,0.122275,-0.056252,-0.035505,...,0.014434,0.180051,0.113348,-0.054137,0.09774,-0.058026,0.014517,-0.059281,-0.090645,-0.036854
5,0.017544,-0.032406,0.108744,0.037399,-0.049013,0.022647,-0.017196,-0.031546,-0.013009,0.020487,...,0.022639,0.097475,0.059172,-0.040099,0.05429,-0.043206,0.058571,-0.045362,0.038615,-0.021555
6,-0.015945,-0.036358,0.180171,0.004195,0.00918,0.104153,-0.047563,-0.015972,0.012728,-0.02913,...,0.144182,0.158345,0.087966,-0.077947,0.031161,-0.180363,0.052373,0.003523,0.038061,0.021573
7,0.045364,-0.055298,0.117346,0.03458,-0.019521,0.129401,-0.115185,-0.011725,-0.000977,-0.02752,...,0.056123,0.147282,0.044409,-0.021009,0.084365,-0.072133,0.051046,-0.02951,-0.015222,0.008379
8,0.067777,-0.080855,0.135845,0.001552,-0.004141,0.118693,-0.104664,0.043767,0.023022,-0.008967,...,0.106571,0.121945,0.137576,0.034291,0.076052,-0.052747,0.015969,-0.073503,0.003654,-0.085034
9,-0.003049,-0.050781,0.07091,-0.048655,-0.062328,0.015332,-0.080659,0.028116,-0.022204,-0.070941,...,0.088179,0.088444,0.049958,-0.018261,0.104775,-0.129718,0.068119,-0.054084,-0.020926,0.007447


In [35]:
final_df=df.merge(X, how='outer', left_index=True, right_index=True)

In [36]:
final_df.head(3)

Unnamed: 0.1,Unnamed: 0,type,review,label,file,tokenzied,vec_text,sent_vec,0,1,...,90,91,92,93,94,95,96,97,98,99
0,0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt,"[Once, again, Mr., Costner, has, dragged, out,...","[[0.0027712008, -0.003520147, -0.00263619, -0....","[0.09312847, -0.094630785, 0.22852522, 0.05958...",0.093128,-0.094631,...,0.12498,0.115464,-0.010149,-0.020738,0.072868,-0.09851,0.094066,0.039837,0.006789,-0.097372
1,1,test,This is an example of why the majority of acti...,neg,10000_4.txt,"[This, is, an, example, of, why, the, majority...","[[-0.004457866, 0.00029172233, -0.00085859944,...","[0.09245081, -0.09774694, 0.13850689, 0.094530...",0.092451,-0.097747,...,0.060163,0.191369,0.116722,-0.04354,0.039515,-0.146225,-0.011857,-0.014735,-0.030818,-0.090125
2,2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt,"[First, of, all, I, hate, those, moronic, rapp...","[[-0.0026348098, -0.004647377, -0.0045004115, ...","[0.06079231, -0.13670747, 0.2628679, -0.002581...",0.060792,-0.136707,...,0.066746,0.240909,0.106708,0.032597,0.101189,-0.132757,0.009021,-0.061532,0.005246,-0.047295


In [37]:
final_df=final_df.iloc[:,final_df.columns !='review']

In [38]:
final_df=final_df.iloc[:,final_df.columns !='tokenzied']

In [39]:
final_df=final_df.iloc[:,final_df.columns !='vec_text']

In [40]:
final_df=final_df.iloc[:,final_df.columns !='sent_vec']

In [42]:
final_df.shape

(11, 104)

In [43]:
final_df.head(3)

Unnamed: 0.1,Unnamed: 0,type,label,file,0,1,2,3,4,5,...,90,91,92,93,94,95,96,97,98,99
0,0,test,neg,0_2.txt,0.093128,-0.094631,0.228525,0.059582,-0.066349,0.108213,...,0.12498,0.115464,-0.010149,-0.020738,0.072868,-0.09851,0.094066,0.039837,0.006789,-0.097372
1,1,test,neg,10000_4.txt,0.092451,-0.097747,0.138507,0.09453,-0.017681,0.089813,...,0.060163,0.191369,0.116722,-0.04354,0.039515,-0.146225,-0.011857,-0.014735,-0.030818,-0.090125
2,2,test,neg,10001_1.txt,0.060792,-0.136707,0.262868,-0.002581,-0.056489,0.181052,...,0.066746,0.240909,0.106708,0.032597,0.101189,-0.132757,0.009021,-0.061532,0.005246,-0.047295
