# **Sense2vec**

Sense2vec is a neural network model that generates vector space representations of words from large corpora. It is an extension of the infamous word2vec algorithm.Sense2vec creates embeddings for ”senses” rather than tokens of words. A sense is a word combined with a label i.e. the information that represents the context in which the word is used. This label can be a POS Tag, Polarity, Entity Name, Dependency Tag etc.

To read about it more, please read [this](https://analyticsindiamag.com/guide-to-sense2vec-contextually-keyed-word-vectors-for-nlp/).

##**Installing Dependencies**

The sense2vec model from this package integrates with spacy seamlessly. Let’s play with this model.

In [2]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim --user -q --no-warn-script-location

In [3]:
!python -m pip install spacy --user -q
!python -m pip install sense2vec --user -q
!python -m spacy download en_core_web_sm --user -q

2021-10-28 14:19:02.460800: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-28 14:19:02.460864: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
!wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz

In [None]:
!tar -xzvf s2v_reddit_2015_md.tar.gz

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

#Standalone Usage

Getting started with this package is extremely easy. Standalone usage is as follows

We can get the embeddings of a sense i.e word along with labels by using “token +’|’+label” as a key.

In [6]:
from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("https://gitlab.com/AnalyticsIndiaMagazine/practicedatasets/-/blob/main/sense2vec/_s2v_old/")
query = "apple|NOUN"
assert query in s2v
vector = s2v[query]
vector.shape

ValueError: Can't read file: https:/gitlab.com/AnalyticsIndiaMagazine/practicedatasets/-/blob/main/sense2vec/_s2v_old/cfg

In [None]:
s2v.most_similar('apple|NOUN')

In [None]:
s2v.most_similar('Apple|ORG')

The difference between a king and a man when added to a woman is very close to a woman. These vectors capture the semantic information well. This is not surprising as even word2vec models these relationships. Let’s look at the most similar senses for polysemic words.

In [None]:
import numpy as np
x=s2v['king|NOUN']-s2v['man|NOUN']+s2v['woman|NOUN']
y=s2v['queen|NOUN']
def cosine_similarity(x,y):
    root_x=np.sqrt(sum([i**2 for i in x]))
    root_y=np.sqrt(sum([i**2 for i in y]))
    return sum([i*j for i,j in zip(x,y)])/root_x/root_y
cosine_similarity(x,y)

In [None]:
from collections import Counter
import matplotlib.pyplot as plt
grams=[len(i.split('_')) for i in s2v.keys()]
c=Counter(grams)
c=sorted(c.items())
c=list(zip(*c))
plt.plot(c[1])
plt.xlabel('Token Length')
plt.ylabel('Frequency')
plt.show()

## **Usage as a spacy component**

Polysemic words sense disambiguation

These embeddings captured the context very well. But it is the responsibility of the user of these embeddings to provide a label along with a token to select the right vector.

Sense2vec package can infer these labels when provided with spacy’s document object. Following is an example of this kind of usage.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("s2v_old/")


Sense2vec can be added as a component to the spacy pipeline. We can initialize this model with random values and train it or we can load a pre-trained model and update it according to our needs.

In [None]:
doc = nlp('Power resides where men believe it resides. It’s a trick, a shadow on the wall.')
doc._.s2v_phrases

In [None]:
doc[-2]._.s2v_most_similar(3)

That’s it, we can get embeddings, similar phrases e.t.c for all the supported phrases from this document.

The spacy pipeline has a pos tagger and named entity recognizer components before the sense2vec component.Sense2vec component uses results from these components to create word senses.

In [None]:
for i in doc:
  try:
    print(i,i.pos_,'\n',i._.s2v_most_similar(3))
  except ValueError as e:
    #If a token pos tag combination is not in the keyed vectors it raises Error so we need to catch it
    pass

## **Loading Data**

In [None]:
def get_subjects(x):
  subjects=[]
  with open(x,'r') as f:
      for line in f.readlines():
          if line.startswith('Subject:'):
              line=line.replace('Subject:',' ')
              line=line.replace('Re:',' ')
              line=line.strip()
              if len(line)<15:
                  continue
              subjects.append(line)
  return list(set(subjects))

In [None]:
mideast_subjects=get_subjects('talk.politics.mideast.txt')
gun_subjects=get_subjects('talk.politics.guns.txt')

In [None]:
import pandas as pd
df=pd.DataFrame(mideast_subjects+gun_subjects,columns=['subjects'])
df['topic']=['mideast']*len(mideast_subjects)+['guns']*len(gun_subjects)
print(df.shape)
df['topic'].value_counts()

## **Generating Word2vec embeddings**

In [None]:
import gensim
from tqdm import tqdm
model = gensim.models.Word2Vec(df['subjects'], min_count=1,size=128,workers=4)
X_s=[]
random_vectors=dict()
for i in tqdm(df['subjects'].values):
  doc=nlp(i)
  x=np.zeros(128)
  for j in doc:
      try:
        random_vectors[j.text.lower()]=model[j.text.lower()]
      except:
        random_vectors[j.text.lower()]=np.random.rand(128)
      x+=random_vectors[j.text.lower()]
  X_s.append(x)
y=df['topic'].values

In [None]:
random_vectors.keys()

## **Generating Sense2vec Embeddings**

In [None]:
from tqdm import tqdm_notebook as tqdm
import numpy as np
X=[]
for i in tqdm(df['subjects'].values):
  doc=nlp(i)
  x=np.zeros(128)
  for i in doc:
    try:
      x+=i._.s2v_vec
    except (ValueError,TypeError,KeyError) as e:
      x+=random_vectors[i.text.lower()]
      # sense=i.text+'|'+i.pos_
      # try:
      #   x+=random_vectors[sense]
      # except:
      #   random_vectors[sense]=np.random.rand(128)
      #   x+=random_vectors[sense]
  X.append(x)
y=df['topic'].values

In [None]:
X_s=np.array(X_s)
X=np.array(X)
X_s.shape,X.shape,y.shape

## **Simple DNN model**

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.normalization import BatchNormalization
import tensorflow as tf
model=Sequential()

model.add(Dense(64,input_dim=128,activation='relu'))
model.add(BatchNormalization())
model.add(Dense(32,activation='relu'))
model.add(BatchNormalization())
model.add(Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer= tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

model.summary()

In [None]:
class_map={'mideast':0,'guns':1}
Y=np.array(list(map(lambda x:class_map[x],y)))
history=model.fit(X, Y, epochs=30, batch_size=16,validation_split=0.2)

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.show()

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.show()

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.normalization import BatchNormalization
import tensorflow as tf
model=Sequential()

model.add(Dense(64,input_dim=128,activation='relu'))
model.add(BatchNormalization())
model.add(Dense(32,activation='relu'))
model.add(BatchNormalization())
model.add(Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer= tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

model.summary()

In [None]:
class_map={'mideast':0,'guns':1}
Y=np.array(list(map(lambda x:class_map[x],y)))
history_s=model.fit(X_s, Y, epochs=30, batch_size=16,validation_split=0.2)

In [None]:
plt.plot(history_s.history['loss'])
plt.plot(history_s.history['val_loss'])
plt.show()

In [None]:
import matplotlib.pyplot as plt
import matplotlib
font = {'family' : 'normal',
        'size'   : 22}
plt.tight_layout()
matplotlib.rc('font', **font)
plt.figure(figsize=(15,8))
plt.plot(history_s.history['accuracy'],label='word2vec_train')
plt.plot(history_s.history['val_accuracy'],label='word2vec_validation')
plt.plot(history.history['accuracy'],label='sense2vec_train')
plt.plot(history.history['val_accuracy'],label='sense2vec_validation')
plt.legend(bbox_to_anchor=(0.7, 0.7))
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.savefig('foo.png')
plt.show()

# **Related Articles:**

> * [Guide to Sense2vec](https://analyticsindiamag.com/guide-to-sense2vec-contextually-keyed-word-vectors-for-nlp/)

> * [Download Twitter Data and Analyze](https://analyticsindiamag.com/hands-on-guide-to-download-analyze-and-visualize-twitter-data/)

> * [Sentiment Analysis using LSTM](https://analyticsindiamag.com/how-to-implement-lstm-rnn-network-for-sentiment-analysis/)

> * [VADER Sentiment Analysis](https://analyticsindiamag.com/sentiment-analysis-made-easy-using-vader/)

> * [Polyglot](https://analyticsindiamag.com/hands-on-tutorial-on-polyglot-python-toolkit-for-multilingual-nlp-applications/)

> * [Textblob](https://analyticsindiamag.com/lets-learn-textblob-quickstart-a-python-library-for-processing-textual-data/)
