<a href="https://colab.research.google.com/github/ruman-shaikh/NLP_Project_Grp_1/blob/master/SBERT_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SBERT Exploration
This notebook's purpose is to explore and understand how SBERT, the sentence transformer works.

In [None]:
pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Quickstart

 - Sentences are passed as a list of strings
 - The model encode them into embeddings

The output embeddings for all tokens are averaged to yield a fixed-sized vector.

This mean is that we need to pass the complete dataset to get the fixed length output tensor

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173869e-02 -4.28515635e-02 -1.56286228e-02  1.40537461e-02
  3.95537652e-02  1.21796295e-01  2.94334088e-02 -3.17523777e-02
  3.54959406e-02 -7.93140680e-02  1.75878406e-02 -4.04369719e-02
  4.97259721e-02  2.54912619e-02 -7.18699619e-02  8.14968720e-02
  1.47071050e-03  4.79627214e-02 -4.50335704e-02 -9.92174968e-02
 -2.81769242e-02  6.45046085e-02  4.44670431e-02 -4.76217046e-02
 -3.52952778e-02  4.38671745e-02 -5.28565943e-02  4.33040143e-04
  1.01921469e-01  1.64072644e-02  3.26996557e-02 -3.45986895e-02
  1.21339085e-02  7.94871375e-02  4.58342163e-03  1.57778263e-02
 -9.68207140e-03  2.87626218e-02 -5.05806357e-02 -1.55794173e-02
 -2.87907124e-02 -9.62278899e-03  3.15556526e-02  2.27349363e-02
  8.71449560e-02 -3.85027602e-02 -8.84718895e-02 -8.75496492e-03
 -2.12343410e-02  2.08923835e-02 -9.02078152e-02 -5.25732413e-02
 -1.05638523e-02  2.88311206e-02 -1.61454454e-02  6.17841119e-03
 -1.23234

In [None]:
sentence_embeddings.shape, type(sentence_embeddings)

((3, 384), numpy.ndarray)

### Comparing Sentence Similarities

The sentences (texts) are mapped such that sentences with similar meanings are close in vector space. One common method to measure the similarity in vector space is to use cosine similarity. 

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

#Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")

cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.6153]])


In [None]:
emb1.shape, emb2.shape

((384,), (384,))

Figure out a way to store the embedding in deive.

In [None]:
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.',]

#Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)

In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/gdrive')
%cd /gdrive/Shareddrives/NMA_DL_Dolma_1/Datasets/Sentiment140


Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).
/gdrive/Shareddrives/NMA_DL_Dolma_1/Datasets/Sentiment140


In [None]:
#header_list = ["polarity", "id", "date", "query", "user", "text"]
df=pd.read_csv("trainingclean.csv",encoding='latin-1')
#df.head()


In [None]:
print(type(df['text']))
data=df['text'].tolist()

<class 'pandas.core.series.Series'>


In [None]:
sentence_embeddings = model.encode(data)

In [None]:
type(sentence_embeddings)
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive/Shareddrives/NMA_DL_Dolma_1/Datasets/Sentiment140

import numpy
numpy.savetxt("sentence_embeddings_traindataset.txt",sentence_embeddings)

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).
/gdrive/Shareddrives/NMA_DL_Dolma_1/Datasets/Sentiment140


In [None]:
import pickle
with open('embeddings.pkl', "wb") as fOut:
    pickle.dump({'sentences': data, 'embeddings': sentence_embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
"""
for sentence, embedding in zip(data, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")
"""

'\nfor sentence, embedding in zip(data, sentence_embeddings):\n    print("Sentence:", sentence)\n    print("Embedding:", embedding)\n    print("")\n'

In [None]:
%pwd

'/content'

In [None]:
path = "/content/drive/Shareddrives/NMA_DL_Dolma_1/Datasets/Sentiment140/embeddings.pkl"

In [None]:
import pickle

with open(path, "rb") as fIn:
    stored_data = pickle.load(fIn)
    stored_sentences = stored_data['sentences']
    stored_embeddings = stored_data['embeddings']




In [None]:
print(type(stored_data), stored_data.keys())

<class 'dict'> dict_keys(['sentences', 'embeddings'])


In [None]:
print(type(stored_data['embeddings'][0]))

stored_data['embeddings'][0].shape

<class 'numpy.ndarray'>


(384,)