# Roberta Document Embeddings

Document Embeddings leverage the principle components of a large language model - in this case, ROBERTA-Large. These principle components reflect the 1024 most-descriptive dimensions in a larger corpus of text. We can convert our text data into these dimensions and use it to differentiate text styles for text comprehension.

## Setup

In [131]:
# pip install -U sentence-transformers
# pip install pytorch
# !pip install pytorch-lightning
# !pip install typing-extensions --upgrade
# !pip install tensorboard

In [1]:
import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import datetime

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Roberta Embedding Extraction

In [1033]:
# Loading raw tweet data
data = pd.read_csv('data/Schmidhuber_follower_tweets.csv', index_col=0)

In [1034]:
# extracting tweets to a list of documents
docs = data.tweet.tolist()

In [1035]:
# Pulling out language extractions using all-roberta-large-v1
# Takes hours to run this cell!
start = datetime.datetime.now()
model = SentenceTransformer('sentence-transformers/all-roberta-large-v1')
embeddings = model.encode(docs)

end = datetime.datetime.now(); elapsed = end-start
print('Training took a total of {}'.format(elapsed))

Training took a total of 2:18:13.869240


In [1036]:
# Check shape to see if it worked - new data should have 1024 features
embeddings.shape

(17092, 1024)

In [1037]:
## Saving our embeddings
pd.DataFrame(embeddings).to_csv('embeddings.csv')

This pipeline was run twice to collect embeddings from our twitter data, once for the accounts andrew ng follows, and once for the data of a few thousand accounts that follow Schmidhuber AI.
## Combining document embedding data
Here we'll combine all of our document embeddings into one larger dataset - this will then get fed into our main notebook. 

In [1046]:
# Combining datasets - note that we are appending schmidhuber's followers to the bottom of the ng embeddings dataset. 
# This order needs to stay consistent across data prep pipelines
emb = pd.read_csv('data/ng_embeddings.csv', index_col=0)
sch_emb = pd.read_csv('data/schmid_follower_embeddings.csv', index_col=0)
emb = emb.append(sch_emb)
emb = emb.reset_index(drop=True) # resetting index numbers

In [1047]:
emb

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0.005032,-0.030947,-0.006062,0.047324,-0.055165,-0.057167,-0.030257,-0.031478,0.055101,0.022555,...,-0.010593,0.060685,0.012057,0.050567,-0.049132,-0.028009,-0.038905,-0.025987,0.004756,-0.034641
1,-0.017376,-0.004529,-0.010203,0.036250,-0.015020,0.017352,-0.022352,-0.022748,0.032929,0.027165,...,0.004835,0.057194,0.040557,0.032773,-0.031219,-0.043274,-0.021079,-0.008907,0.009224,-0.024085
2,-0.042825,-0.020346,-0.008057,0.014741,-0.019099,-0.007700,-0.026105,-0.042662,0.036060,0.011172,...,0.024967,0.033303,0.043724,0.027738,-0.062944,-0.028252,-0.003424,0.017833,0.011276,0.048103
3,-0.029136,-0.005314,-0.010270,0.025464,-0.048004,-0.013484,-0.025751,-0.014573,0.034354,0.032357,...,-0.015721,0.045449,0.053799,0.016852,-0.042974,0.010857,-0.026327,0.011204,0.011505,-0.007473
4,-0.003711,-0.025502,0.000929,0.010299,-0.059911,-0.034888,0.006409,-0.052834,0.039980,0.018004,...,-0.005870,0.020939,0.026160,0.017902,-0.059797,-0.011837,-0.018208,0.003696,0.029591,0.014005
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63028,0.033232,-0.011005,-0.022178,0.011304,-0.031818,0.014010,0.007922,-0.035789,0.045437,0.034156,...,-0.004720,0.012849,0.029085,0.038972,-0.076218,-0.001516,-0.020062,-0.001641,-0.001599,-0.023842
63029,0.020552,0.017289,-0.019154,-0.013074,-0.035824,-0.012427,0.039592,-0.045618,-0.032530,0.044708,...,-0.056745,0.031100,0.024852,0.045971,-0.035301,0.019500,-0.039873,-0.011782,-0.006620,0.030537
63030,0.019945,0.026646,-0.028530,0.012934,-0.051110,0.043488,-0.024095,-0.031269,-0.003725,-0.031914,...,-0.019941,-0.032977,0.036366,0.017026,-0.040759,-0.003935,0.059830,0.008402,0.001396,0.009081
63031,0.009577,-0.034958,-0.016604,0.007820,-0.013977,0.014787,0.039922,-0.021331,0.009933,-0.029053,...,-0.025755,-0.037073,0.016418,-0.007736,-0.030964,0.064336,-0.009551,-0.006510,0.027406,-0.011280


In [1048]:
# Saving our combined embeddings
emb.to_csv('combined_document_embeddings.csv')

:)