### Snorkel Validation 1
In this notebook, we will use a modified version of `snorkel_original` with 3 label functions and a sample of 200 research articles. Our goal will be to validate the snorkel labeling functionality with semantic similarity at a smaller scale. 

In [1]:
from snorkel.labeling import labeling_function
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier
import pandas as pd
from sentence_transformers import SentenceTransformer, util

  from tqdm.autonotebook import tqdm, trange


In [37]:
df_defi = pd.read_excel("data/research_defs.xlsx")
df = pd.read_csv("data/text-classification-train.csv")
df.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0


In our initial proof of concept stage, we will solely look at computer science. Since this is a multi-label classification problem, we would need to perform the below steps for all of the fields, but we will begin with just computer science.

In [86]:
df_labeled = pd.DataFrame()
df_labeled['abstract'] = df['ABSTRACT'][:200] # Make the labeled only the first 200

computer_science = 1 # Identified as a CS research article
ABSTAIN = 0 # Not identified as a CS research article

i=0

In our labeling functions, we hardcode the index to search as 0, which corresponds to the definition for computer science. Note that the threshold in each of these is 0.5, which is somewhat arbitrary.

In [87]:
def lf_def_1_score(x):
  i = 0
  embedder = SentenceTransformer('multi-qa-distilbert-cos-v1')
  list_key = df_defi['DefinitionGPT'].iloc[i] # Definition of field
  def_embedding = embedder.encode(list_key, convert_to_tensor=True)
        # Convert field definition into vector space (tensor) 
  corpus_embeddings = embedder.encode(x, convert_to_tensor=True)
  score = util.pytorch_cos_sim(def_embedding, corpus_embeddings)[0]
  return score

def lf_def_2_score(x):
  i = 0
  embedder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
  list_key = df_defi['Definition'].iloc[i]
  def_embedding = embedder.encode(list_key, convert_to_tensor=True)
  corpus_embeddings = embedder.encode(x, convert_to_tensor=True)  
  score = util.pytorch_cos_sim(def_embedding, corpus_embeddings)[0]
  return score

def lf_def_3_score(x):
  i = 0
  embedder = SentenceTransformer('bert-base-nli-stsb-mean-tokens')
  list_key = df_defi['Definition'].iloc[i]
  def_embedding = embedder.encode(list_key, convert_to_tensor=True)
  corpus_embeddings = embedder.encode(x, convert_to_tensor=True)  
  score = util.pytorch_cos_sim(def_embedding, corpus_embeddings)[0]
  return score

Generate semantic similarity scores and store them in a dataframe

In [88]:
df_scores = pd.DataFrame()
df_scores['abstract'] = df_labeled['abstract']

In [90]:
df_scores['LF_1_CS'] = df_scores['abstract'].apply(lf_def_1_score)



In [115]:
def convert_tensor(x):
    return x.numpy()[0]

In [105]:
df_scores['LF_2_CS'] = df_scores['abstract'].apply(lf_def_2_score)



In [106]:
df_scores['LF_3_CS'] = df_scores['abstract'].apply(lf_def_3_score)



In [116]:
df_scores['LF_2_CS'] = df_scores['LF_2_CS'].apply(convert_tensor)

In [117]:
df_scores['LF_3_CS'] = df_scores['LF_3_CS'].apply(convert_tensor)

In [142]:
@labeling_function()
def lf_def_1(x):
  i = 0
  embedder = SentenceTransformer('multi-qa-distilbert-cos-v1')
  list_key = df_defi['DefinitionGPT'].iloc[i] # Definition of field
  def_embedding = embedder.encode(list_key, convert_to_tensor=True)
        # Convert field definition into vector space (tensor) 
  corpus_embeddings = embedder.encode(x, convert_to_tensor=True)
  score = util.pytorch_cos_sim(def_embedding, corpus_embeddings)[0]
    
  if score > 0.5:
   return computer_science  
  return ABSTAIN


@labeling_function()
def lf_def_2(x):
  i = 0
  embedder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
  list_key = df_defi['Definition'].iloc[i]
  def_embedding = embedder.encode(list_key, convert_to_tensor=True)
  corpus_embeddings = embedder.encode(x, convert_to_tensor=True)  
  score = util.pytorch_cos_sim(def_embedding, corpus_embeddings)[0]
    
  if score > 0.5:
   return computer_science  
  return ABSTAIN

@labeling_function()
def lf_def_3(x):
  i = 0
  embedder = SentenceTransformer('bert-base-nli-stsb-mean-tokens')
  list_key = df_defi['Definition'].iloc[i]
  def_embedding = embedder.encode(list_key, convert_to_tensor=True)
  corpus_embeddings = embedder.encode(x, convert_to_tensor=True)  
  score = util.pytorch_cos_sim(def_embedding, corpus_embeddings)[0]

  if score > 0.5:
   return computer_science  
  return ABSTAIN

In [119]:
# Define the set of labeling functions (LFs)
lfs = [lf_def_1, lf_def_2, lf_def_3]

We now define labels (CS or not CS) for the first 200 items:

In [134]:
df_labeled

Unnamed: 0,abstract
0,Predictive models allow subject-specific inf...
1,Rotation invariance and translation invarian...
2,We introduce and develop the notion of spher...
3,The stochastic Landau--Lifshitz--Gilbert (LL...
4,Fourier-transform infra-red (FTIR) spectra o...
...,...
195,We relate the concepts used in decentralized...
196,Time-varying network topologies can deeply i...
197,A long-standing obstacle to progress in deep...
198,We study the band structure topology and eng...


In [141]:
# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df_labeled)

  0%|▍                                                                               | 1/200 [00:00<00:00, 1002.94it/s]


ValueError: Can only compare identically-labeled Series objects

In [62]:
L_train

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 1],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 1],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 1],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 1],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 1],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])