<a href="https://colab.research.google.com/github/tmnestor/weak_supervision/blob/main/wre_snorkling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
# #@title colab setup steps (takes about 12 minutes)
# from google.colab import drive
# # drive.mount('/content/drive')
# drive.mount('/content/drive')
# ![[ -d "/content/drive/MyDrive/Colab Notebooks/weak_supervision/" ]] && rm -rf "/content/drive/MyDrive/Colab Notebooks/weak_supervision/"
# %mkdir '/content/drive/MyDrive/Colab Notebooks/weak_supervision/'
# %cd '/content/drive/MyDrive/Colab Notebooks/weak_supervision/'
# !git clone https://github.com/tmnestor/weak_supervision.git .
# !pip3 install -r requirements.txt
# %load_ext google.colab.data_table
# !pip3 install colabcode

In [18]:
import snorkel
print(f"{snorkel.__version__=}")

snorkel.__version__='0.9.9'


## Load IHC_dict (keys in Cr_Expns_order)
Two case statements in IHC - 2021 v19 20221014 have been recoded programatically
- the second appearance of BH has been recoded as BH1 because it has different logic, but both keys map to 9
- the second appearance of Y has been recoded as Y1 because it has different logic, but both keys map to 96


The IHC has two separate heirachies of 101 LFs into 99 labels
- One heirachy for Cr_Expns_Descn_Txt
- Another heirachy for Trvl_Expns_Descn_Txt, Self_Educn_Expns_Descn_Txt and WRE_Othr_Expns_Descn_Txt

N.B. "DL: Other_Work_Sites" was commented out of the IHC on 20190213 so is not processed further in this script


In [19]:
import json
import re
infile = 'data/IHC_dict.txt'

# read text description of case statement dictionary from file
with open(infile, 'r') as f:
    dictionary_as_text = f.read()
# dictionary_as_text
# reconstruct the case statement dictionary from its text description
IHC_dict = json.loads(dictionary_as_text)

wre_codes = IHC_dict.keys()


## Generate Synthetic Data

In [20]:
import random
def trim_and_split(s1, s2 = None):
  s3 = s1 + '|' + s2 if s2 else s1
  return s3

def get_synthetic_data(wre_code, SAMPLE_SIZE=20):
    """
    The first half (SAMPLE_SIZE//2) of the synthetic data is comprised of in_words 
    for the wre_code with addional rubbish text
    The second half (SAMPLE_SIZE//2) of the synthetic data is comprised of out_words 
    for the wre_code with addional rubbish text    
    """
    
    rubbish_text = '''Lorem Ipsum is simply dummy text of the printing and 
  typesetting industry. Lorem Ipsum has been the industry's standard dummy 
  text ever since the 1500s, when an unknown printer took a galley of type 
  and scrambled it to make a type specimen book. It has survived not only 
  five centuries, but also the leap into electronic typesetting, remaining 
  essentially unchanged. It was popularised in the 1960s with the release of 
  Letraset sheets containing Lorem Ipsum passages, and more recently with 
  desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras tristique commodo vulputate. 
  Morbi pulvinar iaculis ligula, sit amet tristique leo gravida nec. Proin eu mi a sapien 
  ornare vehicula. Maecenas in ligula eu est cursus placerat. Vestibulum non aliquam felis, 
  sit amet porta metus. Quisque auctor nec eros ac rhoncus. Quisque feugiat ut elit eu iaculis. 
  Nam vitae sagittis sem. Quisque posuere nisl lectus. Nullam ultrices ante ac libero consectetur 
  interdum. Phasellus sed ante tempus, porta sapien ut, dictum ligula. Pellentesque semper lacus 
  nec nisl dapibus, nec maximus libero commodo. Vivamus dignissim metus mi.
  In tempor, justo at placerat vulputate, ex turpis lobortis velit, et sollicitudin orci sem dapibus 
  massa. Aenean ac erat molestie, vulputate neque at, placerat turpis. Ut pharetra et purus id 
  porttitor. Suspendisse id diam ipsum. Quisque tincidunt bibendum purus ac ornare. Quisque viverra 
  eget libero fermentum sollicitudin. Nulla quis quam a mi porta maximus. Suspendisse quis ante vitae 
  felis pulvinar euismod sed vel nibh. Donec ornare mi id vehicula mattis. Etiam laoreet erat at ante 
  venenatis vehicula. Mauris elementum lorem sed vestibulum iaculis. Donec justo lacus, dapibus 
  interdum magna ut, molestie euismod enim. Curabitur ultrices malesuada ligula, ac consequat quam 
  tempus sed. Duis quis urna rhoncus ligula pellentesque laoreet. Phasellus fringilla ullamcorper orci.
  Suspendisse et gravida justo. Aenean faucibus maximus eleifend. Phasellus sit amet risus et felis 
  faucibus volutpat. Ut mollis varius ipsum at consequat. Donec hendrerit pretium ante et iaculis. 
  Proin iaculis pretium ultricies. Donec sagittis est eu lacus pretium, quis volutpat neque maximus. 
  Vivamus est enim, dignissim ornare libero ac, ultrices aliquam orci. Quisque mollis massa dapibus 
  urna facilisis, in sagittis dui ornare. Donec laoreet, mauris vitae efficitur elementum, lacus justo 
  vulputate quam, eget congue diam enim pretium turpis. Etiam id consequat quam.
  Aliquam ac elementum urna. Proin augue augue, bibendum eget neque rhoncus, ultrices interdum nisl. 
  Proin sit amet viverra purus. Proin eu auctor elit. Proin efficitur, risus ac cursus iaculis, felis 
  turpis pulvinar est, at egestas enim libero in erat. Quisque lobortis risus iaculis efficitur efficitur. 
  Vestibulum imperdiet elit vel euismod mattis. Mauris mollis ligula risus, finibus viverra lectus posuere 
  vitae. Mauris luctus vulputate lorem sed ullamcorper.
  Integer tempor, est nec pharetra euismod, odio justo facilisis dolor, a rutrum tellus enim ut nisl. 
  Sed eu rhoncus orci. Quisque mattis nulla lectus, at dapibus metus pulvinar eget. Phasellus nec porttitor 
  tortor, sit amet iaculis enim. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam a luctus 
  lectus, elementum porta massa. Fusce euismod varius orci dictum pulvinar. Donec accumsan mauris et mi 
  dignissim, nec egestas quam porta. Etiam commodo rutrum tincidunt. Phasellus eget turpis lacus. 
  Donec vel orci felis. Praesent iaculis sapien a pharetra commodo. Proin cursus libero vel libero 
  suscipit, eu interdum odio porta. Donec sagittis libero at aliquam pulvinar. Morbi interdum venenatis 
  urna nec auctor. Mauris suscipit lacus rutrum placerat pulvinar.'''
    rubbish_words = rubbish_text.split()

    in_list = IHC_dict[wre_code]['like_any'].split('|')
    in_keywords = random.choices(in_list, k=SAMPLE_SIZE)

    if bool(IHC_dict[wre_code]['not_like_all']):
      out_list = IHC_dict[wre_code]['not_like_all'].split('|')
      out_keywords = random.choices(out_list, k=SAMPLE_SIZE//2)

    # rubbish_words = random.choices(rubbish_words, k=SAMPLE_SIZE//2)
    # additional_claim_text = rubbish_words+out_keywords

    # second half of the synthetic data includes exclusionary claims
    # synthetic_data = []
    # for item in zip(in_keywords, additional_claim_text):
    #     # muddy legitimate WRE claims with additional text
    #     synthetic_data.append(' '.join(item))

    # DELETE
    if bool(IHC_dict[wre_code]['not_like_all']):
      synthetic_data = random.choices(in_list, k=DATASET_SIZE//2) + \
                                    random.choices(out_list, k=DATASET_SIZE//2)
    else:
       synthetic_data = random.choices(in_list, k=DATASET_SIZE)
       
    return synthetic_data

test_codes = ["DV","BG","A","B","E","CI","BS","DK","C","BP","DM","BH1","ZZZ","DO","DN","DP","DQ","DR","DS"]

WRE_CODE = random.choice(test_codes)

print("*"*80)
print(f"Test Code: {WRE_CODE}")
print("*"*80)


DATASET_SIZE = 20
synthetic_data = get_synthetic_data(WRE_CODE, DATASET_SIZE)


********************************************************************************
Test Code: BG
********************************************************************************


In [21]:
synthetic_data

['decline',
 'WRITE-OFF',
 'derp of',
 'D',
 'DECLINE VALUE IN',
 'LVP',
 'LVP',
 'write down',
 'decl',
 'DEDUCTIBLE ADJUSTMENT',
 'PARKING',
 'PARKING',
 'PARKING',
 'PARKING',
 'PARKING',
 'non dep',
 'PARKING',
 'non dep',
 'PARKING',
 'non dep']

In [22]:
# Load EDA Pkgs
import pandas as pd

# store text in dataframe
snorkel_df = pd.DataFrame({'text':synthetic_data})
snorkel_df.head(10)

Unnamed: 0,text
0,decline
1,WRITE-OFF
2,derp of
3,D
4,DECLINE VALUE IN
5,LVP
6,LVP
7,write down
8,decl
9,DEDUCTIBLE ADJUSTMENT


## setup IHC_dict key mapping

In [23]:
code_dict = {'ABSTAIN':-1}
inverse_code_dict = {-1: 'ABSTAIN'}
# labels are assigned numerical values in alphabet order
# keys Y and Y1 share same same value as do BH and BH1
i = -1
for k in sorted(IHC_dict.keys()):
    if k not in ['BH1', 'Y1']:
        i += 1
    print(f"{k}:{i}")
    code_dict[k] = i
    if k not in ['BH1', 'Y1']:
        inverse_code_dict[i] = k

A:0
AM:1
B:2
BA:3
BB:4
BC:5
BD:6
BF:7
BG:8
BH:9
BH1:9
BI:10
BJ:11
BK:12
BL:13
BM:14
BN:15
BO:16
BP:17
BQ:18
BR:19
BS:20
BT:21
BU:22
BV:23
BW:24
BX:25
BY:26
BZ:27
C:28
CA:29
CB:30
CC:31
CD:32
CE:33
CF:34
CG:35
CH:36
CI:37
CJ:38
CK:39
CL:40
CM:41
CN:42
CO:43
CP:44
CQ:45
CR:46
CS:47
CT:48
CU:49
CV:50
CW:51
CX:52
CY:53
CZ:54
D:55
DA:56
DB:57
DC:58
DD:59
DE:60
DF:61
DG:62
DH:63
DI:64
DJ:65
DK:66
DM:67
DN:68
DO:69
DP:70
DQ:71
DR:72
DS:73
DT:74
DV:75
E:76
F:77
G:78
H:79
I:80
J:81
K:82
L:83
M:84
N:85
O:86
P:87
Q:88
R:89
S:90
T:91
U:92
V:93
W:94
X:95
Y:96
Y1:96
Z:97
ZZZ:98


## The Learning Function Template

In [24]:
import re
from snorkel.labeling import LabelingFunction,PandasLFApplier

def preprocess_rgx(s1, s2 = None):
  s = trim_and_split(s1, s2) if bool(s2) else s1
  s = re.sub("\s\s+" , " ", s).split("|")
  rgx="|".join(["\\b"+re.escape(i)+"\\b" for i in s])
  return rgx

# the labeling Function Factory
def make_keyword_lf(wre_code):
    
    rgx = preprocess_rgx(IHC_dict[wre_code]['like_any'],IHC_dict[wre_code]['regexp'])
    inword_search_ptrn = re.compile(rgx, flags=re.IGNORECASE).fullmatch
    
    if bool(IHC_dict[wre_code]['not_like_all']):
        rgx = preprocess_rgx(IHC_dict[wre_code]['not_like_all'])
        outword_search_ptrn = re.compile(rgx, flags=re.IGNORECASE).fullmatch
    else:
        outword_search_ptrn = None

    label = code_dict[wre_code]

    return LabelingFunction(
        name=f"keyword_{wre_code}",
        f=keyword_lookup,
        resources=dict(inword_search_ptrn = inword_search_ptrn, 
                       outword_search_ptrn=outword_search_ptrn, 
                       label=label),
    )

def keyword_lookup(x, inword_search_ptrn, outword_search_ptrn, label):
    search_text = re.sub("\s\s+" , " ", x.text)
    if bool(outword_search_ptrn):
        if inword_search_ptrn(search_text) and not outword_search_ptrn(search_text):
            return label
    elif inword_search_ptrn(search_text):
            return label
    return code_dict['ABSTAIN']

In [25]:
# wre_codes = [A, AM, B, ..., X, Y, Z, ZZZ]
wre_codes = IHC_dict.keys()
# labels = wre_codes 

# Define the labelling functions
labelling_functions = {}
# for wre_code in labels:
for wre_code in wre_codes:
    # print(f"{wre_code=}")
    labelling_functions[wre_code] = make_keyword_lf(wre_code)

# lfs = [labelling_functions['A']] #DELETE

# assemble all 101 labelling functions into a list 
lfs = labelling_functions.values()

applier = PandasLFApplier(lfs=lfs)

L_train = applier.apply(df = snorkel_df)

snork_preds = pd.DataFrame(L_train, columns=wre_codes)

snork_preds = pd.concat([snorkel_df, snork_preds], axis=1)

snork_preds.head(10)

100%|██████████| 20/20 [00:00<00:00, 2042.66it/s]


Unnamed: 0,text,DV,BG,A,B,E,CI,BS,DK,C,...,BP,DM,BH1,ZZZ,DO,DN,DP,DQ,DR,DS
0,decline,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,WRITE-OFF,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,derp of,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3,D,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
4,DECLINE VALUE IN,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
5,LVP,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
6,LVP,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
7,write down,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
8,decl,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
9,DEDUCTIBLE ADJUSTMENT,-1,8,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


## Check whether WRE_CODE is being labelled correctly

In [26]:
print("*"*80)
print(f"these should be: {code_dict[WRE_CODE]}")
print("*"*80)
snork_preds[['text', WRE_CODE]].head(10)

********************************************************************************
these should be: 8
********************************************************************************


Unnamed: 0,text,BG
0,decline,8
1,WRITE-OFF,8
2,derp of,8
3,D,8
4,DECLINE VALUE IN,8
5,LVP,8
6,LVP,8
7,write down,8
8,decl,8
9,DEDUCTIBLE ADJUSTMENT,8


In [27]:
# these should be label -1 for WRE_CODE
# because exlusionary terms are presnt in text
print("*"*80)
print(f"these should be: {code_dict['ABSTAIN']}")
print("*"*80)
snork_preds[['text', WRE_CODE]].tail(10)

********************************************************************************
these should be: -1
********************************************************************************


Unnamed: 0,text,BG
10,PARKING,-1
11,PARKING,-1
12,PARKING,-1
13,PARKING,-1
14,PARKING,-1
15,non dep,-1
16,PARKING,-1
17,non dep,-1
18,PARKING,-1
19,non dep,-1


## Apply Learning functions according to IHC heirachy

In [28]:
import numpy as np

def get_prediction(row):
  result = -1
  for j in row:
      if j != result:
          result = j
          break
  return inverse_code_dict[result]

prediction = [get_prediction(row) for row in snork_preds.iloc[:, 1:].to_numpy()]
# prediction

pd.DataFrame({'text':synthetic_data, 'prediction':prediction})


Unnamed: 0,text,prediction
0,decline,BG
1,WRITE-OFF,BG
2,derp of,BG
3,D,BG
4,DECLINE VALUE IN,BG
5,LVP,BG
6,LVP,BG
7,write down,BG
8,decl,BG
9,DEDUCTIBLE ADJUSTMENT,BG


## Evaluate Labelling Model Performance

In [29]:
# from snorkel.labeling.model import LabelModel
# # Train the label model and compute the training labels
# label_model = LabelModel(cardinality=99, verbose=True)
# label_model.fit(L_train, n_epochs=10)
# snorkel_df["prediction"] = label_model.predict(L=L_train, tie_break_policy="abstain")

In [30]:
# import numpy as np
# y = np.repeat(np.array([code_dict[WRE_CODE],code_dict['ABSTAIN']]), DATASET_SIZE//2)
# results = label_model.score(L_train, y, metrics=["accuracy", "coverage"])
# print(f"Model Accuracy = {results['accuracy']}")
# print(f"Model Coverage = {results['coverage']}")
# should be code_dict[WRE_CODE]
# snorkel_df[['text', 'prediction']].head(10)

### A generator pipeline that returns first match from IHC heirachy

In [31]:
import re

def preprocess_rgx(s1, s2 = None):
  s = trim_and_split(s1, s2) if bool(s2) else s1
  s = re.sub("\s\s+" , " ", s).split("|")
  rgx="|".join(["\\b"+re.escape(i)+"\\b" for i in s])
  return rgx

# the labeling Function Factory
def make_keyword_lf(wre_code):
    
    rgx = preprocess_rgx(IHC_dict[wre_code]['like_any'],IHC_dict[wre_code]['regexp'])
    inword_search_ptrn = re.compile(rgx, flags=re.IGNORECASE).fullmatch
    
    if bool(IHC_dict[wre_code]['not_like_all']):
        rgx = preprocess_rgx(IHC_dict[wre_code]['not_like_all'])
        outword_search_ptrn = re.compile(rgx, flags=re.IGNORECASE).fullmatch
    else:
        outword_search_ptrn = None

    label = code_dict[wre_code]
    # print(f"{wre_code}=")

    def keyword_lookup(x):
        search_text = re.sub("\s\s+" , " ", x)
        # print(f"{search_text}=")
        if bool(outword_search_ptrn):
            if inword_search_ptrn(search_text) and not outword_search_ptrn(search_text):
                return label
        elif inword_search_ptrn(search_text):
                return label
        return code_dict['ABSTAIN']
    
    return keyword_lookup

# labels = wre_codes 
labels = wre_codes

# def dbg_printer(search_text):
#     print(f(search_text))
#     return f(search_text)

def get_first_labelmatch(search_text):
    lbl_fn_gen = (make_keyword_lf(wre_code) for wre_code in labels)
    first_match = (f(search_text) for f in lbl_fn_gen if f(search_text) != -1)
    # matches = (dbg_printer(search_text) for f in gen if f(search_text) != -1)
    return next(first_match, -1)


In [32]:
print("*"*80)
print(f"the first 10 should be: {code_dict[WRE_CODE]}")
print("*"*80)
test_df = pd.DataFrame({'text':synthetic_data})
test_df['label'] = snorkel_df['text'].map(lambda text: get_first_labelmatch(text))
test_df

********************************************************************************
the first 10 should be: 8
********************************************************************************


Unnamed: 0,text,label
0,decline,8
1,WRITE-OFF,8
2,derp of,8
3,D,8
4,DECLINE VALUE IN,8
5,LVP,8
6,LVP,8
7,write down,8
8,decl,8
9,DEDUCTIBLE ADJUSTMENT,8
