# Labelbox - Spring 2021




This notebook details different concepts for Labelbox integration, and was used for the Qualitative Analysis project for the IBM Watson in the Watt CI. The purpose of this semester's work was to test triplet-based learning through the Labelbox platform with a custom labeling interface. This allows researchers to quickly label triplets of data, and this notebook contains information on how to set up the Labelbox data. The main sections of this notebook are:


*   Generating Labelbox-compatible triplet data from a text dataset
*   Training a tokenizer
*   Creating an active learning pipeline for Labelbox



## General Notes:

- For the purposes of this document, I changed most filenames to \<filename\>_example to ensure that none of the original documents got messed up. If you intend to pick up and use this code, it would be best practice to rename these files to something more descriptive.
- json.loads() seems to frequently throw "invalid escape character" errors with this data, likely due to the way it was formatted originally. This can often be easily fixed with a .replace() function to get rid of the invalid character. However, I recommend using a more robust data cleaning method that can ensure no invalid characters are present.
- Similar escape character issues are present during GraphQL queries, which can also be resolved using .replace(). Again, it would be best practice to use a more robust data cleaning method.

-Quinn Hubbarth (quinn.hubbarth@gmail.com)

# Imports and installs

In [45]:
!pip install tokenizers
!pip install labelbox
from labelbox import Client, Dataset, LabelingFrontend, Project, DataRow
from labelbox import schema
import pandas as pd
from typing import List
import numpy as np
import json
from itertools import chain, islice

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
from tokenizers.normalizers import Sequence, Lowercase, Strip
from sklearn.model_selection import train_test_split



In [4]:
try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    IN_COLAB = True
except:
    IN_COLAB = False

Mounted at /content/gdrive


# Generating Labelbox-compatible triplet data from a text dataset

In [10]:
# Define a datapath based on the structure of your Google Drive. One datapath might look like this:
dataPath = 'gdrive/My Drive/Semester 6/UPIC_WitW/data/'

# Read in a dataset of your choosing. In this instance, we used Amazon Fine Food Reviews, which could look like this:
database = pd.read_csv(dataPath + "Reviews.csv")
database.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [22]:
# To prepare the data for separating into triplets, we should select the Summary and Text columns from the database, combine them, and make them readable for JSON & HTML formatting:
reviews = pd.DataFrame("<b>" + database["Summary"].str.replace('"', '').str.replace("{", "(").str.replace("}", ")") + ":<br \/> </b>" + database["Text"].str.replace('"', '').str.replace("{", "(").str.replace("}", ")"))

# To separate the data into triplets, we must generate a set of random indices from the data. This creates variable "indices", which creates random arrays of 3 indexes.
indices = np.random.randint(0, reviews.size, ((int)(reviews.size/3), 3))

# These indices are then nicely sorted into a Pandas dataframe.
reviews_indices = pd.DataFrame(indices, columns = {"A_id", "B_id", "C_id"})
reviews_indices = reviews_indices.reindex(["A_id", "B_id", "C_id"], axis = 1)
reviews_indices.head()

Unnamed: 0,A_id,B_id,C_id
0,479638,561281,467604
1,331429,96362,146627
2,322464,193204,52168
3,98227,110668,374978
4,484180,305031,494406


In [23]:
# Creating reviews_triplets uses the indices to pull data from the reviews dataframe, and concatenates the triplets of reviews onto the triplets of indices.

reviews_triplets = pd.concat([pd.DataFrame(reviews_indices.apply(lambda x: x.apply(lambda y: reviews[0][y]))).rename(columns = {"A_id":"A", "B_id": "B", "C_id": "C"}), reviews_indices], axis=1)
reviews_triplets.head()

Unnamed: 0,A,B,C,A_id,B_id,C_id
0,<b>coffee that will wake you up!:<br \/> </b>I...,<b>Hemp it Up!!!:<br \/> </b>I love this stuff...,<b>Biggest disappointment ever:<br \/> </b>I d...,479638,561281,467604
1,"<b>Not bad, but the original Gold Bears are be...",<b>A real treat!!:<br \/> </b>This jam is abso...,<b>Melitta Coffee Pods:<br \/> </b>tastes ok b...,331429,96362,146627
2,<b>not good:<br \/> </b>I ordered this because...,<b>Really great - price outrageous:<br \/> </b...,<b>Yummy Delicious Sunbutter:<br \/> </b>Thing...,322464,193204,52168
3,<b>Very convenient:<br \/> </b>I have a confes...,<b>Why is the decaf almost twice as expensive?...,"<b>Teeny, tiny amount:<br \/> </b>I haven't ac...",98227,110668,374978
4,"<b>coffee:<br \/> </b>I love this coffee, and ...",<b>Love Crunch Granola:<br \/> </b>If you like...,<b>Perhaps I don't love green tea as much as t...,484180,305031,494406


In [24]:
# Now, we simply need to translate this data into the way Labelbox likes it. This involves replacing certain characters with valid characters for JSON strings.
# We also will export the data by specifying the amount of triplets we want to use. You can export as many triplets as there are rows in reviews_triplets.
# To export the data, we use the pandas function .to_json, which takes in a parameter "orient". The closest export orient to the Labelbox requirements is "records," but some formatting replacements still need to be made.

num_to_export = 1000
output_file = "reviews_triplets_example.json"

reviews_triplets.json = open(dataPath + output_file, "w")
reviews_triplets.json.write(reviews_triplets.head(num_to_export).to_json(orient="records").replace('"', "\\" + '"').replace("{", '{\"data\": \"{\\"compare\\":{').replace("}", '}}\"}').replace("},", "},\n"))
reviews_triplets.json.close()

  


TODO FOR THIS SECTION:

- Clean up some of the .replace() functions with a more sophisticated way to clean up the data, if possible
- Add more parameters in the data gathering function so users can easily swap out datasets

# Training a tokenizer

In [25]:
# Specify a dataPath and database, similarly to the data generation style
dataPath = 'gdrive/My Drive/Semester 6/UPIC_WitW/data/'
database = pd.read_csv(dataPath + "Reviews.csv")

# Set other settings for the tokenizer and tokenizer output
min_freq=3
vocab_size = 20000

save_path = dataPath + "reviews_example.txt"
save_path_tokenizer = dataPath + "tokenizer%i-minf%i_example.json" % (vocab_size, min_freq)

In [29]:
# When creating a tokenizer, we thought it would be useful to make sure that none of the existing triplets were in the test set.
# So, we created a method of getting the indices from the exported files in the above tutorial.
# Then, we make sure to exclude this data from the test set, and add it back into the training set.

# Read in the original exported dataset.
reviews_json = pd.read_json(dataPath + 'reviews_triplets_example.json')
index_list = []
size = reviews_json.size

# 1 is added to the indices because the "id" column in the data starts at index 1, not index 0.
for i in range(size):
  # Add each index in a triplet to index_list, which will eventually have all indexes in reviews_json
  index_list.append(json.loads(reviews_json['data'][i])['compare']['A_id'] + 1)
  index_list.append(json.loads(reviews_json['data'][i])['compare']['B_id'] + 1)
  index_list.append(json.loads(reviews_json['data'][i])['compare']['C_id'] + 1)

# Export these ids in case they are useful later.
pd.DataFrame(index_list).to_csv(dataPath + 'reviews_triplets_example_indices.csv')

In [30]:
# Remove triplets we found from the train-test split 
# We do a little pandas work here to get a subset of dataset where all items in data_no_triplets have an "Id" that does not exist in index_list.
data_no_triplets = database[~database.isin(index_list)["Id"]]
train_set, test_set = train_test_split(data_no_triplets, train_size = .9)

# Add removed triplet data (3000 documents) to train set
train_set.append(database[database.isin(index_list)["Id"]])

# Save train and test data
train_set.to_csv(dataPath + "train_example.csv")
test_set.to_csv(dataPath + "test_example.csv")

In [34]:
# Reload train data from disk
train_data = pd.read_csv(dataPath + "train_example.csv")

# Load and prep the data
data = pd.DataFrame(train_data["Summary"] + ": "  + train_data["Text"])

# This does something to the data to make it useful for the tokenizer. Not quite sure what exactly
text = data.melt()['value'].dropna()

# Save the text data to specified save path
text.to_csv(save_path, index=None, header=False)

# Create specific special tokens
special_toks=["<unk>", "<s>", "</s>", "<pad>", "<mask>"]

In [35]:
# Tokenizer Training function, and tokenizer training
def create_tokenizer(vocab_size, min_freq, special_toks, save_path_text, save_path_tokenizer):
    tokenizer = Tokenizer(BPE())
    tokenizer.normalizer = Sequence([Strip()])
    tokenizer.pre_tokenizer = Whitespace()
    trainer = BpeTrainer(special_tokens=special_toks, vocab_size=vocab_size, min_frequency=min_freq)
    
    tokenizer.train([save_path_text], trainer)
    
    tokenizer.post_processor = TemplateProcessing(
        single="<s> $A </s>",
        special_tokens=[
            ("<s>", tokenizer.token_to_id("<s>")),
            ("</s>", tokenizer.token_to_id("</s>"))
        ],
    )
    
    tokenizer.save(save_path_tokenizer)
    
    return tokenizer

#Create tokenizer
tokenizer = create_tokenizer(vocab_size, min_freq, special_toks, save_path, save_path_tokenizer)

In [36]:
# Testing the tokenizer
sentence = "hello, y'all! [noise] how are \0xg you 😁 ?"
print(tokenizer.encode(sentence).tokens)
print(tokenizer.encode(sentence).ids)

['<s>', 'hello', ',', 'y', "'", 'all', '!', '[', 'noise', ']', 'how', 'are', 'x', 'g', 'you', '?', '</s>']
[1, 18119, 17, 94, 12, 242, 6, 64, 10636, 66, 629, 204, 93, 76, 197, 36, 2]


# Creating an active learning pipeline for Labelbox

The motivation for the Active Learning Pipeline was based largely on this tutorial:
https://labelbox.com/blog/active-learning-with-uncertainty-sampling/

Also, the Labelbox GraphQL documentation was very useful to figure out how to use some of these functions:
https://docs.labelbox.com/graphql-api/en/index-en#labeling-parameters


In [None]:
# Define datapath and read in the exported triplets.
dataPath = 'gdrive/My Drive/Semester 6/UPIC_WitW/data/'
reviews_json = pd.read_json(dataPath + 'reviews_triplets_example.json')

# Register Labelbox client, project, and dataset using the Labelbox API.
LABELBOX_API_KEY = "get your own API Key"
project_name = "Qualitative Analysis Triplets"

client = Client(LABELBOX_API_KEY)
projects = client.get_projects(where=Project.name == project_name)
project = next(iter(projects))

# Collecting the dataset is interesting, because if you have multiple datasets of the same name, it may not work.
# Because of this, make sure to avoid collecting multiple dataset.
# You can attach multiple datasets to a project at a time, and should principally be able to work with multiple in code.
# dataset_name must match the dataset uploaded to Labelbox. reviews_triplets_example.json is not, so this will not work.
dataset_name = "reviews_triplets_example.json"
dataset = next(iter(client.get_datasets(where=Dataset.name == dataset_name)))

In [None]:
# This code sets the external_ids of any datarows within the dataset that do not have an external_id.
# WARNING: THIS CODE CAN MODIFY DATA YOU HAVE UPLOADED TO THE CLOUD.
# I recommend you modify this code to make it safer and unable to modify data in a negative way.
# For example, in the GraphQL query, it might be possible to not pass row_data, which would prevent row_data from getting deleted for any reason.
# The purpose of this is to ensure that each datarow has an external_id (which is <A_id>_<B_id>_<C_id>) so we can set the priority of these datarows

for i in iter(dataset.data_rows()):
  if i.external_id is None:
    compare = json.loads(i.row_data.replace("\\", "\\\\").replace("\x13", " ").replace("\x10", " "))

    row_data = i.row_data.replace('"', "\\" + '"').replace("\\T", "T").replace("\\i", "i").replace(
                "\u0013", " ").replace("\\c", "c").replace("\\M", "M").replace("\\P", "P").replace(
                "\\d", "d").replace("\\<", "").replace("\\ ", " ").replace("\\w", "w").replace("\\\\", "\\").replace(
                "\\s", "s").replace("\\2", "2").replace("\\5", "5").replace("\\S", "S").replace("\u0010", " ")
    ex_id = str(compare["compare"]["A_id"]) + "_" + str(compare["compare"]["B_id"]) + "_" + str(compare["compare"]["C_id"])
    
    client.execute(
        f'''
        mutation UpdateDataRow {{
            updateDataRow( 
                where: {{ 
                  id: "{i.uid}" 
                }},
                data: {{
                    externalId: "{ex_id}",
                    rowData: "{row_data}"
                }}
            ) {{ 
              id 
            }}
        }}
        '''
    )

In [None]:
# This code can add a column to a Pandas dataframe for the external_ids given a local instance of the reviews triplets json file.

reviews_triplets = pd.read_json(dataPath + "reviews_triplets_example.json")
reviews_triplets["external_ids"] = reviews_triplets["data"].apply(lambda x: str(json.loads(x.replace("\\", "\\\\").replace("\x13", " ").replace("\x10", " "))["compare"]["A_id"]) + "_" + 
                                                          str(json.loads(x.replace("\\", "\\\\").replace("\x13", " ").replace("\x10", " "))["compare"]["B_id"]) + "_" + 
                                                          str(json.loads(x.replace("\\", "\\\\").replace("\x13", " ").replace("\x10", " "))["compare"]["C_id"]))

# Then, we can add the uids, which are ids created by Labelbox and used for active learning.

def update_with_uid(
        dataframe: pd.core.frame.DataFrame,
        dataset: schema.dataset.Dataset) -> pd.core.frame.DataFrame:
    """Add uid column for tracking labelbox's id of the same datarow in our dataframe.
    Args:
        df: dataframe to augment
        project_name: project name for dataset
        dataset_name: name for dataset
    """    
    external_uid_map = {
        str(data_row.external_id): data_row.uid
        for data_row in dataset.data_rows()
    }
    dataframe['uid'] = dataframe.external_ids.map(external_uid_map)
    return dataframe

update_with_uid(reviews_triplets, dataset)

# After that, we have two more columns for our reviews_triplets dataframe, both of which are very useful to us:
reviews_triplets.head()

In [None]:
# Now, we can begin to set the priority of different data points.

# Set a random priority for a proof-of-concept.
def set_priority():
  return np.random.randint(1, 100)



# This code was largely adapted from the Active Learning Tutorial linked above.
# Currently, the priority is simply set to i, meaning the first datarows have the highest priority in labeling.
# Replacing i with set_priority() under the "priority:" parameter would set it to a random number between 1 and 100.
# This code appears to work, but has been difficult to test.

project_id = project.uid
uids = reviews_triplets.uid

# priority_data_rows = (
#     f'{{dataRow: {{id: "{uid}"}}, priority: {i}, numLabels: 1}}'
#     for i, uid in enumerate(uids)
# )

def gen_priority_data_rows():
  for i, uid in enumerate(uids):
    print(uid)
    if(i > 5):
      break;
    yield f'''{{dataRow: {{id: "{uid}"}}, priority: {i}, numLabels: 1}}'''
priority_data_rows = gen_priority_data_rows()

#data_rows = chain(priority_data_rows, rest_data_rows)
data_rows = priority_data_rows

def batches(iterable, size):
    iterator = iter(iterable)
    for first in iterator:
        yield chain([first], islice(iterator, size - 1))

for batch in batches(data_rows, size=999):
    response = client.execute(
        f'''
        mutation setLabelingParameterOverrides {{
          project(where: {{ id: "{project_id}" }}) {{
            setLabelingParameterOverrides(data: [
                {','.join(batch)}
            ]) {{
              success
            }}
          }}
        }}
        '''
    )
    assert not response.get('errors')

In [None]:
# To unset all of the labeling parameter overrides (mainly the labeling priority), run this:
response = client.execute(f'''
    mutation UnsetAllLabelingParameterOverrides {{

      project(where: {{id: "{project.uid}" }}) {{
        unsetAllLabelingParameterOverrides {{
            success
            deletedCount
        }}
      }}
    }}
    '''
  )


In [None]:
# To check the labeling parameter overrides, run this:
client.execute(f'''
    query PullLabelingParameterOverrides {{

      project(where: {{id: "{project.uid}" }}) {{
        labelingParameterOverrides {{
            id
            priority
        }}
      }}
    }}
    '''
  )


TODO:
- Test if the labeling parameter overrides are actually working in the Labelbox interface.
- If these tests pass, it could be used in a model where the most uncertain datapoints are prompted to the researcher.
- Currently, it's been hard for us to test this, because it appears that the uids for the labelingParameterOverrides we pull are completely different uids than the uids in the dataset we're using. This raises a few questions about how we could test this method.