# Remarks

THANK YOU FOR ALL THE UPVOTES!

- This notebook was purely created to assist Moth in fixing the tokenization with punctuation for https://www.kaggle.com/datasets/alejopaullier/pii-external-dataset/data?select=pii_dataset.csv
- This notebook was expanded to also reformat the dataset by PJMathematician: https://www.kaggle.com/datasets/pjmathematician/pii-detection-dataset-gpt

- As you can see in version numbers, I went to several versions to try to get these in the correct format - if you see any mistakes, please reach out!

- Thank you to Raja for creating this application, which helps finding mistakes relatively fast: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/discussion/475646#2645266

In [1]:
import ast
import pandas as pd
import string
from Levenshtein import distance as le

In [2]:
from spacy.lang.en import English
en_tokenizer = English().tokenizer

def tokenize_with_spacy(text, tokenizer=en_tokenizer):
    tokenized_text = tokenizer(text)
    tokens = [token.text for token in tokenized_text]
    trailing_whitespace = [bool(token.whitespace_) for token in tokenized_text]
    return {'tokens': tokens, 'trailing_whitespace': trailing_whitespace}

In [3]:
new_ = pd.read_csv("/kaggle/input/pii-external-dataset/pii_dataset.csv")
# the columns contain string representations of lists - we can eval them to turn them to lists
new_["tokens"] = new_.tokens.apply(lambda x: ast.literal_eval(x))
new_["trailing_whitespace"] = new_.trailing_whitespace.apply(
    lambda x: ast.literal_eval(x)
)
new_["labels"] = new_.labels.apply(lambda x: ast.literal_eval(x))

# unprocessed versions for comparison later
og=new_.copy(deep=True)
ex = new_.explode(["tokens","trailing_whitespace","labels"])

new_.shape

(4434, 16)

In [4]:
# moth check: Masako Mitsubishi NAME_STUDENT

In [5]:
label_map={
    "O":"O",
    "B-NAME_STUDENT":"NAME_STUDENT",
    "I-NAME_STUDENT":"NAME_STUDENT",
    "B-URL_PERSONAL":"URL_PERSONAL",
    "I-URL_PERSONAL":"URL_PERSONAL",
    "B-EMAIL":"EMAIL",
    "I-EMAIL":"EMAIL",
    "B-ID_NUM":"ID_NUM",
    "I-ID_NUM":"ID_NUM",
    "B-USERNAME":"USERNAME",
    "I-USERNAME":"USERNAME",
    "B-PHONE_NUM":"PHONE_NUM",
    "I-PHONE_NUM":"PHONE_NUM",
    "B-STREET_ADDRESS":"STREET_ADDRESS",
    "I-STREET_ADDRESS":"STREET_ADDRESS"
}

In [6]:
new_[new_.document.str.contains("45507")]

Unnamed: 0,document,text,tokens,trailing_whitespace,labels,prompt,prompt_id,name,email,phone,job,address,username,url,hobby,len
10,45507bc9-7b42-49fd-8fa3-031cdaade706,"In the bustling city of Alexandria, amidst the...","[In, the, bustling, city, of, Alexandria,, ami...","[True, True, True, True, True, True, True, Tru...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-NAME...",\n Write a fictional semi-formal biography ...,0,Shankar Yu,shankar-yu@outlook.net,(13) 98824-5547,attorney,5036 Jericho Street,,,Mixology,347


In [7]:
print(new_[new_.document.str.contains("45507")].text.iloc[0])

In the bustling city of Alexandria, amidst the clamor of daily life, I, Shankar Yu, emerged into this world, blessed with an unyielding spirit and a profound curiosity that would shape my destiny. From my humble beginnings at 5036 Jericho Street, I embarked on a journey that would propel me to new heights, leaving an indelible mark on the tapestry of human endeavor. The call of knowledge resonated within me from a tender age. I spent countless hours poring over books, devouring every morsel of information that came my way. The allure of the unknown beckoned me forth, igniting a passion for exploration and discovery that would never wane. As fate would have it, my path intersected with a group of kindred spirits, fellow seekers of truth and enlightenment. Together, we embarked on a quest for answers, traversing the globe in search of ancient wisdom and hidden knowledge. Our journey took us to remote corners of the world, where we encountered diverse cultures and traditions, each offerin

In [8]:
def get_labels_pj_format(row):
    # wanted format: {"NAME_STUDENT":["Valentin", "Werner"]}
    labeldict={}
    labels_ = [label_map[l] for l in row["labels"]]
    tokens_ = row["tokens"]
    for t, l in zip(tokens_, labels_):
        # we do not need to collect O-Labels
        if l!="O":
            # append to list if already existent
            if l in labeldict.keys():
                # remove tailing punctuation if exists (this is copied from my old approach)
                while t[-1] in string.punctuation: 
                    t=t[:-1]
                labeldict[l] += [t]
            
            # create new list if not already existent
            else:
                # remove tailing punctuation if exists (this is copied from my old approach)
                while t[-1] in string.punctuation: 
                    t=t[:-1]
                labeldict[l] = [t]
    return labeldict

def pj_to_pj(label_in):
    labeldict={}
    for label, pii, in label_in.items():
        pii_list=[]
        for info in pii:
            pii_list += tokenize_with_spacy(info)["tokens"]
        clean_pii_list=[]
        for token in pii_list:
            # this string.punctuation filters everything except "(" which might be an opening token in phone numbers
            while len(token) > 0 and token[-1] in "".join(string.punctuation.split("(")): 
                token=token[:-1]
            if len(token) > 0:
                clean_pii_list+=[token]
        labeldict[label] = clean_pii_list
    return labeldict

labels_=[]
for i, row in new_.iterrows():
    labeldict=get_labels_pj_format(row)
    labeldict=pj_to_pj(labeldict)
    labels_.append(labeldict)

In [9]:
labels_[10]

{'NAME_STUDENT': ['Shankar', 'Yu', 'Shankar', 'Yu'],
 'STREET_ADDRESS': ['5036', 'Jericho', 'Street', '5036', 'Jericho', 'Street'],
 'PHONE_NUM': ['(', '13', '98824', '5547'],
 'EMAIL': ['shankar-yu@outlook.net']}

In [10]:
new_[new_.document=="f0fd1c48-cee0-4c40-92b1-9ea3be46d7e5"]

Unnamed: 0,document,text,tokens,trailing_whitespace,labels,prompt,prompt_id,name,email,phone,job,address,username,url,hobby,len
3436,f0fd1c48-cee0-4c40-92b1-9ea3be46d7e5,"Homer Pavlov, a seasoned businessman known for...","[Homer, Pavlov,, a, seasoned, businessman, kno...","[True, True, True, True, True, True, True, Tru...","[B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O...",\n Homer Pavlov is a businessman. Write abo...,2,Homer Pavlov,homer_pavlov7000@msn.gov,(562) 527-6711,businessman,811 River Dell Drive,,,Gymnastics,371


## Retokenize with spacy
- This gets the tokens in the exact format wanted by the challenge providers, however, this may cause a mismatch of labels before and after tokenization

In [11]:
tokens = [tokenize_with_spacy(r["text"]) for idx, r in new_.iterrows()]
tokens[0].keys()

dict_keys(['tokens', 'trailing_whitespace'])

## Re-assemble labels

In [12]:
# This loop is very inefficient, but it takes 0.3 seconds - so who cares

new_=[]
for tok, l in zip(tokens, labels_):
    
    # these will just be forwarded to the final result, as we do not change these
    t = tok["tokens"]
    ws = tok["trailing_whitespace"]
    
    # Create "O" label as standard value to overwrite on specific indices
    new_labels=["O"]*len(t)
    
    # Find entities from labels_ in the text
    for ent_type, ent_list in l.items():
        for ent_ in ent_list:
            # find occurence of tagged entities in the list
            # - this assumes that entities are not containing commong words such as "the"
            indices = [i for i, x in enumerate(t) if x == ent_]
            for i in indices:
                # overwrite "O" label with correct label
                new_labels[i] = ent_type
    new_.append({"tokens":t, "trailing_whitespace":ws, "labels":new_labels})

In [13]:
# As we only labelled words, but not punctuation inbetween these words, we need to fill the gaps
new_2=[]
punctuation = [p for p in string.punctuation]
for r, labeldict in zip(new_, labels_):
    
    sandwich_on_comma = ["STREET_ADDRESS"]
    # again these are just forwarded
    t = r["tokens"]
    ws = r["trailing_whitespace"]
    # again, these may get overwritten
    label = r["labels"]
    new_labels=["O"]*len(label)
    for i, l in enumerate(label):
        # get prior label if possible
        if i != 0: prior_label=label[i-1]
        else: prior_label="O"

        # get next label
        if i+1 < len(label): next_label=label[i+1]
        elif i+1 == len(label): next_label="O"
        
        # skip filler / list words that split multiple entities
        if (t[i] == "and" and l == "O") or (t[i] == "or" and l == "O"):
            new_labels[i] = "O"
        elif (t[i] == "." and l == "O" and prior_label=="NAME_STUDENT"):
            new_labels[i] = "O"
        # "(" might be labeled as phone num, which is only correct if more phone num follows
        elif t[i] == "(" and l == "PHONE_NUM" and next_label!="PHONE_NUM":
            new_labels[i] = "O"
        elif t[i] == "(" and l != "PHONE_NUM":
            # print(l, t[i], t[i+1]) # this is whenever we have something like "my name is Valentin (Valle)"
            new_labels[i] = "O"
        elif prior_label == "EMAIL" and t[i] == "to":
            new_labels[i] = "O"
        elif (prior_label == "NAME_STUDENT" or prior_label == "O") and t[i] == "'s":
            new_labels[i] = "O"
            
        # only street addresses should contain commas - this avoids labelling sandwiches
        # which chain multiple entities, such as "Valentin Werner, Thomas Müller, and Manuel Neuer"
        # As these should be three separate entities
        elif t[i] == "," and prior_label not in sandwich_on_comma:
            new_labels[i] = "O"
            
        # replace if we got a sandwich ("LABEL"-"O"-"LABEL", such as "Berlin" - "," - "Germany")
        elif prior_label == next_label and prior_label != "O":
            new_labels[i] = prior_label
        elif l != "O":
            new_labels[i] = l
        else:
            new_labels[i] = "O"
    
    new_2.append({"tokens":t, "trailing_whitespace":ws, "labels":new_labels})

In [14]:
# Turn labels into BIO Labels
new_bio=[]
for i, r in enumerate(new_2):
    
    # again, these are just forwarded
    t = r["tokens"]
    ws = r["trailing_whitespace"]
    # again, these might get overwritten
    label = r["labels"]

    # keep track of last label to identify when to use B or I
    last_label="O"
    for i, l in enumerate(label):
        if l != last_label and l != "O":
            label[i] = "B-"+l
        elif l == last_label and last_label != "O":
            label[i] = "I-"+l
        last_label = l
    new_bio.append({"doc_prelim":i,"tokens":t, "trailing_whitespace":ws, "labels":label})

In [15]:
new_ = pd.DataFrame(new_bio)
new_ = new_.explode(["tokens", "trailing_whitespace", "labels"])
new_.shape # note that this produces even more tokens than my prior approach

(1570222, 4)

In [16]:
new_.head(25)

Unnamed: 0,doc_prelim,tokens,trailing_whitespace,labels
0,410,My,True,O
0,410,name,True,O
0,410,is,True,O
0,410,Aaliyah,True,B-NAME_STUDENT
0,410,Popova,False,I-NAME_STUDENT
0,410,",",True,O
0,410,and,True,O
0,410,I,True,O
0,410,am,True,O
0,410,a,True,O


In [17]:
# Add doc id and text
new_ = pd.merge(new_.reset_index(names="doc_"), og.reset_index(names="doc_")[["doc_","document","text"]], on="doc_", how="left").drop(columns=["doc_","doc_prelim"])
# reorder columns to logical order
new_ = new_[["document","text","tokens","trailing_whitespace","labels"]]

## Sanity checks
- Same amount of labels
- Does the amount of extra tokens add up?

In [18]:
new_[(new_.tokens.str.contains("\)")) & ~(new_.labels.isin(["I-PHONE_NUM","O"]))]

Unnamed: 0,document,text,tokens,trailing_whitespace,labels
169854,6716411b-4770-4945-bb6c-05c963f984f8,Yukio Ma is an experienced electrical engineer...,),True,I-STREET_ADDRESS


In [19]:
# Since we only check for punctuation, this looks fine to me with some ratio of spacy tokenization patterns that are not punctuation
print(
    "new shape: ", 
    new_.shape[0], 
    "- old shape + punctuation: ", 
    ex.shape[0] + new_[(new_.tokens.isin([p for p in string.punctuation]))].shape[0]
)

new shape:  1570222 - old shape + punctuation:  1544255


In [20]:
og[og.document=="3e47e235-6526-4555-875b-c908499e5d33"]

Unnamed: 0,document,text,tokens,trailing_whitespace,labels,prompt,prompt_id,name,email,phone,job,address,username,url,hobby,len
216,3e47e235-6526-4555-875b-c908499e5d33,"My name is Gabriel Fischer. I'm a plumber, and...","[My, name, is, Gabriel, Fischer., I'm, a, plum...","[True, True, True, True, True, True, True, Tru...","[O, O, O, B-NAME_STUDENT, I-NAME_STUDENT, O, O...",\n Gabriel Fischer is a plumber. Write a fi...,3,Gabriel Fischer,gabriel.fischer@gmail.org,+27 67 568 4444,plumber,1636 Briarview Court,,,Wood burning,259


In [21]:
labels_[216]

{'NAME_STUDENT': ['Gabriel', 'Fischer'],
 'EMAIL': ['gabriel.fischer@gmail.org'],
 'STREET_ADDRESS': ['1636', 'Briarview', 'Court']}

In [22]:
x = new_[new_.document == "4f9113a4-e372-47a3-b892-dfdc4aded681"]
for t,l in zip(x.tokens, x.labels):
    print(t, "-", l)

My - O
name - O
is - O
Joseph - B-NAME_STUDENT
Martinez - I-NAME_STUDENT
, - O
and - O
I - O
have - O
the - O
pleasure - O
of - O
sharing - O
my - O
life - O
's - O
journey - O
with - O
you - O
. - O
I - O
reside - O
at - O
22 - B-STREET_ADDRESS
Gallatin - I-STREET_ADDRESS
Street - I-STREET_ADDRESS
Northeast - I-STREET_ADDRESS
, - O
where - O
I - O
've - O
had - O
the - O
privilege - O
of - O
calling - O
home - O
for - O
many - O
years - O
. - O
My - O
life - O
has - O
been - O
a - O
tapestry - O
of - O
experiences - O
, - O
both - O
joyous - O
and - O
challenging - O
, - O
shaping - O
me - O
into - O
the - O
person - O
I - O
am - O
today - O
. - O
From - O
a - O
tender - O
age - O
, - O
I - O
was - O
captivated - O
by - O
the - O
written - O
word - O
. - O
I - O
spent - O
hours - O
immersing - O
myself - O
in - O
books - O
, - O
exploring - O
realms - O
unknown - O
and - O
unlocking - O
the - O
secrets - O
of - O
the - O
world - O
. - O
My - O
passion - O
for - O
language - O
and - O


## Turn to aggregated csv - for your convenience

In [23]:
# unexplode columns to lists (technically you can skip this step and safe it exploded instead too)
fixed = (
    new_.groupby("document")
    .agg(
        {
            "text": lambda x: x,
            "tokens": lambda x: x.tolist(),
            "trailing_whitespace": lambda x: x.tolist(),
            "labels": lambda x: x.tolist(),
        }
    )
    .reset_index()
)
# fix text that was turned into list
fixed["text"] = fixed.text.apply(lambda x: x[0])

In [24]:
fixed.to_csv("pii_dataset_fixed.csv", index = False)
fixed.shape

(4434, 5)

## Also save as JSON - as this is the main format

In [25]:
json_format=[]
for idx, row in fixed.iterrows():
    doc=row["document"]
    text=row["text"]
    tokens=row["tokens"]
    ws=row["trailing_whitespace"]
    labels=row["labels"]
    
    json_format.append(
        {
            "document":doc,
            "full_text":text,
            "tokens":tokens,
            "trailing_whitespace":ws,
            "labels":labels
        }
    )
import json
out_file = open("pii_dataset_fixed.json", "w") 
json.dump(json_format, out_file) 
out_file.close()

## Second Dataset 

In [26]:
# As my first approach was oriented on the column "1" of this dataset, this does not need as much pre-processing
ai_data = pd.read_csv('/kaggle/input/pii-detection-dataset-gpt/ai_data.csv')
ai_data.columns=["text","labeldict"]
ai_data.head()

# However, as you can see in row 4 - Name student: the dictionary is not consistent and this probably fails our approach
# Therefore we have to split on " "

Unnamed: 0,text,labeldict
0,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Richard', 'Chang'], 'EMAIL'..."
1,"In today's modern world, where technology has ...","{'NAME_STUDENT': [], 'EMAIL': ['tamaramorrison..."
2,Janice: A Student with a Unique Identity\n\nIn...,"{'NAME_STUDENT': ['Janice'], 'EMAIL': ['laura5..."
3,Christian is a student who goes by the usernam...,"{'NAME_STUDENT': ['Christian'], 'EMAIL': [], '..."
4,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Aaron Smith', 'Fischer', 'T..."


In [27]:
def pj_to_pj(row):
    # wanted format: {"NAME_STUDENT":["Valentin", "Werner"]}
    labeldict={}
    for label, pii, in ast.literal_eval(row["labeldict"]).items():
        pii_list=[]
        for info in pii:
            pii_list += tokenize_with_spacy(info)["tokens"]

        clean_pii_list=[]
        for token in pii_list:
            # this string.punctuation filters everything except "(" which might be an opening token in phone numbers
            while len(token) > 0 and token[-1] in "".join(string.punctuation.split("(")): 
                token=token[:-1]
            if len(token) > 0:
                clean_pii_list+=[token]
        labeldict[label] = clean_pii_list
    return labeldict

labels_=[]
for i, row in ai_data.iterrows():
    labeldict=pj_to_pj(row)
    labels_.append(labeldict)

In [28]:
tokens = [tokenize_with_spacy(r["text"]) for idx, r in ai_data.iterrows()]
tokens[0].keys()

dict_keys(['tokens', 'trailing_whitespace'])

In [29]:
# This loop is very inefficient, but it takes 0.3 seconds - so who cares

new_=[]
for tok, l in zip(tokens, labels_):
    
    # these will just be forwarded to the final result, as we do not change these
    t = tok["tokens"]
    ws = tok["trailing_whitespace"]
    
    # Create "O" label as standard value to overwrite on specific indices
    new_labels=["O"]*len(t)
    
    # Find entities from labels_ in the text
    for ent_type, ent_list in l.items():
        for ent_ in ent_list:
            # find occurence of tagged entities in the list
            # - this assumes that entities are not containing commong words such as "the"
            indices = [i for i, x in enumerate(t) if x == ent_]
            for i in indices:
                # overwrite "O" label with correct label
                new_labels[i] = ent_type
    new_.append({"tokens":t, "trailing_whitespace":ws, "labels":new_labels})

In [30]:
# As we only labelled words, but not punctuation inbetween these words, we need to fill the gaps
new_2=[]
punctuation = [p for p in string.punctuation]
for r, labeldict in zip(new_, labels_):
    
    sandwich_on_comma = ["STREET_ADDRESS"]
    # again these are just forwarded
    t = r["tokens"]
    ws = r["trailing_whitespace"]
    # again, these may get overwritten
    label = r["labels"]
    new_labels=["O"]*len(label)
    for i, l in enumerate(label):
        # get prior label if possible
        if i != 0: prior_label=label[i-1]
        else: prior_label="O"

        # get next label
        if i+1 < len(label): next_label=label[i+1]
        elif i+1 == len(label): next_label="O"
        
        # skip filler / list words that split multiple entities
        if (t[i] == "and" and l == "O") or (t[i] == "or" and l == "O"):
            new_labels[i] = "O"
        elif (t[i] == "." and l == "O" and prior_label=="NAME_STUDENT"):
            new_labels[i] = "O"
        # "(" might be labeled as phone num, which is only correct if more phone num follows
        elif t[i] == "(" and l == "PHONE_NUM" and next_label!="PHONE_NUM":
            new_labels[i] = "O"
        elif t[i] == "(" and l != "PHONE_NUM":
            # print(l, t[i-2:i+5]) # this is whenever we have something like "my name is Valentin (Valle)"
            new_labels[i] = "O"
        elif prior_label == "EMAIL" and t[i] == "to":
            new_labels[i] = "O"
        elif (prior_label == "NAME_STUDENT" or prior_label == "O") and t[i] == "'s":
            new_labels[i] = "O"
        # only street addresses should contain commas - this avoids labelling sandwiches
        # which chain multiple entities, such as "Valentin Werner, Thomas Müller, and Manuel Neuer"
        # As these should be three separate entities
        elif t[i] == "," and prior_label not in sandwich_on_comma:
            new_labels[i] = "O"
            
        # replace if we got a sandwich ("LABEL"-"O"-"LABEL", such as "Berlin" - "," - "Germany")
        elif prior_label == next_label and prior_label != "O":
            new_labels[i] = prior_label
        elif l != "O":
            new_labels[i] = l
        else:
            new_labels[i] = "O"
    
    new_2.append({"tokens":t, "trailing_whitespace":ws, "labels":new_labels})

In [31]:
# Turn labels into BIO Labels
new_bio=[]
for i, r in enumerate(new_2):
    
    # again, these are just forwarded
    t = r["tokens"]
    ws = r["trailing_whitespace"]
    # again, these might get overwritten
    label = r["labels"]

    # keep track of last label to identify when to use B or I
    last_label="O"
    for i, l in enumerate(label):
        if l != last_label and l != "O":
            label[i] = "B-"+l
        elif l == last_label and last_label != "O":
            label[i] = "I-"+l
        last_label = l
    new_bio.append({"doc_prelim":i,"tokens":t, "trailing_whitespace":ws, "labels":label})

In [32]:
new_ = pd.DataFrame(new_bio)
new_ = new_.explode(["tokens", "trailing_whitespace", "labels"])
new_.shape # note that this produces even more tokens than my prior approach

(1019274, 4)

In [33]:
new_.head(25)

Unnamed: 0,doc_prelim,tokens,trailing_whitespace,labels
0,432,In,True,O
0,432,today,False,O
0,432,'s,True,O
0,432,modern,True,O
0,432,world,False,O
0,432,",",True,O
0,432,where,True,O
0,432,technology,True,O
0,432,has,True,O
0,432,become,True,O


In [34]:
new_ = new_.reset_index(names="doc_")
new_["document"] = new_.doc_.apply(lambda x: f"pj_{x}")

# get text from original 
new_ = pd.merge(new_, ai_data.reset_index(names="doc_")[["doc_","text"]], on="doc_", how="left")
new_=new_.drop(columns=["doc_prelim","doc_"])
new_=new_[["document","text","tokens","trailing_whitespace","labels"]]
new_.head()

Unnamed: 0,document,text,tokens,trailing_whitespace,labels
0,pj_0,"In today's modern world, where technology has ...",In,True,O
1,pj_0,"In today's modern world, where technology has ...",today,False,O
2,pj_0,"In today's modern world, where technology has ...",'s,True,O
3,pj_0,"In today's modern world, where technology has ...",modern,True,O
4,pj_0,"In today's modern world, where technology has ...",world,False,O


## Sanity checks
- only one for this one
- please tell me if you see problems with this dataframe!!

In [35]:
# Same amount of labels as before - if False, check difference - is it correct?
new_.labels.nunique()

12

In [36]:
target = [
    'B-EMAIL', 'B-ID_NUM', 'B-NAME_STUDENT', 'B-PHONE_NUM', 
    'B-STREET_ADDRESS', 'B-URL_PERSONAL', 'B-USERNAME', 'I-ID_NUM', 
    'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-STREET_ADDRESS', 'I-URL_PERSONAL'
]

for l in new_.labels.unique():
    if l not in target:
        print(l)

for l in target:
    if l not in new_.labels.unique():
        print(l)

O
I-URL_PERSONAL


## Again, aggregated df for your convenience

In [37]:
# unexplode columns to lists (technically you can skip this step and safe it exploded instead too)
fixed = (
    new_.groupby("document")
    .agg(
        {
            "text": lambda x: x,
            "tokens": lambda x: x.tolist(),
            "trailing_whitespace": lambda x: x.tolist(),
            "labels": lambda x: x.tolist(),
        }
    )
    .reset_index()
)
# fix text that was turned into list
fixed["text"] = fixed.text.apply(lambda x: x[0])

In [38]:
fixed.to_csv("moredata_dataset_fixed.csv", index = False)
fixed.shape

(2000, 5)

## also safe as json

In [39]:
json_format=[]
for idx, row in fixed.iterrows():
    doc=row["document"]
    text=row["text"]
    tokens=row["tokens"]
    ws=row["trailing_whitespace"]
    labels=row["labels"]
    
    json_format.append(
        {
            "document":doc,
            "full_text":text,
            "tokens":tokens,
            "trailing_whitespace":ws,
            "labels":labels
        }
    )
import json
out_file = open("moredata_dataset_fixed.json", "w") 
json.dump(json_format, out_file) 
out_file.close()

In [40]:
len(json_format[20]["tokens"])

463