# Spacy dataset creation

This notebook takes train and test  datasets (of type `List[InputSample]`)
and transforms them into two structures consumed by Spacy:
1. Spacy JSON (see https://spacy.io/api/annotation#json-input)
2. Spacy Pickle files (of structure `[(full_text,"entities":[(start, end, type),(...))]`.  
See more details here: https://spacy.io/api/annotation#json-input)

JSON is used for Spacy's CLI trainer. 
Pickle is used for fine-tuning using the logic in [../models/spacy_retrain.py](../models/spacy_retrain.py)

In [1]:
from presidio_evaluator.data_generator import read_synth_dataset
%reload_ext autoreload

ModuleNotFoundError: No module named 'presidio_evaluator'

In [None]:
DATA_DATE = 'November 12 2019'

In [None]:
data_path = "../data/generated_{}_{}.json"

train_samples = read_synth_dataset(data_path.format("train",DATA_DATE))
print("Read {} samples".format(len(train_samples)))

For training, keep only sentences with entities:

In [4]:
train_tagged = [sample for sample in train_samples if len(sample.spans)>0]
print("Kept {} samples after removal of non-tagged samples".format(len(train_tagged)))

Evaluate training set's entities

In [None]:
print("Entities found in training set:")
entities = []
for sample in train_tagged:
    entities.extend([tag for tag in sample.tags])
set(entities)

Create Spacy dataset (option 2)

In [None]:
from presidio_evaluator import InputSample
import pickle

spacy_train = InputSample.create_spacy_dataset(train_tagged)


In [7]:
entities_spacy = [x[1]['entities'] for x in spacy_train]
entities_spacy
entities_spacy_flat = []
for samp in entities_spacy:
    for ent in samp:
        entities_spacy_flat.append(ent[2])
set(entities_spacy_flat)

Create Spacy dataset (option 1: JSON)

In [8]:
from presidio_evaluator import InputSample
spacy_train_json = InputSample.create_spacy_json(train_tagged)

Quick evaluation of samples

In [9]:
[sample[0] for sample in spacy_train[:100]]

In [10]:
spacy_train_json[0]['paragraphs'][0]['sentences']

Dump training set to pickle and json respectively

In [11]:
import pickle
import json
with open("../data/train.pickle", 'wb') as handle:
    pickle.dump(spacy_train,handle, protocol=pickle.HIGHEST_PROTOCOL)

with open("../data/train.json","w") as f:
    json.dump(spacy_train_json,f)
       

Create JSON and pickle files for test dataset

In [12]:
test_samples = read_synth_dataset(data_path.format("test",DATA_DATE))
print("Read {} samples".format(len(test_samples)))

In [13]:
spacy_test = InputSample.create_spacy_dataset(test_samples)
spacy_test_json = InputSample.create_spacy_json(test_samples)
print(spacy_test[14])

Dump test set to pickle and json respectively

In [14]:
import pickle
with open("../data/test.pickle", 'wb') as handle:
    pickle.dump(spacy_test,handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open("../data/test.json","w") as f:
    json.dump(spacy_test_json,f)
       