# Data preparation for classifier training

This notebook shows how to prepare the data for the classifier trainig.

    Input: 
        Text file with two TAB separated columns. The first column contains the label, the second - the sentence.
    Output: 
        .json files (with train and test split) with embeddings obtained from the different pre-trained embedding models:
            1) word-level fastText embeddings: model cc.en.300.bin
                (https://fasttext.cc/docs/en/crawl-vectors.html)
            2) sentence-level transformer embeddings: model all-mpnet-base-v2
                (https://www.sbert.net/docs/pretrained_models.html#model-overview)
            3) sentence-level transformer embeddings: model all-distilroberta-v1
                (https://www.sbert.net/docs/pretrained_models.html#model-overview)
            4) sentence-level BERT cased embeddings: model BERT-Base, Cased
                (https://github.com/google-research/bert#pre-trained-models)
            5) sentence-level BERT uncased embeddings: model BERT-Base, Uncased
                (https://github.com/google-research/bert#pre-trained-models)
                
        (.._traintest.json files contain train/test split without embedding vectors)
            
  

## 1. Filtering text examples by length and syntactical structure

*** At first we tokenize text examples. Then we parse text ignoring lines containing more than 6 tokens (thouse lines are excluded from the further processing).

In [1]:
import sys
import os
import json
import csv
import re
import subprocess

In [2]:
from lambeq import SpacyTokeniser
from lambeq import BobcatParser

In [3]:
tokeniser = SpacyTokeniser()
bobcat_parser = BobcatParser() #BobcatParser(device=0) # if GPU is not vailable then BobcatParser()

In [4]:
datadir = "../../data/datasets"
dsName='RAW_interactions'

Change the variables according to your dataset!
Specify if your dataset has field values in the first row and what are names of the classification field and the text field. Also specify field delimiter symbol.

In [5]:
fieldnamesinfile=True
classfield="rating"
txtfield="review"
firstsentence=False #Try to process only the first sentence for the texts longer than 6 tokens
delimiter=','

In [6]:
input_file = f"{datadir}/{dsName}.csv"
output_file = f"{datadir}/withtags_{dsName}.tsv"

In [7]:
with open(input_file, encoding="utf8", newline='') as csvfile, open(output_file, "w", encoding="utf8") as tsvfile:
        if fieldnamesinfile != False: 
            news_reader = csv.DictReader(csvfile, delimiter=delimiter, quotechar='"')
        else:
            news_reader = csv.DictReader(csvfile, delimiter=delimiter, fieldnames = [classfield, txtfield], quotechar='"')        
        processed_summaries = set()
        norm_process_params = ["perl", "../../data/data_processing/scripts/normalize-punctuation.perl","-b","-l", "en"]
        norm_process = subprocess.Popen(norm_process_params, stdin=subprocess.PIPE, stdout=subprocess.PIPE, close_fds=True)
        for row in news_reader:
            score = row[classfield]
            if score == '0':
                continue
            summary = row[txtfield].replace("\n"," ").replace("\t"," ").replace("\r"," ")
            summary = re.sub('@[^\s]+ ','',summary)
            norm_process.stdin.write(summary.encode('utf-8'))
            norm_process.stdin.write('\n'.encode('utf-8'))
            norm_process.stdin.flush()
            norm_summary = norm_process.stdout.readline().decode("utf-8").rstrip()
            tok_summary = " ".join(tokeniser.tokenise_sentences([norm_summary])[0])

            if len(tok_summary.split())>6:
                if firstsentence==True:
                    tok_summary = re.sub('^([^\.!?]+).+','\\1',tok_summary)
                    if len(tok_summary.split())>6:
                        continue
                else:
                    continue
            if tok_summary in processed_summaries:
                continue
            sent_type = ''
            processed_summaries.add(tok_summary)
            try:
                sent_type = 's'
                result = bobcat_parser.sentence2tree(tok_summary).to_json()
                str1=re.sub('\'(rule|text)\':\s[\"\'][^\s]+[\"\']','',str(result))
                str1=re.sub('\'(type|children)\':\s+','',str1)
                str1=re.sub('[\{\},\']','',str1)
                str1=re.sub('\s+([\[\]])','\\1',str1)
                sent_type = str1
            except:
                sent_type = ''
            print("{0}\t{1}\t{2}".format(score, tok_summary,sent_type),file=tsvfile)

Output is 3-column filtered file containing class, text and syntactical structure of the text.

*** Next we chose text examples according to the set of prespecified syntactical structure.

In [8]:
input_file = f"{datadir}/withtags_{dsName}.tsv"
output_file = f"{datadir}/{dsName}.tsv"
tags_file = f"{datadir}/tags.txt"

In [9]:
filterlist = open(tags_file).read().splitlines()
print(filterlist)

['n[(n/n)   n[(n/n)   n]]', 'n[(n/n)   n[(n/n)   n[(n/n)   n]]]', 'n[n[n[(n/n)   n]] (n\\\\n)[((n\\\\n)/n)   n[n[(n/n)   n]]]]', 'n[(n/n)[((n/n)/(n/n))   (n/n)] n]', 'n[n[n[(n/n)   n]] (n\\\\n)[((n\\\\n)/n)   n[(n/n)   n]]]']


In [10]:
  
with open(input_file, "r", encoding="utf8") as ifile, open(output_file, "w", encoding="utf8") as ofile:
    tsv_reader = csv.DictReader(ifile, fieldnames=['Class','Txt','Tag'], delimiter="\t", quotechar='"')
    for item in tsv_reader:
        if item['Tag'] in filterlist:
            print("{0}\t{1}\t{2}".format(item['Class'],item['Txt'],item['Tag']),file=ofile)

Output is 3-column by syntax filtered file containing class, text and syntactical structure of the text.

## 2. Splitting examples in train and test sets and acquiring embedding vectors

In [11]:
from sklearn.model_selection import train_test_split
from typing import List, Tuple, Dict
sys.path.append("../../data/data_processing/data_vectorisation/")
from Embeddings import Embeddings
from collections import defaultdict

In [12]:
def unpack_data(data: List[Tuple[str, str]]) -> List[Dict[str, str]]:
    return [{
        "sentence": sentence,
        "class": sentence_type,
    } for sentence, sentence_type in data]

In [13]:
input_file = f"{datadir}/{dsName}.tsv"

In [14]:
dataset: List[Tuple[str, str]] = []
datasettag={}

In [15]:
with open(input_file, "r", encoding="utf-8") as f:
    for line in f:
        cols=line.split('\t')
        if len(cols) == 3:
            sent = cols[1].rstrip()
            dataset.append((sent, cols[0].rstrip()))
            datasettag[sent] = cols[2].rstrip()


In [16]:
classes = [item[1] for item in dataset]
classes

['5',
 '5',
 '5',
 '5',
 '3',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '4',
 '5',
 '5',
 '4',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '4',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '3',
 '5',
 '5',
 '5',
 '5',
 '3',
 '5',
 '5',
 '5',
 '4',
 '4',
 '5',
 '5',
 '4',
 '4',
 '5',
 '5',
 '5',
 '4',
 '5',
 '4',
 '5',
 '5',
 '5',
 '5',
 '5',
 '3',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '4',
 '5',
 '4',
 '5',
 '5',
 '5',
 '1',
 '5',
 '5',
 '4',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '4',
 '5',
 '5',
 '5',
 '4',
 '2',
 '5',
 '5',
 '5',
 '3',
 '5',
 '5',
 '5',
 '5',
 '5',
 '4',
 '5',
 '5',
 '5',
 '1',
 '1',
 '4',
 '4',
 '1',
 '4',
 '4',
 '4',
 '5',
 '5',
 '3',
 '4',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '5',
 '4',
 '5',
 '3',
 '5',
 '5',
 '5',
 '5',
 '5',
 '4',
 '4',
 '5',
 '5',
 '5',
 '5',
 '2',
 '5',
 '3',
 '5',
 '5',
 '5',
 '5',
 '5'

In [17]:
train_data, test_data = train_test_split(dataset, train_size=0.9, random_state=1, stratify=classes)

In [18]:
my_result = defaultdict(list)
for element in test_data:
    my_result[element[1]].append(element[0])

my_result = dict(my_result)
result_dictionary = dict()

for key in my_result:
    result_dictionary[key] = len(list(set(my_result[key]))) / len(test_data)
print(f"*** Proportion of classes in {len(test_data)} examples of test data ***")
print(json.dumps(result_dictionary, indent=4, sort_keys=True))

*** Proportion of classes in 20 examples of test data ***
{
    "1": 0.05,
    "3": 0.05,
    "4": 0.15,
    "5": 0.75
}


In [19]:
my_result = defaultdict(list)
for element in train_data:
    my_result[element[1]].append(element[0])

my_result = dict(my_result)
result_dictionary = dict()

for key in my_result:
    result_dictionary[key] = len(list(set(my_result[key]))) / len(train_data)
print(f"*** Proportion of classes in {len(train_data)} examples of train data ***")
print(json.dumps(result_dictionary, indent=4, sort_keys=True))

*** Proportion of classes in 177 examples of train data ***
{
    "1": 0.02824858757062147,
    "2": 0.01694915254237288,
    "3": 0.05649717514124294,
    "4": 0.14689265536723164,
    "5": 0.751412429378531
}


In [20]:
test_data = unpack_data(test_data)

In [21]:
train_data = unpack_data(train_data)

In [22]:
for item in train_data:
    item["tag"] = datasettag[item["sentence"]]
for item in train_data:
    item["tag"] = datasettag[item["sentence"]]

In [23]:
with open(f"{datadir}/{dsName}_traintest.json", "w", encoding="utf-8") as f:
        json.dump({"train_data": train_data, "test_data": test_data}, f, indent=1, ensure_ascii=False)

In [24]:
def ObtainEmbeddings(train_data, test_data, key, path, embtype):
    vectorizer = Embeddings(path=path,embtype=embtype)
        
    cnt = 0
    print(f"\n*** Getting vectors for {len(train_data)} examples of train data ***", end='\n')
    for item in train_data:
        item["sentence_vectorized"] = vectorizer.getEmbeddingVector(item["sentence"])
        cnt = cnt + 1
        if cnt % 50 == 0:
            print (str(cnt),end=' ')
                
    cnt = 0
    print(f"\n*** Getting vectors for {len(test_data)} examples of test data ***", end='\n')
    for item in test_data:
        item["sentence_vectorized"] = vectorizer.getEmbeddingVector(item["sentence"])
        cnt = cnt + 1
        if cnt % 50 == 0:
            print (str(cnt),end=' ')
        
    with open(f"{datadir}/{dsName}_{key}.json", "w", encoding="utf-8") as f:
        json.dump({"train_data": train_data, "test_data": test_data}, f, indent=2, ensure_ascii=False)

In [25]:
ObtainEmbeddings(train_data, test_data, 'FASTTEXT', 'cc.en.300.bin', 'fasttext')

cc.en.300.bin loaded!

*** Getting vectors for 177 examples of train data ***
50 100 150 
*** Getting vectors for 20 examples of test data ***


In [26]:
ObtainEmbeddings(train_data, test_data, 'all-mpnet-base', 'all-mpnet-base-v2', 'transformer')

all-mpnet-base-v2 loaded!

*** Getting vectors for 177 examples of train data ***
50 100 150 
*** Getting vectors for 20 examples of test data ***


In [27]:
ObtainEmbeddings(train_data, test_data, 'all-distilroberta', 'all-distilroberta-v1', 'transformer')

all-distilroberta-v1 loaded!

*** Getting vectors for 177 examples of train data ***
50 100 150 
*** Getting vectors for 20 examples of test data ***


In [28]:
ObtainEmbeddings(train_data, test_data, 'BERT_UNCASED', 'bert-base-uncased', 'bert')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


bert-base-uncased loaded!

*** Getting vectors for 177 examples of train data ***
50 100 150 
*** Getting vectors for 20 examples of test data ***


In [29]:
ObtainEmbeddings(train_data, test_data, 'BERT_CASED', 'bert-base-cased', 'bert')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


bert-base-cased loaded!

*** Getting vectors for 177 examples of train data ***
50 100 150 
*** Getting vectors for 20 examples of test data ***
