# Data preparation for classifier training

This notebook shows how to prepare the data for the classifier trainig.

    Input: 
        Text file with two TAB separated columns. The first column contains the label, the second - the sentence.
    Output: 
        .json files (with train and test split) with embeddings obtained from the different pre-trained embedding models:
            1) word-level fastText embeddings: model cc.en.300.bin
                (https://fasttext.cc/docs/en/crawl-vectors.html)
            2) sentence-level transformer embeddings: model all-mpnet-base-v2
                (https://www.sbert.net/docs/pretrained_models.html#model-overview)
            3) sentence-level transformer embeddings: model all-distilroberta-v1
                (https://www.sbert.net/docs/pretrained_models.html#model-overview)
            4) sentence-level BERT cased embeddings: model BERT-Base, Cased
                (https://github.com/google-research/bert#pre-trained-models)
            5) sentence-level BERT uncased embeddings: model BERT-Base, Uncased
                (https://github.com/google-research/bert#pre-trained-models)
                
        (.._traintest.json files contain train/test split without embedding vectors)
            
  

## 1. Filtering text examples by length and syntactical structure

*** At first we tokenize text examples. Then we parse text ignoring lines containing more than 6 tokens (thouse lines are excluded from the further processing).

In [143]:
import sys
import os
import json
import csv
import re
import subprocess

In [144]:
from lambeq import SpacyTokeniser
from lambeq import BobcatParser

In [145]:
tokeniser = SpacyTokeniser()
bobcat_parser = BobcatParser() #BobcatParser(device=0) # if GPU is not vailable then BobcatParser()

In [203]:
datadir = "../../data/datasets"
dsName='amazonreview_train'

Change the variables according to your dataset!
Specify if your dataset has field values in the first row and what are names of the classification field and the text field. Also specify field delimiter symbol.

In [204]:
fieldnamesinfile=True
classfield="rating"
txtfield="review"
firstsentence=False #Try to process only the first sentence for the texts longer than 6 tokens
delimiter=','

In [6]:
input_file = f"{datadir}/{dsName}.csv"
output_file = f"{datadir}/{dsName}_alltrees.tsv"

In [None]:
with open(input_file, encoding="utf8", newline='') as csvfile, open(output_file, "w", encoding="utf8") as tsvfile:
        if fieldnamesinfile != False: 
            news_reader = csv.DictReader(csvfile, delimiter=delimiter, quotechar='"')
        else:
            news_reader = csv.DictReader(csvfile, delimiter=delimiter, fieldnames = [classfield, txtfield], quotechar='"')        
        processed_summaries = set()
        norm_process_params = ["perl", "../../data/data_processing/scripts/normalize-punctuation.perl","-b","-l", "en"]
        norm_process = subprocess.Popen(norm_process_params, stdin=subprocess.PIPE, stdout=subprocess.PIPE, close_fds=True)
        for row in news_reader:
            score = row[classfield]
            if score == '0':
                continue
            summary = row[txtfield].replace("\n"," ").replace("\t"," ").replace("\r"," ")
            summary = re.sub('@[^\s]+ ','',summary)
            norm_process.stdin.write(summary.encode('utf-8'))
            norm_process.stdin.write('\n'.encode('utf-8'))
            norm_process.stdin.flush()
            norm_summary = norm_process.stdout.readline().decode("utf-8").rstrip()
            tok_summary = " ".join(tokeniser.tokenise_sentences([norm_summary])[0])

            if len(tok_summary.split())>6:
                if firstsentence==True:
                    tok_summary = re.sub('^([^\.!?]+).+','\\1',tok_summary)
                    if len(tok_summary.split())>6:
                        continue
                else:
                    continue
            if tok_summary in processed_summaries:
                continue
            sent_type = ''
            processed_summaries.add(tok_summary)
            try:
                sent_type = 's'
                result = bobcat_parser.sentence2tree(tok_summary).to_json()
                str1=re.sub('\'(rule|text)\':\s[\"\'][^\s]+[\"\']','',str(result))
                str1=re.sub('\'(type|children)\':\s+','',str1)
                str1=re.sub('[\{\},\']','',str1)
                str1=re.sub('\s+([\[\]])','\\1',str1)
                sent_type = str1
            except:
                sent_type = ''
            print("{0}\t{1}\t{2}".format(score, tok_summary,sent_type),file=tsvfile)

Output is 3-column filtered file containing class, text and syntactical structure of the text.

*** Next we chose text examples according to the set of prespecified syntactical structure.

In [205]:
input_file = f"{datadir}/{dsName}_alltrees.tsv"
output_file = f"{datadir}/{dsName}_filtered.tsv"
tags_file = f"{datadir}/validtrees.txt"

In [206]:
filterlist = open(tags_file).read().splitlines()
print(filterlist)

['n[(n/n)   n[(n/n)   n]]', 'n[(n/n)   n[(n/n)   n[(n/n)   n]]]', 'n[n[n[(n/n)   n]] (n\\\\n)[((n\\\\n)/n)   n[n[(n/n)   n]]]]', 'n[(n/n)[((n/n)/(n/n))   (n/n)] n]', 'n[n[n[(n/n)   n]] (n\\\\n)[((n\\\\n)/n)   n[(n/n)   n]]]']


In [207]:
  
with open(input_file, "r", encoding="utf8") as ifile, open(output_file, "w", encoding="utf8") as ofile:
    tsv_reader = csv.DictReader(ifile, fieldnames=['Class','Txt','Tag'], delimiter="\t", quotechar='"')
    for item in tsv_reader:
        if item['Tag'] in filterlist:
            print("{0}\t{1}\t{2}".format(item['Class'],item['Txt'],item['Tag']),file=ofile)

Output is 3-column by syntax filtered file containing class, text and syntactical structure of the text.

## 2. Splitting examples in train and test sets and acquiring embedding vectors

In [208]:
from sklearn.model_selection import train_test_split
from typing import List, Tuple, Dict
sys.path.append("../../data/data_processing/data_vectorisation/")
from Embeddings import Embeddings
from collections import defaultdict

In [209]:
def unpack_data(data: List[Tuple[str, str]]) -> List[Dict[str, str]]:
    return [{
        "sentence": sentence,
        "class": sentence_type,
    } for sentence, sentence_type in data]

In [210]:
input_file = f"{datadir}/{dsName}_filtered.tsv"

In [211]:
dataset: List[Tuple[str, str]] = []
datasettag={}

In [212]:
with open(input_file, "r", encoding="utf-8") as f:
    for line in f:
        cols=line.split('\t')
        if len(cols) == 3:
            sent = cols[1].rstrip()
            dataset.append((sent, cols[0].rstrip()))
            datasettag[sent] = cols[2].rstrip()


In [213]:
classes = [item[1] for item in dataset]
classes

['2',
 '2',
 '2',
 '1',
 '2',
 '1',
 '2',
 '1',
 '2',
 '1',
 '2',
 '1',
 '1',
 '1',
 '2',
 '2',
 '1',
 '2',
 '2',
 '2',
 '1',
 '1',
 '2',
 '1',
 '2',
 '1',
 '2',
 '2',
 '2',
 '2',
 '1',
 '2',
 '2',
 '1',
 '2',
 '1',
 '1',
 '2',
 '2',
 '2',
 '2',
 '2',
 '1',
 '1',
 '1',
 '2',
 '1',
 '2',
 '2',
 '2',
 '1',
 '1',
 '2',
 '2',
 '2',
 '1',
 '1',
 '1',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '1',
 '2',
 '1',
 '1',
 '1',
 '1',
 '2',
 '2',
 '1',
 '1',
 '1',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '1',
 '1',
 '1',
 '1',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '1',
 '1',
 '1',
 '1',
 '1',
 '2',
 '1',
 '2',
 '2',
 '1',
 '2',
 '1',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '1',
 '2',
 '2',
 '2',
 '2',
 '2',
 '1',
 '1',
 '2',
 '2',
 '1',
 '2',
 '2',
 '2',
 '2',
 '1',
 '1',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '1',
 '2',
 '1',
 '1',
 '1',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '1',
 '1',
 '2'

In [214]:
train_data, tmp_data = train_test_split(dataset, train_size=0.8, random_state=1, stratify=classes)

In [215]:
classes = [item[1] for item in tmp_data]

In [216]:
print(classes)

['2', '2', '1', '2', '2', '2', '2', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '1', '1', '2', '1', '1', '1', '2', '2', '1', '1', '2', '2', '2', '1', '2', '2', '2', '2', '2', '1', '2', '2', '2', '2', '2', '2', '2', '1', '1', '1', '2', '2', '1', '2', '2', '2', '2', '1', '1', '2', '1', '1', '2', '1', '2', '2', '2', '1', '2', '2', '2', '1', '2', '2', '2', '1', '2', '2', '1', '1', '2', '2', '1', '2', '1', '2', '1', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '1', '2', '2', '1', '2', '2', '1', '1', '2', '1', '1', '2', '2', '2', '2', '2', '2', '1', '2', '2', '2', '1', '2', '1', '2', '2', '2', '1', '2', '2', '1', '2', '1', '2', '2', '2', '1', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '1', '2', '2', '2', '2', '1', '1', '1', '1', '2', '1', '2', '1', '2', '2', '2', '2', '2',

In [217]:
test_data, dev_data = train_test_split(tmp_data, train_size=0.5, random_state=1, stratify=classes)

In [218]:
my_result = defaultdict(list)
for element in test_data:
    my_result[element[1]].append(element[0])

my_result = dict(my_result)
result_dictionary = dict()

for key in my_result:
    result_dictionary[key] = len(list(set(my_result[key]))) / len(test_data)
print(f"*** Proportion of classes in {len(test_data)} examples of test data ***")
print(json.dumps(result_dictionary, indent=4, sort_keys=True))

*** Proportion of classes in 20435 examples of test data ***
{
    "1": 0.3457793002202104,
    "2": 0.6542206997797896
}


In [219]:
my_result = defaultdict(list)
for element in train_data:
    my_result[element[1]].append(element[0])

my_result = dict(my_result)
result_dictionary = dict()

for key in my_result:
    result_dictionary[key] = len(list(set(my_result[key]))) / len(train_data)
print(f"*** Proportion of classes in {len(train_data)} examples of train data ***")
print(json.dumps(result_dictionary, indent=4, sort_keys=True))

*** Proportion of classes in 163483 examples of train data ***
{
    "1": 0.34184594116819483,
    "2": 0.6581540588318051
}


In [220]:
test_data = unpack_data(test_data)

In [221]:
train_data = unpack_data(train_data)

In [222]:
dev_data = unpack_data(dev_data)

In [223]:
with open(f"{datadir}/{dsName}_filtered_train.tsv", "w", encoding="utf-8") as f:
    for item in train_data:
        item["tag"] = datasettag[item["sentence"]]
        f.write(f'{item["class"]}\t{item["sentence"]}\t{item["tag"]}\n')

In [224]:
with open(f"{datadir}/{dsName}_filtered_test.tsv", "w", encoding="utf-8") as f:
    for item in test_data:
        item["tag"] = datasettag[item["sentence"]]
        f.write(f'{item["class"]}\t{item["sentence"]}\t{item["tag"]}\n')

In [225]:
with open(f"{datadir}/{dsName}_filtered_dev.tsv", "w", encoding="utf-8") as f:
    for item in dev_data:
        item["tag"] = datasettag[item["sentence"]]
        f.write(f'{item["class"]}\t{item["sentence"]}\t{item["tag"]}\n')

In [226]:
def ObtainEmbeddings(train_data, test_data, dev_data, key, path, embtype):
    vectorizer = Embeddings(path=path,embtype=embtype)
        
    cnt = 0
    print(f"\n*** Getting vectors for {len(train_data)} examples of train data ***", end='\n')
    for item in train_data:
        item["sentence_vectorized"] = vectorizer.getEmbeddingVector(item["sentence"])
        cnt = cnt + 1
        if cnt % 50 == 0:
            print (str(cnt),end=' ')
                
    cnt = 0
    print(f"\n*** Getting vectors for {len(test_data)} examples of test data ***", end='\n')
    for item in test_data:
        item["sentence_vectorized"] = vectorizer.getEmbeddingVector(item["sentence"])
        cnt = cnt + 1
        if cnt % 50 == 0:
            print (str(cnt),end=' ')
            
    cnt = 0
    print(f"\n*** Getting vectors for {len(dev_data)} examples of development data ***", end='\n')
    for item in dev_data:
        item["sentence_vectorized"] = vectorizer.getEmbeddingVector(item["sentence"])
        cnt = cnt + 1
        if cnt % 50 == 0:
            print (str(cnt),end=' ')
        
    with open(f"{datadir}/{dsName}_{key}.json", "w", encoding="utf-8") as f:
        json.dump({"train_data": train_data, "test_data": test_data, "dev_data": dev_data}, f, indent=2, ensure_ascii=False)

In [227]:
ObtainEmbeddings(train_data, test_data, dev_data, 'FASTTEXT', 'cc.en.300.bin', 'fasttext')

cc.en.300.bin loaded!

*** Getting vectors for 163483 examples of train data ***


50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250 5300 5350 5400 5450 5500 5550 5600 5650 5700 5750 5800 5850 5900 5950 6000 6050 6100 6150 6200 6250 6300 6350 6400 6450 6500 6550 6600 6650 6700 6750 6800 6850 6900 6950 7000 7050 7100 7150 7200 7250 7300 7350 7400 7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950 8000 8050 8100 8150 8200 8250 8300 8350 8400 8450 8500 8550 8600 8650 8700 8750 8800 8850 8900 8950 9000 9050 9100 9150 9200 9250 9300 9350 9400 9450 9500 9550 9600 9650 9700 9750 9800 9850 9900 9950 10000 10050 10100 10150 1

74550 74600 74650 74700 74750 74800 74850 74900 74950 75000 75050 75100 75150 75200 75250 75300 75350 75400 75450 75500 75550 75600 75650 75700 75750 75800 75850 75900 75950 76000 76050 76100 76150 76200 76250 76300 76350 76400 76450 76500 76550 76600 76650 76700 76750 76800 76850 76900 76950 77000 77050 77100 77150 77200 77250 77300 77350 77400 77450 77500 77550 77600 77650 77700 77750 77800 77850 77900 77950 78000 78050 78100 78150 78200 78250 78300 78350 78400 78450 78500 78550 78600 78650 78700 78750 78800 78850 78900 78950 79000 79050 79100 79150 79200 79250 79300 79350 79400 79450 79500 79550 79600 79650 79700 79750 79800 79850 79900 79950 80000 80050 80100 80150 80200 80250 80300 80350 80400 80450 80500 80550 80600 80650 80700 80750 80800 80850 80900 80950 81000 81050 81100 81150 81200 81250 81300 81350 81400 81450 81500 81550 81600 81650 81700 81750 81800 81850 81900 81950 82000 82050 82100 82150 82200 82250 82300 82350 82400 82450 82500 82550 82600 82650 82700 82750 82800 8285

140750 140800 140850 140900 140950 141000 141050 141100 141150 141200 141250 141300 141350 141400 141450 141500 141550 141600 141650 141700 141750 141800 141850 141900 141950 142000 142050 142100 142150 142200 142250 142300 142350 142400 142450 142500 142550 142600 142650 142700 142750 142800 142850 142900 142950 143000 143050 143100 143150 143200 143250 143300 143350 143400 143450 143500 143550 143600 143650 143700 143750 143800 143850 143900 143950 144000 144050 144100 144150 144200 144250 144300 144350 144400 144450 144500 144550 144600 144650 144700 144750 144800 144850 144900 144950 145000 145050 145100 145150 145200 145250 145300 145350 145400 145450 145500 145550 145600 145650 145700 145750 145800 145850 145900 145950 146000 146050 146100 146150 146200 146250 146300 146350 146400 146450 146500 146550 146600 146650 146700 146750 146800 146850 146900 146950 147000 147050 147100 147150 147200 147250 147300 147350 147400 147450 147500 147550 147600 147650 147700 147750 147800 147850

In [228]:
ObtainEmbeddings(train_data, test_data, dev_data, 'all-mpnet-base', 'all-mpnet-base-v2', 'transformer')

all-mpnet-base-v2 loaded!

*** Getting vectors for 163483 examples of train data ***


50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250 5300 5350 5400 5450 5500 5550 5600 5650 5700 5750 5800 5850 5900 5950 6000 6050 6100 6150 6200 6250 6300 6350 6400 6450 6500 6550 6600 6650 6700 6750 6800 6850 6900 6950 7000 7050 7100 7150 7200 7250 7300 7350 7400 7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950 8000 8050 8100 8150 8200 8250 8300 8350 8400 8450 8500 8550 8600 8650 8700 8750 8800 8850 8900 8950 9000 9050 9100 9150 9200 9250 9300 9350 9400 9450 9500 9550 9600 9650 9700 9750 9800 9850 9900 9950 10000 10050 10100 10150 1

70150 70200 70250 70300 70350 70400 70450 70500 70550 70600 70650 70700 70750 70800 70850 70900 70950 71000 71050 71100 71150 71200 71250 71300 71350 71400 71450 71500 71550 71600 71650 71700 71750 71800 71850 71900 71950 72000 72050 72100 72150 72200 72250 72300 72350 72400 72450 72500 72550 72600 72650 72700 72750 72800 72850 72900 72950 73000 73050 73100 73150 73200 73250 73300 73350 73400 73450 73500 73550 73600 73650 73700 73750 73800 73850 73900 73950 74000 74050 74100 74150 74200 74250 74300 74350 74400 74450 74500 74550 74600 74650 74700 74750 74800 74850 74900 74950 75000 75050 75100 75150 75200 75250 75300 75350 75400 75450 75500 75550 75600 75650 75700 75750 75800 75850 75900 75950 76000 76050 76100 76150 76200 76250 76300 76350 76400 76450 76500 76550 76600 76650 76700 76750 76800 76850 76900 76950 77000 77050 77100 77150 77200 77250 77300 77350 77400 77450 77500 77550 77600 77650 77700 77750 77800 77850 77900 77950 78000 78050 78100 78150 78200 78250 78300 78350 78400 7845

132950 133000 133050 133100 133150 133200 133250 133300 133350 133400 133450 133500 133550 133600 133650 133700 133750 133800 133850 133900 133950 134000 134050 134100 134150 134200 134250 134300 134350 134400 134450 134500 134550 134600 134650 134700 134750 134800 134850 134900 134950 135000 135050 135100 135150 135200 135250 135300 135350 135400 135450 135500 135550 135600 135650 135700 135750 135800 135850 135900 135950 136000 136050 136100 136150 136200 136250 136300 136350 136400 136450 136500 136550 136600 136650 136700 136750 136800 136850 136900 136950 137000 137050 137100 137150 137200 137250 137300 137350 137400 137450 137500 137550 137600 137650 137700 137750 137800 137850 137900 137950 138000 138050 138100 138150 138200 138250 138300 138350 138400 138450 138500 138550 138600 138650 138700 138750 138800 138850 138900 138950 139000 139050 139100 139150 139200 139250 139300 139350 139400 139450 139500 139550 139600 139650 139700 139750 139800 139850 139900 139950 140000 140050

50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250 5300 5350 5400 5450 5500 5550 5600 5650 5700 5750 5800 5850 5900 5950 6000 6050 6100 6150 6200 6250 6300 6350 6400 6450 6500 6550 6600 6650 6700 6750 6800 6850 6900 6950 7000 7050 7100 7150 7200 7250 7300 7350 7400 7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950 8000 8050 8100 8150 8200 8250 8300 8350 8400 8450 8500 8550 8600 8650 8700 8750 8800 8850 8900 8950 9000 9050 9100 9150 9200 9250 9300 9350 9400 9450 9500 9550 9600 9650 9700 9750 9800 9850 9900 9950 10000 10050 10100 10150 1

In [229]:
ObtainEmbeddings(train_data, test_data, dev_data, 'all-distilroberta', 'all-distilroberta-v1', 'transformer')

all-distilroberta-v1 loaded!

*** Getting vectors for 163483 examples of train data ***


50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250 5300 5350 5400 5450 5500 5550 5600 5650 5700 5750 5800 5850 5900 5950 6000 6050 6100 6150 6200 6250 6300 6350 6400 6450 6500 6550 6600 6650 6700 6750 6800 6850 6900 6950 7000 7050 7100 7150 7200 7250 7300 7350 7400 7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950 8000 8050 8100 8150 8200 8250 8300 8350 8400 8450 8500 8550 8600 8650 8700 8750 8800 8850 8900 8950 9000 9050 9100 9150 9200 9250 9300 9350 9400 9450 9500 9550 9600 9650 9700 9750 9800 9850 9900 9950 10000 10050 10100 10150 1

70150 70200 70250 70300 70350 70400 70450 70500 70550 70600 70650 70700 70750 70800 70850 70900 70950 71000 71050 71100 71150 71200 71250 71300 71350 71400 71450 71500 71550 71600 71650 71700 71750 71800 71850 71900 71950 72000 72050 72100 72150 72200 72250 72300 72350 72400 72450 72500 72550 72600 72650 72700 72750 72800 72850 72900 72950 73000 73050 73100 73150 73200 73250 73300 73350 73400 73450 73500 73550 73600 73650 73700 73750 73800 73850 73900 73950 74000 74050 74100 74150 74200 74250 74300 74350 74400 74450 74500 74550 74600 74650 74700 74750 74800 74850 74900 74950 75000 75050 75100 75150 75200 75250 75300 75350 75400 75450 75500 75550 75600 75650 75700 75750 75800 75850 75900 75950 76000 76050 76100 76150 76200 76250 76300 76350 76400 76450 76500 76550 76600 76650 76700 76750 76800 76850 76900 76950 77000 77050 77100 77150 77200 77250 77300 77350 77400 77450 77500 77550 77600 77650 77700 77750 77800 77850 77900 77950 78000 78050 78100 78150 78200 78250 78300 78350 78400 7845

132950 133000 133050 133100 133150 133200 133250 133300 133350 133400 133450 133500 133550 133600 133650 133700 133750 133800 133850 133900 133950 134000 134050 134100 134150 134200 134250 134300 134350 134400 134450 134500 134550 134600 134650 134700 134750 134800 134850 134900 134950 135000 135050 135100 135150 135200 135250 135300 135350 135400 135450 135500 135550 135600 135650 135700 135750 135800 135850 135900 135950 136000 136050 136100 136150 136200 136250 136300 136350 136400 136450 136500 136550 136600 136650 136700 136750 136800 136850 136900 136950 137000 137050 137100 137150 137200 137250 137300 137350 137400 137450 137500 137550 137600 137650 137700 137750 137800 137850 137900 137950 138000 138050 138100 138150 138200 138250 138300 138350 138400 138450 138500 138550 138600 138650 138700 138750 138800 138850 138900 138950 139000 139050 139100 139150 139200 139250 139300 139350 139400 139450 139500 139550 139600 139650 139700 139750 139800 139850 139900 139950 140000 140050

50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250 5300 5350 5400 5450 5500 5550 5600 5650 5700 5750 5800 5850 5900 5950 6000 6050 6100 6150 6200 6250 6300 6350 6400 6450 6500 6550 6600 6650 6700 6750 6800 6850 6900 6950 7000 7050 7100 7150 7200 7250 7300 7350 7400 7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950 8000 8050 8100 8150 8200 8250 8300 8350 8400 8450 8500 8550 8600 8650 8700 8750 8800 8850 8900 8950 9000 9050 9100 9150 9200 9250 9300 9350 9400 9450 9500 9550 9600 9650 9700 9750 9800 9850 9900 9950 10000 10050 10100 10150 1

In [230]:
ObtainEmbeddings(train_data, test_data, dev_data, 'BERT_UNCASED', 'bert-base-uncased', 'bert')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


bert-base-uncased loaded!

*** Getting vectors for 163483 examples of train data ***


50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250 5300 5350 5400 5450 5500 5550 5600 5650 5700 5750 5800 5850 5900 5950 6000 6050 6100 6150 6200 6250 6300 6350 6400 6450 6500 6550 6600 6650 6700 6750 6800 6850 6900 6950 7000 7050 7100 7150 7200 7250 7300 7350 7400 7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950 8000 8050 8100 8150 8200 8250 8300 8350 8400 8450 8500 8550 8600 8650 8700 8750 8800 8850 8900 8950 9000 9050 9100 9150 9200 9250 9300 9350 9400 9450 9500 9550 9600 9650 9700 9750 9800 9850 9900 9950 10000 10050 10100 10150 1

70150 70200 70250 70300 70350 70400 70450 70500 70550 70600 70650 70700 70750 70800 70850 70900 70950 71000 71050 71100 71150 71200 71250 71300 71350 71400 71450 71500 71550 71600 71650 71700 71750 71800 71850 71900 71950 72000 72050 72100 72150 72200 72250 72300 72350 72400 72450 72500 72550 72600 72650 72700 72750 72800 72850 72900 72950 73000 73050 73100 73150 73200 73250 73300 73350 73400 73450 73500 73550 73600 73650 73700 73750 73800 73850 73900 73950 74000 74050 74100 74150 74200 74250 74300 74350 74400 74450 74500 74550 74600 74650 74700 74750 74800 74850 74900 74950 75000 75050 75100 75150 75200 75250 75300 75350 75400 75450 75500 75550 75600 75650 75700 75750 75800 75850 75900 75950 76000 76050 76100 76150 76200 76250 76300 76350 76400 76450 76500 76550 76600 76650 76700 76750 76800 76850 76900 76950 77000 77050 77100 77150 77200 77250 77300 77350 77400 77450 77500 77550 77600 77650 77700 77750 77800 77850 77900 77950 78000 78050 78100 78150 78200 78250 78300 78350 78400 7845

132950 133000 133050 133100 133150 133200 133250 133300 133350 133400 133450 133500 133550 133600 133650 133700 133750 133800 133850 133900 133950 134000 134050 134100 134150 134200 134250 134300 134350 134400 134450 134500 134550 134600 134650 134700 134750 134800 134850 134900 134950 135000 135050 135100 135150 135200 135250 135300 135350 135400 135450 135500 135550 135600 135650 135700 135750 135800 135850 135900 135950 136000 136050 136100 136150 136200 136250 136300 136350 136400 136450 136500 136550 136600 136650 136700 136750 136800 136850 136900 136950 137000 137050 137100 137150 137200 137250 137300 137350 137400 137450 137500 137550 137600 137650 137700 137750 137800 137850 137900 137950 138000 138050 138100 138150 138200 138250 138300 138350 138400 138450 138500 138550 138600 138650 138700 138750 138800 138850 138900 138950 139000 139050 139100 139150 139200 139250 139300 139350 139400 139450 139500 139550 139600 139650 139700 139750 139800 139850 139900 139950 140000 140050

50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250 5300 5350 5400 5450 5500 5550 5600 5650 5700 5750 5800 5850 5900 5950 6000 6050 6100 6150 6200 6250 6300 6350 6400 6450 6500 6550 6600 6650 6700 6750 6800 6850 6900 6950 7000 7050 7100 7150 7200 7250 7300 7350 7400 7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950 8000 8050 8100 8150 8200 8250 8300 8350 8400 8450 8500 8550 8600 8650 8700 8750 8800 8850 8900 8950 9000 9050 9100 9150 9200 9250 9300 9350 9400 9450 9500 9550 9600 9650 9700 9750 9800 9850 9900 9950 10000 10050 10100 10150 1

In [231]:
ObtainEmbeddings(train_data, test_data, dev_data, 'BERT_CASED', 'bert-base-cased', 'bert')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


bert-base-cased loaded!

*** Getting vectors for 163483 examples of train data ***


50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250 5300 5350 5400 5450 5500 5550 5600 5650 5700 5750 5800 5850 5900 5950 6000 6050 6100 6150 6200 6250 6300 6350 6400 6450 6500 6550 6600 6650 6700 6750 6800 6850 6900 6950 7000 7050 7100 7150 7200 7250 7300 7350 7400 7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950 8000 8050 8100 8150 8200 8250 8300 8350 8400 8450 8500 8550 8600 8650 8700 8750 8800 8850 8900 8950 9000 9050 9100 9150 9200 9250 9300 9350 9400 9450 9500 9550 9600 9650 9700 9750 9800 9850 9900 9950 10000 10050 10100 10150 1

70150 70200 70250 70300 70350 70400 70450 70500 70550 70600 70650 70700 70750 70800 70850 70900 70950 71000 71050 71100 71150 71200 71250 71300 71350 71400 71450 71500 71550 71600 71650 71700 71750 71800 71850 71900 71950 72000 72050 72100 72150 72200 72250 72300 72350 72400 72450 72500 72550 72600 72650 72700 72750 72800 72850 72900 72950 73000 73050 73100 73150 73200 73250 73300 73350 73400 73450 73500 73550 73600 73650 73700 73750 73800 73850 73900 73950 74000 74050 74100 74150 74200 74250 74300 74350 74400 74450 74500 74550 74600 74650 74700 74750 74800 74850 74900 74950 75000 75050 75100 75150 75200 75250 75300 75350 75400 75450 75500 75550 75600 75650 75700 75750 75800 75850 75900 75950 76000 76050 76100 76150 76200 76250 76300 76350 76400 76450 76500 76550 76600 76650 76700 76750 76800 76850 76900 76950 77000 77050 77100 77150 77200 77250 77300 77350 77400 77450 77500 77550 77600 77650 77700 77750 77800 77850 77900 77950 78000 78050 78100 78150 78200 78250 78300 78350 78400 7845

132950 133000 133050 133100 133150 133200 133250 133300 133350 133400 133450 133500 133550 133600 133650 133700 133750 133800 133850 133900 133950 134000 134050 134100 134150 134200 134250 134300 134350 134400 134450 134500 134550 134600 134650 134700 134750 134800 134850 134900 134950 135000 135050 135100 135150 135200 135250 135300 135350 135400 135450 135500 135550 135600 135650 135700 135750 135800 135850 135900 135950 136000 136050 136100 136150 136200 136250 136300 136350 136400 136450 136500 136550 136600 136650 136700 136750 136800 136850 136900 136950 137000 137050 137100 137150 137200 137250 137300 137350 137400 137450 137500 137550 137600 137650 137700 137750 137800 137850 137900 137950 138000 138050 138100 138150 138200 138250 138300 138350 138400 138450 138500 138550 138600 138650 138700 138750 138800 138850 138900 138950 139000 139050 139100 139150 139200 139250 139300 139350 139400 139450 139500 139550 139600 139650 139700 139750 139800 139850 139900 139950 140000 140050

50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250 5300 5350 5400 5450 5500 5550 5600 5650 5700 5750 5800 5850 5900 5950 6000 6050 6100 6150 6200 6250 6300 6350 6400 6450 6500 6550 6600 6650 6700 6750 6800 6850 6900 6950 7000 7050 7100 7150 7200 7250 7300 7350 7400 7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950 8000 8050 8100 8150 8200 8250 8300 8350 8400 8450 8500 8550 8600 8650 8700 8750 8800 8850 8900 8950 9000 9050 9100 9150 9200 9250 9300 9350 9400 9450 9500 9550 9600 9650 9700 9750 9800 9850 9900 9950 10000 10050 10100 10150 1