# Plan
- Parsed the data from a file containing URL addresses.
- Extracted the data from the file and converted it into JSON format.
- Annotated the data using Label Studio.
- Downloaded the annotated data.
- Split the annotated data into training and testing sets.
- Finally, trained models using the training data.

Function for splitting annotation data

In [None]:
import spacy
from spacy.tokens import DocBin
import json
import random


def convert_data_to_spacy_format(input_file, train_output_file, dev_output_file, split_ratio=0.8):
    nlp = spacy.blank("en")
    train_doc_bin = DocBin()
    dev_doc_bin = DocBin()

    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    random.shuffle(data)

    split_index = int(len(data) * split_ratio)

    for idx, item in enumerate(data):
        text = item['data']['text']
        entities = []

        for annotation in item['annotations']:
            for result in annotation['result']:
                if 'value' in result:
                    entity = result['value']
                    start = entity['start']
                    end = entity['end']
                    label = entity['labels'][0]
                    entities.append((start, end, label))

        doc = nlp.make_doc(text)  
        ents = []

        for start, end, label in entities:
            span = doc.char_span(start, end, label=label)
            if span is not None:
                ents.append(span)

        doc.ents = ents

        if idx < split_index:
            train_doc_bin.add(doc)
        else:
            dev_doc_bin.add(doc)

    train_doc_bin.to_disk(train_output_file)
    dev_doc_bin.to_disk(dev_output_file)


convert_data_to_spacy_format("data/furniture_data.json", "train.spacy", "dev.spacy", split_ratio=0.8)

spacy model for cpu using vectors = "en_core_web_md". This model use for site because it's a great combination of high performance, precision and small model size.

In [3]:
! python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy -g 0

[38;5;4m[i] Saving to output directory: output[0m
[38;5;4m[i] Using GPU: 0[0m
[1m
[38;5;2m[+] Initialized pipeline[0m
[1m
[38;5;4m[i] Pipeline: ['tok2vec', 'ner'][0m
[38;5;4m[i] Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     55.83    0.00    0.00    0.00    0.00
  0     200         51.27   1817.63   16.57   31.87   11.20    0.17
  1     400         13.36    967.30   45.63   61.44   36.29    0.46
  2     600         18.74    794.85   68.73   68.73   68.73    0.69
  3     800         25.11    739.95   71.31   75.98   67.18    0.71
  4    1000         36.70    685.57   72.07   80.48   65.25    0.72
  6    1200         38.41    601.47   74.33   73.76   74.90    0.74
  8    1400         65.36    586.20   78.03   83.33   73.36    0.78
 10    1600         64.48    447.28   73.87   83.82   66.02    0.74
 13    1800         94.72    450.

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


spacy model-transformer "roberta-base". This model too big for site.

In [7]:
! python -m spacy train base_config_2.cfg --output ./output_2 --paths.train ./train.spacy --paths.dev ./dev.spacy -g 0

[38;5;4m[i] Saving to output directory: output_1[0m
[38;5;4m[i] Using GPU: 0[0m
[1m
[38;5;2m[+] Initialized pipeline[0m
[1m
[38;5;4m[i] Pipeline: ['transformer', 'ner'][0m
[38;5;4m[i] Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0          97.49     94.96    1.72    0.89   29.73    0.02
  2     200       17255.22  22057.62   62.38   52.51   76.83    0.62
  5     400        1194.21   2098.01   76.37   74.81   77.99    0.76
  8     600         636.04   1091.26   73.10   78.07   68.73    0.73
 10     800         451.60    692.18   76.06   76.06   76.06    0.76
 13    1000         311.35    500.91   79.22   79.69   78.76    0.79
 16    1200         215.97    361.57   78.14   82.13   74.52    0.78
 18    1400         190.80    322.11   79.19   83.05   75.68    0.79
 21    1600         153.94    260.92   78.34   76.47   80.31    0.78
 24    1800       

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  with torch.cuda.amp.autocast(self._mixed_precision):
  with torch.cuda.amp.autocast(self._mixed_precision):
Token indices sequence length is longer than the specified maximum sequence length for this model (561 > 512). Running this sequence through the model will result in indexing errors


spacy model-transformer "roberta-base". This model too big for site.

In [1]:
! python -m spacy train base_config_3.cfg --output ./output_3 --paths.train ./train.spacy --paths.dev ./dev.spacy -g 0

[38;5;2m[+] Created output directory: output_2[0m
[38;5;4m[i] Saving to output directory: output_2[0m
[38;5;4m[i] Using GPU: 0[0m
[1m
[38;5;2m[+] Initialized pipeline[0m
[1m
[38;5;4m[i] Pipeline: ['transformer', 'ner'][0m
[38;5;4m[i] Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0          34.00     60.32    1.74    0.90   28.96    0.02
  2     200        9058.36  20002.66   23.80   34.56   18.15    0.24
  5     400        1564.59   3047.22   68.67   77.29   61.78    0.69
  8     600         769.34   1657.35   76.48   81.30   72.20    0.76
 10     800         571.34   1142.62   75.49   77.33   73.75    0.75
 13    1000         451.80    869.98   77.97   76.10   79.92    0.78
 16    1200         351.81    668.13   75.64   77.64   73.75    0.76
 18    1400         282.50    530.48   78.14   78.29   77.99    0.78
 21    1600         217.37    418.4

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  with torch.cuda.amp.autocast(self._mixed_precision):
  with torch.cuda.amp.autocast(self._mixed_precision):
Token indices sequence length is longer than the specified maximum sequence length for this model (561 > 512). Running this sequence through the model will result in indexing errors


spacy model-transformer "distilbert-base-uncased". This model has an optimal ratio of size and accuracy, but requires additional libraries to run that are too large to deploy on the site.

In [1]:
! python -m spacy train base_config_4.cfg --output ./output_4 --paths.train ./train.spacy --paths.dev ./dev.spacy -g 0

[38;5;2m[+] Created output directory: output_3[0m
[38;5;4m[i] Saving to output directory: output_3[0m
[38;5;4m[i] Using GPU: 0[0m
[1m
[38;5;2m[+] Initialized pipeline[0m
[1m
[38;5;4m[i] Pipeline: ['transformer', 'ner'][0m
[38;5;4m[i] Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0          86.43    102.60    1.58    0.82   26.25    0.02
  2     200       19681.46  22923.91   64.69   78.89   54.83    0.65
  5     400        1123.88   1524.90   66.97   81.67   56.76    0.67
  8     600         450.18    620.73   77.27   83.11   72.20    0.77
 10     800         293.34    404.34   76.98   79.18   74.90    0.77
 13    1000         242.80    328.66   79.47   78.28   80.69    0.79
 16    1200         208.23    295.23   77.37   73.36   81.85    0.77
 18    1400         184.00    284.76   79.01   78.11   79.92    0.79
 21    1600         136.37    223.1

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  with torch.cuda.amp.autocast(self._mixed_precision):
  with torch.cuda.amp.autocast(self._mixed_precision):
Token indices sequence length is longer than the specified maximum sequence length for this model (727 > 512). Running this sequence through the model will result in indexing errors


To improve the model's accuracy, we can utilize larger and more powerful transformer models, gather and use more data for training, as well as leverage web hosting services that provide greater capabilities and memory capacity.