# Preparing the Track 2 Submissions for Consistency Metric

The implementation of the consistency metric was tailored to the Track 1 format, where the texts were split into small chunks (1 or several sentences) and small dictionaries of terms that were present only there. Since the Track 2 data were in a different format (long passages, document-level dictionaries), we decided to tailor the submissions to the Track 1 format.

We are doing this with the `DocPreprocessor` module, that takes a submission file, splits the source and translated chapters into paragraphs (by default - by `\n\n` delimiter), makes sure they are aligned, and then assigns the subsets of the global dictionary to each paragraph. 

In [2]:
from docpreprocessor import DocPreprocessor
import os

  from tqdm.autonotebook import tqdm, trange


In [3]:
dp = DocPreprocessor()

2025-09-22 11:37:09 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 434kB [00:00, 48.9MB/s]
2025-09-22 11:37:11 INFO: Downloaded file to /Users/ksemen/stanza_resources/resources.json
2025-09-22 11:37:11 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| lemma     | combined_nocharlm |

2025-09-22 11:37:11 INFO: Using device: cpu
2025-09-22 11:37:11 INFO: Loading: tokenize
2025-09-22 11:37:11 INFO: Loading: mwt
2025-09-22 11:37:11 INFO: Loading: lemma
2025-09-22 11:37:11 INFO: Done loading processors!


## Main cycle

The function `make_preprocessing_round` takes a folder with all Track 2 submissions, transforms them and saves into the `track2_aligned` folder. 

**NB**: although initial file formats were JSONL, the outputs of our preprocessor are TSV! 

In [4]:
def make_preprocessing_round(folder):
    # initializing the preprocessing module
    dp = DocPreprocessor()
    # taking all files from the folder
    files = [f for f in os.listdir(folder) if f.endswith('.jsonl') and 'enzh' in f]
    # creating list for possible erroneus files
    error_files = []
    for file in files:
        print(f'parsing {file}')
        # for NMT baselines that processed 'noterm' modes - ensure that the dictionaries are imported from the 'proper' mode
        if 'MADLAD' in file or 'NLLB' in file:
            local_proper_terms = True
        else:
            local_proper_terms = False

        # preparing file configuration
        system, year, pair, mode, _ = file.split('.') 
        # in zhen direction - there are one-to-many dictionary entries. ignore them.
        if pair == 'zhen': 
            clear_1tomany = True
        else:
            clear_1tomany = False
        print(f'processing {file}...')
        #try:
        dp.load(file)
        dp.split()
        dp.retrieve_terms(clear_1tomany=clear_1tomany, local_proper_terms=local_proper_terms)
        print(dp.stats())
        dp.save()
        #except:
        #    error_files.append(file)
        print('=================================')
        #print(file, clear_1tomany)
    return error_files
            

In [5]:
make_preprocessing_round('data/submissions/track2/')

2025-09-22 11:37:16 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 434kB [00:00, 45.2MB/s]
2025-09-22 11:37:16 INFO: Downloaded file to /Users/ksemen/stanza_resources/resources.json
2025-09-22 11:37:16 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| lemma     | combined_nocharlm |

2025-09-22 11:37:16 INFO: Using device: cpu
2025-09-22 11:37:16 INFO: Loading: tokenize
2025-09-22 11:37:16 INFO: Loading: mwt
2025-09-22 11:37:16 INFO: Loading: lemma
2025-09-22 11:37:16 INFO: Done loading processors!


parsing organizers_gpt-4_1-nano.2021.enzh.random.jsonl
processing organizers_gpt-4_1-nano.2021.enzh.random.jsonl...
organizers_gpt-4_1-nano.2021.enzh.random.jsonl: mean±std: 0.8433517568653389±0.1003823369340355; 7 cases with score < 0.5
None
df shape before saving: (540, 8)
df shape after saving: (430, 8)
parsing organizers_gpt-4_1-nano.2015.enzh.random.jsonl
processing organizers_gpt-4_1-nano.2015.enzh.random.jsonl...
organizers_gpt-4_1-nano.2015.enzh.random.jsonl: mean±std: 0.7888031019272317±0.21642896148336624; 40 cases with score < 0.5
None
df shape before saving: (343, 8)
df shape after saving: (285, 8)
parsing organizers_MADLAD.2019.enzh.noterm.jsonl
processing organizers_MADLAD.2019.enzh.noterm.jsonl...
organizers_MADLAD.2019.enzh.noterm.jsonl: mean±std: 0.8451003675260359±0.10440058627972772; 6 cases with score < 0.5
None
df shape before saving: (443, 8)
df shape after saving: (352, 8)
parsing organizers_gpt-4_1-nano.2015.enzh.proper.jsonl
processing organizers_gpt-4_1-nano.2

[]