# Data preprocessing: VESPA
As the VESPA dataset is not an NLI benchmark, several steps should be taken to preprocess the data in such a way that it can be used for NLI. In this Python notebook, we carry out the following steps: 
1. Select the files that are only one of the five majority classes (Dutch, French, Norwegian, Spanish, and Swedish) to filter out texts written by bilingual speakers or other languages
2. Clean the texts by removing XML-tagged items and citations

Finally, we split the dataset into a train and test set by taking a random set of 10 samples per L1 as our test set, leaving the remaining texts as the training set.

Please download the VESPA dataset (https://corpora.uclouvain.be/catalog/en/corpus/vespa) before running this notebook. 

In [None]:
# Import libraries
import os
import pandas as pd
import re

In [4]:
# Define filepaths
dir_path = "./VESPA/texts" # change accordingly to the filepath with the full dataset

file_name = './data/VESPA/VESPA-subset.csv'
file_name_cleaned = './data/VESPA/VESPA-subset-cleaned.csv'
file_name_test = './data/VESPA/VESPA-test.csv'
file_name_train = './data/VESPA/VESPA-train.csv'

files = os.listdir(dir_path) # all file names in corpus
print(len(files))

941


## Only select files with one L1 out of 5 majority classes

In [18]:
# get all the files that only have one out of five L1s
metadata_fp = './VESPA/metadata.csv'
df = pd.read_csv(metadata_fp) # load dataset
native_langs = df['Native language']
filenums = df['Text ID']
# print(set(native_langs))
subset = []
subset_files = []
subset_l1 = []
l1_classes = ['Dutch', 'French', 'Norwegian', 'Spanish', 'Swedish'] 
l1_classes_mapping = {'Dutch': 'DUT', 
                     'French': 'FRE', 
                     'Norwegian': 'NOR', 
                     'Spanish': 'SPA', 
                     'Swedish': 'SWE'}
for filenum, l1 in zip(filenums, native_langs):
    if l1 in l1_classes: 
        subset_files.append(filenum)
        subset_l1.append(l1_classes_mapping[l1])
        with open(f'./VESPA/texts/{filenum}.txt', 'r') as rf:
            text = rf.read()
            subset.append(text) 

<section> Abstract </section> English noun phrases (NP) which include degree modified adjectives show some interesting variation of the position of the indefinite article. A particularly salient pattern is displayed in <mentioned> This is anticipated to be more common a scenario than fleas spreading bubonic plague </mentioned> (BoE, BU-NX022521). The present paper is based on a study of utterances where this pattern was used even though a canonical word order would have been possible. Such constructs are referred to as the Optional Postposed Indefinite Article Noun Phrase (OPIANP) and have been collected from the British National Corpus (BNC) and Collins Word Banks Online : English Corpus (BoE). The central question is whether there is semantic motivation for this postposition of the indefinite article. The results suggest that there is such motivation, namely that the OPIANP could be an extension of a more frequent construction identified as the Postposed Indefinite Article Noun Phras

In [None]:
# make it into one csv file with one column called 'filename', 'text' and 'language'
selected_df = pd.DataFrame({'filename': subset_files, 
                           'text': subset, 
                           'language': subset_l1})

selected_df.head()
print(selected_df['text'][0])
selected_df.to_csv(file_name, encoding='utf-8', index=False)

## Clean texts to remove tagged items and citations

In [37]:
# clean texts in subset to remove tagged items and citations
# df = pd.read_csv(file_name_cleaned)

texts = selected_df['text'].tolist()
cleaned_texts = []
for ind, text in enumerate(texts):
    # remove tags and tagged items 
    text = re.sub(r'<[^>]+>.*?</[^>]+>', '', text)
    text = re.sub(r'</.*?>', '', text)
    
    # remove in-text citations by removing anything in between brackets
    text = re.sub(r'\(.*?\)', '', text)
    
    # remove any trailing spaces from the replacements 
    text = re.sub(r'\s+', ' ', text).strip()
    cleaned_texts.append(text)
    if ind == 3:
        print(text)

selected_df.pop('text')
selected_df['text'] = cleaned_texts
selected_df.to_csv(file_name_cleaned, encoding='utf-8', index=False)

Multinational corporations operating in Sweden often use English as their official corporate language. The employees are expected to communicate using English both internally and with external business contacts. English used for communication between people with different mother tongues is commonly referred to as ELF, English as a Lingua Franca, and when used in business contexts it is referred to as Business English or BELF, Business English as a Lingua Franca. This study was conducted to explore how Business English is used in the pharmaceutical sector in Sweden and what elements of Business English are challenging or necessary for successful communication. In the study five informants were interviewed about their experiences. The study showed that the informants use Business English for all types of communication and are comfortable with English as a lingua franca yet often switch over to Swedish if there are only Swedish speakers present. It was also found that clear, somewhat simp

## Take random sample to form test set

In [30]:
# make random sample of 10 samples per L1
df = pd.read_csv(file_name_cleaned)
df = df.groupby("language").sample(n=10, random_state=1) # test set randomly sampled
df.to_csv(file_name_test, encoding='utf-8', index=False) # output into csv file

df_original = pd.read_csv(file_name_cleaned)
filenums = df_original['filename'].tolist()
texts = df_original['text'].tolist()
langs = df_original['language'].tolist()
train_filenums = []
train_langs = []
train_texts = []
for filenum, text, lang in zip(filenums, texts, langs):
    if filenum in df['filename'].tolist():
        continue
    else:
        train_filenums.append(filenum)
        train_langs.append(lang)
        train_texts.append(text)

train_df = pd.DataFrame({'filename': train_filenums, 
                           'text': train_texts, 
                           'language': train_langs})

train_df.to_csv(file_name_train, encoding = 'utf-8', index=False) # output training set

In [32]:
print(len(df)) 
print(len(train_df)) 

50
647
