# Appendix A: Multilingual  Dataset of Type Annotated Question Pairs QTyTP


This appendix reports on the collection and processing of a multilingual question pair dataset developed for research on cross-lingual semantic similarity and the representation of interrogative intent in pretrained language models. The dataset contains parallel question pairs in five languages: Afrikaans, Arabic, English, Indonesian, and Marathi. Questions were extracted from the NLLB corpus (Schwenk 2020, Tiedemann 2012) using language-specific patterns and aligned based on their translations. English sentences were annotated with linguistic features capturing information type (e.g., modality, quantification, cleft) and question type (e.g., polar, wh-questions, alternative, conditional) using rule-based approaches. Preprocessing steps, including filtering, deduplication, and language balancing, were applied to ensure data quality and composition.

The result is a dataset containing ~100k question pairs adapted for investigating the relationships between language-specific features, semantic similarity, and the encoding of questions in multilingual models. The collection and processing pipeline described here is customisable to some extent and allows for varying sizes and dataset properties. However, it is important to note that this script requires local acces to NLLB alignment files and monolingual corpora, which are quite large and sometimes difficult to handle efficiently. Overall, this resource aims to support the development and evaluation of cross-lingual approaches to sentence processing tasks and to focuson the linguistic factors shaping the representation of interrogative semantics.

In the final step, the annotated dataset is encoded using a SciKit library called Multi Label Binarizer. Since our two kinds of features are stored in the dataset as an iterable, we use the MLB transformer to turn it into a single multilabel format. This kind Multi Label Encoding is popular for use in annotating data samples with multiple features of different kind. For more info, consult the documentation (https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer).


## Data collection

QuestionDetector class uses regular expressions to detect questions across five languages: English, Marathi, Arabic, Indonesian, and Afrikaans. The class initializes with language-specific pattern dictionaries that contain regular expressions for identifying three main types of question indicators: WH-question words (like "who," "what," "where" in English and their translations in other languages), auxiliary verb patterns for yes/no questions, and question marks (including the Arabic question mark '؟', and Marathi particles का, काय).

In [1]:
import json
import re


"""
This script defines the patterns for data collection from NLLB source, target, and alignment file.
It essentially searches both the source language and target language monolingual corpora files by line index, and appends them to a new json object holding original index for cross reference and the pair of sentence.

The patterns were collected with the help of Claude 3.5 Sonnet, by giving it prompts to recognize the question words in a select sample of sentences from the data.

"""

class QuestionDetector:
    def __init__(self):

        # English question patterns
        self.en_patterns = [
            r'^(Who|What|Where|When|Why|How|Which|Whose|Whom)\b',  # WH question key words
            r'^(Do|Does|Did|Is|Are|Was|Were|Have|Has|Had|Can|Could|Should|Would|Will)\b',  # polar questions
            r'\?$'  # Question mark at end
        ]
        
        # Marathi question patterns
        self.mr_patterns = [
            r'^(कोण|काय|कुठे|केव्हा|का|कसे|कोणता|कोणाचे|कोणाला)\b',  # WH questions
            r'\?$',  # Question mark
            r'(का|काय)$'  # Question particles at end
        ]
        
        # Arabic question patterns
        self.ar_patterns = [
            r'^(من|ما|ماذا|أين|متى|لماذا|كيف|أي|لمن|هل)\b',  # WH questions and هل
            r'\?$',  # Question mark
            r'؟$'    # Arabic question mark
        ]
        
        # Indonesian question patterns
        self.id_patterns = [
            r'^(Siapa|Apa|Dimana|Kapan|Mengapa|Bagaimana|Yang mana|Kepada siapa)\b',  # WH questions
            r'^Apakah\b',  # Yes/No questions
            r'\?$'  # Question mark
        ]
        
        # Afrikaans question patterns
        self.af_patterns = [
            r'^(Wie|Wat|Waar|Wanneer|Hoekom|Hoe|Watter|Aan wie)\b',  # WH questions
            r'^(Is|Het|Sal|Kan|Moet|Wil|Mag)\b',  # Verb-initial questions
            r'\?$'  # Question mark
        ]
        
        # Map language codes to their patterns
        self.lang_patterns = {
            'en': self.en_patterns,
            'mr': self.mr_patterns,
            'ar': self.ar_patterns,
            'id': self.id_patterns,
            'af': self.af_patterns
        }


    def is_question(self, text, language):
        
        """Check if a sentence from the source file is a question in the specified language."""

        if not text or language not in self.lang_patterns: # data validation
            return False
            
        patterns = self.lang_patterns[language] # loads the regex patterns
        
        # module that searches for the sentence data
        for pattern in patterns:
            if re.search(pattern, text):

                # Additional validation: must end with question mark for most cases

                if language != 'mr':  #  marathi sometimes uses a particle mark at the end
                    return text.rstrip().endswith('?') or text.rstrip().endswith('؟')
                
                return True                
        return False


def extract_question_pairs(source_file: str, target_file: str, language: str, limit: int = 2000000):

    """
    Main function that extracts question-translation pairs from source and target language files. The limit can be changed for controlling the size of the dataset, i.e. number of lines in the monolingual files to search through.

    Takes: source_file: Path to the source language file.
        target_file: Path to the target language file (English).
        language: The language code of the source file.
        limit: The maximum number of lines to read from each file.

    Returns:
        indices of questions in the source file, dictionaries containing the source-target translation pairs.
    """

    detector = QuestionDetector()
    question_pairs = {} # empty dict for storing pairs and indices

    try:
        with open(source_file, 'r', encoding='utf-8') as sf, open(target_file, 'r', encoding='utf-8') as tf: # opens the file and reads lines
            source_lines = [line.strip() for line in sf.readlines()[:limit]]
            target_lines = [line.strip() for line in tf.readlines()[:limit]]  # reads the entire line

            for i, source_text in enumerate(source_lines): # for every index. check if it holds a question
                if detector.is_question(source_text, language):
                    question_pairs[i] = {"source": source_text, "target": target_lines[i]} # Append translation and source



    except FileNotFoundError: # Error handling for misplaced files
        print(f"Error: {source_file} or {target_file} not found.")
        return {}
    except Exception as e:
        print(f"An error occurred: {e}")
        return {}

    return question_pairs



def main():
    """
    Main function to process language files and extract question pairs. If youre working with other language pairs, change the file structure in language_pairs dict
    """ 
    
    language_pairs = {
        'af': 'NLLB.af-en',
        'ar': 'NLLB.ar-en',
        'id': 'NLLB.en-id',
        'mr': 'NLLB.en-mr'
    }

    all_question_pairs = {} # empty dict to compile all language specific results


    for lang, file_base in language_pairs.items(): # loop through your language files, make sure the path is correct for your corpus files
        source_file = f"{file_base}.{lang}"
        target_file = f"{file_base}.en"  # English target
        output_file = f"question_pairs_{lang}.json"

        print(f"Processing {lang.upper()}...")
        try:
            question_pairs = extract_question_pairs(source_file, target_file, lang) # call the main collection function, takes three inputs, saves output to question_pairs
            all_question_pairs[lang] = question_pairs

            with open(output_file, 'w', encoding='utf-8') as outfile: # creates new output file  
                json.dump(question_pairs, outfile, ensure_ascii=False, indent=2)

            print(f"Saved {len(question_pairs)} question-translation pairs to {output_file}")

        except Exception as e:
            print(f"Error processing {lang}: {str(e)}")


    combined_output_file = "all_question_pairs.json"
    with open(combined_output_file, 'w', encoding='utf-8') as outfile:
        json.dump(all_question_pairs, outfile, ensure_ascii=False, indent=2)

    print(f"\nSaved all question-translation pairs to {combined_output_file}")

if __name__ == "__main__":
    main()

Processing AF...
Saved 78069 question-translation pairs to question_pairs_af.json
Processing AR...
Saved 41697 question-translation pairs to question_pairs_ar.json
Processing ID...
Saved 104867 question-translation pairs to question_pairs_id.json
Processing MR...
Saved 189036 question-translation pairs to question_pairs_mr.json

Saved all question-translation pairs to all_question_pairs.json


## Data framing

This module creates a dataframe from the json dictionary of questions. The 'create_dataframe_from_json' function reads multiple JSON files containing question pairs in different languages (Afrikaans, Arabic, Indonesian, and Marathi) and converts them into a pandas DataFrame. For each JSON file, it extracts the source question, target translation, and adds two empty data columns as placeholder lists for two types of features and adds a language tag for each source sentence.


In [2]:
import json
import pandas as pd

def create_dataframe_from_json(json_filepaths, languages):
    """
    Creates a Pandas DataFrame from pairs of JSON files.
    Adds empty columns for language, feature1, and feature2 (initialized as empty lists).

    takes:
        A list of filepaths to the JSON files.
        A list of language codes corresponding to the JSON files.

    """
    
    data = [] # this is our empty data structure, list 

    for filepath, lang in zip(json_filepaths, languages): # iterates over the entire language file and appends pairs to new data list
            with open(filepath, 'r', encoding='utf-8') as f:
                question_pairs = json.load(f)
                for idx, pair in question_pairs.items():
                    # part below shows the structure of our dataset 
                    data.append({
                        'index': idx,
                        'source': pair.get('source', ''),
                        'target': pair.get('target', ''),
                        'language': lang, # tag from ['af', 'ar', 'id', 'mr']
                        'feature1': [],  # feature1 empty list
                        'feature2': []  
                    })

    df = pd.DataFrame(data)
    return df


def main():

    filepaths = [
        "question_pairs_af.json",  # make sure paths are correct
        "question_pairs_ar.json",
        "question_pairs_id.json",
        "question_pairs_mr.json",
    ]
    languages = ['af', 'ar', 'id', 'mr']



    df = create_dataframe_from_json(filepaths, languages)

        
    if df is not None:

        print(df.head())  # Print first few rows for inspection

        
        df.to_csv("nllb_all_questionpairs.csv", index=False, encoding='utf-8') # save to csv format for easy loading into llm
        print("DataFrame saved to nllb_all_questionpairs.csv")

if __name__ == "__main__":
    main()



  index                                             source  \
0    37  Hoekom het jy nie meer tyd om my te sien nie?'...   
1    54  Wat is die grootste openbaring van geloof en h...   
2    75  In 2011 die wêreld-ekonomie sal groei ten kost...   
3    89        "Wat as dit iets ernstigs is, selfs kanker?   
4    92  Die National Youth Leadership Forum: moet u gaan?   

                                              target language feature1  \
0  Why don't you have more time to see me?" and "...       af       []   
1  What is the greatest revelation of faith, and ...       af       []   
2  In 2011 the world economy will grow at the exp...       af       []   
3      "What if it's something serious; even cancer?       af       []   
4  The National Youth Leadership Forum: Should Yo...       af       []   

  feature2  
0       []  
1       []  
2       []  
3       []  
4       []  
DataFrame saved to nllb_all_questionpairs.csv


## Annotation rules
This section focuses on annotating a DataFrame of text data with linguistic features. It defines a function annotate_features that searches the sentences for patterns of two kinds: feature1 identifies the "Information Type" present in the text (modality, quantification, comparison, negation, cleft sentences), while feature2 classifies the "Question Type" (polar, wh-question, alternative, conditional). Using regular expressions and stemming, the code analyzes the text to extract these features. The Porter Stemmer transforms sentences into lists of stems, and the Regex search pattern matches the defined structure to sentences in the dataset.

Examples of annotated questions:

1. Isn't the majority of them broken? [quantification] [polar]
2. Would you rather have pizza or sushi? [modality] [alternative]
3. Which one is more expensive? [comparison] [wh-question]

In [3]:
import pandas as pd
import re
import nltk
from nltk.stem import PorterStemmer


def annotate_features(df):

    """Annotates the DataFrame with information type and question type features."""

    # Information Type (feature1)
    """ searches for words that are indicative of the kind of propositions that are the focus of the sentence"""

    def get_information_type(text):

        text = text.lower()  # Lowercase for case-insensitivity
        features = []

        # Modality
        if re.search(r"\b(can|could|should|would|will|may|might|must)\b", text):
            features.append("modality")

        # Quantification
        if re.search(r"\b(how much|how many|some|all|any|few|many|several|most|none)\b", text):
            features.append("quantification")

        # Comparison
        if re.search(r"\b(more|less|better|worse|bigger|smaller|than|as|equal|similar|different)\b", text):
            features.append("comparison")

        #cleft sentences
        cleft_pattern_wh = r"it'?s?\b.*\b(that|who|which|where|when|why|how)\b"  # Note the '?' after 'it' and 's'
        if re.search(cleft_pattern_wh, text, re.IGNORECASE):
            features.append("cleft")

        # Negation
        if re.search(r"(\bnot|n't|\bno|\bnever|\bnobody|\bnothing|\bnowhere|\bneither|\bnor)\b", text):
            features.append("negation")

        return features

    # Question Type (feature2)

    def get_question_type(text):
        text = text.lower()  # Lowercase
        stemmer = PorterStemmer()

        text_stemmed = " ".join([stemmer.stem(word) for word in text.split()])
        
        if re.search(r"^(?:\bdo|doe|\bdid|\bis|are|wa|do|\bdoes|did|is|\bhas|were|have|ha|had|can|could|should|would|will|mai|might|must)\b", text_stemmed):
            return ["polar"]
        elif re.search(r"(\bwho|who|\bwhat|\bwhich|what|where|when|why|how|which|whose|whom)\b", text):
            return ["wh-question"]
        elif re.search(r"\bor\b", text):
            return ["alternative"]
        elif re.search(r"\bif\b", text):
            return ["conditional"]
        return []
    

    df['feature1'] = df['target'].apply(get_information_type) # update the empty columns with the results of running type checkers
    df['feature2'] = df['target'].apply(get_question_type)

    return df


def main(): # main fucntion loads the csv file, annotates it and saves to output

    input_csv = "nllb_all_questionpairs.csv" # check the output of the framing cell
    output_csv = "nllb_annotated_questionpairs.csv" # note down the filepath for filtering step later

    try: # check correct loading from CSV
        df = pd.read_csv(input_csv, converters={'feature1': eval, 'feature2': eval})  
        df = annotate_features(df)
        print(df.head()) # Print the first few examples
        df.to_csv(output_csv, index=False, encoding='utf-8')
        print(f"Annotated data saved to {output_csv}")

    except FileNotFoundError:
        print(f"Error: {input_csv} not found.")
    except Exception as e:
        print(f"An error occurred: {e}")


if __name__ == "__main__":
    main()

   index                                             source  \
0     37  Hoekom het jy nie meer tyd om my te sien nie?'...   
1     54  Wat is die grootste openbaring van geloof en h...   
2     75  In 2011 die wêreld-ekonomie sal groei ten kost...   
3     89        "Wat as dit iets ernstigs is, selfs kanker?   
4     92  Die National Youth Leadership Forum: moet u gaan?   

                                              target language  \
0  Why don't you have more time to see me?" and "...       af   
1  What is the greatest revelation of faith, and ...       af   
2  In 2011 the world economy will grow at the exp...       af   
3      "What if it's something serious; even cancer?       af   
4  The National Youth Leadership Forum: Should Yo...       af   

                                           feature1       feature2  
0  [modality, quantification, comparison, negation]  [wh-question]  
1                                                []  [wh-question]  
2                      

## Final filter
Cell bellow reads the file 'nllb_annotated_questionpairs.csv', processes the feature1 and feature2 columns to ensure they contain lists, filters out rows with empty lists in these columns, and saves the filtered data to a new file .


In [7]:
import pandas as pd

def filter_feature_annotations(input_file, output_file):
    """
    Filters a CSV file of question pairs, keeping only rows where both features are present.
    Removes rows with empty annotations and prints statistics about the filtering process.

    takes two files:
        input_file: Path to the input CSV file
        output_file: Path to the output CSV file


    """
    try:
        df = pd.read_csv(input_file)
    except FileNotFoundError:
        print(f"Error: File '{input_file}' not found.")
        return
    except pd.errors.EmptyDataError:
        print("Error: Input CSV file is empty.")
        return
    except Exception as e:
        print(f"An error occurred while reading the CSV file: {e}")
        return

    # this defines the content of the feature columns as necessarily strings, the argument of the apply function is described in this stack overal post[https://stackoverflow.com/questions/44061607/pandas-lambda-function-with-nan-support], i.e. it is pandas specific standardization of the df
    
    df['feature1'] = df['feature1'].astype(str).apply(lambda x: eval(x) if x != 'nan' and x != '[]' else [])
    df['feature2'] = df['feature2'].astype(str).apply(lambda x: eval(x) if x != 'nan' and x != '[]' else [])

    # filter for rows with non-empty features, this is an interesting control factor to play around with
    df_filtered = df[
        (df['feature1'].apply(len) > 0) & 
        (df['feature2'].apply(len) > 0)
    ].copy()

   
    try:
        df_filtered.to_csv(output_file, index=False)
        print(f"\nFiltered data saved to {output_file}")
    except Exception as e:
        print(f"Error writing to output file: {e}")

# change file paths to your needs

if __name__ == "__main__":
    input_csv = 'nllb_annotated_questionpairs.csv'
    output_csv = 'qtytp-all.csv'
    filter_feature_annotations(input_csv, output_csv)



Filtered data saved to qtytp-all.csv


## Dataset statistics


In [8]:
import pandas as pd

def analyze_question_pairs(csv_file):
    """
    Analyzes question pairs from a CSV file.

    Takes csv_file: path to the CSV file.

    Returns a dictionary containing the analysis results.  Returns None if there's an error.
    """
    
    try: # reads the filtered questions file, takes output of previous cell block
        df = pd.read_csv(csv_file)
    except Exception as e:
        print(f"An error occurred while reading the CSV file: {e}")
        return None


    results = {}
    languages = df['language'].unique() # list of languages [id_1, id_2, id_3, ..]

    for lang in languages:
        lang_df = df[df['language'] == lang]
        results[lang] = {
            'pairs_count': len(lang_df),
            'fully_annotated': len(lang_df[(lang_df['feature1'].apply(len) > 0) & (lang_df['feature2'].apply(len) > 0)])}
        
    overall = {
        'pairs_count': len(df),
        'fully_annotated': len(df[(df['feature1'].apply(len) > 0) & (df['feature2'].apply(len) > 0)])}
    
    feature_counts = {}
    for col in ['feature1', 'feature2']:
        for _, row in df.iterrows():
            for feature in row[col]:
                if feature not in feature_counts:
                    feature_counts[feature] = {'total': 0, 'per_language': {}}
                if col == 'feature1':
                    feature_counts[feature]['per_language'][row['language']] = feature_counts[feature].get('per_language', {}).get(row['language'], 0) + 1
                    feature_counts[feature]['total'] += 1
                elif col == 'feature2':
                    feature_counts[feature]['per_language'][row['language']] = feature_counts[feature].get('per_language', {}).get(row['language'], 0) + 1
                    feature_counts[feature]['total'] += 1
    
    results['overall'] = overall
    results['feature_counts'] = feature_counts
    return results




analysis_results = analyze_question_pairs('qtytp-all.csv')
if analysis_results:
    print(analysis_results)

{'af': {'pairs_count': 23881, 'fully_annotated': 23881}, 'ar': {'pairs_count': 19699, 'fully_annotated': 19699}, 'id': {'pairs_count': 50584, 'fully_annotated': 50584}, 'mr': {'pairs_count': 41352, 'fully_annotated': 41352}, 'overall': {'pairs_count': 135516, 'fully_annotated': 135516}, 'feature_counts': {'[': {'total': 271032, 'per_language': {'af': 47762, 'ar': 39398, 'id': 101168, 'mr': 82704}}, "'": {'total': 626184, 'per_language': {'af': 110970, 'ar': 91308, 'id': 241082, 'mr': 182824}}, 'm': {'total': 97832, 'per_language': {'af': 16261, 'ar': 16896, 'id': 40851, 'mr': 23824}}, 'o': {'total': 325900, 'per_language': {'af': 57837, 'ar': 48558, 'id': 126036, 'mr': 93469}}, 'd': {'total': 78634, 'per_language': {'af': 12287, 'ar': 13109, 'id': 32474, 'mr': 20764}}, 'a': {'total': 247501, 'per_language': {'af': 43382, 'ar': 36922, 'id': 96441, 'mr': 70756}}, 'l': {'total': 127644, 'per_language': {'af': 20955, 'ar': 20911, 'id': 51715, 'mr': 34063}}, 'i': {'total': 336970, 'per_lang

## Multi Label Binary Feature Encoding

Takes feature columns that contain strings of feature names. MLE transforms these lists into a binary matrix. Each column represents a binary value for the presence of a label. More information about sklearn-MLB and the fit_transform and prepare_multilabel_encoding functions see documentation here [https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer.fit_transform]

In [4]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer


def prepare_multilabel_encoding(df):
    """
    prepares multi-label encodings for 'feature1' and 'feature2' columns in our dataset

    Takes the final filtered df with 'feature1' and 'feature2' columns containing lists of strings

    Returns pandas df with additional columns for multi-label encoding <- binary matrix
    """
    # check if the features are in the correct list format, convert string representations of lists to actual lists if needed
    
    df['feature1'] = df['feature1'].apply(eval) if isinstance(df['feature1'].iloc[0], str) else df['feature1']
    df['feature2'] = df['feature2'].apply(eval) if isinstance(df['feature2'].iloc[0], str) else df['feature2']


    # Feature 1 MLB encoding
    mlb_feature1 = MultiLabelBinarizer() # create an instance of the MLB class
    encoded_feature1 = mlb_feature1.fit_transform(df['feature1']) # fit transform
    encoded_feature1_df = pd.DataFrame(encoded_feature1, columns=[f"f1_{label}" for label in mlb_feature1.classes_])
    df = pd.concat([df.reset_index(drop=True), encoded_feature1_df.reset_index(drop=True)], axis=1)

    # repeat steps above
    mlb_feature2 = MultiLabelBinarizer()
    encoded_feature2 = mlb_feature2.fit_transform(df['feature2'])
    encoded_feature2_df = pd.DataFrame(encoded_feature2, columns=[f"f2_{label}" for label in mlb_feature2.classes_])
    df = pd.concat([df.reset_index(drop=True), encoded_feature2_df.reset_index(drop=True)], axis=1)

    return df

if __name__ == '__main__':
   
    data = pd.read_csv("qtytp-all.csv")

    # create a multi-label encoding function instance 
    encoded_data = prepare_multilabel_encoding(data.copy()) 

    # save the encoded data to a new CSV file
    output_csv_path = "qtytp-all-encoded.csv" # change this if you want to save it somewhere else
    encoded_data.to_csv(output_csv_path, index=False, encoding='utf-8')
    print(f"Encoded data saved to: {output_csv_path}")


    print(encoded_data.head())

Encoded data saved to: qtytp-all-encoded.csv
   index                                             source  \
0     37  Hoekom het jy nie meer tyd om my te sien nie?'...   
1    141  En met wie het die Mond van JaHWeH gespreek, d...   
2    168  Weet u ooit hoekom ons nie 'n Duitsland kan we...   
3    243  Dink jy dit sou so suksesvol gewees het as Piz...   
4    281  Hoekom kan jy nie net oor hulle loop nie, weet...   

                                              target language  \
0  Why don't you have more time to see me?" and "...       af   
1  Who is he to whom the mouth of Yahweh has spok...       af   
2        Do you ever know why we can't be a Germany?       af   
3  Do you think it would have been as successful ...       af   
4  Why can't you just walk over them, you know, l...       af   

                                           feature1       feature2  f1_cleft  \
0  [modality, quantification, comparison, negation]  [wh-question]         0   
1                        

## References

Schwenk, Holger, Guillaume Wenzek, Sergey Edunov, Edouard Grave, and Armand Joulin. "CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB." arXiv preprint (2020) arXiv:1911.04944 

Tiedemann, Jörg. "Parallel Data, Tools and Interfaces in OPUS." In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2214–18. Istanbul, Turkey: European Language Resources Association (ELRA).  (2012) https://opus.nlpl.eu/NLLB/corpus/version/NLLB