# Filtering "Article X" Entries from ArabAcquis to Reduce Noise and Enhance Data Quality
This Colab notebook presents a Python script designed to refine the ArabAcquis dataset by filtering out entries that primarily serve as structural references, such as "Article X". These lines are often not content-rich sentences and can introduce noise and redundancy into the corpus. The script loads a JSON file (containing original English, reference Arabic, and machine translations from Google API and GPT). It then iterates through each data entry, using a regular expression (Article \d+) to check if the 'english', 'arabic', 'googleAPI_translated', or 'GPT_translated' fields start with this pattern. If any of these fields in an entry match, that entire entry is excluded.

The primary purpose of this filtering is to reduce noise and redundancy. By removing these non-content-rich, structural markers, the script produces a cleaner dataset focused on more semantically meaningful sentences.

* **Input file:** ArabAcquis_translated_withGoogleAPIandGPT.json (JSON file containing English sentences, original Arabic references, and machine translations from both Google API and GPT)
* **Output file:** ArabAcquis_TranslatedAndfiltered.json (JSON file containing entries from the input file except those where 'english', 'arabic', 'googleAPI_translated', or 'GPT_translated' fields start with "Article X")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/ColabData/ArabAcquis Dataset

/content/drive/MyDrive/ColabData/ArabAcquis Dataset


In [None]:
import json
import re

def load_json(file_path):
    """Load data from a JSON file."""
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    return data

def save_json(data, file_path):
    """Save data to a JSON file."""
    with open(file_path, 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)

def filter_articles(data):
    """Filter out entries with 'Article X' in any field."""
    filtered_data = []
    article_pattern = re.compile(r'Article \d+')

    for item in data:
        if not (article_pattern.match(item['english']) or
                article_pattern.match(item['arabic']) or
                article_pattern.match(item['googleAPI_translated']) or
                article_pattern.match(item['GPT_translated'])):
            filtered_data.append(item)

    return filtered_data

def main():
    input_file = 'ArabAcquis_translated_withGoogleAPIandGPT.json'  # The input JSON file
    output_file = 'ArabAcquis_TranslatedAndfiltered.json'  # The output file

    # Load the JSON data
    data = load_json(input_file)

    # Filter out 'Article X' entries
    filtered_data = filter_articles(data)

    # Save the filtered data
    save_json(filtered_data, output_file)
    print(f"Filtered data saved to {output_file}")

if __name__ == "__main__":
    main()


Filtered data saved to ArabAcquis_TranslatedAndfiltered.json


In [None]:
import json

with open('ArabAcquis_TranslatedAndfiltered.json', 'r') as datafile:
  data = json.load(datafile)

len(data)

3582

In [None]:
data[0:20]

[{'english': 'THE COUNCIL OF THE EUROPEAN ECONOMIC COMMUNITY,',
  'arabic': 'مجلس الجماعة الاقتصادية الأوروبية',
  'googleAPI_translated': 'إن مجلس الجماعة الاقتصادية الأوروبية،',
  'GPT_translated': 'مجلس الاتحاد الاقتصادي الأوروبي'},
 {'english': 'Whereas the adoption of a common transport policy involves inter alia laying down common rules for the international carriage of goods by road to or from the territory of a Member State or passing across the territory of one or more Member States;',
  'arabic': 'حيث أن اعتماد سياسة نقل مشتركة تنطوي من بين أمور أخرى على وضع قواعد مشتركة للنقل الدولي للبضائع عن طريق البر أو من أراضي دولة عضو أو المرور عبر أراضي واحدة من الدول الأعضاء أو أكثر؛',
  'googleAPI_translated': 'ولما كان اعتماد سياسة نقل مشتركة يتضمن، في جملة أمور، وضع قواعد مشتركة للنقل الدولي للبضائع عن طريق البر من وإلى أراضي دولة عضو أو المرور عبر أراضي دولة عضو واحدة أو أكثر؛',
  'GPT_translated': 'بينما ينطوي اعتماد سياسة نقل مشتركة على وضع قواعد مشتركة لنقل البضائع عبر الطرق ا