# BEST CATALAN DATASET

This script aims to create the best dataset for the Catalan language. The dataset is created by translating the well-known
C4 in its variant realnewslike dataset to Catalan resulting in around 25 GB of text data about news articles. Its motivation
comes from the lack of Catalan datasets in the NLP community and the need to have a good dataset to train models in this language.

The dataset is downloaded by creating batches of 1 GB (parameter) in order to avoid memory issuess during the translation and training process.

In [1]:
import datasets

class C4NewsBatchLoader:
    def __init__(self, split='train'):
        """
        Initialize the C4NewsBatchLoader with a specific split.
        
        Parameters:
        split (str): The split of the dataset to load (train, validation, test).
        """
        self.dataset = datasets.load_dataset('allenai/c4', 'realnewslike', split=split, streaming=True)
        self.dataset_iter = iter(self.dataset)
        self.bytes_written = 0
        self.file_count = 1

    def _save_to_file(self, text, file_count):
        """
        Save text to a file.
        
        Parameters:
        text (str): The text to save.
        file_count (int): The current file count for naming the file.
        """
        file_name = f"../data/CA_realnewslike{file_count}.txt"
        print(f"Saving to {file_name}")
        with open(file_name, 'w', encoding='utf-8') as f:
            f.write(text)

    def split_to_files(self, max_size_gb=1):
        """
        Split the dataset into files of approximately max_size_gb GB each.
        
        Parameters:
        max_size_gb (int): The maximum size of each file in GB.
        """
        max_size_bytes = max_size_gb * 1024**3  # Convert GB to bytes
        current_text = []

        try:
            while True:
                example = next(self.dataset_iter)
                text = example['text']
                current_text.append(text)
                self.bytes_written += len(text.encode('utf-8'))

                if self.bytes_written >= max_size_bytes:
                    self._save_to_file(''.join(current_text), self.file_count)
                    self.file_count += 1
                    current_text = []
                    self.bytes_written = 0

        except StopIteration:
            if current_text:
                self._save_to_file(''.join(current_text), self.file_count)

# Example usage:
batch_loader = C4NewsBatchLoader(split='train')
batch_loader.split_to_files(max_size_gb=1)

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/512 [00:00<?, ?it/s]

Saving to ../CA_realnewslike1.txt
Saving to ../CA_realnewslike2.txt
Saving to ../CA_realnewslike3.txt
Saving to ../CA_realnewslike4.txt
Saving to ../CA_realnewslike5.txt
Saving to ../CA_realnewslike6.txt
Saving to ../CA_realnewslike7.txt
Saving to ../CA_realnewslike8.txt
Saving to ../CA_realnewslike9.txt
Saving to ../CA_realnewslike10.txt
Saving to ../CA_realnewslike11.txt


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 32fc9d2e-4f47-4fd5-a81f-63b62b358d4d)')' thrown while requesting GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/realnewslike/c4-train.00176-of-00512.json.gz
Retrying in 1s [Retry 1/5].
'(ReadTimeoutError("HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 80724b84-fb89-4f53-9f1c-cdb92721fad5)')' thrown while requesting GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/realnewslike/c4-train.00176-of-00512.json.gz
Retrying in 2s [Retry 2/5].


Saving to ../CA_realnewslike12.txt
Saving to ../CA_realnewslike13.txt
Saving to ../CA_realnewslike14.txt
Saving to ../CA_realnewslike15.txt
Saving to ../CA_realnewslike16.txt
Saving to ../CA_realnewslike17.txt
Saving to ../CA_realnewslike18.txt
Saving to ../CA_realnewslike19.txt
Saving to ../CA_realnewslike20.txt
Saving to ../CA_realnewslike21.txt
Saving to ../CA_realnewslike22.txt
Saving to ../CA_realnewslike23.txt
Saving to ../CA_realnewslike24.txt
Saving to ../CA_realnewslike25.txt


KeyboardInterrupt: 