In [3]:
import os
import json
import cv2
import PIL
import pickle
import pandas as pd

### Merge for M^2E

Link to dataset [$M^2E$](https://www.modelscope.cn/datasets/Wente47/M2E/files).

To get the `jsonl` file and the dataset, click the `Files and versions` to download the metadata and Data file.

Alternatively, didn't try, may not work, according to their README, download with `MsDataset` by

```python
from modelscope.msdatasets import MsDataset
ds = MsDataset.load('Wente47/M2E', subset_name='default', split='train')
print(ds[0])
```


In [None]:
def merge_m2e(img_dir: str, dataset_mapping: dict) -> None:
    """
    Processes multiple JSONL files to create CSV files with image paths and LaTeX labels.
    
    Parameters:
        img_dir (str): The directory containing images.
        dataset_mapping (dict): A dictionary where keys are JSONL file names and values are the corresponding output CSV file names.
    """
    for jsonl_file, csv_file in dataset_mapping.items():
        img_labels = []

        # Read each line from the JSONL file
        with open(jsonl_file, 'r', encoding='utf-8') as f:
            for line in f:
                # Parse JSON line
                data = json.loads(line.strip())
                image_name = data.get("name")
                tex_label = data.get("tex")
                
                # Construct the full image path
                image_path = os.path.join(img_dir, image_name)
                
                # Check if the image exists in the directory
                if os.path.isfile(image_path):
                    img_labels.append([image_path, tex_label])
                else:
                    print(f"Warning: Image {image_name} not found in directory {img_dir}. Skipping.")

        # Create a DataFrame from the collected data
        df = pd.DataFrame(img_labels, columns=['Image Path', 'Label'])
        
        # Save the DataFrame to a CSV file without index
        df.to_csv(csv_file, index=False)
        print(f"Saved CSV file: {csv_file}")

# usage, change path to M^2E image path
merge_m2e(
    img_dir='PATH_TO_M2E_IMAGE',
    dataset_mapping={
        'm2e_train.jsonl': 'm2e_train.csv',
        'm2e_val.jsonl':   'm2e_val.csv',
        'm2e_test.jsonl':  'm2e_test.csv'
    }
)


### Merge for ICDAR

The link to [ICDAR](https://ai.100tal.com/icdar).

**Alert**: This merge is only for the training data set. The testing dataset can be accessed by the same link.

The downloaded files contains a `train_labels.txt` mapping the image to the $\LaTeX$ and the image directory.

In [None]:
def merge_icdar(label_file, image_dir, save_file):
    """
    Reads a label file containing image names and corresponding labels, 
    matches the image names to the actual images in the provided directory, 
    and saves the matched directory paths along with their labels into a CSV file.

    Parameters:
        label_file (str): Path to the label text file containing image names and LaTeX labels.
        image_dir (str): Directory containing the image files.
        save_file (str): Path to the output CSV file where the results will be saved.
    """

    # Read from the img-label mapping file
    with open(label_file, 'r') as f:
        data = f.readlines()

    img_labels = []
    
    # Process each line in the label file
    for line in data:
        # Strip any leading/trailing whitespace/newlines
        line = line.strip()
        
        # Check for tab or space separation and split accordingly
        if '\t' in line:
            image_name, label = line.split('\t', 1)  # Split by the first tab
        elif ' ' in line:
            image_name, label = line.split(' ', 1)   # Split by the first space
        else:
            # If there's no delimiter, skip this line or handle as needed
            continue
        
        # Construct the full image path by checking the directory
        image_path = os.path.join(image_dir, image_name)
         
        # Check if the image file exists in the directory
        if os.path.isfile(image_path):
            img_labels.append([image_path, label])
        else:
            print(f"Warning: Image {image_name} not found in directory {image_dir}. Skipping.")

    # Create a DataFrame from the list of image-label pairs
    df = pd.DataFrame(img_labels, columns=['Image Path', 'Label'])

    # Save the DataFrame to a CSV file without index
    df.to_csv(save_file, index=False)

merge_icdar(label_file='PATH_TO_THE_TRAIN_LABELS_TXT', image_dir='PATH_TO_THE_IMAGE_DIRCTORY', save_file='PATH_TO_THE_SAVE_CSV')

### Merge for HMER and CROHME

Link to [HMER](https://disk.pku.edu.cn/anyshare/en-us/link/AAF10CCC4D539543F68847A9010C607139?_tb=none&expires_at=1970-01-01T08%3A00%3A00%2B08%3A00&item_type=&password_required=false&title=HMER%20Dataset&type=anonymous).

Link to [CROHME](https://disk.pku.edu.cn/anyshare/en-us/link/AAF10CCC4D539543F68847A9010C607139?_tb=none&expires_at=1970-01-01T08%3A00%3A00%2B08%3A00&item_type=&password_required=false&title=HMER%20Dataset&type=anonymous).


The unzipped directory structures are:
```bash
hmer % tree  
.
├── hmer_dictionary.txt
├── subset
│   ├── easy.json
│   ├── hard.json
│   └── medium.json
├── test
│   ├── caption.txt
│   └── images.pkl
└── train
    ├── caption.txt
    └── images.pkl
```

```bash
crohme % tree
.
├── 2014
│   ├── caption.txt
│   └── images.pkl
├── 2016
│   ├── caption.txt
│   └── images.pkl
├── 2019
│   ├── caption.txt
│   └── images.pkl
├── crohme_dictionary.txt
└── train
    ├── caption.txt
    └── images.pkl
```

The images are stored in the `.pkl` file as following structure:
```json
'train_31988.jpg': array([[141, 141, 143, ..., 152, 152, 152],
       [141, 141, 143, ..., 152, 152, 152],
       [144, 144, 143, ..., 153, 153, 153],
       ...,
       [144, 144, 144, ..., 149, 149, 149],
       [145, 145, 144, ..., 149, 149, 149],
       [145, 145, 144, ..., 149, 149, 149]], dtype=uint8)}
```

use Python package `pickle` to load or extract.

Save as before, only processed the training data. In the CROHME dataset, I treat images in 2014, 2016, and 2019 as testing dataset, since each of them only contains about 1-2k images, while the images in the train folder has about 8-9k.

Also, I wrote a `merge_dictionary` function to merge the two dictionaries from two datasets into a combined one.

In [None]:
" Extract function "
def extract_img(pkl_file: str, save_dir: str):
    """
    Extracts images from a pickle file and saves them as image files in the specified directory.
    
    Parameters:
        pkl_file (str): Path to the pickle file containing image data.
        output_dir (str): Directory where the extracted images will be saved.
    """
    # Load the pickle file
    with open(pkl_file, 'rb') as f:
        data = pickle.load(f)
    
    # Ensure the output directory exists
    os.makedirs(save_dir, exist_ok=True)
    
    # Iterate through the dictionary items
    for image_name, image_array in data.items():
        # Construct the full path for saving the image
        output_path = os.path.join(save_dir, image_name)
        
        # Save the image using OpenCV
        cv2.imwrite(output_path, image_array)
        print(f"Saved image: {output_path}")

extract_img(pkl_file='PATH_TO_THE_TRAIN_PKL', save_dir='DIRECTORY_TO_STORE_THE_IMG')

In [None]:
" Merge function "
def merge_hmer_or_crohme(images: str, caption: str, extract_img_dir: str, save_file: str):
    """
    Optimized function to read images from a pickle file and labels from a text file,
    then saves the matched image paths with their corresponding LaTeX labels into a CSV file.

    Parameters:
        images (str): Path to the pickle (.pkl) file containing image paths.
        caption (str): Path to the caption text file containing image-label pairs.
        save_file (str): Path to the output CSV file where the results will be saved.
    """
    
    # Load the image paths from the pickle file
    with open(images, 'rb') as f:
        image_list = pickle.load(f)
    """
    Load image paths into `image_list` from the pickle file.
    """

    # Create a dictionary for fast lookup: key is the image filename, value is the full path
    image_dict = {os.path.basename(img_path): img_path for img_path in image_list}
    """
    The `image_dict` allows for O(1) average-time complexity lookups for matching images.
    """

    # Read the caption file
    with open(caption, 'r', encoding='utf-8') as f:
        data = f.readlines()
    """
    Read all lines from the caption.txt file.
    """
    
    img_labels = []

    # Process each line in the caption file
    for line in data:
        """
        Loop through each line in the caption file.
        """
        
        # Strip any leading/trailing whitespace/newlines
        line = line.strip()
        
        # Check for tab separation and split accordingly
        if '\t' in line:
            image_name, label = line.split('\t', 1)  # Split by the first tab
        else:
            # Skip if improperly formatted
            continue
        
        # Use the dictionary to find the matching image path
        if image_name in image_dict:
            image_path = (extract_img_dir + '/' + image_dict[image_name])  # if extract_img_dir not None else image_dict[image_name]
            if os.path.isfile(image_path):
                img_labels.append([image_path, label])
            else:
                print(f"Warning: Image {image_name} not found in directory {extract_img_dir}. Skipping.")
        else:
            print(f"Warning: Image {image_name} not found in the pickle file. Skipping.")
    """
    The loop now efficiently finds image paths using dictionary lookups instead of iterating through a list.
    """

    # Create a DataFrame from the list of image-label pairs
    df = pd.DataFrame(img_labels, columns=['Image Path', 'Label'])

    # Save the DataFrame to a CSV file without index
    df.to_csv(save_file, index=False)
    print(f"CSV file saved as {save_file}")
    
merge_hmer_or_crohme(images='PATH_TO_THE_TRAIN_PKL', caption='PATH_TO_THE_TRAIN_CAPTION', extract_img_dir='EXTRACTED_IMG_DIRECTORY', save_file='SAVE_CSV_FILE')

In [None]:
" Merge two dictionary file "
def comb_dictionary(dict1: str, dict2: str, save_dir: str):
    """
    Combines two dictionaries from text files and saves the combined dictionary to a new file.
    
    Parameters:
        dict1 (str): Path to the first dictionary text file.
        dict2 (str): Path to the second dictionary text file.
        save_dir (str): Path to save the combined dictionary text file. Defaults to 'combined_dictionary.txt'.
    """
    # Read the entries from the first dictionary file
    with open(dict1, 'r', encoding='utf-8') as f1:
        dict1_entries = {line.strip() for line in f1 if line.strip()}  # Using a set to ensure uniqueness

    # Read the entries from the second dictionary file
    with open(dict2, 'r', encoding='utf-8') as f2:
        dict2_entries = {line.strip() for line in f2 if line.strip()}  # Using a set to ensure uniqueness

    # Combine both sets and sort them
    combined_entries = sorted(dict1_entries | dict2_entries)

    # Save the combined entries to the output file
    with open(save_dir, 'w', encoding='utf-8') as f_out:
        for entry in combined_entries:
            f_out.write(entry + '\n')
    
    print(f"Combined dictionary saved to: {save_dir}")

comb_dictionary(dict1='PATH_TO_CROHME_DICTIONARY', dict2='PATH_TO_HMER_DICTIONARY', save_dir='PATH_TO_COMBINED_DICTIONARY')

### im2latex_100k

Link to [im2latex_100k](https://huggingface.co/datasets/yuntian-deng/im2latex-100k).

The datasets are in `parquet` file, simplily speaking, it's just like a pandas dataframe. The 3 coloumns it has are `formula`, `filename`, and the `image`. It can be accessed by using pandas as follows:

```python
# Read the Parquet file into a DataFrame
df = pd.read_parquet(parquet_file)
```

The `df` looks like:
```bash
                                             formula        filename                                              image
0  \widetilde \gamma _ { \mathrm { h o p f } } \s...  66667cee5b.png  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...
1  ( { \cal L } _ { a } g ) _ { i j } = 0 , \ \ \...  1cbb05a562.png  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...
2  S _ { s t a t } = 2 \pi \sqrt { N _ { 5 } ^ { ...  ed164cc822.png  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...
```

a more visual way can be found through the [Dataset Viewer](https://huggingface.co/datasets/yuntian-deng/im2latex-100k) on Huggingface page.

This is a pretty starightforward mapping from image to $\LaTeX$, thus I didn't write any merge functions for this one as all the information can be directly accessed.

Since the images in the `df` are stored as bytes, a way to convert them to PNG or other format is (`PIL` package required):

```python
from PIL import Image
import io

# Example byte data (shortened for clarity, you would use your full byte string)
image_data = b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01@\x00\x00\x00@\x08\x02\x00\x00\x00\xe4\x859I\x00\x00\x15\xa9IDATx\x9c...'

# Convert byte data to a BytesIO object
image_stream = io.BytesIO(image_data)

# Open the image using PIL
image = Image.open(image_stream)

# Show the image (optional, will open a window with the image)
image.show()

# Save the image to a file (optional)
image.save("output_image.png")
```
