## Download Datasets

This notebook handles the initial data download and preparation phase for our project. Here's what it accomplishes:

The script downloads and processes multiple datasets from the Hugging Face Hub. These datasets contain transliterations, translations, and images of ancient Sumerian tablets, which we'll use for our neural machine translation models.

**Dataset Downloads:**
- SumTablets: Core dataset containing Sumerian transliterations
- SumTablets_English: transliteration and English translations of the Sumerian texts
- SumTablets_English-augmented: Enhanced dataset with additional translations
- SumTablets_Photos: Visual data of the tablets (handled in chunks due to size)

Corresponding datasets are joined using tablet IDs to create unified training, validation, and testing sets
Purpose
This preparatory work establishes the foundation for our subsequent natural language processing tasks, particularly our machine translation models that will translate from Sumerian to English. The carefully organized splits (train/validation/test) will allow for proper model training and evaluation.


### Imports

In [3]:
import pandas as pd
import polars as pl
from datasets import load_dataset, load_dataset_builder
import os

# create a directory to store the dataset
if not os.path.exists("../datasets"):
    os.makedirs("../datasets")

### Download Datasets

#### SumTablets

In [4]:
splits = {'train': 'train.csv', 'validation': 'validation.csv', 'test': 'test.csv'}
train = pd.read_csv("hf://datasets/colesimmons/SumTablets/" + splits["train"])
test = pd.read_csv("hf://datasets/colesimmons/SumTablets/" + splits["test"])
validation = pd.read_csv("hf://datasets/colesimmons/SumTablets/" + splits["validation"])

# write out the dataset to a CSV file
train.to_csv("../datasets/SumTablets_train.csv", index=False)
test.to_csv("../datasets/SumTablets_test.csv", index=False)
validation.to_csv("../datasets/SumTablets_validation.csv", index=False)

#### SumTablets_English-augmented

In [None]:
df = pd.read_parquet("hf://datasets/colesimmons/SumTablets_English-augmented/data/train-00000-of-00001.parquet")

# write out the dataset to a CSV file
df.to_csv("../datasets/SumTablets_English-augmented.csv", index=False)

#### SumTablets_English

In [None]:
train = pd.read_csv("hf://datasets/colesimmons/SumTablets_English/" + splits["train"])
test = pd.read_csv("hf://datasets/colesimmons/SumTablets_English/" + splits["test"])
validation = pd.read_csv("hf://datasets/colesimmons/SumTablets_English/" + splits["validation"])

# write out the dataset to a CSV file
train.to_csv("../datasets/SumTablets_English_train.csv", index=False)
test.to_csv("../datasets/SumTablets_English_test.csv", index=False)
validation.to_csv("../datasets/SumTablets_English_validation.csv", index=False)

#### SumTablets_Photos

In [None]:
def save_image(row, PIL_object = False):
    # Save image with ID as filename
    with open(f'images/{row['id']}.png', 'wb') as f:
        if not PIL_object:
            f.write(row['image']['bytes'])
        else:
            row['image'].save(f, format='PNG')

def download_chunk(dataset_name, config, split, start, end, cols_to_remove=None):
    ds = load_dataset(f'colesimmons/{dataset_name}', config, split=split, streaming=True)

    batch = []
    for i, example in enumerate(ds):
        if i >= end:
            break
        if i >= start:
            batch.append(example)
    
    df = pd.DataFrame(batch)
    
    for row in df.iterrows():
        save_image(row[1], PIL_object=True)

    if cols_to_remove:
        df.drop(columns=cols_to_remove, inplace=True)
    
    df.to_csv(f"../datasets/{dataset_name}_{split}_{start}_{end}.csv", index=False)

Downloading testing and validation split.

In [None]:
splits = {'train': 'data/train_lineart.parquet', 'validation': 'data/validation_lineart.parquet', 'test': 'data/test_lineart.parquet'}
test = pl.read_parquet('hf://datasets/colesimmons/SumTablets_Photos/' + splits['test'])
print('test read')
validation = pl.read_parquet('hf://datasets/colesimmons/SumTablets_Photos/' + splits['validation'])
print('validation read')

# write out the dataset to a CSV file, excluding the image column
test_without_images = test.drop('image')
validation_without_images = validation.drop('image')

# Write the non-image data to CSV
test_without_images.write_csv('datasets/SumTablets_Photos_test.csv')
validation_without_images.write_csv('datasets/SumTablets_Photos_validation.csv')

# Create directory to save images
if not os.path.exists('images'):
    os.makedirs('images')

for row in test.iter_rows(named=True):
    save_image(row)
print('test images saved')

for row in validation.iter_rows(named=True):
    save_image(row)
print('validation images saved')

Test split is too big. We will download only some chunks of it.

In [None]:
download_chunk("SumTablets_Photos", "lineart_only", "train", 0, 10000, cols_to_remove=["image", "image_type"])

#### Merge Datasets

In [None]:
# Merge train SumTablets, SumTablet_English into a single CSV file by id
train_sumtablets = pd.read_csv("../datasets/SumTablets_train.csv")
train_sumtablets_english = pd.read_csv("../datasets/SumTablets_English_train.csv")

merged_train = pd.merge(train_sumtablets, train_sumtablets_english, on='id', how='inner')

print(train_sumtablets.shape)
print(train_sumtablets_english.shape)
print(merged_train.shape)

merged_train.head()

(82452, 6)
(1907, 5)
(1907, 10)


Unnamed: 0,id,period_x,genre_x,transliteration_x,glyph_names,glyphs,period_y,genre_y,transliteration_y,translation
0,P514378,Ur III,Administrative,<SURFACE>\n5(diš) sila₃ kaš 5(diš) sila₃ ninda...,<SURFACE> \n 5(DIŠ) SILA₃ BI 5(DIŠ) SILA₃ GAR ...,<SURFACE>\n𒐊𒋡𒁉𒐊𒋡𒃻𒐊𒂆𒋧\n𒐈𒂆𒉌𒈫𒂆𒉀\n𒌨𒀭𒎏𒆤\n𒐊𒋡𒁉𒐊𒋡𒃻𒐊𒂆\n...,Ur III,Administrative,\n5(diš) sila₃ kaš 5(diš) sila₃ ninda 5(diš) g...,"5 sila3 beer, 5 sila3 bread, 5 shekels garlic,..."
1,P416427,Ur III,Administrative,<SURFACE>\nla₂-ia₃...1(barig) 2(diš) sila₃ dab...,<SURFACE> \n LAL NI...DIŠ MIN SILA₃ |EŠ₂.ŠE| \...,<SURFACE>\n𒇲𒉌...𒁹𒈫𒋡𒂠𒊺\n𒈗𒌫𒊏𒉌\n...𒌨𒄑𒇀𒌉𒀉𒀀𒋛𒇻\n𒐈𒋡𒌨𒀭...,Ur III,Administrative,\nla₂-ia₃...1(barig) 2(diš) sila₃ dabin\nlugal...,Repaid arrears: 1 barig 2 sila3 of dabin-flour...
2,P102320,Ur III,Administrative,<SURFACE>\n1(diš) gu₄ niga\nu₄ 1(u)-kam\nki ab...,<SURFACE> \n DIŠ GUD ŠE \n UD U |HI×BAD| \n KI...,<SURFACE>\n𒁹𒄞𒊺\n𒌓𒌋𒄰\n𒆠𒀊𒁀𒊷𒂵𒋫\n𒀀𒄷𒉿𒅕\n𒉌𒆪\n<SURFAC...,Ur III,Administrative,\n1(diš) gu₄ niga\nu₄ 1(u)-kam\nki ab-ba-sa₆-g...,"1 ox, grain-fed,\n10th day,\nfrom Abbasaga\nAḫ..."
3,P424401,Ur III,Administrative,<SURFACE>\n1(diš) e₂-ki\niti dal-ta\n1(diš) lu...,<SURFACE> \n DIŠ E₂ KI \n |UD×(U.U.U)| RI TA \...,<SURFACE>\n𒁹𒂍𒆠\n𒌗𒊑𒋫\n𒁹𒈗𒍏𒁀𒀭\n𒌗𒋗𒆰𒈾𒋫\n𒁹<unk>𒄒𒆷\n𒌗...,Ur III,Administrative,\n1(diš) e₂-ki\niti dal-ta\n1(diš) lugal-da₅-b...,"1 Eki,\nfrom the month “Flight,”\n1 Lugaldaban..."
4,P131770,Ur III,Administrative,<SURFACE>\n4(diš) 1/2(diš) gin₂ 1(u) 2(diš) še...,<SURFACE> \n 4(DIŠ) MAŠ DUN₃@g U MIN ŠE KU₃ UD...,<SURFACE>\n𒐉𒈦𒂆𒌋𒈫𒊺𒆬𒌓\n𒋛𒉌𒌈\n<BLANK_SPACE>\n𒐠𒐗𒄩𒊕𒉽...,Ur III,Administrative,\n4(diš) 1/2(diš) gin₂ 1(u) 2(diš) še ku₃-babb...,"4 1/2 shekels, 12 grains of silver,\nthe remai..."


In [None]:
# Merge test SumTablets, SumTablet_English into a single CSV file by id
test_sumtablets = pd.read_csv("../datasets/SumTablets_test.csv")
test_sumtablets_english = pd.read_csv("../datasets/SumTablets_English_test.csv")

merged_test = pd.merge(test_sumtablets, test_sumtablets_english, on='id', how='inner')

print(test_sumtablets.shape)
print(test_sumtablets_english.shape)
print(merged_test.shape)

merged_test.head()

(4577, 6)
(113, 5)
(113, 10)


Unnamed: 0,id,period_x,genre_x,transliteration_x,glyph_names,glyphs,period_y,genre_y,transliteration_y,translation
0,P458667,Ur III,Administrative,<SURFACE>\n<COLUMN>\n{d}i-bi₂{d}suen\nlugal ka...,<SURFACE> \n <COLUMN> \n AN I NE AN |EN.ZU| \n...,<SURFACE>\n<COLUMN>\n𒀭𒄿𒉈𒀭𒂗𒍪\n𒈗𒆗𒂵\n𒈗𒋀𒀊𒆠𒈠\n𒈗𒀭𒌒𒁕𒇹...,Ur III,Administrative,\n\n{d}i-bi₂{d}suen\nlugal kal-ga\nlugal uri₅{...,"Ibbi-Suen,\nstrong king,\nking of Ur,\nking of..."
1,P101242,Ur III,Administrative,<SURFACE>\n1(diš) dug dida 5(diš) sila₃ kaš sa...,<SURFACE> \n DIŠ DUG |BI.U₂.SA| 5(DIŠ) SILA₃ B...,<SURFACE>\n𒁹𒂁𒁉𒌑𒊓𒐊𒋡𒁉𒅆𒂟\n𒑏𒃻𒈫𒂆𒉌𒈫𒂆𒉀\n𒐈𒋻𒐈𒊓𒋧\n<unk>𒀭...,Ur III,Administrative,\n1(diš) dug dida 5(diš) sila₃ kaš sag₁₀\n1(ba...,"1 jug of common wort, 5 sila3 fine beer,\n1 ba..."
2,P110413,Ur III,Administrative,<SURFACE>\npisan dub-ba\nše-ba siki-ba\ngiri₃-...,<SURFACE> \n GA₂ DUB BA \n ŠE BA SIK₂ BA \n GI...,<SURFACE>\n𒂷𒁾𒁀\n𒊺𒁀𒋠𒁀\n𒄊𒋧𒂵𒊮𒌷\n𒉌𒅅\n...\n<SURFACE...,Ur III,Administrative,\npisan dub-ba\nše-ba siki-ba\ngiri₃-se₃-ga ša...,Basket-of-tablets:\nxxx\nxxx\nxxx\nxxx
3,P104927,Ur III,Administrative,<SURFACE>\npisan dub-ba\ntag-tag-ga\nudu zu₂-s...,<SURFACE> \n GA₂ DUB BA \n TAG TAG GA \n LU KA...,<SURFACE>\n𒂷𒁾𒁀\n𒋳𒋳𒂵\n𒇻𒅗𒋛𒅗𒋢𒈾\n𒅇𒋠𒁉\n<SURFACE>\n𒇻...,Ur III,Administrative,\npisan dub-ba\ntag-tag-ga\nudu zu₂-si-ka su-n...,"Basket-of-tablets:\nwoven goods,\nsheep of ivo..."
4,P331059,Ur III,Administrative,<SURFACE>\npisan dub-ba\ngu₄\ne₂ {d}nin-gir₂-s...,<SURFACE> \n GA₂ DUB BA \n GUD \n E₂ AN |SAL.T...,<SURFACE>\n𒂷𒁾𒁀\n𒄞\n𒂍𒀭𒎏𒄈𒋢\n𒂍𒀭𒎏𒁯𒀀\n𒂍𒀭𒌉𒍣\n𒂍𒀭𒅅𒄋\n𒂍...,Ur III,Administrative,\npisan dub-ba\ngu₄\ne₂ {d}nin-gir₂-su\ne₂ {d}...,Basket-of-tablets:\nxxx\nxxx\nxxx\nxxx\nxxx\nxxx


In [None]:
# Merge validation SumTablets, SumTablet_English into a single CSV file by id
validation_sumtablets = pd.read_csv("../datasets/SumTablets_validation.csv")
validation_sumtablets_english = pd.read_csv("../datasets/SumTablets_English_validation.csv")

merged_validation = pd.merge(validation_sumtablets, validation_sumtablets_english, on='id', how='inner')

print(validation_sumtablets.shape)
print(validation_sumtablets_english.shape)
print(merged_validation.shape)

merged_train.head()

(4577, 6)
(107, 5)
(54, 14)


Unnamed: 0,id,period_x,genre_x,transliteration_x,glyph_names,glyphs,period_y,genre_y,transliteration_y,translation
0,P514378,Ur III,Administrative,<SURFACE>\n5(diš) sila₃ kaš 5(diš) sila₃ ninda...,<SURFACE> \n 5(DIŠ) SILA₃ BI 5(DIŠ) SILA₃ GAR ...,<SURFACE>\n𒐊𒋡𒁉𒐊𒋡𒃻𒐊𒂆𒋧\n𒐈𒂆𒉌𒈫𒂆𒉀\n𒌨𒀭𒎏𒆤\n𒐊𒋡𒁉𒐊𒋡𒃻𒐊𒂆\n...,Ur III,Administrative,\n5(diš) sila₃ kaš 5(diš) sila₃ ninda 5(diš) g...,"5 sila3 beer, 5 sila3 bread, 5 shekels garlic,..."
1,P416427,Ur III,Administrative,<SURFACE>\nla₂-ia₃...1(barig) 2(diš) sila₃ dab...,<SURFACE> \n LAL NI...DIŠ MIN SILA₃ |EŠ₂.ŠE| \...,<SURFACE>\n𒇲𒉌...𒁹𒈫𒋡𒂠𒊺\n𒈗𒌫𒊏𒉌\n...𒌨𒄑𒇀𒌉𒀉𒀀𒋛𒇻\n𒐈𒋡𒌨𒀭...,Ur III,Administrative,\nla₂-ia₃...1(barig) 2(diš) sila₃ dabin\nlugal...,Repaid arrears: 1 barig 2 sila3 of dabin-flour...
2,P102320,Ur III,Administrative,<SURFACE>\n1(diš) gu₄ niga\nu₄ 1(u)-kam\nki ab...,<SURFACE> \n DIŠ GUD ŠE \n UD U |HI×BAD| \n KI...,<SURFACE>\n𒁹𒄞𒊺\n𒌓𒌋𒄰\n𒆠𒀊𒁀𒊷𒂵𒋫\n𒀀𒄷𒉿𒅕\n𒉌𒆪\n<SURFAC...,Ur III,Administrative,\n1(diš) gu₄ niga\nu₄ 1(u)-kam\nki ab-ba-sa₆-g...,"1 ox, grain-fed,\n10th day,\nfrom Abbasaga\nAḫ..."
3,P424401,Ur III,Administrative,<SURFACE>\n1(diš) e₂-ki\niti dal-ta\n1(diš) lu...,<SURFACE> \n DIŠ E₂ KI \n |UD×(U.U.U)| RI TA \...,<SURFACE>\n𒁹𒂍𒆠\n𒌗𒊑𒋫\n𒁹𒈗𒍏𒁀𒀭\n𒌗𒋗𒆰𒈾𒋫\n𒁹<unk>𒄒𒆷\n𒌗...,Ur III,Administrative,\n1(diš) e₂-ki\niti dal-ta\n1(diš) lugal-da₅-b...,"1 Eki,\nfrom the month “Flight,”\n1 Lugaldaban..."
4,P131770,Ur III,Administrative,<SURFACE>\n4(diš) 1/2(diš) gin₂ 1(u) 2(diš) še...,<SURFACE> \n 4(DIŠ) MAŠ DUN₃@g U MIN ŠE KU₃ UD...,<SURFACE>\n𒐉𒈦𒂆𒌋𒈫𒊺𒆬𒌓\n𒋛𒉌𒌈\n<BLANK_SPACE>\n𒐠𒐗𒄩𒊕𒉽...,Ur III,Administrative,\n4(diš) 1/2(diš) gin₂ 1(u) 2(diš) še ku₃-babb...,"4 1/2 shekels, 12 grains of silver,\nthe remai..."
