In [201]:
MIN_SENTENCE_LENGTH = 100
INPUT_FILE = 'data/train.en'
OUTPUT_FILE = 'data/english_translated_akkadian_corpus.txt'

In [202]:
data = open(INPUT_FILE, 'r').readlines()

In [203]:
data[:5]

['Precious scion of Baltil (Aššur), beloved of the god(dess) (DN and) Šērūa, ... , creation of the goddess Ninmena, who (... ) ... for the dominion of the lands, (... ) who grew up to be king, ... (... ) governor, (... ) ... , the one who increases voluntary offerings for ... , ... (... ) of emblems, (5) powerful male, light of all of his people, lord of (... ) all rulers ... , the one who overwhelms his foes, valiant man, the one who destroys (... ) enemies, who cuts (straight) through interlocking mountains like a (taut) string and ... \n',
 'warrior ... who made ... bow down at his feet ... , who put ... to the sword (lit. “weapon”), ... circumspect ... ,  \n',
 '... he made ... kiss his feet ... mountains ... in/of battle ... he (a god) made my weapon/rule greater than all of those/the kings who sit on (royal) daises, (5) ... circumspect ... , ... exalted lion-dragon, ... inhabited world.\n',
 'I adorned them (statues of the gods) and they (the gods) went (back) to their land. I re

In [204]:
len(data)

50478

In [205]:
data = [line for line in data if len(line) >= MIN_SENTENCE_LENGTH]

In [206]:
len(data)

13281

In [207]:
data = [d for d in data if '...' not in d]
len(data)

11042

In [208]:
data[:5]

['I adorned them (statues of the gods) and they (the gods) went (back) to their land. I rebuilt those cities. I built a city on top of a tell (lit. “a heaped-up ruin mound”) called Ḫumut. I built (and) completed (it) from its foundations to its parapets. Inside (it), I founded a palace for my royal residence. I named it Kār-Aššur, set up the weapon of (the god) Aššur, my lord, therein, (and) settled the people of (foreign) lands conquered by me therein. I imposed upon them tax (and) tribute, (and) considered them as inhabitants of Assyria. \n',
 'From their sheep levy, which I take annually, I apportioned 240 sheep as a gift to (the god) Aššur, my lord. From those Arameans whom I deported, (10) I distributed (and) settled \n',
 'thousand (to) the province of the land Barḫa(l)zi, (and) 5,000 (to) the province of the land Mazamua. \n',
 'I united them, considered them as inhabitants of Assyria, (and) imposed the yoke of (the god) Aššur, my lord, upon them as Assyrians. (As for) the aband

In [209]:
def remove_duplicate_lines_preserving_order(data):
    seen = set()
    return [x for x in data if not (x in seen or seen.add(x))]
    

In [210]:
data = remove_duplicate_lines_preserving_order(data)
len(data)

9720

In [211]:
import regex as re

TEXTUAL_NOTE_REGEX = r'\s?\(text:?\s\"(.*?)\"\)'

TRANSLATION_NOTE_REGEX = r'[\(<]([\p{L}\s,]+?)[\)>]'
TRANSLATION_NOTE_REPLACEMENT = r'\1'

GENERIC_PARENTHESES_NOTE_REGEX = r'\s?\(([^\)]*\d[^\)]*)\)'

LITERAL_NOTE_REGEX = r'\(lit\.:?\s\"(.*?)\"\)'
LITERAL_NOTE_REPLACEMENT = r'(\1)'


def process_line(line):
    line = line.strip()
    line = line.replace('“', '"')
    line = line.replace('”', '"')

    # print(re.findall(TEXTUAL_NOTE_REGEX, line))
    line = re.sub(TEXTUAL_NOTE_REGEX, '', line)
    # print(re.findall(TRANSLATION_NOTE_REGEX, line))
    line = re.sub(TRANSLATION_NOTE_REGEX, TRANSLATION_NOTE_REPLACEMENT, line)
    # print(re.findall(GENERIC_PARENTHESES_NOTE_REGEX, line))
    line = re.sub(GENERIC_PARENTHESES_NOTE_REGEX, '', line)
    # print(re.findall(LITERAL_NOTE_REGEX, line))
    line = re.sub(LITERAL_NOTE_REGEX, LITERAL_NOTE_REPLACEMENT, line)
    
    return line

In [212]:
process_line("I built a city on top of a tell (lit. “a heaped-up ruin mound”) called Ḫumut (and) named it Kār-Aššur. I settled the people of (foreign) lands conquered by me therein (and) placed a eunuch of mine over them.")

'I built a city on top of a tell (a heaped-up ruin mound) called Ḫumut and named it Kār-Aššur. I settled the people of foreign lands conquered by me therein and placed a eunuch of mine over them.'

In [213]:
process_line("I smashed the land Bīt-Šilāni in its entirety like a pot. I destroyed the city Sarrabānu, its (text: “their”) great royal city, (making it) like a tell after the Deluge and I plundered it. (10) I impaled Nabû-ušabši, their king, before the gate of his city <while making> (the people of) his land <watch>. I carried off his wife, his sons, his daughters, his possessions, (and) the treasures of his palace. ")

'I smashed the land Bīt-Šilāni in its entirety like a pot. I destroyed the city Sarrabānu, its great royal city, making it like a tell after the Deluge and I plundered it. I impaled Nabû-ušabši, their king, before the gate of his city while making the people of his land watch. I carried off his wife, his sons, his daughters, his possessions, and the treasures of his palace.'

In [214]:
process_line("In my thirteen regnal year, in the month Ayyāru (II), I got my (chariot) teams ready in Šuanna (Babylon), prepared my (military) camp ... Before my (arrival), he (Marduk-apla-iddina) evacuated the cities Bīt-Zabidāya, Iqbi-Bēl, Ḫursaggalla, ... , carried off as booty the people of (the cities) Ur, ... , Kissik, Nēmed-Laguda, (and) ... , (375) and brought (them) into the city Dūr-Yakīn... He then strengthened its enclosure walls (and), moving back a distance of (one) measuring rope from the front of its main wall, he made a moat two hundred cubits wide; he made (the moat) one and a half nindanu deep and reached ground water. He cut a channel from the Euphrates River, (thereby) making (its water) flow (in)to its meadowland. He (thus) filled the city’s flatlands, where battles (are fought), with water and cut the bridges. Together with his allies (and) his battle troops, he pitched his royal tent in a bend of the river (lit.: “between rivers”) like a crane and set up his (military) camp.")

'In my thirteen regnal year, in the month Ayyāru II, I got my chariot teams ready in Šuanna Babylon, prepared my military camp ... Before my arrival, he (Marduk-apla-iddina) evacuated the cities Bīt-Zabidāya, Iqbi-Bēl, Ḫursaggalla, ... , carried off as booty the people of the cities Ur, ... , Kissik, Nēmed-Laguda, and ... , and brought them into the city Dūr-Yakīn... He then strengthened its enclosure walls and, moving back a distance of one measuring rope from the front of its main wall, he made a moat two hundred cubits wide; he made the moat one and a half nindanu deep and reached ground water. He cut a channel from the Euphrates River, thereby making its water flow into its meadowland. He thus filled the city’s flatlands, where battles are fought, with water and cut the bridges. Together with his allies and his battle troops, he pitched his royal tent in a bend of the river (between rivers) like a crane and set up his military camp.'

In [215]:
process_line("I repaired the woeful desecrated state of the gods and goddess who lived in it, who had been displaced by floods and storm, and whose appearances had become dim; I made their dimmed appearance bright, cleaned their dirty garments, (and) had them permanently installed on their daises. (As for) the šēdus, lamassus, (and) rābiṣu-demons of the temple, I repaired their dilapidated part(s), (and) I (re)stationed them ")

'I repaired the woeful desecrated state of the gods and goddess who lived in it, who had been displaced by floods and storm, and whose appearances had become dim; I made their dimmed appearance bright, cleaned their dirty garments, and had them permanently installed on their daises. As for the šēdus, lamassus, and rābiṣu-demons of the temple, I repaired their dilapidated parts, and I restationed them'

In [216]:
process_line("matter. They were afflicted by thieving (and) murdering. They were stealing from the poor (and) giving to the mighty; there was oppression (and) the taking of bribes in the city. Every day, without ceasing, they stole goods from each other, a son (i 15') cursed his father in the street, a slave ")

'matter. They were afflicted by thieving and murdering. They were stealing from the poor and giving to the mighty; there was oppression and the taking of bribes in the city. Every day, without ceasing, they stole goods from each other, a son cursed his father in the street, a slave'

In [217]:
data = [process_line(d) for d in data]

In [218]:
data[:5]

['I adorned them statues of the gods and they the gods went back to their land. I rebuilt those cities. I built a city on top of a tell (a heaped-up ruin mound) called Ḫumut. I built and completed it from its foundations to its parapets. Inside it, I founded a palace for my royal residence. I named it Kār-Aššur, set up the weapon of the god Aššur, my lord, therein, and settled the people of foreign lands conquered by me therein. I imposed upon them tax and tribute, and considered them as inhabitants of Assyria.',
 'From their sheep levy, which I take annually, I apportioned 240 sheep as a gift to the god Aššur, my lord. From those Arameans whom I deported, I distributed and settled',
 'thousand to the province of the land Barḫalzi, and 5,000 to the province of the land Mazamua.',
 'I united them, considered them as inhabitants of Assyria, and imposed the yoke of the god Aššur, my lord, upon them as Assyrians. As for the abandoned settlements on the periphery of my land that had become 

In [219]:
data[-5:]

['I waited at the well Ṣumūa, which is located between the well Makiru and the well Gallabu, for one whole day in the month Duʾūzu. Opposite the well Ṣumūa were four pens which did not hold any sheep. I knew, however, that the well Ṣumūa',
 'I inflicted this defeat by the power of the gods Šamaš and Marduk, Adad and Apla-Adad, the great gods, my lords. Anyone in the future who comes forward and should ask the elders of his land and the elders of the land of Laqû: "Is it true that Ninurta-kudurrī-uṣur, governor of the land of Sūḫu and the land of Mari,',
 'they allowed me to trample my enemy under my feet. No one in the future who comes forward should say: "How did Ninurta-kudurrī-uṣur inflict this defeat?" By the gods Adad and Apla-Adad',
 'The palace of Enamḫe-zēra-ibni, governor of the land of Sūḫu and the land of Mari, which is located in the district of the city Raʾil,',
 'I, Ninurta-kudurrī-uṣur, governor of the land of Sūḫu and the land of Mari, son of Šamaš-rēša-uṣur, ditto gove

In [220]:
print('\n'.join(data), file=open(OUTPUT_FILE, 'w'))