Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The dataset conversion to ladder format #6

Open
tamohannes opened this issue Dec 25, 2020 · 2 comments
Open

The dataset conversion to ladder format #6

tamohannes opened this issue Dec 25, 2020 · 2 comments

Comments

@tamohannes
Copy link

tamohannes commented Dec 25, 2020

@thompsonb, I'm trying to replicate the work done in your paper, the results in the Table 1 in particular.
How did you convert the format of the dataset that you have in the "bleualign_data" directory to hunalign's ladder-style format?
Is there a script to do that, or you did it manually?

@thompsonb
Copy link
Owner

I converted from hunalign ladder-style to the bleualign format. I believe this is the code I used:

def reformat(ladder_file, src_len, tgt_len):
    alignments = []
    current_alignment = ([], [])
    prev_a1, prev_a2 = None, None

    for line in open(ladder_file, 'r', encoding='utf-8'):
        fields = line.strip().split('\t')
        a1, a2 = int(fields[0]), int(fields[1])
            
        if a1 != prev_a1 and a2 != prev_a2 and current_alignment != ([], []):
            alignments.append(current_alignment)
            current_alignment = ([], [])
            
        current_alignment[0].append(a1)
        current_alignment[1].append(a2)
        prev_a1, prev_a2 = a1, a2
    
    alignments2 = []
    xx, yy = [], []
    for a1, a2 in alignments:
        x1 = sorted(list(set(a1)))
        x2 = sorted(list(set(a2)))
        alignments2.append((x1, x2))  # tuple of lists
        xx.extend(x1)
        yy.extend(x2)
    
    # add deletions/insertions (*not* in order) 
    xx, yy = set(xx), set(yy)
    for x in range(src_len):
        if x not in xx:
            alignments2.append(([x, ], []))
    for y in range(tgt_len):
        if y not in yy:
            alignments2.append(([], [y, ]))

    return alignments2 

@tamohannes
Copy link
Author

@thompsonb can you please upload the corpuses with their corresponding alignment files on the repo ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants