Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity in stage 1 #2

Closed
suvanbalu opened this issue Aug 17, 2021 · 6 comments
Closed

Ambiguity in stage 1 #2

suvanbalu opened this issue Aug 17, 2021 · 6 comments

Comments

@suvanbalu
Copy link

suvanbalu commented Aug 17, 2021

In stage 1 it is stated that we should preprocess the English and Tamil files and align each tamil sentence above its english sentence.

But we faced a problem while splitting, because each tamil sentence was split into 2 or more english sentences and vice versa. So programmatic splitting is little complex and mannual splitting is also complex because there are many lines in some text files.

So instead of aligning sentence by sentence can we align file by file and push to the data repository? @GokulNC @visu

@cmrajan
Copy link
Contributor

cmrajan commented Aug 17, 2021

The Machine translation models are tuned for sentence to sentence translation and doesn't work on long sentences (if you feed a paragraph it's still handled at sentence level internally) hence sentence alignment is required between source and target for training. Also the BLEU scoring works correctly only on sentences.
Can you pls. post the source and target sentences where there's an issue and we can provide appropriate suggestions.
@GokulNC

@suvanbalu
Copy link
Author

suvanbalu commented Aug 18, 2021

image

First we cleaned the dataset and tried to split with those reference numbers (1) from Tamil file and corresponding full stops in English texts as we don't have any reference number in English texts but we didn't get accurate results.

So what we did was splitting both files with full stops and in most of cases we got same number of sentences from both files but the problem here is we didn't get proper alignment of English and Tamil texts

So the problem was we had unnecessary full stops in Tamil texts.

One such example is documented in the pdf below
Issue stage1.pdf

@GokulNC @cmrajan

@cmrajan
Copy link
Contributor

cmrajan commented Aug 18, 2021

Thanks @suvanbalu for the detailed report.

Yes, there are instances where there is 1:m or m:1 sentences aligned between source and target text. In such cases, pls. drop those corresponding sentences where there is one to many sentences or vice versa.

From your example, the sentences will be aligned as below

[1]சௌதி சொன்னார், "எந்த இடத்திற்கும் தன் இச்சைப்படி செல்லக்கூடிய அந்தப் பெரும்பலம்பொருந்திய பறவையானவன், தனது தாயின் இருப்பிடம் சென்று கடற்கரையில் இறங்கினான்.
"Sauti said, 'Then that bird of great strength and energy and capable of going at will to every place repaired to his mother's side on the other shore of the great ocean.

[2]அங்கே வினதை பந்தயத்தில் தோல்வியுற்று, அடிமையாகச் சோகத்துடன் வாழ்ந்து வந்தாள்.
Thither lived Vinata in affliction, defeated in wager and put into a state of slavery.

[3]ஒருமுறை கத்ரு, வினதையை அழைத்தாள்.வினதை அவளை விழுந்து வணங்கி எழுந்ததும், கத்ரு அவளது மகனின் முன்னிலையிலேயே, “ஓ மென்மையான வினதையே, கடலுக்கு நடுவிலே, யாரும் அணுகமுடியாத ஓர் இடத்திலே, அழகானதும், இன்பம் தருவதுமான பாம்புகளின் வசிப்பிடம் ஒன்று இருகிறது.
Once Kadru calling Vinata who had prostrated herself before the former, addressed her these words in the presence of her son, 'O gentle Vinata, there is in the midst of the ocean, in a remote quarter, a delightful and fair region inhabited by the Nagas.

[4]என்னை அங்கே தூக்கிச் செல்வாயாக” என்றாள்.
Bear me thither!'

[5]இப்படிச் சொன்னதும், அந்த அழகான இறகுகளுடைய பறவையின் தாய், (தனது தோளில்) பாம்புகளின் தாயை சுமந்து சென்றாள்.
At this that mother of the bird of fair feathers bore (on her shoulders) the mother of the snakes.

In the above, the sentence [3] should be dropped. We can ignore them for now. Also notice that the text between {} were removed from Tamil as they're extra and text in () were retained as they're in English version. Sorry for not stating it clear in the rules.

@suvanbalu
Copy link
Author

Should we drop those cases manually ? Because a program for dropping those sentences is complex as we don't have proper reference for dropping and splitting.

@cmrajan

@cmrajan
Copy link
Contributor

cmrajan commented Aug 18, 2021

Yes, you can drop them manually. There's no requirement that it should be done programmatically.

@suvanbalu
Copy link
Author

Thank you @cmrajan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants