Ambiguity in stage 1 #2

suvanbalu · 2021-08-17T08:05:37Z

In stage 1 it is stated that we should preprocess the English and Tamil files and align each tamil sentence above its english sentence.

But we faced a problem while splitting, because each tamil sentence was split into 2 or more english sentences and vice versa. So programmatic splitting is little complex and mannual splitting is also complex because there are many lines in some text files.

So instead of aligning sentence by sentence can we align file by file and push to the data repository? @GokulNC @visu

cmrajan · 2021-08-17T19:25:25Z

The Machine translation models are tuned for sentence to sentence translation and doesn't work on long sentences (if you feed a paragraph it's still handled at sentence level internally) hence sentence alignment is required between source and target for training. Also the BLEU scoring works correctly only on sentences.
Can you pls. post the source and target sentences where there's an issue and we can provide appropriate suggestions.
@GokulNC

suvanbalu · 2021-08-18T05:13:00Z

First we cleaned the dataset and tried to split with those reference numbers (1) from Tamil file and corresponding full stops in English texts as we don't have any reference number in English texts but we didn't get accurate results.

So what we did was splitting both files with full stops and in most of cases we got same number of sentences from both files but the problem here is we didn't get proper alignment of English and Tamil texts

So the problem was we had unnecessary full stops in Tamil texts.

One such example is documented in the pdf below
Issue stage1.pdf

@GokulNC @cmrajan

cmrajan · 2021-08-18T06:14:11Z

Thanks @suvanbalu for the detailed report.

Yes, there are instances where there is 1:m or m:1 sentences aligned between source and target text. In such cases, pls. drop those corresponding sentences where there is one to many sentences or vice versa.

From your example, the sentences will be aligned as below

[1]சௌதி சொன்னார், "எந்த இடத்திற்கும் தன் இச்சைப்படி செல்லக்கூடிய அந்தப் பெரும்பலம்பொருந்திய பறவையானவன், தனது தாயின் இருப்பிடம் சென்று கடற்கரையில் இறங்கினான்.
"Sauti said, 'Then that bird of great strength and energy and capable of going at will to every place repaired to his mother's side on the other shore of the great ocean.

[2]அங்கே வினதை பந்தயத்தில் தோல்வியுற்று, அடிமையாகச் சோகத்துடன் வாழ்ந்து வந்தாள்.
Thither lived Vinata in affliction, defeated in wager and put into a state of slavery.

[3]ஒருமுறை கத்ரு, வினதையை அழைத்தாள்.வினதை அவளை விழுந்து வணங்கி எழுந்ததும், கத்ரு அவளது மகனின் முன்னிலையிலேயே, “ஓ மென்மையான வினதையே, கடலுக்கு நடுவிலே, யாரும் அணுகமுடியாத ஓர் இடத்திலே, அழகானதும், இன்பம் தருவதுமான பாம்புகளின் வசிப்பிடம் ஒன்று இருகிறது.
Once Kadru calling Vinata who had prostrated herself before the former, addressed her these words in the presence of her son, 'O gentle Vinata, there is in the midst of the ocean, in a remote quarter, a delightful and fair region inhabited by the Nagas.

[4]என்னை அங்கே தூக்கிச் செல்வாயாக” என்றாள்.
Bear me thither!'

[5]இப்படிச் சொன்னதும், அந்த அழகான இறகுகளுடைய பறவையின் தாய், (தனது தோளில்) பாம்புகளின் தாயை சுமந்து சென்றாள்.
At this that mother of the bird of fair feathers bore (on her shoulders) the mother of the snakes.

In the above, the sentence [3] should be dropped. We can ignore them for now. Also notice that the text between {} were removed from Tamil as they're extra and text in () were retained as they're in English version. Sorry for not stating it clear in the rules.

suvanbalu · 2021-08-18T06:46:14Z

Should we drop those cases manually ? Because a program for dropping those sentences is complex as we don't have proper reference for dropping and splitting.

@cmrajan

cmrajan · 2021-08-18T14:56:05Z

Yes, you can drop them manually. There's no requirement that it should be done programmatically.

suvanbalu · 2021-08-19T11:47:57Z

Thank you @cmrajan

suvanbalu closed this as completed Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguity in stage 1 #2

Ambiguity in stage 1 #2

suvanbalu commented Aug 17, 2021 •

edited

Loading

cmrajan commented Aug 17, 2021 •

edited

Loading

suvanbalu commented Aug 18, 2021 •

edited

Loading

cmrajan commented Aug 18, 2021

suvanbalu commented Aug 18, 2021

cmrajan commented Aug 18, 2021

suvanbalu commented Aug 19, 2021

Ambiguity in stage 1 #2

Ambiguity in stage 1 #2

Comments

suvanbalu commented Aug 17, 2021 • edited Loading

cmrajan commented Aug 17, 2021 • edited Loading

suvanbalu commented Aug 18, 2021 • edited Loading

cmrajan commented Aug 18, 2021

suvanbalu commented Aug 18, 2021

cmrajan commented Aug 18, 2021

suvanbalu commented Aug 19, 2021

suvanbalu commented Aug 17, 2021 •

edited

Loading

cmrajan commented Aug 17, 2021 •

edited

Loading

suvanbalu commented Aug 18, 2021 •

edited

Loading