-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguity in stage 1 #2
Comments
The Machine translation models are tuned for sentence to sentence translation and doesn't work on long sentences (if you feed a paragraph it's still handled at sentence level internally) hence sentence alignment is required between source and target for training. Also the BLEU scoring works correctly only on sentences. |
First we cleaned the dataset and tried to split with those reference numbers (1) from Tamil file and corresponding full stops in English texts as we don't have any reference number in English texts but we didn't get accurate results. So what we did was splitting both files with full stops and in most of cases we got same number of sentences from both files but the problem here is we didn't get proper alignment of English and Tamil texts So the problem was we had unnecessary full stops in Tamil texts. One such example is documented in the pdf below |
Thanks @suvanbalu for the detailed report. Yes, there are instances where there is 1:m or m:1 sentences aligned between source and target text. In such cases, pls. drop those corresponding sentences where there is one to many sentences or vice versa. From your example, the sentences will be aligned as below [1]சௌதி சொன்னார், "எந்த இடத்திற்கும் தன் இச்சைப்படி செல்லக்கூடிய அந்தப் பெரும்பலம்பொருந்திய பறவையானவன், தனது தாயின் இருப்பிடம் சென்று கடற்கரையில் இறங்கினான். [2]அங்கே வினதை பந்தயத்தில் தோல்வியுற்று, அடிமையாகச் சோகத்துடன் வாழ்ந்து வந்தாள். [3]ஒருமுறை கத்ரு, வினதையை அழைத்தாள்.வினதை அவளை விழுந்து வணங்கி எழுந்ததும், கத்ரு அவளது மகனின் முன்னிலையிலேயே, “ஓ மென்மையான வினதையே, கடலுக்கு நடுவிலே, யாரும் அணுகமுடியாத ஓர் இடத்திலே, அழகானதும், இன்பம் தருவதுமான பாம்புகளின் வசிப்பிடம் ஒன்று இருகிறது. [4]என்னை அங்கே தூக்கிச் செல்வாயாக” என்றாள். [5]இப்படிச் சொன்னதும், அந்த அழகான இறகுகளுடைய பறவையின் தாய், (தனது தோளில்) பாம்புகளின் தாயை சுமந்து சென்றாள். In the above, the sentence [3] should be dropped. We can ignore them for now. Also notice that the text between |
Should we drop those cases manually ? Because a program for dropping those sentences is complex as we don't have proper reference for dropping and splitting. |
Yes, you can drop them manually. There's no requirement that it should be done programmatically. |
Thank you @cmrajan |
In stage 1 it is stated that we should preprocess the English and Tamil files and align each tamil sentence above its english sentence.
But we faced a problem while splitting, because each tamil sentence was split into 2 or more english sentences and vice versa. So programmatic splitting is little complex and mannual splitting is also complex because there are many lines in some text files.
So instead of aligning sentence by sentence can we align file by file and push to the data repository? @GokulNC @visu
The text was updated successfully, but these errors were encountered: