-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create OpenAI Gym Environment #1
Comments
InterfaceThe initial version of the environment is registered with the environment-id If a token has a label other than the
|
Clean textQuotation markThe Lang-8 dataset seems to use `` instead of " for quotation marks.
Our processing script will replace `` with " and normalize other characters. raw_text = 'The title is `` closer `` .'
text = clean_text(raw_text) # 'The title is " closer " .' EllipsisThe Lang-8 dataset contains lots of sentences with the ellipsis (. . .).
Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we can remove the 35,947 examples (approx. 3% of total data) containing the ellipsis. |
Data PreparationData FormatThe training datasets are available in the The example below is a sample from the Lang-8 training dataset with 4 annotations.
Our goal is to process these data from the M2 format to generate a JSON file with input text and its references as shown below. {
"text" : "So , I think if we have to go somewhere on foot , we must put our hat .",
"references": [
"So , I think if we have to go somewhere on foot , we must put on our hat .",
"So , I think when we have to go somewhere on foot , we must put on our hats ."
]
}
Data CleaningWe perform the following data cleaning techniques while converting the data from M2 to JSON: Filter based on the number of tokensIn the Lang-8 dataset, there are some short sentences as shown below:
Similarly, we would also like to filter out really longer sentences as they can cause huge GPU usage spikes during batch training. We remove an example if Filter based on proper reference sentenceIn the English language, a proper sentence follows the following rule:
If one of the references does not fulfil the above conditions, we discard those examples. Filter based on source-reference similarityIn Lang-8 training dataset, some edits are so extreme that even a human may not be able to obtain the reference sentence based on the given source text.
If we apply the edits to the source text above, we get the following reference: {
"text": "I think a few days later I can get right ."
"reference": [
"I think in a few daysI will be fine . ( \" can get right \" sounds awkward and unclear )"
]
}
These sorts of examples can be filtered out by checking the similarity between the source and reference tokens as follows: Filter based on ellipsis in sourceThe Lang-8 dataset contains lots of sentences with the ellipsis (. . .).
Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we remove any example containing the ellipsis. Other cleaningsWe perform the following further cleaning steps during the conversion:
|
Parenthetical textsParenthetical texts are used to give extra context information such that removing them should not make the sentence grammatically incorrect.
In Lang-8 dataset, there are some edits that add parenthetical elements such as in this example (lines 4579 - 4582):
It would be unreasonable to request a model to correct the text by adding parenthetical elements as in the above example. To deal with this issue, we remove the parenthetical elements from all the texts.
|
Text-based OpenAI Gym Environment
Data Format
A JSON file with source and list of target sentences.
Remove emojisM2 file does not have any emojisEpisode
Reward function
The text was updated successfully, but these errors were encountered: