Create OpenAI Gym Environment #1

RajK853 · 2022-08-01T08:11:52Z

Text-based OpenAI Gym Environment

Data Format

A JSON file with source and list of target sentences.

Script to convert M2 into JSON
Data cleaning:
- Normalize characters (both source and references)
- ~~Remove emojis~~ M2 file does not have any emojis
- Fix typos using Pyspellchecker (only source sentences. References are hopefully corrected!)
Data filtering:
- Number of tokens in the original sentences
- References ending properly
- Mean similarity between the original and reference sentences
Generate metadata [Optional]
- Number of sentences
- Number of different references per sentence

Episode

Reset
- Select the state
  - Source and target sentences
Take a step
- Calculate the reward
  - Generate intermediate tokens
  - Compute GLEU scores with target sentences
  - Subtract \epsilon from the reward
- Obtain the new state
- Check episode termination
  - N-steps attempted
  - Token length condition
  - All keep actions condition
- Return new state, reward and done

Reward function

$$r(s_t, a_t) = r_{gleu} + r_{delay} + r_{invalid-label}$$

RajK853 · 2022-08-02T13:15:12Z

Interface

The initial version of the environment is registered with the environment-id gec-v0. It uses ANSI to visualize the current state with highlighted texts as shown below:

If a token has a label other than the $KEEP label, that token and its reward value is highlighted with green color and its label is highlighted with red color.

$KEEP labels are not shown beside their tokens.

RajK853 · 2022-08-02T18:19:01Z

Clean text

Quotation mark

The Lang-8 dataset seems to use `` instead of " for quotation marks.

...
Line 798 S The title is `` closer `` .
Line 799 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||
...

Our processing script will replace `` with " and normalize other characters.

raw_text = 'The title is `` closer `` .'
text = clean_text(raw_text)                # 'The title is " closer " .'

Ellipsis

The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).

...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...

Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we can remove the 35,947 examples (approx. 3% of total data) containing the ellipsis.

RajK853 · 2022-08-07T17:19:52Z

Data Preparation

Data Format

The training datasets are available in the M2 format.

The example below is a sample from the Lang-8 training dataset with 4 annotations.

S So , I think if we have to go somewhere on foot , we must put our hat .
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||0
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||1
A 4 5|||R:OTHER|||when|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||2
A 17 18|||R:NOUN:NUM|||hats|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||3

Our goal is to process these data from the M2 format to generate a JSON file with input text and its references as shown below.

{
    "text" : "So , I think if we have to go somewhere on foot , we must put our hat .",
    "references": [
      "So , I think if we have to go somewhere on foot , we must put on our hat .",
      "So , I think when we have to go somewhere on foot , we must put on our hats ."
    ]
  }

Note that we have only 2 different references from the 4 annotations because the edits from the annotators 0, 1 and 3 produce the exact reference (1st one).

Data Cleaning

We perform the following data cleaning techniques while converting the data from M2 to JSON:

Filter based on the number of tokens

In the Lang-8 dataset, there are some short sentences as shown below:

Line 370 S Why ?
Line 371 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0

Similarly, we would also like to filter out really longer sentences as they can cause huge GPU usage spikes during batch training.

We remove an example if
$$N_{min} \lt N_{token} \lt N_{max}$$
where
$N_{token} = \text{Number of tokens}$,
$N_{min} = \text{Minimum number of token}$
$N_{min} = \text{Maximum number of token}$

Filter based on proper reference sentence

In the English language, a proper sentence follows the following rule:

Starting starting token is capitalized.
Sentence ends with one of the following tokens: ., !, ?, "

If one of the references does not fulfil the above conditions, we discard those examples.

Filter based on source-reference similarity

In Lang-8 training dataset, some edits are so extreme that even a human may not be able to obtain the reference sentence based on the given source text.

Line 11217 S I think a few days later I can get right .
Line 11218 A 2 2|||M:PREP|||in|||REQUIRED|||-NONE-|||0
Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0
Line 11220 A 5 7|||R:OTHER|||will be fine . ( ``|||REQUIRED|||-NONE-|||0
Line 11221 A 10 11|||R:OTHER|||`` sounds awkward and unclear )|||REQUIRED|||-NONE-|||0

If we apply the edits to the source text above, we get the following reference:

{
    "text": "I think a few days later I can get right ."
    "reference": [
        "I think in a few daysI will be fine . ( \" can get right \" sounds awkward and unclear )"
    ]
}

Please note that the "days" and "I" tokens are merged together in the reference because of the faulty annotation in the edit where the annotator forgot to put whitespace between them.
Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0

These sorts of examples can be filtered out by checking the similarity between the source and reference tokens as follows:
$$\frac{1}{N_{refs}} \sum_{i=1}^{N_{refs}} similarity(tokens_{source}, tokens_{reference_i}) \ge S_{min}$$
where
$N_{refs} = \text{Number of references}$
$S_{min} = \text{Minimum similarity value}$

Filter based on ellipsis in source

The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).

...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...

Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we remove any example containing the ellipsis.

Other cleanings

We perform the following further cleaning steps during the conversion:

Clean source and reference texts by normalizing the characters like ’ to ' or '' to ".
Correct the spelling errors in the source text before generating the references.

RajK853 · 2022-08-13T13:47:56Z

Parenthetical texts

Parenthetical texts are used to give extra context information such that removing them should not make the sentence grammatically incorrect.

Meena studied (all night) for the grammar test.
Meena studied for the grammar test.

In Lang-8 dataset, there are some edits that add parenthetical elements such as in this example (lines 4579 - 4582):

text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .

It would be unreasonable to request a model to correct the text by adding parenthetical elements as in the above example. To deal with this issue, we remove the parenthetical elements from all the texts.

text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .
cleaned = For example , today I ordered some clothes online .

RajK853 self-assigned this Aug 1, 2022

RajK853 added the documentation Improvements or additions to documentation label Aug 2, 2022

RajK853 closed this as completed Aug 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create OpenAI Gym Environment #1

Create OpenAI Gym Environment #1

RajK853 commented Aug 1, 2022 •

edited

Loading

RajK853 commented Aug 2, 2022

RajK853 commented Aug 2, 2022 •

edited

Loading

RajK853 commented Aug 7, 2022

RajK853 commented Aug 13, 2022

Create OpenAI Gym Environment #1

Create OpenAI Gym Environment #1

Comments

RajK853 commented Aug 1, 2022 • edited Loading

Data Format

Episode

Reward function

RajK853 commented Aug 2, 2022

Interface

RajK853 commented Aug 2, 2022 • edited Loading

Clean text

Quotation mark

Ellipsis

RajK853 commented Aug 7, 2022

Data Preparation

Data Format

Data Cleaning

Filter based on the number of tokens

Filter based on proper reference sentence

Filter based on source-reference similarity

Filter based on ellipsis in source

Other cleanings

RajK853 commented Aug 13, 2022

Parenthetical texts

RajK853 commented Aug 1, 2022 •

edited

Loading

RajK853 commented Aug 2, 2022 •

edited

Loading