Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create OpenAI Gym Environment #1

Closed
RajK853 opened this issue Aug 1, 2022 · 4 comments
Closed

Create OpenAI Gym Environment #1

RajK853 opened this issue Aug 1, 2022 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@RajK853
Copy link
Owner

RajK853 commented Aug 1, 2022

Text-based OpenAI Gym Environment

Data Format

A JSON file with source and list of target sentences.

  • Script to convert M2 into JSON
  • Data cleaning:
    • Normalize characters (both source and references)
    • Remove emojis M2 file does not have any emojis
    • Fix typos using Pyspellchecker (only source sentences. References are hopefully corrected!)
  • Data filtering:
    • Number of tokens in the original sentences
    • References ending properly
    • Mean similarity between the original and reference sentences
  • Generate metadata [Optional]
    • Number of sentences
    • Number of different references per sentence

Episode

  • Reset
    • Select the state
      • Source and target sentences
  • Take a step
    • Calculate the reward
      • Generate intermediate tokens
      • Compute GLEU scores with target sentences
      • Subtract \epsilon from the reward
    • Obtain the new state
    • Check episode termination
      • N-steps attempted
      • Token length condition
      • All keep actions condition
    • Return new state, reward and done

Reward function

$$r(s_t, a_t) = r_{gleu} + r_{delay} + r_{invalid-label}$$

@RajK853 RajK853 self-assigned this Aug 1, 2022
@RajK853
Copy link
Owner Author

RajK853 commented Aug 2, 2022

Interface

The initial version of the environment is registered with the environment-id gec-v0. It uses ANSI to visualize the current state with highlighted texts as shown below:

image

If a token has a label other than the $KEEP label, that token and its reward value is highlighted with green color and its label is highlighted with red color.

$KEEP labels are not shown beside their tokens.

@RajK853 RajK853 added the documentation Improvements or additions to documentation label Aug 2, 2022
@RajK853
Copy link
Owner Author

RajK853 commented Aug 2, 2022

Clean text

Quotation mark

The Lang-8 dataset seems to use `` instead of " for quotation marks.

...
Line 798 S The title is `` closer `` .
Line 799 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||
...

Our processing script will replace `` with " and normalize other characters.

raw_text = 'The title is `` closer `` .'
text = clean_text(raw_text)                # 'The title is " closer " .'

Ellipsis

The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).

...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...

Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we can remove the 35,947 examples (approx. 3% of total data) containing the ellipsis.

@RajK853
Copy link
Owner Author

RajK853 commented Aug 7, 2022

Data Preparation

Data Format

The training datasets are available in the M2 format.

The example below is a sample from the Lang-8 training dataset with 4 annotations.

S So , I think if we have to go somewhere on foot , we must put our hat .
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||0
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||1
A 4 5|||R:OTHER|||when|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||2
A 17 18|||R:NOUN:NUM|||hats|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||3

Our goal is to process these data from the M2 format to generate a JSON file with input text and its references as shown below.

{
    "text" : "So , I think if we have to go somewhere on foot , we must put our hat .",
    "references": [
      "So , I think if we have to go somewhere on foot , we must put on our hat .",
      "So , I think when we have to go somewhere on foot , we must put on our hats ."
    ]
  }

Note that we have only 2 different references from the 4 annotations because the edits from the annotators 0, 1 and 3 produce the exact reference (1st one).

Data Cleaning

We perform the following data cleaning techniques while converting the data from M2 to JSON:

Filter based on the number of tokens

In the Lang-8 dataset, there are some short sentences as shown below:

Line 370 S Why ?
Line 371 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0

Similarly, we would also like to filter out really longer sentences as they can cause huge GPU usage spikes during batch training.

We remove an example if
$$N_{min} \lt N_{token} \lt N_{max}$$
where
$N_{token} = \text{Number of tokens}$,
$N_{min} = \text{Minimum number of token}$
$N_{min} = \text{Maximum number of token}$

Filter based on proper reference sentence

In the English language, a proper sentence follows the following rule:

  1. Starting starting token is capitalized.
  2. Sentence ends with one of the following tokens: ., !, ?, "

If one of the references does not fulfil the above conditions, we discard those examples.

Filter based on source-reference similarity

In Lang-8 training dataset, some edits are so extreme that even a human may not be able to obtain the reference sentence based on the given source text.

Line 11217 S I think a few days later I can get right .
Line 11218 A 2 2|||M:PREP|||in|||REQUIRED|||-NONE-|||0
Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0
Line 11220 A 5 7|||R:OTHER|||will be fine . ( ``|||REQUIRED|||-NONE-|||0
Line 11221 A 10 11|||R:OTHER|||`` sounds awkward and unclear )|||REQUIRED|||-NONE-|||0

If we apply the edits to the source text above, we get the following reference:

{
    "text": "I think a few days later I can get right ."
    "reference": [
        "I think in a few daysI will be fine . ( \" can get right \" sounds awkward and unclear )"
    ]
}

Please note that the "days" and "I" tokens are merged together in the reference because of the faulty annotation in the edit where the annotator forgot to put whitespace between them.

Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0

These sorts of examples can be filtered out by checking the similarity between the source and reference tokens as follows:
$$\frac{1}{N_{refs}} \sum_{i=1}^{N_{refs}} similarity(tokens_{source}, tokens_{reference_i}) \ge S_{min}$$
where
$N_{refs} = \text{Number of references}$
$S_{min} = \text{Minimum similarity value}$

Filter based on ellipsis in source

The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).

...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...

Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we remove any example containing the ellipsis.

Other cleanings

We perform the following further cleaning steps during the conversion:

  1. Clean source and reference texts by normalizing the characters like to ' or '' to ".
  2. Correct the spelling errors in the source text before generating the references.

@RajK853 RajK853 closed this as completed Aug 7, 2022
@RajK853
Copy link
Owner Author

RajK853 commented Aug 13, 2022

Parenthetical texts

Parenthetical texts are used to give extra context information such that removing them should not make the sentence grammatically incorrect.

Meena studied (all night) for the grammar test.
Meena studied for the grammar test.

In Lang-8 dataset, there are some edits that add parenthetical elements such as in this example (lines 4579 - 4582):

text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .

It would be unreasonable to request a model to correct the text by adding parenthetical elements as in the above example. To deal with this issue, we remove the parenthetical elements from all the texts.

text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .
cleaned = For example , today I ordered some clothes online .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant