# Automated Grammatical Error Correction using BERT and GPT2
In this notebook, we'll look at word probabilities, sentence probabilities and how both can assist with improved spell checking and automatic grammatical error correction.
This notebook fleshes out, expands and clarifies some ideas mentioned in:
* `The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction` by Dimitris Alikaniotis, Vipul Raheja [https://arxiv.org/abs/1906.01733]
* `BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model` by Alex Wang, Kyunghyun Cho [https://arxiv.org/pdf/1902.04094.pdf]

### Key Points:
* "We extract the probability of a sentence from BERT, by iteratively masking every word in the sentence and then summing the log probabilities. While this approach is far from ideal, it has been shown (Wang and Cho, 2019) that it approximates the log-likelihood of a sentence."

Log probabilities are summed so as to avoid issues with underflow.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
LOG = logging.getLogger("probas")
logging.basicConfig(level=logging.ERROR)
logging.disable(logging.INFO)
import json
from pprint import pprint
from itertools import chain

import requests
from transformers import BertTokenizer, BertForMaskedLM, GPT2Tokenizer, GPT2LMHeadModel

# Add parent dir, so we can access our common code
import os, sys, inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 

from mlyoucanuse.bert_fun import (
    get_alternate_words, 
    get_word_probabilities,
    get_word_in_sentence_probability, 
    sum_log_probabilities)
from mlyoucanuse.gpt2_fun import predict_next_token

### We'll be using free versions of LanguageTool to spot problem areas:
Open a terminal, install docker, and then run:

`docker pull erikvl87/languagetool`

`docker run --rm -p 8010:8010 erikvl87/languagetool`

This will allow you to hit the grammar check endpoint in sections of the code below

#### The paid version is undoubtedly worth it.

We'll call the service and check for problems to fix using the following function:

In [3]:
def check_sentence(sentence):
    """Helper method to check sentences using languagetool"""
    res = requests.post('http://localhost:8010/v2/check', data= f"language=en-US&text={sentence}")    
    obj = json.loads(res.content.decode('utf-8'))
    return obj.get('matches')

In [4]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
bert_model = BertForMaskedLM.from_pretrained("bert-large-cased-whole-word-masking")
bert_model.eval()
logging.disable(logging.NOTSET)
LOG.info('Done!')

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-cased-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# The follow example sentence has an error, can you spot it?
## `I am looking forway to see you.`

In [5]:
error_sent_1 = 'I am looking forway to see you.' 

pprint(check_sentence(error_sent_1))

[{'context': {'length': 6,
              'offset': 13,
              'text': 'I am looking forway to see you.'},
  'contextForSureMatch': 0,
  'ignoreForIncompleteSentence': False,
  'length': 6,
  'message': 'Possible spelling mistake found.',
  'offset': 13,
  'replacements': [{'value': 'Norway'},
                   {'value': 'foray'},
                   {'value': 'for way'}],
  'rule': {'category': {'id': 'TYPOS', 'name': 'Possible Typo'},
           'description': 'Possible spelling mistake',
           'id': 'MORFOLOGIK_RULE_EN_US',
           'issueType': 'misspelling'},
  'sentence': 'I am looking forway to see you.',
  'shortMessage': 'Spelling mistake',
  'type': {'typeName': 'Other'}}]


# The 4th word is a problem, let's mask it and see BERT's top 5 predictions

In [6]:
get_alternate_words(sentence=error_sent_1, 
                    word_index=3,
                    bert_tokenizer=bert_tokenizer,
                    bert_model=bert_model,
                    top=5)

(('up', 0.554697573184967),
 ('forward', 0.2861657440662384),
 ('down', 0.039398763328790665),
 ('happy', 0.017491035163402557),
 ('back', 0.01694393716752529))

# Our winner is in second place, perhaps BERT can look at the whole context and give us a better decision

In [7]:
tmp = error_sent_1.split()
tmp[3] = 'up'
res = get_word_probabilities(' '.join(tmp), bert_tokenizer=bert_tokenizer, bert_model=bert_model)
tmp, res, sum_log_probabilities(res)

(['I', 'am', 'looking', 'up', 'to', 'see', 'you.'],
 (('I', ('I',), (0.9983968138694763,)),
  ('am', ('am',), (0.08720280230045319,)),
  ('looking', ('looking',), (0.006682963576167822,)),
  ('up', ('up',), (0.554697573184967,)),
  ('to', ('to',), (0.9951606392860413,)),
  ('see', ('see',), (0.24864201247692108,)),
  ('you', ('you',), (0.004596360959112644,)),
  ('.', ('.',), (0.9333391785621643,))),
 -14.88671847175187)

In [8]:
tmp = error_sent_1.split()
tmp[3] = 'forward'
res = get_word_probabilities(' '.join(tmp), bert_tokenizer=bert_tokenizer, bert_model=bert_model)
tmp, res, sum_log_probabilities(res)

(['I', 'am', 'looking', 'forward', 'to', 'see', 'you.'],
 (('I', ('I',), (0.9980271458625793,)),
  ('am', ('am',), (0.4386550784111023,)),
  ('looking', ('looking',), (0.9980376362800598,)),
  ('forward', ('forward',), (0.2861657440662384,)),
  ('to', ('to',), (0.9999135732650757,)),
  ('see', ('see',), (0.0016449571121484041,)),
  ('you', ('you',), (0.07751502841711044,)),
  ('.', ('.',), (0.9046356081962585,))),
 -11.146798981701433)

## Notice: although `up` had a higher predicted probability than `forward`, one word influences all the other probabilities. Thus, on the whole, taken in context, BERT predicts the correct word in context.

## Let's check Language Tool's suggestions against BERT

In [9]:
tmp = error_sent_1.split()
tmp[3] = 'foray'
res = get_word_probabilities(' '.join(tmp), bert_tokenizer=bert_tokenizer, bert_model=bert_model)
tmp, res, sum_log_probabilities(res)

(['I', 'am', 'looking', 'foray', 'to', 'see', 'you.'],
 (('I', ('I',), (0.9969447255134583,)),
  ('am', ('am',), (0.17091619968414307,)),
  ('looking', ('looking',), (0.0023975924123078585,)),
  ('foray', ('for', '##ay'), (0.07495932281017303, 2.0115612642257474e-07)),
  ('to', ('to',), (0.9413219094276428,)),
  ('see', ('see',), (0.031592097133398056,)),
  ('you', ('you',), (0.3327822685241699,)),
  ('.', ('.',), (0.9425782561302185,))),
 -30.487647787328548)

In [10]:
tmp = error_sent_1.split()
tmp[3] = 'for way'
res = get_word_probabilities(' '.join(tmp), bert_tokenizer=bert_tokenizer, bert_model=bert_model)
tmp, res, sum_log_probabilities(res)

(['I', 'am', 'looking', 'for way', 'to', 'see', 'you.'],
 (('I', ('I',), (0.9954105019569397,)),
  ('am', ('am',), (0.17129291594028473,)),
  ('looking', ('looking',), (0.7401918768882751,)),
  ('for', ('for',), (3.599639057938475e-06,)),
  ('way', ('way',), (0.0035734777338802814,)),
  ('to', ('to',), (0.9701237678527832,)),
  ('see', ('see',), (0.04050043597817421,)),
  ('you', ('you',), (0.16742444038391113,)),
  ('.', ('.',), (0.9485463500022888,))),
 -25.31554102184821)

In [11]:
tmp = error_sent_1.split()
tmp[3] = 'Norway'
res = get_word_probabilities(' '.join(tmp), bert_tokenizer=bert_tokenizer, bert_model=bert_model)
tmp, res, sum_log_probabilities(res)

(['I', 'am', 'looking', 'Norway', 'to', 'see', 'you.'],
 (('I', ('I',), (0.998480498790741,)),
  ('am', ('am',), (0.14284664392471313,)),
  ('looking', ('looking',), (0.00012327726290095598,)),
  ('Norway', ('Norway',), (1.3437646551039961e-08,)),
  ('to', ('to',), (0.9193994402885437,)),
  ('see', ('see',), (0.009753890335559845,)),
  ('you', ('you',), (0.008466791361570358,)),
  ('.', ('.',), (0.9369243383407593,))),
 -38.62466457959271)

### BERT does subword tokenization for words OOV (aka beyond its basic whole word range)
* Anybody want to guess where this will make for pain and suffering?

In [12]:
bert_tokenizer.tokenize('foray')

['for', '##ay']

In [13]:
bert_tokenizer.encode('foray', add_special_tokens=False)

[1111, 4164]

# GPT2 can also predict the best match:

In [14]:
logging.disable(logging.INFO)
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2')
gpt2_model.eval()
logging.disable(logging.NOTSET)

## `I am looking`...

In [15]:
predict_next_token('I am looking ', gpt2_model, gpt2_tokenizer)

(('forward', 0.3665640652179718),
 ('for', 0.35346919298171997),
 ('to', 0.08423731476068497))

# GPT2 may be more accurate because it was trained a lot more data, however:
* it can only predict next token
* it doesn't provide a masked word prediction like BERT

# Let's examine some common classes of grammatical errors:
* Real word spelling errors; e.g. not OOV
* Errors involving a missing word
* Errors of an extra word
* Errors of agreement
* Errors of verb form

# Real word spelling errors; e.g. not OOV, e.g.
## `We can order then directly from the web.`

In [15]:
err_7 = "We can order then directly from the web."
err_7c = "We can order them directly from the web."

pprint(check_sentence(err_7))

[]


## LanguageTool doesn't detect this error, let's pretend we know where it's at

In [16]:
get_alternate_words(sentence=err_7,
                    word_index=3,
                    bert_tokenizer=bert_tokenizer,
                    bert_model=bert_model,
                    top=5)

(('them', 0.2877809405326843),
 ('it', 0.27451109886169434),
 ('these', 0.0369507297873497),
 ('everything', 0.03518752008676529),
 ('this', 0.029018063098192215))

In [17]:
err_7_res = get_word_probabilities(err_7, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_7, err_7_res, sum_log_probabilities(err_7_res) 

('We can order then directly from the web.',
 (('We', ('We',), (0.17043927311897278,)),
  ('can', ('can',), (0.14545994997024536,)),
  ('order', ('order',), (0.0012560535687953234,)),
  ('then', ('then',), (3.108142118435353e-05,)),
  ('directly', ('directly',), (0.0045493245124816895,)),
  ('from', ('from',), (0.46224740147590637,)),
  ('the', ('the',), (0.9203201532363892,)),
  ('web', ('web',), (9.851193681242876e-06,)),
  ('.', ('.',), (0.9614834785461426,))),
 -38.57057261585004)

In [18]:
err_7c_res = get_word_probabilities(err_7c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_7c, err_7c_res, sum_log_probabilities(err_7c_res)

('We can order them directly from the web.',
 (('We', ('We',), (0.12232314050197601,)),
  ('can', ('can',), (0.6350147724151611,)),
  ('order', ('order',), (0.0019470882834866643,)),
  ('them', ('them',), (0.2877809405326843,)),
  ('directly', ('directly',), (0.5957993865013123,)),
  ('from', ('from',), (0.5842000842094421,)),
  ('the', ('the',), (0.8040723204612732,)),
  ('web', ('web',), (0.0001781903556548059,)),
  ('.', ('.',), (0.9475058913230896,))),
 -20.002181347529007)

# Another real word spelling error; e.g. not OOV:
## `Yoga brings peace and vitality to you life.`

In [19]:
# real word spelling error; e.g. not OOV
err_6 = "Yoga brings peace and vitality to you life."
err_6c = "Yoga brings peace and vitality to your life."

pprint(check_sentence(err_6))

[]


In [20]:
err_6_res = get_word_probabilities(err_6, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_6c_res = get_word_probabilities(err_6c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)

get_alternate_words(sentence=err_6, word_index=6, bert_tokenizer=bert_tokenizer, bert_model=bert_model, top=5)

(('human', 0.436617910861969),
 ('daily', 0.28348585963249207),
 ('all', 0.12590579688549042),
 ('everyday', 0.04189103841781616),
 ('a', 0.03219813108444214))

In [21]:
get_word_in_sentence_probability(sentence=err_6,
                                 word='you',
                                 bert_model=bert_model,
                                 bert_tokenizer=bert_tokenizer,
                                 word_index=6)

(1.3105238849675516e-06,)

In [22]:
get_word_in_sentence_probability(sentence=err_6, 
                                 word='your',       
                                 bert_model=bert_model,
                                 bert_tokenizer=bert_tokenizer, 
                                 word_index=6)

(0.0052108122035861015,)

In [23]:
err_6, err_6_res, sum_log_probabilities(err_6_res)

('Yoga brings peace and vitality to you life.',
 (('Yoga', ('Yoga',), (3.730556272785179e-05,)),
  ('brings', ('brings',), (0.7759060859680176,)),
  ('peace', ('peace',), (0.03733048215508461,)),
  ('and', ('and',), (0.9945646524429321,)),
  ('vitality',
   ('vital', '##ity'),
   (0.006993272807449102, 0.04066557437181473)),
  ('to', ('to',), (0.5615279674530029,)),
  ('you', ('you',), (1.3105238849675516e-06,)),
  ('life', ('life',), (9.634750313125551e-05,)),
  ('.', ('.',), (0.9477788805961609,))),
 -45.33202755876794)

In [24]:
err_6c, err_6c_res, sum_log_probabilities(err_6c_res)

('Yoga brings peace and vitality to your life.',
 (('Yoga', ('Yoga',), (0.0001000980701064691,)),
  ('brings', ('brings',), (0.9058755040168762,)),
  ('peace', ('peace',), (0.11302768439054489,)),
  ('and', ('and',), (0.9984421133995056,)),
  ('vitality', ('vital', '##ity'), (0.02805033139884472, 0.10872020572423935)),
  ('to', ('to',), (0.6823605895042419,)),
  ('your', ('your',), (0.0052108122035861015,)),
  ('life', ('life',), (0.21406103670597076,)),
  ('.', ('.',), (0.9949281215667725,))),
 -24.468423042951493)

# Errors involving a missing word
(challenging to assess, but technically possible)
## `I'm not sure what I'm up tomorrow.`

In [25]:
err_2 = "I'm not sure what I'm up tomorrow."
err_2c = "I'm not sure what I'm up to tomorrow."

pprint(check_sentence(err_2))

[]


In [26]:
err_2_res = get_word_probabilities(err_2, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_2, err_2_res, sum_log_probabilities(err_2_res)

("I'm not sure what I'm up tomorrow.",
 (('I', ('I',), (0.9999278783798218,)),
  ("'m", ("'", 'm'), (0.9940817952156067, 0.9981948733329773)),
  ('not', ('not',), (0.9955098628997803,)),
  ('sure', ('sure',), (0.9942660927772522,)),
  ('what', ('what',), (0.29661574959754944,)),
  ('I', ('I',), (0.9999932050704956,)),
  ("'m", ("'", 'm'), (0.03428991511464119, 0.00015710253501310945)),
  ('up', ('up',), (2.1657100660377182e-05,)),
  ('tomorrow', ('tomorrow',), (2.44925679737662e-08,)),
  ('.', ('.',), (0.9749104380607605,))),
 -41.65538870204863)

In [27]:
err_2c_res = get_word_probabilities(err_2c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_2c, err_2c_res, sum_log_probabilities(err_2c_res)

("I'm not sure what I'm up to tomorrow.",
 (('I', ('I',), (0.9999313354492188,)),
  ("'m", ("'", 'm'), (0.9962669014930725, 0.9988389611244202)),
  ('not', ('not',), (0.9957733750343323,)),
  ('sure', ('sure',), (0.9943144917488098,)),
  ('what', ('what',), (0.9879710078239441,)),
  ('I', ('I',), (0.9999955892562866,)),
  ("'m", ("'", 'm'), (0.8611570000648499, 0.9070966243743896)),
  ('up', ('up',), (0.5622091889381409,)),
  ('to', ('to',), (0.5460034608840942,)),
  ('tomorrow', ('tomorrow',), (0.0015979717718437314,)),
  ('.', ('.',), (0.9823225140571594,))),
 -7.911865799294704)

# Another missing word example:
## `I am psychologist.`

In [28]:
err_3 = "I am psychologist."
err_3c = "I am a psychologist."

pprint(check_sentence(err_3))

[]


In [29]:
err_3_res = get_word_probabilities(err_3, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_3, err_3_res, sum_log_probabilities(err_3_res)

('I am psychologist.',
 (('I', ('I',), (0.9840011596679688,)),
  ('am', ('am',), (0.7002142667770386,)),
  ('psychologist', ('psychologist',), (4.375794560473878e-06,)),
  ('.', ('.',), (0.8807179927825928,))),
 -12.83893734289548)

In [30]:
err_3c_res = get_word_probabilities(err_3c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_3c, err_3c_res, sum_log_probabilities(err_3c_res)

('I am a psychologist.',
 (('I', ('I',), (0.9992541670799255,)),
  ('am', ('am',), (0.7013963460922241,)),
  ('a', ('a',), (0.9869096875190735,)),
  ('psychologist', ('psychologist',), (0.0009734426275826991,)),
  ('.', ('.',), (0.9416044354438782,))),
 -7.363446689242413)

# Errors of an extra word, e.g.:
## `Why is do they appear in this particular section?`

In [31]:
err_4 = "Why is do they appear in this particular section?"
err_4c = "Why do they appear in this particular section?"

pprint(check_sentence(err_4))

[{'context': {'length': 2,
              'offset': 7,
              'text': 'Why is do they appear in this particular section?...'},
  'contextForSureMatch': -1,
  'ignoreForIncompleteSentence': True,
  'length': 2,
  'message': 'Consider using either the past participle “done” or the present '
             'participle “doing” here.',
  'offset': 7,
  'replacements': [{'value': 'done'}, {'value': 'doing'}],
  'rule': {'category': {'id': 'GRAMMAR', 'name': 'Grammar'},
           'description': "Agreement: 'been' or 'was' + past tense",
           'id': 'BEEN_PART_AGREEMENT',
           'issueType': 'grammar',
           'sourceFile': 'grammar.xml',
           'subId': '9'},
  'sentence': 'Why is do they appear in this particular section?',
  'shortMessage': 'Possible agreement error',
  'type': {'typeName': 'Other'}}]


In [32]:
err_4_res = get_word_probabilities(err_4, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_4, err_4_res, sum_log_probabilities(err_4_res)

('Why is do they appear in this particular section?',
 (('Why', ('Why',), (0.16757825016975403,)),
  ('is', ('is',), (5.6386539654340595e-05,)),
  ('do', ('do',), (7.702327025072009e-07,)),
  ('they', ('they',), (0.20910966396331787,)),
  ('appear', ('appear',), (0.2699648439884186,)),
  ('in', ('in',), (0.9779276847839355,)),
  ('this', ('this',), (0.2772179841995239,)),
  ('particular', ('particular',), (0.07984210550785065,)),
  ('section', ('section',), (0.0006671780720353127,)),
  ('?', ('?',), (0.9757105708122253,))),
 -39.690535928477864)

In [33]:
err_4c_res = get_word_probabilities(err_4c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_4c, err_4c_res, sum_log_probabilities(err_4c_res)

('Why do they appear in this particular section?',
 (('Why', ('Why',), (0.7214864492416382,)),
  ('do', ('do',), (0.4936559796333313,)),
  ('they', ('they',), (0.36473700404167175,)),
  ('appear', ('appear',), (0.25730466842651367,)),
  ('in', ('in',), (0.9865928292274475,)),
  ('this', ('this',), (0.5284489989280701,)),
  ('particular', ('particular',), (0.09415006637573242,)),
  ('section', ('section',), (0.0003871993103530258,)),
  ('?', ('?',), (0.9983170032501221,))),
 -14.270858777541996)

# Duplicate word errors, e.g.:
## `Is our youth really in in such a state of disrepair?`

In [34]:
err_5 = "Is our youth really in in such a state of disrepair?"
err_5c = "Is our youth really in such a state of disrepair?"

pprint(check_sentence(err_5))

[{'context': {'length': 5,
              'offset': 20,
              'text': 'Is our youth really in in such a state of disrepair?'},
  'contextForSureMatch': 1,
  'ignoreForIncompleteSentence': False,
  'length': 5,
  'message': 'Possible typo: you repeated a word',
  'offset': 20,
  'replacements': [{'value': 'in'}],
  'rule': {'category': {'id': 'MISC', 'name': 'Miscellaneous'},
           'description': "Word repetition (e.g. 'will will')",
           'id': 'ENGLISH_WORD_REPEAT_RULE',
           'issueType': 'duplication'},
  'sentence': 'Is our youth really in in such a state of disrepair?',
  'shortMessage': 'Word repetition',
  'type': {'typeName': 'Other'}}]


## Here, LanguageTool's suggestion is excellent, and it would be hard to detect via BERT

In [35]:
err_5_res = get_word_probabilities(err_5, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_5, err_5_res, sum_log_probabilities(err_5_res)

('Is our youth really in in such a state of disrepair?',
 (('Is', ('Is',), (0.529249906539917,)),
  ('our', ('our',), (0.12770050764083862,)),
  ('youth', ('youth',), (1.3720389688387513e-05,)),
  ('really', ('really',), (0.005016029812395573,)),
  ('in', ('in',), (7.473444566130638e-05,)),
  ('in', ('in',), (0.024649903178215027,)),
  ('such', ('such',), (0.8042277097702026,)),
  ('a', ('a',), (0.9959418177604675,)),
  ('state', ('state',), (0.9993483424186707,)),
  ('of', ('of',), (0.9995388984680176,)),
  ('disrepair',
   ('di', '##s', '##re', '##pair'),
   (0.13680821657180786,
    0.03776433318853378,
    0.007090019062161446,
    0.012891368940472603)),
  ('?', ('?',), (0.9985206723213196,))),
 -47.18102059567564)

In [36]:
err_5c_res = get_word_probabilities(err_5c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_5c, err_5c_res, sum_log_probabilities(err_5c_res)

('Is our youth really in such a state of disrepair?',
 (('Is', ('Is',), (0.9524368047714233,)),
  ('our', ('our',), (0.1444408893585205,)),
  ('youth', ('youth',), (7.70058704802068e-06,)),
  ('really', ('really',), (0.11454037576913834,)),
  ('in', ('in',), (0.9997240900993347,)),
  ('such', ('such',), (0.9997158646583557,)),
  ('a', ('a',), (0.9996241331100464,)),
  ('state', ('state',), (0.9997543692588806,)),
  ('of', ('of',), (0.9999294281005859,)),
  ('disrepair',
   ('di', '##s', '##re', '##pair'),
   (0.07161868363618851,
    0.023905493319034576,
    0.008383141830563545,
    0.009917513467371464)),
  ('?', ('?',), (0.9991294741630554,))),
 -31.691813388194507)

# Errors of Agreement, e.g.:
## `I awaits your response.`

In [37]:
err_8 = "I awaits your response."
err_8c = "I await your response."

pprint(check_sentence(err_8))

[{'context': {'length': 6, 'offset': 2, 'text': 'I awaits your response.'},
  'contextForSureMatch': 4,
  'ignoreForIncompleteSentence': True,
  'length': 6,
  'message': 'Possible agreement error — use the base form here.',
  'offset': 2,
  'replacements': [{'value': 'await'}],
  'rule': {'category': {'id': 'GRAMMAR', 'name': 'Grammar'},
           'description': 'base form after I/you/we/they',
           'id': 'BASE_FORM',
           'issueType': 'grammar',
           'sourceFile': 'grammar.xml',
           'subId': '2'},
  'sentence': 'I awaits your response.',
  'shortMessage': '',
  'type': {'typeName': 'Other'}}]


## Another excellent LanguageTool suggestion!

In [38]:
err_8_res = get_word_probabilities(err_8, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_8, err_8_res, sum_log_probabilities(err_8_res)

('I awaits your response.',
 (('I', ('I',), (0.0030388347804546356,)),
  ('awaits',
   ('a', '##wai', '##ts'),
   (0.0018924882169812918, 0.5225626230239868, 2.2834132323623635e-05)),
  ('your', ('your',), (0.006523116026073694,)),
  ('response', ('response',), (0.03266255930066109,)),
  ('.', ('.',), (0.9765116572380066,))),
 -31.880106013770572)

In [39]:
err_8c_res = get_word_probabilities(err_8c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_8c, err_8c_res, sum_log_probabilities(err_8c_res)

('I await your response.',
 (('I', ('I',), (0.6984617710113525,)),
  ('await',
   ('a', '##wai', '##t'),
   (0.0018924882169812918, 0.5225626230239868, 0.011545822024345398)),
  ('your', ('your',), (0.016748905181884766,)),
  ('response', ('response',), (0.014317264780402184,)),
  ('.', ('.',), (0.9831327795982361,))),
 -20.09190233717632)

# Another error of agreement:
## `The first of these scientist begin in January.`

In [40]:
err_9 = "The first of these scientist begin in January."
err_9c = "The first of these scientists begin in January."

pprint(check_sentence(err_9))

[{'context': {'length': 15,
              'offset': 13,
              'text': 'The first of these scientist begin in January.'},
  'contextForSureMatch': 8,
  'ignoreForIncompleteSentence': True,
  'length': 15,
  'message': 'The plural demonstrative ‘these’ does not agree with the '
             'singular noun ‘scientist’.',
  'offset': 13,
  'replacements': [{'value': 'this scientist'}, {'value': 'these scientists'}],
  'rule': {'category': {'id': 'GRAMMAR', 'name': 'Grammar'},
           'description': "'this' vs. 'these'",
           'id': 'THIS_NNS',
           'issueType': 'grammar',
           'sourceFile': 'grammar.xml',
           'subId': '4'},
  'sentence': 'The first of these scientist begin in January.',
  'shortMessage': 'Grammatical problem: use ‘this’',
  'type': {'typeName': 'Other'}}]


In [41]:
err_9_res = get_word_probabilities(err_9, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_9, err_9_res, sum_log_probabilities(err_9_res)

('The first of these scientist begin in January.',
 (('The', ('The',), (0.9965754151344299,)),
  ('first', ('first',), (0.003906280733644962,)),
  ('of', ('of',), (0.9852672815322876,)),
  ('these', ('these',), (0.06135643273591995,)),
  ('scientist', ('scientist',), (1.1261241938953503e-09,)),
  ('begin', ('begin',), (0.01121385209262371,)),
  ('in', ('in',), (0.9632464647293091,)),
  ('January', ('January',), (0.0035626052413135767,)),
  ('.', ('.',), (0.929932713508606,))),
 -39.19693931416358)

In [42]:
err_9c_res = get_word_probabilities(err_9c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_9c, err_9c_res, sum_log_probabilities(err_9c_res)

('The first of these scientists begin in January.',
 (('The', ('The',), (0.9947475790977478,)),
  ('first', ('first',), (0.0005034923669882119,)),
  ('of', ('of',), (0.9047554731369019,)),
  ('these', ('these',), (0.05980180203914642,)),
  ('scientists', ('scientists',), (8.274631113636133e-07,)),
  ('begin', ('begin',), (0.0018082300666719675,)),
  ('in', ('in',), (0.8767003417015076,)),
  ('January', ('January',), (0.0017494859639555216,)),
  ('.', ('.',), (0.9212959408760071,))),
 -37.39832367384093)

# one of the LanguageTool suggested corrections, not so good:
## `The first of this scientist begin in January.`

In [43]:
err_9clt = "The first of this scientist begin in January."
err_9clt_res = get_word_probabilities(err_9clt, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_9clt, err_9clt_res, sum_log_probabilities(err_9clt_res)

('The first of this scientist begin in January.',
 (('The', ('The',), (0.935041606426239,)),
  ('first', ('first',), (6.021699391567381e-06,)),
  ('of', ('of',), (0.07830952852964401,)),
  ('this', ('this',), (0.0009151458507403731,)),
  ('scientist', ('scientist',), (2.6872328362514963e-07,)),
  ('begin', ('begin',), (0.011763887479901314,)),
  ('in', ('in',), (0.9169101119041443,)),
  ('January', ('January',), (0.0047429585829377174,)),
  ('.', ('.',), (0.9340659379959106,))),
 -46.70917113698211)

# Errors with Verb Form:
## `Brent would often became stunned by resentment.`

In [44]:
err_10 = "Brent would often became stunned by resentment."
err_10c = "Brent would often become stunned by resentment."

pprint(check_sentence(err_10))

[{'context': {'length': 6,
              'offset': 18,
              'text': 'Brent would often became stunned by resentment.'},
  'contextForSureMatch': -1,
  'ignoreForIncompleteSentence': True,
  'length': 6,
  'message': 'The modal verb ‘would’ requires the verb’s base form.',
  'offset': 18,
  'replacements': [{'value': 'become'}],
  'rule': {'category': {'id': 'GRAMMAR', 'name': 'Grammar'},
           'description': 'Non-infinitive verb after modal verbs',
           'id': 'MD_BASEFORM',
           'issueType': 'grammar',
           'sourceFile': 'grammar.xml',
           'subId': '2'},
  'sentence': 'Brent would often became stunned by resentment.',
  'shortMessage': 'Grammatical problem: use the base form',
  'type': {'typeName': 'Other'}}]


## Error not found?

In [45]:
err_10_res = get_word_probabilities(err_10, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_10, err_10_res, sum_log_probabilities(err_10_res)

('Brent would often became stunned by resentment.',
 (('Brent', ('Brent',), (5.3277675760909915e-05,)),
  ('would', ('would',), (0.0011291676200926304,)),
  ('often', ('often',), (0.0006896409904584289,)),
  ('became', ('became',), (1.8070495571009815e-05,)),
  ('stunned', ('stunned',), (0.0001664818119024858,)),
  ('by', ('by',), (0.32945454120635986,)),
  ('resentment', ('resentment',), (7.654621731489897e-05,)),
  ('.', ('.',), (0.942987322807312,))),
 -54.174096802403675)

In [46]:
err_10c_res = get_word_probabilities(err_10c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_10c, err_10c_res, sum_log_probabilities(err_10c_res)

('Brent would often become stunned by resentment.',
 (('Brent', ('Brent',), (6.463535100920126e-05,)),
  ('would', ('would',), (0.38164862990379333,)),
  ('often', ('often',), (0.02204342558979988,)),
  ('become', ('become',), (0.03031110018491745,)),
  ('stunned', ('stunned',), (8.590563083998859e-05,)),
  ('by', ('by',), (0.3531121015548706,)),
  ('resentment', ('resentment',), (5.8304620324634016e-05,)),
  ('.', ('.',), (0.9311017990112305,))),
 -38.14543291502301)

# Another verb form error:
## `I having mostly been moving flat.`

In [47]:
err_11 = "I having mostly been moving flat."
err_11c = "I have mostly been moving flat."

pprint(check_sentence(err_10))

[{'context': {'length': 6,
              'offset': 18,
              'text': 'Brent would often became stunned by resentment.'},
  'contextForSureMatch': -1,
  'ignoreForIncompleteSentence': True,
  'length': 6,
  'message': 'The modal verb ‘would’ requires the verb’s base form.',
  'offset': 18,
  'replacements': [{'value': 'become'}],
  'rule': {'category': {'id': 'GRAMMAR', 'name': 'Grammar'},
           'description': 'Non-infinitive verb after modal verbs',
           'id': 'MD_BASEFORM',
           'issueType': 'grammar',
           'sourceFile': 'grammar.xml',
           'subId': '2'},
  'sentence': 'Brent would often became stunned by resentment.',
  'shortMessage': 'Grammatical problem: use the base form',
  'type': {'typeName': 'Other'}}]


In [48]:
err_11_res = get_word_probabilities(err_11, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_11, err_11_res, sum_log_probabilities(err_11_res)

('I having mostly been moving flat.',
 (('I', ('I',), (0.004502556286752224,)),
  ('having', ('having',), (6.764967838535085e-05,)),
  ('mostly', ('mostly',), (0.005808720365166664,)),
  ('been', ('been',), (0.6068595051765442,)),
  ('moving', ('moving',), (0.00011972850188612938,)),
  ('flat', ('flat',), (1.0321228728571441e-05,)),
  ('.', ('.',), (0.8743120431900024,))),
 -41.29804043209969)

In [49]:
err_11c_res = get_word_probabilities(err_11c, bert_tokenizer=bert_tokenizer, bert_model=bert_model)
err_11c, err_11c_res, sum_log_probabilities(err_11c_res)

('I have mostly been moving flat.',
 (('I', ('I',), (0.09664551168680191,)),
  ('have', ('have',), (0.4320402443408966,)),
  ('mostly', ('mostly',), (0.0061361235566437244,)),
  ('been', ('been',), (0.9096378087997437,)),
  ('moving', ('moving',), (0.00013921161007601768,)),
  ('flat', ('flat',), (9.656163456384093e-06,)),
  ('.', ('.',), (0.9592146277427673,))),
 -28.833282879516847)

# Areas for further investigation
* Detecting errors is difficult!
    * Can anomaly detection be used to detect errors?
        * e.g. anomalies in the sequence of word probabilities for a sentence
    * Research seems to indicate that anomaly/error thresholds have to be tuned per domain.
* Good datasets of true grammatical errors are hard to find; best to collect your own if you can.
* Machine generated grammatical error datasets are difficult to get right; but some augmentation is possible.

# That's all for now!