# Exp011: Corpus Rule Detection
For a good external validation, each rule is searched for in the corpus and marked if there is at least 80% precision among the hits. If there are less than 5 hits in the first ten examples, the rule is skipped, otherwise 10 more appear.

In [25]:
import pandas as pd
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
import os
import random
import sys
sys.path.append('../source')
import models
import data
import importlib
importlib.reload(models)

<module 'models' from '/mnt/qb/work/meurers/mpb672/grammarctg/experiments/../source/models.py'>

Load datasets

In [31]:
# egp instances
egp_examples = pd.read_json("../data/egp_examples.json")
# load corpus sentences and prepare dataloader
sentences = data.get_mixed_sentences(20000)
encoded_inputs = models.bert_tokenizer(sentences, return_tensors='pt', max_length=64, padding='max_length', truncation=True)
dataset = TensorDataset(encoded_inputs['input_ids'], encoded_inputs['attention_mask'])
dataloader = DataLoader(dataset, batch_size=64, shuffle=False)
# output dataset
output_path = '../data/coded_corpus_hits.json'
coded_instances = pd.DataFrame(columns=['#', 'sentence', 'correct', 'score', 'max_token']) if not os.path.exists(output_path) else pd.read_json(output_path)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00,  8.86s/it]


for each rule:
- train classifier with existing dataset
- search corpus
- output rule to user
- ask for user input until the precision is clear enough

In [19]:
def get_trained_classifer(rule, egp_examples):
    classifier = models.RuleDetector(models.bert_encoder)
    positive = rule['augmented_examples']
    negative = rule['augmented_negative_examples']
    others = [example for sublist in egp_examples.loc[egp_examples['#'] != rule['#'], 'augmented_examples'].to_list() for example in sublist]
    
    dataset = data.get_dataset(positive, negative, others, models.bert_tokenizer, 64) 
    train_dataloader, val_dataloader = data.get_loaders(dataset)
    
    optimizer, accuracy = models.train(classifier, train_dataloader, val_dataloader, num_epochs=3)
    return classifier

In [48]:
def criterion(coded_instances, min_precision=0.8, num_rules=1):
    correct_per_rule = coded_instances.groupby('#')['correct'].mean()
    return len(correct_per_rule[correct_per_rule > min_precision]) > num_rules

In [49]:
while not criterion(coded_instances, num_rules=19):
    rule = egp_examples.sample(1).iloc[0] # sample random rule
    if rule['#'] in coded_instances['#']: continue
    print(f"{rule['type']}: {rule['Can-do statement']} ({rule['SuperCategory']}: {rule['SubCategory']})")
    print(rule['Example'])
    
    classifier = get_trained_classifer(rule, egp_examples)
    scores, tokens = models.score_corpus(classifier, dataloader, max_positive=250, max_batches=1250)
    results = [(score, token, sample) for score, token, sample in zip(scores, tokens, sentences[:len(scores)]) if score > 0.5]
    sorted_results = iter(sorted(results, key=lambda x: x[0], reverse=True))
    while len(coded_instances[coded_instances['#'] == rule['#']]) < 20:
        if len(coded_instances[coded_instances['#'] == rule['#']]) >= 10: # check if we have at least 40% after 10 instances
            if not criterion(coded_instances[coded_instances['#'] == rule['#']], 0.4, 0): break
        score, token, sample = next(sorted_results)
        user_response = input(f"{sample}")
        new_row = pd.DataFrame({'#': [rule['#']],
             'sentence': [sample],
             'correct': [True if user_response == '2' else False],
             'score': [score],
             'max_token': [token]})
        coded_instances = pd.concat([coded_instances, new_row], ignore_index=True)
        coded_instances.to_json(output_path)

FORM/USE: Can use the negative question form as a persuasion strategy. (PRESENT: present simple)
Don't you just hate taking the bus to school every morning? well I have a perfect solution for you, a great bike with good brakes, a bell and lights. it is a great bike but since I never use it I would be happy to sell it to you. 

Don't you find that when you are having a shower or bath, you occasionally run out of water?


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 17.61it/s]


Training loss: 0.3339063185453415
Accuracy: 0.985


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 18.05it/s]


Training loss: 0.08209181800484658
Accuracy: 0.995


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 17.98it/s]


Training loss: 0.03029802121222019
Accuracy: 1.0


 46%|█████████████████████████████████████████████████████████████▌                                                                        | 574/1250 [00:48<00:57, 11.80it/s]


Alcohol, don't you think alcohol is not ideal to take as brunch? 2
But don't you think it looks rather unfashionable? 2
Don't you think that would be a little weird? 2
But don't you hate it when you step on something sharp? 2
We eat a lot of chicken, pork and beef.You eat those meat a lot in your country too, don't you? 1
Don't you know they've already broken up? 1
I think Shanghai is getting more and more expensive, don't you? 1
I think Shanghai is getting more and more expensive, don't you? 1
Don't you know that umbrella is expensive? 2
Tom Cruise may be handsome, but I think he's a bit crazy, don't you? 1
Yes, I am kidding.But don't you know only professors and students with disabilitiescan apply for parking permits? 1
"Don't you have to be a good bowler to bowl in a league?" 1
Well, don't you? 1
You eat those meat a lot in your country too, don't you? 1
You eat those meat a lot in your country too, don't you? 1
Don't you tell me you speak Spanish, too? 1
You do know all the control

USE: Can use 'that's all' to end a letter.  (PRONOUNS: demonstratives)
That's all for now. 

Well I think that's all.


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.60it/s]


Training loss: 0.6039760363101959
Accuracy: 0.9


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.79it/s]


Training loss: 0.3996618437767029
Accuracy: 0.955


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.63it/s]


Training loss: 0.178541961312294
Accuracy: 0.97


 66%|████████████████████████████████████████████████████████████████████████████████████████▌                                             | 826/1250 [01:10<00:36, 11.68it/s]


You can work really hard for a couple of days, then that's all. 1
But that's all. 2
You're just a chicken that's all. 2
Yes, that's all. 2
Good, that's all then. 2
Good, that's all then. 2
I used to play the piano a bit that's all. 2
That's all for today. 2
That will be all. 2
Here is the agenda for the meeting and that's all for the messages. 2
That is all, it would take me years haha, It is smart to use protection while learning. 2
But that's not all. 1
That's all right then. 2
And is that all? 1
That will be all for now. 2
Yeah, is that all? 1
Oh, is that all? 1
Oh, is that all? 1
No, that's all right. 1
I 'm the one person who can make this country great again, that's all I know, he told reporters Saturday. 1


USE: Can use 'these' as a pronoun to refer to something with immediate relevance which has already been mentioned.  ►  noun phrases  ►  pronouns: demonstrative (PRONOUNS: demonstratives)
He is very clever and generous, and these are the things that I like most about him. 

There are a few interesting and funny programmes like The Simpsons, Password or José Mota's hour. These are the only programmes I like watching. 

I think that we have similar taste about things like these.


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 18.47it/s]


Training loss: 0.5731527447700501
Accuracy: 0.925


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 18.72it/s]


Training loss: 0.3150770592689514
Accuracy: 0.955


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 18.58it/s]


Training loss: 0.18169493973255157
Accuracy: 0.985


 67%|█████████████████████████████████████████████████████████████████████████████████████████▋                                            | 837/1250 [01:12<00:35, 11.59it/s]


I worked on the new line for many months, and these are the elements I hope I managed to incorporate into them. 2
These are the kinds of practical actions that inspire me in my job as we work to make the best science available to understand and reduce the impact of extreme events on families, communities and businesses. 2
These are just some of the ways his company changed everything – for better or worse. 2
Those are the ones I get the most personally invested in. 1
These are the important things for our society when these people are released.” I spoke to a former prisoner who now runs a social enterprise called X-Cons Sweden. 2
And this man is involved in getting investment from UK to Africa, and he was very excited about Telepathy, that it would be a way of educating people about Africa, of showing them other people’s point of view.” 
This is Iguchi’s fondest hope – that seeing somebody else’s literal point of view will help you to see their metaphorical point of view. 1
These are t

FORM: Can form a cleft construction beginning with 'it' to emphasise the subject of the main clause. ► clauses (PRONOUNS: subject/ object)
It was my father who took all this away from me.


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.41it/s]


Training loss: 0.37615697741508486
Accuracy: 1.0


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.69it/s]


Training loss: 0.08096974745392799
Accuracy: 1.0


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.80it/s]


Training loss: 0.029563990235328675
Accuracy: 1.0


 30%|███████████████████████████████████████▌                                                                                              | 369/1250 [00:30<01:13, 11.91it/s]


It is these factors that led the Oregon Resilience Plan to conclude that in a worst-case scenario, 10,000 people could be killed, and large areas of Oregon could be without transportation, electricity, water and sewer for several months to several years. 2
It was in 1963 that their enormous fame first came about with the Beatlemania 2
Crowe spent much of his life living in Australia and the United States, but it was in New Zealand where he first became involved in the entertainment industry. 2
It was not until the 15th century that “breakfast” came into use in written English to describe a morning meal, which literally means to break the fasting period of the prior night 2
When he finds Princess Leia's message to Obi-Wan Kenobi inside the robot R2D2, it is 'the call to adventure' that starts the hero on his journey. 2
It was on the Tiananmen rostrum where Chairman Mao formally proclaimed the founding of the People's Republic. 2
It was in the 10th century that the actual word "pizza"  w

FORM/USE: Can use 'be' + 'not' + adjective + 'that-' clause to make an assertion less direct. (MODALITY: expressions with be)
[talking about a town near a dump where people cannot open their windows] I am not certain that they have got used to it. 

[talking about distractions while studying] It's not likely that you'll make progress. 

As you can see, I'm really in favour of this plan but I'm not sure that the council has anticipated everything.


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.20it/s]


Training loss: 0.3671160113811493
Accuracy: 0.99


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.55it/s]


Training loss: 0.10389510571956634
Accuracy: 0.995


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.49it/s]


Training loss: 0.04626576133072376
Accuracy: 0.995


 11%|██████████████▏                                                                                                                       | 132/1250 [00:11<01:33, 11.91it/s]


It's not clear how weapons, fighter jets and ammunition flowing into Syria will affect the fighting there, much less the heretofore unsuccessful attempts to forge a negotiated settlement. 1
I'm not sure, They say humans need different social experiences to learn their own culture and to survive which most do learn in homeschooling. 1
Yes, the cats we have are domesticated, small, and typically furry, but I'm really not sure when they were. 1
Well it's not surprising that it's the world's largest restaurant chain by revenue, so you are not the only one eating a lot of fast food 1
As for "The Canadian", I'm not sure of it's first operating year. 1
I'm not sure if we've got any now.Books of that kind are on this shelf.Well, I'm afraid we've sold out. 1
Much as I love them personally, I do n't sell things like saris -- and I never would -- it's not a style that would appeal to the tastes of my particular customers. 1
Well, I'll celebrate Hanukah soon, but that's not as important to us as C

USE: Can use the present simple with a wide range of reporting verbs, especially in academic contexts, including 'demonstrate', 'illustrate'. (PRESENT: present simple)
The popularity of this TV game in Russia clearly demonstrates the nature of human fears and dreams. 

The 2 charts illustrate the number of employees, and the trends in profit for three factories, namely the factories located in London, Leeds, and Bristol, which belong to the same company, during the year 2003.


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 18.55it/s]


Training loss: 0.5689870703220368
Accuracy: 0.82


 12%|████████████████▌                                                                                                                         | 3/25 [00:00<00:01, 17.89it/s]


KeyboardInterrupt: 

In [52]:
correct_per_rule = coded_instances.groupby('#')['correct'].mean()
len(correct_per_rule[correct_per_rule > 0.8])

10