<a href="https://colab.research.google.com/github/felixbmuller/nlp-commonsense/blob/main/NLP_Commonsense_Assignment_2_KB_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Commonsense Assignment 2 - Knowledge Base Model

## Setup

In [None]:
!pip install -q transformers datasets torch torchvision
!apt install git-lfs >/dev/null

[K     |████████████████████████████████| 3.5 MB 5.5 MB/s 
[K     |████████████████████████████████| 311 kB 38.4 MB/s 
[K     |████████████████████████████████| 67 kB 4.9 MB/s 
[K     |████████████████████████████████| 895 kB 43.6 MB/s 
[K     |████████████████████████████████| 596 kB 39.2 MB/s 
[K     |████████████████████████████████| 6.5 MB 35.9 MB/s 
[K     |████████████████████████████████| 212 kB 47.0 MB/s 
[K     |████████████████████████████████| 1.1 MB 38.5 MB/s 
[K     |████████████████████████████████| 134 kB 49.2 MB/s 
[K     |████████████████████████████████| 271 kB 53.0 MB/s 
[K     |████████████████████████████████| 144 kB 54.9 MB/s 
[K     |████████████████████████████████| 94 kB 3.4 MB/s 
[?25h



In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [None]:
!git clone https://github.com/felixbmuller/nlp-commonsense.git --depth 1

Cloning into 'nlp-commonsense'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 28 (delta 3), reused 12 (delta 1), pack-reused 0[K
Unpacking objects: 100% (28/28), done.


In [None]:
from datasets import load_dataset, load_metric
import pandas as pd
import transformers

print(transformers.__version__)

model_checkpoint = "bert-base-uncased"
batch_size = 16

datasets = load_dataset("super_glue", "copa")

4.16.2


Downloading:   0%|          | 0.00/9.47k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.23k [00:00<?, ?B/s]

Downloading and preparing dataset super_glue/copa (download: 42.96 KiB, generated: 119.62 KiB, post-processed: Unknown size, total: 162.57 KiB) to /root/.cache/huggingface/datasets/super_glue/copa/1.0.2/d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7...


Downloading:   0%|          | 0.00/44.0k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset super_glue downloaded and prepared to /root/.cache/huggingface/datasets/super_glue/copa/1.0.2/d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## Setup and Test Knowledge Base

In [None]:
%cd /content/nlp-commonsense/src/
!git pull

/content/nlp-commonsense/src
Already up to date.


In [None]:
%load_ext autoreload
%autoreload 2

import utils
import process_examples
import find_shortest_path
import renderer as R
import qa_preprocessing as QA

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


In [None]:
conceptnet = utils.load_conceptnet(load_compressed=True)

KeyboardInterrupt: ignored

In [None]:
example = datasets["train"][0]

example

{'choice1': 'The sun was rising.',
 'choice2': 'The grass was cut.',
 'idx': 0,
 'label': 0,
 'premise': 'My body cast a shadow over the grass.',
 'question': 'cause'}

In [None]:
print(process_examples.extract_terms(example["premise"]))
print(process_examples.extract_terms(example["choice1"]))
print(process_examples.extract_terms(example["choice2"]))
print(find_shortest_path.find_word_path('body', 'sun', conceptnet))
print(find_shortest_path.find_word_path('body', 'sun', conceptnet, renderer=None))

{'my body', 'grass', 'cast', 'shadow', 'body'}
{'wa', 'rising', 'sun'}
{'grass', 'cut', 'wa'}
body <--RelatedTo-- sun
[182090, 1539020]


In [None]:
R.render_path_natural([], conceptnet)

('', [])

In [None]:
R.render_path_natural([182090, 1539020], conceptnet)

('sun is like body.', [0.909])

In [None]:
print(QA.get_knowledge_for_example(example["premise"], example["choice1"], conceptnet, max_paths=100))
print(QA.get_knowledge_for_example(example["premise"], example["choice1"], conceptnet, max_paths=3))

grass is like side. side is like wa. grass is in the context of slang. rising is in the context of slang. grass is like plant. sun is like plant. cast is like rise. rising and rise have similar meanings. iron can be cast . sun has iron. shadow is like sun. wash is like body. wash and wa have similar meanings. dyke is like body. dyke is like rising. sun is like body.
sun is like body. shadow is like sun. cast is like rise. rising and rise have similar meanings.


## Preprocessing the data

In [None]:
from tqdm.notebook import tqdm

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [None]:
ending_names = ["choice1", "choice2"]

QUESTION_MAP = {
    "cause": "What was the cause of this?",
    "effect": "What happened as a RESULT?",
}

MAX_PATHS = 3 # only take the three most relevant knowledge paths into account 

def preprocess_function(examples):
    # Repeat premise and question twice for both possible answers
    # for each repetitions, add knowledge from the knowledge base in front of 
    # the premise. The knowledge added is about connections between the premise 
    # and the answer choice. The type of question (cause/effect) is also taken
    # into account
    first_sentences = [
                       [f"{QA.get_knowledge_for_example(f'{context} {question}', c1, conceptnet, MAX_PATHS)} {context} {QUESTION_MAP[question]}", 
                        f"{QA.get_knowledge_for_example(f'{context} {question}', c2, conceptnet, MAX_PATHS)} {context} {QUESTION_MAP[question]}"] 
                       for context, question, c1, c2 in zip(
                           tqdm(examples["premise"]), 
                           examples["question"], 
                           examples["choice1"], 
                           examples["choice2"]
                           )
                       ]
    # Grab all second sentences possible for each context.
    second_sentences = [[c1, c2] 
                        for c1, c2 in zip(examples["choice1"], examples["choice2"])]
    
    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    if not len(first_sentences) == len(second_sentences):
        raise ValueError("lengths dont match")
    
    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    # Un-flatten
    return {k: [v[i:i+2] for i in range(0, len(v), 2)] for k, v in tokenized_examples.items()}

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists of lists for each key: a list of all examples (here 5), then a list of all choices (4) and a list of input IDs (length varying here since we did not apply any padding):

### Test Tokenizer and Preprocessing

In [None]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
examples = datasets["train"][:2]
features = preprocess_function(examples)

print(features.keys())
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])

  0%|          | 0/2 [00:00<?, ?it/s]

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
2 2 [46, 38]


To check we didn't do anything group when grouping all possibilites then unflattening, let's have a look at the decoded inputs for a given example:

In [None]:
len(datasets["train"]), len(datasets["test"]), len(datasets["validation"])

(400, 500, 100)

In [None]:
idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(2)]

['[CLS] high is the opposite of short. exacta is like runner. forecast is like exacta. cool off is like cause. cool off is like temperature. the runner wore shorts. what was the cause of this? [SEP] the forecast predicted high temperatures. [SEP]',
 '[CLS] runner is like run. run is like cause. runner is like florida. florida has beach. the runner wore shorts. what was the cause of this? [SEP] she planned to run along the beach. [SEP]']

We can compare it to the ground truth:

In [None]:
datasets["train"][3]

{'choice1': 'The forecast predicted high temperatures.',
 'choice2': 'She planned to run along the beach.',
 'idx': 3,
 'label': 0,
 'premise': 'The runner wore shorts.',
 'question': 'cause'}

### Apply Preprocessing to the Whole Dataset

Applying the preprocessing including querying the knowledge base takes around 15 seconds per example. To avoid lengthy calulcations at every execution, this sections allows to save/retrieve results using Google drive. We do not apply preprocessing to the test set, as it is not needed anyways.

In [None]:
import joblib
import pyarrow as pa
from datasets import Dataset, DatasetDict, concatenate_datasets

use_gdrive = False

In [None]:
# Mount google drive
# You can skip this if you don't want to load/save intermediate results from/to
# Google drive

from google.colab import drive
drive.mount('/content/drive')

use_gdrive=True

Mounted at /content/drive


In [None]:
encoded_val = preprocess_function(datasets["validation"])
if use_gdrive:
    joblib.dump(encoded_val, "../../drive/MyDrive/nlp-commonsense/copa_val.joblib")

  0%|          | 0/100 [00:00<?, ?it/s]

['../../drive/MyDrive/nlp-commonsense/copa_val.joblib']

In [None]:
encoded_train = preprocess_function(datasets["train"])
if use_gdrive:
    joblib.dump(encoded_train, "../../drive/MyDrive/nlp-commonsense/copa_train.joblib")

  0%|          | 0/400 [00:00<?, ?it/s]

['../../drive/MyDrive/nlp-commonsense/copa_train.joblib']

In [None]:
if use_gdrive:
    encoded_val = joblib.load("../../drive/MyDrive/nlp-commonsense/copa_val.joblib")
    encoded_train = joblib.load("../../drive/MyDrive/nlp-commonsense/copa_train.joblib")

In [None]:
train_ds = Dataset(pa.Table.from_pydict(encoded_train))
val_ds = Dataset(pa.Table.from_pydict(encoded_val))

In [None]:
# merge tokenizer output with labels from the original dataset
train_ds = concatenate_datasets([train_ds, datasets["train"]], split="train", axis=1)
val_ds = concatenate_datasets([val_ds, datasets["validation"]], split="validation", axis=1)


In [None]:
encoded_datasets = DatasetDict(
    train=train_ds,
    validation=val_ds)

**Add Sorting**

The following code can be used to sort the datasets according to the average number of tokens (average is needed because each datapoint contains two sequences, one for choice 1 and one for choice 2). As this gave worse results, I did not use this in the final solution.

In [None]:
def avg_input_lens(batch):
    vals = [(len(v[0]) + len(v[1]))/2 for v in batch["input_ids"]]
    return {"avg_input_len": vals}

# Uncomment to apply sorting
#encoded_datasets = encoded_datasets.map(avg_input_lens, batched=True)
#encoded_datasets = encoded_datasets.sort("avg_input_len")

In [None]:
s0 = pd.Series(len(encoded_datasets["train"]["input_ids"][i][0]) for i in range(400))
s1 = pd.Series(len(encoded_datasets["train"]["input_ids"][i][1]) for i in range(400))

len_df = pd.DataFrame({"input_ids0": s0, "input_ids1": s1})

In [None]:
len_df

Unnamed: 0,input_ids0,input_ids1
0,46,38
1,53,63
2,60,59
3,52,42
4,41,41
...,...,...
395,54,58
396,62,66
397,49,49
398,68,51


In [None]:
encoded_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'premise', 'choice1', 'choice2', 'question', 'idx', 'label'],
        num_rows: 400
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'premise', 'choice1', 'choice2', 'question', 'idx', 'label'],
        num_rows: 100
    })
})

## Fine-tuning the model

In [None]:
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch
import numpy as np

model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)

model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-copa-kb",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)


@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch


def compute_metrics(eval_predictions):
    predictions, label_ids = eval_predictions
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics,
)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

When called on a list of examples, it will flatten all the inputs/attentions masks etc. in big lists that it will pass to the `tokenizer.pad` method. This will return a dictionary with big tensors (of shape `(batch_size * 4) x seq_length`) that we then unflatten.

### Test Collator

We can check this data collator works on a list of features, we just have to make sure to remove all features that are not inputs accepted by our model (something the `Trainer` will do automatically for us after):

In [None]:
accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)

Again, all those flatten/un-flatten are sources of potential errors so let's make another sanity check on our inputs:

In [None]:
[tokenizer.decode(batch["input_ids"][8][i].tolist()) for i in range(2)]

['[CLS] malpractice is like physician. malpractice is like patient. effect is in the context of law. lawsuit is in the context of law. the physician misdiagnosed the patient. what happened as a result? [SEP] the patient filed a malpractice lawsuit against the physician. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] patient is people. people has information. malpractice is like physician. malpractice is like patient. effect is in the context of physic. physic is like physician. the physician misdiagnosed the patient. what happened as a result? [SEP] the patient disclosed confidential information to the physician. [SEP]']

In [None]:
encoded_datasets["train"][8]

{'attention_mask': [[1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'avg_input_len': 32.0,
 'choice1': 'I ran away.',
 'choice2': 'I apologized to him.',
 'idx': 177,
 'input_ids': [[101,
   7985,
   2003,
   2108,
   1012,
   2185,
   2003,
   2066,
   2108,
   1012,
   3466,
   2003,
   2066,
   1998,
   1012,
   2185,
   2003,
   2066,
   1998,
   1012,
   1045,
   18856,
   18163,
   6588,
   19030,
   2046,
   1996,
   7985,
   1012,
   2054,
   3047,
   2004,
   1037,
   2765,
   1029,
   102,
   1045,
   2743,
   2185,
   1012,
   102],
  [101,
   1045,
   18856,
   18163,
   6588,
   19030,
   2046,
   1996,
   7985,
   1012,
   2054,
   3047,
   2004,
   1037,
   2765,
   1029,
   102,
   1045,
   17806,
   20

### Run Training

In [None]:
trainer.train()

#model.push_to_hub("felixbmuller/bert-base-uncased-finetuned-copa")

The following columns in the training set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: choice1, premise, question, idx, choice2.
***** Running training *****
  Num examples = 400
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 75


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.683946,0.54
2,No log,0.657442,0.61
3,No log,0.631907,0.61


The following columns in the evaluation set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: choice1, premise, question, idx, choice2.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 16
The following columns in the evaluation set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: choice1, premise, question, idx, choice2.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 16
The following columns in the evaluation set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: choice1, premise, question, idx, choice2.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=75, training_loss=0.619866943359375, metrics={'train_runtime': 85.674, 'train_samples_per_second': 14.007, 'train_steps_per_second': 0.875, 'total_flos': 89802285776832.0, 'train_loss': 0.619866943359375, 'epoch': 3.0})

## Evalute the Model


In [None]:
predictions, label_ids, metrics = trainer.predict(encoded_datasets["validation"], metric_key_prefix="val")

The following columns in the test set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: choice1, premise, question, idx, choice2.
***** Running Prediction *****
  Num examples = 100
  Batch size = 16


In [None]:
metrics

{'val_accuracy': 0.6100000143051147,
 'val_loss': 0.6319071054458618,
 'val_runtime': 2.3038,
 'val_samples_per_second': 43.407,
 'val_steps_per_second': 3.039}

In [None]:
val = pd.DataFrame(datasets["validation"])
val["label_ids"] = label_ids
val["pred0"] = predictions[:, 0]
val["pred1"] = predictions[:, 1]
val["pred_label"] = np.argmax(predictions, axis=1)

Sanity check to ensure that predictions work the way I expect them to do

In [None]:
joblib.dump(val, "../../drive/MyDrive/nlp-commonsense/bert-base-uncased-finetuned-copa-kb-validation-results.joblib")

['../../drive/MyDrive/nlp-commonsense/bert-base-uncased-finetuned-copa-kb-validation-results-sorted.joblib']

In [None]:
import joblib
val = joblib.load("/content/drive/MyDrive/nlp-commonsense/bert-base-uncased-finetuned-copa-kb-validation-results.joblib")

In [None]:
val.head(20)

Unnamed: 0,premise,choice1,choice2,question,idx,label,label_ids,pred0,pred1,pred_label
0,The man turned on the faucet.,The toilet filled with water.,Water flowed from the spout.,effect,0,1,1,-0.598694,-0.795718,0
1,The girl found a bug in her cereal.,She poured milk in the bowl.,She lost her appetite.,effect,1,1,1,-0.732909,-0.161618,1
2,The woman retired.,She received her pension.,She paid off her mortgage.,effect,2,0,0,-0.498049,-0.611967,0
3,I wanted to conserve energy.,I swept the floor in the unoccupied room.,I shut off the light in the unoccupied room.,effect,3,1,1,-0.45046,-0.559087,0
4,The hamburger meat browned.,The cook froze it.,The cook grilled it.,cause,4,1,1,-0.432068,-0.590761,0
5,I doubted the salesman's pitch.,I turned his offer down.,He persuaded me to buy the product.,effect,5,0,0,-0.530533,-0.499672,1
6,I decided to stay home for the night.,The forecast called for storms.,My friends urged me to go out.,cause,6,0,0,-0.084215,-0.541973,0
7,My eyes became red and puffy.,I was sobbing.,I was laughing.,cause,7,0,0,-0.27661,-0.47217,0
8,The flame on the candle went out.,I blew on the wick.,I put a match to the wick.,cause,8,0,0,-0.531293,-0.403665,1
9,The man drank heavily at the party.,He had a headache the next day.,He had a runny nose the next day.,effect,9,0,0,-0.451781,-0.528218,0


In [None]:
wrong_samples = val[val.label !=  val.pred_label]
wrong_samples.sample(25, random_state=42)

Unnamed: 0,premise,choice1,choice2,question,idx,label,label_ids,pred0,pred1,pred_label
52,The detective revealed an anomaly in the case.,He finalized his theory.,He scrapped his theory.,effect,52,1,1,-0.244382,-0.404704,0
28,The girl refused to eat her vegetables.,Her father told her to drink her milk.,Her father took away her dessert.,effect,28,1,1,-0.495977,-0.629269,0
17,The kidnappers released the hostage.,They accepted ransom money.,They escaped from jail.,cause,17,0,0,-0.198481,0.116452,1
54,The child learned how to read.,He began attending school.,He skipped a grade in school.,cause,54,0,0,-0.593733,-0.321729,1
8,The flame on the candle went out.,I blew on the wick.,I put a match to the wick.,cause,8,0,0,-0.531293,-0.403665,1
98,The computer was expensive to fix.,I got it repaired.,I bought a new one.,effect,98,1,1,-0.502778,-0.556291,0
38,The bride got cold feet before the wedding.,The wedding guests brought gifts.,She called the wedding off.,effect,38,1,1,-0.612523,-0.770669,0
94,The girl wanted to wear earrings.,She got her ears pierced.,She got a tattoo.,effect,94,0,0,-0.494019,-0.333437,1
62,The lock opened.,I turned the key in the lock.,I made a duplicate of the key.,cause,62,0,0,-0.43981,-0.435905,1
14,The player caught the ball.,Her teammate threw it to her.,Her opponent tried to intercept it.,cause,14,0,0,-0.63181,-0.360854,1


# Calculate t-test

In [8]:
baseline = {
    "P": [91, 70, 65, 52, 98],
    "C": [38, 49, 97, 10, 36, 4, 55],
    "U": [73, 25, 26, 3, 42, 30, 9, 89],
    "E": [35, 8],
    "R": [82, 14, 86]
}

kb_model = {
    "P": [52, 28, 98, 62, 83, 0],
    "C": [38, 55, 10, 63],
    "U": [94, 27, 19, 30, 71, 25, 3, 33],
    "E": [54, 8, 35, 59],
    "R": [14, 82, 17],
}

In [9]:
baseline_vec = {k: [(1 if i in v else 0) for i in range(100)] for k, v in baseline.items()}
kb_model_vec = {k: [(1 if i in v else 0) for i in range(100)] for k, v in kb_model.items()}

In [10]:
print({k: sum(v)*4 for k, v in baseline_vec.items()})
print({k: sum(v)*4 for k, v in kb_model_vec.items()})

{'P': 20, 'C': 28, 'U': 32, 'E': 8, 'R': 12}
{'P': 24, 'C': 16, 'U': 32, 'E': 16, 'R': 12}


In [11]:
from scipy.stats import ttest_rel, ttest_ind

In [13]:
for k in baseline.keys():
    print(f"{k}: stat, p_value: {ttest_rel(baseline_vec[k], kb_model_vec[k])}")

P: stat, p_value: Ttest_relResult(statistic=-0.3763388118272598, pvalue=0.7074703580131823)
C: stat, p_value: Ttest_relResult(statistic=1.3470946333202294, pvalue=0.18102514023295704)
U: stat, p_value: Ttest_relResult(statistic=0.0, pvalue=1.0)
E: stat, p_value: Ttest_relResult(statistic=-1.4214106244380287, pvalue=0.15833990565972564)
R: stat, p_value: Ttest_relResult(statistic=0.0, pvalue=1.0)
