<a href="https://colab.research.google.com/github/usc-isi-i2/kgtk-aaai2023/blob/main/04.2_IdentifyMoralFoundationsInText.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Identify Moral Foundations in text**

In this notebook we'll be using a pre-trained model to identify moral foundations in text. The text could be anything but preferably of similar length to an average tweet as that is what the model was trained on. In this exmple we will be using Telegram messages as these are publicly availble and relatively easy to export from the desktop Telegram client.

Notes:

*   we are using the `bert-base-uncased` tokenizer
*   weights are loded from `model_weights.pkl` file
*   data used is the `translated_messages.json` file

---

***Plese make sure that you have the GPU runtime selected for this notebook***

    - select Runtime -> Change runtime type -> Hardware accelerator -> GPU 

# **GPU setup**
    - check GPU availability
    - setup torch to use GPU device

In [None]:
#@title Check gpu availability

!nvidia-smi


from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('\n\nYour runtime has {:.1f} gigabytes of available RAM\n\n'.format(ram_gb))

Sun Feb  5 22:14:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P0    28W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
#@title Setup torch to use GPU device

import torch

SEED = 7 #@param {type: "slider", min: 0, max: 100}


# get a count of how many GPU devices area available to us
num_gpu_devices = torch.cuda.device_count()
print("There {} {} GPU device{}".format(
    'is' if num_gpu_devices == 1 else 'are',
    num_gpu_devices,
    '' if num_gpu_devices == 1 else 's' 
))


# manually set the seed when using the gpu
if num_gpu_devices > 0:
    torch.cuda.manual_seed_all(SEED)


# Set the device to use gpu or cpu
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    print("Great, using the GPU!")
else:
    device = torch.device("cpu")
    print("Not great, using the CPU!")
    raise Exception('Check if you have the GPU runtime selected')

There is 1 GPU device
Great, using the GPU!


#**Model**#
    - mount the drive
    - define labels
    - install transformers
    - define model class
    - load weights into model

In [None]:
#@title Install the transformers library

%%time
%%capture

!pip install transformers

CPU times: user 59.5 ms, sys: 23.6 ms, total: 83.1 ms
Wall time: 10.7 s


In [None]:
#@title Define moral foundation labels

moral_foundation_labels = [
    'care',
    'harm',
    'fairness',
    'cheating',
    'loyalty',
    'betrayal',
    'authority',
    'subversion',
    'sanctity',
    'degradation',
    'non-moral',
]

In [None]:
#@title Define model class

import pickle
import torch
import transformers


class BERTClass(torch.nn.Module):

    def __init__(self, num_classes):
        super(BERTClass, self).__init__()
        self.l1 = transformers.BertModel.from_pretrained('bert-base-uncased')
        self.dropout = torch.nn.Dropout(0.5)
        self.classifier = torch.nn.Linear(768, num_classes)

    def forward(self, ids, mask, token_type_ids):
        output_1 = self.l1(input_ids=ids, attention_mask=mask, token_type_ids=token_type_ids)
        pooler = self.dropout(output_1.pooler_output)
        output = self.classifier(pooler)
        return output, output_1.last_hidden_state[:, 0, :]

    def save_bert(self, save_path):
        torch.save(self.l1.state_dict(), save_path)

    def save_model(self, save_path):
        with open(save_path, 'wb') as file:
            pickle.dump(self, file)

In [None]:
#@title Mount Google Drive

%%time


from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
CPU times: user 814 ms, sys: 225 ms, total: 1.04 s
Wall time: 18 s


In [None]:
#@title Load the model

%%time
%%capture

import torch


model_weights_folder_path = "/content/drive/My Drive/KGTK Tutorial/models/"
model_weights_filename = 'model_weights.pkl'
model_weights_file = model_weights_folder_path + model_weights_filename


model = BERTClass(len(moral_foundation_labels))
model.load_state_dict(torch.load(model_weights_file))
model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


CPU times: user 3.88 s, sys: 2.45 s, total: 6.34 s
Wall time: 21.3 s


#**Data**#
    - mount the drive
    - load the data
    - preprosessing data
        - generate sentences from messages

In [None]:
#@title Load the data

%%time

import json


data_file_path = "/content/drive/My Drive/KGTK Tutorial/data/translated_messages.json"
translated_messages = json.load(open(data_file_path, "r"))


# load data from the google drive
data_folder_path = "/content/drive/My Drive/KGTK Tutorial/data/"
data_filename = 'translated_messages.json'
data_file = open(data_folder_path + data_filename)
data = json.load(data_file)


# check how many messages are in that data file
print('{} messages in the data file'.format(len(data['messages'])))

32 messages in the data file
CPU times: user 5.31 ms, sys: 1.49 ms, total: 6.8 ms
Wall time: 787 ms


In [None]:
#@title Generate a list of sentences

%%time

from tqdm import tqdm

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize


# split messages up into sentences
for message in tqdm(data['messages']):
    message['sentences'] = sent_tokenize(message['translation']['text'])


# combine all sentences in a single list
sentences = [sentence for message in data['messages'] for sentence in message['sentences']]
len(sentences)


# check how many sentences there are in total
print('{} sentences'.format(len(sentences)))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
100%|██████████| 32/32 [00:00<00:00, 2882.06it/s]

82 sentences
CPU times: user 606 ms, sys: 140 ms, total: 746 ms
Wall time: 1.79 s





#**Processing**#
    - set batch size config varible
    - handle tokenization
    - handle validation

In [None]:
#@title Set VALID_BATCH_SIZE

VALID_BATCH_SIZE = 4 #@param {type: "slider", min: 1, max: 10}

In [None]:
#@title Handle tokenization

%%time
%%capture


import torch
from transformers import BertTokenizer
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset


def handle_tokenize(texts, tokenizer, labels=None):
    encoding = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
    ids = encoding['input_ids']  # default max_seq 512
    mask = encoding['attention_mask']
    token_type_ids = encoding['token_type_ids']

    if labels:
        targets = torch.tensor(labels)
        return TensorDataset(ids, mask, token_type_ids, targets)
    else:
        return TensorDataset(ids, mask, token_type_ids)


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
test_set = handle_tokenize(texts=sentences, tokenizer=tokenizer)
testing_loader = DataLoader(test_set, batch_size=VALID_BATCH_SIZE, shuffle=False, num_workers=4)

CPU times: user 223 ms, sys: 11.6 ms, total: 235 ms
Wall time: 5.37 s


In [None]:
#@title Handle validation

%%time
%%capture


import numpy as np
from torch.utils.data import DataLoader


def handle_validation(val_loader, model):
    model.eval()
    fin_outputs = []
    fin_embeddings = []

    with torch.no_grad():
        for step, batch in enumerate(val_loader):
            batch = [r.to(device) for r in batch]

            if len(batch) == 4:
                ids, mask, token_type_ids, label = batch
                labels_in_batch = True
            elif len(batch) == 3:
                ids, mask, token_type_ids = batch
                labels_in_batch = False

            outputs, embeddings = model(ids, mask, token_type_ids)

            # big_val, big_idx = torch.max(outputs.data, dim=1)
            fin_outputs.append(outputs.cpu().detach().numpy())
            fin_embeddings.append(embeddings.cpu().detach().numpy())

    fin_outputs = np.concatenate(fin_outputs, axis=0)
    fin_embeddings = np.concatenate(fin_embeddings, axis=0)

    return fin_outputs, fin_embeddings


testing_loader = DataLoader(test_set, batch_size=VALID_BATCH_SIZE, shuffle=False, num_workers=4)
MF_outputs, _ = handle_validation(testing_loader, model)
MF_outputs = torch.nn.functional.softmax(torch.Tensor(MF_outputs), dim=-1)
MF_outputs = MF_outputs.numpy()

CPU times: user 1.13 s, sys: 594 ms, total: 1.72 s
Wall time: 4.09 s


# **Output**
    - print sentenes
    - print moral foundation scores

In [None]:
#@title Print moral foundation scores

# Use NumpyEncoder to convert numpy data to list
# Error: Object of type int64 is not JSON serializable

import json
from tqdm import tqdm


class NumpyEncoder(json.JSONEncoder):
    """ Custom encoder for numpy data types """
    def default(self, obj):
        if isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
                            np.int16, np.int32, np.int64, np.uint8,
                            np.uint16, np.uint32, np.uint64)):

            return int(obj)

        elif isinstance(obj, (np.float_, np.float16, np.float32, np.float64)):
            return float(obj)

        elif isinstance(obj, (np.complex_, np.complex64, np.complex128)):
            return {'real': obj.real, 'imag': obj.imag}

        elif isinstance(obj, (np.ndarray,)):
            return obj.tolist()

        elif isinstance(obj, (np.bool_)):
            return bool(obj)

        elif isinstance(obj, (np.void)):
            return None

        return json.JSONEncoder.default(self, obj)


index = 0
for message in tqdm(data['messages']):
    for sentence in message['sentences']:
        print(sentence)
        print(json.dumps(
            dict(zip(moral_foundation_labels, MF_outputs[index])),
            indent=4,
            ensure_ascii=False,
            cls=NumpyEncoder,    
        ))        print()
        index += 1

100%|██████████| 32/32 [00:00<00:00, 827.10it/s]

The controversy over the comedy video, starring three children acting a scene of Mozambican traffic police, continues today, with the journalists' union trading accusations with the provincial prosecutor in Manica.
{
    "care": 0.004591815173625946,
    "harm": 0.031666070222854614,
    "fairness": 0.014475807547569275,
    "cheating": 0.08223871141672134,
    "loyalty": 0.004306383430957794,
    "betrayal": 0.025487428531050682,
    "authority": 0.0037580952048301697,
    "subversion": 0.01861969567835331,
    "sanctity": 0.0021741881500929594,
    "degradation": 0.0076093487441539764,
    "non-moral": 0.805072546005249
}

Zitamar has subtitled the video in English for our readers to see what the fuss is about, below.
{
    "care": 0.002934803254902363,
    "harm": 0.003654651576653123,
    "fairness": 0.002017510123550892,
    "cheating": 0.002186475321650505,
    "loyalty": 0.0017890357412397861,
    "betrayal": 0.0012521755415946245,
    "authority": 0.0009945179335772991,
    "su


