# Checking Embeddings of Terms (Noun/Verb/Adj/etc.) from Tagged Wordnet Gloss

I discovered there's a more active fork of wordnet and bumped this analysis over to that.

## Get preprocessed wordnet categoricals data, extract part of speech columns

In [None]:
import pandas as pd
df = pd.read_csv("https://huggingface.co/datasets/segyges/openwordnet-categoricals/resolve/main/openwordnet-categoricals.csv", keep_default_na=False)

In [None]:
df['members'].isna().sum()

0

In [None]:
df.head()

Unnamed: 0,members,a,n,r,s,v,adj.all,adj.pert,adj.ppl,adv.all,...,verb.consumption,verb.contact,verb.creation,verb.emotion,verb.motion,verb.perception,verb.possession,verb.social,verb.stative,verb.weather
0,.22,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,.22 caliber,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,.22 calibre,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,.22-caliber,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,.22-calibre,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_pos = df[['members', 'a', 'n', 'r', 's', 'v']]

In [None]:
df_pos.head()

Unnamed: 0,members,a,n,r,s,v
0,.22,0,1,0,0,0
1,.22 caliber,1,0,0,0,0
2,.22 calibre,1,0,0,0,0
3,.22-caliber,1,0,0,0,0
4,.22-calibre,1,0,0,0,0


In [None]:
df_pos.shape

(153361, 6)

Copy pastes helper functions from notebooks/Pythia-12B Embedding Analysis.ipynb

In [None]:
!git lfs clone https://huggingface.co/jstephencorey/pythia-12b-embeddings.git ./embeds/

          with new flags from 'git clone'

'git clone' has been updated in upstream Git to have comparable
speeds to 'git lfs clone'.
Cloning into './embeds'...
remote: Enumerating objects: 10, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 10 (delta 1), reused 0 (delta 0), pack-reused 4[K
Unpacking objects: 100% (10/10), 3.26 KiB | 222.00 KiB/s, done.


# Preprocessing tokenizer

We need to get tokens corresponding to wordnet terms. This is slightly complex.

In [None]:
import torch
import torch.nn as nn
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
import matplotlib.pyplot as plt
import numpy as np
embedding_filename = "./embeds/pythia-12b.pth"
embedding_layer = torch.load(embedding_filename, map_location=torch.device(device))
embedding_weights = embedding_layer.weight.data.cpu().numpy()
print(embedding_weights)

cpu
[[ 0.00230217 -0.00296211  0.00490189 ... -0.00183678  0.00023139
  -0.00206184]
 [ 0.01398468 -0.00607681  0.02134705 ... -0.00453568 -0.0137558
  -0.00044656]
 [ 0.00622177 -0.01074982  0.00720215 ...  0.00630188  0.00068235
  -0.02189636]
 ...
 [-0.01255035 -0.00182629 -0.0049057  ... -0.00560379  0.00989532
   0.01010895]
 [ 0.00396347  0.00630188 -0.01152802 ... -0.00132656 -0.01219177
   0.00511551]
 [ 0.00543213 -0.00561523 -0.01293182 ...  0.00473022 -0.00455093
  -0.0066452 ]]


In [None]:
type(embedding_layer)

In [None]:
from transformers import AutoTokenizer
model_name = "EleutherAI/pythia-12b"
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
[key for key in tokenizer.vocab.keys() if "dog" in key.lower()]

['ĠDog', 'Ġdog', 'Ġendogenous', 'Ġdogs', 'dog', 'ĠDogs', 'Dog']

In [None]:
[ord(key[0]) for key in tokenizer.vocab.keys() if "dog" in key.lower()]

[288, 288, 288, 288, 100, 288, 68]

We have capitalized, uncapitalized, plural, non-plural, and with and without the oddball preceded-by-something case.

Tokenizers are cursed, and this is evil.

In [None]:
def get_targets(data):
  X, y, terms = [], [], []
  for i, row in data.iterrows():
    lowercased = row['members'].lower()
    capitalized = row['members'].capitalize()
    leading_space = chr(288)
    options = [lowercased, capitalized, leading_space + lowercased, leading_space + capitalized]
    for option in options:
      ids = tokenizer(option, add_special_tokens=False, padding=False)['input_ids']
      if len(ids) == 1:
        token = embedding_layer(torch.tensor(ids, device=device)).detach().cpu().squeeze().numpy()
        X.append(token)
        y.append(row.drop('members').values)
        terms.append(option)
  return X, y, terms

In [None]:
X, y, terms = get_targets(df_pos)

In [None]:
X[:3], y[:3], terms[:3], len(X), len(y), len(terms)

([array([ 0.00043392, -0.02088928, -0.00630188, ..., -0.00092554,
          0.0048027 ,  0.02067566], dtype=float32),
  array([ 0.00043392, -0.02088928, -0.00630188, ..., -0.00092554,
          0.0048027 ,  0.02067566], dtype=float32),
  array([ 0.0116806 , -0.01187897,  0.00033426, ...,  0.00914001,
          0.0019331 ,  0.00630951], dtype=float32)],
 [array([0, 1, 0, 1, 0], dtype=object),
  array([0, 1, 0, 1, 0], dtype=object),
  array([0, 1, 0, 1, 0], dtype=object)],
 ['0', '0', '1'],
 7023,
 7023,
 7023)

In [None]:
np.vstack(X).shape

(7023, 5120)

In [None]:
y = [arr.astype(int) for arr in y]

## Start training stuff

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1337)
len(X_train), len(X_test), len(y_train), len(y_test)

(5267, 1756, 5267, 1756)

In [None]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

# This kept for later reference because we might want to throw it back in
clf = MultiOutputClassifier(LogisticRegression(max_iter=1000)).fit(X_train, y_train)
clf

In [None]:
y_pred_proba = clf.predict_proba(X_test)

From here on down are I think the *good* metrics that we *definitely* want, precision/recall/f1 are cool but incomplete.

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_recall_curve, roc_auc_score, roc_curve, auc

def eval_preds(y_test, y_pred):
    """
    Takes in (y_test, y_pred) and outputs precision, recall, f1, average precision, average recall, ROC_AUC and PR_AUC.
    But Gyges ran out of steam so Clyde here taking over to do what i do best. ctrl+c ctrl+v code and take credit.
    :)
    """

    n_classes = 5

    # Convert y_test and y_pred to numpy arrays if they are not already
    y_test = np.array(y_test)
    y_pred = np.array(y_pred)
    y_pred_onehot = (y_pred >= 0.5).astype(int)

    # Average precision score for all classes
    micro_precision = average_precision_score(y_test, y_pred, average='micro')

    # Calculate average precision score for each class separately
    classwise_avg_precision = average_precision_score(y_test, y_pred, average=None)

    # Calculate recall score for each class separately
    classwise_recall = recall_score(y_test, y_pred_onehot, average=None)

    # Calculate micro-averaged recall
    micro_recall = recall_score(y_test, y_pred_onehot, average='micro')

    # PR AUC + ROC AUC
    # Calculate Precision-Recall curve and area under the curve (PR AUC) for each class separately
    precision = dict()
    recall = dict()
    pr_auc = dict()

    for i in range(n_classes):  # Assuming n_classes is the number of classes
        precision[i], recall[i], _ = precision_recall_curve(y_test[:, i], y_pred[:, i])
        pr_auc[i] = auc(recall[i], precision[i])

    # Calculate ROC curve and area under the curve (ROC AUC) for each class separately
    fpr = dict()
    tpr = dict()
    roc_auc = dict()

    for i in range(n_classes):  # Assuming n_classes is the number of classes
        fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])
        roc_auc[i] = roc_auc_score(y_test[:, i], y_pred[:, i])

    df_dict = {"avg precision per class": list(classwise_avg_precision),
               "avg recall per class": list(classwise_recall),
               "PR AUC": list(pr_auc.values()),
               "ROC AUC": list(roc_auc.values()),
               "micro precision": [micro_precision],
               "micro recall": [micro_recall]}
    return(df_dict)

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

classifier = MultiOutputClassifier(make_pipeline(StandardScaler(), SVC(kernel='sigmoid', probability=True))),

y_pred_proba = classifier.fit(X_train, y_train).predict_proba(X_test)
output_array = np.stack(y_pred_proba)
output_transposed = np.transpose(output_array, (1, 0, 2))
reshaped_array = output_transposed.reshape(output_transposed.shape[0], -1)
positive_probs = reshaped_array[:, 1::2]
results = eval_preds(y_test, positive_probs)

In [None]:
results

[{'avg precision per class': [0.6586016975656442,
   0.9753253577314361,
   0.6986351407517849,
   0.6883862047231798,
   0.7479282166952557],
  'avg recall per class': [0.5789473684210527,
   0.945031712473573,
   0.5590551181102362,
   0.5607235142118863,
   0.6028368794326241],
  'PR AUC': [0.6577229053887013,
   0.9753178081995846,
   0.6973017443262605,
   0.6877151415576875,
   0.7476867421923145],
  'ROC AUC': [0.9290122393680537,
   0.9126960725884193,
   0.9206072997781354,
   0.8737804429193494,
   0.8561323420914846],
  'micro precision': [0.8892837660115932],
  'micro recall': [0.7710091743119266]},
 {'avg precision per class': [0.6359689800849596,
   0.9631206957870838,
   0.5559651222270277,
   0.7449500564987556,
   0.6901984941598085],
  'avg recall per class': [0.0043859649122807015,
   0.9901338971106413,
   0.0,
   0.14470284237726097,
   0.0],
  'PR AUC': [0.633440362140717,
   0.9630386871262635,
   0.5520140935140702,
   0.7443696658162436,
   0.689329930266571],


In [None]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cpu device


In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np


class PredictingNeuralNetwork(nn.Module):
    def __init__(self, dropout_prob=0.5):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(5120, 256),
            nn.ReLU(),
            nn.Dropout(p=dropout_prob),  # Applying dropout
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Dropout(p=dropout_prob),  # Applying dropout
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Dropout(p=dropout_prob),  # Applying dropout
            nn.Linear(256, 5),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.layers(x)

In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# Assuming you have your data X_train, y_train, X_test, and y_test where X_train, X_test are of shape (N, 5120) and y_train, y_test are of shape (N, 5)
# Convert your data to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32)

# Create DataLoaders for training and testing
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)

test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)  # No need to shuffle the test data

# Define your model
model = PredictingNeuralNetwork()

# Define the loss function
criterion = nn.BCELoss()

# Define the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=15)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    # Training
    model.train()  # Set the model to training mode
    train_loss = 0.0
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * inputs.size(0)
    train_loss /= len(train_loader.dataset)

    # Testing
    model.eval()  # Set the model to evaluation mode
    test_loss = 0.0
    with torch.no_grad():
        for inputs, targets in test_loader:
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            test_loss += loss.item() * inputs.size(0)
    test_loss /= len(test_loader.dataset)

    # Print training and test losses for this epoch
    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss}, Test Loss: {test_loss}")


  X_train = torch.tensor(X_train, dtype=torch.float32)
  y_train = torch.tensor(y_train, dtype=torch.float32)
  X_test = torch.tensor(X_test, dtype=torch.float32)
  y_test = torch.tensor(y_test, dtype=torch.float32)


Epoch 1/100, Train Loss: 0.683758481315673, Test Loss: 0.6642460734806191
Epoch 2/100, Train Loss: 0.6424453808841701, Test Loss: 0.5863002051253525
Epoch 3/100, Train Loss: 0.5340377872133246, Test Loss: 0.4775540587027687
Epoch 4/100, Train Loss: 0.45933618447869784, Test Loss: 0.4357839479940627
Epoch 5/100, Train Loss: 0.4131454370077045, Test Loss: 0.40768428992030287
Epoch 6/100, Train Loss: 0.3823464495171083, Test Loss: 0.38084079810590027
Epoch 7/100, Train Loss: 0.35775881607346455, Test Loss: 0.3718527076863482
Epoch 8/100, Train Loss: 0.34602904229796044, Test Loss: 0.3673063986393748
Epoch 9/100, Train Loss: 0.3384610795139565, Test Loss: 0.3680144356975251
Epoch 10/100, Train Loss: 0.33165369884597073, Test Loss: 0.36416142717309313
Epoch 11/100, Train Loss: 0.32712374664086286, Test Loss: 0.3649584336677282
Epoch 12/100, Train Loss: 0.32233608100589456, Test Loss: 0.36548876314337
Epoch 13/100, Train Loss: 0.3175627965729762, Test Loss: 0.36572979909955505
Epoch 14/100, 

In [None]:
model.eval()

with torch.no_grad():
  y_pred_probs = np.ndarray(model(X_test))

eval_preds(y_test, y_pred_probs)

TypeError: expected a sequence of integers or a single integer, got 'tensor([[2.3340e-01, 6.5989e-01, 9.0471e-02, 4.4410e-01, 1.0431e-01],
        [1.4336e-03, 9.8982e-0'

In [None]:
y_pred_probs.shape

torch.Size([1756, 5])