<a href="https://colab.research.google.com/github/tanwljamie/Automatic-KG-Construction/blob/main/KGE_ESZSL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
root_dir = '/content/gdrive/MyDrive/caltech_birds-master/Caltech-Birddata_full/CUB_200_2011/'

# Ampligraph

In [None]:
!pip install ampligraph; 

In [None]:
%tensorflow_version 1.x 
import numpy as np
import pandas as pd
import ampligraph

ampligraph.__version__

In [65]:
import requests

from ampligraph.datasets import load_from_csv, load_from_ntriples, load_from_rdf
X = load_from_ntriples('/content/gdrive/MyDrive/caltech_birds-master/','all_type_labelgraph_31jul.nt')


In [66]:
entities = np.unique(np.concatenate([X[:, 0], X[:, 2]]))[:200]
print(entities)

['1-D extent^^<http://www.w3' '2-D shape^^<http://www.w3'
 '<http://example.org/AcadianFlycatcher>'
 '<http://example.org/AmericanCrow>'
 '<http://example.org/AmericanGoldfinch>'
 '<http://example.org/AmericanPipit>'
 '<http://example.org/AmericanRedstart>'
 '<http://example.org/AmericanThreetoedWoodpecker>'
 '<http://example.org/AnnaHummingbird>' '<http://example.org/ArticTern>'
 '<http://example.org/BairdSparrow>'
 '<http://example.org/BaltimoreOriole>' '<http://example.org/BankSwallow>'
 '<http://example.org/BarnSwallow>'
 '<http://example.org/BaybreastedWarbler>'
 '<http://example.org/BeltedKingfisher>' '<http://example.org/BewickWren>'
 '<http://example.org/BlackTern>'
 '<http://example.org/BlackandwhiteWarbler>'
 '<http://example.org/BlackbilledCuckoo>'
 '<http://example.org/BlackcappedVireo>'
 '<http://example.org/BlackfootedAlbatross>'
 '<http://example.org/BlackthroatedBlueWarbler>'
 '<http://example.org/BlackthroatedSparrow>'
 '<http://example.org/BlueGrosbeak>' '<http://exam

In [67]:
relations = np.unique(X[:, 1])
relations

array(['<http://purl.obolibrary.org/obo/UBERON_0000022>',
       '<http://purl.obolibrary.org/obo/UBERON_0000023>',
       '<http://purl.obolibrary.org/obo/UBERON_0000033>',
       '<http://purl.obolibrary.org/obo/UBERON_0000128>',
       '<http://purl.obolibrary.org/obo/UBERON_0000180>',
       '<http://purl.obolibrary.org/obo/UBERON_0000310>',
       '<http://purl.obolibrary.org/obo/UBERON_0000341>',
       '<http://purl.obolibrary.org/obo/UBERON_0000916>',
       '<http://purl.obolibrary.org/obo/UBERON_0000970>',
       '<http://purl.obolibrary.org/obo/UBERON_0000974>',
       '<http://purl.obolibrary.org/obo/UBERON_0000978>',
       '<http://purl.obolibrary.org/obo/UBERON_0001137>',
       '<http://purl.obolibrary.org/obo/UBERON_0001443>',
       '<http://purl.obolibrary.org/obo/UBERON_0001456>',
       '<http://purl.obolibrary.org/obo/UBERON_0001467>',
       '<http://purl.obolibrary.org/obo/UBERON_0001567>',
       '<http://purl.obolibrary.org/obo/UBERON_0001684>',
       '<http:

In [68]:
from ampligraph.evaluation import train_test_split_no_unseen 

num_test = int(len(X) * (20 / 100))
data = {}
data['train'], data['test'] = train_test_split_no_unseen(X, test_size=num_test, seed=0, allow_duplication=False) 

In [69]:
from ampligraph.latent_features import ConvE

Lets go through the parameters to understand what's going on:

- **`k`** : the dimensionality of the embedding space
- **`eta`** ($\eta$) : the number of negative, or false triples that must be generated at training runtime for each positive, or true triple
- **`batches_count`** : the number of batches in which the training set is split during the training loop. If you are having into low memory issues than settings this to a higher number may help.
- **`epochs`** : the number of epochs to train the model for.
- **`optimizer`** : the Adam optimizer, with a learning rate of 1e-3 set via the *optimizer_params* kwarg.
- **`loss`** : pairwise loss, with a margin of 0.5 set via the *loss_params* kwarg.
- **`regularizer`** : $L_p$ regularization with $p=2$, i.e. l2 regularization. $\lambda$ = 1e-5, set via the *regularizer_params* kwarg. 

Now we can instantiate the model:


In [70]:
amp = ComplEx(batches_count=99, 
                seed=0, 
                epochs=500, 
                k=150, 
                eta=5,
                optimizer='adam', 
                optimizer_params={'lr':3e-5}, #1e-3
                loss='multiclass_nll', 
                regularizer='LP', 
                regularizer_params={'p':3, 'lambda':1e-5}, 
                verbose=True)

## Filtering negatives

AmpliGraph aims to follow scikit-learn's ease-of-use design philosophy and simplify everything down to **`fit`**, **`evaluate`**, and **`predict`** functions. 

However, there are some knowledge graph specific steps we must take to ensure our model can be trained and evaluated correctly. The first of these is defining the filter that will be used to ensure that no *negative* statements generated by the corruption procedure are actually positives. This is simply done by concatenating our train and test sets. Now when negative triples are generated by the corruption strategy, we can check that they aren't actually true statements.  


In [71]:
positives_filter = X

## Fitting the model

Once you run the next cell the model will train. 

On a modern laptop this should take ~3 minutes (although your mileage may vary, especially if you've changed any of the hyper-parameters above).

In [72]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

amp.fit(data['train'], early_stopping = False)

Average ComplEx Loss:   0.257889: 100%|██████████| 500/500 [03:19<00:00,  2.51epoch/s]


In [73]:
from ampligraph.evaluation import evaluate_performance

In [None]:
ranks = evaluate_performance(data['test'], 
                             model=amp, 
                             filter_triples=positives_filter,   # Corruption strategy filter defined above 
                             use_default_protocol=True, # corrupt subj and obj separately while evaluating
                             verbose=True)

In [None]:
from ampligraph.evaluation import mr_score, mrr_score, hits_at_n_score

mrr = mrr_score(ranks)
print("MRR: %.2f" % (mrr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

---
# 4.  Extract KGE 

In [76]:
import numpy as np
from ampligraph.latent_features import ComplEx
kg_embeddings = amp.get_embeddings(entities)
print(kg_embeddings.shape)

(200, 200)


In [77]:
df = pd.read_excel('/content/gdrive/MyDrive/caltech_birds-master/aab_bird_desc.xlsx')
color = df['Family']
order = df['Order']

In [None]:
import numpy as np
from sklearn.manifold import TSNE
import seaborn as sns
X_embedded = TSNE(n_components=2, perplexity = 20, learning_rate='auto',init='random').fit_transform(kg_embeddings)
sns.set(rc={'figure.figsize':(10,8)})
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=order, legend=False)

# ESZSL

In [80]:
import numpy as np
import os
import scipy.io
from sklearn.metrics import classification_report,confusion_matrix

In [81]:
#Please add the folder name of the dataset to run it on different dataset.
dataset = 'CUB'

From the .mat files extract all the features from resnet and the attribute splits. 
- The res101 contains features and the corresponding labels.
- att_splits contains the different splits for trainval, train, val and test set.

In [82]:
res101 = scipy.io.loadmat('/content/gdrive/MyDrive/caltech_birds-master/CUB/res101.mat')
#res101 = models.resnet50(pretrained=True)
att_splits = scipy.io.loadmat('/content/gdrive/MyDrive/caltech_birds-master/CUB/att_splits.mat')

In [None]:
res101

In [None]:
att_splits

In [90]:
#Using the correct naming conventions to get the loctions
trainval_loc = 'trainval_loc'
train_loc = 'train_loc'
val_loc = 'val_loc'
test_loc = 'test_unseen_loc'

We need the corresponding ground-truth labels/classes for each training example for all our train, val, trainval and test set according to the split locations provided.
In this example we have used the `CUB` dataset which has 200 unique classes overall.

In [91]:
labels = res101['labels']
labels_train = labels[np.squeeze(att_splits[train_loc]-1)] #get labels for training
labels_val = labels[np.squeeze(att_splits[val_loc]-1)]
labels_trainval = labels[np.squeeze(att_splits[trainval_loc]-1)]
labels_test = labels[np.squeeze(att_splits[test_loc]-1)]


In [92]:
labels_train[:10,:]

array([[151],
       [151],
       [151],
       [151],
       [151],
       [151],
       [151],
       [151],
       [151],
       [151]], dtype=uint8)

In [93]:
unique_labels = np.unique(labels)
unique_labels

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 18

In a typical zero-shot learning scenario, there are no overlapping classes between training and testing phase, i.e the train classes are completely different from the test classes. So let us verify if there are any overlapping classes in the test and train scenario.
- During training phase we have `z` classes
- During the testing phase we have `z'` classes

In [94]:
train_labels_seen = np.unique(labels_train)
val_labels_unseen = np.unique(labels_val)
trainval_labels_seen = np.unique(labels_trainval)
test_labels_unseen = np.unique(labels_test)
print(train_labels_seen)

[  1   2   5   6   8   9  10  11  13  14  15  16  17  18  20  22  23  24
  25  26  27  28  30  31  33  35  38  39  41  43  46  47  48  51  54  57
  59  60  61  63  65  66  67  70  73  74  75  76  78  81  85  86  89  90
  93  94  96  97  99 103 106 107 109 112 114 118 119 123 126 127 128 131
 132 134 136 144 147 149 151 153 154 156 162 164 165 169 172 177 178 180
 181 183 188 190 194 196 197 198 199 200]


In [95]:
print("Number of overlapping classes between train and val:",len(set(train_labels_seen).intersection(set(val_labels_unseen))))
print("Number of overlapping classes between trainval and test:",len(set(trainval_labels_seen).intersection(set(test_labels_unseen))))

Number of overlapping classes between train and val: 0
Number of overlapping classes between trainval and test: 0


In [96]:
i = 0
for labels in train_labels_seen:
    labels_train[labels_train == labels] = i    
    i = i+1
j = 0
for labels in val_labels_unseen:
    labels_val[labels_val == labels] = j
    j = j+1
k = 0
for labels in trainval_labels_seen:
    labels_trainval[labels_trainval == labels] = k
    k = k+1
l = 0
for labels in test_labels_unseen:
    labels_test[labels_test == labels] = l
    l = l+1

#used for relabeling     
print(train_labels_seen)
print(test_labels_unseen)

[  1   2   5   6   8   9  10  11  13  14  15  16  17  18  20  22  23  24
  25  26  27  28  30  31  33  35  38  39  41  43  46  47  48  51  54  57
  59  60  61  63  65  66  67  70  73  74  75  76  78  81  85  86  89  90
  93  94  96  97  99 103 106 107 109 112 114 118 119 123 126 127 128 131
 132 134 136 144 147 149 151 153 154 156 162 164 165 169 172 177 178 180
 181 183 188 190 194 196 197 198 199 200]
[  7  19  21  29  34  36  50  56  62  68  69  72  79  80  87  88  91  95
  98 100 104 108 116 120 122 124 125 129 139 141 142 150 152 157 159 160
 166 167 171 174 176 179 182 185 187 189 191 192 193 195]


Let us denote the features X ∈ [d×m] available at training stage, where d is the dimensionality
of the data, and m is the number of instances. We are useing resnet features which are extracted from `CUB` dataset.

In [97]:
X_features = res101['features']
train_vec = X_features[:,np.squeeze(att_splits[train_loc]-1)]
val_vec = X_features[:,np.squeeze(att_splits[val_loc]-1)]
trainval_vec = X_features[:,np.squeeze(att_splits[trainval_loc]-1)]
test_vec = X_features[:,np.squeeze(att_splits[test_loc]-1)]

In [98]:
print("Features for train:", train_vec.shape)
print("Features for val:", val_vec.shape)
print("Features for trainval:", trainval_vec.shape)
print("Features for test:", test_vec.shape)

Features for train: (2048, 5875)
Features for val: (2048, 2946)
Features for trainval: (2048, 7057)
Features for test: (2048, 2967)


#### Normalize the vectors

In [99]:
def normalization(vec,mean,std):
    sol = vec - mean
    sol1 = sol/std
    return sol1

In [100]:
train_mean = train_vec.mean(axis=1, keepdims=True)
train_std = np.std(train_vec, axis=1, keepdims = True)
trainval_mean = trainval_vec.mean(axis=1, keepdims = True)
trainval_std = np.std(trainval_vec, axis=1, keepdims=True)

train_vec = normalization(train_vec, train_mean, train_std)
val_vec = normalization(val_vec, train_mean, train_std)

trainval_vec = normalization(trainval_vec, trainval_mean, trainval_std)
test_vec = normalization(test_vec, trainval_mean, trainval_std)

Each of the classes in the dataset have an attribute (a) description. This vector is known as the `Signature matrix` of dimension S ∈ [0, 1]a×z. For training stage there are z classes and z' classes  for test S ∈ [0, 1]a×z'.

In [101]:
signature = kg_embeddings.transpose() #the kge gets sorted here; transpose to match the shape of the original attributes
train_sig = signature[:,(train_labels_seen)-1]
val_sig = signature[:,(val_labels_unseen)-1]
trainval_sig = signature[:,(trainval_labels_seen)-1]
test_sig = signature[:,(test_labels_unseen)-1]
print(train_sig.shape)


(200, 100)


This is a signature matrix, where the occurance of an attribute corresponding to the class is give.
For instance, if the classes are `horse` and `zebra` and the corresponding attributes are [wild_animal, 4_legged, carnivore]

```
 Horse      Zebra
[0.00354613 0.        ] Domestic_animal
[0.13829921 0.20209503] 4_legged
[0.06560347 0.04155225] carnivore
```

In [102]:
print(train_sig[3:6,:2])

[[ 0.09150717  0.01566118]
 [-0.1190209  -0.15038627]
 [ 0.16520041 -0.05978232]]


In [103]:
print("Signature for train:", train_sig.shape)
print("Signature for val:", val_sig.shape)
print("Signature for trainval:", trainval_sig.shape)
print("Signature for test:", test_sig.shape)

Signature for train: (200, 100)
Signature for val: (200, 50)
Signature for trainval: (200, 150)
Signature for test: (200, 50)


In [104]:
#params for train and val set
m_train = labels_train.shape[0]
n_val = labels_val.shape[0]
z_train = len(train_labels_seen)
z1_val = len(val_labels_unseen)

#params for trainval and test set
m_trainval = labels_trainval.shape[0]
n_test = labels_test.shape[0]
z_trainval = len(trainval_labels_seen)
z1_test = len(test_labels_unseen)

The ground truth is a one-hot encoded vector

In [105]:
#ground truth for train and val set
gt_train = 0*np.ones((m_train, z_train))
gt_train[np.arange(m_train), np.squeeze(labels_train)] = 1

#grountruth for trainval and test set
gt_trainval = 0*np.ones((m_trainval, z_trainval))
gt_trainval[np.arange(m_trainval), np.squeeze(labels_trainval)] = 1

In [106]:
gt_train[:1,:100]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.]])

In [107]:
#train set
d_train = train_vec.shape[0]
a_train = train_sig.shape[0]

#Weights
V = np.zeros((d_train,a_train))

In [113]:
#trainval set
d_trainval = trainval_vec.shape[0]
a_trainval = trainval_sig.shape[0]
W = np.zeros((d_trainval,a_trainval))

#Note: These hyper-parameters were found using the code snippet available below
gamm1 = 2
alph1 = 0

The one-line code solution proposed.
```
V = inverse(XX' + γI) XYS' inverse(SS' + λI)
```



In [114]:
part_1_test = np.linalg.pinv(np.matmul(trainval_vec, trainval_vec.transpose()) + (10**alph1)*np.eye(d_trainval))
part_0_test = np.matmul(np.matmul(trainval_vec,gt_trainval),trainval_sig.transpose())
part_2_test = np.linalg.pinv(np.matmul(trainval_sig, trainval_sig.transpose()) + (10**gamm1)*np.eye(a_trainval))

W = np.matmul(np.matmul(part_1_test,part_0_test),part_2_test)

For inference stage, 
```
argmax(x'VS)
```
Where S is the signature matrix of the test_set

In [115]:
#predictions
outputs_1 = np.matmul(np.matmul(test_vec.transpose(),W),test_sig)
preds_1 = np.array([np.argmax(output) for output in outputs_1])

In [None]:
cm = confusion_matrix(labels_test, preds_1)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
avg = sum(cm.diagonal())/len(test_labels_unseen)
print("The top 1% accuracy is:", avg*100)

------------------------------------------------------------------------------------------------
The below code snippet can be used to find the best hyper-parameter using the train and val set.

In [112]:
accu = 0.10
alph1 = 3
gamm1 = 1
for alpha in range(-3, 4):
    for gamma in range(-3,4):
        #One line solution
        part_1 = np.linalg.pinv(np.matmul(train_vec, train_vec.transpose()) + (10**alpha)*np.eye(d_train))
        part_0 = np.matmul(np.matmul(train_vec,gt_train),train_sig.transpose())
        part_2 = np.linalg.pinv(np.matmul(train_sig, train_sig.transpose()) + (10**gamma)*np.eye(a_train))

        V = np.matmul(np.matmul(part_1,part_0),part_2)
        #print(V)

        #predictions
        outputs = np.matmul(np.matmul(val_vec.transpose(),V),val_sig)
        preds = np.array([np.argmax(output) for output in outputs])

        #print(accuracy_score(labels_val,preds))
        cm = confusion_matrix(labels_val, preds)
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        avg = sum(cm.diagonal())/len(val_labels_unseen)
        #print("Avg:", avg, alpha, gamma)

        if avg > accu:
            accu = avg
            alph1 = alpha
            gamm1 = gamma
print(alph1, gamm1)

3 1
