# Paper title: Geometric properties of PalmTree instruction embeddings
## Target IJCNN, deadline: End of January


## **Overview**
*   Geometry of embeddings
*   Pairwise correlations
*   Find which are closer to which 
*   Analog relationship or outlier detection, see how embeddings relate different instructions
*   Analyze how effective basic block search is
*   Improve it by encoding pairs, or more instructions at a time
*   Do deep metric learning to improve distance matching/basic block search \

**Pairwise correlations** 
*   Correlations of the embeddings, plot them on histogram 
*   Relies on inner products between two vectors,

**Analyze** 
*   call, ret, mov, etc. most important instructions in malware analysis
*   Pick common instructions, look at embeddings see which ones are closer
*   Prediction "non of embeddings will be close/ will be perpendicular" 
*   Look at the vectors, for constants 0, 1, 2, etc. 

**Analog query** 
*   K nearest neighbors  
*   Look at vectors in  each layer, as input goes through, can see performance for analog query 
*   Find vector for king, minus vector for man + woman. find vector closest to this, see if result vector is close to king 
*   Outlier detection, look at each layer 

**Plot of embedding norms** 
*   Norm inverserly related to the frequency of tokens 
*   The most frequent tokens had smallest norm in Bert 

**For outliers, see accuracy of outliers from layer to layer**
*   Given sequence, see how it changes through the layers 
*   Look at number of layers in model 

**Basic Block Search** 
*   Do analysis, how effective the embeddings to characterize similiarities between basic blocks 
*   Average cosine distance used as similarity 

**Possible improvment**
*   Given a basic block, if encodes one instruction at time, then take average
*   Problem for instructions that are dependant on others
*   Figure out how to improve it, possible encod epair of instructions at a time

**Metric learning** "Deep Metric Learning to Rank"
*   If dataset is available, apply FastAP
*   See if it improves 







## Initialize Colab & Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
!pip install bert-pytorch



In [3]:
!pip install pytorch-metric-learning



In [4]:
% cd /content/drive/MyDrive/Colab\ Notebooks/PalmTree-Trained/
% ls

/content/drive/.shortcut-targets-by-id/1LMg3kN9NvUURy3zB07RjiMQNzN0TPyMm/PalmTree-Trained
 add_cosine_sim_sorted_final.pdf
 add_cosine_sim_sorted.pdf
 agg_plot.pdf
 BINOP.pdf
 CALL.pdf
 CMOV.pdf
 CMP.pdf
 config.py
[0m[01;34m'Cosine Distribution Data'[0m/
 CSET.pdf
'Embedding Distribuitions.pdf'
 eval_utils.py
 FP.pdf
 hex1_cosine_sim_sorted_final.pdf
 hex1_cosine_sim_sorted.pdf
 histogram_final.pdf
 histogram.pdf
 how2use.py
 ins_cosine_sim_sorted_final.pdf
 ins_cosine_sim_sorted.pdf
 ins_MOV_cosine_sim_sorted.pdf
 Inter-BINOP.pdf
 Inter-CALL.pdf
 Inter-CMOV.pdf
 Inter-CMP.pdf
 Inter-CSET.pdf
 Inter-FP.pdf
 Inter-IntraBINOP.pdf
 Inter-IntraCALL.pdf
 Inter-IntraCMOV.pdf
 Inter-IntraCMP.pdf
 Inter-IntraCSET.pdf
 Inter-IntraFP.pdf
 Inter-IntraJMP.pdf
 Inter-IntraMOV.pdf
 Inter-IntraSHIFT.pdf
 Inter-IntraUNARY.pdf
 Inter-JMP.pdf
 Inter-MOV.pdf
 Inter-SHIFT.pdf
 Inter-UNARY.pdf
 Intra-BINOP.pdf
 Intra-CALL.pdf
 Intra-CMOV.pdf
 Intra-CMP.pdf
 Intra-CSET.pdf
 Intra-FP.pdf
 Intra-JMP.pdf
 

## Datasets Used for Intrinsic Evaluation

In [5]:
import pickle
ground_truth_file = "intrinsic_eval/opcode.pkl"
with open(ground_truth_file, 'rb') as f:
    instruction_set = pickle.load(f)

type(instruction_set)
num_of_ins = 0
for ins in instruction_set:
  print(ins, len(instruction_set[ins]) ,instruction_set[ins])
  num_of_ins += len(instruction_set[ins])
print('number of instructions:' , num_of_ins)
print(type(instruction_set['BINOP']))

MOV 4206 {'mov,eax,0x3c', 'mov,rcx,qword [ rsp + rax - 0x8 ]', 'mov,qword [ rsp + 0x50 ],0x0', 'mov,r14,qword [ rbp - 0xd0 ]', 'mov,ebx,dword [ esp + 0x38 ]', 'mov,dword [ ebp - 0x54 ],ecx', 'mov,qword [ rsp + 0x118 ],rax', 'mov,r14,qword [ rsi ]', 'mov,rax,qword [ rsp + 0xc0 ]', 'mov,dword [ ebp - 0x10c ],eax', 'mov,dword [ ebp - 0x14 ],esi', 'mov,dword [ esp + 0xc ],edi', 'mov,dword [ rsp + 0x48 ],0x2', 'mov,rbp,r15', 'mov,rbp,rax', 'mov,qword [ r13 + 0x18 ],rcx', 'mov,byte [ rbp - 0x330 ],r9b', 'mov,byte [ r14 ],0x23', 'mov,rbx,qword [ rdx ]', 'mov,byte [ rbp - 0x330 ],r14b', 'mov,dword [ esp ],string', 'mov,dword [ ebp - 0xa0 ],esi', 'mov,dword [ ebp - 0x34 ],0x0', 'mov,eax,dword [ rbp - 0x140 ]', 'mov,r14d,0x1', 'mov,rax,qword [ rbx + 0x68 ]', 'mov,ecx,dword [ eax + 0x4 ]', 'mov,dword [ rbp + 0x4 ],ebx', 'mov,eax,0x11', 'push,qword [ rsp + 0x60 ]', 'mov,byte [ ecx + 0x5 ],dh', 'mov,dword [ ebp - 0x108 ],0x1', 'mov,eax,0xa', 'mov,ecx,dword [ r13 ]', 'mov,eax,dword [ rdx + 0x18 ]', 

## Import Libraries and Load the PalmTree Model

In [6]:
#Import Libraries and Load the PalmTree model
import os
from config import *
from torch import nn
from scipy.ndimage.filters import gaussian_filter1d
import vocab

import random
import time
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances
from multiprocessing.dummy import Pool as ThreadPool
from torch import nn, optim

from torch.autograd import Variable
import torch
import numpy as np
import eval_utils as utils
import re
import pickle
import numpy as np

usable_model = utils.UsableTransformer(model_path="./palmtree/transformer.ep19", vocab_path="./palmtree/vocab")

#Read the opcode.pkl file and initialize offsets for Instruction groups
opcode_group_file = "./intrinsic_eval/opcode.pkl"

with open(opcode_group_file, 'rb') as f:
    instruction_set = pickle.load(f)

instruction_groups = dict()

#print('instruction_set is a',type(instruction_set))
offset = 0
for ins in instruction_set:
  ins_list = instruction_set[ins]
  instruction_groups[ins] = []
  instruction_groups[ins].append(offset) #start_index
  instruction_groups[ins].append(len(ins_list)+offset-1) #end_index
  offset += len(ins_list)
#num_of_ins += len(instruction_set[ins])
#print('number of instructions:' , num_of_ins)


#instruction_groups keeps a mapping of an instruction category(i.e. MOV) and start & end indices in that category
for key, val in instruction_groups.items():
    print(key, val)

Loading Vocab ./palmtree/vocab
Vocab Size:  6631
MOV [0, 4205]
BINOP [4206, 5948]
CALL [5949, 5973]
CMP [5974, 7245]
JMP [7246, 7285]
SHIFT [7286, 7438]
CSET [7439, 7482]
CMOV [7483, 7674]
UNARY [7675, 7699]
FP [7700, 7702]


## Compute Embeddings and Save to FILE [SKIP if FILE exists]

In [None]:
#Generate Embeddings for Instructions Grouped by Opcode, and pickle the embeddings [The Time Consuming Step]
#Skip This Block if the 'opcode_instructions_embeddings.pkl' file already EXISTS
embeddings_list = []

for ins in instruction_set:
  ins_list = instruction_set[ins]
  for each_ins in ins_list:
    #Get Embeddings for a single Instruction
    each_ins = each_ins.replace(","," ")
    list_of_ins = []
    list_of_ins.append(each_ins)
    embd_matrix = usable_model.encode(list_of_ins) #have to pass a list here
    embd = embd_matrix.flatten()
    embeddings_list.append(embd) 

#dump the list of embeddings to a pickle file for later use
with open('opcode_instructions_embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings_list, f)


print("# of instructions in embeddings_list", len(embeddings_list))

KeyboardInterrupt: ignored

## Load the Embeddings from pickle file and Analyze

In [7]:
#Unpickle the embeddings, and check it out 
with open('opcode_instructions_embeddings.pkl', 'rb') as f:
    embeddings_list = pickle.load(f)

In [8]:
#check the shape of the embeddings 
all_embeddings = np.array(embeddings_list)
print(all_embeddings.shape)
#print('Sample Embedding:')
for i in range(0, 0):
  print(all_embeddings[i])

(7703, 128)


## Utility Methods

In [9]:
#method which can get the instruction from instruction_set
#using the instruction_group key and index'
#Do We Really Need this?
def fetch_instruction(_index):

  for key, val in instruction_groups.items():
    start_index = instruction_groups[key][0]
    end_index = instruction_groups[key][1]
    if(_index>=start_index and _index<=end_index):
      idx = _index-start_index
      size = len(instruction_set[key])
      #print('Group:', key,"Index:", idx,"Size:",size)
      if(idx>=size):
        print("Error!")
        return ""
      return list(instruction_set[key])[idx]


#Testing fetch_instruction(_index) method
ins = fetch_instruction(7250)
print(ins)

jae,address


In [10]:
#Get the Embeddings for an insruction from the outlier_check_list
#Use Index from outlier_check_list
def GetInstructionEmbedding(embd_matrix, ins_index):
  size = embd_matrix.shape[0]
  if(size <= ins_index):
    print("Wrong Index!")
    return
  else:
    return embd_matrix[ins_index]


In [11]:
#This function is taken from the PalmTree author's Notebook
def find_outliner(embeddings):
    result = pairwise_distances(embeddings, embeddings, metric='cosine')
    result = result.sum(axis=0) #use min/max in stead of sum
    return np.argmax(result)

## Creating Groups of Insturctions set with one outlier

In [12]:
#Randomly pick two instruction groups
#Pick 4 instructions from one group and 1 instruction from the Other [use dictionary instruction_groups for this purpose]
#put the outlier at the end
def CreateTestSamples(sample_size):
  
  random.seed(time.time())
  result_matrix = []

  while (len(result_matrix) < sample_size):
      outliner_key, inliner_key = random.sample(instruction_groups.keys(), 2)
      outliner_choice = random.randint(instruction_groups[outliner_key][0], instruction_groups[outliner_key][1]) 
      #outliner_ins = fetch_instruction(outliner_key, outliner_choice)
      outlier_check_list = []
      counter = 0
      while len(outlier_check_list) < 4:
          inliner_choice = random.randint(instruction_groups[inliner_key][0], instruction_groups[inliner_key][1])
          #inliner_ins = fetch_instruction(inliner_key, inliner_choice)        
          if inliner_choice not in outlier_check_list: #find a different instruction to insert to the set.
              outlier_check_list.append(inliner_choice)
          else:
              counter = counter+1
          if counter >= 100: #Give up after 100 attempts
              break

      if len(outlier_check_list) < 4:
          # print("fail, choose another outlier")
          continue
      else:
          outlier_check_list.append(outliner_choice)   
          result_matrix.append(outlier_check_list)
      #do this until 50,000 rows are created

  result_matrix = np.array(result_matrix)
  print(result_matrix.shape)
  return result_matrix


## Testing Accuracy of PalmTree Model

In [13]:
#print instructions and check outliers from the result_matrix
def test_accuracy(start, finish, result_matrix, verbose=False):
  #Keep an accuracy score
  total = finish-start
  accurate = 0
  for i in range(start,finish):
    check_embeddings = []
    check_row = result_matrix[i]
    if(verbose == True):
      print('[',end="")
    for j in check_row:
      if(verbose == True):
        print(fetch_instruction(j),end="; ")
      check_embeddings.append(all_embeddings[j])
    if(verbose == True):
      print(']', end='\t')
    result = find_outliner(check_embeddings)
    if(verbose == True):
      print('Outlier is at',result)
    if result == 4:
      accurate+=1

  #Print out accuracy
  accuracy = accurate/total*100
  return accuracy

In [14]:
#Testing the accuary of PalmTree model's Embeddings to detect outliers
#np.set_printoptions(threshold=np.inf)
sample_size = 50000
start = 0
finish = sample_size
#Do this for 10 Iterations, then take mean and STDev
acc_lst = []
for i in range(10):
  result_matrix = CreateTestSamples(sample_size)
  accuracy = test_accuracy(start, finish, result_matrix, verbose=False)
  acc_lst.append(accuracy)
#print(result_matrix[start:finish])

(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)


In [15]:
print(acc_lst)

[68.012, 68.202, 67.906, 68.232, 68.104, 67.812, 68.12, 68.06, 68.24, 67.962]


In [16]:
#Compute avg and SD
import statistics

avg = statistics.mean(acc_lst)
sd = statistics.pstdev(acc_lst)
print('Avg(%):', avg, 'Std Dev:',sd)

Avg(%): 68.065 Std Dev: 0.13585359767043215


## Deep Metric Learning with Triplet Loss

In [17]:
#print instructions and check outliers from the result_matrix
#Use the Embeddings learned by Metric Learning Model here
def test_accuracy_metric(start, finish, result_matrix, verbose=False):
  #Keep an accuracy score
  labels = torch.Tensor([0, 0, 0, 0, 1])
  total = finish-start
  accurate = 0
  testing_loss = 0
  for i in range(start,finish):
    check_embeddings = []
    check_row = result_matrix[i]
    if(verbose == True):
      print('[',end="")
    for j in check_row:
      if(verbose == True):
        print(fetch_instruction(j),end="; ")
      check_embeddings.append(all_metric_embeddings[j])
    if(verbose == True):
      print(']', end='\t')
    result = find_outliner(check_embeddings)
    if(verbose == True):
      print('Outlier is at',result)
    if result == 4:
      accurate+=1

  #Print out accuracy
  accuracy = accurate/total*100
  return accuracy

In [18]:
# Embedding network with triplet loss training on the 50,000 samples generated
from pytorch_metric_learning import losses, reducers
import statistics

class Embedding(nn.Module):
  def __init__(self, input_dim, hidden_dim, output_dim):
    super(Embedding, self).__init__()

    self.l1 = nn.Linear(input_dim, hidden_dim)
    self.l2 = nn.Linear(hidden_dim, hidden_dim)
    self.l3 = nn.Linear(hidden_dim, output_dim)
    self.relu = nn.ReLU()

  def forward(self, triplets):

    out = self.l1(triplets)
    out = self.relu(out)
    out = self.l2(out)
    out = self.relu(out)
    out = self.l3(out)

    return out

# Begin training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Embedding(128, 256, 128).to(device)
learning_rate = .001
reducer = reducers.MeanReducer()
criterion = losses.TripletMarginLoss(1.2, reducer, triplets_per_anchor='all')
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=.9)
epochs = 25
labels = torch.Tensor([0, 0, 0, 0, 1])

train_data = result_matrix[0:500, :]

training_losses = []
testing_losses = []
training_accuracy = []
testing_accuracy = []
max_acc_list = []
min_acc_list = []
max_loss_list = []
min_loss_list = []

for epoch in range(epochs):
  running_loss = 0
  training = True
  for i, (p1, p2, p3, p4, outlier) in enumerate(train_data):
    p1 = GetInstructionEmbedding(all_embeddings, p1)
    p2 = GetInstructionEmbedding(all_embeddings, p2)
    p3 = GetInstructionEmbedding(all_embeddings, p3)
    p4 = GetInstructionEmbedding(all_embeddings, p4)
    outlier = GetInstructionEmbedding(all_embeddings, outlier)

    data_embeddings = torch.Tensor([p1, p2, p3, p4, outlier]).to(device)

    embeddings = model(data_embeddings)
    optimizer.zero_grad()
    loss = criterion(embeddings, labels)
    print("iteration {} loss: {}".format(i, loss.item()))
    running_loss += loss.item()

    loss.backward()
    optimizer.step()

#torch.save(model.state_dict(), "/content/drive/MyDrive/Colab Notebooks/PalmTree-Trained/model.pth")



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
iteration 0 loss: 0.0
iteration 1 loss: 0.024864554405212402
iteration 2 loss: 0.0
iteration 3 loss: 0.0
iteration 4 loss: 0.0
iteration 5 loss: 0.0
iteration 6 loss: 0.0
iteration 7 loss: 0.0
iteration 8 loss: 0.0
iteration 9 loss: 0.0
iteration 10 loss: 0.0
iteration 11 loss: 0.0
iteration 12 loss: 0.0
iteration 13 loss: 0.0
iteration 14 loss: 0.0
iteration 15 loss: 0.0
iteration 16 loss: 0.0
iteration 17 loss: 0.0
iteration 18 loss: 0.0
iteration 19 loss: 0.0
iteration 20 loss: 0.0
iteration 21 loss: 0.0
iteration 22 loss: 0.0
iteration 23 loss: 0.0
iteration 24 loss: 0.0
iteration 25 loss: 0.0
iteration 26 loss: 0.11145206540822983
iteration 27 loss: 0.0
iteration 28 loss: 0.0
iteration 29 loss: 0.0
iteration 30 loss: 0.0
iteration 31 loss: 0.0
iteration 32 loss: 0.0
iteration 33 loss: 0.0
iteration 34 loss: 0.0
iteration 35 loss: 0.03452610969543457
iteration 36 loss: 0.0
iteration 37 loss: 0.0
iteration 38 loss: 0.0

In [19]:
# Using all_embeddings to create a list for the new metric learning embeddings
print(np.shape(all_embeddings))
metric_embeddings = np.empty([7703, 128])
with torch.no_grad():
  for i, vector in enumerate(all_embeddings):
    vector = torch.Tensor(vector).to("cuda:0")
    new_embedding = model(vector)
    new_embedding = new_embedding.cpu().numpy()
    metric_embeddings[i] = new_embedding

print(np.shape(metric_embeddings))

(7703, 128)
(7703, 128)


In [20]:
#dump the list of newly created embeddings to a pickle file for later use
with open('opcode_instructions_metric_embeddings.pkl', 'wb') as f:
    pickle.dump(metric_embeddings, f)

## Load Embeddings Learned by Deep Metric Learning

In [21]:
#Unpickle the Metric embeddings, and check it out 
with open('opcode_instructions_metric_embeddings.pkl', 'rb') as f:
    all_metric_embeddings = pickle.load(f)

In [22]:
all_metric_embeddings.shape

(7703, 128)

## Testing Accuracy of Metric Learning Model

In [23]:
#Use this all_metric_embeddings to recalucate the accuracy on the intrinsic evaluation-- Outlier Detection
#Testing the accuary of Metric Learning Model's Embeddings to detect outliers
#np.set_printoptions(threshold=np.inf)
sample_size = 50000
start = 0
finish = sample_size
#Do this for 10 Iterations, then take mean and STDev
acc_lst = []
for i in range(10):
  result_matrix = CreateTestSamples(sample_size)
  accuracy = test_accuracy_metric(start, finish, result_matrix, verbose=False)
  acc_lst.append(accuracy)


(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)
(50000, 5)


In [24]:
print(acc_lst)

[98.38, 98.434, 98.538, 98.458, 98.614, 98.59599999999999, 98.438, 98.488, 98.56, 98.408]


In [25]:
#Compute avg and SD
import statistics

avg = statistics.mean(acc_lst)
sd = statistics.pstdev(acc_lst)
print('Avg(%):', avg, 'Std Dev:',sd)

Avg(%): 98.4914 Std Dev: 0.07712872357299827


# Further Readings & References

**Papers to read**
*   Geometry of Bert: http://vigir.missouri.edu/~gdesouza/Research/Conference_CDs/IEEE_WCCI_2020/IJCNN/Papers/N-21493.pdf
* PalmTree: https://arxiv.org/pdf/2103.03809.pdf
