<a href="https://colab.research.google.com/github/summermccune/Tokenization-Testing-for-Malware-Data/blob/main/HMM2Vec/HMM2Vec_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#timer
!pip install ipython-autotime
%load_ext autotime

## Imports

In [2]:
from hmmlearn import hmm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import pickle
# from google.colab import drive
# drive.mount('/content/drive')

time: 21.4 s (started: 2024-07-10 12:22:06 -07:00)


## Reading in df and splitting data

In [3]:
df = pd.read_pickle('UnigramFilteredOpcodes.pkl')
opcodes_df = pd.read_csv('MostCommonOpcodes.csv')
#split dataset in 1/3
df = df.sample(frac=0.25)

time: 2min 13s (started: 2024-07-10 12:22:28 -07:00)


## Creating numerical representation for the opcodes

In [4]:
#creating a list of number representation for each opcode that is in the dataset
opcode_to_number = {}
count = 0
for opcode in opcodes_df['Opcodes']:
    opcode_to_number[opcode] = count
    count += 1

print(opcode_to_number)

{'add': 0, 'mov': 1, 'push': 2, 'pop': 3, 'inc': 4, 'xchg': 5, 'call': 6, 'or': 7, 'dec': 8, 'cmp': 9, 'xor': 10, 'sub': 11, 'and': 12, 'adc': 13, 'sbb': 14, 'lea': 15, 'test': 16, 'out': 17, 'in': 18, 'jmp': 19, 'movl': 20, 'int3': 21, 'ret': 22, 'imul': 23, 'je': 24, 'nop': 25, 'lods': 26, 'stos': 27, 'scas': 28, 'lret': 29, 'jne': 30}
time: 47 ms (started: 2024-07-10 12:24:41 -07:00)


## Function for converting samples into the numerical representations

In [5]:
def opcodes_to_numbers(columns):
  for sample in columns:
    # length = len(sample)
    # sequence_lengths.append(length)
    temp = []
    for opcode in sample:
      temp.append(opcode_to_number[opcode])
    opcode_sequences.append(temp)

time: 15 ms (started: 2024-07-10 12:24:41 -07:00)


## Function for training the models

Notes about the implementation (from the paper):

As mentioned above, we train HMMs using the hmmlearn library [7] and **we select the
highest scoring model based on multiple random restarts**. The precise number of random
restarts is determined by the length of the opcode sequence—for **shorter sequences in the
range of 1000 to 5000 opcodes, we use 100 restarts; otherwise we select the best model
based on 50 random restarts**. The B matrix of the highest-scoring model is then converted
to a one-dimensional vector.

In [6]:
def train_hmm_models(opcodes,n_states,n_restarts):
  hmm_models = []
  for opcode_seq in opcodes:
      model = hmm.CategoricalHMM(n_components=n_states, n_iter=100)
      opcode_seq = np.array(opcode_seq)
      model.fit(opcode_seq.reshape(-1, 1))
      hmm_models.append(model)
    # best_model = None
    # best_score = -np.inf
    # for i in range(n_restarts):
    #   model = hmm.MultinomialHMM(n_components=n_states, n_iter=100)
    #   opcode_seq = np.array(opcode_seq)
    #   model.fit(opcode_seq.reshape(-1, 1))

    #   #check if the model has a higher score than the current best model
    #   score = model.score(opcode_seq.reshape(-1, 1))
    #   if score > best_score:
    #     best_model = model
    #     best_score = score
    # hmm_models.append(best_model)

  return hmm_models

time: 0 ns (started: 2024-07-10 12:24:41 -07:00)


## Function for converting the matrix to feature vectors

Notes on the implementation:

To obtain the HMM2Vec features, we convert the B matrix of a trained HMM into
vector form. A subtle point that arises in this conversion process is that the order of the
hidden states in the B matrix need not be consistent across different models. **Since we only
have N = 2 hidden states in our experiments, this means that the order of the rows of the
corresponding B matrices may not agree between different models**. To account for this
possibility, **we determine the hidden state that has the highest probability with respect to
the mov opcode and we deem this to be the first half of the HMM2Vec feature vector, with
the other row of the B matrix being the second half of the vector**. Since mov is by far the
most frequent opcode, this will yield a consistent ordering of the hidden states.

In [7]:
def b_matrix_to_features(hmm_models, max_feature_length):
  hmm2vec_features = []
  for model in hmm_models:
    #determine the hidden state that has the highest probability with respect to the mov opcode
    mov_index = np.argmax(model.emissionprob_[:, opcode_to_number['mov']])

    #deem this to be the first half of the HMM2Vec feature vector, with the other row of the B matrix being the second half of the vector
    sorted_indices = [mov_index, 1 - mov_index]
    sorted_bmatrices = model.emissionprob_[sorted_indices]

    # Flatten the rearranged B matrix to create HMM2Vec feature vector
    feature_vector = sorted_bmatrices.flatten()

    # pad or truncate feature_vector to ensure consistent length
    if len(feature_vector) < max_feature_length:
      feature_vector = np.pad(feature_vector, (0, max_feature_length - len(feature_vector)), mode='constant')
    elif len(feature_vector) > max_feature_length:
      feature_vector = feature_vector[:max_feature_length]

    hmm2vec_features.append(feature_vector)

  return hmm2vec_features


time: 15 ms (started: 2024-07-10 12:24:41 -07:00)


## Main

In [8]:
#number of malware families? - we have 4 but for some reason both papers use 2?
n_states = 2

#number of unique opcodes - maybe not necessary
n_obs = 31

#converting opcodes to numbers
opcode_sequences = []
opcodes_to_numbers(df['Opcodes'])

time: 14 s (started: 2024-07-10 12:24:41 -07:00)


In [9]:
#hmm for each sample (so far categorical worked and multinomial didnt)
n_restarts = 100 if len(opcode_sequences[0]) <= 5000 else 50
hmm_models = train_hmm_models(opcode_sequences,n_states,n_restarts)

#extract feature vectors
hmm2vec_features = b_matrix_to_features(hmm_models, 62)

time: 1h 7min 15s (started: 2024-07-10 12:24:55 -07:00)


In [10]:
#save results as pkl
import pickle
with open('hmm2vec_features.pkl', 'wb') as f:
    pickle.dump(hmm2vec_features, f)

time: 15 ms (started: 2024-07-10 13:32:13 -07:00)


In [12]:
#split data
X_train, X_test, y_train, y_test = train_test_split(hmm2vec_features, df['MalwareType'], test_size=0.2, random_state=42)

#train SVM
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

#predict
y_pred = svm.predict(X_test)

#evaluate
accuracy = np.mean(y_pred == y_test)
print("Accuracy:", accuracy)

#confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)


Accuracy: 0.735
Confusion Matrix:
[[29 12  1  4]
 [ 8 39  3  5]
 [ 1  2 52  0]
 [ 9  6  2 27]]
time: 62 ms (started: 2024-07-10 13:33:35 -07:00)
