### Install the Python SDK

The Python SDK for the Gemini API, is contained in the [`google-generativeai`](https://pypi.org/project/google-generativeai/) package. Install the dependency using pip:

https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/tutorials/python_quickstart.ipynb#scrollTo=G-zBkueElVEO

https://ai.google.dev/tutorials/python_quickstart


In [1]:
!pip install -q -U google-generativeai


# This is formatted as code


### Import packages

Import the necessary packages.

In [2]:
import os
import json
import pandas as pd
import time
import re
import csv
import concurrent.futures
import pathlib
import textwrap

import google.generativeai as genai

from IPython.display import display
from IPython.display import Markdown


def to_markdown(text):
  text = text.replace('â€¢', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

  # Used to securely store your API key
from google.colab import userdata

### Setup your API key

Before you can use the Gemini API, you must first obtain an API key. If you don't already have one, create a key with one click in Google AI Studio.

<a class="button button-primary" href="https://makersuite.google.com/app/apikey" target="_blank" rel="noopener noreferrer">Get an API key</a>


In Colab, add the key to the secrets manager under the "ðŸ”‘" in the left panel. Give it the name `GOOGLE_API_KEY`.

Once you have the API key, pass it to the SDK. You can do this in two ways:

* Put the key in the `GOOGLE_API_KEY` environment variable (the SDK will automatically pick it up from there).
* Pass the key to `genai.configure(api_key=...)`


In [3]:
# Or use `os.getenv('GOOGLE_API_KEY')` to fetch an environment variable.
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

genai.configure(api_key=GOOGLE_API_KEY)

## List models

Now you're ready to call the Gemini API. Use `list_models` to see the available Gemini models:

* `gemini-pro`: optimized for text-only prompts.
* `gemini-pro-vision`: optimized for text-and-images prompts.

In [4]:
for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

models/gemini-1.0-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-latest
models/gemini-1.0-pro-vision-latest
models/gemini-1.5-flash
models/gemini-1.5-flash-001
models/gemini-1.5-flash-latest
models/gemini-1.5-pro
models/gemini-1.5-pro-001
models/gemini-1.5-pro-latest
models/gemini-pro
models/gemini-pro-vision


Note: For detailed information about the available models, including their capabilities and rate limits, see [Gemini models](https://ai.google.dev/models/gemini). There are options for requesting [rate limit increases](https://ai.google.dev/docs/increase_quota). The rate limit for Gemini-Pro models is 60 requests per minute (RPM).

The `genai` package also supports the PaLM  family of models, but only the Gemini models support the generic, multimodal capabilities of the `generateContent` method.

## Generate text from text inputs

For text-only prompts, use the `gemini-pro` model:

In [5]:
# Create the model
# See https://ai.google.dev/api/python/google/generativeai/GenerativeModel
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}
safety_settings = [
  {
    "category": "HARM_CATEGORY_HARASSMENT",
    "threshold": "BLOCK_NONE",
  },
  {
    "category": "HARM_CATEGORY_HATE_SPEECH",
    "threshold": "BLOCK_NONE",
  },
  {
    "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
    "threshold": "BLOCK_NONE",
  },
  {
    "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
    "threshold": "BLOCK_NONE",
  },
]

model = genai.GenerativeModel(
  model_name="gemini-1.0-pro",
  safety_settings=safety_settings
  #generation_config=generation_config,
)


#model = genai.GenerativeModel('models/gemini-1.0-pro')

### Data:
Our data will be taken from 162 snort rules that have already been manually labeled to techniques from MITRE ATT&CK.

In [6]:
!git clone https://github.com/trickdeath0/Labeling_IDS_to_MITRE.git

Cloning into 'Labeling_IDS_to_MITRE'...
remote: Enumerating objects: 351, done.[K
remote: Counting objects: 100% (351/351), done.[K
remote: Compressing objects: 100% (252/252), done.[K
remote: Total 351 (delta 169), reused 267 (delta 86), pack-reused 0[K
Receiving objects: 100% (351/351), 7.04 MiB | 12.16 MiB/s, done.
Resolving deltas: 100% (169/169), done.


In [6]:
data = pd.read_csv('/content/Labeling_IDS_to_MITRE/Semester_B/01 stratification/test_data_fix.csv') # Our experiment
print(data.head())
rules_list = data['Rule']
true_labels = data['technique ids']

#print(data['Sid'][0+41])
print(f"\n{len(data)=}")

     Sid                                  URL       technique ids  \
0  50094  https://snort.org/rule_docs/1-50094           ['T1187']   
1  38563  https://snort.org/rule_docs/1-38563           ['T1056']   
2    976    https://snort.org/rule_docs/1-976           ['T1204']   
3   1129   https://snort.org/rule_docs/1-1129           ['T1218']   
4  27967  https://snort.org/rule_docs/1-27967  ['T1505', 'T1219']   

                                                Rule  
0  alert tcp any $HTTP_PORTS -> any any ( msg:"IN...  
1  alert tcp $EXTERNAL_NET $HTTP_PORTS -> $HOME_N...  
2  alert tcp $EXTERNAL_NET any -> $HTTP_SERVERS $...  
3  alert tcp $EXTERNAL_NET any -> $HTTP_SERVERS $...  
4  alert tcp $EXTERNAL_NET any -> $HOME_NET $HTTP...  

len(data)=300


In [7]:
def clean_response(text):
    text = text.data.replace(">", "").strip()  # Remove leading ">", whitespace
    try:
      text = text.replace("```json", "")
      text = text.replace("```", "")
    except:
      pass
    return text


# **Zero Shot (ZS):**
At this stage, the LLMs will receive a prompt that does not include the list of techniques from MITRE ATT&CK in order to examine the results of the models based on prior knowledge that has been trained. According to our request, the LLMs will classify the techniques according to the content of the rule.

In [21]:
def ZS(snort_rule):

  prompt = f"""Rule: {snort_rule}
  Return a MITRE technique ID (with quotation marks) that related to the rule"""

  message = model.generate_content(prompt)

  return message

# **Prompting without techniques guide and without example (WTGWE):**
At this stage, the LLMs will receive a prompt that does not include the list of techniques from MITRE ATT&CK in order to examine the results of the models based on prior knowledge that has been trained. According to our request, the LLMs will classify the techniques according to the content of the rule.


**Prompt1:**:

      prompt = f"""You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
      Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
      Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
      Please don't write anything but the JSON. Rule: {snort_rule}"""


**prompt2:**:

      prompt2 = f"""I'm going to give you a Snort rule. Read the Snort rule carefully, because I'm going to given you a task about it. Here is the Snort rule: <snort_rule>{snort_rule}</snort_rule>

      First, find the techniques from MITRE ATT&CK that are most relevant to the Snort rule.

      Then, answer the task, for each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.

      Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.

      Thus, the format of your overall response should look like what's shown between the <examples></examples> tags. Make sure to follow the formatting and spacing exactly.


      <examples>
        [
          "sid": "2274",
          "Technique ID": "T1110",
          "Technique name": "Brute Force",
          "Quotes": [
            "\"PROTOCOL-POP login brute force attempt\"",
            "track by_dst,count 30,seconds 30"
          ],
          "Explanation": "The rule is looking for excessive \"USER\" commands within a short period of time, which are common indicators of brute-force attacks targeting the POP3 service."
        ]
        </examples>

        Do not include anything besides write the JSON.
        """


**prompt3:**:

        prompt3 = f"""You work in a company that deals with information security, your role in the company is to label techniques from MITRE ATT&CK to the rules of IDS systems. The labeling between a rule and a technique indicates that the attacker operated with a technique that you found to be suitable for the rule that alerted the IDS system. Now we will test your knowledge labeling IDS rules for MITRE ATT&CK techniques. For your task, you're going to have a single Snort IDS rule and you'll need to label the most relevant techniques from MITRE ATT&CK associated with the rule. From the rule you receive, your labeling should be based on your knowledge and the information found within the 'msg' in the rule received. For each technique you call the rule, include the following information as JSON format in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.  Note: The value of the 'Quotes' field should contain quotation marks from the data sets relevant to the mapped technique. The value of the 'Explanation' should be your explanation of why you decided to give the technique and how it relates to the rule. The 'Technique ID' should be the official MITRE technique ID.
        Please don't write anything but the JSON. Rule: {snort_rule}""")


**prompt4:**

        prompt4 = f'''You are going to receive a Snort rule and your task is to find as many MITRE ATT&CK techniques as possible that are associated with the rule. Note: You should categorize the techniques to 1 or 2. Technique of type 1 is a technique that you can associate with the rule directly based on the rule. Technique of type 2 is a technique that can be associated with the rule indirectly, based on your knowledge and understanding. The categorization value should be the value 1 or 2, based on the explanation given above. The quotes field value should contain quotes from the rules data that are relevant to the technique mapped and they are the main reason you believe the mapping to this technique is correct. The explanationâ€™s value should be your explanation for why you decided to give the technique and how it is associated with the rule. The technique id should be the official MITRE technique id. For each technique include the following information as JSON: sid, Technique id, Technique name, Categorization, Quotes, Explanation. After each rule I will provide you with, answer according to the provided format. Please do not write anything else but the JSON. Rule: {snort_rule}''')


In [None]:
def WTGWE(snort_rule):

  prompt = f"""You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
  Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
  Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
  Please don't write anything but the JSON. Rule: {snort_rule}"""

  message = model.generate_content(prompt)

  return message


# **Prompting without techniques guide and with 1 example (WTG1E):**
At this stage, the LLMs will receive a prompt that does not include the list of techniques from MITRE ATT&CK in order to examine the results of the models based on prior knowledge that has been trained. According to our request, the LLMs will classify the techniques according to the content of the rule.

In addition, the prompt has one example (one shot)

In [None]:
def WTG1E(snort_rule):

  prompt = f"""Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: "alert tcp $EXTERNAL_NET $HTTP_PORTS -> $HOME_NET any ( msg:""MALWARE-CNC Win.Trojan.GateKeylogger fake 404 response""; flow:to_client,established; http_stat_code; content:""200""; http_stat_msg; content:""OK""; pkt_data; content:"">404 Not Found<"",fast_pattern,nocase; content:"" requested URL / was not found ""; metadata:impact_flag red,ruleset community; service:http; T1056; classtype:trojan-activity; sid:38563; rev:4; )"
    A: [
        "sid": "38563",
        "Technique ID": "T1056",
        "Technique name": "Input Capture",
        "Quotes": "\"Input Capture techniques involve intercepting and capturing user input data, such as keystrokes, to obtain sensitive information. The rule indicates the presence of a Trojan (GateKeylogger) that mimics a '404 Not Found' error to disguise its communication with a command and control server, which is a common method used by keyloggers to stealthily capture input data.\"",
        "Explanation": "This event is generated when activity relating to malware is detected."
    ]

    Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: {snort_rule}
    A: """

  message = model.generate_content(prompt)

  return message

# **Prompting without techniques guide and with 2 example (WTG2E):**
At this stage, the LLMs will receive a prompt that does not include the list of techniques from MITRE ATT&CK in order to examine the results of the models based on prior knowledge that has been trained. According to our request, the LLMs will classify the techniques according to the content of the rule.

In addition, the prompt has two example (two shot)

In [None]:
def WTG2E(snort_rule):

  prompt = f"""Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: "alert tcp $EXTERNAL_NET $HTTP_PORTS -> $HOME_NET any ( msg:""MALWARE-CNC Win.Trojan.GateKeylogger fake 404 response""; flow:to_client,established; http_stat_code; content:""200""; http_stat_msg; content:""OK""; pkt_data; content:"">404 Not Found<"",fast_pattern,nocase; content:"" requested URL / was not found ""; metadata:impact_flag red,ruleset community; service:http; T1056; classtype:trojan-activity; sid:38563; rev:4; )"
    A: [
        "sid": "38563",
        "Technique ID": "T1056",
        "Technique name": "Input Capture",
        "Quotes": "\"Input Capture techniques involve intercepting and capturing user input data, such as keystrokes, to obtain sensitive information. The rule indicates the presence of a Trojan (GateKeylogger) that mimics a '404 Not Found' error to disguise its communication with a command and control server, which is a common method used by keyloggers to stealthily capture input data.\"",
        "Explanation": "This event is generated when activity relating to malware is detected."
    ]

    Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: "alert tcp $EXTERNAL_NET $HTTP_PORTS -> $HOME_NET any ( msg:""MALWARE-CNC Win.Trojan.GateKeylogger fake 404 response""; flow:to_client,established; http_stat_code; content:""200""; http_stat_msg; content:""OK""; pkt_data; content:"">404 Not Found<"",fast_pattern,nocase; content:"" requested URL / was not found ""; metadata:impact_flag red,ruleset community; service:http; T1056; classtype:trojan-activity; sid:38563; rev:4; )"
    A: [
        "sid": "23934",
        "Technique ID": "T1190",
        "Technique name": "Exploit Public-Facing Application",
        "Quotes": "\"Exploit Public-Facing Application techniques involve targeting vulnerabilities in externally facing applications to gain unauthorized access or execute arbitrary code. This rule detects an attempted blind SQL injection attack on the Symantec Web Gateway's 'blocked.php' page, which is a common method attackers use to exploit web applications by manipulating SQL queries.\"",
        "Explanation": "SQL injection vulnerability in the management console in Symantec Web Gateway 5.0.x before 5.0.3.18 allows remote attackers to execute arbitrary SQL commands via unspecified vectors, related to a "blind SQL injection" issue."
    ]

    Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: {snort_rule}
    A: """


  message = model.generate_content(prompt)

  return message

# **Prompting with techniques guide and without example (TGWE):**
In the next step, we will provide the LLMs with the list of all the techniques from MITRE ATT&CK, to guarantee that the models are targeted to the present techniques, even the infrequently used ones. Each technique will include the technique number, the name of the technique and its description. The techniques will be provided to the models in the form of batches (due to the memory limit of the models) and after each batch we will ask him to classify the appropriate techniques from the list he received (if exist), finally we will unite the model's answers for each individual rule.


pre colletion data for TG

In [None]:
def recursive_enter(path: str, file_list: list = None) -> list:
    if file_list is None:
        file_list = []

    try:
        os.chdir(path)  # Change path

        items = os.listdir()  # List everything in the directory
        for item in items:
            full_path = os.path.join(path, item)

            if full_path.endswith(".json"):
                with open(full_path) as f:
                    file_list.append(json.load(f))

    except Exception as e:
        print(f"An error occurred: {e}")

    return file_list

tacticFolder = "/content/Labeling_IDS_to_MITRE/Semester_A/Extract data from MITRE ATTACK/techniques_split"
file_list = []
MITRE_Technique = recursive_enter(tacticFolder, file_list)
print(len(MITRE_Technique))
os.chdir("/content/")

In [None]:
def TGWE(snort_rule, techniques):

  prePrompt = f"""You are an information security expert. Now I will provide you information about techniques from MITRE ATT&CK, you will use the information for a task you will receive later. Do not reply to the information you receive."""

  dataPrompt = f"The information:\n {str(techniques)}"

  response_data = f"""Your task is to label IDS rules for MITRE ATT&CK techniques based on the information I have provided you. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques associated with the rule from the information I provided you only.
   Note 1: There is not necessarily a suitable technique in the information, return a technique if and only if it has an unambiguous relationship to the provided rule, if not return an empty JSON. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
   Note 2: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
   Please don't write anything but the JSON. Rule: {snort_rule}"""

  tg_data_list = [prePrompt, dataPrompt, response_data]

  try:
    response = model.generate_content(tg_data_list)
    #print(response.parts)
  except:
    #print(response.parts)
    response = model.generate_content("Return empty JSON {}")
  finally:
    return response


# **Prompting with techniques guide and with 1 example (TG1E):**
In the next step, we will provide the LLMs with the list of all the techniques from MITRE ATT&CK, to guarantee that the models are targeted to the present techniques, even the infrequently used ones. Each technique will include the technique number, the name of the technique and its description. The techniques will be provided to the models in the form of batches (due to the memory limit of the models) and after each batch we will ask him to classify the appropriate techniques from the list he received (if exist), finally we will unite the model's answers for each individual rule.

In addition, the prompt has one example (one shot)

In [None]:
def TG1E(snort_rule, techniques):

  prePrompt = f"""You are an information security expert. Now I will provide you information about techniques from MITRE ATT&CK, you will use the information for a task you will receive later. Do not reply to the information you receive."""

  dataPrompt = f"The information:\n {str(techniques)}"

  response_data = f"""Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: "alert tcp $EXTERNAL_NET $HTTP_PORTS -> $HOME_NET any ( msg:""MALWARE-CNC Win.Trojan.GateKeylogger fake 404 response""; flow:to_client,established; http_stat_code; content:""200""; http_stat_msg; content:""OK""; pkt_data; content:"">404 Not Found<"",fast_pattern,nocase; content:"" requested URL / was not found ""; metadata:impact_flag red,ruleset community; service:http; T1056; classtype:trojan-activity; sid:38563; rev:4; )"
    A: [
        "sid": "38563",
        "Technique ID": "T1056",
        "Technique name": "Input Capture",
        "Quotes": "\"Input Capture techniques involve intercepting and capturing user input data, such as keystrokes, to obtain sensitive information. The rule indicates the presence of a Trojan (GateKeylogger) that mimics a '404 Not Found' error to disguise its communication with a command and control server, which is a common method used by keyloggers to stealthily capture input data.\"",
        "Explanation": "This event is generated when activity relating to malware is detected."
    ]

    Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: {snort_rule}
    A: """

  tg_data_list = [prePrompt, dataPrompt, response_data]

  try:
    response = model.generate_content(tg_data_list)
    #print(response.parts)
  except:
    #print(response.parts)
    response = model.generate_content("Return empty JSON {}")
  finally:
    return response


# **Prompting with techniques guide and with 2 example (TG2E):**
In the next step, we will provide the LLMs with the list of all the techniques from MITRE ATT&CK, to guarantee that the models are targeted to the present techniques, even the infrequently used ones. Each technique will include the technique number, the name of the technique and its description. The techniques will be provided to the models in the form of batches (due to the memory limit of the models) and after each batch we will ask him to classify the appropriate techniques from the list he received (if exist), finally we will unite the model's answers for each individual rule.

In addition, the prompt has two example (two shot)

In [None]:
def TG2E(snort_rule, techniques):

  prePrompt = f"""You are an information security expert. Now I will provide you information about techniques from MITRE ATT&CK, you will use the information for a task you will receive later. Do not reply to the information you receive."""

  dataPrompt = f"The information:\n {str(techniques)}"

  response_data = f"""Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: "alert tcp $EXTERNAL_NET $HTTP_PORTS -> $HOME_NET any ( msg:""MALWARE-CNC Win.Trojan.GateKeylogger fake 404 response""; flow:to_client,established; http_stat_code; content:""200""; http_stat_msg; content:""OK""; pkt_data; content:"">404 Not Found<"",fast_pattern,nocase; content:"" requested URL / was not found ""; metadata:impact_flag red,ruleset community; service:http; T1056; classtype:trojan-activity; sid:38563; rev:4; )"
    A: [
        "sid": "38563",
        "Technique ID": "T1056",
        "Technique name": "Input Capture",
        "Quotes": "\"Input Capture techniques involve intercepting and capturing user input data, such as keystrokes, to obtain sensitive information. The rule indicates the presence of a Trojan (GateKeylogger) that mimics a '404 Not Found' error to disguise its communication with a command and control server, which is a common method used by keyloggers to stealthily capture input data.\"",
        "Explanation": "This event is generated when activity relating to malware is detected."
    ]

    Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: "alert tcp $EXTERNAL_NET $HTTP_PORTS -> $HOME_NET any ( msg:""MALWARE-CNC Win.Trojan.GateKeylogger fake 404 response""; flow:to_client,established; http_stat_code; content:""200""; http_stat_msg; content:""OK""; pkt_data; content:"">404 Not Found<"",fast_pattern,nocase; content:"" requested URL / was not found ""; metadata:impact_flag red,ruleset community; service:http; T1056; classtype:trojan-activity; sid:38563; rev:4; )"
    A: [
        "sid": "23934",
        "Technique ID": "T1190",
        "Technique name": "Exploit Public-Facing Application",
        "Quotes": "\"Exploit Public-Facing Application techniques involve targeting vulnerabilities in externally facing applications to gain unauthorized access or execute arbitrary code. This rule detects an attempted blind SQL injection attack on the Symantec Web Gateway's 'blocked.php' page, which is a common method attackers use to exploit web applications by manipulating SQL queries.\"",
        "Explanation": "SQL injection vulnerability in the management console in Symantec Web Gateway 5.0.x before 5.0.3.18 allows remote attackers to execute arbitrary SQL commands via unspecified vectors, related to a "blind SQL injection" issue."
    ]

    Q: You are an information security expert. Your task is to label IDS rules for MITRE ATT&CK techniques based on your cybersecurity knowledge. For the task, you are going to get a single Snort IDS rule and you will need to return the most relevant techniques from MITRE ATT&CK that are related to the rule.
    Try to search based on keywords and based on the knowledge you have. For each technique include the following information as JSON in this order: 'Sid', 'Technique ID', 'Technique Name', 'Quotes', 'Explanation'.
    Note: The value of the citation field should contain quotation marks from the data sets relevant to the mapped technique are the main reason you chose this technique to be correct. The value of the explanation should be your explanation of why you decided to give the technique and how it relates to the rule. The technique ID should be the official MITRE technique ID.
    Please don't write anything but the JSON. Rule: {snort_rule}
    A: """

  tg_data_list = [prePrompt, dataPrompt, response_data]

  try:
    response = model.generate_content(tg_data_list)
    #print(response.parts)
  except:
    #print(response.parts)
    response = model.generate_content("Return empty JSON {}")
  finally:
    return response


# Write to csv

In [32]:
def write_csv_ZS(filename, rule_dict):
  # Define the field names
  field_names = ["Technique_id", "True_labels"]

  # Open the CSV file in write mode (truncating any existing content)
  with open(filename, "w", newline="") as csvfile: # "prompting_without_techniques_guide.csv"
      # Create a DictWriter object with the specified field names
      writer = csv.DictWriter(csvfile, fieldnames=field_names)

      # Write the header row
      writer.writeheader()

      # Extract relevant data from each item and write it as a dictionary
      counter = 0
      for key, value in rule_dict.items():
        text = clean_response(value)
        technique_ids = []

        if "'Sid" in text:
          # Define a regex pattern to switch single quotes to double quotes
          pattern = re.compile(r"((^|\s)'((?:[^'\\]|\\.)*)'(?=[\s.,:;!?)]))|(:\s*'((?:[^'\\]|\\.)+)')")
          # Switch single quotes to double quotes
          text = pattern.sub(lambda x: x.group().replace("'", '"'), text)

          pattern = re.compile(r'"\S+"[\s\.]|\s"[\w\s]*"\s')
          text = re.sub(pattern, "", text)

        # Extracting "TXXXX" numbers using regular expression
        technique_ids = re.findall(r'[\'\"](T\d+(?:\.\d+)?)', text)

        # Extracting "Sid"
        match = re.search(r'[\'\"][s|S]id[\'\"]: [\'\"](\d+)[\'\"]', text)
        if match:
            sid_number = match.group(1)


        # Assuming each item has all necessary fields:
        insertRow = {
            "Technique_id": technique_ids,  # Handle potential absence
            "True_labels": true_labels[counter],
        }
        writer.writerow(insertRow)
        counter += 1


Write without techniques guide

In [29]:
def write_csv_WTG(filename, rule_dict):
  # Define the field names
  field_names = ["Sid", "Response", "Technique_id", "True_labels"]

  # Open the CSV file in write mode (truncating any existing content)
  with open(filename, "w", newline="") as csvfile: # "prompting_without_techniques_guide.csv"
      # Create a DictWriter object with the specified field names
      writer = csv.DictWriter(csvfile, fieldnames=field_names)

      # Write the header row
      writer.writeheader()

      # Extract relevant data from each item and write it as a dictionary
      counter = 0
      for key, value in rule_dict.items():
        text = clean_response(value)
        technique_ids = []

        if "'Sid" in text:
          # Define a regex pattern to switch single quotes to double quotes
          pattern = re.compile(r"((^|\s)'((?:[^'\\]|\\.)*)'(?=[\s.,:;!?)]))|(:\s*'((?:[^'\\]|\\.)+)')")
          # Switch single quotes to double quotes
          text = pattern.sub(lambda x: x.group().replace("'", '"'), text)

          pattern = re.compile(r'"\S+"[\s\.]|\s"[\w\s]*"\s')
          text = re.sub(pattern, "", text)

        # Extracting "TXXXX" numbers using regular expression
        technique_ids = re.findall(r'[\'\"](T\d+(?:\.\d+)?)', text)

        # Extracting "Sid"
        match = re.search(r'[\'\"][s|S]id[\'\"]: [\'\"](\d+)[\'\"]', text)
        if match:
            sid_number = match.group(1)


        # Assuming each item has all necessary fields:
        insertRow = {
            "Sid": sid_number,
            "Response": text,
            "Technique_id": technique_ids,  # Handle potential absence
            "True_labels": true_labels[counter],
        }
        writer.writerow(insertRow)
        counter += 1


Write with techniques guide

In [None]:
import csv
import re

headersCSV_TG = ["Sid", "Response", "Technique_id", "True_labels"]

def init_file(fileName):
  # Initial write to csv with header
  with open(fileName, 'w', newline='') as csvfile: # 'prompting_with_techniques_guide.csv'
      writer = csv.DictWriter(csvfile, fieldnames=headersCSV_TG)
      writer.writeheader()

def appendToCSV(rows_data, counter, fileName) -> None:
    '''
    rows_data -> {213: [<IPython.core.display.Markdown object>, <IPython.core.display.Markdown object>, ...]}
    '''
    # Open the CSV file in append mode to add new rows
    with open(fileName, 'a', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=headersCSV_TG)

        # Loop through each row and write data
        for row, value in rows_data.items():
            technique_ids = []
            response_text = ""
            for i in value:
                text = clean_response(i)
                response_text += text
                #print(text)

                try:
                    # Extracting "TXXXX" numbers using regular expression
                    technique_ids.extend(re.findall(r': [\'\"](T\d+(?:\.\d+)?)', text))
                except Exception as e:
                    print(f"Error extracting technique IDs: {e}")

            insertRow = {
                "Sid": row,
                "Response": response_text,
                "Technique_id": technique_ids,
                "True_labels": true_labels[counter]
            }

            # Write the row to the CSV file
            writer.writerow(insertRow)


# WTG - Generic

In [25]:
def WTG(functionName, rules_list):
  rule_dict = {}
  max_retries = 3  # Maximum number of retries

  for index, rule in enumerate(rules_list):
      retries = 0
      time.sleep(4)
      print(f"------------- {index} -------------")
      while retries < max_retries:
          try:
              res = functionName(rule)
              text = res.text
              # Check if the text contains the desired pattern
              t_numbers = re.findall(r'[\'\"](T\d+(?:\.\d+)?)', text)
              if t_numbers:  # If the pattern is found
                  rule_dict[data['Sid'][index]] = to_markdown(text)
                  break  # Break out of the retry loop if successful
              else:
                  print("Desired pattern not found in the text. Retrying...")
                  retries += 1
                  time.sleep(1)  # Wait for a short duration before retrying
          except Exception as e:
              print(f"An error occurred: {e}")
              retries += 1
              if retries < max_retries:
                  print(f"Retrying... ({retries}/{max_retries})")
                  time.sleep(1)  # Wait for a short duration before retrying
              else:
                  print("Max retries reached. Unable to process this rule.")


  # If sending fails, attempt to send again
  try:
      # Code to send data
      pass
  except Exception as e:
      print(f"Sending failed: {e}")
      # Retry sending here

  return rule_dict


# TG - Generic

In [None]:
def tg_split_data(functionName, rules_list_index, index, fileName):

  for rule in rules_list_index:
    print(f"index {index} \t Sid: {data['Sid'][index]}")

    tg_dict = {}
    count = 0 #####
    for batch in MITRE_Technique: # 11 files
      time.sleep(2)
      res = tg(rule, batch)
      sid = data['Sid'][index]
      if sid not in tg_dict:
        tg_dict[sid] = []
        tg_dict[sid].append(to_markdown(res.text))
      else:
        #print(res.text)
        try:
          tg_dict[sid].append(to_markdown(res.text))
        except:
          tg_dict[sid].append(to_markdown("{}"))
      print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{count}~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~") #####
      #print(to_markdown(res.text))
      count += 1 #######

    # Write to csv
    appendToCSV(tg_dict, index)
    index += 1

# Run Experiments

without data

In [33]:
 # Without Example Without Techniuqes Guide
# rule_dict_ZS = WTG(ZS, rules_list)
write_csv_ZS("zero_shot.csv", rule_dict_ZS)


#  # Without Example Without Techniuqes Guide
# rule_dict_WTG = WTG(WTGWE, rules_list)
# write_csv_WTG("prompting_without_techniques_guide_zero_shot.csv", rule_dict_WTG)


#  # With 1 Example Without Techniuqes Guide
# rule_dict_WTG1E = WTG(WTG1E, rules_list)
# write_csv_WTG("prompting_without_techniques_guide_one_shot.csv", rule_dict_WTG1E)


#  # With 2 Example Without Techniuqes Guide
# rule_dict_WTG2E = WTG(WTG2E, rules_list)
# write_csv_WTG("prompting_without_techniques_guide_two_shot.csv", rule_dict_WTG2E)

with data

In [None]:
 # Without Example With Techniuqes Guide
init_file('prompting_with_techniques_guide_zero_shot.csv')

rule_dict_TGWE_01 = rules_list[:100] # index 0-99
tg_split_data(rule_dict_TGWE_01, 0)

# rule_dict_TGWE_02 = rules_list[100:200] # index 100-199
# tg_split_data(rule_dict_TGWE_02, 100)

# rule_dict_TGWE_03 = rules_list[200:] # index 200-299
# tg_split_data(rule_dict_TGWE_03, 200)



  # With 1 Example With Techniuqes Guide
init_file('prompting_with_techniques_guide_one_shot.csv')

rule_dict_TG1E_01 = rules_list[:100] # index 0-99
tg_split_data(rule_dict_TG1E_01, 0)

# rule_dict_TG1E_02 = rules_list[100:200] # index 100-199
# tg_split_data(rule_dict_TG1E_02, 100)

# rule_dict_TG1E_03 = rules_list[200:] # index 200-299
# tg_split_data(rule_dict_TG1E_03, 200)



 # With 2 Example With Techniuqes Guide
init_file('prompting_with_techniques_guide_two_shot.csv')

rule_dict_TG2E_01 = rules_list[:100] # index 0-99
tg_split_data(rule_dict_TG2E_01, 0)

# rule_dict_TG2E_02 = rules_list[100:200] # index 100-199
# tg_split_data(rule_dict_TG2E_02, 100)

# rule_dict_TG2E_03 = rules_list[200:] # index 200-299
# tg_split_data(rule_dict_TG2E_03, 200)






# rule_dict_WTG = rules_list[:5] # index 0-1
# #rules_list2 = rules_list[2:4] # index 2-3


# tg_split_data(TGWE, rules_list_index, index, fileName)


# tg_split_data(TGWE, rules_list1, 0)
# # tg_split_data(rules_list2, 2)

# Evaluation


*   Persicion
*   Recall
*   F-1



In [34]:
import ast
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix

def evaluation(true_labels, predicted_labels):

  recall = []
  precision = []
  f1 = []


  for i in range(len(true_labels)):
    trueList = ast.literal_eval(true_labels[i])
    predList = ast.literal_eval(predicted_labels[i])
    # Extract only the 'TXXXX' part from each string in the list
    predList = [item.split('.')[0] if '.' in item else item for item in predList]
    intersection = set(trueList).intersection(set(predList))
    #print(list(intersection))

    if (len(predList) != 0):
      recall.append(len(intersection) / len(set(trueList)))
      precision.append(len(intersection) / len(set(predList)))
      try:
        f1.append((2 * precision[i] * recall[i]) / (recall[i] + precision[i]))
      except:
        f1.append(0)

  # Avg.
  average_recall = sum(recall) / len(recall)
  average_precision = sum(precision) / len(precision)
  average_f1 = (2 * average_recall * average_precision) / (average_recall + average_precision)

  print("Metric    |   Score")
  print("-------------------")
  print(f"Precision |   {average_precision:.2f}")
  print(f"Recall    |   {average_recall:.2f}")
  print(f"F1 Score  |   {average_f1:.2f}")

#### ZS

In [35]:
loadData = pd.read_csv("zero_shot.csv")
true_labels_ZS = loadData['True_labels']
predicted_labels = loadData['Technique_id']

evaluation(true_labels_ZS, predicted_labels)


Metric    |   Score
-------------------
Precision |   0.08
Recall    |   0.08
F1 Score  |   0.08


#### WTGWE

In [None]:
loadData = pd.read_csv("prompting_without_techniques_guide_zero_shot.csv")
true_labels_WTGWE = loadData['True_labels']
predicted_labels = loadData['Technique_id']

evaluation(true_labels_WTGWE, predicted_labels)


Metric    |   Score
-------------------
Precision |   0.09
Recall    |   0.10
F1 Score  |   0.09


#### WTG1E

In [None]:
loadData = pd.read_csv("prompting_without_techniques_guide_one_shot.csv")
true_labels_WTG1E = loadData['True_labels']
predicted_labels = loadData['Technique_id']

evaluation(true_labels_WTG1E, predicted_labels)

#### WTG2E

In [None]:
loadData = pd.read_csv("prompting_without_techniques_guide_two_shot.csv")
true_labels_WTG2E = loadData['True_labels']
predicted_labels = loadData['Technique_id']

evaluation(true_labels_WTG2E, predicted_labels)


#### TGWE

In [None]:
loadData = pd.read_csv('prompting_with_techniques_guide_zero_shot.csv')
true_labels_TGWE = loadData['True_labels']
predicted_labels = loadData['Technique_id']

evaluation(true_labels_TGWE, predicted_labels)

#### TG1E

In [None]:
loadData = pd.read_csv('prompting_with_techniques_guide_one_shot.csv')
true_labels_TG1E = loadData['True_labels']
predicted_labels = loadData['Technique_id']

evaluation(true_labels_TG1E, predicted_labels)

#### TG2E

In [None]:
loadData = pd.read_csv('prompting_with_techniques_guide_two_shot.csv')
true_labels_TG2E = loadData['True_labels']
predicted_labels = loadData['Technique_id']

evaluation(true_labels_TG2E, predicted_labels)

# Visualization Data

In [None]:
!pip install tabulate
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score
from tabulate import tabulate


# List of file names
file_names = [
    'Nir_Dataset_162_without_techniques.csv', 'Nir_Dataset_162_with_techniques.csv', # Nir data
    'Our_Dataset_300_without_techniques_and_zero_shot.csv', 'Our_Dataset_300_with_techniques_zero_shot.csv', # Zero shot
    'Our_Dataset_300_without_techniques_and_one_shot.csv', # One shot
    'Our_Dataset_300_without_techniques_and_two_shot.csv', 'Our_Dataset_300_with_techniques_two_shot.csv'      # Two shot
]

# Load all files into a dictionary of DataFrames
data_frames = {file: pd.read_csv(file) for file in file_names}

# # Optionally, display the first few rows of each DataFrame
# for name, df in data_frames.items():
#     print(f"\n{name}:\n", df.head())




In [None]:
# Loop over each file, load the data, and evaluate
for file in file_names:
    try:
        # Load the data
        loadData = pd.read_csv(file)

        # Extract the true and predicted labels
        true_labels = loadData['True_labels']
        predicted_labels = loadData['Technique_id']

        # Perform evaluation
        print(f'\nEvaluating {file}')
        evaluation(true_labels, predicted_labels)

    except KeyError as e:
        print(f'Error: Column {e} not found in {file}')
    except Exception as e:
        print(f'An error occurred while processing {file}: {e}')


Evaluating Nir_Dataset_162_without_techniques.csv
Metric    |   Score
-------------------
Precision |   0.14
Recall    |   0.12
F1 Score  |   0.13

Evaluating Nir_Dataset_162_with_techniques.csv
Metric    |   Score
-------------------
Precision |   0.09
Recall    |   0.24
F1 Score  |   0.13

Evaluating Our_Dataset_300_without_techniques_and_zero_shot.csv
Metric    |   Score
-------------------
Precision |   0.09
Recall    |   0.10
F1 Score  |   0.09

Evaluating Our_Dataset_300_with_techniques_zero_shot.csv
Metric    |   Score
-------------------
Precision |   0.09
Recall    |   0.16
F1 Score  |   0.12

Evaluating Our_Dataset_300_without_techniques_and_one_shot.csv
Metric    |   Score
-------------------
Precision |   0.09
Recall    |   0.11
F1 Score  |   0.10

Evaluating Our_Dataset_300_without_techniques_and_two_shot.csv
Metric    |   Score
-------------------
Precision |   0.07
Recall    |   0.07
F1 Score  |   0.07

Evaluating Our_Dataset_300_with_techniques_two_shot.csv
Metric    |

In [None]:
def evaluate_metrics(true_labels, predicted_labels):
    precision = precision_score(true_labels, predicted_labels, average='macro', zero_division=0)
    recall = recall_score(true_labels, predicted_labels, average='macro', zero_division=0)
    f1 = f1_score(true_labels, predicted_labels, average='macro', zero_division=0)
    return precision, recall, f1

results = []

for file in file_names:
    try:
        loadData = pd.read_csv(file)
        true_labels = loadData['True_labels']
        predicted_labels = loadData['Technique_id']

        precision, recall, f1 = evaluate_metrics(true_labels, predicted_labels)

        results.append({
            'filename': file,
            'Precision': f"{precision:.2f}",
            'Recall': f"{recall:.2f}",
            'F1 Score': f"{f1:.2f}"
        })

    except KeyError as e:
        print(f'Error: Column {e} not found in {file}')
    except Exception as e:
        print(f'An error occurred while processing {file}: {e}')

results_df = pd.DataFrame(results)
print(tabulate(results_df, headers='keys', tablefmt='grid'))