## A tutorial to identify evidence entities from a cyber incident report

The cyber incident report records a conversation between an IT Security Specialist and an Employee. The conversation describes an email phishing attack scenario.

### Goal
- Familiar with [DSPy: Declarative Self-improving Language Programs, pythonically](https://github.com/stanfordnlp/dspy). 
    - DSPy is a framework for algorithmically optimizing LM prompts and weights.
    - The framework for programming—not prompting—foundation models
- Identify a list of evidence entities
- Identify a list of relationships between entities

### Step 1: Download libraries and files for the lab
- Make use you download necessary library and files. 
- All downloaded and saved files can be located in the `content` folder if using google Colab

In [2]:
# uncomment the commands to download libraries and files
#!pip install python-dotenv
#!pip install dspy-ai
#!pip install graphviz
# !wget https://raw.githubusercontent.com/frankwxu/digital-forensics-lab/main/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/conversation.txt

import dspy
import os
import openai
import json
from dotenv import load_dotenv
from IPython.display import display

### Step 2: Config DSPy with openAI 
- You `MUST` have an openAI api key
- load an openAI api key from `openai_api_key.txt` file
- or, hard code your open api key

In [3]:
def set_dspy():
    # ==============set openAI enviroment=========
    # Path to your API key file
    key_file_path = "openai_api_key.txt"

    # Load the API key from the file
    with open(key_file_path, "r") as file:
        openai_api_key = file.read().strip()

    # Set the API key as an environment variable
    os.environ["OPENAI_API_KEY"] = openai_api_key
    openai.api_key = os.environ["OPENAI_API_KEY"]
    turbo = dspy.OpenAI(model="gpt-3.5-turbo", max_tokens=2000, temperature=0.5)
    dspy.settings.configure(lm=turbo)
    return turbo
    # ==============end of set openAI enviroment=========


def set_dspy_hardcode_openai_key():
    os.environ["OPENAI_API_KEY"] = (
        "sk-proj-yourapikeyhere"
    )
    openai.api_key = os.environ["OPENAI_API_KEY"]
    turbo = dspy.OpenAI(model="gpt-3.5-turbo",  temperature=0, max_tokens=2000)
    dspy.settings.configure(lm=turbo)
    return turbo

# provide `openai_api_key.txt` with your openAI api key
turbo=set_dspy()
# optionally, hard code your openAI api key at line 21 
# turbo=set_dspy_hardcode_openai_key()

### Step 3: Load the cyber incident repot (e.g., conversation)

In [4]:
def load_text_file(file_path):
    """
    Load a text file and return its contents as a string.

    Parameters:
    file_path (str): The path to the text file.

    Returns:
    str: The contents of the text file.
    """
    try:
        with open(file_path, "r") as file:
            contents = file.read()
        return contents
    except FileNotFoundError:
        return "File not found."
    except Exception as e:
        return f"An error occurred: {e}"

conversation = load_text_file("conversation.txt")
print(conversation)

Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was "Urgent: Verify Your Account Now". The email looks suspicious to me.

Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from.

Alice: Sure, forwarding it now.

Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It's actually registered to someone in Russia.

Alice: That’s definitely not right. Should I be worried?

Bob: We should investigate further. Did you click on any links or download any attachments?

Alice: I did click on a link that took me to a page asking for my login credentials. I didn't enter anything though. The URL was http://banksecure-verification.com/login.

Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s hi

### Step 4: Tell an LLM `WHAT` are the inputs/outputs by defining DSPy: Signature 

- A signature is one of the basic building blocks in DSPy's prompt programming
- It is a declarative specification of input/output behavior of a DSPy module
    - Think about a function signature
- Allow you to tell the LLM what it needs to do. 
    - Don't need to specify how we should ask the LLM to do it.
- The following signature identifies a list of evidence based on the conversation
    - Inherit from `dspy.Signature`
    - Exact `ONE` input, e.g., the conversation 
    - Exact `ONE` output, e.g., a list of evidence entities

### The following `EvidenceIdentifier` is equivalent to 

```
Identify evidence entities from a conversation ....
---
Follow the following format.
Question: a conversation between an IT Security Specialist and Employe
Answer: a list of evidence, inlcuding ...
---
Question: {a new unseen conversation}
Answer: write your answer here
```


In [5]:
class EvidenceIdentifier(dspy.Signature):
    """Identify evidence entities from a conversation between an IT Security Specialist and an Employee."""

    question = dspy.InputField(
        desc="a conversation between an IT Security Specialist and Employee."
    )
    answer = dspy.OutputField(
        desc="a list of evidence, inlcuding but not limited to emails, IP addresses, URLs, File names, timestamps, etc, in the conversation as a Python dictionary. For example, {evidence type: evidence value, ...}"
    )

### Step 5: Tell an LLM `HOW` to generate answer in a function: 

Generates and saves evidence from a conversation using a specified signature.

#### Parameters:
- `signature` (dspy.Signature): The signature defining the input and output structure for evidence identification.
- `conversation` (str): The conversation text to analyze for evidence.
- `output_file` (str): The file path where the identified evidence will be saved as JSON.

#### Returns:
None. The function saves the result to a file and prints a confirmation message.

#### Notes:
- This function uses `dspy.Predict` to process the conversation and identify evidence.
- The result is saved as a formatted JSON file.
- The function prints the result to the console and saves it to the specified file.

In [6]:
def generate_answer(signature, conversation, output_file):
    generate_answer = dspy.Predict(signature)
    answer=generate_answer(question=conversation).answer  # here we use the module

    with open(output_file, "w") as json_file:
        result = json.loads(answer)
        print(result)
        json.dump(result, json_file, indent=4)
    print(f"The evidence has been saved to the file {output_file}")

### Step 6: Execute above function and generate entities with three inputs
- Which signature: `EvidenceIdentifier`
- What input: conversation
- Where to save results: the name of output file

In [7]:
output_file = "01_output_entity.txt"
generate_answer(
    EvidenceIdentifier, conversation, 
    output_file,
)

{'Email': {'From': 'support@banksecure.com', 'Subject': 'Urgent: Verify Your Account Now', 'Content': 'strange email asking to verify account details urgently'}, 'IP Address': '192.168.10.45', 'Domain': 'banksecure.com', 'URLs': ['http://banksecure-verification.com/login', 'http://banksecure-verification.com/account-details'], 'File': {'Name': 'AccountDetails.exe', 'Creation Time': '10:20 AM', 'MD5 Hash': 'e99a18c428cb38d5f260853678922e03'}, 'Timestamps': {'Visited at 10:15 AM': 'http://banksecure-verification.com/login', 'Visited at 10:17 AM': 'http://banksecure-verification.com/account-details'}}
The evidence has been saved to the file 01_output_entity.txt


### Step 7: Inspect the last prompt send to the LLM

You want to check:
- Prompt Description Section: Description in the signature
- Format Section: `Following the following format.` 
- Result Section: Question (scenario) and Answer (entities) section

In [8]:
turbo.inspect_history(n=1)




Identify evidence entities from a conversation between an IT Security Specialist and an Employee.

---

Follow the following format.

Question: a conversation between an IT Security Specialist and Employee.
Answer: a list of evidence, inlcuding but not limited to emaile, IP address, URL, File name, timestamps, etc, in the conversation as a Python dictionary. For example, {evidence type: evidence value, ...}

---

Question: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was "Urgent: Verify Your Account Now". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It's actually registered to someone in Russia. Alice: That’s de

'\n\n\nIdentify evidence entities from a conversation between an IT Security Specialist and an Employee.\n\n---\n\nFollow the following format.\n\nQuestion: a conversation between an IT Security Specialist and Employee.\nAnswer: a list of evidence, inlcuding but not limited to emaile, IP address, URL, File name, timestamps, etc, in the conversation as a Python dictionary. For example, {evidence type: evidence value, ...}\n\n---\n\nQuestion: Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was "Urgent: Verify Your Account Now". The email looks suspicious to me. Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from. Alice: Sure, forwarding it now. Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It\'s actually registered to someone in Russia. 

## A tutorial to identify `evidence relationship` from a cyber incident report

The cyber incident report records a conversation between an IT Security Specialist and an Employee. The conversation describes an email phishing attack scenario.

### Goal
- In addition to a list of evidence entities, we want to identify a list of `relationships` between entities

### Step 1: Define a signature that identifies a list of `relationships` in the conversation

It is important to note that:
- There is ONE input 
    - Cyber incident conversation
- There are `TWO` outputs:
    - a list of entities
    - a list of relationships

In [9]:
class EvidenceRelationIdentifier(dspy.Signature):
    """Idenitfy evidence entities and their relationships from a conversation between -Alex (IT Security Specialist) and Taylor (Employee)."""

    question = dspy.InputField(
        desc="a conversation between -Alex (IT Security Specialist) and Bob (Employee)."
    )

    answer_relations: str = dspy.OutputField(
        desc="relatioinships between evidence entities. Output in JSON format: {Relationship name: evidence -> evidence, ...}."
    )
    
    answer_evidence : str = dspy.OutputField(
        desc="a list of evidence type and the value, inlcuding but not limited to emaile, IP address, URL, File name, timestamps, etc, idenified from the conversation. Output in JSON format: {evidence type: evidence value, ...}"
    )

### Step 2: A function that can receive two outputs

We have to revise the function `generate_answer()` so that we can receive two outputs. The following function `generate_answers` can receive two outputs from a LLM (e.g, openAI)
- a list of entities
- a list of relationships

In [10]:
# deal with multiple output fields
def generate_answers(
    signature, conversation, output_file, attributes_to_extract=["answer"]
):
    generate_answer = dspy.Predict(signature)
    result = generate_answer(question=conversation)  # Call the module
    print(result)

    # Write the answers to the JSON file
    with open(output_file, "w") as json_file:
        # Extract specified attributes
        for attr in attributes_to_extract:
            if hasattr(result, attr):
                # print(attr)
                # print(getattr(result, attr))
                # json_file.write(getattr(result, attr))
                results = json.loads(getattr(result, attr))

                json.dump(results, json_file, indent=4)

            else:
                print(f"Warning: Attribute '{attr}' not found in the result.")

    print(f"The evidence has been saved to the file {output_file}")

### Step 3: Execute code to generate evidences and relations
- Input 1: Signature: E`videnceRelationIdentifier`
- Input 2: a conversation
- Output 1: a file that saves entities and relations
- Output 2: a list of entities and relations

In [11]:
output_file = "01_output_entity_relation.txt"
generate_answers(
    EvidenceRelationIdentifier,
    conversation,
    output_file,
    ["answer_evidence", "answer_relations"],
)

Prediction(
    answer_relations='{\n  "Email Header Analysis": "IP Address -> Domain",\n  "URL Analysis": "URL -> Domain",\n  "Browser History Analysis": "URL -> Timestamp",\n  "File Analysis": "File Name -> Timestamp, File Name -> MD5 Hash",\n  "Malware Analysis": "MD5 Hash -> Malware Database"\n}',
    answer_evidence='{\n  "Email Sender": "support@banksecure.com",\n  "Email Subject": "Urgent: Verify Your Account Now",\n  "IP Address": "192.168.10.45",\n  "Domain": "banksecure.com",\n  "Domain Registration": "Russia",\n  "URL": "http://banksecure-verification.com/login",\n  "URL Registration Date": "Two days ago",\n  "File Name": "AccountDetails.exe",\n  "File Creation Timestamp": "10:20 AM",\n  "MD5 Hash": "e99a18c428cb38d5f260853678922e03"\n}'
)
The evidence has been saved to the file 01_output_entity_relation.txt


In [None]:
turbo.inspect_history(n=1)