Test set:

Leaky:

1) torch-dockerfile:
    * api-key in line 92
    * GCP TOKEN in line 104

2) data.csv:
    * PII leaks in every row.

3) payment-processor.js:
    * lines 1-4 credit card details
    * 7-9 credit details
    * 16-17 credit card details

4) DatabaseConnectios.cs:
    * 7-10 credit card details
    * 57: PII

Not leaky:
1) linalg-utils_01.py
2) utils_01.js

In [1]:
from openai import OpenAI

client = OpenAI(
    base_url = 'http://0.0.0.0:8000/v1',
    api_key='ollama', # required, but unused
)

def get_completion(prompt):
    response = client.chat.completions.create(
    model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    temperature=0,
    max_tokens=1000,
    messages=[
        {"role": "user", "content": prompt}
    ])
    return response.choices[0].message.content

get_completion('Hello.')

APIConnectionError: Connection error.

In [21]:
prompt = """
Your task is to analyse code and text files for leaks of confidential information. This includes:
* Secrets: For example: api keys and tokens.
* PII data: For example: names, emails, date of birth, addresses.

Please analyse the file in the triple backticks (''' ''') for leaks.

If the file contains no leaks, then respond:

'No leaks found.'

If the file contains leaks then please list the leaks one by one.

File Name: 

{file_name}

Content:

'''
{file_content}
'''
"""

In [28]:
import os

def run_test_set(prompt):
    folder = 'test-set'
    for file_name in os.listdir(folder):
        with open(os.path.join(folder, file_name), 'r') as f:
            content = f.read()
        formatted_prompt = prompt.format(file_name = file_name, file_content = content)
        print("FILE: ", file_name,"\n\nMODEL RESPONSE:\n")
        print(get_completion(formatted_prompt), "\n")
run_test_set(prompt=prompt)

FILE:  linalg-utils.py 

MODEL RESPONSE:

No leaks found. 

FILE:  data.csv 

MODEL RESPONSE:

I've analyzed the file and found some leaks of confidential information.

The following PII data (Personally Identifiable Information) is leaked:

* Names: John Smith, Emily Johnson, Michael Williams, Sarah Davis, David Thompson
* Emails: john.smith@example.com, emilyj@email.net, mike.williams@company.org, sadavis88@internet.com, dthompson@gmail.com
* Phone numbers: 555-0123, 555-9876, 555-4567, 555-1234, 555-7890
* Addresses: 123 Main St, Anytown, CA 12345; 456 Oak Rd, Someville, NY 67890; 789 Maple Ln, Othercity, TX 24680; 159 Pine Ave, Anothertown, FL 36925; 753 Cedar St, Yetanotherplace, IL 14789

These pieces of information can be used to identify and contact the individuals involved. 

FILE:  torch-dockerfile 

MODEL RESPONSE:

After analyzing the file, I found no leaks of confidential information. The file appears to be a Dockerfile for building a PyTorch image, and it does not contain

In [31]:
# Not bad - a bit too verbose though and missed GCP_TOKEN in the docker file.
# Let's try a new prompt:

prompt = """
Your task is to analyse code and text files for leaks of confidential information. This is your sole task. Please refrain from giving general advice or summarising what the file is doing.

Please analyse the file in the triple backticks (''' ''') for leaks of sensitive information

Look for the following types of sensitive information:
- API keys, access tokens, and secret credentials
- Private SSH keys
- Database usernames, passwords, and connection strings
- Personally Identifiable Information (PII) such as names, email addresses, phone numbers, or social security numbers
- Financial information like credit card numbers or bank account details
- Confidential business data, including proprietary algorithms or business plans
- Hardcoded passwords in source code 
- Private configuration files with sensitive settings
- Sensitive comments or TODO notes revealing vulnerabilities or private information
- Email addresses and private communication

<OUTPUT FORMAT>:

For each instance of sensitive data found, provide the outputs in JSON format with the following keys:
file_name: [Name of the file containing sensitive data]
line_number: [Line number where the sensitive data is located]
type_of_data: [Type of sensitive data]
description: [Brief description of the sensitive data]

Collect all instances of sensitive data found in one list of dictionaries.

If the file contains no leaks of sensitive informaiton, then simply respond with an empty list: 

[]

Do not right anything else. Only return either a list of json objects if there are leaks or an empty list if there are no leaks.

```
<FILE NAME>:

{file_name}

<FILE CONTENT>:

{file_content}
'''
"""
run_test_set(prompt)


FILE:  linalg-utils.py 

MODEL RESPONSE:

[] 

FILE:  data.csv 

MODEL RESPONSE:

[
    {
        "file_name": "data.csv",
        "line_number": 1,
        "type_of_data": "Personally Identifiable Information (PII)",
        "description": "Names and email addresses"
    },
    {
        "file_name": "data.csv",
        "line_number": 2,
        "type_of_data": "Personally Identifiable Information (PII)",
        "description": "Phone numbers and physical addresses"
    },
    {
        "file_name": "data.csv",
        "line_number": 3,
        "type_of_data": "Personally Identifiable Information (PII)",
        "description": "Names, email addresses, and physical addresses"
    },
    {
        "file_name": "data.csv",
        "line_number": 4,
        "type_of_data": "Personally Identifiable Information (PII)",
        "description": "Phone numbers and physical addresses"
    },
    {
        "file_name": "data.csv",
        "line_number": 5,
        "type_of_data": "Personally Iden

In [45]:
# this is looking good, but it's still missing the GCP Token. Let's try once more:

prompt = """
Your task is to analyse code and text files for leaks of confidential information. This is your sole task. Please refrain from giving general advice or summarising what the file is doing.

Please analyse the file in the triple backticks (''' ''') for leaks of sensitive information

Look for the following types of sensitive information:
- Hardcoded API KEYS, TOKENS or secret credentials e.g. aws_access_token: AKIAXXXXXXXXXXXXXXX, azure_subscription_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- Private SSH keys
- Database usernames, passwords, and connection strings
- Personally Identifiable Information (PII) such as names, email addresses, phone numbers, or social security numbers
- Financial information like credit card numbers or bank account details
- Hardcoded passwords in source code 
- Private configuration files with sensitive settings
- Sensitive comments or TODO notes revealing vulnerabilities or private information
- Email addresses and private communication

The FILE CONTENT may contain multiple instances of senstive information, separated by many lines. It is essential that your find all instances of these leaks.

Think through this step by step:

If the file contains leaks of sensitive information:

1) Count the number of leaks found.

2) For each instance of sensitive data found, provide the outputs in JSON format with the following keys:
file_name: [Name of the file containing sensitive data]
line_number: [Line number where the sensitive data is located]
type_of_data: [Type of sensitive data]
description: [Brief description of the sensitive data]

3) Collect all instances of sensitive data found in one larger JSON object with the following metadata:
file name: [the name of the file: This is under the <FILE NAME>: row]
sensitive data count: [the number of leaks found in the file.]
sensitive data: a list of the JSON objects containing the sensitive data created in step 2 above.

If the file contains no leaks of sensitive informaiton, then simply respond with an empty json object.

Do not right anything else.


```
FILE NAME:

{file_name}

CONTENT:

{file_content}
'''
"""
run_test_set(prompt)



FILE:  linalg-utils.py 

MODEL RESPONSE:

```
{
  "file_name": "linalg-utils.py",
  "sensitive_data_count": 0,
  "sensitive_data": []
}
``` 

FILE:  data.csv 

MODEL RESPONSE:

{
    "file name": "data.csv",
    "sensitive data count": 6,
    "sensitive data": [
        {
            "file_name": "data.csv",
            "line_number": 1,
            "type_of_data": "Personally Identifiable Information (PII)",
            "description": "Name: John Smith, Date of Birth: 01/15/1985, Email: john.smith@example.com, Phone Number: 555-0123, Address: 123 Main St, Anytown, CA 12345"
        },
        {
            "file_name": "data.csv",
            "line_number": 2,
            "type_of_data": "Personally Identifiable Information (PII)",
            "description": "Name: Emily Johnson, Date of Birth: 09/22/1992, Email: emilyj@email.net, Phone Number: 555-9876, Address: 456 Oak Rd, Someville, NY 67890"
        },
        {
            "file_name": "data.csv",
            "line_number": 3,
  

In [40]:
prompt.format(file_name='lol')

KeyError: '\n        "file_name"'