Test set:

Leaky:

1) torch-dockerfile:
    * api-key in line 92
    * GCP TOKEN in line 104

2) data.csv:
    * PII leaks in every row.

3) payment-processor.js:
    * lines 1-4 credit card details
    * 7-9 credit details
    * 16-17 credit card details

4) DatabaseConnectios.cs:
    * 7-10 credit card details
    * 57: PII

Not leaky:
1) linalg-utils_01.py
2) utils_01.js

In [1]:
from openai import OpenAI

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama', # required, but unused
)

def get_completion(prompt):
    response = client.chat.completions.create(
    model="mistral",
    temperature=0,
    max_tokens=1000,
    messages=[
        {"role": "user", "content": prompt}
    ])
    return response.choices[0].message.content

get_completion('Hello.')

" Hello! How can I help you today? I'm here to answer any questions you might have or engage in a friendly conversation on a topic of your choice. Let me know what's on your mind and we'll go from there.\n\nIf you're looking for some lighthearted banter, how about we talk about our favorite hobbies or share some funny stories? Or if you have a more serious question, I'll do my best to provide accurate and helpful information. Let me know what you prefer!\n\nAlso, if there's something specific you'd like to discuss related to the prompt of this conversation, feel free to bring that up as well. I'm here to help in any way I can.\n\nSo, what do you say? Shall we get started?"

In [2]:
prompt = """
Your task is to analyse code and text files for leaks of confidential information. This includes:
* Secrets: For example: api keys and tokens.
* PII data: For example: names, emails, date of birth, addresses.

Please analyse the file in the triple backticks (''' ''') for leaks.

If the file contains no leaks, then respond:

'No leaks found.'

If the file contains leaks then please list the leaks one by one.

File Name: 

{file_name}

Content:

'''
{file_content}
'''
"""

In [3]:
import os

def run_test_set(prompt):
    folder = 'test-set'
    for file_name in os.listdir(folder):
        with open(os.path.join(folder, file_name), 'r') as f:
            content = f.read()
        formatted_prompt = prompt.format(file_name = file_name, file_content = content)
        print("FILE: ", file_name,"\n\nMODEL RESPONSE:\n")
        print(get_completion(formatted_prompt), "\n")
run_test_set(prompt=prompt)

FILE:  utils_01.js 

MODEL RESPONSE:

 This code snippet contains several helper functions that can be used for various purposes in JavaScript development. Here's a brief description of each function:

1. `debounce` and `throttle`: These are higher-order functions used to control the rate at which a function is executed. The difference between them is that `debounce` delays the execution of a function until a certain amount of time has passed since the last call, while `throttle` limits the number of times a function can be called within a given time frame.
2. `formatDate`: This function formats a JavaScript Date object into a specified string format using placeholders like 'mm' for month, 'dd' for day, etc.
3. `getTimestamp`: This function returns the current timestamp in milliseconds since January 1, 1970.
4. `generateUUID`: This function generates a unique Universal Unique Identifier (UUID) as a string.
5. `isValidEmail`, `isValidUrl`, and `isValidPhoneNumber`: These functions valid

In [4]:
# last prompt unnecessary output (listing the functions for the utils.js file), missed tokens in the torch-dockerfile, missed all the PII in the csv file
# Let's try a new prompt:

prompt = """
Your task is to analyse code and text files for leaks of confidential information. This is your sole task. Please refrain from giving general advice or summarising what the file is doing.

Please analyse the file in the triple backticks (''' ''') for leaks of sensitive information

Look for the following types of sensitive information:
- API keys, access tokens, and secret credentials
- Private SSH keys
- Database usernames, passwords, and connection strings
- Personally Identifiable Information (PII) such as names, email addresses, phone numbers, or social security numbers
- Financial information like credit card numbers or bank account details
- Confidential business data, including proprietary algorithms or business plans
- Hardcoded passwords in source code 
- Private configuration files with sensitive settings
- Sensitive comments or TODO notes revealing vulnerabilities or private information
- Email addresses and private communication

Only mention explicit sensitive information leaks where the data is explicitly visible as strings. 

Do not identify function calls as data leaks unless there are explicity exposed data leaks within them.

<OUTPUT FORMAT>:

For each instance of sensitive data found, provide the outputs in JSON format with the following keys:
file_name: [Name of the file containing sensitive data]
line_number: [Line number where the sensitive data is located]
type_of_data: [Type of sensitive data]
description: [Brief description of the sensitive data]

Collect all instances of sensitive data found in one list of dictionaries.

If the file contains no leaks of sensitive informaiton, then simply respond with an empty list: 

[]

Do not right anything else. Only return either a list of json objects if there are leaks or an empty list if there are no leaks.

```
<FILE NAME>:

{file_name}

<FILE CONTENT>:

{file_content}
'''
"""
run_test_set(prompt)


FILE:  utils_01.js 

MODEL RESPONSE:

 This is a JavaScript code snippet that contains several helper functions. These functions can be used to perform various tasks such as merging arrays, truncating strings, converting strings to different cases, formatting dates, generating unique identifiers, checking if values are valid emails or phone numbers, and more. The functions are designed to make common programming tasks easier and more efficient. 

FILE:  DataBaseConnection.cs 

MODEL RESPONSE:

 [
  {
    "file\_name": "DataBaseConnection.cs",
    "line\_number": 8,
    "type\_of\_data": "Database connection string",
    "description": "Explicitly exposed database connection string"
  },
  {
    "file\_name": "DataBaseConnection.cs",
    "line\_number": 12,
    "type\_of\_data": "Credit card number",
    "description": "Explicitly exposed credit card number"
  },
  {
    "file\_name": "DataBaseConnection.cs",
    "line\_number": 13,
    "type\_of\_data": "CVV",
    "description": "Expli

In [5]:
# this is looking good, but it's still missing the GCP Token. Let's try once more:

prompt = """
Your task is to analyse code and text files for leaks of confidential information. This is your sole task. Please refrain from giving general advice or summarising what the file is doing.

Please analyse the file in the triple backticks (''' ''') for leaks of sensitive information

Look for the following types of sensitive information:
- Hardcoded API KEYS, TOKENS or secret credentials e.g. aws_access_token: AKIAXXXXXXXXXXXXXXX, azure_subscription_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- Private SSH keys
- Database usernames, passwords, and connection strings
- Personally Identifiable Information (PII) such as names, email addresses, phone numbers, or social security numbers
- Financial information like credit card numbers or bank account details
- Hardcoded passwords in source code 
- Private configuration files with sensitive settings
- Sensitive comments or TODO notes revealing vulnerabilities or private information
- Email addresses and private communication

The FILE CONTENT may contain multiple instances of senstive information, separated by many lines. It is essential that your find all instances of these leaks.

Think through this step by step:

If the file contains leaks of sensitive information:

1) Count the number of leaks found.

2) For each instance of sensitive data found, provide the outputs in JSON format with the following keys:
file_name: [Name of the file containing sensitive data]
line_number: [Line number where the sensitive data is located]
type_of_data: [Type of sensitive data]
description: [Brief description of the sensitive data]

3) Collect all instances of sensitive data found in one larger JSON object with the following metadata:
file name: [the name of the file: This is under the <FILE NAME>: row]
sensitive data count: [the number of leaks found in the file.]
sensitive data: a list of the JSON objects containing the sensitive data created in step 2 above.

If the file contains no leaks of sensitive informaiton, then simply respond with an empty json object.

Do not right anything else.


```
<FILE NAME>:

{file_name}

<CONTENT>:

{file_content}
'''
"""
run_test_set(prompt)



FILE:  utils_01.js 

MODEL RESPONSE:

 Here's the code you provided, formatted for easier reading:

```javascript
// Helper functions

function removeDuplicates(array) {
  return [...new Set(array)];
}

function countOccurrences(array, value) {
  return array.filter((v) => v === value).length;
}

function mergeArrays(array1, array2) {
  const mergedArray = [...array1, ...array2];
  return removeDuplicates(mergedArray);
}

function flattenArray(array) {
  return array.reduce((acc, val) => acc.concat(Array.isArray(val) ? flattenArray(val) : val), []);
}

function toCamelCase(str) {
  return str
    .replace(/(?:^\w|[A-Z]|\b\w)/g, (word, index) => {
      return index === 0 ? word.toLowerCase() : word.toUpperCase();
    })
    .replace(/\s+/g, '');
}

function toKebabCase(str) {
  return str
    .replace(/([a-z0-9])([A-Z])/g, '$1-$2')
    .toLowerCase();
}

function truncateString(str, maxLength) {
  if (str.length <= maxLength) {
    return str;
  }
  return `${str.slice(0, maxLength)}..