# Using Open AI to detect data leaks in code bases

In [1]:
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()

True

In [2]:

client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content="In the realm of code and bytes so dear,\nLies a magical concept, both subtle and clear.\nRecursion, a loop of a different kind,\nA function that calls itself, intertwined.\n\nLike a mirror reflecting its own reflection,\nRepeating iterations with such perfection.\nA dance of algorithms, a graceful spin,\nUnlocking solutions, diving deep within.\n\nPassing down through layers, like roots of a tree,\nUnraveling mysteries for all to see.\nA cycle of self-reference, ever true,\nIn the world of programming, a powerful brew.\n\nSo trust in recursion, its pattern and flow,\nFor in its elegance, solutions grow.\nA recursive chant for those who dare,\nTo grasp the beauty hidden in the code's lair.", role='assistant', function_call=None, tool_calls=None)


In [11]:
# here's an example

system_prompt = """
Your are security analyst and you're role is to scan through the files in github repos to identify leaks of confidential information. Examples of these are below:

Names and Emails:
"Identify instances of names and email addresses that may be related to customers or employees.
Look for patterns like 'john.doe@example.com' or 'Jane Doe <jane.doe@example.com>'.
Consider names and emails mentioned in code comments, variable names, or string literals."

Dates of Birth:
"Detect any code snippets or comments that contain dates of birth in the format DD/MM/YYYY or MM/DD/YYYY.
Look for patterns like '1990-02-28' or '02/28/1990'.
Consider dates mentioned in code comments, variable names, or string literals."

Residential Addreses:
"Find any references to residential addresses, including street names, cities, states, or zip codes.
Look for patterns like '123 Main St, Anytown, CA 12345' or 'Anytown, CA 12345'.
Consider addresses mentioned in code comments, variable names, or string literals."

Passwords:
"Identify hardcoded passwords, API keys, or authentication credentials that may be exposed.
Look for patterns like 'password: mysecretpassword' or 'api_key: 1234567890abcdef'.
Consider credentials mentioned in code comments, variable names, or string literals."
Confidential Information-related prompts

Company Accounts Data:
"Identify data relating to company accounts or sensitive internal information'.

"""

example1 = """
def fn(x):
  # squares a number
  # used by Donaldsons's actuarial team to calcuate the SCR
  # the SCR was 4.2bn in 2021
  # use the api_key 1234ajfalklk to access
  # the function is written in python
  return x**2
"""
output1 = """
Here's an example output:
The code leaks information about the company:
1) it reveals that "Donaldson's" Actuarial team uses this function
2) It reveals information about the company accounts i.e. that they were 4.2bn in 2021.

"""

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": 'Please could you look for security leaks in the code base below please?'},
    {"role": "assistant", "content": 'Ofcourse.'},
    {"role": "user", "content": example1},
  ]
)

print(completion.choices[0].message.content)


In the code snippet provided, there are a few potential security concerns that can be identified:

1. Company Account Data: 
   - The mention of "SCR" (Solvency Capital Requirement) and its specific value of "4.2bn in 2021" could be considered sensitive company financial information related to insurance calculations. Revealing such specific financial data publicly could pose a risk.
   - Reference to an "api_key 1234ajfalklk" raises concerns about potential exposure of an API key used to access certain resources. Hardcoding API keys in code can lead to security vulnerabilities if this information is publicly accessible.

It's important to avoid sharing specific financial details and API keys in code that is publicly accessible, as this can potentially lead to unauthorized access or misuse of sensitive information. It's recommended to review and remove such specific details from the code before sharing it publicly.


In [13]:
with open('repo1.txt', 'r') as f:
  example2 = f.read()
  
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": """
    Please could you look for security leaks in the code base below please? I have seperated files using a two rows of #s, with the file name between them. 
    For instance:
    the start og the .gitignore file is indicated as follows:
    
     ################################################################################
    # repos/repo1/.gitignore
    ################################################################################
    .line1
    .line2
    etc.

    Please could you reference the file when analysis for data leaks.

    """},
    {"role": "assistant", "content": 'Of course.'},
    {"role": "user", "content": example2},
  ]
)

print(completion.choices[0].message.content)

Based on the analysis of the code base in the provided files, here are the potential data leaks identified:

1. **No significant data leaks**: The code appears to be primarily focused on a simple Streamlit app for downloading images from a search query. The code is well-structured and does not contain any obvious leaks of confidential information such as names, emails, dates of birth, residential addresses, or passwords.

2. **Sensitive Information**: One potential consideration is the handling of image URLs and potentially downloaded images. It's important to ensure that the image URLs do not inadvertently expose any sensitive information or violate any usage rights. Additionally, best practices should be followed to securely handle the downloading and storage of images to prevent any data breaches.

3. **Security best practices**: While the code reviewed looks clean, it's always important to follow security best practices such as input validation, sanitization, and output encoding to

In [18]:
simple_prompt = """
Your are security analyst and you're role is to scan through the files in github repos to identify leaks of confidential information.

Please looks for emails, names of people, addresses, passwords, api keys and sensitive business data like account details.
Please flag anything that looks like it could be a password or api key as there should be no examples of this anywhere in the code or comments.
Please also identify any names of personally identifiable information anywhere in the code. The only case where this is acceptable is the names of the authors of the repository.
"""

with open('repo1_with_leaks.txt', 'r') as f:
  example2 = f.read()
  
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": simple_prompt},
    {"role": "user", "content": """
    Please could you look for security leaks in the code base below please? I have seperated files using a two rows of #s, with the file name between them. 
    For instance, the .gitignore file is indicated as follows:
    
    ################################################################################
    # repos/repo1/.gitignore
    ################################################################################

    Please could you reference the file when analysis for data leaks.

    """},
    {"role": "assistant", "content": 'Of course.'},
    {"role": "user", "content": example2},
  ]
)

print(completion.choices[0].message.content)

I have identified some potential security leaks in the code base:

1. In `app.py`, there is a variable `STREAMLIT_API_KEY = 'ABCC3241Q9_SK'` which appears to be an API key. API keys should not be hardcoded in the code, especially in publicly accessible repositories.

2. In the file `data.txt`, there are entries that contain personal information such as names and email addresses. This type of personally identifiable information (PII) should not be exposed in code or shared publicly.

It is important to review and address these security concerns to prevent any potential data breaches or unauthorized access.
