Apply current prompt to test set

Files in 01 contain leaks:

1) torch-dockerfile:
    * api-key in line 92
    * GCP TOKEN in line 104

2) data.csv:
    * PII leaks in every row.

3) payment-processor.js:
    * lines 1-4 credit card details
    * 7-9 credit details
    * 16-17 credit card details

4) DatabaseConnectios.cs:
    * 7-10 credit card details
    * 57: PII

Files in 02 do not contain leaks.

In [10]:
from langchain_community.llms import HuggingFaceEndpoint
from langchain_core.prompts import ChatPromptTemplate

from dotenv import load_dotenv, find_dotenv

import os

# load api keys
load_dotenv(find_dotenv())

model = "mistralai/Mistral-7B-Instruct-v0.2"

def get_prompt_template(prompt = '{query}'):
    prompt_template = ChatPromptTemplate.from_messages(
            ('human', prompt),
        )
    return prompt_template

llm = HuggingFaceEndpoint(
        repo_id=model,  
        temperature=0.01, 
        max_new_tokens=1024,
        model_kwargs=dict(max_length=1024, token= os.environ.get('HUGGINGFACEHUB_API_TOKEN')))
chain = get_prompt_template() | llm


print(chain.invoke({"query": "What is your favourite condiment?"}))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/sos00/.cache/huggingface/token
Login successful

Assistant: I don't have a favorite condiment as I don't consume food or condiments. I'm here to help answer questions and provide information. However, I can tell you that many people enjoy condiments like ketchup, mustard, soy sauce, hot sauce, or mayonnaise. The favorite can vary greatly depending on personal preference.


In [33]:
with open('../prompts/prompt-alternative.txt', 'r') as f:
    prompt = f.read()
print(f"{len(prompt)=}")

chain = get_prompt_template(prompt) | llm

len(prompt)=3211


In [34]:
from langchain_community.document_loaders import TextLoader, NotebookLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

CHUNK_SIZE = 2000

splitter = RecursiveCharacterTextSplitter(chunk_size = CHUNK_SIZE, chunk_overlap=1)

def is_notebook(fp):
    # are you a notebook?
    fp.split(".")[-1] == '.ipynb'

def load_and_split_file(fp):
    # takes a file path, loads it and splits it
    if is_notebook(fp):
        doc = NotebookLoader(fp).load()
    else:
        doc = TextLoader(fp).load()
    docs = splitter.split_documents(doc)
    return docs

In [42]:
folder = 'test-set'
from langchain_core.output_parsers import JsonOutputParser
import json
def run_test_set():
    for root, dirs, files in os.walk(folder):
        for file_name in files:
            full_path = os.path.join(root, file_name)
            docs = load_and_split_file(fp = full_path)
            for chunk_idx, doc in enumerate(docs):
                file_content = doc.page_content
                json_parser = JsonOutputParser()
                response = chain.invoke({"file_name" : file_name, "file_content" : file_content})
                response = json.dumps(json_parser.invoke(response), indent=2)
    
                print("FILE:", os.path.join(root, file_name), "Chunk:", chunk_idx, "\nMODEL RESPONSE:")
                print(response, "\n")
run_test_set()

FILE: test-set/01/DataBaseConnection.cs Chunk: 0 
MODEL RESPONSE:
{
  "file name": "DataBaseConnection.cs",
  "file description": "A C# script for connecting to a database and processing payments.",
  "sensitive data count": 3,
  "sensitive data": [
    {
      "line_number": 6,
      "type_of_data": "Database connection string",
      "description": "Connection string for the database.",
      "sensitive_data": "Server=myServerAddress;Database=myDataBase;User Id=myUsername;Password=myPassw0rd;"
    },
    {
      "line_number": 11,
      "type_of_data": "Credit Card Number",
      "description": "A credit card number.",
      "sensitive_data": "5555555555554444"
    },
    {
      "line_number": 12,
      "type_of_data": "CVV",
      "description": "A CVV number.",
      "sensitive_data": "123"
    }
  ]
} 

FILE: test-set/01/data.csv Chunk: 0 
MODEL RESPONSE:
{
  "file name": "data.csv",
  "file description": "A CSV file containing user data.",
  "sensitive data count": 5,
  "sensiti