<img src="images/devin_ai_image.jpg" alt="Devin" width="300" height="300" style="float:left; margin-right: 40px; margin-bottom: 20px;" />

# Devin AI Notebook
* #### This notebook provides methods in Python for retrieval augmented generation to:
1. Extract data from a csv to create embeddings to build a knowledge base.
2. Prompt AI via a custom interface using only the knowledge base.

###### Note: Built for Devin to do Devin things

<div style="clear: both;"></div>

### Prerequisites
1. Make sure python is accessable on your path
2. OpenAI api key
3. Built/Tested with python 3.9
4. Assumes the user's IDE is Anaconda but will work with any IDE that supports .ipynd (notbook) files with minor updates to the steps
* #### Note: This workbook assumes minimal python proficiency so steps have been designed accordingly

# Table of Contents
1. Git Files on Local Machine 
2. Setup / Launch .venv
3. Initial library install
4. Import Stuff
5. Create embeddings
6. Prompt AI 

## **Step 1: There are many ways to accomplish this but one of the easiest ways is to download the code via a zip file and open in a preferred directory**

## **Step 2: There is a plethora of content on the web about how to create a new .venv (environment -> pref python 3.9) for python so this will be needed to ensure you're running a clean setup that should not be impacted by other projects. Just make sure you’re running that new environment before you open the notebook.**

## **Step 3: You’ll only need to run the script below {!pip install -r requirements.txt} during the initial install**

In [None]:
# Install packages from requirements.txt
!pip install -r requirements.txt


## **Step 4: Load Important Stuff (needed each time you open again)**

In [None]:
# @title

import ipywidgets as widgets
from ipywidgets import AppLayout, Button, GridspecLayout, Layout
import numpy as np
from numpy import dot
import pandas as pd
import requests
import openai
import json
from openai import OpenAI
import ast
from io import BytesIO
from IPython.display import clear_output
import tiktoken # counting tokens, used by openai
import requests
import os
import re
import time
import threading

clear_output()

## **Note: If you've already created an embedding file then skip to Step 6**

## **Step 5: Create embeddings**

> **Action**: Create folders infosec_od_folder and infosec_encode_folder with your preferred location for input/output folders (make sure you use / for paths that you will copy into the following variables)

In [None]:


# Using the paths as specified without modification
infosec_od_folder = 'C:/Users/Downloads/infosec_od_folder'
infosec_encode_folder = 'C:/Users/Downloads/infosec_encode'

# Check if the specified folder exists and list its contents for 'infosec_od_folder'
if os.path.exists(infosec_od_folder):
    print(f"Mounted successfully: {infosec_od_folder}")
    print("Contents of 'infosec_od_folder':")
    print(os.listdir(infosec_od_folder))
else:
    print("The 'infosec_od_folder' does not exist in your Google Drive root.")

# Check if the specified folder exists and list its contents for 'infosec_encode_folder'
if os.path.exists(infosec_encode_folder):
    print(f"Mounted successfully: {infosec_encode_folder}")
    print("Contents of 'infosec_encode_folder':")
    print(os.listdir(infosec_encode_folder))
else:
    print("The 'infosec_encode_folder' does not exist in your Google Drive root.")


## **Step 5.1: Load your csv file into a dataframe via the code below**

> **Action**: Run the cell below

In [None]:
# List to hold DataFrames
dataframes = []

# Check if the folder exists and read all Excel files
if os.path.exists(infosec_od_folder):
    for filename in os.listdir(infosec_od_folder):
        file_path = os.path.join(infosec_od_folder, filename)

        # Check if the file is an Excel file
        if filename.endswith('.csv'):
            try:
                # Load the Excel file into a DataFrame
                df = pd.read_csv(file_path)
                dataframes.append(df)
                print(f"Loaded {filename} into a DataFrame.")
            except Exception as e:
                print(f"Error loading {filename}: {e}")
else:
    print(f"The folder '{infosec_od_folder}' does not exist.")

# Combine all DataFrames if needed
combined_df = pd.concat(dataframes, ignore_index=True) if dataframes else None

# Display the combined DataFrame or notify if empty
if combined_df is not None:
    print("Combined DataFrame from all csv files:")
    print(combined_df)
else:
    print("No csv files were loaded.")

## **Step 5: Preprocess the dataframe by:**
1. For each row, use the field names and values to build a string that is below 7,000 tokens.
2. Rows with with more than 7000 tokens will use the first num_fields variable in the preprocess_df function as the headers for content segments then adding in additional fields/content/tokens until 7,000 tokens is reached.
3. The preprocessing will continue on that row, using the first num_fields as header content, adding addtional fields/content/tokens until that row is completed allowing the process to start again for the next row.   
4. The preprocess_df function outputs a dataframe that applies a new field that count the token for that row.
5. Update the GPT_MODEL, embedding model and api_key variables as needed to enable the embedding process of the tokens

> **Action**: Update the variables below as needed

In [None]:
GPT_MODEL = "gpt-4o-2024-05-13" # Update your preferred model (just make sure it's using 100k tokens or you can update the code below)
embedding_model = "text-embedding-3-large" # update your model as needed
first_important_fields = 2 # indicates the first X number of fields in the csv assuming they represent the facts of that record 
api_key = 'X' 

In [None]:


def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


def encode_text_with_curl_method(text, model="text-embedding-3-large", api_key=api_key):
    try:
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {api_key}"
        }

        data = {
            "input": text,
            "model": model
        }

        response = requests.post(
            "https://api.openai.com/v1/embeddings",
            headers=headers,
            data=json.dumps(data)
        )

        if response.status_code == 200:
            return response.json()['data'][0]['embedding']
        else:
            print(f"Error: {response.status_code}, {response.text}")
            return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None


def preprocess_df(df, num_fields, encoding_name="cl100k_base"):
    """
    Processes the DataFrame by splitting rows into multiple segments if they exceed a specified token limit.

    Parameters:
    df (pd.DataFrame): Input DataFrame to be processed.
    num_fields (int): Number of fields to use as headers for each segment.
    encoding_name (str): Encoding name for tokenization (default is "cl100k_base").

    Returns:
    pd.DataFrame: A single DataFrame with each row segmented into a 'text' column, not exceeding 7,000 tokens.
    """

    # Convert all values in the DataFrame to strings
    df = df.astype(str)
    text_dataframes = []

    for _, row in df.iterrows():
        row_texts = []
        accumulated_text = ""
        accumulated_tokens = 0

        # Initial header content using the specified number of fields
        header_content = ', '.join([f"{col}: {row[col]}" for col in df.columns[:num_fields]])
        header_tokens = num_tokens_from_string(header_content, encoding_name)

        # Start the segment with header content
        accumulated_text += header_content
        accumulated_tokens += header_tokens

        # Process remaining fields in the row after the header fields
        for col in df.columns[num_fields:]:
            # Create text for the current field
            field_text = f", {col}: {row[col]}"

            # Count tokens for the current field
            field_tokens = num_tokens_from_string(field_text, encoding_name)

            # Check if adding this field exceeds the token limit
            if accumulated_tokens + field_tokens > 7000:
                # Save accumulated text as a row in the output DataFrame
                row_texts.append(accumulated_text)

                # Reset accumulated text and tokens, start new segment with header
                accumulated_text = header_content
                accumulated_tokens = header_tokens

            # Add the current field text to the accumulated segment
            accumulated_text += field_text
            accumulated_tokens += field_tokens

        # Add any remaining accumulated text as the last segment for this row
        if accumulated_text:
            row_texts.append(accumulated_text)

        # Create individual DataFrames for each segment and add to the list
        row_dfs = pd.DataFrame({"embedding_text": row_texts})
        text_dataframes.append(row_dfs)

    # Concatenate all the individual row DataFrames into a single DataFrame
    final_text_df = pd.concat(text_dataframes, ignore_index=True)
    return final_text_df


processed_df = preprocess_df(combined_df,first_important_fields)
processed_df['token_count'] = processed_df.apply(lambda x: num_tokens_from_string(x.embedding_text, encoding_name="cl100k_base"), axis=1)
processed_df['embedding'] = processed_df.apply(lambda x: encode_text_with_curl_method(x.embedding_text, model=embedding_model, api_key=api_key), axis=1)
print(f'{processed_df.token_count.sum()} <- Sum count of all tokens')
processed_df.to_csv(infosec_encode_folder+"/embeddings.csv", index=False)
processed_df.head(3)
#

## **Step 6: Run the cell below and wait for the workbook to load the requirements and then scroll to the bottom of the workbook and enter your question into the "Human" box. Then press the "Send Message" button.**


In [None]:
GPT_MODEL = "gpt-4o-2024-05-13" # Update your preferred model (just make sure it's using 100k tokens or you can update the code below)
embedding_model = "text-embedding-3-large" # update your model as needed
api_key = 'X' 

In [9]:
# @title

# Load the Excel file into a pandas DataFrame
embeddings_df = pd.read_csv(infosec_encode_folder+"/embeddings.csv")

embeddings_df['embedding'] = embeddings_df.apply(lambda x: ast.literal_eval(x.embedding), axis=1)

client = openai.OpenAI(api_key=api_key)


# Helper functions
def json_gpt(input: str):
    # API endpoint URL
    api_url = 'https://api.openai.com/v1/chat/completions'

    # Request headers
    headers = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {api_key}'
    }

    # Request data gpt-4-1106-preview     gpt-4-0613 this one works
    data = {
        'model': GPT_MODEL,
        'messages': [
            {"role": "system", "content": "Output only valid JSON"},
            {"role": "user", "content": input}
                    ]
    }

    # Make the API call
    response = requests.post(api_url, json=data, headers=headers)

    # Process the API response
    if response.status_code == 200:
        print('success')
        result = response.json()
        chatgpt_response = result['choices'][0]['message']['content']
        cleaned_str = chatgpt_response.replace('```', '').replace('\n', '').replace('json', '')
        parsed = json.loads(cleaned_str)
        return parsed
    else:
        print('error')




def message_preprocess(user_input, embeddings_df):
    #
    HA_INPUT = f"""
    Generate a hypothetical answer to the user's question. This answer will be used to rank search results.
    Pretend you have all the information you need to answer, but don't use any actual facts. Instead, use placeholders
    like THING is derived by, or THING was sourced by from PLACE.

    User question: {user_input}

    Format: {{"hypotheticalAnswer": "hypothetical answer text"}}
    """
    # get a hypothetical answer
    hypothetical_answer = json_gpt(HA_INPUT)["hypotheticalAnswer"] +" " + user_input

    # now get the embeddings for that answer
    hypothetical_answer_embedding = encode_text_with_curl_method(hypothetical_answer)

    embeddings_df['cosine_similarities'] = embeddings_df.apply(lambda x: dot(hypothetical_answer_embedding, x.embedding), axis=1)
    embeddings_df = embeddings_df.sort_values(by='cosine_similarities', ascending=False)
    embeddings_df['rolling_count'] = embeddings_df.token_count.cumsum()

    # now get the top N number of records and return a string with the update text that will
    # be used for downstream processing
    token_max = 125000#16000#
    first_q = token_max*.25
    half = token_max*.5
    three_q = token_max*.75
    max_tokens = token_max

    # get only the targeted amount of tokens for processing
    target_tokens = embeddings_df[embeddings_df.rolling_count <= half]

    formatted_top_results = [
        {
            "text": text.embedding_text,
        }
        for text in target_tokens.itertuples()
    ]

    ANSWER_INPUT = f"""
    Generate an answer to the user's question using only on the given search results.
    search results: {formatted_top_results}
    user question: {user_input}

    Include as much information as possible in the answer.
    It is important to explain the how and why with your answer.
    Reference only the information in search results.
    If the search results do not contain content that directly addresses the
    question then respond by indicating the content lacks the required details to respond to the question.
    """

    # Create the final JSON structure
    final_json = {
        'text': formatted_top_results
    }

    # Specify the path for the JSON file
    path = 'temp.json'

    # Write to the JSON file
    with open(path, 'w', encoding='utf-8') as file:
        json.dump(final_json, file, ensure_ascii=False, indent=4)

    return ANSWER_INPUT


def results_completion(ANSWER_INPUT, api_key):
    # API endpoint URL
    api_url = 'https://api.openai.com/v1/chat/completions'

    # Request headers
    headers = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {api_key}'
    }

    data = {
        'model': GPT_MODEL,
        'messages': [{'role': 'system', 'content': '''You are an security analyst referencing documentation regarding
                                                      information security updates.
                                                      It is important to use only use the provided content to answer to find answers that are the closest relevant content.
                                                      Make sure you cite your sources by providing a reference using the inital part of the referenced content.'''},
                     {'role': 'user', 'content': ANSWER_INPUT}]
    }

    # Make the API call
    response = requests.post(api_url, json=data, headers=headers)

    # Process the API response
    if response.status_code == 200:
        result = response.json()
        chatgpt_response = result['choices'][0]['message']['content']
        return(chatgpt_response)
    else:
        chatgpt_response = f"'Error:', {response.status_code}, {response.text}"
        print(chatgpt_response)




# Function to submit message to the assistant
def submit_message(user_message, embeddings_df):

    # get the preprocessed user message
    processed_prompt = message_preprocess(user_message, embeddings_df)

    return results_completion(processed_prompt, api_key)






###########
#
#  Code below is to work the UI
#
###########

chat_output = widgets.Textarea(
    value='Hi, I am here to answer your questions',
    placeholder='Hi, I am here to answer your questions',
    description='Assistant:',
    disabled=True,
    layout=Layout(width="80%", align_items='stretch')
)




chat_input = widgets.Text(
    value='Start chatting by typing here',
    placeholder='Start chatting by typing here',
    description='Human:',
    disabled=False,
    layout=Layout(width="80%")
)

def submit_message_and_update_output():
    global chat_output
    response = submit_message(chat_input.value, embeddings_df)

    # Once the response is received, update the chat_output
    chat_output.value = f"{chat_output.value}\n\n{response}"

def on_send_message_button_clicked(b):
    global chat_output
    chat_output.value = f"{chat_output.value}\n\nThinking... (could take a minute or two)"

    # Run submit_message in a separate thread
    message_thread = threading.Thread(target=submit_message_and_update_output)
    message_thread.start()

    # While the thread is running, periodically update the chat_output
    while message_thread.is_alive():
        time.sleep(3)  # Wait for 10 seconds before updating the chat_output
        chat_output.value = f"{chat_output.value}\n\nStill waiting..."

    message_thread.join()  # Wait for the thread to finish



send_message_button = widgets.Button(
    description='▶️ Send Message',
    disabled=False,
    button_style='',
    tooltip='Send Message',
    layout=Layout(width="80%", display='flex', align_items='flex-start')
)


# Attach the event handler to the button
send_message_button.on_click(on_send_message_button_clicked)


chat_tab = GridspecLayout(100, 100, height='800px')
chat_tab[:4, :] = chat_input
chat_tab[4:8, :] = send_message_button
chat_tab[8:, :] = chat_output


tabs = widgets.Tab()
tabs.children = [chat_tab]
tabs.set_title(0, 'Chat')




# Run UI

In [10]:
tabs

Tab(children=(GridspecLayout(children=(Text(value='Start chatting by typing here', description='Human:', layou…