In [1]:
import json
import pandas as pd
import os

### Load domain and subdomain labels
Load in domain and subdomain labels from the json files using some functions. These will be used to create the prompts sent to the LLM. Domain will generate classes for the expert column, subdomain will generate labels for the sub-expert column.

In [2]:
with open('./data/domain_labels.json', 'r') as f:
    domain_json = json.load(f) 
with open('./data/subdomain_labels.json', 'r') as f:
    subdomain_json = json.load(f)

In [3]:
type(list(subdomain_json['Application'][0].keys())[0]) #use these cells to explore the json data if need be.

str

In [4]:
print(list(subdomain_json['Application'][0].keys())[0])

Integration


In [5]:
def create_domains_list(domain_json):
    domains = []
    items_list = domain_json["Items"]
    
    for i in range(len(items_list)):
        domains.append(list(items_list[i].keys())[0]) #looks strange because the data type keeps swapping dict and list.
        
    return domains

In [6]:
def create_subdomains_dict(subdomain_json, domains):
    subdomains = {}
    
    for domain in domains:
        items_list = subdomain_json[domain]
        subdomain_list = []
        
        for i in range(len(items_list)):
            subdomain_list.append(list(items_list[i].keys())[0])
            
        subdomains.update({domain : subdomain_list})
    
    return subdomains

In [7]:
domains = create_domains_list(domain_json)

In [8]:
subdomains = create_subdomains_dict(subdomain_json, domains)

In [9]:
print(domains)

['Application', 'Application Performance Manager', 'Big Data', 'Cloud', 'Computer Graphics', 'Data Structure', 'Databases', 'DevOps', 'Error Handling', 'Event Handling', 'Geographic Information System', 'Input-Output', 'Interpreter', 'Internationalization', 'Logic', 'Language', 'Logging', 'Machine Learning', 'Microservices/Services', 'Multimedia', 'Multi-Thread', 'Natural Language Processing', 'Network', 'Operating System', 'Parser', 'Search', 'Security', 'Setup', 'User Interface', 'Utility', 'Test']


In [10]:
for domain in domains:
    print(("Domain: " + domain + '\nSubdomains(%d): ' + ', '.join(subdomains[domain]) + '\n') % (len(subdomains[domain])))

Domain: Application
Subdomains(6): Integration, Plugin Management, User Customization, App Configuration, Version Control, Compatibility Checks

Domain: Application Performance Manager
Subdomains(6): Performance Monitoring, Resource Allocation, Error Detection, Load Balancing, Traffic Management, Diagnostic Tools

Domain: Big Data
Subdomains(6): Data Processing, Data Storage, Data Analysis, Real-Time Processing, Batch Processing, Data Visualization

Domain: Cloud
Subdomains(6): Resource Management, Virtualization, Scalability Solutions, Cloud Security, Data Migration, Service Configuration

Domain: Computer Graphics
Subdomains(6): Image Rendering, Animation, Modeling, Texture Mapping, Visual Effects, Graphics Optimization

Domain: Data Structure
Subdomains(6): Linear Structures, Tree Structures, Graph Structures, Data Sorting, Search Algorithms, Data Manipulation

Domain: Databases
Subdomains(6): Query Execution, Transaction Management, Schema Design, Database Security, Backup and Reco

### Setup OpenAI API Key
Make sure the api key is functioning. One response should be extremely cheap.

In [11]:
api_key = os.getenv("OPENAI_API_KEY")

In [12]:
from langchain_openai import ChatOpenAI

In [13]:
test_llm = ChatOpenAI(temperature=0, api_key=api_key)

In [14]:
response = test_llm.invoke("Test prompt, only reply True") #if no error is thrown, api key works

In [15]:
print(response)

content='True' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 1, 'prompt_tokens': 13, 'total_tokens': 14, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-BpfYUKMOskgyZUG9ZRHTe9Axefba7', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='run--be8cd9fc-af1a-4ede-afb7-5c0f6d95a214-0' usage_metadata={'input_tokens': 13, 'output_tokens': 1, 'total_tokens': 14, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}


In [16]:
print(response.content)

True


In [17]:
print(response.response_metadata)

{'token_usage': {'completion_tokens': 1, 'prompt_tokens': 13, 'total_tokens': 14, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-BpfYUKMOskgyZUG9ZRHTe9Axefba7', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}


In [18]:
print(type(response.response_metadata['token_usage']['completion_tokens']))

<class 'int'>


### Generate Promts
We will start with a base promt. The goal is to append the list of labels and let the library name be inserted into the prompt.\
Be sure to practice and test the prompts in the free browser based version.\
Prompts will need to be updated as the models update or with a change of model.

In [19]:
domain_base_prompt = "Categorize the library \"%s\" into only one of the domains, without explaining the result: "

In [20]:
domain_prompt = domain_base_prompt + ", ".join(domains)

In [21]:
print(domain_prompt % "example.library")

Categorize the library "example.library" into only one of the domains, without explaining the result: Application, Application Performance Manager, Big Data, Cloud, Computer Graphics, Data Structure, Databases, DevOps, Error Handling, Event Handling, Geographic Information System, Input-Output, Interpreter, Internationalization, Logic, Language, Logging, Machine Learning, Microservices/Services, Multimedia, Multi-Thread, Natural Language Processing, Network, Operating System, Parser, Search, Security, Setup, User Interface, Utility, Test


In [22]:
subdomain_base_prompt = "Given that the domain is \"%s\", categorize the library \"%s\" into one of the subdomains, without explaining the result: "

In [23]:
def generate_subdomain_prompt(base_prompt, domain):
    if domain not in domains: 
        #Error check should be performed outside this function but this will minimize cost if something goes wrong.
        return "Only reply PromptError"
    
    prompt = base_prompt + ", ".join(subdomains[domain])
    return prompt % (domain, "%s")
    

In [24]:
print(generate_subdomain_prompt(subdomain_base_prompt, "Test") % "example.library")

Given that the domain is "Test", categorize the library "example.library" into one of the subdomains, without explaining the result: Unit Testing, Integration Testing, Performance Testing, Security Testing, Usability Testing, Regression Testing


### Setup dataframe
Import API_specific.csv, exported from PostgreSQL.\
small_LLM will create a small copy to test with.

In [25]:
columns = ['general', 'specific', 'api_name_fk', 'expert', 'sub_expert']
API_specific = pd.read_csv("./data/input/API_specific.csv", header="infer", names=columns)

In [26]:
print(API_specific.shape)
API_specific.head()

(2531, 5)


Unnamed: 0,general,specific,api_name_fk,expert,sub_expert
0,r,r,r.fileformat,,
1,r,r,r.util,,
2,r,fetcher,r.fetcher.citation.semanticscholar,,
3,r,r,r.fetcher,,
4,r,fetcher,r.fetcher.citation,,


In [27]:
elements = [20 * (i + 1) for i in range(10)]
print(elements)

[20, 40, 60, 80, 100, 120, 140, 160, 180, 200]


In [28]:
small_df = API_specific.iloc[elements]

In [29]:
small_df

Unnamed: 0,general,specific,api_name_fk,expert,sub_expert
20,gui,preferences,org.jabref.gui.preferences.JabRefGuiPreferences,,
40,dev,langchain4j,dev.langchain4j.data.document.DefaultDocument,,
60,logic,l10n,org.jabref.logic.l10n.Localization,,
80,util,util,java.util.Optional,,
100,javafx,scene,javafx.scene.input.ClipboardContent,,
120,javafx,scene,javafx.scene.layout.Background,,
140,logic,preview,org.jabref.logic.preview.PreviewLayout,,
160,javafx,scene,javafx.scene.Group,,
180,io,github,io.github.adr.linked.ADR,,
200,javafx,scene,javafx.scene.Scene,,


### Test Calls
Use this to understand how the dataframes will work and the general structure of the LLM calls.

In [30]:
info = {'calls_made': 0}

def test_domain_call(api_name, printing):
    prompt = domain_prompt % api_name
    response_content = "Application"
    
    if(printing):
        #prints various info from the response
        print("api name: " + api_name)
        print("label: " + response_content)
        print("total tokens: ")
        print()
        
    info['calls_made'] += 1
    return response_content

def test_subdomain_call(api_name, domain, printing):
    subdomain_prompt = generate_subdomain_prompt(subdomain_base_prompt, domain)
    prompt = subdomain_prompt % api_name
    response_content = "Application Subdomain"
    
    if(printing):
        print("api name: " + api_name)
        print("domain: " + domain)
        print("label: " + response_content)
        print("total tokens: ")
        print()
    
    info['calls_made'] += 1
    return response_content

def generate_test_labels(small_df, printing=True):
    df = small_df.head(10) #ensures small size
    info['calls_made'] = 0
    
    df.loc[:, 'expert'] = df['api_name_fk'].apply(test_domain_call, printing=printing)
    df.loc[:, 'sub_expert'] = df.apply(lambda row: test_subdomain_call(row['api_name_fk'], row['expert'], printing), axis=1)
    
    print("Finished.")
    print("Calls made: " + str(info['calls_made']))
    return df

In [31]:
test_labeled = generate_test_labels(small_df, printing=False)

Finished.
Calls made: 20


In [32]:
test_labeled

Unnamed: 0,general,specific,api_name_fk,expert,sub_expert
20,gui,preferences,org.jabref.gui.preferences.JabRefGuiPreferences,Application,Application Subdomain
40,dev,langchain4j,dev.langchain4j.data.document.DefaultDocument,Application,Application Subdomain
60,logic,l10n,org.jabref.logic.l10n.Localization,Application,Application Subdomain
80,util,util,java.util.Optional,Application,Application Subdomain
100,javafx,scene,javafx.scene.input.ClipboardContent,Application,Application Subdomain
120,javafx,scene,javafx.scene.layout.Background,Application,Application Subdomain
140,logic,preview,org.jabref.logic.preview.PreviewLayout,Application,Application Subdomain
160,javafx,scene,javafx.scene.Group,Application,Application Subdomain
180,io,github,io.github.adr.linked.ADR,Application,Application Subdomain
200,javafx,scene,javafx.scene.Scene,Application,Application Subdomain


In [33]:
test_labeled.to_csv("./data/output/testlabels.csv", header=False, index=False)

### LLM Calls

In [34]:
llm = ChatOpenAI(model = "gpt-4o-mini-2024-07-18", temperature=0, api_key=api_key)

In [35]:
info = {"overall_tokens": 0, "errors": 0} #using dict gets around global variable issues

def domain_call(api_name, printing):
    prompt = domain_prompt % api_name
    response = llm.invoke(prompt)
    
    if(printing):
        meta_data = response.usage_metadata
        #print("prompt: " + prompt)
        print("api name: " + api_name)
        print("label: " + response.content)
        print("output tokens: " + str(meta_data['output_tokens']))
        print("total tokens: " + str(meta_data['total_tokens']))
        print()
        
    info['overall_tokens'] += response.usage_metadata['total_tokens']
    
    return response.content


def subdomain_call(api_name, domain, printing):
    if domain not in domains:
        info['errors'] += 1
        print("DOMAIN NOT FOUND. api: " + api_name + "domain: " + domain)
        return "DomainError"
    
    subdomain_prompt = generate_subdomain_prompt(subdomain_base_prompt, domain)
    prompt = subdomain_prompt % api_name
    response = llm.invoke(prompt)
    
    if(printing):
        meta_data = response.usage_metadata
        #print("prompt: " + prompt)
        print("api name: " + api_name)
        print("domain: " + domain)
        print("subdomain: " + response.content)
        print("output tokens: " + str(meta_data['output_tokens']))
        print("total tokens: " + str(meta_data['total_tokens']))
        print()
        
    info['overall_tokens'] += response.usage_metadata['total_tokens']
    
    return response.content


def generate_labels(small_df, printing=True):
    df = small_df.head(10) #ensures small size
    
    info['overall_tokens'] = 0 #reset info
    info['errors'] = 0
    
    #make llm calls
    df.loc[:, 'expert'] = df['api_name_fk'].apply(domain_call, printing=printing)
    df.loc[:, 'sub_expert'] = df.apply(lambda row: subdomain_call(row['api_name_fk'], row['expert'], printing), axis=1)
    
    print("Finished.")
    print("Tokens used overall: " + str(info['overall_tokens']))
    print("Errors encountered: " + str(info['errors']))
    print("Model: " + llm.model_name)
    
    return df

In [36]:
labeled = generate_labels(small_df)

api name: org.jabref.gui.preferences.JabRefGuiPreferences
label: User Interface
output tokens: 2
total tokens: 123

api name: dev.langchain4j.data.document.DefaultDocument
label: Data Structure
output tokens: 2
total tokens: 121

api name: org.jabref.logic.l10n.Localization
label: Internationalization
output tokens: 2
total tokens: 121

api name: java.util.Optional
label: Utility
output tokens: 1
total tokens: 114

api name: javafx.scene.input.ClipboardContent
label: User Interface
output tokens: 2
total tokens: 119

api name: javafx.scene.layout.Background
label: User Interface
output tokens: 2
total tokens: 117

api name: org.jabref.logic.preview.PreviewLayout
label: User Interface
output tokens: 2
total tokens: 121

api name: javafx.scene.Group
label: User Interface
output tokens: 2
total tokens: 116

api name: io.github.adr.linked.ADR
label: Utility
output tokens: 1
total tokens: 119

api name: javafx.scene.Scene
label: User Interface
output tokens: 2
total tokens: 116

api name: o

In [37]:
labeled

Unnamed: 0,general,specific,api_name_fk,expert,sub_expert
20,gui,preferences,org.jabref.gui.preferences.JabRefGuiPreferences,User Interface,Interaction Design
40,dev,langchain4j,dev.langchain4j.data.document.DefaultDocument,Data Structure,Data Manipulation
60,logic,l10n,org.jabref.logic.l10n.Localization,Internationalization,Localization
80,util,util,java.util.Optional,Utility,Data Conversion
100,javafx,scene,javafx.scene.input.ClipboardContent,User Interface,Interaction Design
120,javafx,scene,javafx.scene.layout.Background,User Interface,Layout Design
140,logic,preview,org.jabref.logic.preview.PreviewLayout,User Interface,Layout Design
160,javafx,scene,javafx.scene.Group,User Interface,Layout Design
180,io,github,io.github.adr.linked.ADR,Utility,Data Conversion
200,javafx,scene,javafx.scene.Scene,User Interface,Layout Design


In [38]:
labeled.to_csv("./data/output/small_labels.csv", header=False, index=False)

Make sure this .csv can be imported to PostgreSQL.