## Using Large Language Model & Vector Database for Categorization

### Set OPENAI_API_KEY before running jupyter
You need an API key set up from: https://platform.openai.com/account/api-keys

    export OPENAI_API_KEY="secret key from site"

In [1]:
import sys
print (sys.version)

3.10.11 (main, Apr  5 2023, 14:15:30) [GCC 7.5.0]


Log into openai with OPEN_API_KEY

In [2]:
import os
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = os.getenv("OPENAI_API_KEY") or "OPENAI_API_KEY"

# openai.Engine.list()  # check we have authenticated

Connect to Pinecone with PINECONE_API_KEY and PINECONE_ENVIRONMENT

In [4]:
import pinecone
from tqdm import tqdm

api_key = os.getenv("PINECONE_API_KEY") or "PINECONE_API_KEY"
# find your environment next to the api key in pinecone console
env = os.getenv("PINECONE_ENVIRONMENT") or "PINECONE_ENVIRONMENT"

pinecone.init(api_key=api_key, enviroment=env)
# pinecone.whoami()

Create a sample embedding so that we know the embedding length

In [6]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=["This is sample test that will determine the length"],
    engine=embed_model
)

embedding_length = len(res['data'][0]['embedding'])
embedding_length

1536

In [9]:
index_name = 'openai-fpi-categorization'

# Create the index if it doesn't exist already
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=embedding_length,
        metric='cosine'
    )

In [10]:
# connect to index
index = pinecone.Index(index_name)
# view index stats 
index_stats = index.describe_index_stats()
index_stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

## Populate the Vector Database
### Key Imports

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Import our Labeled Text into a DataFrame

In [12]:
# Read the CSV File into a dataframe

labeled_data = pd.read_csv('Federal_Program_Inventory_Pilot_Data.csv',encoding='cp1252',
                           usecols=['Agency','Program Name','Program Description','Mission/Purpose',
                                    'Recipients','Beneficiaries','Associated Categories'])

# Remove duplicate rows
labeled_data = labeled_data.drop_duplicates().reset_index(drop=True)

# Get the categories
categories = pd.read_csv('Federal_Program_Inventory_Pilot_Data.csv',encoding='cp1252',
                         usecols=['Category'])
categories = categories.drop_duplicates().reset_index(drop=True)
categories = [r["Category"] for _, r in categories.iterrows()]
categories

['Broadband',
 'Economic Development',
 'Opioid Epidemic Response',
 'STEM Education',
 'Workforce Development',
 'Native American',
 'Flood Risk',
 'A.I. R&D/Quantum R&D',
 'Global Health',
 'Homelessness',
 'HIV/AIDS',
 'Transportation Infrastructure']

### Let's create our comparison string and our category list

In [13]:
# Let's join the Agency, Program Name, Program Description, Mission/Purpose, and Beneficiaries
# into a single Text field (make it human readable for the LLM)
labeled_data['text'] = ("Program Name: " + labeled_data['Program Name'] + "\n" +
                        "Agency: " + labeled_data['Agency'] + "\n" +
                        "Mission/Purpose: " + labeled_data['Mission/Purpose'] + "\n" +
                        "Program Description: " + labeled_data['Program Description'] + "\n" +
                        "Recipients: " + labeled_data['Recipients'] + "\n" +
                        "Beneficiaries: " + labeled_data['Beneficiaries'])

# Let's generate the list of categories for each entry
labeled_data['categories'] = labeled_data['Associated Categories'].str.split('; ')

In [15]:
print(labeled_data['text'][2])

Program Name: POWER Initiative
Agency: Appalachian Regional Commission
Mission/Purpose: Area and regional development
Program Description: The POWER (Partnerships for Opportunity and Workforce and Economic Revitalization) Initiative helps communities and regions that have been affected by job losses in coal mining, coal power plant operations, and coal-related supply chain industries due to the changing economics of America’s energy production. The POWER Initiative funds projects that cultivate economic diversity, enhance job training and re-employment opportunities, create jobs in existing or new industries, attract new sources of investment, and strengthen the Region’s broadband infrastructure.
Recipients: U.S. State Government (including the District of Columbia); Public/State Controlled Institution; Nonprofit with 501(c)(3) Status; Domestic Local Government (includes territories unless otherwise specified)
Beneficiaries: State; Rural; Other public institution/organization; Public n

In [17]:
# Set up arrays of labeled data and calculate the embeddings
labeled_text = labeled_data['text'].tolist()
labeled_categories = labeled_data['categories'].tolist()
embeddings = openai.Embedding.create(input=labeled_text, engine=embed_model)

In [18]:
if (index_stats.total_vector_count == 0):
    to_upsert = [('fpi-'+str(i), embeddings['data'][i]['embedding'], 
                  {'text':labeled_text[i], 'categories':labeled_categories[i]})
                 for i in range(len(labeled_text))]
    # Insert in groups of 50
    for i in range(0, len(to_upsert), 50):
        print(i)
        index.upsert(vectors=to_upsert[i:i+50])

0
50
100
150
200
250
300
350
400
450
500
550
600
650
700


In [19]:
# view index stats 
index_stats = index.describe_index_stats()
index_stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 711}},
 'total_vector_count': 711}

In [20]:
# Let's try a to search our index!
query = " economic and community development, empowering Appalachian communities to work with their state governments"
qe = openai.Embedding.create(input=[query], engine=embed_model)
res = index.query(qe['data'][0]['embedding'], top_k=5, include_metadata=True)
# res

In [21]:
# res

In [22]:
labeled_text[24]

'Program Name: Planning Assistance to States\nAgency: Corps of Engineers--Civil Works\nMission/Purpose: Water resources\nProgram Description: The Corps uses this funding to provide technical assistance to states, local governments, Indian tribes, and regional and interstate water resources authorities to assist them in their water resources planning efforts.  The Corps would use the requested funds for work related to flood risk management.  The Corps also is able under this program to provide technical analysis to support a broader effort by a state, regional, or interstate authority that is evaluating options involving a range of issues across a large watershed.\nRecipients: U.S. Federal Government\nBeneficiaries: State; U.S. Citizen; U.S. Territories; Local'

In [67]:
# Let's try a to search our index!
key = 168
query = labeled_text[key]
qe = openai.Embedding.create(input=[query], engine=embed_model)
res = index.query(qe['data'][0]['embedding'], top_k=5, include_metadata=True)
# res

In [68]:
query = ("You are natural language processing categorization tool.\n" +
         "Given examples of nearest-neighbor US grant programs and their categories. You will assign " +
         "zero to many categories to a new grant program.\n" +
         "The category choices are ['" + "','".join(categories) + "'])\n" +
         "The following are pre-categorized programs and their known categories:\n\n")
for match in res['matches']:
    if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
        query += (match['metadata']['text'] + "\n\n" +
                  "Categories: " + ', '.join(match['metadata']['categories']) + "\n\n" +
                  "---" + "\n\n")
query += ("Please use the above categorized programs to assign categories to the following program description: \n\n" +
          "Program text: " + labeled_text[key] + "\n\n" +
          "Output a JSON object with three entries: \n" +
          " 1) 'keywords' a string listing important keywords found in the program description that impacts your decision\n" + 
          " 1) 'reasoning' a string describing your reasoning and justification of the categorizaitons\n" + 
          " 2) 'categories' an array of strings with your categories (the choices are ['" + "','".join(categories) + "'])\n" +
          " 3) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)\n\n" +
          " Please note that a program may not fit into any of the categories.")

print(query)

You are natural language processing categorization tool.
Given examples of nearest-neighbor US grant programs and their categories. You will assign zero to many categories to a new grant program.
The category choices are ['Broadband','Economic Development','Opioid Epidemic Response','STEM Education','Workforce Development','Native American','Flood Risk','A.I. R&D/Quantum R&D','Global Health','Homelessness','HIV/AIDS','Transportation Infrastructure'])
The following are pre-categorized programs and their known categories:

Program Name: Student Support and Academic Enrichment State Grants
Agency: Department of Education
Mission/Purpose: Elementary, secondary, and vocational education
Program Description: The program supports formula grants that are intended to improve academic achievement by increasing the capacity of States and local educational agencies (LEAs) to provide students with access to a well-rounded education and improve school conditions and use of technology.
Recipients: U.

In [69]:
# now query GPT 3.5
chat = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    temperature=0,
    messages=[
        {"role": "user", "content" : query}
    ]
)

print(chat['choices'][0]['message']['content'])

{
  "keywords": "low-income students, postsecondary education",
  "reasoning": "Based on the program description, the focus of the grant program is to increase the number of low-income students who are prepared for and succeed in postsecondary education. This aligns with the mission/purpose of higher education. Additionally, the beneficiaries include Education (0-8) and Education (9-12), indicating a focus on students at various educational levels.",
  "categories": ["STEM Education"],
  "confidence": 80
}


In [70]:
labeled_categories[key]

['Native American']

In [326]:
import json

# Gets the categories, excluding key
# openai, index, categories, embed_model are global
def llm_vd_get_categories(text, key, top_k=5):
    # First get the related hits from Pinecone
    qe = openai.Embedding.create(input=[text], engine=embed_model)
    res = index.query(qe['data'][0]['embedding'], top_k=top_k, include_metadata=True)
    # Construct the query
    query = ("You are natural language processing categorization tool.\n" +
             "Given examples of nearest-neighbor US grant programs and their categories. You will assign " +
             "zero to many categories to a new grant program.\n" +
             "The category choices are ['" + "','".join(categories) + "'])\n" +
             "The following are pre-categorized programs and their known categories:\n\n")
    for match in res['matches']:
        if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
            query += (match['metadata']['text'] + "\n\n" +
                      "Categories: " + ', '.join(match['metadata']['categories']) + "\n\n" +
                      "---" + "\n\n")
    query += ("Please use the above categorized programs to assign categories to the following program description: \n\n" +
              "Program text: " + labeled_text[key] + "\n\n" +
              "Output a JSON object with four entries: \n" +
              " 1) 'keywords' a string listing important keywords found in the program description that impacts your decision\n" + 
              " 2) 'reasoning' a string describing your reasoning and justification of the categorizations\n" + 
              " 3) 'categories' an array of strings with your categories (the choices are ['" + "','".join(categories) + "'])\n" +
              " 4) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)\n\n" +
              " Please note that a program may not fit into any of the categories.")
    # print(query)
    chat = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[{"role": "user", "content" : query}]
    )
    return(json.loads(chat['choices'][0]['message']['content']))

In [35]:
llm_vd_get_categories(labeled_text[24],24)

{'keywords': 'technical assistance, water resources planning, flood risk management',
 'reasoning': 'Based on the keywords found in the program description, it is clear that the program focuses on providing technical assistance for water resources planning, specifically related to flood risk management.',
 'categories': ['Flood Risk'],
 'confidence': 100}

In [395]:
found_categories=[]
if os.path.exists("found_categories.json"):
    with open("found_categories.json","r") as infile:
        found_categories = json.load(infile)
        
# found_categories

In [131]:
# The following is a reentrant loop so that we can calculate the categories for all of the FPI programs
# in batches
for i in range(0,len(labeled_text)):
    if (i >= len(found_categories)):
        chat_result = llm_vd_get_categories(labeled_text[i],i)
        chat_result["key"] = i # Record key just in case
        print(chat_result)
        found_categories.append(chat_result)
        with open("found_categories.json","w") as outfile:
            outfile.write(json.dumps(found_categories))
   

In [132]:
# Count false positives, false negatives, true positives and true negatives
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
total_categories = len(set(categories))
for fc in found_categories:
    i = fc["key"]
    real_categories = set(labeled_categories[i])
    assigned_categories = set(categories).intersection(set(fc["categories"]))
#    print(f"{i}: {real_categories} -> {assigned_categories}")
#    print(f" TP={len(real_categories.intersection(assigned_categories))}")
    tp = len(real_categories.intersection(assigned_categories))
    tp_total += tp
#    print(f" FP={len(assigned_categories.difference(real_categories))}")
    fp = len(assigned_categories.difference(real_categories))
    fp_total += fp
#    print(f" FN={len(real_categories.difference(assigned_categories))}")
    fn = len(real_categories.difference(assigned_categories))
    fn_total += fn
#    print(f" TN={total_categories - tp - fp - fn}")
    tn = total_categories - tp - fp - fn
    tn_total += tn
#    print(f" {tp_total} {fp_total} {fn_total} {tn_total}")
    

In [133]:
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
print(f"Accuracy: {accuracy*100}")
print(f"Precision: {precision*100}")
print(f"Recall: {recall*100}")
print(f"f1: {f1*100}")


Accuracy: 96.24941397093296
Precision: 79.31844888366626
Recall: 82.41758241758241
f1: 80.83832335329342


In [134]:
# Calculate and display Accuracy, Precision, and Recall for each category
precision_data = []
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
for c in categories:
    in_cat = [(c in lc) for lc in labeled_categories]
    assigned_cat = [(c in fc["categories"]) for fc in found_categories]
    tp = ([(a and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    tn = ([(not(a) and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fp = ([(not(a) and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fn = ([(a and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    f1 = 2.0 / ((1/precision) + (1/recall))
    precision_data.append([c,tp,tn,fp,fn,accuracy,precision,recall,f1])
    tp_total += tp
    tn_total += tn
    fp_total += fp
    fn_total += fn
    
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
precision_data.append(["All Categories",tp_total,tn_total,fp_total,fn_total,accuracy,precision,recall,f1])

df = pd.DataFrame(precision_data, columns = ['Category','True Pos','True Neg','False Pos',
                                             'False Neg','Accuracy','Precision','Recall','F1'])
df

Unnamed: 0,Category,True Pos,True Neg,False Pos,False Neg,Accuracy,Precision,Recall,F1
0,Broadband,28,665,9,9,0.974684,0.756757,0.756757,0.756757
1,Economic Development,41,639,27,4,0.956399,0.602941,0.911111,0.725664
2,Opioid Epidemic Response,35,660,5,11,0.977496,0.875,0.76087,0.813953
3,STEM Education,72,600,23,16,0.945148,0.757895,0.818182,0.786885
4,Workforce Development,62,597,35,17,0.926864,0.639175,0.78481,0.704545
5,Native American,225,406,17,63,0.887482,0.929752,0.78125,0.849057
6,Flood Risk,40,658,8,5,0.981716,0.833333,0.888889,0.860215
7,A.I. R&D/Quantum R&D,52,638,7,14,0.970464,0.881356,0.787879,0.832
8,Global Health,25,677,9,0,0.987342,0.735294,1.0,0.847458
9,Homelessness,31,667,11,2,0.981716,0.738095,0.939394,0.826667


In [139]:
categories

['Broadband',
 'Economic Development',
 'Opioid Epidemic Response',
 'STEM Education',
 'Workforce Development',
 'Native American',
 'Flood Risk',
 'A.I. R&D/Quantum R&D',
 'Global Health',
 'Homelessness',
 'HIV/AIDS',
 'Transportation Infrastructure']

In [146]:
# Export key information to a spreadsheet for analysis
data = [[i, lt, sorted(lc), sorted(fc["categories"]), fc["keywords"], 
         fc["reasoning"]] for (i, lt, lc, fc) in 
        zip(range(0,len(labeled_text)), labeled_text, labeled_categories, found_categories)]
                                   
df = pd.DataFrame(data, columns = ['idx','Text','real_categories','found_categories','keywords','reasoning'])
df.to_csv("llm_vd_results.csv")

In [8]:
# When we are done, delete the index
# pinecone.delete_index(index_name)

## How Accurate is Just the Nearest Neighbor (Vector Database)?

In [103]:
# Gets the categories from the nearest neighbor
def vd_get_categories(text, key):
    # Get the related hits from Pinecone
    qe = openai.Embedding.create(input=[text], engine=embed_model)
    res = index.query(qe['data'][0]['embedding'], top_k=2, include_metadata=True)
    for match in res['matches']:
        if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
            c = match['metadata']['categories'] # Grab the categories from the first match
            break
    return(c)


In [121]:
found_vd_categories=[]
if os.path.exists("found_vd_categories.json"):
    with open("found_vd_categories.json","r") as infile:
        found_vd_categories = json.load(infile)

In [122]:
# The following is a reentrant loop so that we can calculate the categories for all of the FPI programs
# in batches
for i in range(0,len(labeled_text)):
    if (i >= len(found_vd_categories)):
        cat = vd_get_categories(labeled_text[i],i)
        print(f"{i} {cat}")
        found_vd_categories.append(cat)
        with open("found_vd_categories.json","w") as outfile:
            outfile.write(json.dumps(found_vd_categories))

In [123]:
# Calculate and display Accuracy, Precision, and Recall for each category
precision_data = []
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
for c in categories:
    in_cat = [(c in lc) for lc in labeled_categories]
    assigned_cat = [(c in fc) for fc in found_vd_categories]
    tp = ([(a and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    tn = ([(not(a) and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fp = ([(not(a) and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fn = ([(a and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    f1 = 2.0 / ((1/precision) + (1/recall))
    precision_data.append([c,tp,tn,fp,fn,accuracy,precision,recall,f1])
    tp_total += tp
    tn_total += tn
    fp_total += fp
    fn_total += fn
    
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
precision_data.append(["All Categories",tp_total,tn_total,fp_total,fn_total,accuracy,precision,recall,f1])

df = pd.DataFrame(precision_data, columns = ['Category','True Pos','True Neg','False Pos',
                                             'False Neg','Accuracy','Precision','Recall','F1'])
df

Unnamed: 0,Category,True Pos,True Neg,False Pos,False Neg,Accuracy,Precision,Recall,F1
0,Broadband,27,666,8,10,0.974684,0.771429,0.72973,0.75
1,Economic Development,38,655,11,7,0.974684,0.77551,0.844444,0.808511
2,Opioid Epidemic Response,35,657,8,11,0.973277,0.813953,0.76087,0.786517
3,STEM Education,60,601,22,28,0.929677,0.731707,0.681818,0.705882
4,Workforce Development,50,610,22,29,0.92827,0.694444,0.632911,0.662252
5,Native American,259,382,41,29,0.901547,0.863333,0.899306,0.880952
6,Flood Risk,43,658,8,2,0.985935,0.843137,0.955556,0.895833
7,A.I. R&D/Quantum R&D,51,625,20,15,0.950774,0.71831,0.772727,0.744526
8,Global Health,21,685,1,4,0.992968,0.954545,0.84,0.893617
9,Homelessness,28,671,7,5,0.983122,0.8,0.848485,0.823529


## Let's Try to Improve the Model
Can the LLM analyze what it got wrong and develop a list of "things to consider"? Let's iterate over the incorrect answers and try to develop this list!

In [222]:
def get_things_to_consider(incorrect_assignments, lessons_so_far=""):
    # Construct the query
    query = ("You are natural language processing categorization tool that is trying to improve past results.\n" +
             "Your goal is to analyze incorrect category assignments and figure out a series of \n"+
             "'lessons learned' that you can apply to future catgorizations. (For example 'If X\n" +
             "appears in the description, we probably should assign Y')\n\n")
    if (lessons_so_far != ""):
        query += f"The lessons learned we have so far are:\n\n{lessons_so_far}\n\n"

    query += "The following is a collection of program data, the 'correct' assignment, and your assignment:\n\n"
    for i in incorrect_assignments:
        query += i[1] + "\n\n"
        query += "Correct Categories: " + ", ".join(i[2]) + "\n"
        query += "Your Categories: " + ", ".join(i[3]) + "\n\n---\n\n"

    # First get the related hits from Pinecone
    if (lessons_so_far != ""):
        query += ("Please repeat the lessons we have so far with any updates or additions you think will \n" +
                  "help. If you have nothing new to add, please simply repeat the lessons we have above.\n")
    else:
        query += ("Please output any lessons learned you think will \n" +
                  "help. If you have nothing to add, simply reply with an empty string.\n")
    query += ("\n\nPlease create a full and complete lessons learned list (not referencing past lists) " +
              " and limited to the categories: " + ", ".join(categories) + "\n")
    print(query)
    return""
    chat = openai.ChatCompletion.create(
        model='gpt-3.5-turbo-16k',
        messages=[{"role": "user", "content" : query}]
    )
    return(chat['choices'][0]['message']['content'])

In [157]:
# Let's reconstruct the full list of results
data = [[i, lt, sorted(lc), sorted(fc["categories"]), fc["keywords"], 
         fc["reasoning"]] for (i, lt, lc, fc) in 
        zip(range(0,len(labeled_text)), labeled_text, labeled_categories, found_categories)]

In [158]:
# Let's pull out columns that are wrong
incorrect_assignments = [d for d in data if d[2] != d[3]]

In [209]:
lessons = ""

In [237]:
len(incorrect_assignments)

242

In [405]:
# Note: I manually ran the following in batches of 50 to review the results in ChatGPT-3
# I reran the following in batches of 20 and pasted them into ChatGPT-4
# lessons = get_things_to_consider(incorrect_assignments[221:242],lessons)

In [239]:
lessons = """
1. "Broadband": Assign this category to programs that mention the development or enhancement of broadband or Information Technology (IT) infrastructure, or programs aimed at making internet access affordable or universally available. If a program facilitates remote health care services (telehealth), it should also be categorized under "Broadband". But remember, a program's involvement with technology does not automatically classify it under this category. 

2. "Economic Development": This category is applicable for programs primarily aimed at improving economic conditions or developing business and commercial capacity. However, avoid categorizing programs that primarily focus on community services, environmental protection, or health under this category, even though they might indirectly contribute to economic development.

3. "Opioid Epidemic Response": This category should be assigned to programs explicitly focusing on addressing the opioid crisis, including treatment, prevention, and comprehensive strategies for opioid misuse. Programs that mention substance misuse as a peripheral issue should not be categorized under this category.

4. "STEM Education": Programs mentioning "education", "training", or "student" in connection with Science, Technology, Engineering, and Mathematics subjects should be categorized as "STEM Education". However, not all education or training programs necessarily fall under this category.

5. "Workforce Development": Programs mentioning job training, re-employment opportunities, enhancing workforce productivity, or facilitating worker training and education should be categorized under "Workforce Development". Nonetheless, not every program that benefits workers or the unemployed should be assigned this category.

6. "Native American": Assign this category to programs specifically serving "Indian/Native American Tribal Government", "Indian Tribes", "Native American Organizations", "Federally Recognized Indian Tribal Governments", or specific native communities. Do not assign this category just because a program may indirectly benefit Native American communities.

7. "Flood Risk": Programs that involve "hydrologic studies", "water resources projects", "storm events", or specifically mention "flood control" should be categorized under "Flood Risk". However, general environmental or infrastructure programs should not necessarily fall under this category. 

8. "A.I. R&D/Quantum R&D": This category should be assigned to programs explicitly mentioning the development or application of advanced technologies like Artificial Intelligence, Machine Learning, or Quantum Research and Development. Avoid assigning this category solely based on the involvement of technology or research.

9. "Global Health": Programs addressing health concerns at a global level, including pandemics, should be categorized under "Global Health". However, programs addressing specific diseases or health crises should not be categorized under this category if they do not have a clear global health focus.

10. "Homelessness": Programs directly addressing the issue of homelessness or substandard housing, aiming to improve living conditions or build new houses, should be categorized under "Homelessness". However, programs addressing issues related to homelessness, like substance misuse or economic hardship, should not necessarily be assigned this category.

11. "HIV/AIDS": Programs directly aimed at the prevention, treatment, or management of HIV/AIDS should be categorized under "HIV/AIDS". 

12. "Transportation Infrastructure": Programs that mention the development or maintenance of transportation infrastructure, like roads or bridges, should be categorized under "Transportation Infrastructure". However, the transportation aspect should not override other primary focuses such as STEM education or workforce development.

Each program should be analyzed in its entirety, considering its primary mission and objectives, and the specific context provided in the program description and beneficiary/recipient information. The mere presence of certain terms doesn't guarantee that a category applies. Always consider the primary purpose of the program.
"""

## REDO THE EXPERIMENT WITH LESSONS LEARNED
Let's rerun our categorization, but this time incorporating our lessons learned

In [349]:
# ChatGPT-3.5 version
lessons = """Lessons learned:

1. Keywords such as "broadband" and "internet connectivity" in the program description often indicate the "Broadband" category.
2. "Economic Development" can be inferred when program descriptions mention initiatives to stimulate local economies or industry growth.
3. Programs that aim to support indigenous communities or organizations typically fall into the "Native American" category.
4. The "Workforce Development" category is often associated with program descriptions that mention job training, employment opportunities, or support for individuals in improving their skills.
5. Programs focusing on science, technology, engineering, or mathematics education often fall into the "STEM Education" category.
6. The "Flood Risk" category can be determined when programs address issues related to flood control, hazard forecasting, or emergency response in coastal or waterway areas.
7. When program descriptions mention substance abuse treatment, recovery services, or workforce re-entry after treatment, these often fall under "Opioid Epidemic Response".
8. When program descriptions mention road, bridge, or waterfront development projects, these often categorize under "Transportation Infrastructure".
9. Program descriptions mentioning international health assistance, child nutrition, or support for low-income countries often indicate the "Global Health" category.
10. Programs described as services for individuals experiencing homelessness or related support programs usually categorize as "Homelessness".
11. Programs with efforts related to HIV prevention, care, treatment, and support usually fall under the "HIV/AIDS" category.
12. The "A.I. R&D/Quantum R&D" category can be inferred when program descriptions focus on the development of new AI or quantum technologies or research.
"""

In [352]:
# ChatGPT-4 version
lessons = """Lessons learned:
1. "Broadband": Assign this category to programs that mention the development or enhancement of broadband or Information Technology (IT) infrastructure, or programs aimed at making internet access affordable or universally available. If a program facilitates remote health care services (telehealth), it should also be categorized under "Broadband".
2. "Economic Development": This category is applicable for programs primarily aimed at improving economic conditions or developing business and commercial capacity. 
3. "Opioid Epidemic Response": This category should be assigned to programs explicitly focusing on addressing the opioid crisis, including treatment, prevention, and comprehensive strategies for opioid misuse. 
4. "STEM Education": Programs mentioning "education", "training", or "student" in connection with Science, Technology, Engineering, and Mathematics subjects should be categorized as "STEM Education". 
5. "Workforce Development": Programs mentioning job training, re-employment opportunities, enhancing workforce productivity, or facilitating worker training and education should be categorized under "Workforce Development". 
6. "Native American": Assign this category to programs specifically serving "Indian/Native American Tribal Government", "Indian Tribes", "Native American Organizations", "Federally Recognized Indian Tribal Governments", or specific native communities. 
7. "Flood Risk": Programs that involve "hydrologic studies", "water resources projects", "storm events", or specifically mention "flood control" should be categorized under "Flood Risk". However, general environmental or infrastructure programs should not necessarily fall under this category. 
8. "A.I. R&D/Quantum R&D": This category should be assigned to programs explicitly mentioning the development or application of advanced technologies like Artificial Intelligence, Machine Learning, or Quantum Research and Development. Avoid assigning this category solely based on the involvement of technology or research.
9. "Global Health": Programs addressing health concerns at a global level, including pandemics, should be categorized under "Global Health". 
10. "Homelessness": Programs directly addressing the issue of homelessness or substandard housing, aiming to improve living conditions or build new houses, should be categorized under "Homelessness". 
11. "HIV/AIDS": Programs directly aimed at the prevention, treatment, or management of HIV/AIDS should be categorized under "HIV/AIDS". 
12. "Transportation Infrastructure": Programs that mention the development or maintenance of transportation infrastructure, like roads or bridges, should be categorized under "Transportation Infrastructure". However, the transportation aspect should not override other primary focuses such as STEM education or workforce development.
"""


In [381]:
# Gets the categories, excluding key
# openai, index, categories, embed_model are global
def llm_vd_get_categories_wll(text, key, top_k=5):
    # First get the related hits from Pinecone
    qe = openai.Embedding.create(input=[text], engine=embed_model)
    res = index.query(qe['data'][0]['embedding'], top_k=top_k, include_metadata=True)
    # Construct the query
    query = ("You are natural language processing categorization tool.\n" +
             "Given examples of nearest-neighbor US grant programs and their categories. You will assign " +
             "zero to many categories to a new grant program.\n" +
             "The category choices are ['" + "','".join(categories) + "'])\n" +
             "The following are pre-categorized programs and their known categories:\n\n")
    for match in res['matches']:
        if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
            query += (match['metadata']['text'] + "\n\n" +
                      "Categories: " + ', '.join(match['metadata']['categories']) + "\n\n" +
                      "---" + "\n\n")
    query += ("Please use the above categorized programs to assign categories. " + 
              "In addition, " +
              "consider the following lessons learned from past mistakes: \n" + lessons + "\n\n" +
              "Categorize the following program definition: \n\n" +
              "Program text: " + labeled_text[key] + "\n\n" +
              "Output a JSON object (do not output any other text) with four entries: \n" +
              " 1) 'clues' First, list CLUES (i.e. Keywords, phrases, contextual information, smenatic relations, semantic meaning, tones, references) that support the categorization\n" + 
              " 2) 'reasoning' Second, deduce the diagnostic REASONING process from premises (i.e., clues, input) that supports the categorization\n" + 
              " 3) 'categories' an array of strings with your categories based upon the clues and reasoning (the choices are ['" + "','".join(categories) + "'])\n" +
              " 4) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)""\n\n")
    # print(query)
    chat = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[{"role": "user", "content" : query}]
    )
    return(json.loads(chat['choices'][0]['message']['content']))

In [None]:
llm_vd_get_categories_wll(labeled_text[10],10)

In [359]:
llm_vd_get_categories(labeled_text[6],6)

In [368]:
# labeled_categories[24]

In [399]:
found_categories=[]
if os.path.exists("found_categories_with_lessons.json"):
    with open("found_categories_with_lessons.json","r") as infile:
        found_categories = json.load(infile)

In [400]:
# The following is a reentrant loop so that we can calculate the categories for all of the FPI programs
# in batches
for i in range(0,len(labeled_text)):
    if (i >= len(found_categories)):
        while True:
            try:
                chat_result = llm_vd_get_categories_wll(labeled_text[i],i)
                break
            except Exception as e:
                print(e)
                time.sleep(5)
        chat_result["key"] = i # Record key just in case
        # print(chat_result)
        print(f"{i}")
        found_categories.append(chat_result)
        with open("found_categories_with_lessons.json","w") as outfile:
            outfile.write(json.dumps(found_categories))
   

In [401]:
# found_categories = []

In [402]:
# Calculate and display Accuracy, Precision, and Recall for each category
precision_data = []
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
num = len(found_categories)
for c in categories:
    in_cat = [(c in lc) for lc in labeled_categories[0:num]]
    assigned_cat = [(c in fc["categories"]) for fc in found_categories[0:num]]
    tp = ([(a and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    tn = ([(not(a) and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fp = ([(not(a) and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fn = ([(a and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    if (tp+tn+fp+fn) > 0:
        accuracy = (tp+tn)/(tp+tn+fp+fn)
    else:
        accuracy = 1
    if (tp + fp) > 0:
        precision = tp/(tp + fp)
    else:
        precision = 1
    if (tp + fn) > 0:
        recall = tp/(tp + fn)
    else:
        recall = 1
    if (precision > 0) and (recall > 0):
        f1 = 2.0 / ((1/precision) + (1/recall))
    else:
        f1 = 1
    precision_data.append([c,tp,tn,fp,fn,accuracy,precision,recall,f1])
    tp_total += tp
    tn_total += tn
    fp_total += fp
    fn_total += fn
    
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
precision_data.append(["All Categories",tp_total,tn_total,fp_total,fn_total,accuracy,precision,recall,f1])

df = pd.DataFrame(precision_data, columns = ['Category','True Pos','True Neg','False Pos',
                                             'False Neg','Accuracy','Precision','Recall','F1'])
df

Unnamed: 0,Category,True Pos,True Neg,False Pos,False Neg,Accuracy,Precision,Recall,F1
0,Broadband,31,666,8,6,0.980309,0.794872,0.837838,0.815789
1,Economic Development,37,629,37,8,0.936709,0.5,0.822222,0.621849
2,Opioid Epidemic Response,34,658,7,12,0.973277,0.829268,0.73913,0.781609
3,STEM Education,68,593,30,20,0.929677,0.693878,0.772727,0.731183
4,Workforce Development,62,588,44,17,0.914205,0.584906,0.78481,0.67027
5,Native American,224,403,20,64,0.881857,0.918033,0.777778,0.842105
6,Flood Risk,39,654,12,6,0.974684,0.764706,0.866667,0.8125
7,A.I. R&D/Quantum R&D,49,635,10,17,0.962025,0.830508,0.742424,0.784
8,Global Health,25,674,12,0,0.983122,0.675676,1.0,0.806452
9,Homelessness,27,662,16,6,0.969058,0.627907,0.818182,0.710526
