## Using Large Language Model & Vector Database for Categorization

### Set OPENAI_API_KEY before running jupyter
You need an API key set up from: https://platform.openai.com/account/api-keys

    export OPENAI_API_KEY="secret key from site"

In [1]:
import sys
print (sys.version)

3.10.11 (main, Apr  5 2023, 14:15:30) [GCC 7.5.0]


Log into openai with OPEN_API_KEY

In [2]:
import os
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = os.getenv("OPENAI_API_KEY") or "OPENAI_API_KEY"

# openai.Engine.list()  # check we have authenticated

Connect to Pinecone with PINECONE_API_KEY and PINECONE_ENVIRONMENT

In [4]:
import pinecone
from tqdm import tqdm

api_key = os.getenv("PINECONE_API_KEY") or "PINECONE_API_KEY"
# find your environment next to the api key in pinecone console
env = os.getenv("PINECONE_ENVIRONMENT") or "PINECONE_ENVIRONMENT"

pinecone.init(api_key=api_key, enviroment=env)
# pinecone.whoami()

Create a sample embedding so that we know the embedding length

In [5]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=["This is sample test that will determine the length"],
    engine=embed_model
)

embedding_length = len(res['data'][0]['embedding'])
embedding_length

1536

In [6]:
index_name = 'openai-fpi-categorization'

# Create the index if it doesn't exist already
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=embedding_length,
        metric='cosine'
    )

In [7]:
# connect to index
index = pinecone.Index(index_name)
# view index stats 
index_stats = index.describe_index_stats()
index_stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 711}},
 'total_vector_count': 711}

## Populate the Vector Database
### Key Imports

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Import our Labeled Text into a DataFrame

In [9]:
# Read the CSV File into a dataframe

labeled_data = pd.read_csv('Federal_Program_Inventory_Pilot_Data.csv',encoding='cp1252',
                           usecols=['Agency','Program Name','Program Description','Mission/Purpose',
                                    'Recipients','Beneficiaries','Associated Categories'])

# Remove duplicate rows
labeled_data = labeled_data.drop_duplicates().reset_index(drop=True)

# Get the categories
categories = pd.read_csv('Federal_Program_Inventory_Pilot_Data.csv',encoding='cp1252',
                         usecols=['Category'])
categories = categories.drop_duplicates().reset_index(drop=True)
categories = [r["Category"] for _, r in categories.iterrows()]
categories

['Broadband',
 'Economic Development',
 'Opioid Epidemic Response',
 'STEM Education',
 'Workforce Development',
 'Native American',
 'Flood Risk',
 'A.I. R&D/Quantum R&D',
 'Global Health',
 'Homelessness',
 'HIV/AIDS',
 'Transportation Infrastructure']

### Let's create our comparison string and our category list

In [10]:
# Let's join the Agency, Program Name, Program Description, Mission/Purpose, and Beneficiaries
# into a single Text field (make it human readable for the LLM)
labeled_data['text'] = ("Program Name: " + labeled_data['Program Name'] + "\n" +
                        "Agency: " + labeled_data['Agency'] + "\n" +
                        "Mission/Purpose: " + labeled_data['Mission/Purpose'] + "\n" +
                        "Program Description: " + labeled_data['Program Description'] + "\n" +
                        "Recipients: " + labeled_data['Recipients'] + "\n" +
                        "Beneficiaries: " + labeled_data['Beneficiaries'])

# Let's generate the list of categories for each entry
labeled_data['categories'] = labeled_data['Associated Categories'].str.split('; ')

In [11]:
print(labeled_data['text'][2])

Program Name: POWER Initiative
Agency: Appalachian Regional Commission
Mission/Purpose: Area and regional development
Program Description: The POWER (Partnerships for Opportunity and Workforce and Economic Revitalization) Initiative helps communities and regions that have been affected by job losses in coal mining, coal power plant operations, and coal-related supply chain industries due to the changing economics of America’s energy production. The POWER Initiative funds projects that cultivate economic diversity, enhance job training and re-employment opportunities, create jobs in existing or new industries, attract new sources of investment, and strengthen the Region’s broadband infrastructure.
Recipients: U.S. State Government (including the District of Columbia); Public/State Controlled Institution; Nonprofit with 501(c)(3) Status; Domestic Local Government (includes territories unless otherwise specified)
Beneficiaries: State; Rural; Other public institution/organization; Public n

In [12]:
# Set up arrays of labeled data and calculate the embeddings
labeled_text = labeled_data['text'].tolist()
labeled_categories = labeled_data['categories'].tolist()
embeddings = openai.Embedding.create(input=labeled_text, engine=embed_model)

In [13]:
if (index_stats.total_vector_count == 0):
    to_upsert = [('fpi-'+str(i), embeddings['data'][i]['embedding'], 
                  {'text':labeled_text[i], 'categories':labeled_categories[i]})
                 for i in range(len(labeled_text))]
    # Insert in groups of 50
    for i in range(0, len(to_upsert), 50):
        print(i)
        index.upsert(vectors=to_upsert[i:i+50])

In [14]:
# view index stats 
index_stats = index.describe_index_stats()
index_stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 711}},
 'total_vector_count': 711}

In [15]:
# Let's try a to search our index!
query = " economic and community development, empowering Appalachian communities to work with their state governments"
qe = openai.Embedding.create(input=[query], engine=embed_model)
res = index.query(qe['data'][0]['embedding'], top_k=5, include_metadata=True)
# res

In [16]:
# res

In [17]:
key = 6
print(labeled_text[key])

Program Name: AmeriCorps State & National (Competitive)
Agency: Corporation for National and Community Service
Mission/Purpose: Elementary, secondary, and vocational education
Program Description: AmeriCorps State and National matches individuals with organizations that see service as solution to local, regional, and national challenges. There are thousands of opportunities in locations across the country to serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups. Most AmeriCorps grant funding goes to the State Service Commissions, which in turn award grants to organizations to respond to local needs.
Recipients: U.S. State Government (including the District of Columbia); Nonprofit without 501(c)(3) IRS Status; Nonprofit with 501(c)(3) Status; Private Educational Institution, Nonprofit; Public/State Controlled Institution; Special District Governments or Interstate; Domestic Local Government (includes territories unless otherwise specified); Indian

In [18]:
# Let's try a to search our index!
query = labeled_text[key]
qe = openai.Embedding.create(input=[query], engine=embed_model)
res = index.query(qe['data'][0]['embedding'], top_k=5, include_metadata=True)
# res

In [19]:
query = ("You are natural language processing categorization tool.\n" +
         "Given examples of nearest-neighbor US grant programs and their categories. You will assign " +
         "zero to many categories to a new grant program.\n" +
         "The category choices are ['" + "','".join(categories) + "'])\n" +
         "The following are pre-categorized programs and their known categories:\n\n")
for match in res['matches']:
    if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
        query += (match['metadata']['text'] + "\n\n" +
                  "Categories: " + ', '.join(match['metadata']['categories']) + "\n\n" +
                  "---" + "\n\n")
query += ("Please use the above categorized programs to assign categories to the following program description: \n\n" +
          "Program text: " + labeled_text[key] + "\n\n" +
          "Output a JSON object (do not output any other text) with four entries: \n" +
          " 1) 'clues' First, list CLUES (i.e. Keywords, phrases, contextual information, smenatic relations, semantic meaning, tones, references) that support the categorization\n" + 
          " 2) 'reasoning' Second, deduce the diagnostic REASONING process from premises (i.e., clues, input) that supports the categorization\n" + 
          " 3) 'categories' an array of strings with your categories based upon the clues and reasoning (the choices are ['" + "','".join(categories) + "'])\n" +
          " 4) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)""\n\n")

print(query)

You are natural language processing categorization tool.
Given examples of nearest-neighbor US grant programs and their categories. You will assign zero to many categories to a new grant program.
The category choices are ['Broadband','Economic Development','Opioid Epidemic Response','STEM Education','Workforce Development','Native American','Flood Risk','A.I. R&D/Quantum R&D','Global Health','Homelessness','HIV/AIDS','Transportation Infrastructure'])
The following are pre-categorized programs and their known categories:

Program Name: AmeriCorps State & National (Formula)
Agency: Corporation for National and Community Service
Mission/Purpose: Elementary, secondary, and vocational education
Program Description: AmeriCorps State and National matches individuals with organizations that see service as solution to local, regional, and national challenges. There are thousands of opportunities in locations across the country to serve with nonprofits, schools, public agencies, tribes, and comm

In [20]:
# now query GPT 3.5
chat = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    temperature=0,
    n=1,
    messages=[
        {"role": "user", "content" : query}
    ]
)

#print(chat['choices'][0]['message']['content'])
print('\n\n'.join([c['message']['content'] for c in chat['choices']]))

{
  "clues": [
    "AmeriCorps State and National",
    "matches individuals with organizations",
    "serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups",
    "grant funding goes to the State Service Commissions"
  ],
  "reasoning": "The program description mentions that AmeriCorps State and National matches individuals with organizations to serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups. It also states that grant funding goes to the State Service Commissions. Based on these clues, it can be deduced that the program focuses on providing opportunities for individuals to serve in various sectors and that the funding is distributed through state service commissions.",
  "categories": [
    "STEM Education",
    "Workforce Development"
  ],
  "confidence": 80
}


In [21]:
# now query GPT 4
chat = openai.ChatCompletion.create(
    model='gpt-4',
    temperature=0,
    n=1,
    messages=[
        {"role": "user", "content" : query}
    ]
)

#print(chat['choices'][0]['message']['content'])
print('\n\n'.join([c['message']['content'] for c in chat['choices']]))

{
  "clues": ["AmeriCorps State and National", "Elementary, secondary, and vocational education", "serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups", "Indian/Native American Tribal Government"],
  "reasoning": "The program is similar to the AmeriCorps State & National (Formula) program, which is categorized under STEM Education. It also involves serving with tribes and Indian/Native American Tribal Government, which is a clue for the Native American category. The program's mission and purpose also align with the STEM Education category.",
  "categories": ["STEM Education", "Native American"],
  "confidence": 90
}


In [451]:
labeled_categories[key]

['Native American', 'STEM Education']

In [452]:
import json

# Gets the categories, excluding key
# openai, index, categories, embed_model are global
def llm_vd_get_categories(text, key, top_k=5):
    # First get the related hits from Pinecone
    qe = openai.Embedding.create(input=[text], engine=embed_model)
    res = index.query(qe['data'][0]['embedding'], top_k=top_k, include_metadata=True)
    # Construct the query
    query = ("You are natural language processing categorization tool.\n" +
             "Given examples of nearest-neighbor US grant programs and their categories. You will assign " +
             "zero to many categories to a new grant program.\n" +
             "The category choices are ['" + "','".join(categories) + "'])\n" +
             "The following are pre-categorized programs and their known categories:\n\n")
    for match in res['matches']:
        if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
            query += (match['metadata']['text'] + "\n\n" +
                      "Categories: " + ', '.join(match['metadata']['categories']) + "\n\n" +
                      "---" + "\n\n")
    query += ("Please use the above categorized programs to assign categories to the following program description: \n\n" +
              "Program text: " + labeled_text[key] + "\n\n" +
              "Output a JSON object (do not output any other text) with four entries: \n" +
              " 1) 'clues' First, list CLUES (i.e. Keywords, phrases, contextual information, smenatic relations, semantic meaning, tones, references) that support the categorization\n" + 
              " 2) 'reasoning' Second, deduce the diagnostic REASONING process from premises (i.e., clues, input) that supports the categorization\n" + 
              " 3) 'categories' an array of strings with your categories based upon the clues and reasoning (the choices are ['" + "','".join(categories) + "'])\n" +
              " 4) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)""\n\n")
    # print(query)
    chat = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        temperature=0,
        messages=[{"role": "user", "content" : query}]
    )
    return(json.loads(chat['choices'][0]['message']['content']))

In [454]:
llm_vd_get_categories(labeled_text[6],6)

{'clues': ['AmeriCorps State and National',
  'matches individuals with organizations',
  'serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups',
  'AmeriCorps grant funding goes to the State Service Commissions'],
 'reasoning': 'The program description mentions that AmeriCorps State and National matches individuals with organizations to serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups. It also states that AmeriCorps grant funding goes to the State Service Commissions. Based on these clues, it can be deduced that the program focuses on providing opportunities for individuals to serve in various sectors and that the funding is distributed through state commissions.',
 'categories': ['STEM Education', 'Workforce Development'],
 'confidence': 80}

In [30]:
found_categories=[]
if os.path.exists("found_categories.json"):
    with open("found_categories.json","r") as infile:
        found_categories = json.load(infile)
        
# found_categories

In [540]:
import time

# The following is a reentrant loop so that we can calculate the categories for all of the FPI programs
# in batches
for i in range(0,len(labeled_text)):
    if (i >= len(found_categories)):
        while True:
            try:
                chat_result = llm_vd_get_categories(labeled_text[i],i)
                error_count = 0
                break
            except Exception as e:
                print(e)
                error_count += 1
                if error_count > 3:
                    raise Exception("Three errors in a row")
                time.sleep(5)
        chat_result["key"] = i # Record key just in case
        # print(chat_result)
        print(f"{i}")
        found_categories.append(chat_result)
        with open("found_categories.json","w") as outfile:
            outfile.write(json.dumps(found_categories))
   

In [457]:
# Calculate and display Accuracy, Precision, and Recall for each category
precision_data = []
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
for c in categories:
    in_cat = [(c in lc) for lc in labeled_categories]
    assigned_cat = [(c in fc["categories"]) for fc in found_categories]
    tp = ([(a and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    tn = ([(not(a) and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fp = ([(not(a) and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fn = ([(a and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    f1 = 2.0 / ((1/precision) + (1/recall))
    precision_data.append([c,tp,tn,fp,fn,accuracy,precision,recall,f1])
    tp_total += tp
    tn_total += tn
    fp_total += fp
    fn_total += fn
    
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
precision_data.append(["All Categories",tp_total,tn_total,fp_total,fn_total,accuracy,precision,recall,f1])

df = pd.DataFrame(precision_data, columns = ['Category','True Pos','True Neg','False Pos',
                                             'False Neg','Accuracy','Precision','Recall','F1'])
df

Unnamed: 0,Category,True Pos,True Neg,False Pos,False Neg,Accuracy,Precision,Recall,F1
0,Broadband,27,666,8,10,0.974684,0.771429,0.72973,0.75
1,Economic Development,41,633,33,4,0.947961,0.554054,0.911111,0.689076
2,Opioid Epidemic Response,34,659,6,12,0.974684,0.85,0.73913,0.790698
3,STEM Education,69,592,31,19,0.929677,0.69,0.784091,0.734043
4,Workforce Development,64,584,48,15,0.911392,0.571429,0.810127,0.670157
5,Native American,235,404,19,53,0.898734,0.925197,0.815972,0.867159
6,Flood Risk,42,659,7,3,0.985935,0.857143,0.933333,0.893617
7,A.I. R&D/Quantum R&D,54,634,11,12,0.967651,0.830769,0.818182,0.824427
8,Global Health,25,673,13,0,0.981716,0.657895,1.0,0.793651
9,Homelessness,29,669,9,4,0.981716,0.763158,0.878788,0.816901


In [139]:
categories

['Broadband',
 'Economic Development',
 'Opioid Epidemic Response',
 'STEM Education',
 'Workforce Development',
 'Native American',
 'Flood Risk',
 'A.I. R&D/Quantum R&D',
 'Global Health',
 'Homelessness',
 'HIV/AIDS',
 'Transportation Infrastructure']

In [146]:
# Export key information to a spreadsheet for analysis
data = [[i, lt, sorted(lc), sorted(fc["categories"]), fc["keywords"], 
         fc["reasoning"]] for (i, lt, lc, fc) in 
        zip(range(0,len(labeled_text)), labeled_text, labeled_categories, found_categories)]
                                   
df = pd.DataFrame(data, columns = ['idx','Text','real_categories','found_categories','keywords','reasoning'])
df.to_csv("llm_vd_results.csv")

In [8]:
# When we are done, delete the index
# pinecone.delete_index(index_name)

## How Accurate is Just the Nearest Neighbor (Vector Database)?

In [103]:
# Gets the categories from the nearest neighbor
def vd_get_categories(text, key):
    # Get the related hits from Pinecone
    qe = openai.Embedding.create(input=[text], engine=embed_model)
    res = index.query(qe['data'][0]['embedding'], top_k=2, include_metadata=True)
    for match in res['matches']:
        if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
            c = match['metadata']['categories'] # Grab the categories from the first match
            break
    return(c)


In [121]:
found_vd_categories=[]
if os.path.exists("found_vd_categories.json"):
    with open("found_vd_categories.json","r") as infile:
        found_vd_categories = json.load(infile)

In [122]:
# The following is a reentrant loop so that we can calculate the categories for all of the FPI programs
# in batches
for i in range(0,len(labeled_text)):
    if (i >= len(found_vd_categories)):
        cat = vd_get_categories(labeled_text[i],i)
        print(f"{i} {cat}")
        found_vd_categories.append(cat)
        with open("found_vd_categories.json","w") as outfile:
            outfile.write(json.dumps(found_vd_categories))

In [123]:
# Calculate and display Accuracy, Precision, and Recall for each category
precision_data = []
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
for c in categories:
    in_cat = [(c in lc) for lc in labeled_categories]
    assigned_cat = [(c in fc) for fc in found_vd_categories]
    tp = ([(a and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    tn = ([(not(a) and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fp = ([(not(a) and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fn = ([(a and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    f1 = 2.0 / ((1/precision) + (1/recall))
    precision_data.append([c,tp,tn,fp,fn,accuracy,precision,recall,f1])
    tp_total += tp
    tn_total += tn
    fp_total += fp
    fn_total += fn
    
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
precision_data.append(["All Categories",tp_total,tn_total,fp_total,fn_total,accuracy,precision,recall,f1])

df = pd.DataFrame(precision_data, columns = ['Category','True Pos','True Neg','False Pos',
                                             'False Neg','Accuracy','Precision','Recall','F1'])
df

Unnamed: 0,Category,True Pos,True Neg,False Pos,False Neg,Accuracy,Precision,Recall,F1
0,Broadband,27,666,8,10,0.974684,0.771429,0.72973,0.75
1,Economic Development,38,655,11,7,0.974684,0.77551,0.844444,0.808511
2,Opioid Epidemic Response,35,657,8,11,0.973277,0.813953,0.76087,0.786517
3,STEM Education,60,601,22,28,0.929677,0.731707,0.681818,0.705882
4,Workforce Development,50,610,22,29,0.92827,0.694444,0.632911,0.662252
5,Native American,259,382,41,29,0.901547,0.863333,0.899306,0.880952
6,Flood Risk,43,658,8,2,0.985935,0.843137,0.955556,0.895833
7,A.I. R&D/Quantum R&D,51,625,20,15,0.950774,0.71831,0.772727,0.744526
8,Global Health,21,685,1,4,0.992968,0.954545,0.84,0.893617
9,Homelessness,28,671,7,5,0.983122,0.8,0.848485,0.823529


## Let's Try to Improve the Model
Can the LLM analyze what it got wrong and develop a list of "things to consider"? Let's iterate over the incorrect answers and try to develop this list!

In [222]:
def get_things_to_consider(incorrect_assignments, lessons_so_far=""):
    # Construct the query
    query = ("You are natural language processing categorization tool that is trying to improve past results.\n" +
             "Your goal is to analyze incorrect category assignments and figure out a series of \n"+
             "'lessons learned' that you can apply to future catgorizations. (For example 'If X\n" +
             "appears in the description, we probably should assign Y')\n\n")
    if (lessons_so_far != ""):
        query += f"The lessons learned we have so far are:\n\n{lessons_so_far}\n\n"

    query += "The following is a collection of program data, the 'correct' assignment, and your assignment:\n\n"
    for i in incorrect_assignments:
        query += i[1] + "\n\n"
        query += "Correct Categories: " + ", ".join(i[2]) + "\n"
        query += "Your Categories: " + ", ".join(i[3]) + "\n\n---\n\n"

    # First get the related hits from Pinecone
    if (lessons_so_far != ""):
        query += ("Please repeat the lessons we have so far with any updates or additions you think will \n" +
                  "help. If you have nothing new to add, please simply repeat the lessons we have above.\n")
    else:
        query += ("Please output any lessons learned you think will \n" +
                  "help. If you have nothing to add, simply reply with an empty string.\n")
    query += ("\n\nPlease create a full and complete lessons learned list (not referencing past lists) " +
              " and limited to the categories: " + ", ".join(categories) + "\n")
    print(query)
    return""
    chat = openai.ChatCompletion.create(
        model='gpt-3.5-turbo-16k',
        messages=[{"role": "user", "content" : query}]
    )
    return(chat['choices'][0]['message']['content'])

In [157]:
# Let's reconstruct the full list of results
data = [[i, lt, sorted(lc), sorted(fc["categories"]), fc["keywords"], 
         fc["reasoning"]] for (i, lt, lc, fc) in 
        zip(range(0,len(labeled_text)), labeled_text, labeled_categories, found_categories)]

In [158]:
# Let's pull out columns that are wrong
incorrect_assignments = [d for d in data if d[2] != d[3]]

In [209]:
lessons = ""

In [237]:
len(incorrect_assignments)

242

In [405]:
# Note: I manually ran the following in batches of 50 to review the results in ChatGPT-3
# I reran the following in batches of 20 and pasted them into ChatGPT-4
# lessons = get_things_to_consider(incorrect_assignments[221:242],lessons)

In [239]:
lessons = """
1. "Broadband": Assign this category to programs that mention the development or enhancement of broadband or Information Technology (IT) infrastructure, or programs aimed at making internet access affordable or universally available. If a program facilitates remote health care services (telehealth), it should also be categorized under "Broadband". But remember, a program's involvement with technology does not automatically classify it under this category. 

2. "Economic Development": This category is applicable for programs primarily aimed at improving economic conditions or developing business and commercial capacity. However, avoid categorizing programs that primarily focus on community services, environmental protection, or health under this category, even though they might indirectly contribute to economic development.

3. "Opioid Epidemic Response": This category should be assigned to programs explicitly focusing on addressing the opioid crisis, including treatment, prevention, and comprehensive strategies for opioid misuse. Programs that mention substance misuse as a peripheral issue should not be categorized under this category.

4. "STEM Education": Programs mentioning "education", "training", or "student" in connection with Science, Technology, Engineering, and Mathematics subjects should be categorized as "STEM Education". However, not all education or training programs necessarily fall under this category.

5. "Workforce Development": Programs mentioning job training, re-employment opportunities, enhancing workforce productivity, or facilitating worker training and education should be categorized under "Workforce Development". Nonetheless, not every program that benefits workers or the unemployed should be assigned this category.

6. "Native American": Assign this category to programs specifically serving "Indian/Native American Tribal Government", "Indian Tribes", "Native American Organizations", "Federally Recognized Indian Tribal Governments", or specific native communities. Do not assign this category just because a program may indirectly benefit Native American communities.

7. "Flood Risk": Programs that involve "hydrologic studies", "water resources projects", "storm events", or specifically mention "flood control" should be categorized under "Flood Risk". However, general environmental or infrastructure programs should not necessarily fall under this category. 

8. "A.I. R&D/Quantum R&D": This category should be assigned to programs explicitly mentioning the development or application of advanced technologies like Artificial Intelligence, Machine Learning, or Quantum Research and Development. Avoid assigning this category solely based on the involvement of technology or research.

9. "Global Health": Programs addressing health concerns at a global level, including pandemics, should be categorized under "Global Health". However, programs addressing specific diseases or health crises should not be categorized under this category if they do not have a clear global health focus.

10. "Homelessness": Programs directly addressing the issue of homelessness or substandard housing, aiming to improve living conditions or build new houses, should be categorized under "Homelessness". However, programs addressing issues related to homelessness, like substance misuse or economic hardship, should not necessarily be assigned this category.

11. "HIV/AIDS": Programs directly aimed at the prevention, treatment, or management of HIV/AIDS should be categorized under "HIV/AIDS". 

12. "Transportation Infrastructure": Programs that mention the development or maintenance of transportation infrastructure, like roads or bridges, should be categorized under "Transportation Infrastructure". However, the transportation aspect should not override other primary focuses such as STEM education or workforce development.

Each program should be analyzed in its entirety, considering its primary mission and objectives, and the specific context provided in the program description and beneficiary/recipient information. The mere presence of certain terms doesn't guarantee that a category applies. Always consider the primary purpose of the program.
"""

## REDO THE EXPERIMENT WITH LESSONS LEARNED
Let's rerun our categorization, but this time incorporating our lessons learned

In [349]:
# ChatGPT-3.5 version
lessons = """Lessons learned:

1. Keywords such as "broadband" and "internet connectivity" in the program description often indicate the "Broadband" category.
2. "Economic Development" can be inferred when program descriptions mention initiatives to stimulate local economies or industry growth.
3. Programs that aim to support indigenous communities or organizations typically fall into the "Native American" category.
4. The "Workforce Development" category is often associated with program descriptions that mention job training, employment opportunities, or support for individuals in improving their skills.
5. Programs focusing on science, technology, engineering, or mathematics education often fall into the "STEM Education" category.
6. The "Flood Risk" category can be determined when programs address issues related to flood control, hazard forecasting, or emergency response in coastal or waterway areas.
7. When program descriptions mention substance abuse treatment, recovery services, or workforce re-entry after treatment, these often fall under "Opioid Epidemic Response".
8. When program descriptions mention road, bridge, or waterfront development projects, these often categorize under "Transportation Infrastructure".
9. Program descriptions mentioning international health assistance, child nutrition, or support for low-income countries often indicate the "Global Health" category.
10. Programs described as services for individuals experiencing homelessness or related support programs usually categorize as "Homelessness".
11. Programs with efforts related to HIV prevention, care, treatment, and support usually fall under the "HIV/AIDS" category.
12. The "A.I. R&D/Quantum R&D" category can be inferred when program descriptions focus on the development of new AI or quantum technologies or research.
"""

In [352]:
# ChatGPT-4 version
lessons = """Lessons learned:
1. "Broadband": Assign this category to programs that mention the development or enhancement of broadband or Information Technology (IT) infrastructure, or programs aimed at making internet access affordable or universally available. If a program facilitates remote health care services (telehealth), it should also be categorized under "Broadband".
2. "Economic Development": This category is applicable for programs primarily aimed at improving economic conditions or developing business and commercial capacity. 
3. "Opioid Epidemic Response": This category should be assigned to programs explicitly focusing on addressing the opioid crisis, including treatment, prevention, and comprehensive strategies for opioid misuse. 
4. "STEM Education": Programs mentioning "education", "training", or "student" in connection with Science, Technology, Engineering, and Mathematics subjects should be categorized as "STEM Education". 
5. "Workforce Development": Programs mentioning job training, re-employment opportunities, enhancing workforce productivity, or facilitating worker training and education should be categorized under "Workforce Development". 
6. "Native American": Assign this category to programs specifically serving "Indian/Native American Tribal Government", "Indian Tribes", "Native American Organizations", "Federally Recognized Indian Tribal Governments", or specific native communities. 
7. "Flood Risk": Programs that involve "hydrologic studies", "water resources projects", "storm events", or specifically mention "flood control" should be categorized under "Flood Risk". However, general environmental or infrastructure programs should not necessarily fall under this category. 
8. "A.I. R&D/Quantum R&D": This category should be assigned to programs explicitly mentioning the development or application of advanced technologies like Artificial Intelligence, Machine Learning, or Quantum Research and Development. Avoid assigning this category solely based on the involvement of technology or research.
9. "Global Health": Programs addressing health concerns at a global level, including pandemics, should be categorized under "Global Health". 
10. "Homelessness": Programs directly addressing the issue of homelessness or substandard housing, aiming to improve living conditions or build new houses, should be categorized under "Homelessness". 
11. "HIV/AIDS": Programs directly aimed at the prevention, treatment, or management of HIV/AIDS should be categorized under "HIV/AIDS". 
12. "Transportation Infrastructure": Programs that mention the development or maintenance of transportation infrastructure, like roads or bridges, should be categorized under "Transportation Infrastructure". However, the transportation aspect should not override other primary focuses such as STEM education or workforce development.
"""


In [503]:
# ChatGPT-4 version converted to "If X then assign the category Y"
lessons = """
1. If the program mentions "Indian/Native American Tribal Government", "Indian Tribes", "Native American Organizations", "Federally Recognized Indian Tribal Governments", or specific native communities, then assign the category "Native American".
2. If the program mentions the development or enhancement of broadband or IT infrastructure, or aims at making internet access affordable or universally available, or facilitates remote health care services (telehealth), then assign the category "Broadband".
3. If the program is primarily aimed at improving economic conditions or developing business and commercial capacity, then assign the category "Economic Development".
4. If the program is explicitly focusing on addressing the opioid crisis, including treatment, prevention, and comprehensive strategies for opioid misuse, then assign the category "Opioid Epidemic Response".
5. If the program mentions "education", "training", or "student" in connection with Science, Technology, Engineering, and Mathematics subjects, then assign the category "STEM Education".
6. If the program mentions job training, re-employment opportunities, enhancing workforce productivity, or facilitating worker training and education, then assign the category "Workforce Development".
7. If the program involves "hydrologic studies", "water resources projects", "storm events", or specifically mentions "flood control", but is not a general environmental or infrastructure program, then assign the category "Flood Risk".
8. If the program explicitly mentions the development or application of advanced technologies like Artificial Intelligence, Machine Learning, or Quantum Research and Development, but not solely based on the involvement of technology or research, then assign the category "A.I. R&D/Quantum R&D".
9. If the program addresses health concerns at a global level, including pandemics, then assign the category "Global Health".
10. If the program is directly addressing the issue of homelessness or substandard housing, and aims to improve living conditions or build new houses, then assign the category "Homelessness".
11. If the program is directly aimed at the prevention, treatment, or management of HIV/AIDS, then assign the category "HIV/AIDS".
12. If the program mentions the development or maintenance of transportation infrastructure, like roads or bridges, and its primary focus is transportation infrastructure, not overridden by other primary focuses such as STEM education or workforce development, then assign the category "Transportation Infrastructure".
"""

In [532]:
# Gets the categories, excluding key
# openai, index, categories, embed_model are global
def llm_vd_get_categories_wll(text, previous_response, temperature=0):
    # Construct the query
    query = ("You are natural language processing categorization tool.\n" +
             "The category choices are ['" + "','".join(categories) + "'])\n" +
             "This is the program definition you are categorizing: \n\n" +
             "Program text: " + text + "\n\n" +
             "From nearest neighbor programs (determined by keywords), in the past " +
             "you assigned it the categories: " + ', '.join(previous_response['categories']) + "\n" +
             "and used the reasoning: " + str(previous_response['reasoning']) + "\n\n" +
             "Apply the following lessons " + 
             "learned: \n" + lessons + "\n\n" +
              "Output a JSON object (do not output any other text) with five entries: \n" +
              " 1) 'applicable lessons learned' For each lesson learned state if it applies\n" +
              " 2) 'lessons learned categories' List the categories for the applied lessons learned \n"  
              " 3) 'reasoning' Third, deduce the diagnostic REASONING process from premises (i.e., clues, input) that supports the categorization\n" + 
              " 4) 'categories' an array of strings with your categories based upon the clues and reasoning (the choices are ['" + "','".join(categories) + "'])\n" +
              " 5) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)""\n\n")
    # print(query)
    chat = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        temperature=temperature,
        messages=[{"role": "user", "content" : query}]
    )
    return(json.loads(chat['choices'][0]['message']['content']))

In [528]:
llm_vd_get_categories_wll(labeled_text[200],found_categories[200])
# found_categories[81]

{'applicable lessons learned': {'1': True,
  '2': False,
  '3': False,
  '4': False,
  '5': True,
  '6': True,
  '7': False,
  '8': False,
  '9': False,
  '10': False,
  '11': False,
  '12': False},
 'lessons learned categories': ['Native American',
  'STEM Education',
  'Workforce Development'],
 'reasoning': "Lesson 1 applies because the program does not mention any Native American-related terms. Lesson 2 does not apply because the program does not mention anything related to broadband or IT infrastructure. Lesson 3 does not apply because the program does not focus on improving economic conditions or developing business and commercial capacity. Lesson 4 does not apply because the program does not explicitly address the opioid crisis. Lesson 5 applies because the program mentions 'employment' and 'career goals' in connection with individuals with disabilities. Lesson 6 applies because the program mentions 'vocational rehabilitation services' and 'competitive integrated employment'. Le

In [510]:
llm_vd_get_categories(labeled_text[7],7)

{'clues': ['AmeriCorps State and National',
  'matches individuals with organizations',
  'serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups',
  'AmeriCorps grant funding goes to the State Service Commissions'],
 'reasoning': 'The program description mentions that AmeriCorps State and National matches individuals with organizations to serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups. It also states that AmeriCorps grant funding goes to the State Service Commissions. Based on these clues, it can be deduced that the program focuses on providing service opportunities and grants to various organizations.',
 'categories': ['STEM Education', 'Native American'],
 'confidence': 100}

In [511]:
labeled_text[7]

'Program Name: AmeriCorps State & National (Formula)\nAgency: Corporation for National and Community Service\nMission/Purpose: Elementary, secondary, and vocational education\nProgram Description: AmeriCorps State and National matches individuals with organizations that see service as solution to local, regional, and national challenges. There are thousands of opportunities in locations across the country to serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups. Most AmeriCorps grant funding goes to the State Service Commissions, which in turn award grants to organizations to respond to local needs.\nRecipients: U.S. State Government (including the District of Columbia); Nonprofit without 501(c)(3) IRS Status; Nonprofit with 501(c)(3) Status; Private Educational Institution, Nonprofit; Public/State Controlled Institution; Special District Governments or Interstate; Domestic Local Government (includes territories unless otherwise specified); India

In [515]:
found_categories_wll=[]
if os.path.exists("found_categories_with_lessons.json"):
    with open("found_categories_with_lessons.json","r") as infile:
        found_categories_wll = json.load(infile)

In [536]:
# The following is a reentrant loop so that we can calculate the categories for all of the FPI programs
# in batches
error_count = 0
for i in range(0,len(labeled_text)):
    if (i >= len(found_categories_wll)):
        while True:
            try:
                chat_result = llm_vd_get_categories_wll(labeled_text[i],found_categories[i],error_count * 0.1)
                error_count = 0
                break
            except Exception as e:
                print(e)
                error_count += 1
                if error_count > 3:
                    raise Exception("Three errors in a row")
                time.sleep(5)
        chat_result["key"] = i # Record key just in case
        # print(chat_result)
        print(f"{i}")
        found_categories_wll.append(chat_result)
        with open("found_categories_with_lessons.json","w") as outfile:
            outfile.write(json.dumps(found_categories_wll))
   

689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710


In [537]:
# found_categories = []

In [544]:
# Calculate and display Accuracy, Precision, and Recall for each category
precision_data = []
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
num = len(found_categories_wll)
for c in categories:
    in_cat = [(c in lc) for lc in labeled_categories[0:num]]
    assigned_cat = [(c in fc["categories"]) for fc in found_categories_wll[0:num]]
    tp = ([(a and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    tn = ([(not(a) and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fp = ([(not(a) and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fn = ([(a and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    if (tp+tn+fp+fn) > 0:
        accuracy = (tp+tn)/(tp+tn+fp+fn)
    else:
        accuracy = 1
    if (tp + fp) > 0:
        precision = tp/(tp + fp)
    else:
        precision = 1
    if (tp + fn) > 0:
        recall = tp/(tp + fn)
    else:
        recall = 1
    if (precision > 0) and (recall > 0):
        f1 = 2.0 / ((1/precision) + (1/recall))
    else:
        f1 = 1
    precision_data.append([c,tp,tn,fp,fn,accuracy,precision,recall,f1])
    tp_total += tp
    tn_total += tn
    fp_total += fp
    fn_total += fn
    
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
precision_data.append(["All Categories",tp_total,tn_total,fp_total,fn_total,accuracy,precision,recall,f1])

df = pd.DataFrame(precision_data, columns = ['Category','True Pos','True Neg','False Pos',
                                             'False Neg','Accuracy','Precision','Recall','F1'])
df

Unnamed: 0,Category,True Pos,True Neg,False Pos,False Neg,Accuracy,Precision,Recall,F1
0,Broadband,31,662,12,6,0.974684,0.72093,0.837838,0.775
1,Economic Development,39,618,48,6,0.924051,0.448276,0.866667,0.590909
2,Opioid Epidemic Response,32,662,3,14,0.97609,0.914286,0.695652,0.790123
3,STEM Education,57,578,45,31,0.893108,0.558824,0.647727,0.6
4,Workforce Development,59,530,102,20,0.828411,0.36646,0.746835,0.491667
5,Native American,222,315,108,66,0.755274,0.672727,0.770833,0.718447
6,Flood Risk,37,643,23,8,0.956399,0.616667,0.822222,0.704762
7,A.I. R&D/Quantum R&D,36,640,5,30,0.950774,0.878049,0.545455,0.672897
8,Global Health,7,680,6,18,0.966245,0.538462,0.28,0.368421
9,Homelessness,28,668,10,5,0.978903,0.736842,0.848485,0.788732


In [548]:
# Export key information to a spreadsheet for analysis
data = [[i, lt, sorted(lc), sorted(fc["categories"]), fc["clues"], 
         fc["reasoning"], sorted(fcw["categories"]), fcw["reasoning"]] for (i, lt, lc, fc, fcw) in 
        zip(range(0,len(labeled_text)), labeled_text, labeled_categories, found_categories, found_categories_wll)]
                                   
df = pd.DataFrame(data, columns = ['idx','Text','real_categories','found_categories','keywords','reasoning','ll_found','ll_reasoning'])
df.to_csv("llm_vd_results.csv")

## Redo the Experiment with ChatGPT 4

In [24]:
import json

# Gets the categories, excluding key
# openai, index, categories, embed_model are global
def llm_vd_get_categories_4(text, key, temperature=0, top_k=5):
    # First get the related hits from Pinecone
    qe = openai.Embedding.create(input=[text], engine=embed_model)
    res = index.query(qe['data'][0]['embedding'], top_k=top_k, include_metadata=True)
    # Construct the query
    query = ("You are natural language processing categorization tool.\n" +
             "Given examples of nearest-neighbor US grant programs and their categories. You will assign " +
             "zero to many categories to a new grant program.\n" +
             "The category choices are ['" + "','".join(categories) + "'])\n" +
             "The following are pre-categorized programs and their known categories:\n\n")
    for match in res['matches']:
        if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
            query += (match['metadata']['text'] + "\n\n" +
                      "Categories: " + ', '.join(match['metadata']['categories']) + "\n\n" +
                      "---" + "\n\n")
    query += ("Please use the above categorized programs to assign categories to the following program description: \n\n" +
              "Program text: " + labeled_text[key] + "\n\n" +
              "Output a JSON object (do not output any other text) with four entries: \n" +
              " 1) 'clues' First, list CLUES (i.e. Keywords, phrases, contextual information, smenatic relations, semantic meaning, tones, references) that support the categorization\n" + 
              " 2) 'reasoning' Second, deduce the diagnostic REASONING process from premises (i.e., clues, input) that supports the categorization\n" + 
              " 3) 'categories' an array of strings with your categories based upon the clues and reasoning (the choices are ['" + "','".join(categories) + "'])\n" +
              " 4) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)""\n\n")
    # print(query)
    chat = openai.ChatCompletion.create(
        model='gpt-4',
        temperature=temperature,
        messages=[{"role": "user", "content" : query}]
    )
    return(json.loads(chat['choices'][0]['message']['content']))

In [26]:
llm_vd_get_categories_4(labeled_text[6],6,0)

{'clues': ['AmeriCorps State and National',
  'Elementary, secondary, and vocational education',
  'serve with nonprofits, schools, public agencies, tribes, and community and faith-based groups',
  'Indian/Native American Tribal Government'],
 'reasoning': "The program is similar to the AmeriCorps State & National (Formula) and AmeriCorps VISTA programs, which are categorized as STEM Education and Native American respectively. The program's mission and purpose align with education, and it serves with tribes and Native American Tribal Government, which suggests the Native American category.",
 'categories': ['STEM Education', 'Native American'],
 'confidence': 90}

In [27]:
found_categories_4=[]
if os.path.exists("found_categories_4.json"):
    with open("found_categories_4.json","r") as infile:
        found_categories_4 = json.load(infile)
        
# found_categories

In [38]:
import time

# The following is a reentrant loop so that we can calculate the categories for all of the FPI programs
# in batches
error_count = 0
for i in range(0,len(labeled_text)):
    if (i >= len(found_categories_4)):
        while True:
            try:
                chat_result = llm_vd_get_categories_4(labeled_text[i],i,error_count * 0.1)
                error_count = 0
                break
            except Exception as e:
                print(e)
                error_count += 1
                if error_count > 3:
                    raise Exception("Three errors in a row")
                time.sleep(5)
        chat_result["key"] = i # Record key just in case
        # print(chat_result)
        print(f"{i}")
        found_categories_4.append(chat_result)
        with open("found_categories_4.json","w") as outfile:
            outfile.write(json.dumps(found_categories_4))

114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
The server is overloaded or not ready yet.
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
3

In [39]:
# Calculate and display Accuracy, Precision, and Recall for each category for ChatGPT-4
precision_data = []
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
num = len(found_categories_4)
for c in categories:
    in_cat = [(c in lc) for lc in labeled_categories[0:num]]
    assigned_cat = [(c in fc["categories"]) for fc in found_categories_4[0:num]]
    tp = ([(a and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    tn = ([(not(a) and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fp = ([(not(a) and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fn = ([(a and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    if (tp+tn+fp+fn) > 0:
        accuracy = (tp+tn)/(tp+tn+fp+fn)
    else:
        accuracy = 1
    if (tp + fp) > 0:
        precision = tp/(tp + fp)
    else:
        precision = 1
    if (tp + fn) > 0:
        recall = tp/(tp + fn)
    else:
        recall = 1
    if (precision > 0) and (recall > 0):
        f1 = 2.0 / ((1/precision) + (1/recall))
    else:
        f1 = 1
    precision_data.append([c,tp,tn,fp,fn,accuracy,precision,recall,f1])
    tp_total += tp
    tn_total += tn
    fp_total += fp
    fn_total += fn
    
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
precision_data.append(["All Categories",tp_total,tn_total,fp_total,fn_total,accuracy,precision,recall,f1])

df = pd.DataFrame(precision_data, columns = ['Category','True Pos','True Neg','False Pos',
                                             'False Neg','Accuracy','Precision','Recall','F1'])
df

Unnamed: 0,Category,True Pos,True Neg,False Pos,False Neg,Accuracy,Precision,Recall,F1
0,Broadband,31,666,8,6,0.980309,0.794872,0.837838,0.815789
1,Economic Development,44,638,28,1,0.959212,0.611111,0.977778,0.752137
2,Opioid Epidemic Response,31,662,3,15,0.974684,0.911765,0.673913,0.775
3,STEM Education,75,600,23,13,0.949367,0.765306,0.852273,0.806452
4,Workforce Development,65,598,34,14,0.932489,0.656566,0.822785,0.730337
5,Native American,267,365,58,21,0.888889,0.821538,0.927083,0.871126
6,Flood Risk,42,659,7,3,0.985935,0.857143,0.933333,0.893617
7,A.I. R&D/Quantum R&D,56,631,14,10,0.966245,0.8,0.848485,0.823529
8,Global Health,25,682,4,0,0.994374,0.862069,1.0,0.925926
9,Homelessness,28,675,3,5,0.988748,0.903226,0.848485,0.875
