## Using Large Language Model & Vector Database for Categorization

### Set OPENAI_API_KEY before running jupyter
You need an API key set up from: https://platform.openai.com/account/api-keys

    export OPENAI_API_KEY="secret key from site"

In [1]:
import sys
print (sys.version)

3.10.11 (main, Apr  5 2023, 14:15:30) [GCC 7.5.0]


Log into openai with OPEN_API_KEY

In [2]:
import os
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = os.getenv("OPENAI_API_KEY") or "OPENAI_API_KEY"

# openai.Engine.list()  # check we have authenticated

Connect to Pinecone with PINECONE_API_KEY and PINECONE_ENVIRONMENT

In [4]:
import pinecone
from tqdm import tqdm

api_key = os.getenv("PINECONE_API_KEY") or "PINECONE_API_KEY"
# find your environment next to the api key in pinecone console
env = os.getenv("PINECONE_ENVIRONMENT") or "PINECONE_ENVIRONMENT"

pinecone.init(api_key=api_key, enviroment=env)
# pinecone.whoami()

Create a sample embedding so that we know the embedding length

In [6]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=["This is sample test that will determine the length"],
    engine=embed_model
)

embedding_length = len(res['data'][0]['embedding'])
embedding_length

1536

In [9]:
index_name = 'openai-fpi-categorization'

# Create the index if it doesn't exist already
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=embedding_length,
        metric='cosine'
    )

In [10]:
# connect to index
index = pinecone.Index(index_name)
# view index stats 
index_stats = index.describe_index_stats()
index_stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

## Populate the Vector Database
### Key Imports

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Import our Labeled Text into a DataFrame

In [12]:
# Read the CSV File into a dataframe

labeled_data = pd.read_csv('Federal_Program_Inventory_Pilot_Data.csv',encoding='cp1252',
                           usecols=['Agency','Program Name','Program Description','Mission/Purpose',
                                    'Recipients','Beneficiaries','Associated Categories'])

# Remove duplicate rows
labeled_data = labeled_data.drop_duplicates().reset_index(drop=True)

# Get the categories
categories = pd.read_csv('Federal_Program_Inventory_Pilot_Data.csv',encoding='cp1252',
                         usecols=['Category'])
categories = categories.drop_duplicates().reset_index(drop=True)
categories = [r["Category"] for _, r in categories.iterrows()]
categories

['Broadband',
 'Economic Development',
 'Opioid Epidemic Response',
 'STEM Education',
 'Workforce Development',
 'Native American',
 'Flood Risk',
 'A.I. R&D/Quantum R&D',
 'Global Health',
 'Homelessness',
 'HIV/AIDS',
 'Transportation Infrastructure']

### Let's create our comparison string and our category list

In [13]:
# Let's join the Agency, Program Name, Program Description, Mission/Purpose, and Beneficiaries
# into a single Text field (make it human readable for the LLM)
labeled_data['text'] = ("Program Name: " + labeled_data['Program Name'] + "\n" +
                        "Agency: " + labeled_data['Agency'] + "\n" +
                        "Mission/Purpose: " + labeled_data['Mission/Purpose'] + "\n" +
                        "Program Description: " + labeled_data['Program Description'] + "\n" +
                        "Recipients: " + labeled_data['Recipients'] + "\n" +
                        "Beneficiaries: " + labeled_data['Beneficiaries'])

# Let's generate the list of categories for each entry
labeled_data['categories'] = labeled_data['Associated Categories'].str.split('; ')

In [15]:
print(labeled_data['text'][2])

Program Name: POWER Initiative
Agency: Appalachian Regional Commission
Mission/Purpose: Area and regional development
Program Description: The POWER (Partnerships for Opportunity and Workforce and Economic Revitalization) Initiative helps communities and regions that have been affected by job losses in coal mining, coal power plant operations, and coal-related supply chain industries due to the changing economics of America’s energy production. The POWER Initiative funds projects that cultivate economic diversity, enhance job training and re-employment opportunities, create jobs in existing or new industries, attract new sources of investment, and strengthen the Region’s broadband infrastructure.
Recipients: U.S. State Government (including the District of Columbia); Public/State Controlled Institution; Nonprofit with 501(c)(3) Status; Domestic Local Government (includes territories unless otherwise specified)
Beneficiaries: State; Rural; Other public institution/organization; Public n

In [17]:
# Set up arrays of labeled data and calculate the embeddings
labeled_text = labeled_data['text'].tolist()
labeled_categories = labeled_data['categories'].tolist()
embeddings = openai.Embedding.create(input=labeled_text, engine=embed_model)

In [18]:
if (index_stats.total_vector_count == 0):
    to_upsert = [('fpi-'+str(i), embeddings['data'][i]['embedding'], 
                  {'text':labeled_text[i], 'categories':labeled_categories[i]})
                 for i in range(len(labeled_text))]
    # Insert in groups of 50
    for i in range(0, len(to_upsert), 50):
        print(i)
        index.upsert(vectors=to_upsert[i:i+50])

0
50
100
150
200
250
300
350
400
450
500
550
600
650
700


In [19]:
# view index stats 
index_stats = index.describe_index_stats()
index_stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 711}},
 'total_vector_count': 711}

In [20]:
# Let's try a to search our index!
query = " economic and community development, empowering Appalachian communities to work with their state governments"
qe = openai.Embedding.create(input=[query], engine=embed_model)
res = index.query(qe['data'][0]['embedding'], top_k=5, include_metadata=True)
# res

In [21]:
# res

In [22]:
labeled_text[24]

'Program Name: Planning Assistance to States\nAgency: Corps of Engineers--Civil Works\nMission/Purpose: Water resources\nProgram Description: The Corps uses this funding to provide technical assistance to states, local governments, Indian tribes, and regional and interstate water resources authorities to assist them in their water resources planning efforts.  The Corps would use the requested funds for work related to flood risk management.  The Corps also is able under this program to provide technical analysis to support a broader effort by a state, regional, or interstate authority that is evaluating options involving a range of issues across a large watershed.\nRecipients: U.S. Federal Government\nBeneficiaries: State; U.S. Citizen; U.S. Territories; Local'

In [67]:
# Let's try a to search our index!
key = 168
query = labeled_text[key]
qe = openai.Embedding.create(input=[query], engine=embed_model)
res = index.query(qe['data'][0]['embedding'], top_k=5, include_metadata=True)
# res

In [68]:
query = ("You are natural language processing categorization tool.\n" +
         "Given examples of nearest-neighbor US grant programs and their categories. You will assign " +
         "zero to many categories to a new grant program.\n" +
         "The category choices are ['" + "','".join(categories) + "'])\n" +
         "The following are pre-categorized programs and their known categories:\n\n")
for match in res['matches']:
    if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
        query += (match['metadata']['text'] + "\n\n" +
                  "Categories: " + ', '.join(match['metadata']['categories']) + "\n\n" +
                  "---" + "\n\n")
query += ("Please use the above categorized programs to assign categories to the following program description: \n\n" +
          "Program text: " + labeled_text[key] + "\n\n" +
          "Output a JSON object with three entries: \n" +
          " 1) 'keywords' a string listing important keywords found in the program description that impacts your decision\n" + 
          " 1) 'reasoning' a string describing your reasoning and justification of the categorizaitons\n" + 
          " 2) 'categories' an array of strings with your categories (the choices are ['" + "','".join(categories) + "'])\n" +
          " 3) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)\n\n" +
          " Please note that a program may not fit into any of the categories.")

print(query)

You are natural language processing categorization tool.
Given examples of nearest-neighbor US grant programs and their categories. You will assign zero to many categories to a new grant program.
The category choices are ['Broadband','Economic Development','Opioid Epidemic Response','STEM Education','Workforce Development','Native American','Flood Risk','A.I. R&D/Quantum R&D','Global Health','Homelessness','HIV/AIDS','Transportation Infrastructure'])
The following are pre-categorized programs and their known categories:

Program Name: Student Support and Academic Enrichment State Grants
Agency: Department of Education
Mission/Purpose: Elementary, secondary, and vocational education
Program Description: The program supports formula grants that are intended to improve academic achievement by increasing the capacity of States and local educational agencies (LEAs) to provide students with access to a well-rounded education and improve school conditions and use of technology.
Recipients: U.

In [69]:
# now query GPT 3.5
chat = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    temperature=0,
    messages=[
        {"role": "user", "content" : query}
    ]
)

print(chat['choices'][0]['message']['content'])

{
  "keywords": "low-income students, postsecondary education",
  "reasoning": "Based on the program description, the focus of the grant program is to increase the number of low-income students who are prepared for and succeed in postsecondary education. This aligns with the mission/purpose of higher education. Additionally, the beneficiaries include Education (0-8) and Education (9-12), indicating a focus on students at various educational levels.",
  "categories": ["STEM Education"],
  "confidence": 80
}


In [70]:
labeled_categories[key]

['Native American']

In [34]:
import json

# Gets the categories, excluding key
# openai, index, categories, embed_model are global
def llm_vd_get_categories(text, key, top_k=5):
    # First get the related hits from Pinecone
    qe = openai.Embedding.create(input=[text], engine=embed_model)
    res = index.query(qe['data'][0]['embedding'], top_k=top_k, include_metadata=True)
    # Construct the query
    query = ("You are natural language processing categorization tool.\n" +
             "Given examples of nearest-neighbor US grant programs and their categories. You will assign " +
             "zero to many categories to a new grant program.\n" +
             "The category choices are ['" + "','".join(categories) + "'])\n" +
             "The following are pre-categorized programs and their known categories:\n\n")
    for match in res['matches']:
        if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
            query += (match['metadata']['text'] + "\n\n" +
                      "Categories: " + ', '.join(match['metadata']['categories']) + "\n\n" +
                      "---" + "\n\n")
    query += ("Please use the above categorized programs to assign categories to the following program description: \n\n" +
              "Program text: " + labeled_text[key] + "\n\n" +
              "Output a JSON object with three entries: \n" +
              " 1) 'keywords' a string listing important keywords found in the program description that impacts your decision\n" + 
              " 1) 'reasoning' a string describing your reasoning and justification of the categorizaitons\n" + 
              " 2) 'categories' an array of strings with your categories (the choices are ['" + "','".join(categories) + "'])\n" +
              " 3) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)\n\n" +
              " Please note that a program may not fit into any of the categories.")
    # print(query)
    chat = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[{"role": "user", "content" : query}]
    )
    return(json.loads(chat['choices'][0]['message']['content']))

In [35]:
llm_vd_get_categories(labeled_text[24],24)

{'keywords': 'technical assistance, water resources planning, flood risk management',
 'reasoning': 'Based on the keywords found in the program description, it is clear that the program focuses on providing technical assistance for water resources planning, specifically related to flood risk management.',
 'categories': ['Flood Risk'],
 'confidence': 100}

In [130]:
found_categories=[]
if os.path.exists("found_categories.json"):
    with open("found_categories.json","r") as infile:
        found_categories = json.load(infile)
        
# found_categories

In [131]:
# The following is a reentrant loop so that we can calculate the categories for all of the FPI programs
# in batches
for i in range(0,len(labeled_text)):
    if (i >= len(found_categories)):
        chat_result = llm_vd_get_categories(labeled_text[i],i)
        chat_result["key"] = i # Record key just in case
        print(chat_result)
        found_categories.append(chat_result)
        with open("found_categories.json","w") as outfile:
            outfile.write(json.dumps(found_categories))
   

In [132]:
# Count false positives, false negatives, true positives and true negatives
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
total_categories = len(set(categories))
for fc in found_categories:
    i = fc["key"]
    real_categories = set(labeled_categories[i])
    assigned_categories = set(categories).intersection(set(fc["categories"]))
#    print(f"{i}: {real_categories} -> {assigned_categories}")
#    print(f" TP={len(real_categories.intersection(assigned_categories))}")
    tp = len(real_categories.intersection(assigned_categories))
    tp_total += tp
#    print(f" FP={len(assigned_categories.difference(real_categories))}")
    fp = len(assigned_categories.difference(real_categories))
    fp_total += fp
#    print(f" FN={len(real_categories.difference(assigned_categories))}")
    fn = len(real_categories.difference(assigned_categories))
    fn_total += fn
#    print(f" TN={total_categories - tp - fp - fn}")
    tn = total_categories - tp - fp - fn
    tn_total += tn
#    print(f" {tp_total} {fp_total} {fn_total} {tn_total}")
    

In [133]:
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
print(f"Accuracy: {accuracy*100}")
print(f"Precision: {precision*100}")
print(f"Recall: {recall*100}")
print(f"f1: {f1*100}")


Accuracy: 96.24941397093296
Precision: 79.31844888366626
Recall: 82.41758241758241
f1: 80.83832335329342


In [134]:
# Calculate and display Accuracy, Precision, and Recall for each category
precision_data = []
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
for c in categories:
    in_cat = [(c in lc) for lc in labeled_categories]
    assigned_cat = [(c in fc["categories"]) for fc in found_categories]
    tp = ([(a and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    tn = ([(not(a) and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fp = ([(not(a) and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fn = ([(a and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    f1 = 2.0 / ((1/precision) + (1/recall))
    precision_data.append([c,tp,tn,fp,fn,accuracy,precision,recall,f1])
    tp_total += tp
    tn_total += tn
    fp_total += fp
    fn_total += fn
    
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
precision_data.append(["All Categories",tp_total,tn_total,fp_total,fn_total,accuracy,precision,recall,f1])

df = pd.DataFrame(precision_data, columns = ['Category','True Pos','True Neg','False Pos',
                                             'False Neg','Accuracy','Precision','Recall','F1'])
df

Unnamed: 0,Category,True Pos,True Neg,False Pos,False Neg,Accuracy,Precision,Recall,F1
0,Broadband,28,665,9,9,0.974684,0.756757,0.756757,0.756757
1,Economic Development,41,639,27,4,0.956399,0.602941,0.911111,0.725664
2,Opioid Epidemic Response,35,660,5,11,0.977496,0.875,0.76087,0.813953
3,STEM Education,72,600,23,16,0.945148,0.757895,0.818182,0.786885
4,Workforce Development,62,597,35,17,0.926864,0.639175,0.78481,0.704545
5,Native American,225,406,17,63,0.887482,0.929752,0.78125,0.849057
6,Flood Risk,40,658,8,5,0.981716,0.833333,0.888889,0.860215
7,A.I. R&D/Quantum R&D,52,638,7,14,0.970464,0.881356,0.787879,0.832
8,Global Health,25,677,9,0,0.987342,0.735294,1.0,0.847458
9,Homelessness,31,667,11,2,0.981716,0.738095,0.939394,0.826667


In [139]:
categories

['Broadband',
 'Economic Development',
 'Opioid Epidemic Response',
 'STEM Education',
 'Workforce Development',
 'Native American',
 'Flood Risk',
 'A.I. R&D/Quantum R&D',
 'Global Health',
 'Homelessness',
 'HIV/AIDS',
 'Transportation Infrastructure']

In [146]:
# Export key information to a spreadsheet for analysis
data = [[i, lt, sorted(lc), sorted(fc["categories"]), fc["keywords"], 
         fc["reasoning"]] for (i, lt, lc, fc) in 
        zip(range(0,len(labeled_text)), labeled_text, labeled_categories, found_categories)]
                                   
df = pd.DataFrame(data, columns = ['idx','Text','real_categories','found_categories','keywords','reasoning'])
df.to_csv("llm_vd_results.csv")

In [8]:
# When we are done, delete the index
# pinecone.delete_index(index_name)

## How Accurate is Just the Nearest Neighbor (Vector Database)?

In [103]:
# Gets the categories from the nearest neighbor
def vd_get_categories(text, key):
    # Get the related hits from Pinecone
    qe = openai.Embedding.create(input=[text], engine=embed_model)
    res = index.query(qe['data'][0]['embedding'], top_k=2, include_metadata=True)
    for match in res['matches']:
        if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
            c = match['metadata']['categories'] # Grab the categories from the first match
            break
    return(c)


In [121]:
found_vd_categories=[]
if os.path.exists("found_vd_categories.json"):
    with open("found_vd_categories.json","r") as infile:
        found_vd_categories = json.load(infile)

In [122]:
# The following is a reentrant loop so that we can calculate the categories for all of the FPI programs
# in batches
for i in range(0,len(labeled_text)):
    if (i >= len(found_vd_categories)):
        cat = vd_get_categories(labeled_text[i],i)
        print(f"{i} {cat}")
        found_vd_categories.append(cat)
        with open("found_vd_categories.json","w") as outfile:
            outfile.write(json.dumps(found_vd_categories))

In [123]:
# Calculate and display Accuracy, Precision, and Recall for each category
precision_data = []
fp_total = 0
fn_total = 0
tp_total = 0
tn_total = 0
for c in categories:
    in_cat = [(c in lc) for lc in labeled_categories]
    assigned_cat = [(c in fc) for fc in found_vd_categories]
    tp = ([(a and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    tn = ([(not(a) and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fp = ([(not(a) and b) for a, b in zip(in_cat, assigned_cat)]).count(True)
    fn = ([(a and not(b)) for a, b in zip(in_cat, assigned_cat)]).count(True)
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    f1 = 2.0 / ((1/precision) + (1/recall))
    precision_data.append([c,tp,tn,fp,fn,accuracy,precision,recall,f1])
    tp_total += tp
    tn_total += tn
    fp_total += fp
    fn_total += fn
    
accuracy = (tp_total+tn_total)/(tp_total+tn_total+fp_total+fn_total)
precision = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
f1 = 2.0 / ((1/precision) + (1/recall))
precision_data.append(["All Categories",tp_total,tn_total,fp_total,fn_total,accuracy,precision,recall,f1])

df = pd.DataFrame(precision_data, columns = ['Category','True Pos','True Neg','False Pos',
                                             'False Neg','Accuracy','Precision','Recall','F1'])
df

Unnamed: 0,Category,True Pos,True Neg,False Pos,False Neg,Accuracy,Precision,Recall,F1
0,Broadband,27,666,8,10,0.974684,0.771429,0.72973,0.75
1,Economic Development,38,655,11,7,0.974684,0.77551,0.844444,0.808511
2,Opioid Epidemic Response,35,657,8,11,0.973277,0.813953,0.76087,0.786517
3,STEM Education,60,601,22,28,0.929677,0.731707,0.681818,0.705882
4,Workforce Development,50,610,22,29,0.92827,0.694444,0.632911,0.662252
5,Native American,259,382,41,29,0.901547,0.863333,0.899306,0.880952
6,Flood Risk,43,658,8,2,0.985935,0.843137,0.955556,0.895833
7,A.I. R&D/Quantum R&D,51,625,20,15,0.950774,0.71831,0.772727,0.744526
8,Global Health,21,685,1,4,0.992968,0.954545,0.84,0.893617
9,Homelessness,28,671,7,5,0.983122,0.8,0.848485,0.823529
