## Using Large Language Model & Vector Database for Categorization

### Set OPENAI_API_KEY before running jupyter
You need an API key set up from: https://platform.openai.com/account/api-keys

    export OPENAI_API_KEY="secret key from site"

In [1]:
import sys
print (sys.version)

3.10.11 (main, Apr  5 2023, 14:15:30) [GCC 7.5.0]


Log into openai with OPEN_API_KEY

In [3]:
import os
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = os.getenv("OPENAI_API_KEY") or "OPENAI_API_KEY"

# openai.Engine.list()  # check we have authenticated

Connect to Pinecone with PINECONE_API_KEY and PINECONE_ENVIRONMENT

In [5]:
import pinecone
from tqdm import tqdm

api_key = os.getenv("PINECONE_API_KEY") or "PINECONE_API_KEY"
# find your environment next to the api key in pinecone console
env = os.getenv("PINECONE_ENVIRONMENT") or "PINECONE_ENVIRONMENT"

pinecone.init(api_key=api_key, enviroment=env)
# pinecone.whoami()

Create a sample embedding so that we know the embedding length

In [7]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=["This is sample test that will determine the length"],
    engine=embed_model
)

embedding_length = len(res['data'][0]['embedding'])
embedding_length

1536

In [9]:
index_name = 'openai-fpi-categorization'

# Create the index if it doesn't exist already
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=embedding_length,
        metric='cosine'
    )

In [10]:
# connect to index
index = pinecone.Index(index_name)
# view index stats 
index_stats = index.describe_index_stats()
index_stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

## Populate the Vector Database
### Key Imports

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Import our Labeled Text into a DataFrame

In [16]:
# Read the CSV File into a dataframe

labeled_data = pd.read_csv('Federal_Program_Inventory_Pilot_Data.csv',encoding='cp1252',
                           usecols=['Agency','Program Name','Program Description','Mission/Purpose',
                                    'Recipients','Associated Categories'])

# Remove duplicate rows
labeled_data = labeled_data.drop_duplicates().reset_index(drop=True)

# Get the categories
categories = pd.read_csv('Federal_Program_Inventory_Pilot_Data.csv',encoding='cp1252',
                         usecols=['Category'])
categories = categories.drop_duplicates().reset_index(drop=True)
categories = [r["Category"] for _, r in categories.iterrows()]
categories

['Broadband',
 'Economic Development',
 'Opioid Epidemic Response',
 'STEM Education',
 'Workforce Development',
 'Native American',
 'Flood Risk',
 'A.I. R&D/Quantum R&D',
 'Global Health',
 'Homelessness',
 'HIV/AIDS',
 'Transportation Infrastructure']

### Let's create our comparison string and our category list

In [17]:
# Let's join the Agency, Program Name, Program Description, Mission/Purpose, and Recipients 
# into a single Text field 
labeled_data['text'] = labeled_data[['Agency','Program Name','Program Description','Mission/Purpose',
                                    'Recipients']].agg(' '.join, axis=1)

# Let's generate the list of categories for each entry
labeled_data['categories'] = labeled_data['Associated Categories'].str.split('; ')

In [26]:
# Set up arrays of labeled data and calculate the embeddings
labeled_text = labeled_data['text'].tolist()
labeled_categories = labeled_data['categories'].tolist()
embeddings = openai.Embedding.create(input=labeled_text, engine=embed_model)

In [31]:
if (index_stats.total_vector_count == 0):
    to_upsert = [('fpi-'+str(i), embeddings['data'][i]['embedding'], 
                  {'text':labeled_text[i], 'categories':labeled_categories[i]})
                 for i in range(len(labeled_text))]
    # Insert in groups of 50
    for i in range(0, len(to_upsert), 50):
        print(i)
        index.upsert(vectors=to_upsert[i:i+50])

0
50
100
150
200
250
300
350
400
450
500
550
600
650
700


In [32]:
# view index stats 
index_stats = index.describe_index_stats()
index_stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 711}},
 'total_vector_count': 711}

In [34]:
# Let's try a to search our index!
query = " economic and community development, empowering Appalachian communities to work with their state governments"
qe = openai.Embedding.create(input=[query], engine=embed_model)
res = index.query(qe['data'][0]['embedding'], top_k=5, include_metadata=True)
# res

In [35]:
res

{'matches': [{'id': 'fpi-2',
              'metadata': {'categories': ['Broadband',
                                          'Economic Development',
                                          'Opioid Epidemic Response',
                                          'Workforce Development'],
                           'text': 'Appalachian Regional Commission POWER '
                                   'Initiative The POWER (Partnerships for '
                                   'Opportunity and Workforce and Economic '
                                   'Revitalization) Initiative helps '
                                   'communities and regions that have been '
                                   'affected by job losses in coal mining, '
                                   'coal power plant operations, and '
                                   'coal-related supply chain industries due '
                                   'to the changing economics of America’s '
                              

In [85]:
labeled_text[24]

'Corps of Engineers--Civil Works Planning Assistance to States The Corps uses this funding to provide technical assistance to states, local governments, Indian tribes, and regional and interstate water resources authorities to assist them in their water resources planning efforts.  The Corps would use the requested funds for work related to flood risk management.  The Corps also is able under this program to provide technical analysis to support a broader effort by a state, regional, or interstate authority that is evaluating options involving a range of issues across a large watershed. Water resources U.S. Federal Government'

In [115]:
# Let's try a to search our index!
key = 72
query = labeled_text[key]
qe = openai.Embedding.create(input=[query], engine=embed_model)
res = index.query(qe['data'][0]['embedding'], top_k=5, include_metadata=True)
# res

In [116]:
query = ("You are a natural language processing categorization tool. " + 
         "The following are pre-categorized programs and their known categories:\n\n")
for match in res['matches']:
    if (match['id'] != f"fpi-{key}"):  # EXCLUDE THE ACTUAL PROGRAM
        query += ("Program text: " + match['metadata']['text'] + "\n\n" +
                  "Categories: " + ', '.join(match['metadata']['categories']) + "\n\n" +
                  "---" + "\n\n")
query += ("Please use the above categorized programs to assign categories to the following program description: \n\n" +
          "Program text: " + labeled_text[key] + "\n\n" +
          "Output a JSON object with three entries: \n" +
          " 1) 'keywords' a string listing important keywords found in the program description that impacts your decision\n" + 
          " 1) 'reasoning' a string describing your reasoning and justification of the categorizaitons\n" + 
          " 2) 'categories' an array of strings with your categories (the choices are ['" + "','".join(categories) + "'])\n" +
          " 3) 'confidence' a number from 0 to 100 representing your confidence (100 being the most confident)\n\n" +
          " Please note that a program may not fit into any of the categories.")

print(query)

You are a natural language processing categorization tool. The following are pre-categorized programs and their known categories:

Program text: Department of Agriculture Regional Food System Partnership Grants supports partnerships that connect public and private resources to plan and develop local or regional food systems Agricultural research and services Indian/Native American Tribal Government

Categories: Native American

---

Program text: Department of Agriculture Tribal College This program provides funding to 1994 Land Grant Institutions (Tribal Colleges) to make capital improvements to their educational facilities and to purchase equipment Community development Indian/Native American Tribal Government

Categories: Native American

---

Program text: Department of Agriculture Intermediary Relending Program The purpose of this program is to provide loans to intermediaries that establish revolving loan programs to provide loans to ultimate recipients for business lending. Area 

In [117]:
# now query GPT 3.5
chat = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {"role": "user", "content" : query}
    ]
)

print(chat['choices'][0]['message']['content'])

{
  "keywords": "local agriculture, regional food business, intermediaries, producer to consumer marketing, agricultural products",
  "reasoning": "Based on the keywords found in the program description, it is clear that this program focuses on supporting local and regional food business enterprises, intermediaries, and increasing access to and availability of locally and regionally produced agricultural products. This aligns with the categories of Economic Development and Native American, as the program aims to promote economic growth in the agriculture sector and supports the development of local and regional food systems, which could include Native American communities.",
  "categories": ["Economic Development", "Native American"],
  "confidence": 95
}


In [118]:
labeled_categories[key]

['Native American']