## NAICS Code Generation

This notebook demonstrates how to use SuperPipe to generate North American Industry Classification System (NAICS) codes for businesses based on name and address.

We are provided with a list of business names and addresses, for example

```
{
  "name": "",
  "street": "",
  "city": "",
  "state": "",
  "zip": {}
}
```

The objective is to accurately assign a NAICS code to each business, which categorizes it into a specific industry. For example, the NAICS code for the above business might be

`311811 - Retail Bakeries`

The challenge is to correctly generate the NAICS code for each business, considering the vast array of industries covered by the NAICS system.


### Approach

We'll implement the following multi-step approach:

1. Do a google search with the company's name and address

2. Feed the name and the search results from Step 1 into an LLM and ask it for the 3 most likely NAICS codes

3. Feed the name, search results and the 3 most likely NAICS codes from Step 2 into an LLM and ask it for the most likely NAICS code along with its reasoning.


Import libraries, and load the data and the taxonomy

In [1]:
import pandas as pd
import json
from pydantic import BaseModel, Field
from typing import List
from superpipe import *

# df = pd.read_csv('./data.csv')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Defining the Pipeline using SuperPipe

Now, let's define the first step of the pipeline which uses a Google SERP library to search for the `description` field on the input object.

We do this by using SuperPipe's built-in `SERPEnrichmentStep`. You could also easily build your own using a `CustomStep` instead.

In [2]:
def shorten(x):
    short = []
    kgi = json.loads(x).get('knowledgeGraph')
    if kgi is None:
        short.append(kgi)
    else:
        kg = {'title': kgi.get('title'), 'type': kgi.get(
            'type'), 'description': kgi.get('description')},
        short.append(kg)
    y = json.loads(x).get('organic')
    if y is None:
        return None
    for o in y[:3]:
        short.append({
            'title': o['title'],
            'snippet': o['snippet'],
            'link': o['link']
        })
    return short


def serp_prompt(row):
    name = row['name']
    street = row['address']['street']
    city = row['address']['city']
    state = row['address']['state']
    zip = row['address']['zip']
    return f"Review for {name} located at {street} {city} {state} {str(zip)}"


serp_step = steps.SERPEnrichmentStep(
  prompt=serp_prompt,
  postprocess=shorten,
  name="serp")

The second step of the pipeline takes the business name and the search results and feeds them into an LLM to get the 3 most likely NAICS codes. We create this step using `LLMStep`.

An `LLMStep` instance takes a Pydantic model and a prompt generator function as arguments. The pydantic model specifies the output structure (remember every `LLMStep` creates structured output). The prompt generator function defines how to generate a prompt from the input data.

In [3]:
def top3_codes_prompt(row):
    return f"""You are given a business name and a list of google search results about a company.
    Return an array of the top 3 most like NAICS business codes this company falls into. Only use codes in the 2022 taxonomy.

    Company name: {row['name']}
    Search results:
    {row['serp']}
    """


class Top3Codes(BaseModel):
    top3_codes: List[int] = Field(
        description="The top 3 most likely NAICS codes")


top3_codes_step = steps.LLMStep(
  model=models.gpt4,
  prompt=top3_codes_prompt,
  out_schema=Top3Codes,
  name="top3_codes")

The third step the business name, search results and 3 most likely NAICS codes into an LLM to get the most likely NAICS code and the thinking behind it. Again, we create this step using `LLMStep`.

In [4]:
def top_code_prompt(row): return f"""
You are given a business name and a list of google search results about a company.
You are given 3 possible NAICS codes it could be -- pick the best one and explain your thinking.

Company name: {row['name']}
NAICS options: {row['top3_codes']}
Search results:
{row['serp']}
"""


class TopCode(BaseModel):
    result: int = Field(description="The best NAICS code")
    thinking: str = Field(
        description="The thought process for why this is the best NAICS code")


top1_code_step = steps.LLMStep(
  model=models.gpt4,
  prompt=top_code_prompt,
  out_schema=TopCode,
  name="top1_code")

We're done defining the steps. Finally, we define an evaluation function - a simple string comparison against the ground truth column which was present in the dataset. Then we define a `Pipeline` and run it.

In [None]:
evaluate = lambda row: row['result'] == row['NAICS']

naics_coder = pipeline.Pipeline(
  steps=[
    serp_step,
    top3_codes_step,
    top1_code_step], 
  evaluation_fn=evaluate)

naics_coder.apply(df)