# Semantic search with FAISS (TensorFlow)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install faiss-gpu

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading 

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset

In [3]:
df = pd.read_csv(r'soc_202311261432.csv')
len(df)

59121

In [4]:
X_train, X_test, y_train, y_test = train_test_split(df['JOB_DUTIES'], df['SOC_CODE'], test_size=0.2, random_state=42)

In [5]:
df_train = [{'text': text, 'label': label} for text, label in zip(X_train, y_train)]
df_test = [{'text': text, 'label': label} for text, label in zip(X_test, y_test)]

In [6]:
ds_train = Dataset.from_dict({key: [item[key] for item in df_train] for key in df_train[0]})
ds_test = Dataset.from_dict({key: [item[key] for item in df_test] for key in df_test[0]})

ds_train

Dataset({
    features: ['text', 'label'],
    num_rows: 47296
})

In [7]:
# ds_train_trim = ds_train.select(range(20000))
# ds_train_trim

In [8]:
# columns = issues_dataset.column_names
# columns_to_keep = ["title", "body", "html_url", "comments"]
# columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
# issues_dataset = issues_dataset.remove_columns(columns_to_remove)
# issues_dataset

In [10]:
ds_train.set_format("pandas")
df = ds_train[:]

In [11]:
# df["comments"][0].tolist()
df['text'].tolist()

['Tasks include: Operate vehicles and powered equipment, such as mowers, tractors, twin-axle vehicles, snow blowers, chainsaws, electric clippers, sod cutters, and pruning saws. Mow or edge lawns. Shovel snow and salt walks, driveways, or parking lots. Care for established lawns by mulching, aerating, weeding, grubbing, removing thatch, or trimming or edging around flower beds, walks, or walls.  Use hand tools, such as shovels, rakes, pruning saws, hedge or brush trimmers, or axes. Prune or trim trees, shrubs, or hedges. Gather and remove litter. Maintain or repair tools, equipment, or structures, such as building, greenhouses, fences, or benches, using hand or power tools. Mix and spray fertilizers, herbicides, or insecticides onto grass, shrubs, or trees.',
 'Prepare construction site where concrete will be poured. Assist in pouring of concrete for finishers to complete. Use hand and power tools and carry various construction materials. Dig and level earth and gravel. Clean up constr

In [12]:
text_df = df.explode("text", ignore_index=True)
text_df.head()

Unnamed: 0,text,label
0,Tasks include: Operate vehicles and powered eq...,37-3011.00
1,Prepare construction site where concrete will ...,47-2061.00
2,Perform a variety of food preparation duties o...,35-2021.00
3,***RFI Response***\n \n- Prep Station\n ...,35-2014.00
4,"Greet, register and assign rooms to guests. Is...",43-4081.00


In [13]:
text_dataset = Dataset.from_pandas(text_df)
text_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 47296
})

In [14]:
text_dataset = text_dataset.map(
    lambda x: {"text_length": len(x["text"].split())}
)

Map:   0%|          | 0/47296 [00:00<?, ? examples/s]

In [15]:
text_dataset = text_dataset.filter(lambda x: x["text_length"] > 5)
text_dataset

Filter:   0%|          | 0/47296 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label', 'text_length'],
    num_rows: 47248
})

In [16]:
# def concatenate_text(examples):
#     return {
#         "text": examples["title"]
#         + " \n "
#         + examples["body"]
#         + " \n "
#         + examples["comments"]
#     }


# comments_dataset = comments_dataset.map(concatenate_text)

In [17]:
from transformers import AutoTokenizer, TFAutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMPNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


In [18]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [19]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="tf"
    )
    encoded_input = {k: v for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [20]:
embedding = get_embeddings(text_dataset["text"][0])
embedding.shape

TensorShape([1, 768])

In [26]:
embeddings_dataset = text_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).numpy()[0]}
)



Map:   0%|          | 0/47248 [00:00<?, ? examples/s]

In [30]:
embeddings_dataset

Dataset({
    features: ['text', 'label', 'text_length', 'embeddings'],
    num_rows: 47248
})

In [42]:
import joblib

# Save the dataset along with the Faiss index
joblib.dump(embeddings_dataset, "embeddings_dataset.joblib")

['embeddings_dataset.joblib']

In [None]:
# embeddings_dataset.add_faiss_index(column="embeddings")

In [32]:
# !pip install supabase

Collecting supabase
  Downloading supabase-2.3.0-py3-none-any.whl (15 kB)
Collecting gotrue<3.0,>=1.3 (from supabase)
  Downloading gotrue-2.1.0-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.7/42.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx<0.25.0,>=0.24.0 (from supabase)
  Downloading httpx-0.24.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.4/75.4 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting postgrest<0.14.0,>=0.10.8 (from supabase)
  Downloading postgrest-0.13.0-py3-none-any.whl (19 kB)
Collecting realtime<2.0.0,>=1.0.0 (from supabase)
  Downloading realtime-1.0.2-py3-none-any.whl (8.9 kB)
Collecting storage3<0.8.0,>=0.5.3 (from supabase)
  Downloading storage3-0.7.0-py3-none-any.whl (15 kB)
Collecting supafunc<0.4.0,>=0.3.1 (from supabase)
  Downloading supafunc-0.3.1-py3-none-any.whl (6.1 kB)
Collecting httpcore<0.18.0,>=0.15.0 (from httpx<0.25.

In [33]:
import os
from supabase import create_client, Client

os.environ["SUPABASE_URL"] = os.environ['SUPABASE_URL']
os.environ["SUPABASE_KEY"] = os.environ['SUPABASE_KEY']

url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_KEY")
supabase: Client = create_client(url, key)

In [34]:
# !pip install vecs

Collecting vecs
  Downloading vecs-0.4.2.tar.gz (21 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pgvector==0.1.* (from vecs)
  Downloading pgvector-0.1.8-py2.py3-none-any.whl (7.6 kB)
Collecting psycopg2-binary==2.9.* (from vecs)
  Downloading psycopg2_binary-2.9.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting flupy==1.* (from vecs)
  Downloading flupy-1.2.0.tar.gz (12 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting deprecated==1.2.* (from vecs)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Building wheels for collected packages: vecs, flupy

In [37]:
import vecs

DB_CONNECTION = os.environ['SUPABASE_POSTGRE']

# create vector store client
vx = vecs.create_client(DB_CONNECTION)

In [None]:
data, count = supabase.table('TEST').insert([{col: row[col] for col in embeddings_dataset.column_names} for row in embeddings_dataset]).execute()

In [None]:
supabase.table('TEST').select("*").execute()

In [27]:
# question = "Perform variety of attending duties at traveling carnival.  Set up, tear-down, operate amusement food concessions.\nThe OFLC ETA requested detail on specific portions of these job duties.  \nFood Concessions set up & tear down:  Mobile food concessions are typically mounted in a trailer.  A supervisor would position the trailer(s) at a specific location on the grounds, unhitch the power unit from the trailer.  The awnings covering the windows during transit would be raised & secured.  Items such as trash cans, screens, tables that may be positioned outside of the trailer but carried inside of the trailer during transit would be manually moved from inside the trailer to outside.  Trailer would be cleaned, sanitized & stocked with supplies for the event.  Any counters, guidance railings, signage, decorations would be positioned outside of the trailer. Condiment dispensers, napkin dispensers & trash containers would be set up outside of the trailer. Typically the fair or event maintains the tables & chairs for patrons, but is some instances the worker may set up a limited number of chairs & tables for patron use.  Teardown would simply be these duties being handled in the reverse order & items being stored & secured for transit to the next location.\nTo clarify the portion of the job duties that includes operate mobile food concessions stand:   On a carnival midway, when there is a mobile food concessions, a stand is limited to selling only one or two specific items, such as cotton candy, popcorn, turkey legs, roasted corn, or other specialty foods.  The food is prepared in a production line, where an individual may only perform one task, such as measuring corn & oil into a popper.  The next individual would salt & bag.  The next individual would choose correct bag as per customer order & hand to teller.  The next individual would have taken order, taken money, made change & then hands order to client."
# question = "Responsible for servicing, cleaning, and supplying restrooms, performing routine maintenance activities, notifying management of need for repairs, cleaning snow or debris from sidewalk, and performing heavy cleaning duties such as cleaning floors, shampooing rugs, washing walls and glass, and removing rubbish."
question = "DUTIES include: apply caulk, sealants, or other agents to installed surfaces; apply grout between joints of bricks or tiles, using grouting trowels; arrange or store materials, machines, tools, or equipment; clean installation surfaces, equipment, tools, work sites, or storage areas, using water, chemical solutions, oxygen lances, or polishing machines; correct surface imperfections or fill chipped, cracked, or broken bricks or tiles, using fillers, adhesives, or grouting materials; cut materials to specified sizes for installation, using power saws or tile cutters; erect scaffolding or other installation structures; locate and supply materials to masons for installation; mix mortar, plaster, and grout, manually or using machines, according to standard formulas; modify material moving, mixing, grouting, grinding, polishing, or cleaning procedures, according to installation or material requirements; move or position materials such as marble slabs, assisting team members using cranes, hoists, or dollies; provide assistance in the preparation, installation, repair, or rebuilding of tile, brick, or stone surfaces; remove damaged tile, brick, or mortar, and clean or prepare surfaces, using pliers, hammers, chisels, drills, wire brushes, or metal wire anchors; remove excess grout or residue from tile or brick joints, using sponges or trowels; transport materials, tools, or machines to installation sites, manually or using conveyance equipment."
question_embedding = get_embeddings([question]).numpy()
question_embedding.shape

scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

MissingIndex: ignored

In [None]:
for _, row in samples_df.iterrows():
    print(f"SOC: {row.label}")
    print(f"SCORE: {row.scores}")
    print(f"TEXT: {row.text}")
    print("=" * 50)
    print()

In [None]:
import joblib

# Save the dataset along with the Faiss index
joblib.dump(embeddings_dataset, "embeddings_dataset.joblib")

['embeddings_dataset.joblib']

In [None]:
loaded_dataset = joblib.load("embeddings_dataset.joblib")

question = "DUTIES include: apply caulk, sealants, or other agents to installed surfaces; apply grout between joints of bricks or tiles, using grouting trowels; arrange or store materials, machines, tools, or equipment; clean installation surfaces, equipment, tools, work sites, or storage areas, using water, chemical solutions, oxygen lances, or polishing machines; correct surface imperfections or fill chipped, cracked, or broken bricks or tiles, using fillers, adhesives, or grouting materials; cut materials to specified sizes for installation, using power saws or tile cutters; erect scaffolding or other installation structures; locate and supply materials to masons for installation; mix mortar, plaster, and grout, manually or using machines, according to standard formulas; modify material moving, mixing, grouting, grinding, polishing, or cleaning procedures, according to installation or material requirements; move or position materials such as marble slabs, assisting team members using cranes, hoists, or dollies; provide assistance in the preparation, installation, repair, or rebuilding of tile, brick, or stone surfaces; remove damaged tile, brick, or mortar, and clean or prepare surfaces, using pliers, hammers, chisels, drills, wire brushes, or metal wire anchors; remove excess grout or residue from tile or brick joints, using sponges or trowels; transport materials, tools, or machines to installation sites, manually or using conveyance equipment."
question_embedding = get_embeddings([question]).numpy()
question_embedding.shape

scores, samples = loaded_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

for _, row in samples_df.iterrows():
    print(f"SOC: {row.label}")
    print(f"SCORE: {row.scores}")
    print(f"TEXT: {row.text}")
    print("=" * 50)
    print()

SOC: 47-3011.00
SCORE: 14.699073791503906
TEXT: Assist masons and finishers as needed. Loading and unloading materials. Gathering and removing litter. Provide assistance in the preparation, installation, repair, or rebuilding of tile, brick, or stone surfaces. Locate and supply materials to masons for installation, following drawings or numbered sequences. Arrange or store materials, machines, tools, or equipment. May use hand or power tools or equipment or vehicles to complete these tasks.

SOC: 47-2022.00
SCORE: 14.39815616607666
TEXT: Duties may include: Provide assistance in the preparation for installation or repair of bricks, tiles, or other stone surfaces. Move, position, and transport materials, tools, or machines to installation sites, manually using dollies, hoists, cranes, or conveyance equipment. Erect scaffolding. Lay out wall patterns or foundations, using straight edge, ruler, or staked lines. Use hand tools to shape, trim, face, and cut bricks, tiles, marble or other st

In [None]:
df_train[0:5]

[{'text': 'Tasks include: Operate vehicles and powered equipment, such as mowers, tractors, twin-axle vehicles, snow blowers, chainsaws, electric clippers, sod cutters, and pruning saws. Mow or edge lawns. Shovel snow and salt walks, driveways, or parking lots. Care for established lawns by mulching, aerating, weeding, grubbing, removing thatch, or trimming or edging around flower beds, walks, or walls.  Use hand tools, such as shovels, rakes, pruning saws, hedge or brush trimmers, or axes. Prune or trim trees, shrubs, or hedges. Gather and remove litter. Maintain or repair tools, equipment, or structures, such as building, greenhouses, fences, or benches, using hand or power tools. Mix and spray fertilizers, herbicides, or insecticides onto grass, shrubs, or trees.',
  'label': '37-3011.00'},
 {'text': 'Prepare construction site where concrete will be poured. Assist in pouring of concrete for finishers to complete. Use hand and power tools and carry various construction materials. Dig

In [None]:
df_train[657]

{'text': 'Perform variety of attending duties at traveling carnival.  Set up, tear-down, operate amusement rides, food concessions, game concessions and/or novelty concessions.\n\nThe OFLC ETA requested detail on specific portions of these job duties.  \n\nAmusement Rides set up & tear down:  Mobile amusement rides are trailer mounted.  A supervisor would position the trailer(s) at a specific location on the grounds, unhitch the power unit from the trailer, & move the power unit away from the ride.  All of the pieces of the ride would travel on the same trailer(s) & be located proximate to their position when the ride is in operation.   Work would be performed by individual workers as members of a team, with some tasks being performed individually & some collectively.  Restraints holding pieces of the ride while in transit would be released.  Ride platform (if any) would be lowered & leveled.  Track or railing (if any) would be positioned & connected.  Sweeps, supports, bars, pins woul

In [None]:
abc = pd.DataFrame(df_train)[0:2000]
abc[abc['label'] == '35-3023.00']

Unnamed: 0,text,label
113,TAKE ORDERS AND ACCEPT PAYMENT FROM CUSTOMERS ...,35-3023.00
155,Duties include simple cooking and selling of f...,35-3023.00
171,Perform variety of attending duties at mobile ...,35-3023.00
263,Perform duties which combine both food prepara...,35-3023.00
408,Perform duties which combine both food prepara...,35-3023.00
448,Take orders and serve food over a counter. Acc...,35-3023.00
517,EMPLOYEES MAY BE ASSIGNED TO WORK IN VARIOUS A...,35-3023.00
535,Perform variety of attending duties at mobile ...,35-3023.00
647,Crew member for fast food restaurant. Tasks: m...,35-3023.00
657,Perform variety of attending duties at traveli...,35-3023.00


In [None]:
df_test[0:5]

[{'text': 'Ability to smooth & finish surfaces of poured concrete, such as floors, walks, sidewalks, or curbs using a variety of hand & power tools. Align forms for sidewalks, curbs or gutters. May perform other duties and tasks per SOC Code 47-2051.00.',
  'label': '47-2051.00'},
 {'text': "Attends to the overall care of thoroughbred race horses including feeding, watering,\nmaintenance of stalls and tack, cleaning, brushing, trimming of horses, disinfecting stalls and\nbedding, administration of meds as directed, inspection of horses' physical condition. Will lift\nlegs and clean horses' feet and apply liniments and bandages to legs as required. Will care for\n1-5 horses at a time, including hot walking and tacking up. Work day is split: 5am11am,3pm-5pm. Days off rotates.",
  'label': '39-2021.00'},
 {'text': 'Responsible for servicing, cleaning, and supplying restrooms, performing routine maintenance activities, notifying management of need for repairs, cleaning snow or debris from 