## Finding YAML manifest Samples Using Embeddings

We are going to read all the sample manifests in the path, compute their embeddings and then do a search using a natural language query.

First we need to find all manifests in the directory and store their paths

In [12]:
import os

directory = os.curdir

yaml_files = []

for dirpath, dirnames, filenames in os.walk(directory):
    for filename in filenames:
        if filename.endswith('.yaml'):
            # print(os.path.join(dirpath, filename))
            yaml_files.append(os.path.join(dirpath, filename))

Let's see how many files we have

In [2]:
print(len(yaml_files))

142


Now we can read all the manifests and build a list for all of them

In [3]:
all_manifests = []
for filepath in yaml_files:
    with open(filepath) as f:
        lines = f.readlines()
        newline = []
        for line in lines:
            if line.startswith("#"):
                continue
            else:
                newline.append(line.replace("\r", "\n"))
    whole_manifest = "".join(newline)
    #whole_manifest = open(filepath).read().replace("\r", "\n")
    res = {"manifest": whole_manifest, "filepath": filepath}
    all_manifests.append(res)

Next we use OpenAI to compute the embeddings for all manifests and we store them in a Panda DataFrame. This can take a bit of time.

In [4]:
import openai
from openai.embeddings_utils import get_embedding
import pandas as pd

openai.api_key_path = "/Users/sebgoa/.openaikey"
df = pd.DataFrame(all_manifests)
df['yaml_embedding'] = df['manifest'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
df.head()

Unnamed: 0,manifest,filepath,yaml_embedding
0,\n\napiVersion: v1\nkind: Secret\nmetadata:\n ...,./targets/googlecloudfirestore/100-secret.yaml,"[0.0009237080230377614, -0.005223845597356558,..."
1,\napiVersion: targets.triggermesh.io/v1alpha1\...,./targets/googlecloudfirestore/200-target.yaml,"[-0.02706410177052021, -0.011596946977078915, ..."
2,\napiVersion: eventing.knative.dev/v1\nkind: T...,./targets/googlecloudfirestore/300-trigger.yaml,"[-0.02558640018105507, -0.023556580767035484, ..."
3,\napiVersion: v1\nkind: Secret\nmetadata:\n n...,./targets/sendgrid/100-secret.yaml,"[0.0017921671969816089, -0.021822964772582054,..."
4,\napiVersion: targets.triggermesh.io/v1alpha1\...,./targets/sendgrid/200-target.yaml,"[-0.014343353919684887, -0.036025792360305786,..."


With our embeddings handy we can now do a search and sort the results. You can return only the result with the highest similarity to your query. The query is made via a sentence in natural language.

In [11]:
from openai.embeddings_utils import cosine_similarity

def search_functions(df, code_query, n=3, pprint=True, n_lines=50):
    embedding = get_embedding(code_query, engine='text-embedding-ada-002')
    df['similarities'] = df.yaml_embedding.apply(lambda x: cosine_similarity(x, embedding))

    res = df.sort_values('similarities', ascending=False).head(n)
    if pprint:
        for r in res.iterrows():
            print(r[1].filepath + "  score=" + str(round(r[1].similarities, 3)))
            print("\n".join(r[1].manifest.split("\n")[:n_lines]))
            print('-'*70)
    return res

res = search_functions(df, 'what does a AWS kinesis source looks like', n=1)

./sources/awskinesissource.yaml  score=0.843


apiVersion: sources.triggermesh.io/v1alpha1
kind: AWSKinesisSource
metadata:
  name: sample
spec:
  arn: arn:aws:kinesis:us-west-2:123456789012:stream/triggermeshtest

  auth:
    credentials:
      accessKeyID:
        valueFromSecret:
          name: awscreds
          key: aws_access_key_id
      secretAccessKey:
        valueFromSecret:
          name: awscreds
          key: aws_secret_access_key

  sink:
    ref:
      apiVersion: eventing.knative.dev/v1
      kind: Broker
      name: default

----------------------------------------------------------------------
