# Generate a K8s dataset for LLMs

This notebook reverse engineers a possible prompt for K8s YAML manifests from the K8s website. It does this by calling ChatGPT API with following prompt:
```
f"Explain in 3 sentences and start your explanation with 'Write YAML that':\n```yaml\n{YAM_FILE}\n```"
```

In [None]:
! git clone https://github.com/kubernetes/website.git

Cloning into 'website'...
remote: Enumerating objects: 352325, done.[K
remote: Counting objects: 100% (161/161), done.[K
remote: Compressing objects: 100% (107/107), done.[K
remote: Total 352325 (delta 82), reused 114 (delta 54), pack-reused 352164[K
Receiving objects: 100% (352325/352325), 399.73 MiB | 21.71 MiB/s, done.
Resolving deltas: 100% (256710/256710), done.
Updating files: 100% (8097/8097), done.


In [None]:
! ls website/content/en/examples

access	     concepts	  examples.go	    priority-and-fairness  service
admin	     configmap	  examples_test.go  README.md		   tls
application  controllers  pods		    secret		   windows
audit	     debug	  policy	    security


In [None]:
from pathlib import Path
path = "website/content/en/examples"

example_manifests = []
for path in Path(path).rglob('*.yaml'):
  example_manifests.append(path)

print(example_manifests[0])
len(example_manifests)

website/content/en/examples/windows/deploy-hyperv.yaml


283

In [None]:
! pip install --upgrade openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp (from openai)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5 (from aiohttp->openai)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->openai)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp->openai)
  Downloadin

In [None]:
import openai
from getpass import getpass
openai.api_key = getpass('OpenAI API Key: ')

OpenAI API Key: ··········


In [None]:
models = openai.Model.list()
models.data[0].id

'whisper-1'

In [None]:
chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo-0613", messages=[{"role": "user", "content": "Hello world"}])

# print the chat completion
print(chat_completion.choices[0].message.content)

Hello! How can I assist you today?


In [None]:
for path in example_manifests:
  with path.open(mode="r", encoding="utf-8") as manifest:
    content = manifest.read()
    break

print(content)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: iis
spec:
  selector:
    matchLabels:
      app: iis
  replicas: 3
  template:
    metadata:
      labels:
        app: iis
      annotations:
        experimental.windows.kubernetes.io/isolation-type: hyperv
    spec:
      containers:
      - name: iis
        image: microsoft/iis
        ports:
        - containerPort: 80




In [None]:
def get_instruction(content: str) -> str:
  chat_completion = openai.ChatCompletion.create(
      model="gpt-3.5-turbo-0613",
      messages=[{
          "role": "user",
          "content": f"Explain in 3 sentences and start your explanation with 'Write YAML that':\n```yaml\n{content}\n```"
          }])

  return chat_completion.choices[0].message.content


# print the chat completion
get_instruction(content)

'Write YAML that specifies a Kubernetes Deployment resource named "iis" in the apps/v1 API version. The deployment should create 3 replicas of a container based on the microsoft/iis image. The container should expose port 80 and have the label "app: iis" attached to it. Additionally, an annotation specifying hyperv as the isolation type for experimental Windows containers should be added to the Deployment\'s template metadata.'

In [None]:
k8s_dataset = []
for path in example_manifests[0:2]:
  with path.open(mode="r", encoding="utf-8") as manifest:
    content = manifest.read()
    k8s_dataset.append({"prompt": get_instruction(content), "completion": f"```yaml\n{content}\n```"})

k8s_dataset

[{'prompt': 'Write YAML that defines a Deployment resource with the API version of apps/v1 and specifies that it creates instances of the kind Deployment. The metadata section defines the name of the deployment as "iis". The spec section sets the selector to match labels with the value of "app: iis" and specifies 3 replicas. Inside the template section, the metadata section adds labels and annotations, including specifying the isolation type as hyperv for Windows environments. Finally, the spec section inside the template defines a container named "iis" with the image of microsoft/iis and exposes port 80.',
  'completion': '```yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: iis\nspec:\n  selector:\n    matchLabels:\n      app: iis\n  replicas: 3\n  template:\n    metadata:\n      labels:\n        app: iis\n      annotations:\n        experimental.windows.kubernetes.io/isolation-type: hyperv\n    spec:\n      containers:\n      - name: iis\n        image: microsoft/iis\n

In [None]:
print(k8s_dataset[0]["completion"])

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: iis
spec:
  selector:
    matchLabels:
      app: iis
  replicas: 3
  template:
    metadata:
      labels:
        app: iis
      annotations:
        experimental.windows.kubernetes.io/isolation-type: hyperv
    spec:
      containers:
      - name: iis
        image: microsoft/iis
        ports:
        - containerPort: 80

```


In [None]:
import json
with open('k8s-instructions.jsonl', 'w') as outfile:
    for entry in k8s_dataset:
        json.dump(entry, outfile)
        outfile.write('\n')

! cat k8s-instructions.jsonl

{"prompt": "Write YAML that defines a Deployment resource with the API version of apps/v1 and specifies that it creates instances of the kind Deployment. The metadata section defines the name of the deployment as \"iis\". The spec section sets the selector to match labels with the value of \"app: iis\" and specifies 3 replicas. Inside the template section, the metadata section adds labels and annotations, including specifying the isolation type as hyperv for Windows environments. Finally, the spec section inside the template defines a container named \"iis\" with the image of microsoft/iis and exposes port 80.", "completion": "```yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: iis\nspec:\n  selector:\n    matchLabels:\n      app: iis\n  replicas: 3\n  template:\n    metadata:\n      labels:\n        app: iis\n      annotations:\n        experimental.windows.kubernetes.io/isolation-type: hyperv\n    spec:\n      containers:\n      - name: iis\n        image: microsoft/ii

In [None]:
! rm k8s-instructions.jsonl

In [None]:
import logging

k8s_dataset = []
for path in example_manifests:
  with path.open(mode="r", encoding="utf-8") as manifest:
    content = manifest.read()
    try:
      k8s_dataset.append({"prompt": get_instruction(content), "completion": f"```yaml\n{content}\n```"})
    except:
      logging.exception(f"Exception occured while calling OpenAI. Skipping content: {content}")


f"examples: {len(example_manifests)} dataset:{len(k8s_dataset)}"

ERROR:root:Exception occured while calling OpenAI. Skipping content: apiVersion: v1
kind: ConfigMap
data:
  containers.input.conf: |-
    # This configuration file for Fluentd is used
    # to watch changes to Docker log files that live in the
    # directory /var/lib/docker/containers/ and are symbolically
    # linked to from the /var/log/containers directory using names that capture the
    # pod name and container name. These logs are then submitted to
    # Google Cloud Logging which assumes the installation of the cloud-logging plug-in.
    #
    # Example
    # A line in the Docker log file might look like this JSON:
    #
    # {"log":"2014/09/25 21:15:03 Got request with path wombat\\n",
    #  "stream":"stderr",
    #   "time":"2014-09-25T21:15:03.499185026Z"}
    #
    # The record reformer is used to write the tag to focus on the pod name
    # and the Kubernetes container name. For example a Docker container's logs
    # might be in the directory:
    #  /var/lib/docker/co

'examples: 283 dataset:282'

In [None]:
import json
with open('k8s-instructions.jsonl', 'w') as outfile:
    for entry in k8s_dataset:
        json.dump(entry, outfile)
        outfile.write('\n')

! ls -lash k8s-instructions.jsonl
! wc -l k8s-instructions.jsonl

268K -rw-r--r-- 1 root root 267K Jun 17 03:39 k8s-instructions.jsonl
282 k8s-instructions.jsonl
