# Chatbot for ArgoCD Operator

This is an attempt to use an AI-powered chatbot to interact with the argocd-operator codebase.

## Why?

I was brainstorming ideas to automate CVE triage. The idea of a system that understands a CVE, analyzes it in a codebase to check if it is affected, and provides patches, sounded interesting, so I decided to give it a try.

## How?

LLMs(Large Language Model) are generally trained on general-purpose language understanding and generation. However they can be used to to perform domain specific tasks by providing them domain knowledge. This can be done using fine-tuning or RAG(Retrieval-augmented generation).

I decided to used RAG technique for providing operator specific knowledge to LLM as fine-tuning requires creation of a structured dataset.

### Chatbot working

The argocd-operator specific knowledge is stored into a vector database which is retrieved based on provided query and passed as context along with query to LLM to generate an answer.

#### Data indexing

The operator knowledge i.e source code & github issues are are converted into vector embeddings and stored into database.

![rag_indexing](https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png ) (Source: python.langchain.com)

#### Data retrieval

The chatbot will retrieve the data using vector similarity for provided query and pass it as context to the LLM to answer the query. 
![rag_retrieval_generation](https://python.langchain.com/assets/images/rag_retrieval_generation-1046a4668d6bb08786ef73c56d4f228a.png) (Source: python.langchain.com)

## Implementation

I have used `Langchain` framework to create the chatbot.  
The LLM models used are
1. BERT based `sentence-t5-base-nlpl-code_search_net` from [huggingfaces](https://huggingface.co/krlvi/sentence-t5-base-nlpl-code_search_net).  
   This model has been trained on the with the code_search_net dataset and is used to generate embeddings to store in vector db.
3. GPT based `gpt4all-falcon-q4_0.gguf` from [gpt4all.io](https://gpt4all.io/index.html).  
   This model is used for conversations and answering questions.

Refer [Chat Results](#chat-results) section for result.

### Data indexing

The argocd-operator code and github issues are stored under `data/` directory in this repository. The github data is scrapped using [ghi-scraper](https://github.com/agateau/ghi-scraper) and processed using `json-to-text.py` helper script from this repository.

#### Load raw data

In [1]:
from langchain.docstore.document import Document
import os

folder_name = "data"

documents = []
for root, dirs, files in os.walk(folder_name):
    for file in files:
        try:
            with open(os.path.join(root, file), "r", encoding="utf-8") as o:
                code = o.readlines()
                d = Document(page_content="\n".join(code), metadata={"source": os.path.join(root, file)})
                documents.append(d)
        except UnicodeDecodeError:
            print("couldn't load ", file)
            pass

couldn't load  icon.png
couldn't load  logo.png
couldn't load  login_via_keycloak.png
couldn't load  login_with_openshift.png
couldn't load  Keycloak_Manageaccount.png
couldn't load  Keycloak_ChangePassword.png


In [2]:
len(documents)

783

#### Store data in vector db

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

hfemb = HuggingFaceEmbeddings(model_name="krlvi/sentence-t5-base-nlpl-code_search_net")
persist_directory = "db"
db = Chroma.from_documents(documents, hfemb, persist_directory=persist_directory)
db.persist()

### Data retrieval

Load the vector db with operator knowledge and create a langchain retriever.

#### Load vector db

In [4]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

hfemb = HuggingFaceEmbeddings(model_name="krlvi/sentence-t5-base-nlpl-code-x-glue")

persist_directory = "db"
db = Chroma(persist_directory=persist_directory, embedding_function=hfemb)

#### Vector data retriever

In [5]:
#retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .2, "k": 5})

retriever = db.as_retriever()

### Chat LLM model

The model `gpt4all-falcon-q4_0.gguf` is downloaded from [gpt4all.io](https://gpt4all.io/index.html) website and stored in `models` directory in this repository. 

In [6]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationSummaryMemory

from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import GPT4All

# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]

# Verbose is required to pass to the callback manager
llm = GPT4All(model="models/gpt4all-falcon-q4_0.gguf", callbacks=callbacks, verbose=False)

#memory = ConversationSummaryMemory(
#    llm=llm, memory_key="chat_history", return_messages=True
#)

### Setup Langchain

Create Conversation chain for chatbot using GPT4All chat model and vector db as retriever

In [7]:
from langchain.chains import ConversationalRetrievalChain

qa_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, return_source_documents=True, max_tokens_limit=2046)

### Query helper

In [8]:
def query(q):
    result = qa_chain({"question":q, "chat_history":[]})
    #print(f"Answer: {result['answer']}")
    print(f"\n\nSources\n{[x.metadata['source'] for x in result['source_documents']]}")

## Chat Results

In [9]:
query("What do you know about argocd-operator?")

Token indices sequence length is longer than the specified maximum sequence length for this model (2909 > 1024). Running this sequence through the model will result in indexing errors


 argocd-operator is an operator in OKD that manages ArgoCD deployments and updates. It provides a way to manage ArgoCD deployments, including creating, updating, and deleting deployments. The operator also handles the creation of resources such as pods, services, and secrets for each deployment.

Sources
['data/issues/1081.txt']


In [10]:
query("What is the source code langauge of argocd-operator?")

 The source code language of argocd-operator is Go.

Sources
['data/issues/274.txt', 'data/issues/153.txt', 'data/issues/1001.txt']


In [11]:
query("How to use this operator?")

 To use the ArgoCD Operator, you need to have a Kubernetes cluster running and have the operator installed. You can then create an ArgoCD Operator instance using the following command:

```
kubectl apply -f https://raw.githubusercontent.com/argoproj-labs/argocd-operator/master/examples/operator/argo-cd-operator/v0.0.9/deploy/kustomization.yaml
```

This will create an ArgoCD Operator instance in your Kubernetes cluster. You can then use the operator to manage your ArgoCD deployments and releases.

To deploy a

Sources
['data/issues/256.txt', 'data/issues/49.txt', 'data/issues/79.txt', 'data/code/hack/test.sh']


In [12]:
query("How to install it on kubernetes?")

 To install the ArgoCD Operator on Kubernetes, you can follow these steps:

1. Install the operator by running `make deploy`.
2. Verify that the operator is installed by checking the status of the pods using `kubectl get pods -n argocd-operator-system`.
3. Create a new ArgoCD resource by running `argocd create <resource_name>`.
4. Verify that the resource was created by checking the status of the resource using `kubectl get resources -n argocd-operator-system`.
5. Repeat steps 3-4 for each additional resource you want to create.
6. Once all resources have been created, verify that the operator is running and has no errors by checking the status of the pods using `kubectl get pods -n argocd-operator-system`.
7. To remove the operator from the cluster, run `make undeploy`.
8. Verify that the operator has been removed by checking the status of the pods using `kubectl get pods -n argocd-operator-system`.

Sources
['data/code/docs/install/manual.md']


In [14]:
query("does argocd-operator source as go.mod file?")

 Yes, the argocd-operator project sources from a `go.mod` file.

Sources
['data/issues/461.txt']


In [16]:
query("Tell me the go version in go.mod of this project")

 The go version in go.mod for this project is 1.16.

Sources
[]


In [18]:
query("What is go version in this project?")

 The go version used in this project is 1.20.

Sources
['data/code/Dockerfile', 'data/code/.github/workflows/codegen.yaml']


In [20]:
query("Write a patch to update go version to 1.21 for this project?")

 To update the go version to 1.21, you can modify the `env` section of the `jobs` configuration in the `check-go-modules` job. Here's an example of how you can do it:
```yaml
- name: Update go version
  uses: actions/setup-go@v1
  with:
    go-version: 1.21
```
This will update the go version to 1.21 for the `check-go-modules` job. You can repeat this step for the `check-sdk-codegen` job as well.

Sources
['data/code/.github/workflows/codegen.yaml']


In [28]:
query("Write a patch to update k8s.io/api in go.mod file in argocd-operator to v1.29?")



To update the k8s.io/api in go.mod file in argocd-operator to v1.29, you can follow these steps:

1. Open the `argocd-operator` directory in your terminal and navigate to the `cmd/k8s/apis` directory.
2. Find the `api.go` file and open it in a text editor.
3. Search for the line that starts with `//+k8s: k8s.io/api v1.29`.
4. Replace this line with the following:
```go
//+k8s: k8s.io/api v1.29
```
5. Save and close the file.
6. Run `make` to rebuild the argocd-operator package.
7. Verify that the k8s.io/api version in the go.mod file has been updated

Sources
['data/issues/461.txt', 'data/issues/565.txt', 'data/issues/462.txt']


In [21]:
query("Which sso does argocd-operator support?")

 Currently, argocd-operator supports both dex and keycloak SSO providers.

Sources
['data/issues/153.txt', 'data/issues/274.txt', 'data/issues/119.txt', 'data/issues/483.txt']


In [22]:
query("Does it support dex?If yes, can you share how to configure it?")

 Yes, ArgoCD supports Dex. To configure it, you need to add the following volume mount to your ArgoCD deployment:



```
- name: google-groups-credentials
  secret:
    defaultMode: 420
    secretName: argocd-google-groups-credentials
```

This will mount a secret volume that contains the Google Groups credentials JSON. You can then use this JSON to authenticate and query Google Workspace directory for a user's groups.

You can also configure the ArgoCD operator to use Dex by setting the `DISABLE_DEX` variable to `true`. This will disable the deployment of the bundled Dex server.

Sources
['data/issues/278.txt', 'data/issues/559.txt', 'data/issues/142.txt']


In [25]:
query("How to configure rbac with argocd-operator?")

 To configure RBAC with argocd-operator, you need to follow these steps:

1. Create a new role or patch an existing role bound to the SA to add RBAC permissions for the argocd-operator service account.
2. Add the RBAC role to the argocd-operator service account.
3. Apply the hack for the service account.
4. Once you apply the hack for the service account, you can list the HorizontalPodAutoscalers resource in the API group "autoscaling".
5. You can also configure RBAC with argocd-operator by using a custom role instead of the default role.
6. To do this, create a new role and add RBAC permissions for the argocd-operator service account.
7. Apply the hack for the service account.
8. Once you apply the hack for the service account, you can list the HorizontalPodAutoscalers resource in the API group "autoscaling".
9. You can also configure RBAC with argocd-operator by using a custom role instead of the default role.
10. To do this, create a new role and add RBAC permissions for the argoc



In [29]:
query("What is namespace vs cluster scope ArgoCD in operator?")

 Namespace-scoped ArgoCD in the operator means that the ArgoCD instance will be managed by the operator and will only watch for changes in the specified namespace. This allows for a single operator to manage multiple ArgoCD instances, each with their own namespace. Cluster-scoped ArgoCD in the operator means that the ArgoCD instance will be managed by the operator and will watch for changes across all namespaces. This allows for a single operator to manage multiple ArgoCD instances, each with their own namespace.

Sources
['data/issues/665.txt', 'data/issues/107.txt']


In [30]:
query("How to make a ArgoCD cluster scope in operator?")

 To create a ArgoCD cluster scope in the operator, you need to define the `clusterScope` field in the `operator.yaml` file. This field should contain the `apiVersion`, `kind`, and `metadata` fields for the ArgoCD resource that you want to manage.

For example, if you want to create a new ArgoCD resource with the name `my-argocd-resource`, you can define it in the `operator.yaml` file as follows:
```
apiVersion: argoproj.io/v1alpha1
kind: ArgoCD
metadata:
  name: my-argocd-resource
```
Once you have defined the `clusterScope` field, you can run the operator to create the ArgoCD resource and manage it using the `operator-sdk`.

Sources
['data/issues/92.txt', 'data/issues/339.txt']


In [31]:
query("What is the latest api version of ArgoCD CRD?")

 The latest api version of ArgoCD CRD is `v1beta1`.

Sources
['data/issues/119.txt', 'data/issues/1090.txt', 'data/code/examples/argocd-basic.yaml', 'data/code/examples/argocd-autoscale.yaml']


In [33]:
query("Provide a sample for v1beta1 ArgoCD CRD?")



Here is an example of a sample ArgoCD CRD for v1beta1:
```yaml
apiVersion: argoproj.io/v1beta1
kind: ArgoCD
metadata:
  name: example-argocd
spec:
  server:
    resources:
      limits:
        cpu: 500m
        memory: 256Mi
      requests:
        cpu: 125m
        memory: 128Mi
  repo:
    resources:
      limits:
        cpu: 1,000m
        memory: 512Mi
      requests:
        cpu: 250m
        memory: 256Mi
  ha:
    enabled: false
    resources:
      limits:
        cpu: 500m
        memory: 256Mi
      requests:
        cpu: 250m
        memory: 128Mi
  redis:
    resources:
      limits:
        cpu: 500m
        memory: 256Mi
      requests:
        cpu: 250m
        memory: 128Mi
  sso:
    provider: dex
    dex:
      resources:
        limits:
          cpu: 500m
          memory: 256Mi
        requests:
          cpu: 250m
          memory: 128Mi
  controller:
   

Sources
['data/code/examples/argocd-basic.yaml', 'data/code/config/samples/argoproj.io_v1alpha1_argocdexp

## Conclusion

This was a simple experiment, so I didn't expect exceptional results. The chatbot performed adequately on QA tasks, considering the limited amount of unstructured data, RAG technique, and models used. However, for code/instruction generation tasks, the results were not that good. Further exploration, such as using domain-specific LLM models or fine-tuning, might yield better outcomes.

## References 

- https://towardsdatascience.com/code-understanding-on-your-own-hardware-dd38c4f266d6
- https://python.langchain.com/docs/use_cases/question_answering/
- https://python.langchain.com/docs/use_cases/question_answering/code_understanding
- https://huggingface.co/krlvi/sentence-t5-base-nlpl-code_search_net
- https://python.langchain.com/docs/integrations/llms/gpt4all
- https://www.youtube.com/watch?v=DYOU_Z0hAwo
- https://www.youtube.com/watch?v=9ISVjh8mdlA