Skip to content
This repository has been archived by the owner on Jun 25, 2020. It is now read-only.

Commit

Permalink
Initial attempt at travis smoke tests (#40)
Browse files Browse the repository at this point in the history
* Initial attempt at travis smoke tests

* Uses `oc cluster up` instead of minishift to keep things simpler
* Now queries only the last 3s of metrics from the smoketest container
  * test can be re-run
* Test watches job status with a timeout to determine success or failure
  * I'd prefer something more robust
  * Still trying to find exactly how to wait for succes or EXPLICIT failure
  * Will need a timeout either way, so not sure how important it is TBH
* Gobs of info output whether your test passes OR fails for easy comparing

* Extra tweaks before PR

* Fixed the deployment to pass tests and did some label cleanup  (#39)

* Some label cleanup

* Removed redundant ServiceMonitor
* Fixed 'alertmanager' route
  * No backends matched the label and it seemed redundant
  * Removed it and pointed the route to `alertmanager-operated` instead
* Documented the label set that we should stick to
* Eliminated several non-conformant labels
* Fixed some markdown lint
* Moved some waits from travis.yml to deploy.sh because humans need to wait too
  * Also improved them to be more robust
* Fixed wait for prometheus operator to actually wait for CRDs to be usable
* I've found two places where the labels intersect with the operator

  * Node labels
    * https://github.com/redhat-service-assurance/smart-gateway-operator/blob/7386c50807c09fb2229a35c9063c464213539078/roles/smartgateway/tasks/main.yml#L52
    * This needs a node labelled as `application: sa-telemetry`, `node: white`
    * Which is what we do here: https://github.com/redhat-service-assurance/telemetry-framework/blob/3562d4492a22fdd784c7689fd06cc50835d0a779/deploy/quicktest_upstream.sh#L17
    * I propose that this change to `app: smartgateway`, `sa-affinity: white`
    * Do we even need this? Why is the smartgateway dependent on a specifically
      labelled node when other components are not?
    * If necesarry, we can handle this in a new PR because of the dependency

  * ServiceMonitor labels
    * ServiceMonitor `smartgateway: white` label: https://github.com/redhat-service-assurance/smart-gateway-operator/blob/7386c50807c09fb2229a35c9063c464213539078/roles/smartgateway/tasks/main.yml#L84
    * Prometheus `smartgateway: white` selector: https://github.com/redhat-service-assurance/telemetry-framework/blob/3562d4492a22fdd784c7689fd06cc50835d0a779/deploy/service-assurance/prometheus/prometheus.yaml#L35
    * I propose these change to `app: smart-gateway`

* The other labels used in the operator appear to be self-contained
  * Should be able to fix them there without touching this repo
  * Pod `app: prometheus-white` label: https://github.com/redhat-service-assurance/smart-gateway-operator/blob/7386c50807c09fb2229a35c9063c464213539078/roles/smartgateway/tasks/main.yml#L25
  * Service `smartgateway: white` label: https://github.com/redhat-service-assurance/smart-gateway-operator/blob/7386c50807c09fb2229a35c9063c464213539078/roles/smartgateway/tasks/main.yml#L65
  * ServiceMonitor `smartgateway: white` selector: https://github.com/redhat-service-assurance/smart-gateway-operator/blob/7386c50807c09fb2229a35c9063c464213539078/roles/smartgateway/tasks/main.yml#L88
  * Note that the two above are actually not related to the ServiceMonitor
    label & Prometheus selector mentioned in the previous section
  * I propose making all of these `app: smart-gateway, sa-affinity:white`

* Removed sa-affinity label
  • Loading branch information
csibbitt committed Jul 18, 2019
1 parent 3562d44 commit b1d8e19
Show file tree
Hide file tree
Showing 17 changed files with 225 additions and 71 deletions.
16 changes: 16 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
language: minimal
sudo: required

before_install:
- tests/install-and-run-ocp.sh

install:
- sudo add-apt-repository ppa:ansible/ansible-2.8 -y
- sudo apt-get update
- sudo apt-get install -y ansible openssl wget

script:
- cd deploy
- ./quickstart_upstream.sh
- cd ../tests
- ./smoketest.sh
53 changes: 46 additions & 7 deletions deploy/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Deployment using Operators
# Service Assurance Framework Deployment using Operators

This directory contains sample configurations for deployment of the Telemetry
Framework leverage Operators for the deployment. The contents here are
Expand All @@ -19,7 +19,7 @@ currently a work in a progress.
> methods. It's possible our issues will be resolved with the migration to the
> Operator Lifecycle Manager as well.
# Quickstart (Minishift)
## Quickstart (Minishift)

The following is a quickstart guide on deploying SAF into a minishift created
OpenShift environment. It will allow for SAF to be started for development
Expand Down Expand Up @@ -54,8 +54,7 @@ purposes, and is not intended for production environments.
./deploy.sh DELETE
watch -n10 oc get all


# Routes and Certificates
## Routes and Certificates

In order to get the remote QDR connections through the OpenShift operator, we
need to use TLS/SSL certificates. The following two commands will first create
Expand All @@ -74,7 +73,7 @@ the OpenShift route to Passthrough mode to port 5671.

oc create secret tls qdr-white-cert --cert=qdr-server-certs/tls.crt --key=qdr-server-certs/tls.key

# Importing ImageStreams
## Importing ImageStreams

In order to better separate between upstream and downstream locations of
images, we've made use of
Expand All @@ -84,7 +83,7 @@ To import the downstream container images into the local registry, run the
`./import-downstream.sh` script which will configure the appropriate Image
Streams for the Service Assurance Framework components.

# Generating Appropriate Manifests
## Generating Appropriate Manifests

The manifests provided here are used to request the appropriate state within
Kubernetes to allow for the Service Assurance Framework to exist as intended.
Expand Down Expand Up @@ -116,7 +115,7 @@ environment variables like so:
-e "imagestream_namespace=$(oc project --short)" \
deploy_builder.yml

# Instantiating Service Assurance Framework
## Instantiating Service Assurance Framework

After executing the above prerequisite steps, we need to patch a node (or
nodes) to allow for the scheduling of the Smart Gateway by the Operator. To do
Expand All @@ -127,3 +126,43 @@ that, run the following command:
Then simply run the `deploy.sh` script. You will need to follow the
instructions during the script as it will pause waiting for the successful
completion at a couple of steps.

## Internals

Sections for implementation details that are helpful for developers

### Labels

Here is a data dictionary of our labels; not including auto-generated ones.

Currently we are in the process of refining the label set, so this will
temporarily document the old vs. the new. This will become canonical when the
work is complete.

#### BEFORE

| **Label Key** | **On Types** | **Values** | **Notes** |
|-----------------------|--------------------------------|----------------|------------|
| alertmanager | Pod | sa | This comes from prometheus-operator |
| app | Pod, Service, DeploymentConfig | alertmanager, prometheus-operator, prometheus, prometheus-white, sa-telemetry-alertmanager | The prometheus has a label app=prometheus, and the smart-gateway has app=prometheus-white We should standardize this to name all components by their proper names and not use it for additional metadata (white) |
| application | Pod, Service, ReplicaSet | qdr-white | This comes from qdr-operator |
| name | Pod, ReplicaSet, Job, SmartGateway, Qdr | qdr-operator, saf-smoketest, smart-gateway-operator, white, qdr-white | There is already metadata.name We should standardize this to 'app' |
| operated-alertmanager | Service | true | These come from prometheus-operator |
| operated-prometheus | Service | true | These come from prometheus-operator |
| prometheus | Pod, StatefulSet | white, prometheus-sa-telemetry | The 'white' value should move to 'sa-affinity'|
| qdr_cr | Pod, Service, ReplicaSet | qdr-white | We should standardize this to 'app' but remove extra metadata (white) |
| sa-app | Pod | prometheus-white | We should standardize this to 'app' but remove extra metadata (white) |
| sa-app-white | ServiceMonitor | prometheus-white | We should standardize this to 'app' but remove extra metadata (white) |
| sa-telemetry-app-white| ServiceMonitor | prometheus-white | We should standardize this to 'app' but remove extra metadata (white) |
| smartgateway | Service | white | |

#### AFTER (Proposed)

| **Label Key** | **On Types** | **Values** | **Notes** |
|-----------------------|--------------------------------|----------------|------------|
| alertmanager | Pod | sa | This comes from prometheus-operator |
| app | Pod, Service, DeploymentConfig | alertmanager, prometheus, prometheus-operator, qdr, qdr-operator, smart-gateway, smart-gateway-operator | Primary way to identify a specific component |
| application | Pod, Service, ReplicaSet | qdr-white | This comes from qdr-operator |
| operated-alertmanager | Service | true | These come from prometheus-operator |
| operated-prometheus | Service | true | These come from prometheus-operator |
| qdr_cr | Pod, Service, ReplicaSet | qdr-white | Where does this come from? |
84 changes: 76 additions & 8 deletions deploy/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ echo " * [OK] Switched to sa-telemetry project"
# setup our default method
method="CREATE"


# checking if we're deleting or creating
if [[ "$1" != "" ]]; then
if [[ "$1" != "CREATE" && "$1" != "DELETE" ]]; then
Expand Down Expand Up @@ -50,17 +49,24 @@ declare -a application_list=(
'service-assurance/prometheus/service_account.yaml'
'service-assurance/prometheus/role.yaml'
'service-assurance/prometheus/rolebinding.yaml'
'service-assurance/prometheus/service_monitor.yaml'
'service-assurance/prometheus/prometheus.yaml'
'service-assurance/prometheus/route.yaml'
'service-assurance/prometheusrules/prometheusrules.yaml'
'service-assurance/alertmanager/service_account.yaml'
'service-assurance/alertmanager/secret.yaml'
'service-assurance/alertmanager/alertmanager.yaml'
'service-assurance/alertmanager/service.yaml'
'service-assurance/alertmanager/route.yaml'
)

declare -a crds_to_wait_for=(
'alertmanagers.monitoring.coreos.com'
'prometheuses.monitoring.coreos.com'
'prometheusrules.monitoring.coreos.com'
'qdrs.interconnectedcloud.github.io'
'servicemonitors.monitoring.coreos.com'
'smartgateways.smartgateway.infra.watch'
)

create() {
object_list=("$@")
# shellcheck disable=SC2068
Expand All @@ -73,17 +79,79 @@ delete() {
oc delete --wait=true ${object_list[@]/#/-f }
}

wait_for_crds(){
while true; do
not_ready=0
# shellcheck disable=SC2068
for crd in ${crds_to_wait_for[@]}; do
echo -n "Checking if '${crd}' is Established..."
estab=$(oc get crd "${crd}" -o jsonpath='{.status.conditions[?(@.type=="Established")].status}')
if [ "${estab}" != "True" ]; then
echo "Not Established"
not_ready=1
break
fi
echo "Established"
done
if [ ${not_ready} -eq 0 ]; then
break
fi
echo "Still waiting on CRDs..."
sleep 3;
done

# "there is a race in Kubernetes that the CRD creation finished but the API
# is not actually available"
# https://github.com/coreos/prometheus-operator/issues/1866#issuecomment-419191907
#
# This code (ideally this whole function) should go away when we add better
# failure handling in the operator
# https://github.com/redhat-service-assurance/smart-gateway-operator/issues/6
echo 'Confirming we can instantiate a ServiceMonitor'
until oc create -f - <<EOSM
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dummy-race-condition-checker
spec:
endpoints:
- port: "11111"
selector:
matchLabels:
dummy-race-condition-checker: true
EOSM
do
sleep 3;
done
oc delete servicemonitor/dummy-race-condition-checker

# Nothing above actually works to solve the problem, so instead of sleep 300
# we force the SGO to restart, which DOES solve the problem
echo 'Restarting SGO to clear API condition'
oc delete pod -l app=smart-gateway-operator
}

# create the objects
if [ "$method" == "CREATE" ]; then
echo " * [ii] Creating the operators" ; create "${operator_list[@]}"
echo ""
echo "+--------------------------------------------------------+"
echo "| Waiting for prometheus-operator deployment to complete |"
echo "+--------------------------------------------------------+"
echo "+---------------------------------------------------+"
echo "| Waiting for CRDs to become established in the API |"
echo "+---------------------------------------------------+"
echo ""
oc rollout status dc/prometheus-operator
oc get pods
wait_for_crds
echo " * [ii] Creating the application" ; create "${application_list[@]}"
echo " * [ii] Waiting for QDR deployment to complete"
until oc rollout status deployment.apps/qdr-white; do sleep 3; done
echo " * [ii] Waiting for prometheus deployment to complete"
until oc rollout status statefulset.apps/prometheus-white; do sleep 3; done
echo " * [ii] Waiting for smart-gateway deployment to complete"
until oc rollout status deploymentconfig.apps.openshift.io/white-smartgateway; do sleep 3; done
echo " * [ii] Waiting for all pods to show Ready"
while oc get pods -o 'jsonpath={..status.conditions[?(@.type=="Ready")].status}' | grep False; do
oc get pods
sleep 3
done
fi

# delete the objects
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/bin/sh
minishift start
oc login -u system:admin
oc new-project sa-telemetry

openssl req -new -x509 -batch -nodes -days 11000 \
-subj "/O=io.interconnectedcloud/CN=qdr-white.sa-telemetry.svc.cluster.local" \
-out qdr-server-certs/tls.crt \
Expand All @@ -21,7 +21,6 @@ oc patch node localhost -p '{"metadata":{"labels":{"application": "sa-telemetry"

# deploy the environment (requires interaction)
./deploy.sh
watch -n5 oc get pods

# teardown the environment when done (requires interaction)
echo "Your environment should now be ready."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ spec:
replicas: 1
selector:
matchLabels:
name: qdr-operator
app: qdr-operator
template:
metadata:
labels:
name: qdr-operator
app: qdr-operator
spec:
serviceAccountName: qdr-operator
containers:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ spec:
replicas: 1
selector:
matchLabels:
name: smart-gateway-operator
app: smart-gateway-operator
template:
metadata:
labels:
name: smart-gateway-operator
app: smart-gateway-operator
spec:
serviceAccountName: smart-gateway-operator
containers:
Expand Down
2 changes: 1 addition & 1 deletion deploy/service-assurance/alertmanager/route.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,6 @@ metadata:
spec:
to:
kind: Service
name: alertmanager
name: alertmanager-operated
weight: 100
wildcardPolicy: None
16 changes: 0 additions & 16 deletions deploy/service-assurance/alertmanager/service.yaml

This file was deleted.

7 changes: 2 additions & 5 deletions deploy/service-assurance/prometheus/prometheus.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: prometheus-sa-telemetry
name: white
namespace: sa-telemetry
spec:
Expand All @@ -18,15 +16,14 @@ spec:
port: metrics
podMetadata:
labels:
sa-app: prometheus-white
app: prometheus
replicas: 1
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: prometheus-sa-telemetry
role: prometheus-rulefiles
role: sa-telemetry
securityContext:
nonroot: true
serviceAccountName: prometheus-sa
Expand Down
17 changes: 0 additions & 17 deletions deploy/service-assurance/prometheus/service_monitor.yaml

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ metadata:
creationTimestamp: null
name: sa-telemetry-rules
labels:
prometheus: prometheus-sa-telemetry
role: prometheus-rulefiles
role: sa-telemetry
spec:
groups:
- interval: 30s
Expand Down
3 changes: 0 additions & 3 deletions tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,3 @@ These are some things that would make this better:
* Would like to actually test via the AMQP+TLS interface as the system boundary
instead of directly to the internal AMQP broker
* Option to do internal vs. external
* Looks for just any metrics, so you have to reset prometheus to re-test.
* Would be better to check for metrics specifically from the test harness
specifically in the testing timeframe
16 changes: 16 additions & 0 deletions tests/install-and-run-ocp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/sh
#set -e

# OC command line tools
OC_VER=v3.11.0
OC_HASH=0cbc58b
OC_NAME="openshift-origin-client-tools-${OC_VER}-${OC_HASH}-linux-64bit"
wget https://github.com/openshift/origin/releases/download/${OC_VER}/${OC_NAME}.tar.gz
tar -xvzf ${OC_NAME}.tar.gz
sudo mv ${OC_NAME}/oc /usr/local/bin/

# Start the containerized openshift
sudo sed -i "s/\DOCKER_OPTS=\"/DOCKER_OPTS=\"--insecure-registry=172.30.0.0\/16 /g" /etc/default/docker
sudo cat /etc/default/docker
sudo service docker restart
oc cluster up --public-hostname=$(hostname) #--base-dir /var/lib/minishift
7 changes: 7 additions & 0 deletions tests/minimal-collectd.conf
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
Interval 1

LoadPlugin "logfile"
<Plugin "logfile">
LogLevel "debug"
File stdout
Timestamp true
</Plugin>

LoadPlugin cpu
LoadPlugin amqp1

Expand Down
Loading

0 comments on commit b1d8e19

Please sign in to comment.