# Amazon CloudWatch

<img src="../_assets/aws_service_icons/cloudwatch.svg" width="80" alt="Amazon CloudWatch">

## Goals
- Understand what CloudWatch is
- Know what it’s practically used for
- See a simple **SDK pseudo-code** example (not executed)


## What is CloudWatch?

**Amazon CloudWatch** is AWS’s primary **monitoring + observability** service.

At a high level, CloudWatch helps you:
- Collect **metrics** (time-series measurements) from AWS services and your own applications
- Collect and search **logs** (CloudWatch Logs)
- Create **alarms** that notify or trigger actions when something crosses a threshold
- Build **dashboards** for operational visibility

Most AWS services publish built-in metrics automatically (e.g., CPU utilization, request counts, error rates). You can also publish **custom metrics** for application-level signals (e.g., inference latency, queue depth).


## What is CloudWatch used for (practically)?

CloudWatch is commonly used to answer operational questions like:
- “Is the system healthy right now?” (dashboards + key metrics)
- “Did something break?” (alarms + notifications)
- “Why did it break?” (logs + correlation with metrics)
- “Is performance getting worse over time?” (trend analysis, percentiles)

Typical use cases:
- **Alerting**: notify on high error rates, latency spikes, low disk space, elevated throttling
- **Autoscaling signals**: drive scaling decisions based on CPU, queue depth, custom metrics
- **Troubleshooting**: correlate deploys/incidents with metrics and log traces
- **Compliance/operations**: retain logs with defined retention, build audit-style dashboards

In ML systems (examples):
- Monitor **training jobs** (duration, resource usage, failures)
- Monitor **inference endpoints** (p50/p95 latency, error rate, saturation)
- Track **data quality / drift** signals as custom metrics (where appropriate)


## Core concepts (minimum you should know)

- **Metrics**: time-series data points (e.g., `Latency`, `5xxErrorRate`).
- **Namespace**: logical grouping for metrics (AWS uses namespaces like `AWS/Lambda`; you can define your own).
- **Dimensions**: key/value labels that slice a metric (e.g., `FunctionName=...`, `Model=...`).
- **Statistics / percentiles**: summarize many samples per period (e.g., average, max, p95).
- **Alarms**: evaluate a metric over time and trigger actions (e.g., notify via SNS).
- **Dashboards**: visualize metrics for a service or system.
- **CloudWatch Logs**: log groups/streams, retention policies, and search/query (Logs Insights).


## Using CloudWatch with an SDK (pseudo-code)
This is **illustrative pseudo-code** showing a typical workflow:
1) publish a custom metric,
2) create an alarm,
3) query recent datapoints.
**Note**: requires AWS credentials/permissions and a region; add retries/logging/error handling in production.
```python
# PSEUDO-CODE (do not run)
import boto3
from datetime import datetime, timedelta, timezone
region = "us-east-1"
# CloudWatch (metrics + alarms)
cw = boto3.client("cloudwatch", region_name=region)
# 1) Publish a custom application metric
# Example: inference latency in milliseconds
namespace = "DemoApp"
metric_name = "InferenceLatencyMs"
dims = [{"Name": "Model", "Value": "recommender-v1"}]
cw.put_metric_data(
    Namespace=namespace,
    MetricData=[
        {
            "MetricName": metric_name,
            "Dimensions": dims,
            "Timestamp": datetime.now(timezone.utc),
            "Value": 123.4,
            "Unit": "Milliseconds",
        }
    ],
)
# 2) Create an alarm (notify an SNS topic if p95 > 200ms for 3 minutes)
alarm_name = "demo-inference-latency-p95-high"
sns_topic_arn = "arn:aws:sns:us-east-1:123456789012:ops-alerts"
cw.put_metric_alarm(
    AlarmName=alarm_name,
    Namespace=namespace,
    MetricName=metric_name,
    Dimensions=dims,
    Period=60,
    EvaluationPeriods=3,
    Threshold=200.0,
    ComparisonOperator="GreaterThanThreshold",
    TreatMissingData="missing",
    ExtendedStatistic="p95",
    AlarmActions=[sns_topic_arn],
)
# 3) Query recent datapoints (for quick debugging / basic reporting)
end = datetime.now(timezone.utc)
start = end - timedelta(minutes=10)
resp = cw.get_metric_statistics(
    Namespace=namespace,
    MetricName=metric_name,
    Dimensions=dims,
    StartTime=start,
    EndTime=end,
    Period=60,
    Statistics=["Average", "Maximum"],
)
datapoints = sorted(resp.get("Datapoints", []), key=lambda d: d["Timestamp"])
print(datapoints)
```


## Pitfalls & quick tips

- Keep **dimension cardinality** under control (too many unique dimension values can get expensive/noisy).
- Set **log retention** intentionally (never-ending retention can become costly).
- Prefer percentiles (e.g., **p95**) for latency over averages when tail behavior matters.
- Treat alarms as part of a system: route notifications (SNS) to the right on-call path and avoid alert fatigue.

## References
- AWS Docs: Amazon CloudWatch (metrics, alarms, dashboards)
- AWS Docs: CloudWatch Logs + Logs Insights
