# Exploring Data from Active Resources 
In this notebook, you'll use the `/resources` endpoint of the Vantage API to retrieve and analyze your _active resources_. Active resources are any type of provider-based resource, such as an Amazon EC2 instance, that is currently accruing costs. This analysis will look at costs for resources across provider, region, and resource type.

## Prerequisites
Ensure you have the following libraries installed below, such as `pandas` ([documentation](https://pandas.pydata.org/docs/index.html)), `matplotlib` ([documentation](https://matplotlib.org/)), and `seaborn` ([documentation](https://seaborn.pydata.org/)). 

In [None]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time

## `/resources` API Endpoint
The `/resources` [endpoint](https://vantage.readme.io/reference/getreportresources) returns a JSON array of all resources within a specific [Resource Report](https://docs.vantage.sh/active_resources) or workspace. The `resource_report_token` variable represents the unique token for a Resource Report in Vantage. For this lab, use the **All Active Resources** report that's automatically provided in your Vantage account. 
1. Navigate to the [Resource Reports page](https://console.vantage.sh/resources) in Vantage.
2. Select the **All Active Resources** report.
3. In the URL, copy the report token (e.g., in `https://console.vantage.sh/resources/prvdr_rsrc_rprt_a12f345345aad1ac`, copy `prvdr_rsrc_rprt_a12f345345aad1ac`).
4. Replace the `<TOKEN>` placeholder below with the token you just copied.

In [None]:
url = "https://api.vantage.sh/v2/resources"
params = {
    "resource_report_token": "<TOKEN>", 
    "include_cost": "true"
}

### Vantage API Token
Create a [Vantage API token](https://vantage.readme.io/reference/authentication). Export it as the `VANTAGE_API_TOKEN` environment variable within this session.

In [None]:
vantage_token = os.getenv("VANTAGE_API_TOKEN")
if vantage_token is None:
    raise ValueError("Set VANTAGE_API_TOKEN as an environment variable.")

headers = {
    "accept": "application/json",
    "authorization": f"Bearer {vantage_token}"
}

## API Call and Pagination
The `/resources` API response is paginated, with about 20 resource records returned per page. The response provides the following `links` for pagination:

```
{
  "links": {
    "self": "https://api.vantage.sh/v2/resources?resource_report_token=prvdr_rsrc_rprt_a12f345345aad1ac_cost=true",
    "first": "https://api.vantage.sh/v2/resources?resource_report_token=prvdr_rsrc_rprt_a12f345345aad1ac&include_cost=true&page=1",
    "next": "https://api.vantage.sh/v2/resources?resource_report_token=prvdr_rsrc_rprt_a12f345345aad1ac&include_cost=true&page=2",
    "last": "https://api.vantage.sh/v2/resources?resource_report_token=prvdr_rsrc_rprt_a12f345345aad1ac&include_cost=true&page=100",
    "prev": null
  },
```

The API also has a rate limit of 20 requests per minute. The loop below extracts all data and accounts for any [rate-limiting](https://vantage.readme.io/reference/rate-limiting) to add a delay each minute between requests.

In [None]:
# Create a list to collect the data across all pages
all_data = []
page = 1

# Loops through pagination to retrieve all pages
while url:
    response = requests.get(url, headers=headers, params=params)
    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        break
    
    data = response.json()
    all_data.extend(data["resources"])
    
    url = data["links"].get("next")
    page += 1

    # Handles rate-limiting, as the API is limited to 20 requests per minute
    if response.headers.get("X-RateLimit-Remaining") == "0":
        reset_time = int(response.headers.get("X-RateLimit-Reset", 60))
        print(f"Rate limit hit. Sleeping for {reset_time} seconds...")
        time.sleep(reset_time)
    else:
        time.sleep(1)  

## Store Data in `pandas` Dataframe
The API response includes a number of fields. Each unique token can have multiple records, as cost is determined by the resource's `category`. For example, the following resource, with its unique `token`, has one record for `Data Transfer` costs and another for `API Request` costs:

```
  "resources": [
    {
      "token": "prvdr_rsrc_1ba2e3aa45678f9f",
      "uuid": "arn:aws:kms:us-east-1:12345678901:key/1234ab0d-56a7-89a3-45ab-89ab45ab1e34",
      "type": "aws_cloudfront_distribution",
      "label": "1234ab0d-56a7-89a3-45ab-89ab45ab1e34",
      "metadata": null,
      "account_id": "12345678901",
      "billing_account_id": "12345678901",
      "provider": "aws",
      "region": "us-east-1",
      "costs": [
        {
          "category": "Data Transfer",
          "amount": "0.0000899936"
        }
      ],
      "created_at": "2023-05-22T19:43:33.264Z"
    },
    {
      "token": "prvdr_rsrc_1ba2e3aa45678f9f",
      "uuid": "arn:aws:kms:us-east-1:12345678901:key/1234ab0d-56a7-89a3-45ab-89ab45ab1e34",
      "type": "aws_cloudfront_distribution",
      "label": "1234ab0d-56a7-89a3-45ab-89ab45ab1e34",
      "metadata": null,
      "account_id": "12345678901",
      "billing_account_id": "12345678901",
      "provider": "aws",
      "region": "us-east-1",
      "costs": [
        {
          "category": "API Request",
          "amount": "0.0000987564"
        }
      ],
      "created_at": "2023-05-23T19:43:33.264Z"
    },
    ...
```

The `pandas` dataframe below pulls in the `'uuid', 'type', 'provider', 'region', 'token', 'label', 'account_id'` for each resource as a record. In addition, the `amount` and `category` parameters are nested under `costs`. The `record_path` accounts for this. The `record_prefix` adds `cost_` in front of each nested column name for when it's referenced later.

In [None]:
df = pd.json_normalize(
    all_data, 
    record_path='costs', 
    meta=['uuid', 'type', 'provider', 'region', 'token', 'label', 'account_id'],
    record_prefix='cost_'
)

Convert `cost_amount` to a `float` so that you can accurately calculate total costs per resource type. The `total_cost_df` groups all tokens together to give a total cost per resource token.

In [None]:
df['cost_amount'] = df['cost_amount'].astype(float)
total_cost_df = df.groupby('token')['cost_amount'].sum().reset_index()
total_cost_df = total_cost_df.sort_values(by='cost_amount', ascending=False)
total_cost_df.head()

## Exploring High-Costing Resources
Now that you have the data, you can explore different visualizations using `matplotlib`. This visualization looks at the top cost-contributing resource types across all providers. A new dataframe groups by `type` and sums the `cost_amount` for each `type`.

In [None]:
type_cost_df = df.groupby('type')['cost_amount'].sum().reset_index()
# create table for visual
top_types = type_cost_df.sort_values(by='cost_amount', ascending=False).head(5)
print(top_types)

# Plot top resource types by cost
plt.figure(figsize=(10, 6))
plt.bar(top_types['type'], top_types['cost_amount'], color='coral')
plt.xlabel('Resource Type')
plt.ylabel('Total Cost')
plt.title('Top 5 Cost-Contributing Resource Types')
plt.xticks(rotation=45)
plt.show()

### Filter to a Specific Provider
If you want to see top-costing resources for only one provider (e.g., `azure` or `aws`), you can filter the results from the dataframe.

In [None]:
provider_filtered_df = df[df['provider'].str.lower() == 'azure']
type_cost_df = provider_filtered_df.groupby('type')['cost_amount'].sum().reset_index()
# create table for visual
top_types = type_cost_df.sort_values(by='cost_amount', ascending=False).head(5)
print(top_types)

# Plot top resource types by cost
plt.figure(figsize=(10, 6))
plt.bar(top_types['type'], top_types['cost_amount'], color='coral')
plt.xlabel('Resource Type')
plt.ylabel('Total Cost')
plt.title('Top 5 Cost-Contributing Resource Types')
plt.xticks(rotation=45)
plt.show()

## Exploring Top-Costing Regions
For this visualization, you can see costs across region for all providers. You can also filter to one specific provider, as done in the visualization above.

In [None]:
region_cost_df = df.groupby('region')['cost_amount'].sum().reset_index()
# create table for visual
top_regions = region_cost_df.sort_values(by='cost_amount', ascending=False).head(5)
print(top_regions)

# Plot total cost by region
plt.figure(figsize=(10, 6))
plt.bar(region_cost_df['region'], region_cost_df['cost_amount'], color='skyblue')
plt.xlabel('Region')
plt.ylabel('Total Cost')
plt.title('Total Cost by Region')
plt.xticks(rotation=45)
plt.show()


## Exploring Heatmap of Provider and Resource Types
This heatmap uses `matplotlib` and `seaborn` and creates a pivot table of provider and resource type and includes the top 10 highest-costing resource types across the dataset. Darker cells represent greater costs for that provider/resource type.

In [None]:
heatmap_data = df.pivot_table(values='cost_amount', index='provider', columns='type', aggfunc='sum').fillna(0)

# Keep only the top 10 highest-cost types for readability
top_types = df.groupby('type')['cost_amount'].sum().nlargest(10).index
heatmap_data = heatmap_data[top_types]

plt.figure(figsize=(16, 10))
sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt=".4f", cbar_kws={'label': 'Total Cost'})
plt.xlabel('Type')
plt.ylabel('Provider')
plt.xticks(rotation=45, ha='right')  # Rotate labels
plt.title('Total Cost Distribution by Provider and Top Resource Types')
plt.show()

### Filtered Provider Heatmap
If your data skews heavily toward one provider (e.g., AWS), you could filter out a particular provider and view the heatmap for all other providers. Update the criteria in the `filtered_df` to look at specific providers.

In [None]:
# Filter data to include all providers except 'aws'
filtered_df = df[df['provider'].str.lower() != 'aws']

heatmap_data = filtered_df.pivot_table(values='cost_amount', index='provider', columns='type', aggfunc='sum').fillna(0)

# Keep only the top 10 highest-cost types across the filtered data for readability
top_types = filtered_df.groupby('type')['cost_amount'].sum().nlargest(10).index
heatmap_data = heatmap_data[top_types]

plt.figure(figsize=(16, 10))
sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt=".4f", cbar_kws={'label': 'Total Cost'})
plt.xlabel('Type')
plt.ylabel('Provider')
plt.xticks(rotation=45, ha='right')  # Rotate labels
plt.title('Total Cost Distribution by Provider (Excluding AWS) and Top Resource Types')
plt.show()

## Next Steps
Your original dataframe also includes other parameters, like `category` and `account_id`. Consider creating other analyses to look at resource costs across these parameters.