# Blob life cycle - Research notebook

## Introduction

This notebook provides a structured approach to analyze your blob access patterns. By understanding these patterns, you can craft lifecycle policies that effectively reduce storage costs.

## Prerequisites

Before you begin, ensure you have the following data points available in the same storage account that's being used as your data lake:

- Blob Inventory Snapshot: This should be in parquet format.
- Blob Diagnostic Logs: Specifically, the 'Read' file logs.

## Connecting to Your Data Lake

In the Python cell provided, you'll need to input the appropriate values for your data lake storage account and key. This will establish a connection to your data lake.

This notebook provide key steps to analyze your blob access patterns, with goal to create life cycle policies that would reduce the cost of storage.

In [None]:
## Inventory storage

invent_storage_account_name = "<your data lake storage account>"
invent_storage_account_key = "< your key>"

spark.conf.set("fs.azure.account.key.{0}.dfs.core.windows.net".format(invent_storage_account_name), invent_storage_account_key)

## Reading Diagnostic Logs from Parquet Files

In this section, we will load the diagnostic logs, which have been copied via Azure Data Factory, into a DataFrame for analysis.

Ensure the data you're reading is organized in a structured format. For this example, we're reading logs for the year 2023, but you can modify the path to suit your requirements.

In [None]:
lake_container = 'inventorylake'
# Set the path to your parquet log files, note that with 'y=2023/*/*/*/*/' you read all logs of 2023
diag_log_path = 'diag_logs/resourceId=/subscriptions/11acf2e0-XXXX-XXXX-86ef-37c53a9XXXXX/resourceGroups/ext-lake/providers/Microsoft.Storage/storageAccounts/targetextsa/blobServices/default/y=2023/*/*/*/*/'
# Define the path to your parquet files so spark can read them
path = f"abfss://{lake_container}@{invent_storage_account_name}.dfs.core.windows.net/{diag_log_path}"

# Read the parquet files into a DataFrame
df = spark.read.parquet(path)


### Summary of Activities Over the Selected Period

To gain a quick overview of the various operations performed in your storage account, we'll employ the groupBy operation. This will help us understand the distribution of different activities.

It's crucial to note that for analyzing access patterns, our primary focus will be on the 'GetBlob' and 'ReadFile' operations. By setting truncate=False, we ensure that the complete content of each field is displayed, allowing for a thorough understanding of each operation.

In [None]:
# Group the DataFrame by operation names and count the occurrences of each operation
df.groupBy("operationName").count().show(truncate=False)

These are the results of the groupBy operation on the diagnostic logs our experimental storage account:

|operationName              |count|
|---------------------------|-----|
|BlobPreflightRequest       |84   |
|ListFilesystemDir          |393  |
|GetBlobServiceProperties   |55   |
|GetBlobMetadata            |7    |
|GetBlob                    |664  |
|ListBlobs                  |3148 |
|GetPathAccessControl       |6    |
|GetContainerProperties     |1318 |
|ListContainers             |793  |
|GetPathStatus              |857  |
|GetBlockList               |150  |
|GetBlobProperties          |647  |
|GetContainerServiceMetadata|45   |
|ReadFile                   |682  |
|GetFilesystemProperties    |3    |
|ListFilesystems            |5    |
|GetContainerACL            |2    |

#### Analysis of Operations:

From the table above, we can derive a few insights:

High Frequency Operations: The ```ListBlobs``` and ```GetContainerProperties``` operations are the most frequent, with 3148 and 1318 counts respectively. This indicates regular listing and retrieval activities on the blobs and container properties.
Primary Access Patterns: The ```GetBlob``` and ```ReadFile``` operations, which are crucial for our analysis, have been performed 664 and 682 times respectively. This shows a balanced distribution of both blob retrieval and file reading activities.
Least Frequent Operations: Operations like ```GetFilesystemProperties```, ```ListFilesystems```, and ```GetContainerACL``` have the lowest counts, suggesting they are not regular activities in the current environment.

Understanding the distribution of these operations will help in formulating effective lifecycle policies that reflect real usage patterns.

### Filtering by Time, Specific Operations, and Blob Name

For a more targeted analysis, it's essential to narrow down our dataset. We'll focus on the operations ```ReadFile``` and ```GetBlob``` as these are central to understanding blob access patterns. Furthermore, we'll eliminate platform-related operations to ensure that our insights are based solely on user-based activities.

In the upcoming cell, we'll generate a new DataFrame (```final_df```) which will consist of three fields:

- ```time```: Timestamp of the operation
- ```operationName```: The name of the operation (either ```ReadFile``` or ```GetBlob```)
- ```blob_name```: The name of the blob being accessed

In [None]:
from pyspark.sql.functions import col, regexp_extract, from_unixtime, unix_timestamp

# Define the format of your time column
time_format = "yyyy-MM-dd'T'HH:mm:ss.SSSSSSS'Z'"
# Filtering out platform operations
filtered_df = df.filter(~(col("uri").contains("$accountmetadata") | 
                          col("uri").contains("$logs") | 
                          col("uri").contains("%24logs")))
# Filtering out all non related operations
filtered_df = filtered_df.filter(col("operationName").isin("GetBlob", "ReadFile"))
# Regular expression pattern to extract desired blob_name
pattern = r'(?:https?:\/\/[^\/]+\/)([^?]+)'

# Extract blob_name using regexp_extract
df_with_blob_name = filtered_df.withColumn("blob_name", regexp_extract(col("uri"), pattern, 1))
# Convert the string column to datetime
df_with_datetime = df_with_blob_name.withColumn("time", from_unixtime(unix_timestamp(col("time"), time_format)))

# Select the relevant columns for our final DataFrame
final_df = df_with_datetime.select("time", "operationName", "blob_name")

final_df.show(truncate=False)

### Analyzing Access Patterns

#### Determining the Last Access Time

Azure Storage does provide an option to track the last access time for blobs, but it's not enabled by default. To bridge this gap, we can derive the last access time for each blob using our diagnostic logs. In the subsequent cell, we'll group the data by each blob and identify the most recent access timestamp for it.

In [None]:
from pyspark.sql.functions import max

# Group by blob_name and aggregate to get the latest timestamp for each blob
result_df = final_df.groupBy("blob_name").agg(max("time").alias("last_access_timestamp"))

# Show the result
result_df.show(truncate=False)


## Analyzing the Blob Inventory 

In the next steps, we will focus on reading and processing the blob inventory stored as a parquet file. This process involves:

1. Loading the blob inventory data.
2. Filtering out platform-related entries.
3. Selecting the relevant fields. (For optimization, it's a good practice to only retrieve necessary fields during the initial read.)
4. Converting timestamps from the long data type into a more human-readable format.

In [None]:
from pyspark.sql.functions import col,from_unixtime
# the inventory files are in this container (consider moving later to the lake)
container = 'inventory'
# Define the path to your parquet files
inventory_file = "parquet-all.parquet"
path = f"abfss://{container}@{invent_storage_account_name}.dfs.core.windows.net/2023/08/27/07-02-03/parquet-all/{inventory_file}"

# Read the parquet files into a DataFrame
_df = spark.read.parquet(path)
# first lets filter any platform related rows from the dataframe: Filter out rows where Name starts with "$logs"
filtered_inventory = _df.filter(~col("Name").startswith("$logs"))
# We need only these fields:
new_df = filtered_inventory.select("Name","Creation-Time", "Last-Modified", "Content-Length")
# Readable date time format
date_format = "yyyy-MM-dd HH:00:00"

# Convert Creation-Time and Last-Modified to the desired format
inventory_df = new_df.withColumn("Creation-Time", from_unixtime(col("Creation-Time") / 1000, date_format))
inventory_df = inventory_df.withColumn("Last-Modified", from_unixtime(col("Last-Modified") / 1000, date_format))

# Show the result
inventory_df.show(truncate=False)

### Merging the Inventory and Access Logs 

To gain a comprehensive view of blob events, we'll merge data from both the blob inventory (```inventory_df```) and the access logs (```final_df```).

Approach:

1. Expand the ```inventory_df```: For each blob, create separate entries for both the creation and update events, along with their associated timestamps.
2. Join DataFrames: Merge ```inventory_df``` and ```final_df``` based on the blob names.
3. Extract Relevant Events: Utilize the operationName column and the timestamp difference between ```Creation-Time``` and ```Last-Modified``` to determine the nature of the event (e.g., creation, update, read).
4. Filter and Construct the Final DataFrame: Present the results in a structured manner, highlighting the event type, blob name, and associated timestamps.

In [None]:
from pyspark.sql.functions import col,explode, arrays_zip, struct, lit, when,min,max


# Create a DataFrame for creation events
creation_df = inventory_df.select(
    col("Name"),
    col("Content-Length"),
    col("Creation-Time").alias("event_time"),
    lit("creation").alias("eventType")
)

# Create a DataFrame for update events
update_df = inventory_df.select(
    col("Name"),
    col("Content-Length"),
    col("Last-Modified").alias("event_time"),
    lit("update").alias("eventType")
)

# Combine the two DataFrames using union
inventory_expanded = creation_df.union(update_df)

# Remove rows that indicate same time for creation and update:
# Group by Name and aggregate to check if creation and update times are the same

agg_df = inventory_expanded.groupBy("Name", "Content-Length")\
                           .agg(min("event_time").alias("min_time"), 
                                max("event_time").alias("max_time"))\
                           .withColumn("eventType", 
                                       when(col("min_time") == col("max_time"), lit("creation")).otherwise(lit("update")))


# Select the appropriate columns based on the condition
_result = agg_df.withColumn("event_time", 
                            when(col("eventType") == "creation", col("min_time"))
                            .otherwise(col("max_time"))
)

_result = _result.select("Name", "event_time", "eventType", "Content-Length")

# Join with final_df (we created it in previous step from the logs). The join is done on the blob name.
joined_df = final_df.join(_result, final_df.blob_name == _result.Name, how="outer")

# If operationName is "ReadFile" or "GetBlob", update the eventType
final_result = joined_df.withColumn("eventType", 
                                    when(col("operationName").isin(["ReadFile", "GetBlob"]), col("operationName"))
                                    .otherwise(col("eventType"))
)

# Select the desired columns and order by the name of the blob and the event time
final_result = final_result.select("event_time", "eventType", "Name", "Content-Length").orderBy("Name", "event_time")

final_result.show(truncate=False)

### Calculating Distinct Blobs in the Inventory

To understand the scale of our data, let's determine the total number of unique blobs present in our inventory.

In [None]:
# Calculate the total number of distinct blobs from inventory_df
total_blobs_inventory = inventory_expanded.agg(countDistinct("Name").alias("total_blobs_inventory")).collect()[0]["total_blobs_inventory"]

# Print the result
print("Total number of distinct blobs in inventory_df:", total_blobs_inventory)

Total number of distinct blobs in inventory_df: 83213


### Analyzing Storage Behavior

By grouping and counting events in the ```final_result``` DataFrame, we can gain insights into the predominant actions taken on the storage. Let's break down the number of each event type, including creation, update, and read actions.

In [None]:
from pyspark.sql.functions import count

# Group by eventType and count the occurrences
grouped_result = final_result.groupBy("eventType").agg(count("Name").alias("count"))

# Display the result
grouped_result.show(truncate=False)


The table below shows the distribution of different event types for our experimental storage:

|eventType|count|
|---------|-----|
|GetBlob  |7    |
|creation |83177|
|update   |27   |
|ReadFile |29   |

From the results, it's evident that the dominant operation on this storage is blob creation, accounting for __99.92%__ of all events. This highlights a potential area for optimization, as understanding such patterns can inform better storage management strategies.

### Analyzing File Access Patterns

Understanding the frequency and timing of file access can be crucial for optimizing storage costs and management. One insightful way to analyze this is by calculating the duration from when a file was created (or last updated) to its most recent access time. This can help identify patterns of blob usage over time.

In this section, we will:

1. Filter the dataset to consider only the "GetBlob" and "ReadFile" events.
2. Convert the event time from a string to an actual timestamp.
3. Exclude blobs accessed within the last 30 days (as they are recent and may not reflect longer-term patterns).
4. Group the data by event type and count the distinct blobs accessed for each.


Let's start by calculating how many blobs were accessed during the period covered by our logs:

In [None]:
from pyspark.sql.functions import to_timestamp, current_date, datediff, countDistinct

# Convert event_time from string to timestamp
filtered_df = final_result.withColumn("event_time", to_timestamp("event_time", "yyyy-MM-dd HH:mm:ss"))

# Filter rows with eventType of "Get" or "Read"
filtered_df = filtered_df.filter(col("eventType").isin(["GetBlob", "ReadFile"]))

# Filter rows where event_time is older than 30 days from current date (it could be specific or parameter)
filtered_df = filtered_df.filter(datediff(current_date(), col("event_time")) > 30)

# Group by eventType and count distinct blobs
grouped_result = filtered_df.groupBy("eventType").agg(countDistinct("Name").alias("num_blobs"))

# Display the result
grouped_result.show(truncate=False)


#### Interpretation of Results:

From the data, it's evident that there's limited access to the blobs over the extended period covered by our logs:

|eventType|num_blobs|
|---------|---------|
|GetBlob  |4        |
|ReadFile |5        |

This indicates that a majority of blobs, once created or updated, aren't accessed frequently. Such insights can be crucial when deciding on blob lifecycle policies.

### Grouping Blob Access into Bins

To further refine our understanding of blob access patterns, we can group blobs into bins based on the number of days since their creation or last update. This will allow us to see if there are common periods after which blobs are accessed.

In the next cell:

1. We compute the difference between the current date and each blob's event time.
2. We group blobs into 5-day bins (e.g., 0-4 days, 5-9 days, etc.).
3. We count the number of unique blobs in each bin.


Let's see how the blobs are distributed across these bins:

In [None]:
from pyspark.sql.functions import floor,collect_list, countDistinct, datediff, current_date

# Assuming last_access_time is the current date for simplicity. 
# You can replace current_date() with the appropriate column or value if you have a specific last access time.
_filtered_df = filtered_df.withColumn("days_diff", datediff(current_date(), col("event_time")))

# Bucket the blobs based on the time difference
# Using floor function to create buckets. Each bucket will have a range of 5 days (2 days on either side of a central value).
_filtered_df = _filtered_df.withColumn("bucket", floor(col("days_diff") / 5))

# Count the number of distinct blobs in each bucket
bucket_counts = _filtered_df.groupBy("bucket").agg(countDistinct("Name").alias("num_blobs"))

# Convert the bucket number to denote the access pattern in days
bucket_counts = bucket_counts.withColumn("days", (col("bucket") * 5) + 2) # Adding 2 to get the central value

# Select the desired columns and order by days
bucket_counts = bucket_counts.select("days", "num_blobs").orderBy("days")

# Display the result
bucket_counts.show(truncate=False)

#### Interpretation of Blob Access Bins

The table below depicts the distribution of blob access based on the number of days since their creation or last update:


|days|num_blobs|
|----|---------|
|47  |3        |
|52  |1        |
|57  |1        |
|62  |2        |
|252 |2        |

__From the table:__

- The ```days``` column indicates the number of days since the creation or last update of the blob to its last access.
- The ```num_blobs``` column signifies how many blobs fit into each respective time bin.

__Insights:__

- We observe that a total of 9 blobs were accessed in the last 252 days.
- Within this timeframe, there are 5 distinct periods (or bins) when blobs were accessed.

This binning approach allows us to understand typical access patterns and could inform decisions on how long to retain blobs in higher-cost storage tiers before moving them to archival or cooler storage, thereby potentially realizing cost savings.







### Analyzing the Percentage of Blob Access

While understanding the distribution of blob access in bins is insightful, it's also important to gauge the proportion of blobs accessed relative to the total number of blobs in storage. By calculating this percentage, we can get a better sense of how active or dormant our storage is over a specified timeframe.

In the following cell, we will:

1. Compute the total number of distinct blobs in the inventory.
2. Incorporate this total into our previously created bin distribution table.
3. Calculate the percentage of blobs accessed for each bin, providing a clearer perspective on the intensity of access.

Let's view the distribution as a percentage of the total:

In [None]:
from pyspark.sql.functions import sum, round

# Calculate the total number of distinct blobs from inventory_df
total_blobs_inventory = inventory_expanded.agg(countDistinct("Name").alias("total_blobs_inventory")).collect()[0]["total_blobs_inventory"]

# Add the total_blobs to the bucket_counts DataFrame
bucket_counts = bucket_counts.withColumn("total_blobs", lit(total_blobs_inventory))

# Calculate the percentage of blobs accessed for each bucket
bucket_counts = bucket_counts.withColumn("percentage_accessed", 
                                         round((col("num_blobs") / col("total_blobs")) * 100, 3).cast("double"))

# Display the result
bucket_counts.show(truncate=False)


#### Interpretation of Blob Access as a Percentage:

The table below showcases the ratio of each blob access group compared to the total number of blobs:

|days|num_blobs|total_blobs|percentage_accessed|
|----|---------|-----------|-------------------|
|47  |3        |83213      |0.004              |
|52  |1        |83213      |0.001              |
|57  |1        |83213      |0.001              |
|62  |2        |83213      |0.002              |
|252 |2        |83213      |0.002              |

Although the storage account under investigation provides a limited view of access patterns, it still offers insights into a few common access behaviors. The percentages indicate a relatively low activity in terms of accessing the blobs post their creation or last update, suggesting that most of the data remains dormant over extended periods.

### Outcome

The analysis above has led to the identification of various "buckets" or clusters of blobs that exhibit similar access patterns. By discerning the percentage of total blobs in each cluster, we can make more informed decisions regarding the ideal lifecycle policies for data storage.

### Size Calculations

It's not just the number of blobs that's important; the size of these blobs can also have a significant impact on storage costs and decisions about data lifecycle management. In this section, we'll delve into the size of the blobs to discern the overall storage footprint, both for accessed and non-accessed blobs.

In [None]:
from pyspark.sql.functions import sum

# Conversion factor from bytes to MB
bytes_to_mb = 1 / (2**20)

# Calculate the total size of all blobs in MB
total_size_mb = inventory_df.agg((sum("Content-Length") * bytes_to_mb).alias("total_size_mb")).collect()[0]["total_size_mb"]

# Calculate the total size of accessed blobs in MB
accessed_size_mb = filtered_df.agg((sum("Content-Length") * bytes_to_mb).alias("accessed_size_mb")).collect()[0]["accessed_size_mb"]

# Calculate the size of blobs that were not accessed in MB
not_accessed_size_mb = total_size_mb - accessed_size_mb

print(f"Total size of all blobs (MB): {total_size_mb:.2f}")
print(f"Total size of accessed blobs (MB): {accessed_size_mb:.2f}")
print(f"Size of blobs not accessed (MB): {not_accessed_size_mb:.2f}")


#### Insights on Blob Sizes

These are the results from above cell:

```
Total size of all blobs (MB): 8863.65
Total size of accessed blobs (MB): 5019.23
Size of blobs not accessed (MB): 3844.42
```

Despite the fact that 99.92% of operations are related to blob creation, we observe that the few blobs accessed tend to be of a larger size. This underscores the importance of considering both frequency of access and the actual size of the blobs when devising lifecycle policies. Specifically, it suggests that size-based criteria could be a pivotal factor in fine-tuning blob lifecycle management strategies.

## Conclusion

Our analysis has illuminated the access patterns of blobs in the storage account and highlighted the significance of considering blob sizes. Armed with this knowledge, organizations can craft more efficient storage strategies, potentially leading to cost savings and optimized data management. As storage requirements evolve and data grows, continually revisiting and refining these strategies will ensure that storage is both cost-effective and aligned with operational needs.

Remember, while this analysis provides a foundation, real-world scenarios may require deeper dives and considerations tailored to specific use cases. Always adapt and iterate based on the unique characteristics of your data environment.