Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions .github/workflows/medium-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@ jobs:
echo ::set-output name=file_name::$(echo ${{ steps.all_changed_files.outputs.added_modified }} | sed "s/.*medium\///;s/\.md.*//")
- name: Publish draft to Medium
id: medium_response
uses: InfraWay/post-medium-action@v1.2.0
uses: InfraWay/post-medium-action@v1.3.0
with:
app_id: ${{ secrets.MEDIUM_APP_ID }}
app_secret: ${{ secrets.MEDIUM_APP_SECRET }}
access_token: ${{ secrets.MEDIUM_ACCESS_TOKEN }}
markdown_file: .medium/${{ steps.medium_markdown_file.outputs.file_name }}.md
base_url: https://thecloudbee.blog/
base_url: https://thecloudbee.blog
- run: |
echo 'Published to Medium @'
echo ${{ steps.medium_response.outputs.url }}
60 changes: 60 additions & 0 deletions .medium/2021-02-15-s3-data-in-athena.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
slug: s3-data-in-athena
tags: [athena, awscloud]
...

# How to Smartly Query Your Data in S3 Using Athena?

Take decisions using your big data stored in S3 without running ETL jobs. Save cost by SMARTLY partitioning the data.

![2021-02-15/head.png](https://www.thecloudbee.blog/assets/images/2021-02-15/head.png)

## AWS Athena Pricing

Athena is a powerful service built to query the data in S3. It suites to data that is semi or unstructured. You only pay for the amount of data that was scanned during the query execution.

Since we are paying for the amount of data scanned, we can make a difference by — compressing, partitioning, or using smart formats to store the data.

It goes without saying that if we are storing big data in an S3 it must be compressed. Further, we can save the data in Columnar format such as Parquet, which makes Athena faster and cheaper. But a problem with the latter approach is that we must have a defined schema.

Lastly, partitions are the game-changers. Think of querying a data set that has several months of data vs the one which is partitioned into days.

## Partitioning Data in Athena

Inorder to run a query on Athena. We must defined a table with several columns and partitions. Athena supports different types of data, semi-structured data (like JSON and XML) and unstructured data( like CSV, logs, and text with delimiters).

It is ideal to save data daywise in S3. This way we can partition on a date and query any subset with fewer resources.

[https://gist.github.com/thecloudbee/d98a0883167e923c892db9d12d855764](https://gist.github.com/thecloudbee/d98a0883167e923c892db9d12d855764)

[https://gist.github.com/thecloudbee/00edec18d88d9e87c6f92cb52e62c34e](https://gist.github.com/thecloudbee/00edec18d88d9e87c6f92cb52e62c34e)

Consider the above folder structure and a sample log for the DHCP IP allocation logs. We can define a create table query as follows.

[https://gist.github.com/thecloudbee/a0337c1a3c2c08ad0ccb29198ca1fabe](https://gist.github.com/thecloudbee/a0337c1a3c2c08ad0ccb29198ca1fabe)

`RegexSerDe` uses regular expression (regex) to serialize/deserialize.

Finally, we can create partitions using the below query.

[https://gist.github.com/thecloudbee/af6486d4fd161124d558b39c33bd62fc](https://gist.github.com/thecloudbee/af6486d4fd161124d558b39c33bd62fc)

## Querying the Data

Let's write a query to get all the IPs assigned to a Mac address for the given date.

[https://gist.github.com/thecloudbee/f45a109f69022afed0b1d2252701db0a](https://gist.github.com/thecloudbee/f45a109f69022afed0b1d2252701db0a)

This is a simple SQL like query, note that where clause has the partition `date`. This implies we have limited the data that has to be scanned by querying inside the firewall folder with a specified date.

All types of queries like JOIN, GROUP BY, and so on are supported by Athena.

## Conclusion

Athena is a powerful yet simple tool that can help you query your data in S3. It might not the fastest way to query big data but it has its own upsides of being able to query raw data.

Athena is using Presto behind the scenes — which is an interesting distributed query engine for big data.

---

*Originally published at [https://www.thecloudbee.blog](https://www.thecloudbee.blog/s3-data-in-athena/) on .*
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,17 @@
# amrojsandhu.github.io
# amrojsandhu.github.io

## Build

```shell script
bundle exec jekyll server --watch --incremental
```

## Test

Add the following to _config.yml

```yaml
# Filtering Content
future : true
unpublished : true
```
4 changes: 2 additions & 2 deletions _posts/2021-02-08-es-shard-sizes.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ categories: elasticsearch
image: assets/images/2021-02-08/head.png
description: "Shards are the heart of Elasticsearch. This blog takes the understanding of shards further to link it with performance."
featured: true
hidden: false
hidden: true
rating: 0
---

Expand All @@ -19,7 +19,7 @@ The search power of Elasticsearch revolves around the shard size. An index has m

> One shard is searched by a single thread.

![2021-02-08/1.png]({{ site.baseurl }}/assets/images/2021-02-08/1.png){: .half-image }
![2021-02-08/1.png]({{ site.baseurl }}/assets/images/2021-02-08/1.png){: .center-image }

Both the indexes are storing the same amount of data but with different shard sizes. The left index has smaller shards as compared to the right one. More aggregations have to perform on the left one. But in the later index, search time per shard is greater, as it is storing double the amount of data.

Expand Down
57 changes: 57 additions & 0 deletions _posts/2021-02-15-s3-data-in-athena.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
layout: post
title: "How to Smartly Query Your Data in S3 Using Athena?"
author: amroj
categories: [athena, awscloud]
image: assets/images/2021-02-15/head.png
description: "Take decisions using your big data stored in S3 without running ETL jobs. Save cost by SMARTLY partitioning the data."
featured: true
hidden: true
rating: 0
---

## AWS Athena Pricing

Athena is a powerful service built to query the data in S3. It suites to data that is semi or unstructured. You only pay for the amount of data that was scanned during the query execution.

Since we are paying for the amount of data scanned, we can make a difference by — compressing, partitioning, or using smart formats to store the data.

It goes without saying that if we are storing big data in an S3 it must be compressed. Further, we can save the data in Columnar format such as Parquet, which makes Athena faster and cheaper. But a problem with the latter approach is that we must have a defined schema.

Lastly, partitions are the game-changers. Think of querying a data set that has several months of data vs the one which is partitioned into days.

## Partitioning Data in Athena

Inorder to run a query on Athena. We must defined a table with several columns and partitions. Athena supports different types of data, semi-structured data (like JSON and XML) and unstructured data( like CSV, logs, and text with delimiters).

It is ideal to save data daywise in S3. This way we can partition on a date and query any subset with fewer resources.

{% gist thecloudbee/d98a0883167e923c892db9d12d855764 %}

{% gist thecloudbee/00edec18d88d9e87c6f92cb52e62c34e %}

Consider the above folder structure and a sample log for the DHCP IP allocation logs. We can define a create table query as follows.

{% gist thecloudbee/a0337c1a3c2c08ad0ccb29198ca1fabe %}

`RegexSerDe` uses regular expression (regex) to serialize/deserialize.

Finally, we can create partitions using the below query.

{% gist thecloudbee/af6486d4fd161124d558b39c33bd62fc %}

## Querying the Data

Let's write a query to get all the IPs assigned to a Mac address for the given date.

{% gist thecloudbee/f45a109f69022afed0b1d2252701db0a %}

This is a simple SQL like query, note that where clause has the partition `date`. This implies we have limited the data that has to be scanned by querying inside the firewall folder with a specified date.

All types of queries like JOIN, GROUP BY, and so on are supported by Athena.

## Conclusion

Athena is a powerful yet simple tool that can help you query your data in S3. It might not the fastest way to query big data but it has its own upsides of being able to query raw data.

Athena is using Presto behind the scenes — which is an interesting distributed query engine for big data.
Binary file added assets/images/2021-02-15/head.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.