thecloudbee · github-actions · Feb 15, 2021 · Feb 14, 2021 · Feb 14, 2021 · Feb 14, 2021
diff --git a/.github/actions/post-medium-action/node_modules/@actions/http-client/actions.png b/.github/actions/post-medium-action/node_modules/@actions/http-client/actions.png
diff --git a/.github/workflows/medium-publish.yml b/.github/workflows/medium-publish.yml
@@ -21,13 +21,13 @@ jobs:
           echo ::set-output name=file_name::$(echo ${{ steps.all_changed_files.outputs.added_modified }} | sed "s/.*medium\///;s/\.md.*//")
       - name: Publish draft to Medium
         id: medium_response
-        uses: InfraWay/post-medium-action@v1.2.0
+        uses: InfraWay/post-medium-action@v1.3.0
         with:
           app_id: ${{ secrets.MEDIUM_APP_ID }}
           app_secret: ${{ secrets.MEDIUM_APP_SECRET }}
           access_token: ${{ secrets.MEDIUM_ACCESS_TOKEN }}
           markdown_file: .medium/${{ steps.medium_markdown_file.outputs.file_name }}.md
-          base_url: https://thecloudbee.blog/
+          base_url: https://thecloudbee.blog
       - run: |
           echo 'Published to Medium @'
           echo ${{ steps.medium_response.outputs.url }}
diff --git a/.medium/2021-02-15-s3-data-in-athena.md b/.medium/2021-02-15-s3-data-in-athena.md
@@ -0,0 +1,60 @@
+---
+slug: s3-data-in-athena
+tags: [athena, awscloud]
+...
+
+# How to Smartly Query Your Data in S3 Using Athena?
+
+Take decisions using your big data stored in S3 without running ETL jobs. Save cost by SMARTLY partitioning the data.
+
+![2021-02-15/head.png](https://www.thecloudbee.blog/assets/images/2021-02-15/head.png)
+
+## AWS Athena Pricing
+
+Athena is a powerful service built to query the data in S3. It suites to data that is semi or unstructured. You only pay for the amount of data that was scanned during the query execution.
+
+Since we are paying for the amount of data scanned, we can make a difference by — compressing, partitioning, or using smart formats to store the data.
+
+It goes without saying that if we are storing big data in an S3 it must be compressed. Further, we can save the data in Columnar format such as Parquet, which makes Athena faster and cheaper. But a problem with the latter approach is that we must have a defined schema.
+
+Lastly, partitions are the game-changers. Think of querying a data set that has several months of data vs the one which is partitioned into days.
+
+## Partitioning Data in Athena
+
+Inorder to run a query on Athena. We must defined a table with several columns and partitions. Athena supports different types of data, semi-structured data (like JSON and XML) and unstructured data( like CSV, logs, and text with delimiters).
+
+It is ideal to save data daywise in S3. This way we can partition on a date and query any subset with fewer resources.
+
+[https://gist.github.com/thecloudbee/d98a0883167e923c892db9d12d855764](https://gist.github.com/thecloudbee/d98a0883167e923c892db9d12d855764)
+
+[https://gist.github.com/thecloudbee/00edec18d88d9e87c6f92cb52e62c34e](https://gist.github.com/thecloudbee/00edec18d88d9e87c6f92cb52e62c34e)
+
+Consider the above folder structure and a sample log for the DHCP IP allocation logs. We can define a create table query as follows.
+
+[https://gist.github.com/thecloudbee/a0337c1a3c2c08ad0ccb29198ca1fabe](https://gist.github.com/thecloudbee/a0337c1a3c2c08ad0ccb29198ca1fabe)
+
+`RegexSerDe` uses regular expression (regex) to serialize/deserialize.
+
+Finally, we can create partitions using the below query.
+
+[https://gist.github.com/thecloudbee/af6486d4fd161124d558b39c33bd62fc](https://gist.github.com/thecloudbee/af6486d4fd161124d558b39c33bd62fc)
+
+## Querying the Data
+
+Let's write a query to get all the IPs assigned to a Mac address for the given date.
+
+[https://gist.github.com/thecloudbee/f45a109f69022afed0b1d2252701db0a](https://gist.github.com/thecloudbee/f45a109f69022afed0b1d2252701db0a)
+
+This is a simple SQL like query, note that where clause has the partition `date`. This implies we have limited the data that has to be scanned by querying inside the firewall folder with a specified date. 
+
+All types of queries like JOIN, GROUP BY, and so on are supported by Athena.
+
+## Conclusion
+
+Athena is a powerful yet simple tool that can help you query your data in S3. It might not the fastest way to query big data but it has its own upsides of being able to query raw data.
+
+Athena is using Presto behind the scenes — which is an interesting distributed query engine for big data.
+
+---
+
+*Originally published at [https://www.thecloudbee.blog](https://www.thecloudbee.blog/s3-data-in-athena/) on .*
diff --git a/README.md b/README.md
@@ -1 +1,17 @@
-# amrojsandhu.github.io
+# amrojsandhu.github.io
+
+## Build
+
+```shell script
+bundle exec jekyll server --watch  --incremental
+```
+
+## Test
+
+Add the following to _config.yml
+
+```yaml
+# Filtering Content
+future              : true
+unpublished         : true
+```
diff --git a/_posts/2021-02-08-es-shard-sizes.md b/_posts/2021-02-08-es-shard-sizes.md
@@ -6,7 +6,7 @@ categories: elasticsearch
 image: assets/images/2021-02-08/head.png
 description: "Shards are the heart of Elasticsearch. This blog takes the understanding of shards further to link it with performance."
 featured: true
-hidden: false
+hidden: true
 rating: 0
 ---
 
@@ -19,7 +19,7 @@ The search power of Elasticsearch revolves around the shard size. An index has m
 
 > One shard is searched by a single thread.
 
-![2021-02-08/1.png]({{ site.baseurl }}/assets/images/2021-02-08/1.png){: .half-image }
+![2021-02-08/1.png]({{ site.baseurl }}/assets/images/2021-02-08/1.png){: .center-image }
 
 Both the indexes are storing the same amount of data but with different shard sizes. The left index has smaller shards as compared to the right one. More aggregations have to perform on the left one. But in the later index, search time per shard is greater, as it is storing double the amount of data.
 

diff --git a/_posts/2021-02-15-s3-data-in-athena.md b/_posts/2021-02-15-s3-data-in-athena.md
@@ -0,0 +1,57 @@
+---
+layout: post
+title:  "How to Smartly Query Your Data in S3 Using Athena?"
+author: amroj
+categories: [athena, awscloud]
+image: assets/images/2021-02-15/head.png
+description: "Take decisions using your big data stored in S3 without running ETL jobs. Save cost by SMARTLY partitioning the data."
+featured: true
+hidden: true
+rating: 0
+---
+
+## AWS Athena Pricing
+
+Athena is a powerful service built to query the data in S3. It suites to data that is semi or unstructured. You only pay for the amount of data that was scanned during the query execution.
+
+Since we are paying for the amount of data scanned, we can make a difference by — compressing, partitioning, or using smart formats to store the data.
+
+It goes without saying that if we are storing big data in an S3 it must be compressed. Further, we can save the data in Columnar format such as Parquet, which makes Athena faster and cheaper. But a problem with the latter approach is that we must have a defined schema.
+
+Lastly, partitions are the game-changers. Think of querying a data set that has several months of data vs the one which is partitioned into days.
+
+## Partitioning Data in Athena
+
+Inorder to run a query on Athena. We must defined a table with several columns and partitions. Athena supports different types of data, semi-structured data (like JSON and XML) and unstructured data( like CSV, logs, and text with delimiters).
+
+It is ideal to save data daywise in S3. This way we can partition on a date and query any subset with fewer resources.
+
+{% gist thecloudbee/d98a0883167e923c892db9d12d855764 %}
+
+{% gist thecloudbee/00edec18d88d9e87c6f92cb52e62c34e %}
+
+Consider the above folder structure and a sample log for the DHCP IP allocation logs. We can define a create table query as follows.
+
+{% gist thecloudbee/a0337c1a3c2c08ad0ccb29198ca1fabe %}
+
+`RegexSerDe` uses regular expression (regex) to serialize/deserialize.
+
+Finally, we can create partitions using the below query.
+
+{% gist thecloudbee/af6486d4fd161124d558b39c33bd62fc %}
+
+## Querying the Data
+
+Let's write a query to get all the IPs assigned to a Mac address for the given date.
+
+{% gist thecloudbee/f45a109f69022afed0b1d2252701db0a %}
+
+This is a simple SQL like query, note that where clause has the partition `date`. This implies we have limited the data that has to be scanned by querying inside the firewall folder with a specified date. 
+
+All types of queries like JOIN, GROUP BY, and so on are supported by Athena.
+
+## Conclusion
+
+Athena is a powerful yet simple tool that can help you query your data in S3. It might not the fastest way to query big data but it has its own upsides of being able to query raw data.
+
+Athena is using Presto behind the scenes — which is an interesting distributed query engine for big data.
diff --git a/assets/images/2021-02-15/head.png b/assets/images/2021-02-15/head.png