# EDA Using Athena
Since it's not possible to load huge data in pandas dataframe,we use AWS Athena to explore the dataset.

Docs References: https://docs.aws.amazon.com/athena/latest/ug/what-is.html

Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your data in Amazon S3. This allows you to create tables and query data in Athena based on a central metadata store available throughout your Amazon Web Services account and integrated with the ETL and data discovery features of AWS Glue.

#### Error:
- **Since parquet files have different schema i.e. different datatypes for few columns. The Sql query will raise an error, if we select or process a column that has different datatype in differn file.**
- **You need to write a Glue ETL job that will address the above issue and cleans the data, then you will be able to analyse data with Athena**

#### Notes:
- If you have moved s3 data/files to new location, The athena query execution will not fetch any results. you can set the new location using "alter table table_name set location 's3::/uri/'" and wr.athena.start_query_execution.
- When you expect results from the query use **wr.athena.read_sql_query**.
- And use **wr.athena.start_query_execution** when you want to run DDL (define or modify the structure) of database like tables, schema etc.

In [1]:
!pip install awswrangler



In [2]:
import pandas as pd
import boto3
import awswrangler as wr
import sagemaker



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [3]:
boto_session = boto3.session.Session()
s3_client = boto_session.client('s3')

In [None]:
sagemaker.get_execution_role()

## Change the location of s3 data if you have moved it.

In [None]:
query = """
    ALTER TABLE raw_data
    SET LOCATION 's3://sagemaker-us-east-1-205930620783/NYC_Taxi_Prediction/data/raw_data/'
"""
df = wr.athena.start_query_execution(sql=query, database='nyc_taxi_data')
df

### Get sample of 10 rows

In [13]:
query = "select * from raw_data limit 10"
df = wr.athena.read_sql_query(sql=query, database='nyc_taxi_data')
df

### Get By Year Summary

In [14]:
query = """
    SELECT 
        year(tpep_pickup_datetime) AS year, 
        COUNT(*) as total_trips, 
        ROUND(SUM(total_amount) / 1000000, 0) as total_amount,
        ROUND(SUM(fare_amount) / 1000000, 0) as fare_amount, 
        ROUND(SUM(tip_amount) / 1000000, 0) as tip_amount
    FROM raw_data 
    GROUP BY year(tpep_pickup_datetime)
    ORDER BY 1
"""
df = wr.athena.read_sql_query(sql=query, database='nyc_taxi_data')
df

2025-07-03 06:49:15,602	INFO worker.py:1821 -- Started a local Ray instance.


Unnamed: 0,year,total_trips,total_amount,fare_amount,tip_amount
0,2001,15,0.0,0.0,0.0
1,2002,478,0.0,0.0,0.0
2,2003,33,0.0,0.0,0.0
3,2004,1,0.0,0.0,0.0
4,2007,1,0.0,0.0,0.0
5,2008,366,0.0,0.0,0.0
6,2009,744,0.0,0.0,0.0
7,2010,1,0.0,0.0,0.0
8,2011,4,0.0,0.0,0.0
9,2012,1,0.0,0.0,0.0


In [38]:
### Number of Missabsing Data
columns = ['vendorid','tpep_pickup_datetime','tpep_dropoff_datetime','passenger_count','trip_distance','ratecodeid','store_and_fwd_flag','pulocationid','dolocationid','payment_type','fare_amount','extra','mta_tax','tip_amount','tolls_amount','improvement_surcharge','total_amount','congestion_surcharge','airport_fee','cbd_congestion_fee']	


'COUNT(vendorid) AS vendorid , COUNT(tpep_pickup_datetime) AS tpep_pickup_datetime , COUNT(tpep_dropoff_datetime) AS tpep_dropoff_datetime , COUNT(passenger_count) AS passenger_count , COUNT(trip_distance) AS trip_distance , COUNT(ratecodeid) AS ratecodeid , COUNT(store_and_fwd_flag) AS store_and_fwd_flag , COUNT(pulocationid) AS pulocationid , COUNT(dolocationid) AS dolocationid , COUNT(payment_type) AS payment_type , COUNT(fare_amount) AS fare_amount , COUNT(extra) AS extra , COUNT(mta_tax) AS mta_tax , COUNT(tip_amount) AS tip_amount , COUNT(tolls_amount) AS tolls_amount , COUNT(improvement_surcharge) AS improvement_surcharge , COUNT(total_amount) AS total_amount , COUNT(congestion_surcharge) AS congestion_surcharge , COUNT(airport_fee) AS airport_fee , COUNT(cbd_congestion_fee) AS cbd_congestion_fee'

In [44]:
query = f"""
    SELECT 
        COUNT(*) as total_rows,
        {' , '.join(['ROUND(100*COUNT({}) / COUNT(*)) AS {}'.format(col, col) for col in columns])}
    FROM raw_data
"""
print(query)
df = wr.athena.read_sql_query(sql=query, database='nyc_taxi_data')
df


    SELECT 
        COUNT(*) as total_rows,
        ROUND(100*COUNT(vendorid) / COUNT(*)) AS vendorid , ROUND(100*COUNT(tpep_pickup_datetime) / COUNT(*)) AS tpep_pickup_datetime , ROUND(100*COUNT(tpep_dropoff_datetime) / COUNT(*)) AS tpep_dropoff_datetime , ROUND(100*COUNT(passenger_count) / COUNT(*)) AS passenger_count , ROUND(100*COUNT(trip_distance) / COUNT(*)) AS trip_distance , ROUND(100*COUNT(ratecodeid) / COUNT(*)) AS ratecodeid , ROUND(100*COUNT(store_and_fwd_flag) / COUNT(*)) AS store_and_fwd_flag , ROUND(100*COUNT(pulocationid) / COUNT(*)) AS pulocationid , ROUND(100*COUNT(dolocationid) / COUNT(*)) AS dolocationid , ROUND(100*COUNT(payment_type) / COUNT(*)) AS payment_type , ROUND(100*COUNT(fare_amount) / COUNT(*)) AS fare_amount , ROUND(100*COUNT(extra) / COUNT(*)) AS extra , ROUND(100*COUNT(mta_tax) / COUNT(*)) AS mta_tax , ROUND(100*COUNT(tip_amount) / COUNT(*)) AS tip_amount , ROUND(100*COUNT(tolls_amount) / COUNT(*)) AS tolls_amount , ROUND(100*COUNT(improvement_surchar