# Introduction to AWS SDK for Pandas (awswrangler)

In this session, we'll explore how to use the **AWS SDK for Pandas**, also known as `awswrangler`, to simplify interactions with AWS services like S3. This library provides a higher-level interface compared to Boto3, making it easier to perform data engineering tasks directly in Python.

We'll start with a brief refresher on using Boto3 to interact with S3 and then move on to using `awswrangler` to achieve the same tasks more efficiently.

**If you get stuck, there is no better place than reading the documentation which you can find here: https://aws-sdk-pandas.readthedocs.io/en/stable/**

## Installaion and Setup

To install awswrangler you can install using either `uv pip install`, `pip install` or `conda install`

```shell
uv pip install awswrangler
```
or
```shell
pip install awswrangler
```

Once installed, make sure you have your `.env` file created with the following information
```
AWS_ACCESS_KEY_ID=<YOURKEY>
AWS_SECRET_ACCESS_KEY=<YOURSECRETKEY>
AWS_DEFAULT_REGION=us-east-2
```

If you did install python-dotenv before then you can do so using:
```shell
uv pip install python-dotenv
```

In [2]:
# load your credentials
from dotenv import load_dotenv
load_dotenv()


True

## Getting to know awswrangler

In [3]:
# import awswrangler
import awswrangler as wr

Similar to pandas, awswrangles comes with many reader and writer functions. You can use `wr.s3.read_csv`, `wr.s3.read_parquet`, ..etc

In [9]:
# test if you can read from a private bucket

df = wr.s3.read_csv('s3://techcatalyst-raw/stocks/GOOG.csv')
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,1/2/2025 16:00:00,191.49,193.2,188.71,190.63,17545162
1,1/3/2025 16:00:00,192.73,194.5,191.35,193.13,12874957
2,1/6/2025 16:00:00,195.15,199.56,195.06,197.96,19483323
3,1/7/2025 16:00:00,198.27,202.14,195.94,196.71,16966760
4,1/8/2025 16:00:00,193.95,197.64,193.75,195.39,14335341


In [10]:
# the returned object is actuall a pandas DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    140 non-null    object 
 1   Open    140 non-null    float64
 2   High    140 non-null    float64
 3   Low     140 non-null    float64
 4   Close   140 non-null    float64
 5   Volume  140 non-null    int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 6.7+ KB


Now, you have a pandas DataFrame loaded into memory

In AWS Glue, you worked with Glue Catalog, Glue Database, and Glue Tables. If you recall, you used the Glue Crawler to crawl your files in an S3 bucket, then the crawler would populate the metadata information in a Glue Table inside the Glue Database you had to define manually. 

Let's inspect what Glue Database we have available using `awswrangler`:

In [None]:
databases = wr.catalog.databases()
print(databases)

           Database Description
0  awswrangler_test            
1             my_db            


These are databases I had created. You will need to create you own database now

In [None]:
name = # ENTER YOUR NAME HERE
database_name = f"{name}_db"
wr.catalog.create_database(database_name)

In [7]:
# Confirm it was created 
wr.catalog.databases()

Unnamed: 0,Database,Description
0,awswrangler_test,
1,my_db,
2,tatwan_db,


Inspect your database to see if it contains any tables

In [17]:
wr.catalog.tables(database=database_name)

Unnamed: 0,Database,Table,Description,TableType,Columns,Partitions


awswrangles allows you to easily write you pandas DataFrame into S3 in any format (e.g., CSV, Parquet, JSON ..etc). You can also specify a Glue Database and a Table so it can write some metadata information. Let's see how that works:

In [None]:
wr.s3.to_parquet(
    df=df,
    path=f"s3://techcatalyst-raw/{name}/", # write to the techcatalyst-raw bucket under your folder name (or it would create a new folder)
    dataset=True, 
    database=database_name, # the name of the database you just created in AWS Glue 
    table= #YOUR CODE, # pick a table name for example YOURNAME_STOCK
    mode='overwrite'
    )

{'paths': ['s3://techcatalyst-raw/tatwan/36385087362744ee918c24a23b82e878.snappy.parquet'],
 'partitions_values': {}}

The above line confirmst that the parquet file was written in the specified S3 bucket. Let's inspect if a table was written in AWS Glue. You should login to the AWS console (Web), change region to use-east-2, and then navigate to AWS Glue. Check the Database and verify it was created. Then inside that database, verify that the table was created. 

We can also do this via awswrangler

In [19]:
wr.catalog.tables(database=database_name)

Unnamed: 0,Database,Table,Description,TableType,Columns,Partitions
0,tatwan_db,tatwan_stock,,EXTERNAL_TABLE,"date, open, high, low, close, volume",


In [20]:
# You can also do a search for tables by name
wr.catalog.tables(name_contains="stock")

Unnamed: 0,Database,Table,Description,TableType,Columns,Partitions
0,tatwan_db,tatwan_stock,,EXTERNAL_TABLE,"date, open, high, low, close, volume",


To view the content of a specific AWS Glue table using awswrangler, you typically read the data from the underlying data store (often S3) using the Glue table’s catalog metadata. This is how Athena works by utilizing Glue tables.

In [24]:
df = wr.s3.read_parquet_table(database=database_name, table='TATWAN_STOCK')

# Display the DataFrame's first few rows
df.head()

Unnamed: 0,date,open,high,low,close,volume
0,1/2/2025 16:00:00,191.49,193.2,188.71,190.63,17545162
1,1/3/2025 16:00:00,192.73,194.5,191.35,193.13,12874957
2,1/6/2025 16:00:00,195.15,199.56,195.06,197.96,19483323
3,1/7/2025 16:00:00,198.27,202.14,195.94,196.71,16966760
4,1/8/2025 16:00:00,193.95,197.64,193.75,195.39,14335341


In [None]:
# you can just request the column types in the table using the get_table_types
wr.catalog.get_table_types(database=database_name, table='TATWAN_STOCK')

{'date': 'string',
 'open': 'double',
 'high': 'double',
 'low': 'double',
 'close': 'double',
 'volume': 'bigint'}

In [None]:
# you can obtain the same information such as column type, partition, and any comments using the table method 
wr.catalog.table(database=database_name, table='TATWAN_STOCK')

Unnamed: 0,Column Name,Type,Partition,Comment
0,date,string,False,
1,open,double,False,
2,high,double,False,
3,low,double,False,
4,close,double,False,
5,volume,bigint,False,


In [50]:
# Get the Glue metadata as a generator that contains dictionaries for all tables in your database
table_details = wr.catalog.get_tables(database=database_name)

next(table_details)

{'Name': 'tatwan_stock',
 'DatabaseName': 'tatwan_db',
 'Description': 'This is my stock table.',
 'CreateTime': datetime.datetime(2025, 7, 31, 14, 15, 56, tzinfo=tzlocal()),
 'UpdateTime': datetime.datetime(2025, 7, 31, 14, 31, 21, tzinfo=tzlocal()),
 'Retention': 0,
 'StorageDescriptor': {'Columns': [{'Name': 'date',
    'Type': 'string',
    'Comment': 'Trading Date'},
   {'Name': 'open', 'Type': 'double', 'Comment': 'Opening Price'},
   {'Name': 'high', 'Type': 'double'},
   {'Name': 'low', 'Type': 'double'},
   {'Name': 'close', 'Type': 'double', 'Comment': 'Closing Price'},
   {'Name': 'volume', 'Type': 'bigint'}],
  'Location': 's3://techcatalyst-raw/tatwan/',
  'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
  'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
  'Compressed': True,
  'NumberOfBuckets': -1,
  'SerdeInfo': {'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe',
   'Para

Glue Tables allows us to add additonal metadata information like description about the data, comments about each column ..etc

In [None]:
# Example adding additional metadata information 

desc = "This is my stock table."

param = {"source": "Google", "class": "stock"}

comments = {
    "Date": "Trading Date",
    "Open": "Opening Price",
    "Close": "Closing Price"
}

wr.s3.to_parquet(
    df=df,
    path="s3://techcatalyst-raw/tatwan/", # CHANGE THIS TO USE YOUR NAME 
    dataset=True,
    database=database_name,
    table=#YOURTABLE NAME,
    mode='overwrite',
    glue_table_settings=wr.typing.GlueTableSettings(description=desc, parameters=param, columns_comments=comments),
    )

{'paths': ['s3://techcatalyst-raw/tatwan/8cc73735d3b24af5b1ddfbaf692cf595.snappy.parquet'],
 'partitions_values': {}}

In [None]:
# you can obtain the same information such as column type, partition, and any comments using the table method 
wr.catalog.table(database=database_name, table=#YOURNAME_STOCK)

Unnamed: 0,Column Name,Type,Partition,Comment
0,date,string,False,Trading Date
1,open,double,False,Opening Price
2,high,double,False,
3,low,double,False,
4,close,double,False,Closing Price
5,volume,bigint,False,


In [45]:
# you can get all tables in all databases
tables = wr.catalog.get_tables()
tables

<generator object get_tables at 0x790ac6b8c8e0>

You will need to loop through the `generator` to extract specific information

In [69]:
tables = wr.catalog.get_tables()
tables

<generator object get_tables at 0x790ac6b8f400>

In [71]:
# one quick way to view the content is converting into a list
tables = wr.catalog.get_tables()
list(tables)

[{'Name': 'my_table',
  'DatabaseName': 'my_db',
  'Description': 'This is my stock table.',
  'CreateTime': datetime.datetime(2025, 7, 31, 8, 41, 35, tzinfo=tzlocal()),
  'UpdateTime': datetime.datetime(2025, 7, 31, 8, 45, 16, tzinfo=tzlocal()),
  'Retention': 0,
  'StorageDescriptor': {'Columns': [{'Name': 'date',
     'Type': 'string',
     'Comment': 'Trading Date'},
    {'Name': 'open', 'Type': 'double', 'Comment': 'Opening Price'},
    {'Name': 'high', 'Type': 'double'},
    {'Name': 'low', 'Type': 'double'},
    {'Name': 'close', 'Type': 'double', 'Comment': 'Closing Price'},
    {'Name': 'volume', 'Type': 'bigint'}],
   'Location': 's3://techcatalyst-raw/tatwan/',
   'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
   'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
   'Compressed': True,
   'NumberOfBuckets': -1,
   'SerdeInfo': {'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

To get specific content you will need to loop through the generator object. This is based on the information above, I can see a dictionary with key value pairs, and nested dictionaries.

In [72]:
tables = wr.catalog.get_tables()
for info in tables:
    print(info.get('Name'), info.get('DatabaseName'), info.get('TableType'))

my_table my_db EXTERNAL_TABLE
test my_db EXTERNAL_TABLE
tatwan_stock tatwan_db EXTERNAL_TABLE


In [None]:
# to get column metadata from the generator you will need to go inside the `StorageDescriptor` key
tables = wr.catalog.get_tables()
for info in tables:
    print(info.get('Name'), info.get('StorageDescriptor'))
# you can observer we have a dictionary within a dictionary

my_table {'Columns': [{'Name': 'date', 'Type': 'string', 'Comment': 'Trading Date'}, {'Name': 'open', 'Type': 'double', 'Comment': 'Opening Price'}, {'Name': 'high', 'Type': 'double'}, {'Name': 'low', 'Type': 'double'}, {'Name': 'close', 'Type': 'double', 'Comment': 'Closing Price'}, {'Name': 'volume', 'Type': 'bigint'}], 'Location': 's3://techcatalyst-raw/tatwan/', 'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat', 'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat', 'Compressed': True, 'NumberOfBuckets': -1, 'SerdeInfo': {'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe', 'Parameters': {'serialization.format': '1'}}, 'BucketColumns': [], 'SortColumns': [], 'Parameters': {'CrawlerSchemaDeserializerVersion': '1.0', 'compressionType': 'snappy', 'classification': 'parquet', 'typeOfData': 'file'}, 'StoredAsSubDirectories': False}
test {'Columns': [{'Name': 'date', 'Type': 'string'}, {'Name': 'ope

In [64]:
# to get column metadata from the generator you will need to go inside the `StorageDescriptor` key
tables = wr.catalog.get_tables()

for info in tables:
    for cols in info['StorageDescriptor']['Columns']:
        print(cols.get('Name'), cols.get('Type'), cols.get('Columns'))

date string None
open double None
high double None
low double None
close double None
volume bigint None
date string None
open double None
high double None
low double None
close double None
volume bigint None
date string None
open double None
high double None
low double None
close double None
volume bigint None


If you want to list all of the buckets available you would use `s3.list_buckets()`

In [7]:
wr.s3.list_buckets()

['amsu-aws-s3-assessment',
 'amsu-s3-cf-website',
 'andrea-lm-code-bucket',
 'apiproject-build-bucket-su05297',
 'austin-lambda-s3',
 'aws-athena-query-results-us-east-1-535146832369',
 'aws-athena-query-results-us-east-1-535146832369nh',
 'aws-athena-query-results-us-east-1-blakeliu',
 'aws-athena-query-results-us-east-1-etl-aa',
 'aws-athena-query-results-us-east-2-535146832369',
 'aws-athena-query-results-us-east-2-535146832369-kb',
 'aws-cloudtrail-logs-535146832369-01f575cd',
 'aws-cloudtrail-logs-535146832369-9d19bcae',
 'aws-logs-535146832369-us-east-1',
 'aws-sam-cli-managed-default-samclisourcebucket-1oz2tqbsdli3',
 'aws-sam-cli-managed-default-samclisourcebucket-mgdplk7mifi7',
 'aws-sam-cli-managed-default-samclisourcebucket-t2x2wb9il3p1',
 'aws-sam-cli-managed-default-samclisourcebucket-yh6y67cvomwp',
 'axel-di-student14',
 'axel-di-student2',
 'axel-di-student25',
 'axel-di-student8',
 'axel-di-team2',
 'backend-lambda-java-test',
 'barb-lm-code-bucket',
 'bored-api-reactjs

To list all objects in a specific bucket you can use the `s3.list_objects`

In [93]:
wr.s3.list_objects('s3://techcatalyst-raw/stage/')

['s3://techcatalyst-raw/stage/yellow_tripdata.csv',
 's3://techcatalyst-raw/stage/yellow_tripdata.json',
 's3://techcatalyst-raw/stage/yellow_tripdata.parquet']

If you want to download a specific file locally to your machine, you can use the `s3.download`

In [77]:
wr.s3.download(path='s3://techcatalyst-raw/stocks/GOOG.csv', local_file='./new_file.csv')

To upload a file from your local machine to S3 you can use `s3.upload()` 

In [None]:
your_name = # ENTER A NAME
file_name = #ENTER YOUR FILE NAME YOU WANT TO UPLOAD
wr.s3.upload(local_file='new_file.csv',path= f's3://techcatalyst-raw/{your_name}/uploads/{file_name}')

Let's double check it was uploaded

In [98]:
wr.s3.list_objects(f's3://techcatalyst-raw/{your_name}/uploads/')

['s3://techcatalyst-raw/tatwan/uploads/test.csv']

## Exercise (using Glue Catalog and Athena)

You can execute an Athena query directly using the wr.athena.read_sql_query() function. This function runs your SQL query and returns the results as a Pandas DataFrame.

In [None]:
# --- 1. Define Configuration ---

# create a new Glue Database it should be YOURNAME_TAXI
db_name =  # YOUR CODE

# Create a table it should be YOURNAME_TRIPDATA
table_name =  #YOUR CODE

# the path_direcoty to should point to the bucket (main directory)
s3_path_directory = #YOUR CODE

# The path_file should be the full path to the actual file
s3_path_file = # YOUR CODE

# --- 2. Get the Schema from the Parquet File Metadata ---

# uncomment below if you ran into issues to clean things up and rerun the cell
wr.catalog.delete_table_if_exists(database=db_name, table=table_name) 

# Create the new Glue database first based on the db_name you created
# YOUR CODE

# This function can extract the schema from our file and returns a tuple: (schema, partitions). We only need the schema. 
columns_types, partitions_types = wr.s3.read_parquet_metadata(path=s3_path_file)
print("Successfully read schema from Parquet file.")

Successfully read schema from Parquet file.


Inspect and make sure the column types are what you expect

In [40]:
columns_types

{'VendorID': 'int',
 'tpep_pickup_datetime': 'timestamp',
 'tpep_dropoff_datetime': 'timestamp',
 'passenger_count': 'double',
 'trip_distance': 'double',
 'RatecodeID': 'double',
 'store_and_fwd_flag': 'string',
 'PULocationID': 'int',
 'DOLocationID': 'int',
 'payment_type': 'bigint',
 'fare_amount': 'double',
 'extra': 'double',
 'mta_tax': 'double',
 'tip_amount': 'double',
 'tolls_amount': 'double',
 'improvement_surcharge': 'double',
 'total_amount': 'double',
 'congestion_surcharge': 'double',
 'Airport_fee': 'double'}

In [None]:
# --- 3. Create the Glue Table with the Explicit Schema ---
wr.catalog.create_parquet_table(
    database= , # pass the database name
    table= , # pass the table name
    path= , # use the directoy here 
    columns_types=  ,  # Pass the schema here
    partitions_types= # pass the partition types
)
print(f"Table '{table_name}' created successfully in database '{db_name}'.")

Table 'TATWAN_TRIPDATA' created successfully in database 'TATWAN_TAXI'.


In [None]:
# --- 4. Now you can query it ---
query = f"SELECT * FROM {table_name} LIMIT 5"

# you will use read_sql_query from athena from awswrangler
# https://aws-sdk-pandas.readthedocs.io/en/3.12.1/stubs/awswrangler.athena.read_sql_query.html
df = # YOUR CODE 
print("\nQuery Results:")
print(df)


Query Results:
   vendorid tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2024-04-01 00:02:40   2024-04-01 00:30:42              0.0   
1         2  2024-04-01 00:56:02   2024-04-01 01:05:09              1.0   
2         1  2024-04-01 00:08:32   2024-04-01 00:10:24              1.0   
3         2  2024-04-01 00:41:12   2024-04-01 00:55:29              1.0   
4         2  2024-04-01 00:48:42   2024-04-01 01:05:30              1.0   

   trip_distance  ratecodeid store_and_fwd_flag  pulocationid  dolocationid  \
0           5.20         1.0                  N           161             7   
1           1.06         1.0                  N           137           164   
2           0.70         1.0                  N           236           263   
3           5.60         1.0                  N           264           264   
4           3.55         1.0                  N           186           236   

   payment_type  fare_amount  extra  mta_tax  tip_amount  

## Summary
Hopefully by now you come to appreciate awswrangler.

At its core, it acts as a high-level bridge, extending the power of the popular Pandas library to the AWS cloud environment. Its primary goal is to eliminate the need for extensive "boilerplate" code that data scientists and engineers would otherwise have to write using lower-level libraries like Boto3 to interact with services like S3, Redshift, and Athena.

__It is an official AWS project and is part of the AWS Professional Services portfolio.__

__The Problem It Solves__

Imagine you are a data scientist or data engineer working in Python. Your workflow often looks like this:

1. __Get Data__: You need to read a large CSV or Parquet file from an Amazon S3 bucket into a Pandas DataFrame.
2. __Analyze & Transform__: You perform your analysis, clean the data, and generate new features using Pandas.
3. __Store Results__: You need to save your transformed DataFrame back to S3 or load it into a data warehouse like Amazon Redshift for others to use.

Without AWS Wrangler, this process involves many manual steps:
* Using `Boto3` to connect to S3.
* Downloading the file locally or streaming it into memory.
* Parsing the file format (e.g., CSV, Parquet) into a DataFrame.
* For writing, you'd have to serialize the DataFrame into a specific format (like a CSV string or Parquet file).
* Then, use Boto3 again to upload this new file to S3.
* All of this requires handling AWS credentials, sessions, and potential errors.

__AWS Wrangler simplifies this entire workflow into single-line function calls.__