## Intro

[Amazon S3](https://aws.amazon.com/s3/getting-started/) is file storage solution for AWS. Different types of files such as csvs, parquet files, text files, etc can be stored in [S3 buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html). You can think of S3 buckets as a container or bucket that contains different files. 

There are many times where you need to read in a file from S3 and analyze it locally. In this post I will cover how to create a [Pandas](https://github.com/pandas-dev/pandas) dataframe from a csv file stored in S3. 

---

In this example, I have a bucket called ```s3-demo-bucket-09132021``` and a csv file in the bucket called ```supermarket_sales.csv```.

![File in s3 bucket details](../../static/img/aws_s3_pandas_read_09132021/file_in_s3_bucket.png "File in s3 bucket details")

I will show two ways of accessing the file - [boto3](https://github.com/boto/boto3) and [smart_open]( )

---

## boto3

[Boto3](https://github.com/boto/boto3) is the official AWS SDK. Install it with pip:

```pip install boto3```

In [1]:
import pandas as pd
import boto3

Get your AWS access and secret keys from the [IAM console](https://console.aws.amazon.com/iamv2/home#/users)

We are hard coding the keys here for simplicity. Later on I will show a better alternative.

In [2]:
AWS_ACCESS_KEY_ID = 'my_id'
AWS_SECRET_ACCESS_KEY = 'my_secret'

Create a [```boto3 client```](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/clients.html) with your access and secret keys. The client allows us to read and write files in S3 buckets

In [3]:
s3 = boto3.client(
    's3',
    aws_access_key_id= AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY)

Retrieve the file object with the [```get_object```](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_object) function. Need to specify the bucket that contains the csv file and the key for the csv file. The key is generally the file name - you can confirm the key by looking at the file details in the S3 console 

In [4]:
bucket = 's3-demo-bucket-09132021'
key = 'supermarket_sales.csv'

obj = s3.get_object(Bucket=bucket, Key=key)
obj

{'ResponseMetadata': {'RequestId': '9SWH9APKY10DGJ2M',
  'HostId': 'O+GQgt/2Jix2U3TG/Hl+3kjdFsTHu1Tg0yCPmd1vfzJW6EefpzK0aviq9ICt6jOTmbrUDhQzNLY=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'O+GQgt/2Jix2U3TG/Hl+3kjdFsTHu1Tg0yCPmd1vfzJW6EefpzK0aviq9ICt6jOTmbrUDhQzNLY=',
   'x-amz-request-id': '9SWH9APKY10DGJ2M',
   'date': 'Tue, 14 Sep 2021 03:23:38 GMT',
   'last-modified': 'Tue, 14 Sep 2021 00:57:44 GMT',
   'etag': '"3b4cf0aacbb90f82ea912d76d08a7702"',
   'accept-ranges': 'bytes',
   'content-type': 'text/csv',
   'server': 'AmazonS3',
   'content-length': '131528'},
  'RetryAttempts': 0},
 'AcceptRanges': 'bytes',
 'LastModified': datetime.datetime(2021, 9, 14, 0, 57, 44, tzinfo=tzutc()),
 'ContentLength': 131528,
 'ETag': '"3b4cf0aacbb90f82ea912d76d08a7702"',
 'ContentType': 'text/csv',
 'Metadata': {},
 'Body': <botocore.response.StreamingBody at 0x7fde778f7850>}

The returned ```obj``` contains metadata about our csv file. We are interested in the ```'Body'``` attribute - it contains a stream of our csv file. We can easily pass that into pandas's [```read_csv```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to create a dataframe.

In [5]:
df = pd.read_csv(obj['Body'])
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3


We now have a local dataframe that we explore!

---

### Optional:  Create a shared credential file

It is not safe to directly hardcode your AWS credentials in a program! 

One solution is to store the ```AWS_ACCESS_KEY_ID``` and ```AWS_SECRET_ACCESS_KEY``` in a [shared credential file](https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/configuration.html#shared-credentials-file).

Create ```~/.aws/credentials``` with the contents:

```
[default]
aws_access_key_id=my_id
aws_secret_access_key=my_key
```

You can now create a ```boto3 client``` without hardcoding the credentials. If you are using a jupyter-notebook, you may need to restart the kernel

In [6]:
bucket = 's3-demo-bucket-09132021'
key = 'supermarket_sales.csv'

s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'])
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3



An alternative approach is to create [environment variables](https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/configuration.html#environment-variables)

---

## smart_open

[smart_open](https://github.com/RaRe-Technologies/smart_open) is another way to read files from S3. More specifically, smart_open allows you to read files from different storage sources such as S3, GCS, Azure Blob storage, etc. Smart_open is a good solution if you work across various storage sources

Since we are only working with S3 in this example, we can install the s3 dependencies with pip:

```pip install smart_open[s3]```

You can also download all dependencies if necessary:

```pip install smart_open[all]```


In [7]:
from smart_open import smart_open
import pandas as pd

Get your AWS access and secret keys from the [IAM console](https://console.aws.amazon.com/iamv2/home#/users)

We are hard coding the keys here for simplicity. Later on I will show a better alternative.

In [8]:
AWS_ACCESS_KEY_ID = 'my_id'
AWS_SECRET_ACCESS_KEY = 'my_secret'

Create a S3 file path with the format: s3://AWS_ACCESS_KEY_ID:AWS_ACCESS_KEY@bucket/key

In [9]:
bucket = 's3-demo-bucket-09132021'
key = 'supermarket_sales.csv'

path = f's3://{AWS_ACCESS_KEY_ID}:{AWS_SECRET_ACCESS_KEY}@{bucket}/{key}'

In this case, the ```smart_open``` function returns a Reader object which can be passed to Pandas' read_csv function

In [10]:
f = smart_open(path)
print(type(f))
f

<class 'smart_open.s3.Reader'>


smart_open.s3.Reader(bucket='s3-demo-bucket-09132021', key='supermarket_sales.csv', version_id=None, buffer_size=131072, line_terminator=b'\n')

In [11]:
df = pd.read_csv(smart_open(path))
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3


---

### Optional: Use boto3 Session with smart_open

It is not safe to directly hardcode your AWS credentials in a program! 

One solution is to use a ```boto3 session```. This assumes that you followed the previous optional instructions of creating the [shared credential file](https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/configuration.html#shared-credentials-file) or created [environment variables](https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/configuration.html#environment-variables ) for the AWS keys

In [12]:
from smart_open import smart_open
import pandas as pd
import boto3

Create a ```boto3 session```. We already created the shared credential file so no need to hardcode any keys

In [13]:
session = boto3.Session()
client = session.client('s3')

Create a path that only contains the bucket and key. Simply call pandas's read_csv

In [14]:
bucket = 's3-demo-bucket-09132021'
key = 'supermarket_sales.csv'
path = f's3://{bucket}/{key}'


df = pd.read_csv(smart_open(path))
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3


---

## Conclusion

You now can :

- Read file in s3 with boto3 and pandas
- Read file in s3 with smart_open and pandas

Resources

- https://aws.amazon.com/s3/getting-started/
- https://github.com/boto/boto3 
- https://github.com/RaRe-Technologies/smart_open 
- https://github.com/pandas-dev/pandas