# Using Boto3 to Share Data Between Python and AWS S3 Buckets

## Introduction
Boto3 is a really useful tool to connect Python scripts with Amazon Web Service’s (AWS’s) S3 service. AWS’s S3 can categorize and save data, such as an image file’s binary data. Boto3 helps communicate what to save, where to save it and how to label it. To use Boto3, first it needs to be installed and the amazon credentials associated with the account from which S3 will be accessed (usually your account or your company’s account) need to be configured. This can be done like this:

pip install boto3

In [174]:
# And then in a file with the path "~/.aws/credentials", it should look like this, but with the [default] uncommented. (Just doesn't work here otherwise)
# [default]
aws_access_key_id = 'YOUR_ACCESS_KEY'
aws_secret_access_key = 'YOUR_SECRET_KEY'

## Buckets
#### Bucket Names
The necessary information can be saved in what is called a bucket and can be accessed through the Python code or from the Amazon Web Service’s web console by going to the S3 section of the console. Importing Boto3, creating an S3 client object, creating an S3 resource object, accessing all of the S3 buckets and then accessing each bucket’s name (which we’ll just print for now) looks like this:

In [175]:
import boto3
import sys

# First method of finding credentials that will be tried by boto3 is with this client function
client = boto3.client(
    's3',
    aws_access_key_id='AKIAIJHRV5BXBVRW5UDQ',
    aws_secret_access_key='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'
    )

# This does not need to be hard coded, though, if the credentials file is configured properly
s3 = boto3.resource('s3')

for bucket in s3.buckets.all():
    print(bucket.name)

my_df_bucket
pds-tutorial
pds-tutorial-config-bucket
pds-tutorial-create-bucket


#### Creating Buckets

Creating and deleting buckets is very simple, but absolutely crucial. There are two options to creating a bucket: a simple create or creating a bucket with specific configurations. Configuring a bucket is often used to specify a particular location that can access the bucket. The tricky thing when coming up with a bucket name is that all users share a namespace. So whatever bucket name you come up with, it has to be unique!

In [176]:
try: 
    bucket1 = s3.create_bucket(Bucket="pds-tutorial-create-bucket")
    bucket2 = s3.create_bucket(Bucket='pds-tutorial-config-bucket', CreateBucketConfiguration={
    'LocationConstraint': 'us-west-1'})
except:
    print "Bucket already exists!"

Bucket already exists!


#### Bucket Keys and Objects
Each piece of information that is saved to a bucket is saved with a key (formatted as a string). This key is an identifier and can double as a way to further subdivide buckets. So, if we know that ‘pds-tutorial’ is the name of one of the buckets in our S3, we can then save an object to that bucket. This object that we are going to save includes the key that we want to use to identify the data as well as the data itself in the body of the object. We can do that for an image file we have on our local computer like this:

In [177]:
client.upload_file("text.txt", "pds-tutorial", "pds-text")
client.upload_file("text.txt", "pds-tutorial-create-bucket", "pds-text")
client.upload_file("text.txt", "pds-tutorial-config-bucket", "pds-text")

Note that if the bucket name that we try to use is not actually the name of a bucket, we will get an error. The easiest way to deal with this is to use a try and except block around the line or lines where the bucket is accessed. In general, this would look like this:

In [178]:
try:
    client.upload_file("text.txt", "pds-tutorial", "pds-text")
except:
    print "Unexpected error:", sys.exc_info()[0]

There are even more examples of how to upload objects, though, all with a different nuance. This first example shows how to import a file object rather than just a file. The second example shows how to upload a file that has a dictionary of extra arguments that consists of a meta data dictionary. This meta data dictionary is what actually keeps track of whatever variables you want to save with this data.  

In [179]:
# Example 1
try:
    # Upload a file-like object to pds-tutorial at key pds-text2
    with open("text.txt", "rb") as f:
        client.upload_fileobj(f, "pds-tutorial", "pds-text2")
except:
    print "Unexpected error:", sys.exc_info()[0]

In [180]:
# Example 2
try:
    # Upload tmp.txt to pds-tutorial at key pds-text-args
    client.upload_file(
        "text.txt", "pds-tutorial", "pds-text-args",
        ExtraArgs={"Metadata": {"mykey": "myvalue"}}
    )
except:
    print "Unexpected error:", sys.exc_info()[0]
    

#### Deleting Buckets
But what if we realize we don't want all these buckets? Now it's time to delete the keys of the buckets we don't want anymore and then delete the buckets. This can be done as so:

In [181]:
print "Bucket list starts as: "
for bucket in s3.buckets.all():
    print(bucket.name)

bucket1 = s3.Bucket("pds-tutorial-create-bucket")
bucket2 = s3.Bucket("pds-tutorial-config-bucket")
for key in bucket1.objects.all():
    print "Deleting key ", key, " from bucket1"
    key.delete()
print "Deleting bucket1"
bucket1.delete()

for key in bucket2.objects.all():
    print "Deleting key ", key, " from bucket2"
    key.delete()
print "Deleting bucket2"
bucket2.delete()

print "Bucket list is: "
for bucket in s3.buckets.all():
    print(bucket.name)

Bucket list starts as: 
my_df_bucket
pds-tutorial
pds-tutorial-config-bucket
pds-tutorial-create-bucket
Deleting key  s3.ObjectSummary(bucket_name='pds-tutorial-create-bucket', key=u'pds-text')  from bucket1
Deleting bucket1
Deleting key  s3.ObjectSummary(bucket_name='pds-tutorial-config-bucket', key=u'pds-text')  from bucket2
Deleting bucket2
Bucket list is: 
my_df_bucket
pds-tutorial


#### Head Bucket
This is especially helpful in larger projects where the code is going to interact with a single bucket multiple times. In this case, we can just use the head_bucket function on the s3 client instance that we created and set a boolean to false if we get an error. This does, however, require another package import -- botocore. Then, any time we access s3 we can just place an object after doing a simple boolean check. It will then place the object in that same bucket whenever we make that kind of call. It is useful to note that the way we place an object in this case has different syntax. This would be done as shown: 

In [182]:
import botocore

exists = True
try:
    client.head_bucket(Bucket='random-bucket')
except botocore.exceptions.ClientError as e:
    # If a client error is thrown, then check that it was a 404 error.
    # If it was a 404 error, then the bucket does not exist.
    error_code = int(e.response['Error']['Code'])
    if error_code == 404:
        exists = False
        
print exists

True


#### Bucket Subdivisions
This is also likely when we are going to want to have our information subdivided. For example, we can divide into when the data was posted on a website that the Python script is crawling. We could have the key be “year/month/day/time/file_name”. This would be done like this:

In [183]:
try:
    client.upload_file("text.txt", "pds-tutorial", "2016/10/19/2300/pds-text")
except:
    print "Unexpected error:", sys.exc_info()[0]

#### Bucket Info Accessing
Then the information would be divided into year folders and further subdivided into month, day and time folders with the file name used as the data’s identifying name. The data can then be downloaded and its contents put into a file like this:

In [184]:
# Download object at pds-tutorial with the key pds-text to file-like object
with open("tmp.txt", "wb") as f:
    client.download_fileobj("pds-tutorial", "2016/10/19/2300/pds-text", f)

And those are some of the basics of Boto3! But what if we want to include more complex kinds of data rather than just importing file data? Like, what if we want to save formatted data? We can do that, too! Let's make a new bucket and add some more complex data to it. You'll notice that we have a different way of adding this data. We're going to use the s3 object a bit more and demonstrate how to save this data.

### Saving Formatted Data

In [185]:
import pandas as pd

# Given the path to a csv file, creates a pandas DataFrame 
def format_data(csv_file):
    df = pd.DataFrame()
    df = df.from_csv(csv_file)
    return df

# Given bucket_name which is a string and df which is a pandas DataFrame
# Saves each column in df as an object (with the key as the column name) in the bucket if a bucket with that name exists. 
def save_data(bucket_name, df):
    exists = True
    try:
        s3.meta.client.head_bucket(Bucket=bucket_name)
    except botocore.exceptions.ClientError as e:
        exists = False
    if exists:  
        for column in df:
            bod = df[column].to_string()
            s3.Object(bucket_name, column).put(Body=bod)
            
s3.create_bucket(Bucket='my_df_bucket')
df = format_data("tweets.csv")
save_data("my_df_bucket", df)

#### S3 Objects
You can notice that here the most notable difference is using s3's Object function which creates and saves an object in bucket_name with the column as the key and then places the body's information in that object. It's important to recognize that the body's information had to be converted to a string, though, because the information saved in an s3 Object has to be a string, byte array or file type object. This does limit how the data is saved, but what's most important is that this does allow us to save DataFrame information somewhere that can then be accessed later. To demonstrate this, let's iterate through the keys and print them out.

In [186]:
bucket = s3.Bucket("my_df_bucket")
for key in bucket.objects.all():
    print(key.key)

created_at
favorite_count
retweet_count
text


You can also add key metadata after adding a key and object to the bucket using the s3 Object. This is most helpful when tagging information after its been added to s3, but is good to know regardless. It can be done like this:

In [187]:
counter = 0
for key in bucket.objects.all():
    key.put(Metadata={'counter_val': str(counter)})
    counter += 1

### CORS Configuration
Using our bucket object, we can also configure the cross-origin resource sharing of the bucket. This is useful for specifying the methods and sources that can be used with the bucket. It is also a very straight-forward bit of code, which helps. It's done as so: 

In [188]:
cors = bucket.Cors()
config = {
    'CORSRules': [
        {
            'AllowedMethods': ['GET'],
            'AllowedOrigins': ['*']
        }
    ]
}
cors.put(CORSConfiguration=config)
cors.delete()

{'ResponseMetadata': {'HTTPHeaders': {'date': 'Fri, 04 Nov 2016 03:28:29 GMT',
   'server': 'AmazonS3',
   'x-amz-id-2': 'VykKBVP+KK74Z6aGTKlBUnAEZpBUk/WZkCMaZUY5p+P+3b3pX588z7MgFsqXYx+4vW3rZdj86Ss=',
   'x-amz-request-id': '58546DBAE400FA35'},
  'HTTPStatusCode': 204,
  'HostId': 'VykKBVP+KK74Z6aGTKlBUnAEZpBUk/WZkCMaZUY5p+P+3b3pX588z7MgFsqXYx+4vW3rZdj86Ss=',
  'RequestId': '58546DBAE400FA35',
  'RetryAttempts': 0}}

### Conclusion
And that's everything that relates boto3 to Amazon's S3 services! For some examples (including some of the examples shown here) and information about other resources boto3 provides for AWS, you can always check out the documentation here: https://media.readthedocs.org/pdf/boto3/latest/boto3.pdf