# Using AWS S3 and AWS Lambda for Data Science
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/Amazon_Web_Services_Logo.svg/320px-Amazon_Web_Services_Logo.svg.png">

## Introduction
This tutorial will introduce you to AWS S3 and AWS Lambda. More specifically, I will walk you through the AWS software development kit for Python called Boto 3 as well as the Python [function handler](https://docs.aws.amazon.com/lambda/latest/dg/python-programming-model-handler-types.html) for AWS Lambda.

AWS S3 is a storage service running on AWS, Amazon's cloud computing service. This service is particularly useful in the context of Data Science and especially Big Data as it will allow us to easily manipulate large amounts of data without having to store them locally on our own machine.

AWS Lambda is a [serverless computing](https://en.wikipedia.org/wiki/Serverless_computing) service also provided by Amazon. It lets you run code on the cloud without having to worry about setting up any kind of server. Our code will be executed only when needed and AWS Lambda will take care of scaling servers automatically. Because of this, AWS Lambda is adequate both for functions that are called a few times as well as thousands of times per second. We will only be charged when our code is running. In this tutorial we will use AWS Lambda to act over data stored inside S3.

Once we know how to store data in the cloud using S3 and how to manipulate it using AWS Lambda, we could run a whole data science project in the cloud!

## Setting Up an AWS Account
In order to follow this tutorial, you will need an AWS account. Please follow this [link](https://aws.amazon.com/free) in case you don't already have one. Then sign in to the [console](https://signin.aws.amazon.com) to check that your account works properly.

The console allows you to access all of the AWS services, including S3 and Lambda. It is a web interface for all of Amazon's cloud products.

## AWS S3
Let's start with S3. We will use Boto 3, the Python API for AWS. Please note that most of what we will do here can be achieved by using the AWS console. However, this is no the goal here. We are trying to become familiar with the Boto 3 library.

### Installing and Configuring Boto 3
First run `pip install boto3` on your machine to install the latest version of Boto 3 and make sure the following Python import works properly.

In [1]:
import boto3

Now that we have succesfully installed Boto 3, we need to configure it with our AWS account. There are multiple ways to do this, they are all described in the [documentation](http://boto3.readthedocs.io/en/latest/guide/configuration.html) (feel free to consult this if you whish to connect to AWS in a different way from what we present).

The recommended way to connect to AWS in Python is to use the [AWS CLI](https://aws.amazon.com/cli/). However, in this tutorial, we will simply provide our credentials as method parameters. This is condidered bad practice if our confidential information is hard coded. We do this as it is much more adapted to the jupyter notebook format of this tutorial.

Before we can provide our credentials, we will need to generate them. Following best practices, we will create users with a restricted set of permissions to connect to Boto 3:
* Go to the [following page](https://console.aws.amazon.com/iam/home?region=us-east-1#/users$new?step=details) to add a new user.
* Provide a name and make sure to check the "Programmatic access" option.
* Click on "Next: Permissions".
* Select "Attach existing policies directly" and add "AmazonS3FullAccess", "AWSLambdaFullAccess" as well as "AmazonSNSFullAccess".
* Click on "Next: Review" and then "Create user".

The new user's credentials are now available to you. Please provide these values in the following variables if you wish to follow along. You can also provide your andrew id (or any personnal identifier) to personnalize any name that is required to be completely unique by Amazon.

In [2]:
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
ANDREW_ID = ''

### Interacting with AWS S3
In order to interact with S3 using Boto 3 we can use two different objects: `client` and `resource`. `client` is used to call AWS's service API directly, it provides a low-level access.  `resource` is a more high-level representation of available actions in S3, it usually provides a more concise and elegant way to perform actions.
Here we will see how to create both these objects.

In [3]:
s3 = boto3.resource('s3', aws_access_key_id = AWS_ACCESS_KEY_ID, aws_secret_access_key = AWS_SECRET_ACCESS_KEY)
s3_client = boto3.client('s3', aws_access_key_id = AWS_ACCESS_KEY_ID, aws_secret_access_key = AWS_SECRET_ACCESS_KEY)

Now that we have the necessary objects to send calls to AWS S3, we will first create a bucket. A bucket is where we will store key-value pairs corresponding to files. You can think of it as a directory where our files will be stored. You can create several buckets if you wish. As much as possible, we will try to use both `client` and `resource` in order to compare them. The result should be identical whether we use one or the other.

For this step and all the following ones, you can check on the [S3 page](https://s3.console.aws.amazon.com/s3) that everything is working properly.

In [4]:
# Using resource:
bucket_name_r = ANDREW_ID + '.15388-bucket-resource'
s3.create_bucket(Bucket = bucket_name_r)

#Using client:
bucket_name_c = ANDREW_ID + '.15388-bucket-client'
s3_client.create_bucket(Bucket = bucket_name_c)

{'Location': '/my-andrew-id.15388-bucket-client',
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '0',
   'date': 'Sat, 31 Mar 2018 22:24:25 GMT',
   'location': '/my-andrew-id.15388-bucket-client',
   'server': 'AmazonS3',
   'x-amz-id-2': 'ayTrn50oyBU4CHWU5VSubLOq2p22Y7M7lWVpT9MzAuwAgZFVXne8Jy4ZT55pCOSQ88mOZ5INUtQ=',
   'x-amz-request-id': 'E8A46D95CF4669A1'},
  'HTTPStatusCode': 200,
  'HostId': 'ayTrn50oyBU4CHWU5VSubLOq2p22Y7M7lWVpT9MzAuwAgZFVXne8Jy4ZT55pCOSQ88mOZ5INUtQ=',
  'RequestId': 'E8A46D95CF4669A1',
  'RetryAttempts': 0}}

We notice that when creating a bucket, both `resource` and `client` use the same function. However, this is generally not the case as we will see when trying to list all the buckets created so far.

In [6]:
# Using resource:
print('Using resource:')
for bucket in s3.buckets.all():
    print(bucket.name)

print('\n')
    
# Using client:
response = s3_client.list_buckets()
print('Using client:')
for bucket in response['Buckets']:
    print(bucket['Name'])


Using resource:
my-andrew-id.15388-bucket-client
my-andrew-id.15388-bucket-resource


Using client:
my-andrew-id.15388-bucket-client
my-andrew-id.15388-bucket-resource


Our buckets are empty for the moment. We can change this by uploading a file. Files in S3 are simply key-value pairs where the key is a file name and the value is the file content. We will see how to upload a file to the two buckets we have created.

In [7]:
file_name = 'aws_lambda_s3_tutorial.ipynb'

# Using resource:
file_content = open(file_name, 'rb')
response_resource = s3.Object(bucket_name_r, file_name).put(Body = file_content)

# Using client:
s3_client.upload_file(file_name, bucket_name_c, file_name)            

When uploading using `put`, we strored the response generated by that call. Most operations in Boto 3 return such a response, they can be used to fetch information about the operation.

In [8]:
# Most operations return a response that allows us to gather information
print(str(response_resource) + '\n')

# Some fieds are more interesting than others
if response_resource['ResponseMetadata']['HTTPStatusCode'] == 200:
    time = str(response_resource['ResponseMetadata']['HTTPHeaders']['date'])
    print('File successfully added on ' + time + '.')
else:
    print('An error occured')

{'ResponseMetadata': {'RequestId': 'BBDCC9BDCC27A265', 'HostId': 'Rr+WyMeKIUEmLCRAh96wFuio1unDAc6Ei3ZguziDnSkRKEHYqs1WccFrBrVsqp0EEOOxI6HeWBM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'Rr+WyMeKIUEmLCRAh96wFuio1unDAc6Ei3ZguziDnSkRKEHYqs1WccFrBrVsqp0EEOOxI6HeWBM=', 'x-amz-request-id': 'BBDCC9BDCC27A265', 'date': 'Sat, 31 Mar 2018 22:25:40 GMT', 'etag': '"61cff2c5c3a8f3505f1803f1792ddc11"', 'content-length': '0', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'ETag': '"61cff2c5c3a8f3505f1803f1792ddc11"'}

File successfully added on Sat, 31 Mar 2018 22:25:40 GMT.


Now let's see how we can list the files that have been uploaded.

In [10]:
# Using resource:
print('Using resource:')
for bucket in s3.buckets.all():
    print(bucket.name + ':')
    for key in bucket.objects.all():
        print('  ' + key.key)
        
print('\n')        
        
# Using client:
print('Using client:')
response = s3_client.list_buckets()
for bucket in response['Buckets']:
    bucket_name = bucket['Name']
    print(bucket_name + ':')
    for key in s3_client.list_objects_v2(Bucket = bucket_name)['Contents']:
        print('  ' + key['Key'])

Using resource:
my-andrew-id.15388-bucket-client:
  aws_lambda_s3_tutorial.ipynb
my-andrew-id.15388-bucket-resource:
  aws_lambda_s3_tutorial.ipynb


Using client:
my-andrew-id.15388-bucket-client:
  aws_lambda_s3_tutorial.ipynb
my-andrew-id.15388-bucket-resource:
  aws_lambda_s3_tutorial.ipynb


Interestingly, we notice in the previous cell that `client` is much slower than `resource` for listing all files. The notation for `resource` is also more concise.

Now, let's see how to delete a bucket. Before deleting any bucket, we need to remove all of its keys.

In [11]:
# Using resource:
# First we find our bucket
bucket = None
for b in s3.buckets.all():
    if b.name == bucket_name_r:
        bucket = b
        
# Now we delete all keys and the bucket
if bucket != None:
    for key in bucket.objects.all():
        key.delete()
    bucket.delete()

# Using client:
# We need to check that the bucket exists first
bucket = None
response = s3_client.list_buckets()
for bucket in response['Buckets']:
    if bucket['Name'] == bucket_name_c:
        bucket = bucket['Name']
        
if bucket != None:        
    for key in s3_client.list_objects(Bucket = bucket)['Contents']:
        s3_client.delete_object(Bucket = bucket, Key = key['Key'])
    s3_client.delete_bucket(Bucket = bucket)

We are now familiar enough with the Boto 3 API to perform all the imporant operations related to S3.

## AWS Lambda
### Creating our function
To create an AWS Lambda function, go to this [link](https://console.aws.amazon.com/lambda/home?region=us-east-1#/create). From here:
* Give a name to your function (`15388-tutorial` for example).
* Select Python 3.6 as the runtime.
* Select "Create a custom role". This will take you to another page where you can provide a name and leave the rest as it is and click "Allow".
* Now go back to the Lambda creation page and click on "Create function".

You are now on the lambda management console. From here you can tweak different things, including the permissions of your function and the amount of memory and compute time available to it. Go to the "Function code" cell, from here you can edit your lambda function. We are given this template:

In [12]:
def lambda_handler(event, context):
    # TODO implement
    return 'Hello from Lambda'

The `event` argument will contain a dictionary describing the event that caused the function to run. There are several ways to cause the function to run, one of the easier ones is AWS SNS which we will not cover in this tutorial. You can find more information about SNS [here](https://aws.amazon.com/sns/getting-started/).

### Interacting with S3 from our lambda function
Let's create two buckets like we have learned. We will use them to get familiar with using Boto 3 inside of AWS Lambda.

In [13]:
lambda_bucket_1 = '15388-lambda-bucket1-' + ANDREW_ID
lambda_bucket_2 = '15388-lambda-bucket2-' + ANDREW_ID

s3.create_bucket(Bucket = lambda_bucket_1)
s3.create_bucket(Bucket = lambda_bucket_2)

s3.Bucket(name='15388-lambda-bucket2-my-andrew-id')

Click on the "Test" button at the top-left of the Lambda management console. From here we will provide the following: 
`{"bucket1": "", "bucket2": "", "key_id": "", "secret_key_id": ""}` where we fill in the two fields with the names of the two newly created buckets and the rest with our credentials.

Next time you press "Test", the function will be called with the provided input in `event`.

Let's look at what a full example would look like (you can access the logs from calls to `print` by clicking on `logs` after each execution):

In [14]:
import json
import os
import boto3

def lambda_handler(event, context):
    # Fetch the names from the event
    bucket1_name = event['bucket1']
    print('bucket1_name: ' + bucket1_name)
    bucket2_name = event['bucket2']
    print('bucket2_name: ' + bucket1_name)
    
    # We will use resource as it leads to cleaner code
    s3 = boto3.resource('s3', aws_access_key_id = event['key_id'], aws_secret_access_key = event['secret_key_id'])
    
    # Let's acquire the objects
    bucket1 = None
    bucket2 = None
    for bucket in s3.buckets.all():
        if bucket.name == bucket1_name:
            print('Found ' + bucket1_name + '!')
            bucket1 = bucket
        elif bucket.name == bucket2_name:
            print('Found ' + bucket2_name + '!')
            bucket2 = bucket
    
    # Check that they exist
    if bucket1 == None or bucket2 == None:
        return 'Could not find both buckets in S3.'
            
    # We can store the files in lists
    file1_list = []
    file2_list = []
    for key in bucket1.objects.all():
        file1_list.append(key.key)
    for key in bucket2.objects.all():
        file2_list.append(key.key)
            
    print(bucket1_name + ' contains ' + str(file1_list))
    print(bucket2_name + ' contains ' + str(file2_list))
    
    # Here we download the files to the local server's temp directory
    try:
        for f in file1_list:
            s3.Bucket(bucket1_name).download_file(f, '/tmp/' + f)
        for f in file2_list:
            s3.Bucket(bucket2_name).download_file(f, '/tmp/' + f)
    except Exception as e:
        return str(e)
    
    # We check that the files have been properly downloaded
    local_files = []
    for f in os.listdir("/tmp/"):
        local_files.append(f)
    
    return 'The following files were downloaded ' + str(local_files)

Now that we know how to download locally the files, we can use everything we learned during the course to gather insight.

# Further resources
* [AWS S3 documentation](https://aws.amazon.com/documentation/s3/)
* [AWS Lambda documentation](https://aws.amazon.com/lambda/getting-started/)
* [AWS SNS documentation](https://aws.amazon.com/documentation/sns/)