# Reading a file from S3
The Simple Storage Service (S3) is the AWS service which provides object storage. In short, its like a web accessable file system. There are some limitations to the service but we will not focus on those here.

# Overview
In this workbook we will look at how to pull a file from S3 and read the contents using python

1. Install the boto3 library
2. Connect to AWS
3. Upload a file to our S3 bucket
4. Download a file from our S3 bucket

# Step 1: Install the boto3 library
Boto3 is the python library that allows python to interact with the various web services that aws provides. In order to use the library we must install it. We will install it using the Python Package Installation Program (Pip).

In [1]:
!pip install boto3

Collecting boto3
[?25l  Downloading https://files.pythonhosted.org/packages/a2/23/ccd621dc954c95094caaa57d603bcb911a38dcd4c0989f236016e4fb7aa3/boto3-1.9.241-py2.py3-none-any.whl (128kB)
[K    100% |################################| 133kB 5.3MB/s ta 0:00:01
[?25hCollecting botocore<1.13.0,>=1.12.241 (from boto3)
[?25l  Downloading https://files.pythonhosted.org/packages/f7/8e/6007b25ec8604a47ca9f2d357f9747f7909cd5b73cd6f33d87b2415c4fd9/botocore-1.12.241-py2.py3-none-any.whl (5.7MB)
[K    100% |################################| 5.7MB 3.2MB/s ta 0:00:011
[?25hCollecting jmespath<1.0.0,>=0.7.1 (from boto3)
  Downloading https://files.pythonhosted.org/packages/83/94/7179c3832a6d45b266ddb2aac329e101367fbdb11f425f13771d27f225bb/jmespath-0.9.4-py2.py3-none-any.whl
Collecting s3transfer<0.3.0,>=0.2.0 (from boto3)
[?25l  Downloading https://files.pythonhosted.org/packages/16/8a/1fc3dba0c4923c2a76e1ff0d52b305c44606da63f718d14d3231e21c51b0/s3transfer-0.2.1-py2.py3-none-any.whl (70kB)
[K   

# Step 2: Connecting to AWS
We will need to authenticate as a user in AWS to gain access to the various resources. The AIML Developmenr VM has an IAM Role associated with it which is granted permission to the S3 bucket.

The IAM policy JSON was as follows

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PermissionForObjectOperations",
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::ai-enthusiasts-machine-learning-datasets",
                "arn:aws:s3:::ai-enthusiasts-machine-learning-datasets/*"
            ]
        }
    ]
}
```

I then attached the policy to the ml-users group

In [3]:
# Connect to AWS and list contents of the AIML S3 bucket
import boto3
s3 = boto3.resource('s3')
bucketName = "machine-learning-us-east-2"
bucket = s3.Bucket(bucketName)
for bucket_object in bucket.objects.all():
    print(bucket_object)
    break

s3.ObjectSummary(bucket_name='machine-learning-us-east-2', key='Installation Media/')


# Step 3: Upload resource to AWS
We will store a txt file named "nasdaq_2019.txt" in an S3 bucket named "ai-enthusiasts-machine-learning-datasets" in our aws account.

In [4]:
# Define a function to uplaod a file to S3
import logging
import boto3
from botocore.exceptions import ClientError


def upload_file(file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket

    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return True

In [15]:
# Import a library to assist opening a file
import os

# Define some parameters for our upload
bucket_name = "ai-enthusiasts-machine-learning-datasets"
bucket_path = ""
file_name = "nasdaq_2019.csv"
full_bucket_file_path = os.path.join(bucket_path, file_name)
full_file_path = os.path.join(os.getcwd(), file_name)

if not os.path.exists(full_file_path):
    raise Exception("Our file path isnt real")

# Create an S3 Client
s3 = boto3.client('s3')

# Do the upload
with open(full_file_path, "rb") as f:
    s3.upload_fileobj(f, bucket_name, full_bucket_file_path)

ClientError: An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: Access Denied

# Step 4: Download a file from S3
We have stored a txt file named "nasdaq_2019.txt" in an S3 bucket named "ai-enthusiasts-machine-learning-datasets" in our aws account.

In [14]:
# We will specify some parameters to specify where we are downloading the file to
# The os library will help us interact with our operating system to get filesystem information
#

os_file_path = os.path.join( os.path.abspath(os.sep), "tmp", full_bucket_file_path)

# Download a file from S3 bucket using the S3 client
#    https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-examples.html
#

if not os.path.isfile(os_file_path):
    print("Downloading file '{0}' from S3".format(full_bucket_file_path))
    s3 = session.client('s3')
    s3.download_file(bucket_name, full_bucket_file_path, os_file_path)
    print("Done!")
else:
    print("File already exists. skipping download.")

File already exists. skipping download.


# Step 5: Read the data from the filesystem using Python
We want to be able to access the data so we need to load it into memory. There are many ways to do this, we will simply read the first few lines of the file

In [12]:
with open(os_file_path) as f:
    i = 0
    for line in f.readlines():
        print(line.strip(os.linesep))
        i += 1
        if i > 5:
            break

"ticker","interval","date","open","high","low","close","volume"
"AABA","D","2019-07-01","70.9","71.52","70.325","70.57","10234800"
"AAL","D","2019-07-01","33.14","33.6632","32.5301","32.88","8995100"
"AAME","D","2019-07-01","2.43","2.43","2.4","2.4","500"
"AAOI","D","2019-07-01","10.7","10.89","10.01","10.18","883100"
"AAON","D","2019-07-01","50.57","50.985","48.56","49.73","180200"
