# Using SageMaker
SageMaker Studio is the hub for Amazon's machine learning services. It is within SageMaker Studio that we can write Python code to import and prepare our data, create and train machine learning models, and automatically deploy endpoints for those models. For those that might be familiar, SageMaker Studio is essentially just a Jupyter Notebooks IDE that has access to all the rest of the AWS services. Since you are already familiar with Google Colab, using Sagemaker shouldn't be too difficult.

In this file, you'll learn how to save data from external sources to the local file system that is associated with your SageMaker Studio account and you will also learn how to save and retrieve these files to an S3 bucket.


# Getting Data and Saving it Locally
We can retrieve data from a variety of sources as you've seen in previous classes. It is important to recognize that SageMaker has a local file system of its own. If we wanted to get more techincal, when we created the SageMaker Studio environment, AWS provisioned a filesystem for us and when we launch SageMaker Studio, it mounts that filesystem for us. So, we can simply retrieve and save our data locally (like we've done before).

If you would like, you can simply upload a file to your SageMaker file system. However, if you are using AWS, it is likely that you have a lot of data and so rather than saving it here, you might as well upload it directly into S3 where other AWS ML services can also get access.

Like you've seen previously, we can also get data directly from a file that is posted on the internet. Panda's read_csv method has the ability to read from a website built in to its method.

### Importing a CSV from a URL

In [None]:
# Read CSV from a URL
import pandas as pd

data = pd.read_csv('http://www.ishelp.info/data/insurance.csv')
data.head(5)

Once our data is loaded, we usually will take time to analyze and evaluate the dataset. We can use the same methods that we saw in earlier chapters of the book. Once we are done cleaning and transforming the data, we can easily save it to the local file system. To save our data, we can call the to_csv function off of the data frame.

In [None]:
# Save to local file system
data.to_csv('insurance.csv')

Once you run the code block above the file will appear on the local file system in the same directory as your Python notebook. Sometimes it take a few seconds for SageMaker Studio to refresh before it appears.

If you double click on the file in the local file system, SageMaker Studio will open your imported data file so that you can explore the contents.

### Importing JSON from an API

Just for fun, let's import some JSON from an API. To keep it simple, this one doesn't require an API key but based on previous classes, you could do a more complicated request.

In [None]:
import requests
import json 

# '.get' refers to the type of request: GET, POST, or many others but those two are most common
response = requests.get("https://api.coinlore.com/api/tickers/") 

#format the json and load it into an object
json_data = json.loads(response.text)

#display the data
json_data['data']

#save to the local file system
with open('coins.json', 'w') as f:
    f.write(json.dumps(json_data['data']))
    f.close


In many cases, we might just save this JSON as to a local file. However, we also might want to convert it into a data frame to make it easier to process in our models.

In [None]:
#convert the coins list in our JSON object to a pandas data frame
coins = pd.DataFrame(json_data['data'])

# then saving out is super simple
coins.to_csv('coins.csv')

In [None]:
#display the coins data
coins

# Saving Files to S3

Amazon S3 is one of AWS' major services. As you likely know, S3 is an object repository that allows you to store  enormous amounts of data. S3 organizes files or objects in folders called 'buckets.' Each file can be up to 5 terrabytes in size and you can have an unlimited number of files in any S3 bucket. S3 was intentionally built with minimal feature set that focuses on simplicity and robustness.

### Advantages of S3
Amazon S3 is intentionally built with a minimal feature set that focuses on simplicity and robustness. Following are some of the advantages of using Amazon S3:

* Creating buckets – Create and name a bucket that stores data. Buckets are the fundamental containers in Amazon S3 for data storage.
* Storing data – Store an infinite amount of data in a bucket. Upload as many objects as you like into an Amazon S3 bucket. Each object can contain up to 5 TB of data. Each object is stored and retrieved using a unique developer-assigned key.
* Downloading data – Download your data or enable others to do so. Download your data anytime you like, or allow others to do the same.
* Permissions – Grant or deny access to others who want to upload or download data into your Amazon S3 bucket. Grant upload and download permissions to three types of users. Authentication mechanisms can help keep data secure from unauthorized access.
* Standard interfaces – Use standards-based REST and SOAP interfaces designed to work with any internet-development toolkit.

[Source: AWS S3 User Guide](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html)

### Installing the S3 Python Library
We can save files to and retrieve files from S3 using the _s3fs_ Python library. As you may have guessed, this library allows us to interact with S3 like we would any other file system. The following code installs that library on our SageMaker Studio's container so that it is available to all of our scripts.

In [None]:
# %pip install -q 's3fs'

### More code libraries

When you use SageMaker Studio, you launch it from an account that can access a lot of other interesting AWS services without some of the complexities (e.g., authentication/authorization) that you would have to handle if you did it in your own Python Environment. Since you are running within your own AWS account, when you write code within SageMaker Studio you can use existing libraries that provide information about your current session including the default S3 bucket that is assigned to your SageMaker environment. Of course, we could also override that bucket name.

However, you can also specify you own S3 bucket if you want to organize it differently.

In [None]:
#import the libraries that we will use
import s3fs
import sagemaker

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket() 
# replace with your own bucket if you have one 
#bucket = "PutYourBucketNameHere" #e.g., sagemaker-us-east-1-427325791960
s3 = sagemaker_session.boto_session.resource('s3')

print(bucket)

To upload a file to an S3 bucket, we can use the _upload_file_ method off of the s3 bucket. The following code will upload...

In [None]:
s3.Bucket(bucket).upload_file("coins.csv","projectdata/coins.csv")

If we are going to be writing a lot of files to S3, it might be helpful to create a function that combines the directory information into an easy to use function.

In [None]:
#create a function to save files to S3
def save_to_s3(filename, s3bucket, s3directory):
    key = "{}/{}".format(s3directory,filename)
    return s3.Bucket(bucket).upload_file(filename,key)


Let's now use our function to upload the _coins.json_ file and also the _insurance.csv_ file.

In [None]:
save_to_s3("insurance.csv",bucket,"projectdata")
save_to_s3("coins.json",bucket,"projectdata")

# Retrieving Files from S3

There are three main ways to import or ingest your data stored on S3 so that you can use it in building your machine learning models:
* Copy the data to your SageMaker Studio and then load it from there
* Use pre-built packages that work with S3

### Option 1 - Copy Data to SageMaker Studio, Then Use it
AWS has a Command Line Interface (CLI) that you can use to copy your data from s3 to your SageMaker instance. Then, you can load your data from the local file system. This is fine for up to medium sized data files. More info about the CLI can be found [here](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html).

In [None]:
#copy data to your sagemaker instance using AWS CLI
# the first parameter is the bucket name and folder in the bucket, the second parameter is the local directory name
!aws s3 cp s3://$bucket/projectdata "projectdata" --recursive

### Option 2 - Use Pre-built Packages That Work with S3
For large dataset or to simply keep all of you data in one place, you can use pre-built packages to directly access your files in S3 without having to copy them to your local file system. The version of _Pandas_ that runs on SageMaker has been adjusted to be able to use an S3 filepath (i.e., 's3://') much like you can use other prefixes (e.g., 'file://', 'https://', 'ftp://') to access files stored either locally or accessible through remote services. More information can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). 

In [None]:
s3_file = "s3://{}/{}/{}".format(bucket, "projectdata", "coins.json")
print(s3_file)

In [None]:
df_json = pd.read_json(s3_file, orient='records')
df_json.head()

We can also read other files, like CSV files, directly from their location on S3. You can hardcode the URL if you would like.

In [None]:
#example hard coded URL
#df_csv = pd.read_csv("s3://sagemaker-us-east-1-427325791960/projectdata/insurance.csv")
    
#since we are dynamically specifying our bucket, we will format the string and pass in the values    
df_csv = pd.read_csv("s3://{}/{}/{}".format(bucket, "projectdata", "insurance.csv"))
df_csv.head()

Whenever possible, I recommend using this second method for referencing your model data. With this approach you aren't constantly shifting data back and forth between environments.

### Use the S3 File System Directly
Sometimes you will want to interact with your files on S3 (e.g., get a list of all files that need to be processed). The _s3fs_ library is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, glob, etc., as well as put/get of local files to/from S3.

In [None]:
fs = s3fs.S3FileSystem()
data_s3fs_location = "s3://{}/{}/".format(bucket, "projectdata")
# To List all files in your accessible bucket
fs.ls(data_s3fs_location)

In [None]:
# open it directly with s3fs
data_s3fs_location = "s3://{}/{}/{}".format(bucket, "projectdata", "insurance.csv") # S3 URL
with fs.open(data_s3fs_location) as f:
    print(pd.read_csv(f, sep = '\t', nrows = 2))

An then, you could process each file. In this case, we will simply output a few lines. More likely, you might load each file one by one into a dataframe and process each of the records in all of the files.