# Cloud Storage basics / AWS S3

### Pedagogic goals

- Discover Cloud storage and the
- Understand the fundamentals AWS S3
- Learn to upload and retrieve files from S3 Cloud Storage with Python and `boto3`

## 1. Introduction

Cloud storage refers to the concept of storing any kind of data somewhere accessible on the Internet, without worrying about its actual location.

Cloud storage started to be a thing in the 90's, where the Internet started allowing data to be easily shareable from one machine to another. The big shift arrived in 2006 when Amazon launched its popular S3 solution. The underlying technology always implies lots of different virtual servers (sometimes even hosted in different regions), to achieve the best performance in terms of scalability and availability. However, to the end user, this technology acts as a black box, and most of the times interacting with a Cloud storage solution is as simple as interacting with any source of data located on a classical file system.

For legal or security reason, the Cloud storage providers give you the ability to choose the country or geographic region you want your data stored. They also allow you to choose whether or not your data should be publicly available, stored fully encrypted or not, automatically deleted after some period of time, etc...

### Reasons

There are two main reasons why it may be a good idea to store data in Cloud storage.

The first one is an organizational reason, the same as for any other cloud services: your provider — whether its Amazon, Microsoft, Google... — takes care of all the maintenance, redundancy, availability... so you don't have to worry about anything else than downloading or uploading your data. You pay for the storage space and the network you use, not for the physical machines. The extra costs is accepted because of the peace of mind given to the user.

The other reason is more technical. By storing your data in the Cloud, you make it available for any machine that may need it. With local storage, you would need to duplicate the data in all the file systems of the machines that need it. On the contrary, using a Cloud solution allows you to free your mind about data accessibility: all you need worry about is to have your machine connected to the Internet... which is pretty obvious anyway!

In distributed computing (such as when you use Spark), this second reason shines even more: if your compute that needs to  is split across several machines (we call them *nodes*), a Cloud solution is the only way to avoid a big amount of trouble and inefficiency, trying to duplicate your data on every nodes.

## 2. Cloud storage providers

Today, there are many Cloud storage providers you can choose from. Each of them has some idiosyncrasies, but most share the same basic service offering.  The choice of working with one or another depends on the rest of your infrastructure (for instance, if your team already works with Google Cloud Platform, then choosing Google Cloud Storage is obvious), the optimization you are looking for, the geographical location of the data centers, the price, etc...

Here is the list of the biggest Cloud storage providers. In the rest of this lesson, we will exclusively work with Amazon S3, the most popular solution.

- Amazon Web Services S3
- Google Cloud Storage
- Microsoft Azure Blob Storage
- IBM Cloud Object Storage
- Digital Ocean Spaces

### Alternatives

There are also other alternatives that are not Cloud storage per say, but depending on your needs, their functionalities can be enough for your use cases. This topic is beyond the scope of this lecture, but here are some examples if you want to dig further.

- HDFS

The *Hadoop Distributed File System* provides a framework to store big amounts of data on different machines, using a more standard "file system" paradigm, as opposed to Cloud storage which is most of the time "object-based" (more on that later).

- NFS

The Network File System standard is an old way of storing data accessible from a network. Its main limitation is that it is not suited to handle huge amount of data

## 3. AWS S3

S3, or *Simple Storage Service*, is the most commonly used Cloud storage solution in the professional world. The service was launched by Amazon Web Services in 2006.

### S3 basics

There are 3 main objects from S3 that you must absolutely understand today: buckets, objects and keys.

An S3 **bucket** is a unique resource (the equivalent of a folder) where you can store all your **objects** (the equivalent of files) under a given **key** (equivalent of the file's name). An object consists of data and its descriptive metadata, such as its size, encoding etc...

S3 stores data using *object storage*, as opposed to file storage — the one you are used to with your laptop: folders containing named files. To have a better understandfing on the difference, have a look at [this video]([https://www.youtube.com/watch?v=zfA7EeblmZI](https://www.youtube.com/watch?v=zfA7EeblmZI)), or check out the resource listed at the end of this lecture. The key benefits of the object storage paradigm are an infinite scalability, a possibility to easily split the data into several region and much better retrieval performance when dealing with big files (above 1GB).

You can store data in S3 under different conditions, these conditions influencing the cost of your storage. For instance, if you want to store tremendous amount of data that for a very long period of time and you won't need to access very often, you can choose [S3 Glacier]([https://aws.amazon.com/glacier/](https://aws.amazon.com/glacier/)). Its cost structure is optimized to guarantee longevity over accessibility. For this tradeoff, Amazon offers you a better price per GB.

The most basic S3 configuration (and most commonly used) is more expensive but has better accessibility performance. An important note is that you don't pay only for storage space, but also for data transfer (to and from your buckets). The global cost structure of AWS S3 is a bit complex, so Amazon gives you the ability to estimate your costs with a [cost calculator]([https://calculator.s3.amazonaws.com/index.html](https://calculator.s3.amazonaws.com/index.html)).

### Interaction with S3

The permission system is highly flexible, allowing you to reproduce any complex organization policy: who is allowed to add/delete/modify objects in a given bucket, etc... Also, S3 provides a versioning system that allows you to retrieve the state of an object at any point of the past. Finally, for security purposes, you can choose to store your data fully encrypted with the encryption key of your choice.

There are three different ways to work with S3: directly via the web console, using the [AWS CLI]([https://aws.amazon.com/cli/](https://aws.amazon.com/cli/)) in a terminal, or with the official and non-official [SDKs]([https://aws.amazon.com/tools/](https://aws.amazon.com/tools/)) (Software Development Kit) that you can download on the Internet.

S3 supports many SDKs  to make it easy to interface with dozens of different programming languages, including Python. `boto3` is the official open-source library for working on AWS with Python, and you can install it with a simple `pip install boto3`.  Please, note that `boto3` is not designed specifically for S3, but for all the AWS resources.

### Key take-away

- Core object: the **bucket**
- Can contain any kind of data, any size
- Included versioning and encryption features
- Flexible and editable permission policies, on any object within the bucket
- SDKs to automate data storage processing with dozens of programming language

## 4. Extra resources

- [Working with AWS S3]([https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html)) (official docs)
- [Boto3 S3 documentation]([https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html))
- [Object Storage VS File Storage]([https://cloudian.com/blog/object-storage-vs-file-storage/](https://cloudian.com/blog/object-storage-vs-file-storage/))

**Cloud storage alternatives**

- [What is HDFS?]([https://www.ibm.com/analytics/hadoop/hdfs](https://www.ibm.com/analytics/hadoop/hdfs))
- [Network File System]([https://en.wikipedia.org/wiki/Network_File_System](https://en.wikipedia.org/wiki/Network_File_System)) (Wikipedia)
