# <center>Big Data for Engineers &ndash; Exercises &ndash; Solution</center>
## <center>Spring 2023 &ndash; Week 2 &ndash; ETH Zurich</center>

## Exercise 1: Storage devices

In this exercise, we want to understand the differences between [SSD](https://en.wikipedia.org/wiki/Solid-state_drive), [HDD](https://en.wikipedia.org/wiki/Hard_disk_drive), and [SDRAM](https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory) in terms of __capacity__, __speed__ and __price__. 

### Task 1
Fill in the table below by visiting your local online hardware store and choosing the storage device with largest capacity available but optimizing for read/write speed.
For instance, you can visit Digitec.ch to explore the prices on [SSDs](https://www.digitec.ch/en/s1/producttype/ssd-545?tagIds=76), [HDDs](https://www.digitec.ch/en/s1/producttype/hard-drives-36?tagIds=76), and 
[SDRAMs](https://www.digitec.ch/en/s1/producttype/memory-2?tagIds=76). 
You are free to use any other website for filling the table. 

| Storage Device | Maximum capacity, GB | Price, CHF/GB  | Read speed, GB/s | Write speed, GB/s | Link |
| --------------:| --------------------:| --------------:|-----------------:|------------------:|------|
| HDD            |                      |                |                  |                   |&nbsp;|
| SSD            |                      |                |                  |                   |&nbsp;|
| DRAM           |                      |                |                  |                   |&nbsp;|


### Task 2
Answer the following questions:
1. What type of storage devices above is the cheapest one?
2. What type of storage devices above is the fastest in terms of read speed?

### Solution
Looking at digitec.ch, we complete the table as follows:

| Storage Device | Maximum capacity, GB | Price, CHF/GB  | Read speed, GB/s | Write speed, GB/s | Link |
| --------------:| --------------------:| --------------:|-----------------:|------------------:|------|
| HDD            |      20000 (20 TB).  | 0.02345 CHF/GB |        0.29 GB/s |         0.29 GB/s |[Link](https://www.digitec.ch/en/s1/product/seagate-ironwolf-pro-20-tb-35-cmr-hard-drives-17728311?supplier=406802)|
| SSD            |      30720 (30.7 TB) |   0.232 CHF/GB |         2.1 GB/s |          1.7 GB/s |[Link](https://www.digitec.ch/en/s1/product/samsung-enterprise-pm1643-30720-gb-25-ssd-10110860?supplier=406802)|
| DRAM           |              128GB** |   7.109 CHF/GB |        ~60 GB/s* |         ~48 GB/s* |[Link](https://www.digitec.ch/en/s1/product/kingston-ddr4-3200mhz-lrdimm-quad-rank-module-1-x-128gb-lr-dimm-memory-17550606?supplier=406802)|

*RAM speeds are usually not measured in GB/s, but rather MT/s (Megatransfers per second). Actual data transfer speeds in GB/s depend also on the CPU/Motherboard and are usually empirical rather than by specification.

**The new DDR-5 standard offers much faster speeds but are still limited to 32 GB capacity of a single stick.

1. HDDs are the cheapest storage device among mentioned devices
2. DRAMs are the fastest storage device among mentioned devices 

## Exercise 2: HTTP

HTTP is the underlying protocol used by the World Wide Web. It defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands.

The HTTP protocol is based on requests (usually made by a client, eg: your web browser) and responses (usually made by a server hosting a particular website or application).

When we visit websites, our browser is always making multitudes of HTTP requests to retrieve all the resources needed to render out webpage! If you're curious, you can try keep the 'network' tab of your  browser's 'Developer Tools' tab open while visiting several websites.

#### Example Request:
> <font color="#990000">POST</font> <font color="blue">/path/index.html</font> <font color="#bf9000">HTTP/1.1</font>  
<font color="green">Host: www .example.com  
User-Agent: Mozilla/4.0</font><br>
BookId=3131&Author=Asimov

<font color="#990000">This fragment indicates the HTTP method used</font><br>
<font color="blue">This indicate the relative path of the resource</font><br>
<font color="#bf9000">This is the HTTP version</font><br>
<font color="green">These are the headers of the request (1 per line)</font><br>
<font>This is the body of the request</font>


#### Example Response
> <font color="#bf9000">HTTP/1.1</font> <font color="#990000">200 OK</font>  
<font color="green">Date: Tue, 25 Sep 2018 09:48:34 GMT  
Content-Type: text/html; charset=UTF-8  
Content-Length: 138  
</font>
&lt;html> &lt;head> &lt;title>An Example Page&lt;/title>
&lt;/head> &lt;body> Hello World, this is a very simple
HTML document. &lt;/body> &lt;/html>

<font color="#990000">This fragment indicates the status code of the response</font><br>
<font color="#bf9000">This is the HTTP version</font><br>
<font color="green">These are the headers of the response (1 per line)</font><br>
<font>This is the body of the response</font>

### HTTP Methods

Consider a well-designed object storage service providing a REST API implemented over HTTP.

1. Which HTTP method allows the retrieval of objects on the server? What do we mean by saying that it should be side-effect free?  
2. Which HTTP methods allow the insertion and deletion of objects from the server, respectively?
3. Which other generic method allows for sending information and/or receiving results?

### Solution

1. GET. This means that it should cause no modifications on the server.
2. PUT, DELETE
3. POST

## Exercise 3: Storage

### Object Storage, Scalability

1. What are the four most important traits of Object Storage that allows scalability? 
2. What are the two ways through which you can scale beyond one machine?

#### Solution
1. Black box objects, key-value model, flexible metadata, commodity hardware.
2. Horizontal scalability (more nodes); vertical scalability (more powerful nodes)

### Azure Blob Storage vs Amazon S3

For each question give the answer for both: Azure Blob Storage and Amazon S3

1. How are objects identified?
- S3: Bucket, Object ID
- Azure: Account + Container + Blob
2. What kind of objects can you create?

#### Solution
* Azure
    1. Account ID + Container ID + Blob ID
    2. 3 types of blobs: BlockBlob, PageBlob, AppendBlob
* S3
    1. Bucket ID + Object ID
    2. Blackbox objects

## Exercise 4: Setting up an Distributed Object storage

In this section we will see how to set up a Object Storage instance. Instead of using a commercial solution such as Microsoft Azure or Amazon S3, we will a freely available solution *MinIO*


### Step 1: Setting up MinIO

1. First, log into the MinIO portal locally hosted at [localhost:9001](localhost:9001) with your username and password. The credentials specified in the docker-compose file is username: `admin` and password: `supersecret` 

<img src="images/login.png" width=800/>

2. The object store is currently completely empty. We will next create a new bucket to add files to. Go to the *Buckets* tab from the left side pane and create a new bucket.

<img src="images/new_bucket.png" width=800/>

We will be using `exercise-2` as the bucket name. After, the bucket is created, we can view more details about the bucket by clicking on it. 

<img src="images/new_bucket_info.png" width=800/>

### Step 2: Uploading Files
3. We can finally start uploading files to our storage server. Go to the *Object Browser* tab and click on the newly created bucket. 
      
<img src="images/uploading_file.png" width=800/>

Click on `Upload -> Upload File`. You can upload any file of your choice. There is also a sample image in the *test_image* folder for this step.

    
After uploading, your window should look something like this. You can click on the uploaded file to view more details:

<img src="images/view_file_info.png" width=800/>

4. Now that we have uploaded the file, we can generate a link to share/download this file. Click on *Share* in the right side pane. This should bring up a pop up screen as shown below:

<img src="images/share_file.png" width=500/>

Note that the link contains a lot of characters after the file name. These contain security information such as *Validation Period* , *Signature*, *Encryption Algorithm* & *Credentials*. Try opening this link in a separate browser tab. If the file type you had uploaded could be renderred by your browser, you should be able to view the uploaded file. 

### Step 3: Accessing Buckets

5. Next, let's try to download the file from the object storage server. Edit the link below to include the bucket name and file name (with the extension) you used in previous steps.

In [18]:
!wget http://host.docker.internal:9000/exercise-2/eth_zurich.jpeg

--2023-08-04 15:45:05--  http://host.docker.internal:9000/exercise-2/eth_zurich.jpeg
Resolving proxy.ethz.ch (proxy.ethz.ch)... 129.132.202.155
Connecting to proxy.ethz.ch (proxy.ethz.ch)|129.132.202.155|:3128... connected.
Proxy request sent, awaiting response... 

307 Temporary Redirect
Location: https://proxybd.ethz.ch/cgi-bin/login.pl?url=http%3A%2F%2Fhost.docker.internal%3A9000%2Fexercise-2%2Feth_zurich.jpeg [following]
--2023-08-04 15:45:05--  https://proxybd.ethz.ch/cgi-bin/login.pl?url=http%3A%2F%2Fhost.docker.internal%3A9000%2Fexercise-2%2Feth_zurich.jpeg
Connecting to proxy.ethz.ch (proxy.ethz.ch)|129.132.202.155|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘eth_zurich.jpeg.1’

eth_zurich.jpeg.1       [ <=>                ]   1.15K  --.-KB/s    in 0.005s  

2023-08-04 15:45:05 (251 KB/s) - ‘eth_zurich.jpeg.1’ saved [1182]



The server should have returned with a `HTTP 403 Forbidden` error message. This is because the buckets are configured to be *Private* by default. Therefore we need either a proper access key or a complete share link to access the file. 

6. In the left-hand menu, go to the *Buckets* menu and edit your bucket to have `Public` access policy.

<img src="images/update_bucket_access.png" width=800/>

Let's try downloading the file again:

In [19]:
!wget http://host.docker.internal:9000/exercise-2/eth_zurich.jpeg

--2023-08-04 15:45:19--  http://host.docker.internal:9000/exercise-2/eth_zurich.jpeg
Resolving proxy.ethz.ch (proxy.ethz.ch)... 129.132.202.155
Connecting to proxy.ethz.ch (proxy.ethz.ch)|129.132.202.155|:3128... connected.
Proxy request sent, awaiting response... 307 Temporary Redirect
Location: https://proxybd.ethz.ch/cgi-bin/login.pl?url=http%3A%2F%2Fhost.docker.internal%3A9000%2Fexercise-2%2Feth_zurich.jpeg [following]
--2023-08-04 15:45:19--  https://proxybd.ethz.ch/cgi-bin/login.pl?url=http%3A%2F%2Fhost.docker.internal%3A9000%2Fexercise-2%2Feth_zurich.jpeg
Connecting to proxy.ethz.ch (proxy.ethz.ch)|129.132.202.155|:3128... 

connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘eth_zurich.jpeg.2’

eth_zurich.jpeg.2       [ <=>                ]   1.15K  --.-KB/s    in 0.01s   

2023-08-04 15:45:19 (117 KB/s) - ‘eth_zurich.jpeg.2’ saved [1182]



The file should now have downloaded into your current working directory.

In this exercise, we setup a locally hosted object storage server using MinIO. MinIO is an Amazon S3 compatible object storage solution. Later in this course, we will use Microsoft Azure for Object Storage. While there are differences between these services, you will see that a number of concepts such as Buckets (Containers in Azure), Access keys & Access Policy translate well between them. 


## Whats next:

1. *Public MinIO server*: MinIO provides a public *play* server at [https://play.min.io](https://play.min.io). You can explore the above described above on a server that is not hosted on your machine. Note that any file uploaded this server are considered public and non-protected! So be careful what you upload here. For more information, read here [https://min.io/docs/minio/container/index.html](link).

2. *MinIO console client*: MinIO also comes with a console client to communicate with the servers. You can use this client to practice querying object storage servers and the use of access keys. For more information, look at the documentation at [https://min.io/docs/minio/container/administration/minio-console.html#minio-console](link).

3. *Microsoft Azure / Amazon S3*: These solutions usually provide students with a number of free credits when creating new accounts. You can also try getting some credits on these services to try out these solutions.  