# Mastering Applied Skills in Management, Analytics and Entrepreneurship

## DATA COLLECTION TECHNIQUES
## Part II. Load from object storage

__NOTE:__ use this notebook with `Data Science environment`.

### 1. Libraries and credentials

[About boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) library.

In [None]:
# not necessary
!pip install -U pandas
!pip install -U fsspec

In [None]:
import os
import sys
import json
import boto3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

In [None]:
def access_data(file_path):
    """
    Reads JSON data from file.
    JSON data is a dictionary
    in Python.

    Keyword agruments:
      file_path: path to a file with JSON data

    """
    with open(file_path) as file:
        access_data = json.load(file)
    return access_data


creds = access_data(file_path='data/access_bucket.json')
print(creds.keys())

### 2. Session and client for loading

In [None]:
session = boto3.session.Session()
s3 = session.client(
    service_name='s3',
    aws_access_key_id=creds['aws_access_key_id'],
    aws_secret_access_key=creds['aws_secret_access_key'],
    endpoint_url='https://storage.yandexcloud.net'
)

In [None]:
DEMO_BUCKET = creds['bucket_name']
print('bucket for master classes Fall 2024:', DEMO_BUCKET)

In [None]:
# this method `list_objects`
# works only for num of files < 1000
all_files = [key['Key'] for key
             in s3.list_objects(Bucket=DEMO_BUCKET)['Contents']]
print('files in storage:', all_files)

### 3. Load data from the storage

In [None]:
file_to_load = all_files[-1]
print('file to load:', file_to_load)
get_object_response = s3.get_object(
    Bucket=DEMO_BUCKET,
    Key=file_to_load
)

In [None]:
get_object_response

In [None]:
df = pd.read_csv(get_object_response['Body'], sep=',')
df.info()

In [None]:
df.describe().T

In [None]:
df.head()

### 4. Use of the data

In [None]:
plt.figure(figsize=(4, 4))
df['Sex'].hist()
plt.title('Male vs Female statistics')
plt.show()

In [None]:
plt.figure(figsize=(12, 4))
df['Age'].hist(bins=40)
plt.title("Passengers' age distribution")
plt.show()

### 5. It can be any kind of data

In [None]:
file_to_load = all_files[7]
print('file to load:', file_to_load)
get_object_response = s3.get_object(
    Bucket=DEMO_BUCKET,
    Key=file_to_load
)

In [None]:
img = get_object_response['Body']

In [None]:
img

In [None]:
from PIL import Image

In [None]:
plt.figure(figsize=(12, 8))
img = Image.open(img)
plt.imshow(np.array(img))
plt.show()

### 6. It can be with public access

In [None]:
DEMO_BUCKET_PUB = 'miba-master-classes-public'

#### 6.1. Table data

In [None]:
df = pd.read_csv(
    f'https://storage.yandexcloud.net/{DEMO_BUCKET_PUB}/vgsales.csv'
)
df.head()

In [None]:
df.describe().T

#### 6.2. Images

Public acceess bucket image can be inserted into html or markdown code:

In [None]:
![Public Image](https://storage.yandexcloud.net/miba-master-classes-public/picture_pub.jpg)

Or can be accessed through url https://storage.yandexcloud.net/miba-master-classes-public/picture_pub.jpg

### 7. Using s3fs-supported pandas API

Pandas can read [directly from object storage](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) with some parameters added!

In [None]:
file_to_load = all_files[1]
print(f's3://storage.yandexcloud.net/{DEMO_BUCKET}/{file_to_load}')
df = pd.read_csv(
    f's3a://{DEMO_BUCKET}/{file_to_load}',
    storage_options={
        'key': creds['aws_access_key_id'],
        'secret': creds['aws_secret_access_key'],
        'client_kwargs': {'endpoint_url': 'https://storage.yandexcloud.net'}
    }
)

df.head()

## <font color='red'>LAB WORK #2</font>

Your home assignment for this part is:
1. Take any file you want (not very large size)
2. Upload it to S3 bucket
3. Check if file is in S3 storage

__HINT:__ use [this manual](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html)

In [None]:
### YOUR CODE HERE ###