# Analysis of S3 objects using Python

**Data Set**
[Kagggle Financial Data Set](https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs/downloads/price-volume-data-for-all-us-stocks-etfs.zip/3)
_Note:_ Only ""aapl" and "ge" uploaded

This demo shows the following:
* List objects in an S3 bucket
* Use python Boto3 package to programmatically connect to S3 and process objects
* Use pandas to calculate the size of the folders
* Load the files into dataframe

------------
## List the folders in s3 bucket

```console
!aws s3 ls --profile fin-demo
2019-09-13 11:16:34 rsdg-fin-demo-price-eu-west-2
2019-09-13 11:16:34 rsdg-fin-demo-reference-eu-west-2
2019-09-13 11:16:34 rsdg-fin-demo-transaction-eu-west-2
2019-09-12 23:18:23 rsdg-s3-bucket-fin-demo
```

****
## List the files in the **rsdg-s3-bucket-fin-demo**

In [None]:
!aws s3 ls s3://rsdg-s3-bucket-fin-demo/ --profile fin-demo

****
## Use boto3 package to query the s3 bucket 

In [None]:
import boto3

_session = boto3.Session(profile_name='fin-demo')
s3 = _session.client('s3')
s3.list_objects_v2(Bucket='rsdg-s3-bucket-fin-demo')

****
## Process boto3 json to build a dictionary of files and objects

In [None]:
def get_s3_keys(bucket):
    """Get a list of keys in an S3 bucket."""
    _keys = {}
    resp = s3.list_objects_v2(Bucket=bucket)
    for obj in resp['Contents']:
        _key = obj['Key']
        _size = obj['Size']
        _keys[_key] = _size
    return _keys

keysAndSizes = {}
keysAndSizes = get_s3_keys('rsdg-s3-bucket-fin-demo')
print(keysAndSizes)

****
## Load the dictionary into a pandas dataframe

In [None]:
import pandas as pd
pdKeysAndSizes = pd.DataFrame(list(keysAndSizes.items()))
pdKeysAndSizes.columns = ['FileName', 'Size']
pdKeysAndSizes

****
## Use pandas to count the number of files and total size of files

In [None]:
print("Number of files: %s; Total size of files (MB): %s" % (pdKeysAndSizes.count()['FileName'],
                                                        pdKeysAndSizes.sum(axis=0)['Size']/1024/1024
                                                       ))

## Sample the file to understand the structure

In [None]:
import smart_open as so
ii = 0
for line in so.open('s3://rsdg-s3-bucket-fin-demo/aapl.us.txt', transport_params=dict(session= _session) ):
    ii += 1
    if(ii < 10):
        print(line)
    else:
        break

## Load "aapl" file into data frame

In [None]:

equity_df = pd.read_csv(so.open('s3://rsdg-s3-bucket-fin-demo/aapl.us.txt', transport_params=dict(session= _session) ))
equity_df['Stock']='aapl.us.txt'.replace('.txt','')
equity_df

## Load the files into data frame

In [None]:
combined_equity_df = pd.DataFrame()

for key in keysAndSizes:
    file = 's3://rsdg-s3-bucket-fin-demo/' + key
    single_equity_df = pd.DataFrame()
    single_equity_df = pd.read_csv(so.open(file, transport_params=dict(session= _session) ))
    single_equity_df['Stock'] = key.replace('.txt','')
    combined_equity_df = combined_equity_df.append(single_equity_df, ignore_index=True)
        
print(combined_equity_df)


### Back to readme
* [00-Setup](/markdown/setup.md) 
* [01-Process S3 using python](https://nbviewer.jupyter.org/github/satishrsdg/aws-finance-analytics-demo/blob/master/jupyter-lab/process_s3_files.ipynb?flush_cache=true)
* [02-Visualization and Analytics](./02_Visualization_and_Analytics.ipynb)
* [03-Risk Analytics](./03_Risk_Analytics.ipynb)
* [04-Exploring Firehose,Athena and Quicksight](./04_Exploring_Kinesis_Firehose.ipynb)
* [05-Athena and Quicksights](./05_Athena_Quicksight.ipynb)
* [06-Sagemaker to run the notebooks](./06_Sagemaker_jupyterlab.ipynb)
* [07_Transform stream data using Lambda](./07_Transform_lambda.ipynb)
* [08_Move data to Redshift using Glue](./08_Glue_Redshift.ipynb)
* [09_CI/CD Terrform with Travis CI](./09_Integrating_terraform_travisci.ipynb)