#### Data Collection and Compression

We collect and process data from the CoinMarketCap API.


In [13]:
%matplotlib inline

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

N_COINS = 30

df = pd.read_json(f'https://api.coinmarketcap.com/v1/ticker/?limit={N_COINS}')
df = df.infer_objects()
df.head(10)


Unnamed: 0,24h_volume_usd,available_supply,id,last_updated,market_cap_usd,max_supply,name,percent_change_1h,percent_change_24h,percent_change_7d,price_btc,price_usd,rank,symbol,total_supply
0,7982860000,16829450,bitcoin,1517055866,186116887550,21000000.0,Bitcoin,-1.0,3.92,-13.27,1.0,11059.0,1,BTC,16829450
1,2995630000,97241313,ethereum,1517055852,103596033236,,Ethereum,-0.17,4.48,-5.89,0.096974,1065.35,2,ETH,97241313
2,1206530000,38739142811,ripple,1517055841,46348672633,100000000000.0,Ripple,-0.46,0.42,-24.23,0.000109,1.19643,3,XRP,99993093880
3,453246000,16934538,bitcoin-cash,1517055857,26706273674,21000000.0,Bitcoin Cash,-1.11,3.55,-17.1,0.14355,1577.03,4,BCH,16934538
4,631236000,25927070538,cardano,1517055859,15601070228,45000000000.0,Cardano,-0.62,3.93,-13.67,5.5e-05,0.601729,5,ADA,31112483745
5,493381000,17868055883,stellar,1517055843,11070743668,,Stellar,-0.32,7.61,17.95,5.6e-05,0.619583,6,XLM,103629819514
6,316366000,54944008,litecoin,1517055841,9693551582,84000000.0,Litecoin,-0.44,3.2,-14.12,0.016059,176.426,7,LTC,54944008
7,1021610000,631116954,eos,1517055855,8906259337,1000000000.0,EOS,-0.92,4.38,1.63,0.001285,14.1119,8,EOS,900000000
8,258869000,65000000,neo,1517055850,8836880000,,NEO,-0.06,4.73,-7.03,0.012375,135.952,9,NEO,100000000
9,110000000,8999999999,nem,1517055846,7633988999,,NEM,-0.17,7.06,-26.55,7.7e-05,0.848221,10,XEM,8999999999


We we want to compress this data by removing irrelevant and/or redundant features. This will allow us to minimize our disk usage (which will be useful for future scalability), and allow us to present our information in a more useful and human-readable format.

1. Remove ```name``` and ```id```. Instead, use ```symbol```.
2. Remove ```max_supply```, ```total_supply```, and ```available_supply```. The max supply of BTC will always be ```2.1e7```. We don't need to store it again every 5 minutes. Total and available supply are both irrelvant.
3. Remove ```price_btc```. It is a redundant field that can be easily re-calculated with ```price_usd```.
4. Reorder remaining fields to a more human-readable format.

This is my proposed improvement:

In [12]:
df = pd.read_json(f'https://api.coinmarketcap.com/v1/ticker/?limit={N_COINS}')
df = df.infer_objects()

df = df[['rank', 'symbol', 'price_usd', 'market_cap_usd', '24h_volume_usd',
         'percent_change_1h', 'percent_change_24h', 'percent_change_7d']]

df.head(10)

Unnamed: 0,rank,symbol,price_usd,market_cap_usd,24h_volume_usd,percent_change_1h,percent_change_24h,percent_change_7d
0,1,BTC,11059.0,186116887550,7982860000,-1.0,3.92,-13.27
1,2,ETH,1065.35,103596033236,2995630000,-0.17,4.48,-5.89
2,3,XRP,1.19643,46348672633,1206530000,-0.46,0.42,-24.23
3,4,BCH,1577.03,26706273674,453246000,-1.11,3.55,-17.1
4,5,ADA,0.601729,15601070228,631236000,-0.62,3.93,-13.67
5,6,XLM,0.619583,11070743668,493381000,-0.32,7.61,17.95
6,7,LTC,176.426,9693551582,316366000,-0.44,3.2,-14.12
7,8,EOS,14.1119,8906259337,1021610000,-0.92,4.38,1.63
8,9,NEO,135.952,8836880000,258869000,-0.06,4.73,-7.03
9,10,XEM,0.848221,7633988999,110000000,-0.17,7.06,-26.55


We will make serial calls to this API every 5 minutes, and store results in their own time-stamped CSV.

###### Global Data

The API also provides access to global data.

In [10]:
json = requests.get('https://api.coinmarketcap.com/v1/global/').json()
gdf = pd.DataFrame([json]).infer_objects()

gdf.head()

Unnamed: 0,active_assets,active_currencies,active_markets,bitcoin_percentage_of_market_cap,last_updated,total_24h_volume_usd,total_market_cap_usd
0,567,896,8165,34.48,1517054667,23363370000.0,538665500000.0


I threw out ```last_updated``` and ```bitcoin_percentage_of_market_cap```, and re-ordered the remaining fields.

In [11]:
gdf = gdf[['total_market_cap_usd', 'total_24h_volume_usd', 
           'active_markets', 'active_currencies', 'active_assets']]

gdf.head()

Unnamed: 0,total_market_cap_usd,total_24h_volume_usd,active_markets,active_currencies,active_assets
0,538665500000.0,23363370000.0,8165,896,567


For each API call that we make, we only add one single line of data. Thus, data compression is not an issue for us in this situation. 

#### Setting up an AWS Instance 

`TODO: Write resources about the cloud, why and how we're using it. Homework assignment for either Will or Law.`

We are running an EC2 instance for collecting and storing data as described above. Each file is stored in a separate timestamped file. 

`TODO: Document AWS access, etc.`