# Scraping coinmarketcap for ICO prices
By Leon Yin<br>
Github: [yinleon](#TODO)<br>
Twitter: [@leonyin](#TODO)<br>
Updated: 2017-11-12

This is notebook that explains how to make a scraper that collects a [table of ICO stats](https://coinmarketcap.com/all/views/all/) from the site coinmarketcap.com. 

While we're at it we create a few helpful metadata columns, makes numerical values machine-readible, and perform some simple data analysis. Because this is a Jupyter Notebook you can run it on your own machine :)


View this on Github [here](#TODO) or NBViewer [here](#TODO).<br>
The scraper is available as Python script [here](#TODO).<br>
The hourly data is avilable open source on Amazon s3 [here](#TODO)

If you like this project please help support it by contributing time to make it better, or donating to help [pay for hosting](#TODO).

### Table of Contents
1. [Scraping Data with Requests and Beautiful Soup](#scrape)
2. [Cleaning Data with Pandas](#clean)
3. [Analysis with Pandas](#analysis)
4. [Next Steps](#next-steps)

In [1]:
import os
import re
import datetime
import requests

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
root_dir = '../'
table_url = 'https://coinmarketcap.com/all/views/all/'
table_id = 'currencies-all'
today = datetime.datetime.now()

## Scraping a Website with Requests and Beautiful Soup <a id='scrape'></a>
Let's visit the coinmarketcap website programatically using the requests package...

In [3]:
r = requests.get(table_url)

Among other things, `r` contains the html content of the page we visited.

In [4]:
r.content[:300]

b'<!DOCTYPE html>\n<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> \n<html class="no-js" lang="en"> <!--<![endif]-->'

BeautifulSoup is the defacto package (still?) for parsing HTML content.<br>
We can send the html from `r` into a parsable object.

In [5]:
soup = BeautifulSoup(r.content, 'lxml')

This new object (`soup`) comes in handy because we can isolate sections of the HTML page using `soup.find()`.<br>
The seciton we are after is the element that contains the ICO data.<br>
We can use the `inspect element` feature from Chrome to identify the ID (in this case currencies-all) of this table.

<img src='../media/find_element.png'></img>

I stored the ID for this table as a variable `table_id`.

In [6]:
table_id

'currencies-all'

Let's isolate the table element using `table_id`, and read it into a Pandas dataframe.

In [7]:
html_tbl = str(soup.find('table',{'id': table_id}))
df = pd.read_html(html_tbl, index_col=0)[0]

We can get a peak at the top 5 ICOs.

In [8]:
df.head()

Unnamed: 0_level_0,Name,Symbol,Market Cap,Price,Circulating Supply,Volume (24h),% 1h,% 24h,% 7d
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,BTC Bitcoin,BTC,"$101,127,463,297",$6063.97,16676775,"$8,482,620,000",0.83%,2.02%,-17.44%
2,ETH Ethereum,ETH,"$30,032,324,679",$313.84,95693412,"$1,729,130,000",-1.08%,1.89%,6.03%
3,BCH Bitcoin Cash,BCH,"$20,857,585,264",$1241.51,16800175,"$6,682,870,000",-7.00%,-31.28%,94.33%
4,XRP Ripple,XRP,"$7,681,647,599",$0.199360,"38,531,538,922 *","$253,269,000",-0.82%,-2.93%,-1.56%
5,DASH Dash,DASH,"$3,366,095,503",$437.99,7685290,"$515,079,000",-2.31%,27.63%,61.30%


Now that we have everything in Pandas, we can do some extra janitorial work, and analysis.<br>
If you've never used Pandas, you're in for a treat!

## Cleaning Data with Pandas <a id='clean'></a>
The data we have is straight from the HTML table--<br>
it is human-readible, but not machine readible.

Let's use Pandas to clean up the the data (stored in a DataFrame `df`)...<br>
As a first step, we can add a timestamp for context.

In [9]:
df['scrape_timestamp'] = today

### Renaming Columns
Let's make the columns more descriptive, by including the unit in each column (USD).<br>
We can do this by replacing column names using a key-value store (a dictionary)

In [10]:
col_name_w_currency = {
    'Market Cap' : 'market_cap_usd',
    'Price' : 'price_usd',
    'Volume (24h)': 'volume_24h_usd',
}

and a function that operates on each column name.

In [11]:
def clean_up_col(col):
    '''
    Adds currency unit to relevant column names,
    replaces spaces for underscores, 
    replaces % symbols for "percent_change",
    and returns the updated column in lower case.
    '''
    col = col_name_w_currency.get(col, col)
    col = col.replace(' ', '_')
    col = col.replace('%', 'percent_change')
    return col.lower()

Now let's iterate through each column name in `df`, and apply `clean_up_col` to each.<br>
For reference: `[x for x in some_iterator]` is called a list comprehension, which is a slight modification to a for loop.

In [12]:
df.columns = [clean_up_col(c) for c in df.columns]

In [13]:
df.dtypes

name                          object
symbol                        object
market_cap_usd                object
price_usd                     object
circulating_supply            object
volume_24h_usd                object
percent_change_1h             object
percent_change_24h            object
percent_change_7d             object
scrape_timestamp      datetime64[ns]
dtype: object

In [14]:
df.head(4)

Unnamed: 0_level_0,name,symbol,market_cap_usd,price_usd,circulating_supply,volume_24h_usd,percent_change_1h,percent_change_24h,percent_change_7d,scrape_timestamp
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,BTC Bitcoin,BTC,"$101,127,463,297",$6063.97,16676775,"$8,482,620,000",0.83%,2.02%,-17.44%,2017-11-12 23:51:38.894699
2,ETH Ethereum,ETH,"$30,032,324,679",$313.84,95693412,"$1,729,130,000",-1.08%,1.89%,6.03%,2017-11-12 23:51:38.894699
3,BCH Bitcoin Cash,BCH,"$20,857,585,264",$1241.51,16800175,"$6,682,870,000",-7.00%,-31.28%,94.33%,2017-11-12 23:51:38.894699
4,XRP Ripple,XRP,"$7,681,647,599",$0.199360,"38,531,538,922 *","$253,269,000",-0.82%,-2.93%,-1.56%,2017-11-12 23:51:38.894699


Notice that some columns have an asterix (used to denote that the currency are not minable).<br>
We can convert this feature into a new column by leveraging Pandas DataFrames' `apply` function -- <br>
which applies any function (anonymous or declared) across either columns (`axis`=0) or rows (`axis`=1).

In [15]:
def is_minable(row):
    '''
    Check if `circulating_supply` contains an asterix.
    This function operates on each row of the dataframe.
    If the ICO is not minable, we'll find an asterix and return 0.
    
    Note:
    That when we apply a function across a row,
    the entire row is treated as a key-value pair.
    '''
    circulating_supply = row['circulating_supply']
    
    if '*' in circulating_supply:
        return 0
    
    else:
        return 1

In [16]:
df['is_minable'] = df.apply(is_minable, axis=1)

In [17]:
df.head(4)

Unnamed: 0_level_0,name,symbol,market_cap_usd,price_usd,circulating_supply,volume_24h_usd,percent_change_1h,percent_change_24h,percent_change_7d,scrape_timestamp,is_minable
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,BTC Bitcoin,BTC,"$101,127,463,297",$6063.97,16676775,"$8,482,620,000",0.83%,2.02%,-17.44%,2017-11-12 23:51:38.894699,1
2,ETH Ethereum,ETH,"$30,032,324,679",$313.84,95693412,"$1,729,130,000",-1.08%,1.89%,6.03%,2017-11-12 23:51:38.894699,1
3,BCH Bitcoin Cash,BCH,"$20,857,585,264",$1241.51,16800175,"$6,682,870,000",-7.00%,-31.28%,94.33%,2017-11-12 23:51:38.894699,1
4,XRP Ripple,XRP,"$7,681,647,599",$0.199360,"38,531,538,922 *","$253,269,000",-0.82%,-2.93%,-1.56%,2017-11-12 23:51:38.894699,0


### This table is now more human readible, but problematic for machines
Why? Because there are dollar signs, commas, asterix, and percent signs in numeric values.<br>
This causes most computers (and Pandas) to view numeric values as strings!

In [18]:
df.dtypes

name                          object
symbol                        object
market_cap_usd                object
price_usd                     object
circulating_supply            object
volume_24h_usd                object
percent_change_1h             object
percent_change_24h            object
percent_change_7d             object
scrape_timestamp      datetime64[ns]
is_minable                     int64
dtype: object

We can remove these symbols using regular expressions.<br>
Below is a dictionary of regular expressions we can use to weed out symbols

In [19]:
replace_symbols = {
    r'  [*]' : '',    # two spaces and any number of asterix
    r'[\$,%*]' : '',  # money signs, commas, percent signs, asterix
    r'[?]' : np.nan,  # question marks becomes a null value
    'Low Vol' : 0,    # low volume is simplified as zero...
}

Pandas `replace` operates on all all columns and all rows.<br>
The coolest aspect of this function is that 
1. it can take a dictionary as an input,
2. it can implement regular expressions, and
3. it can operate inplace

In [20]:
df.replace(replace_symbols, regex=True, inplace=True)

In [21]:
df.head(4)

Unnamed: 0_level_0,name,symbol,market_cap_usd,price_usd,circulating_supply,volume_24h_usd,percent_change_1h,percent_change_24h,percent_change_7d,scrape_timestamp,is_minable
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,BTC Bitcoin,BTC,101127463297,6063.97,16676775,8482620000,0.83,2.02,-17.44,2017-11-12 23:51:38.894699,1
2,ETH Ethereum,ETH,30032324679,313.84,95693412,1729130000,-1.08,1.89,6.03,2017-11-12 23:51:38.894699,1
3,BCH Bitcoin Cash,BCH,20857585264,1241.51,16800175,6682870000,-7.0,-31.28,94.33,2017-11-12 23:51:38.894699,1
4,XRP Ripple,XRP,7681647599,0.19936,38531538922,253269000,-0.82,-2.93,-1.56,2017-11-12 23:51:38.894699,0


This looks good to me! Let's write this clean dataset to a csv.<br>
A best practice in data engineering is to create a function to programmatically generate file paths and directories.

In [22]:
def create_filename(root_dir, today):
    '''
    This function creates the filename, 
    it also creates the directory for the file if the directory doesn't exist.
    '''
    f_template = '{year}/{month}/{day}/{hour}/market_cap_USD_{time}.csv.gz'
    f = f_template.format(year = today.year,
                          month= today.month,
                          day  = today.day,
                          hour = today.strftime('%H'),
                          time = today.strftime('%H:%M:%S'))
    
    f_out = os.path.join(root_dir, f)
    
    dir_out = '/'.join(f_out.split('/')[:-1])
    if not os.path.exists(dir_out):
        os.makedirs(dir_out, exist_ok=True)
    
    return f_out

In [23]:
file = create_filename(root_dir, today)
file

'./2017/11/12/23/market_cap_USD_23:51:38.csv.gz'

In [24]:
df.to_csv(file, index=None, compression='gzip')

## Let's do some analysis <a id='analysis'></a>
How's the data look?

In [25]:
df = pd.read_csv(file, compression='gzip')

In [26]:
df.head()

Unnamed: 0,name,symbol,market_cap_usd,price_usd,circulating_supply,volume_24h_usd,percent_change_1h,percent_change_24h,percent_change_7d,scrape_timestamp,is_minable
0,BTC Bitcoin,BTC,101127500000.0,6063.97,16676780.0,8482620000.0,0.83,2.02,-17.44,2017-11-12 23:51:38.894699,1
1,ETH Ethereum,ETH,30032320000.0,313.84,95693410.0,1729130000.0,-1.08,1.89,6.03,2017-11-12 23:51:38.894699,1
2,BCH Bitcoin Cash,BCH,20857590000.0,1241.51,16800180.0,6682870000.0,-7.0,-31.28,94.33,2017-11-12 23:51:38.894699,1
3,XRP Ripple,XRP,7681648000.0,0.19936,38531540000.0,253269000.0,-0.82,-2.93,-1.56,2017-11-12 23:51:38.894699,0
4,DASH Dash,DASH,3366096000.0,437.99,7685290.0,515079000.0,-2.31,27.63,61.3,2017-11-12 23:51:38.894699,1


Monetary and percentages are now floats!

In [27]:
df.dtypes

name                   object
symbol                 object
market_cap_usd        float64
price_usd             float64
circulating_supply    float64
volume_24h_usd        float64
percent_change_1h     float64
percent_change_24h    float64
percent_change_7d     float64
scrape_timestamp       object
is_minable              int64
dtype: object

We can get a big picture of what's going on:


In [28]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
market_cap_usd,989.0,196376000.0,3430574000.0,4.0,88379.0,855871.0,8600916.0,101127500000.0
price_usd,1277.0,253.6656,5668.109,1.5e-08,0.00222,0.02699,0.274574,175919.0
circulating_supply,989.0,5723924000.0,50305770000.0,0.078264,4882231.0,22092100.0,108835000.0,1055692000000.0
volume_24h_usd,1256.0,16749760.0,311597100.0,0.0,0.0,1643.5,58939.5,8482620000.0
percent_change_1h,1179.0,0.5570483,10.21335,-68.36,-0.17,0.7,0.94,246.39
percent_change_24h,1195.0,12.38731,218.3082,-92.02,-6.475,-0.37,5.05,6580.41
percent_change_7d,1189.0,4.098057,156.8465,-98.95,-23.01,-10.45,7.46,4973.09
is_minable,1278.0,0.4749609,0.4995681,0.0,0.0,0.0,1.0,1.0


There are aggregation fucntions we can use to calculate the market cap:

In [29]:
df['market_cap_usd'].sum()

194215909708.0

We can also find which ICOs have dropped by more than 60% since the past week:

In [30]:
df_losers = df[df['percent_change_7d'] <= -60]
df_losers.head()

Unnamed: 0,name,symbol,market_cap_usd,price_usd,circulating_supply,volume_24h_usd,percent_change_1h,percent_change_24h,percent_change_7d,scrape_timestamp,is_minable
187,BQ bitqy,BQ,14271249.0,0.006137,2325626000.0,42124.0,2.72,-51.57,-76.45,2017-11-12 23:51:38.894699,0
199,MCAP MCAP,MCAP,12938259.0,0.337852,38295640.0,684875.0,0.67,-10.78,-74.6,2017-11-12 23:51:38.894699,0
369,DIME Dimecoin,DIME,2610862.0,5e-06,537179000000.0,1922.0,1.55,-42.23,-74.56,2017-11-12 23:51:38.894699,1
428,NYC NewYorkCoin,NYC,1407373.0,1.1e-05,129032800000.0,6220.0,1.46,-21.66,-73.8,2017-11-12 23:51:38.894699,1
452,GRE Greencoin,GRE,1035818.0,0.000294,3527281000.0,2081.0,0.71,69.54,-77.56,2017-11-12 23:51:38.894699,0


Since we don't care about EVERY ICO, we can filter the dataframe by relevant symbols

In [31]:
watchlist = [
    'LTC',
    'BTC',
    'NEO'
]

In [32]:
df_w = df[df['symbol'].isin(watchlist)]
df_w

Unnamed: 0,name,symbol,market_cap_usd,price_usd,circulating_supply,volume_24h_usd,percent_change_1h,percent_change_24h,percent_change_7d,scrape_timestamp,is_minable
0,BTC Bitcoin,BTC,101127500000.0,6063.97,16676775.0,8482620000.0,0.83,2.02,-17.44,2017-11-12 23:51:38.894699,1
5,LTC Litecoin,LTC,3231123000.0,60.05,53805782.0,324139000.0,-0.05,-0.74,10.1,2017-11-12 23:51:38.894699,1
7,NEO NEO,NEO,1784991000.0,27.46,65000000.0,60386900.0,0.23,1.39,3.91,2017-11-12 23:51:38.894699,0


We can also calculate values in BTC

In [33]:
btc_price = df[df['symbol'] == 'BTC']['price_usd'].iloc[0]
btc_price

6063.9700000000003

In [34]:
df_w['price_btc'] = df_w['price_usd'] / btc_price

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [35]:
df_w[['name', 'symbol', 'price_usd', 'price_btc']].head()

Unnamed: 0,name,symbol,price_usd,price_btc
0,BTC Bitcoin,BTC,6063.97,1.0
5,LTC Litecoin,LTC,60.05,0.009903
7,NEO NEO,NEO,27.46,0.004528


The data we just scraped is also [available in BTC](https://coinmarketcap.com/coins/views/all/#BTC), rather than USD.<br>
However, that table is rendered using Javascript, <br>
so it can't be scraped unless we use a client such as Selenium.

## Conclusions <a id='next-steps'></a>
Having programatic access to ICO prices is a first step for many applications.<br>
Please use this information responsibally!

Here are some next steps:
- Do this for BTC units.
- Host the data on s3 to another open source outlet with programmatic access.
- Host a cloud isntance that generates this dataset in a regular interval.
- Analysis of of BTC's price on alt-coins.

I think there is some good software to be written.
This is done as a passion project, if this is helpful you any suggestions, time, or donation helps!

Wallet locations for donations:
<a id='TODO'>todo</a>