# Intro
The purpose of this work is to extract a dataset of all bitcoin blocks as well as bitcoin historic prices.

There are several ways to get block information, e.g.:

- Number of blocks per month can be directly queried from the bitcoin blockchain with Python/RPC/bitcoind https://github.com/jgarzik/python-bitcoinrpc
- For that we need a Bicoin full node, so we load the CSV file from the GitHub page
- Here we can select which data from the blockchain extract https://blockchair.com/bitcoin/blocks
- Some further info: https://bitcoin.stackexchange.com/questions/73186/csv-file-of-every-block-timestamp-in-btc-history


Concerning bitcoin prices, I found this: 
- https://www.reddit.com/r/algotrading/comments/b543yn/made_a_webapp_to_get_price_and_indicator_as_csv/


# Download blocks from Blockchair

I tried to follow the export instructions by Marcel Burger here: https://medium.com/burgercrypto-com/building-a-bitcoin-dataset-b2f526d667ce

Unfortunately, as of February 23, 2020, I cannot find the "Export" function that Marcel refers to, so I am extracting the block data from the dumps offered by blockchair here: https://gz.blockchair.com/bitcoin/blocks/
It is very cumbersome, but eventually we get one CSV file with one line per block. Pretty cool, I should to say.

*Warning*: The following code will download a big amount of data to your computer (ca. 400MB), so make sure you have enough space (and patience) for the download.
The good thing is that you need to do this process only once and then maybe update your dataset incrementally , but adjusting the start and end dates.

In [None]:
# Lets install first all modules used elsewhere in this code
# !pip install pandas

In [None]:
import urllib.request
import gzip
import shutil
from datetime import timedelta, date


def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

def unpack(filename_gz, filename_unpacked):
    try:
        with gzip.open(filename_gz, 'rb') as f_in:
            with open(filename_unpacked, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
    except:
        print("ERROR - Couldn't unpack the file " + filename_gz)


BLOCKS_DOWNLOAD_DIR = './blocks/'
START_DATE = date(2016, 1, 1) # originally: 20090103
END_DATE = date(2020, 2, 23) # 20220223
FILENAME_PREFIX = 'blockchair_bitcoin_blocks_'
FILENAME_SUFIX = '.tsv.gz'
URL_PREFIX = 'https://gz.blockchair.com/bitcoin/blocks/'

In [None]:
for single_date in daterange(START_DATE, END_DATE):
    filename = FILENAME_PREFIX + single_date.strftime("%Y%m%d") + FILENAME_SUFIX
    url = URL_PREFIX + filename
    print("Downloading: " + url)
    try:
        urllib.request.urlretrieve (url, BLOCKS_DOWNLOAD_DIR + filename)
        print("File downloaded: " + url)
        print("--> Unpacking file ...")
        unpack(BLOCKS_DOWNLOAD_DIR + filename, BLOCKS_DOWNLOAD_DIR + FILENAME_PREFIX + single_date.strftime("%Y%m%d") + '.tsv')
        print("--> File unpacked")
    except IOError:
        print("ERROR - Couldn't download file: " + url)
                

## Merge all individual TSV files into one CSV file

In [None]:
import csv

ALL_BLOCKS_TSV = 'all_blocks.tsv'
ALL_BLOCKS_CSV = 'all_blocks.csv'
START_DATE = date(2009, 1, 3) # originally: 20090103
START_DATE_PLUS_ONE = date(2009, 1, 4) 
END_DATE = date(2020, 2, 23) # 20220223 # The end date is not inclusive in the range iteration,

with open(BLOCKS_DOWNLOAD_DIR + ALL_BLOCKS_TSV, "w") as fout:
    # first file:
    fin = BLOCKS_DOWNLOAD_DIR + FILENAME_PREFIX + START_DATE.strftime("%Y%m%d") + '.tsv'
    try:
        print('Trying to create headers and initial contents from file ' + fin)
        for line in open(fin):
            fout.write(line)
    except:
        print("Problem with file " + fin)

with open(BLOCKS_DOWNLOAD_DIR + ALL_BLOCKS_TSV, "a") as fout:
    # now the rest:    
    
    for single_date in daterange(START_DATE_PLUS_ONE, END_DATE):
        
        fin = BLOCKS_DOWNLOAD_DIR + FILENAME_PREFIX + single_date.strftime("%Y%m%d") + '.tsv'
        try:
            print('Trying to append contents of file ' + fin)
            f = open(fin)
            f.readline() # skip the header
            fout.write('\n')
            for line in f:
                fout.write(line)
            f.close() # not really needed
            print('... Success!')
        except IOError:
            print("... Problem with file " + fin)

    
# Convert to CSV
# read tab-delimited file
with open(BLOCKS_DOWNLOAD_DIR + ALL_BLOCKS_TSV,'r') as fin:
    cr = csv.reader(fin, delimiter='\t')
    filecontents = [line for line in cr]

# write comma-delimited file (comma is the default delimiter)
with open(BLOCKS_DOWNLOAD_DIR + ALL_BLOCKS_CSV,'w', newline='') as fout:
    cw = csv.writer(fout, quotechar='', quoting=csv.QUOTE_NONE, escapechar='\\')
    cw.writerows(filecontents)
    
print('COOL! CSV file with all blocks created in ' + BLOCKS_DOWNLOAD_DIR + ALL_BLOCKS_CSV)

## Clean up
If it makes sense to you, run the following lines to delete the individual files, both .tsv and .tsv.gz

*Note*: If you feel more comfortable deleting the files manually from your file explorer, go ahead, that's easy.

In [None]:
# from pathlib import Path
# for p in Path(BLOCKS_DOWNLOAD_DIR).glob(FILENAME_PREFIX + "*"):
  #  p.unlink()

# Download bitcoin prices

Blockchain.com offers a CSV download of the desired date range. The only drawback is that the price data points are every 3 days.

https://www.blockchain.com/charts/market-price?timespan=all

Here is the CSV download:

https://api.blockchain.info/charts/market-price?timespan=all&format=csv


* Also here, I tried to follow the export instructions by Marcel Burger: https://medium.com/burgercrypto-com/building-a-bitcoin-dataset-b2f526d667ce
However the links in the Reddit post mentioned are not working anymore.