My primary motivation is to draw some basic inferences from the Bitcoin blockchain by performing a statistical analysis of various fundamental factors affecting the network.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from statsmodels.graphics.tsaplots import plot_acf
import seaborn as sns
import numpy as np
import statsmodels.api as sm
%matplotlib inline


#Reading data from the input directory
bitcoin_data = pd.read_csv('../input/bitcoin_dataset.csv', header=0, parse_dates=['Date'])
bitcoin_data['Year'] = bitcoin_data['Date'].apply(lambda x: x.year)
bitcoin_data['Month'] = bitcoin_data['Date'].apply(lambda x: x.month)
bitcoin_data.head(3)

Visualize when the blockchain network witnessed growth in the no of transactions.

In [None]:
plt.plot(bitcoin_data['Date'], bitcoin_data['btc_n_transactions_total'])
plt.show()

Next we see the increase in the processing power of miners (hash rate) of the entire network as function of time

In [None]:
plt.plot(bitcoin_data['Date'], bitcoin_data['btc_hash_rate'])
plt.show()

By observing the above graphs it appears that the network started getting users and increased support of mining power from 2015 onwards. So we will focus our attention to activities from 2015 onwards.
Filter our dataset. 

In [None]:
bitcoin_data = bitcoin_data.loc[bitcoin_data['Date'] > datetime(2015,1,1)]

# Transform the total number of transactions into a scale of million
bitcoin_data['btc_n_transactions_total'] = bitcoin_data['btc_n_transactions_total']/1000000

# The dataset has btc_miners_revenue which is basically total value in bitcoin earned by miners. 
# However, from user perspective who wants to transact bitcoin, we should define another parameter. This
# parameter should provide a measure of average cost incurred by the user per transaction
bitcoin_data['Avg_Txn_Fee'] = bitcoin_data['btc_transaction_fees']/bitcoin_data['btc_n_transactions_total']


Next we plot relationship among different variables:

In [None]:
sns.pairplot(bitcoin_data[bitcoin_data.columns[[8,9,10,11,13,24]]],hue='Year',palette='afmhot')

**Key observations**
1.  Median confirmation time for a transaction (btc_median_confirmation_time) shows somewhat exponential relationship with avg. transactions per block (btc_n_transactions_per_block)
2. Hash Rate and Difficulty level of the blockchain have strong linear relation. This is expected since larger hash rate will result in faster mining of blocks and the difficulty level will be set accordingly. 
Please refer Bitcoin Core reference docs for more details.
3. Median confirmation time also exhibits relationship with btc_transaction_fees.

Next step, exploration of points 1 and 3.

In [None]:
# Median Txn time vs. Log(no. of transactions per block)
sns.lmplot('btc_n_transactions_per_block','btc_median_confirmation_time',
           data= pd.concat([bitcoin_data['btc_median_confirmation_time'],
            np.log(bitcoin_data['btc_n_transactions_per_block']),
            bitcoin_data['Year']],axis=1),hue='Year',fit_reg=False)

plt.xlabel('Log(No. of transactions/block)')
plt.ylabel('Median Time')

In [None]:
# Median Txn time vs Avg fee per transaction
sns.lmplot('Avg_Txn_Fee','btc_median_confirmation_time',
           data= pd.concat([bitcoin_data['btc_median_confirmation_time'],
            bitcoin_data['Year'], bitcoin_data['Avg_Txn_Fee']], axis=1),hue='Year',fit_reg=False)

plt.xlabel('Average Transaction fees')
plt.ylabel('Median Time')

We see two outliers in 2016 data. 
Median time close to 30 and Avg. transaction fees > 2.5.
Other parameters for these observations are well within their usual ranges and also there was no significant news on that day. Hence, we rule the likely occurence of such possibility and remove these data inputs. 

In [None]:
bitcoin_data_2015 = bitcoin_data.loc[bitcoin_data['Year']==2015]
bitcoin_data_2016 = bitcoin_data.loc[bitcoin_data['Year']==2016]
bitcoin_data_2017 = bitcoin_data.loc[bitcoin_data['Year']==2017]

bitcoin_data_2016 = bitcoin_data_2016.loc[bitcoin_data_2016['btc_median_confirmation_time'] < 25]
bitcoin_data_2016 = bitcoin_data_2016.loc[bitcoin_data_2016['Avg_Txn_Fee'] < 2.5]


In [None]:
# Lets check the correlation between btc_n_transactions_per_block and Avg_Txn_Fee
print(np.corrcoef(bitcoin_data.loc[bitcoin_data['Year'] == 2015,'btc_n_transactions_per_block'],
            bitcoin_data.loc[bitcoin_data['Year'] == 2015, 'Avg_Txn_Fee'])[0][1])

print(np.corrcoef(bitcoin_data.loc[bitcoin_data['Year'] == 2016,'btc_n_transactions_per_block'],
            bitcoin_data.loc[bitcoin_data['Year'] == 2016, 'Avg_Txn_Fee'])[0][1])

print(np.corrcoef(bitcoin_data.loc[bitcoin_data['Year'] == 2017,'btc_n_transactions_per_block'],
            bitcoin_data.loc[bitcoin_data['Year'] == 2017, 'Avg_Txn_Fee'])[0][1])

Not much correlation, so we can include these two variables in multiple regression.

In [None]:
# Regression year 2015
# Median confirmation time ~ log(no. of transactions/block) + average transaction fee
reg_data_2015 = bitcoin_data_2015[['btc_median_confirmation_time', 'btc_n_transactions_per_block', 
                                   'Avg_Txn_Fee']]
reg_data_2015['log_txn_block'] = reg_data_2015['btc_n_transactions_per_block'].apply(lambda x: np.log(x))
reg_data_2015 = reg_data_2015.drop('btc_n_transactions_per_block', axis=1)
reg_data_2015_exog = sm.add_constant(reg_data_2015[['Avg_Txn_Fee', 'log_txn_block']], prepend=False)
model_2015 = sm.OLS(reg_data_2015['btc_median_confirmation_time'],reg_data_2015_exog)
model_2015.fit().summary()

Poor fit as demonstrated by R-squared statistic.

In [None]:
# Regression year 2016
# Median confirmation time ~ log(no. of transactions/block) + average transaction fee
reg_data_2016 = bitcoin_data_2016[['btc_median_confirmation_time', 'btc_n_transactions_per_block', 'Avg_Txn_Fee']]
reg_data_2016['log_txn_block'] = reg_data_2016['btc_n_transactions_per_block'].apply(lambda x: np.log(x))
reg_data_2016 = reg_data_2016.drop('btc_n_transactions_per_block', axis=1)
reg_data_2016_exog = sm.add_constant(reg_data_2016[['Avg_Txn_Fee', 'log_txn_block']], prepend=False)
model_2016 = sm.OLS(reg_data_2016['btc_median_confirmation_time'],reg_data_2016_exog)
model_2016.fit().summary()

**Key observations**
*  Avg_Txn_Fee has positive significant co-efficient meaning miners select transactions into a current block rather greedily which results in delay on average for a transaction with average transaction fees associated with it to get accepted into a block.
* This is corroborated by research paper by Bitfury Group as well which says that miners first order the transactions by decreasing fee density (transaction fees/size of txn) and then select them.
*  Increase in the number of transactions within a block increases the block size, thereby requiring more time to validate it via proof of work and more time to propogate it over the network. This is known as impedance for a block.
* This is witnessed by observing the positive co-efficient for log(no_txn_block).

In [None]:
# Regression year 2017
# Median confirmation time ~ log(no. of transactions/block) + average transaction fee
reg_data_2017 = bitcoin_data_2017[['btc_median_confirmation_time', 'btc_n_transactions_per_block', 'Avg_Txn_Fee']]
reg_data_2017['log_txn_block'] = reg_data_2017['btc_n_transactions_per_block'].apply(lambda x: np.log(x))
reg_data_2017 = reg_data_2017.drop('btc_n_transactions_per_block', axis=1)
reg_data_2017_exog = sm.add_constant(reg_data_2017[['Avg_Txn_Fee', 'log_txn_block']], prepend=False)
model_2017 = sm.OLS(reg_data_2017['btc_median_confirmation_time'],reg_data_2017_exog)
model_2017.fit().summary()

* Both the independant variables are still positive but now Avg_Txn_Fee has been proven as insignificant. Probably due to data being more dispersed as shown in the charts above.
* In both years 2016 and 2017 the Durbin-Watson statistic is around 2 which points to close to insignificant correlation amongst the residuals.
* Log_txn_block has emerged to be a viable predictor for mean_txn_time for both these years.

In [None]:
# Time Series analysis of log of avg. no transactions per block over time.
plt.plot(bitcoin_data['Date'], np.log(bitcoin_data['btc_n_transactions_per_block']))
plt.xticks(rotation=45)
plt.xlabel('Date')
plt.ylabel('Log(Avg_n_transactions)')

The series exhibits a trend pattern. Lets see whether, this is time dependant or the series values depend on previous ones. We consider data before 2017 August since Bitcoin blockchain experienced hard fork and was split into BTC Cash and BTC. Hence the sudden downfall in the parameter after 2017-07. We want to isolate our time series from extraneous events which could be considered equivalent of a regime change.

In [None]:
block_txn = pd.DataFrame(bitcoin_data.loc[bitcoin_data['Date'] < datetime(2017,7,31), 'btc_n_transactions_per_block'])
block_txn['log_txn_block'] = np.log(block_txn[['btc_n_transactions_per_block']])
block_txn = block_txn.drop('btc_n_transactions_per_block', axis=1)

block_txn['time'] = block_txn.index - block_txn.index[0] + 1
block_txn_exog = sm.add_constant(block_txn[['time']], prepend=False)
model_txn_blk = sm.OLS(block_txn['log_txn_block'], block_txn_exog)
results_txn_blk = model_txn_blk.fit()

plot_acf(results_txn_blk.resid, lags=50)

We regress Log(no. of transactions) = m*t + c where t is time and see whether residuals exhibit correlations. As shown above, there are significant figures for lags 1,2 and periodically in 7, 14, 21 etc. This is an indicator of Seasonality in the series. Next we model our regress our series against lag 1 and lag 7.

In [None]:
block_txn['log_txn_block_lag_7'] = block_txn['log_txn_block'].shift(7)
block_txn['log_txn_block_lag_1'] = block_txn['log_txn_block'].shift(1)

block_txn = block_txn.dropna()
model_txn_blk = sm.OLS(block_txn['log_txn_block'], block_txn[['log_txn_block_lag_7', 'log_txn_block_lag_1']])
results_txn_blk = model_txn_blk.fit()
plot_acf(results_txn_blk.resid, lags=50)

Still some lags persist. However, increasing the number of variables may increase the chances of overfitting. Hence, careful consideration should be given prior to selecting lags. The model, overall proves a good fit. We may also want to test its predictive accuracy for future data.

Future studies should focus on modelling the probability of orphan_blocks as a function of other network variables.
Miners incentive analysis can be done by feature engineering of our data.
Posting update on this soon!