Our goal in this notebook is not to predict, as accurately as possible, the price of bitcoin tomorrow. Instead we want to see how we can use a machine learning algorithm called Random Forest to create a model that can predict bitcoin prices using historical data on bitcoin supply and demand.


### How Random Forest Works

Random Forest will make use of decision trees to understand how these factors affected bitcoin prices in the past. Simply put, a decision tree is a flowchart-like mapping of inputs and outputs.

![](https://46gyn61z4i0t1u1pnq2bbk2e-wpengine.netdna-ssl.com/wp-content/uploads/2018/07/what-is-a-decision-tree.png)

The patterns from these decision trees are what would dictate our price prediction model. The more data we can feed the Random Forest algorithm, the more opportunities it would have in finding new patterns and verifying existing patterns. It will then average the predictions of each tree to create a more reliable prediction. We will test how the predicted values fare out against the actual values. This means, we will save some data points from our dataset to test the predictions of the Random Forest algorithm.

Before we do that, let’s first see what’s in our dataset.

### Quick Data Exploration

Our dataset has 2,920 rows. These represent all the days between February 23, 2010 to May 19, 2017.

In [None]:
import pandas as pd
import os
df = pd.read_csv("/kaggle/input/cryptocurrencypricehistory/bitcoin_dataset.csv", parse_dates=['Date'])
df.shape

We are specifically are interested to know how the following factors affect the average market price of bitcoin across major bitcoin exchanges (btc_market_price):
1. Total number of bitcoins that have already been mined (btc_total_bitcoins)
1. A relative measure of how difficult it is to find a new bitcoin block (btc_difficulty)
1. Total number of unique addresses used on the Bitcoin blockchain (btc_n_unique_addresses) 
1. Total value of coinbase block rewards paid to miners (btc_rewards)

When I was reading the data dictionary, I got confused about the difference between miners_revenue and transaction_fees. They sound overlapping:
> - btcminersrevenue : Total value of coinbase block rewards and transaction fees paid to miners.
> - btctransactionfees : The total value of all transaction fees paid to miners.

If I only want to get the coinbase block rewards paid to miners and assuming the miner’s revenue includes transactions fees, I’d have to create a new column by getting the difference between the two columns:



In [None]:
df['btc_rewards'] = df['btc_miners_revenue'] - df['btc_transaction_fees']

Below is a summary of the minimum, maximum and average values for each factor we're interested in, as well as their 25th, 50th and 75th percentile values.

In [None]:
interest = ["btc_market_price","btc_total_bitcoins", "btc_difficulty", "btc_rewards", "btc_transaction_fees", "btc_n_unique_addresses"]
df2 = df[interest]
df2.describe()

In [None]:
According to this data, bitcoin prices have continually increased between 2010 and 2017:

In [None]:
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

sns.lineplot(data=df2.btc_market_price, label = "Price of Bitcoin Over the Years")

#### We are specifically interested in the effects of these factors on bitcoin prices:

* Total number of bitcoins that have already been mined (btc_total_bitcoins). 
There are only 21 million bitcoins that can ever be mined. By May 19, 2017, there were already ~16.3 million bitcoins mined.


In [None]:
sns.lineplot(data=df2.btc_total_bitcoins, label = "Total Bitcoins Mined Over the Years")

* Total value of all transaction fees paid to miners (btc_transaction_fees). When a person spends a bitcoin, that transaction will need to be verified by a set of computers owned by “miners” in the Bitcoin network. Transaction fees are required in order to have these transactions processed by miners. The spender may specify how much fee to add but miners can only confirm 1MB worth of transactions for each batch or block. If the number of transactions waiting to confirm exceeds what can fit in 1 block, a miner can choose to confirm the transactions with the highest bitcoin fees.


In [None]:
sns.lineplot(data=df2.btc_transaction_fees, label = "Transaction Fees Over the Years")

* Total value of coinbase block rewards paid to miners (btc_rewards). To mine bitcoin means to contribute in verifying bitcoin tansaction information . This process takes a lot of computing powers and miners are rewarded with bitcoin payments for successful transactions (apart from transaction fees).


In [None]:
sns.lineplot(data=df2.btc_rewards, label = "Total Value of Block Rewards Paid to Miners Over the Years")

* A relative measure of how difficult it is to find a new block or to successfully verify a batch of transactions (btc_difficulty). Bitcoin’s mining difficulty is designed to adjust every set number of blocks. This means that the more blocks mined, the harder it becomes to mine the next one.


In [None]:
sns.lineplot(data=df2.btc_difficulty, label = "Difficulty Over the Years")

* Total number of unique addresses used on the Bitcoin blockchain (btc_n_unique_addresses). This represents the number of bitcoin wallets that have been created over the years.


In [None]:
sns.lineplot(data=df2.btc_n_unique_addresses, label = "Unique Addresses Over the Years")

### Yearly Growth Rates
A quick look at the yearly growth rate of each factor, suggests their relationship to the bitcoin prices.
For example, when the number of bitcoin wallets increased by 150% in 2014 from 2013, prices recorded an almost 270% increase. On May 2012, the total bitcoins in the market increased by 45% from previous year. Meanwhile, the prices decreased of 30%. This follows elementary economics so far, an increase in demand drives up price while an increase in supply drives it down.

In [None]:
df2.set_index('Date', inplace = True)
df3 = df2.query("Date == '2010-05-19' or Date == '2011-05-19' or Date == '2012-05-19' or Date == '2013-05-19'  or Date == '2014-05-19'  or Date == '2015-05-19' or Date == '2016-05-19'  or Date == '2017-05-19'")
df3.pct_change()
df4=df3.pct_change()
df4.style.format("{:.2%}")

However, a closer look at their regression plots suggest that despite the number of total bitcoins increasing in the market, it wasn’t enough to drive down the price. Meanwhile all the other factors —difficulty to mine, coinbase block mining rewards, transaction fees and the increase number of bitcoin users proxied by n_unique_addresses share a positive relationship with btc_market_prices as expected.

In [None]:
plt.figure(figsize=(50,10))
sns.pairplot(df2,
             x_vars = ["btc_total_bitcoins", "btc_difficulty", "btc_rewards", "btc_transaction_fees", "btc_n_unique_addresses"],
             y_vars = ["btc_market_price"], kind = 'reg')

A seaborn scatter plot shows a more detailed view of how market prices have moved in relation to all the five factors over the years. We can see that most of the movements are concentrated on the green, purple and pink datapoints representing the years 2013, 2016 and 2017.


In [None]:
df2.reset_index(inplace=True)

In [None]:
df2['Year'] = df2['Date'].dt.year
import matplotlib.pyplot as plt
plt.figure(figsize=(6,3))
sns.pairplot(df2,
             x_vars = ["btc_total_bitcoins", "btc_difficulty", "btc_rewards", "btc_transaction_fees", "btc_n_unique_addresses"],
             y_vars = ["btc_market_price"], kind = 'scatter', hue = 'Year', palette="husl").fig.set_size_inches(16,4)

## Applying Random Forest
To perform a Random Forest training on our dataset, we assign the X and y variables. X will have all the columns that we are assuming to be factors affecting price, y.

In [None]:
features = ["btc_total_bitcoins", "btc_difficulty", "btc_rewards", "btc_transaction_fees", "btc_n_unique_addresses"]
X = df2[features]
y = df2['btc_market_price']

Then we split the dataset into a training and testing or validation dataset. We need this so we can test how good our predictions are to the actual prices.

In [None]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

We declare that we’re using Random Forest to create a model:


In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(random_state=1)

We now feed the training data to the Random Forest algorithm so it can map patterns or decision trees.

In [None]:
rf_model.fit(train_X, train_y)

Once we have a model, we can call predict to predict the prices of bitcoin on our testing or validation data. That is, given that we’ve trained a model using train_X and train_y, we are now interested if it’d be able to ouput a number close to val_Y given val_X. Remember that we split the dataset earlier to a training dataset (train_X, train_Y) and a testing dataset (val_X, val_y).

In [None]:
rf_pred = rf_model.predict(val_X)

To see the predicted values of the model on the last 5 validation data, we can call:

In [None]:
print(rf_model.predict(X.tail()))
df2['btc_market_price'].tail()

Let’s see how far off the predicted values are from the actual values using MAE (Mean Absolute Error). This number will tell us how off are predictions are on average

In [None]:
from sklearn.metrics import mean_absolute_error
rf_val_mae = mean_absolute_error(val_y,rf_pred)
rf_val_mae

**Not bad! MAE is telling us that on average, our bitcoin price prediction is off by $41 from the actual prices.**

*fin*