<a href="https://colab.research.google.com/github/sunshineluyao/UTXO/blob/main/UTXO_Data_analysis_Task_4only.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import datetime

# Import Data from Google Drive and Data Wrangling

In [128]:
# Importing drive method from colab for accessing google drive
from google.colab import drive

In [129]:
# Mounting drive
# This will require authentication : Follow the steps as guided
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Note：Read data from the csv in drive: you have put the csv data in Google Drive folder "UTXO." The data contains all UTXO that is generated by 2010-12-31. The information includes


*   the value of the UTXO （in naive coin)
*   the data that the UTXO was generated
*   the data that the UTXO was spent. "NaN" if not spent by 2020-10-12




In [130]:
import pandas as pd
df_2010=pd.read_csv('/content/drive/My Drive/UTXO/joint_2010.csv',index_col='Unnamed: 0')
df_2010.head()

Unnamed: 0,value,block_date,spent_block_date
0,5000000000,2009-01-03,
21553,5000000000,2009-01-09,2009-01-12
1,5000000000,2009-01-09,
2,5000000000,2009-01-09,
3,5000000000,2009-01-09,


Generate the UTXO value in bitcoin unit, which = $value/10^{8}$

In [131]:
df_2010['UTXO'] = df_2010['value']*10**(-8)
df_2010.head()

Unnamed: 0,value,block_date,spent_block_date,UTXO
0,5000000000,2009-01-03,,50.0
21553,5000000000,2009-01-09,2009-01-12,50.0
1,5000000000,2009-01-09,,50.0
2,5000000000,2009-01-09,,50.0
3,5000000000,2009-01-09,,50.0


In [132]:
# drop value and reset index
df_2010=df_2010.reset_index()
df_2010 = df_2010.drop(['value','index'], axis = 1)
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO
0,2009-01-03,,50.0
1,2009-01-09,2009-01-12,50.0
2,2009-01-09,,50.0
3,2009-01-09,,50.0
4,2009-01-09,,50.0


In [133]:
df_2010.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137525 entries, 0 to 137524
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   block_date        137525 non-null  object 
 1   spent_block_date  115972 non-null  object 
 2   UTXO              137525 non-null  float64
dtypes: float64(1), object(2)
memory usage: 3.1+ MB


Change the block_date and spent_block_date to datatime object

In [134]:
df_2010['block_date'] = pd.to_datetime(df_2010['block_date'], format='%Y/%m/%d')
df_2010['spent_block_date'] = pd.to_datetime(df_2010['spent_block_date'], format='%Y/%m/%d')
df_2010.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137525 entries, 0 to 137524
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   block_date        137525 non-null  datetime64[ns]
 1   spent_block_date  115972 non-null  datetime64[ns]
 2   UTXO              137525 non-null  float64       
dtypes: datetime64[ns](2), float64(1)
memory usage: 3.1 MB


#Task 4: Calculate the Distribution for UTXO (Bitcoin Age Distribution for the Bitcoin that are still Alive)

## We first try to solve the problem for one date and check if the algorithm works

In [135]:
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO
0,2009-01-03,NaT,50.0
1,2009-01-09,2009-01-12,50.0
2,2009-01-09,NaT,50.0
3,2009-01-09,NaT,50.0
4,2009-01-09,NaT,50.0


In [136]:
from datetime import datetime

In [137]:
### calculate the age of each UTXO untill 2009-01-09
range=pd.date_range(start='2009-01-09', end='2010-12-31')
size=np.size(range)


In [138]:
now=range[0]

In [139]:
col=range[0].strftime("%Y-%m-%d")

We first drop the UTXO that has been spent "now"=2009-01-09"

In [140]:
df_2010=df_2010.drop(df_2010[df_2010['spent_block_date']<=now].index)
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO
0,2009-01-03,NaT,50.0
1,2009-01-09,2009-01-12,50.0
2,2009-01-09,NaT,50.0
3,2009-01-09,NaT,50.0
4,2009-01-09,NaT,50.0


In [141]:
df_2010[col]=now
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
0,2009-01-03,NaT,50.0,2009-01-09
1,2009-01-09,2009-01-12,50.0,2009-01-09
2,2009-01-09,NaT,50.0,2009-01-09
3,2009-01-09,NaT,50.0,2009-01-09
4,2009-01-09,NaT,50.0,2009-01-09


In [142]:
df_2010[col]=df_2010[col]-df_2010['block_date']
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
0,2009-01-03,NaT,50.0,6 days
1,2009-01-09,2009-01-12,50.0,0 days
2,2009-01-09,NaT,50.0,0 days
3,2009-01-09,NaT,50.0,0 days
4,2009-01-09,NaT,50.0,0 days


In [143]:
df_2010[col]=df_2010[col].map(lambda x:x.days)
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
0,2009-01-03,NaT,50.0,6
1,2009-01-09,2009-01-12,50.0,0
2,2009-01-09,NaT,50.0,0
3,2009-01-09,NaT,50.0,0
4,2009-01-09,NaT,50.0,0


In [144]:
df_2010.tail()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
137520,2010-12-31,2017-07-30,0.05,-721
137521,2010-12-31,2017-07-30,0.05,-721
137522,2010-12-31,2017-10-16,5.23,-721
137523,2010-12-31,2017-12-15,0.05,-721
137524,2010-12-31,2019-10-16,0.05,-721


In [150]:
df_2010.loc[(df_2010[col]<0),col]='NaN'
df_2010.tail()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
137520,2010-12-31,2017-07-30,0.05,
137521,2010-12-31,2017-07-30,0.05,
137522,2010-12-31,2017-10-16,5.23,
137523,2010-12-31,2017-12-15,0.05,
137524,2010-12-31,2019-10-16,0.05,


In [151]:
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
0,2009-01-03,NaT,50.0,6
1,2009-01-09,2009-01-12,50.0,0
2,2009-01-09,NaT,50.0,0
3,2009-01-09,NaT,50.0,0
4,2009-01-09,NaT,50.0,0


Now, we can iterate on range=pd.date_range(start='2009-01-09', end='2010-12-31')
size=np.size(range) to get the result for all date

Hint: the final pandas will have 722 more columns

Hint: please remember to drop the rows that has been spent before "now", the current calculating column to reduce data size