<a href="https://colab.research.google.com/github/sunshineluyao/UTXO/blob/main/UTXO_Data_analysis_Task_4only.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [86]:
import numpy as np
import pandas as pd
import datetime

# Import Data from Google Drive and Data Wrangling

In [87]:
# Importing drive method from colab for accessing google drive
from google.colab import drive

In [88]:
# Mounting drive
# This will require authentication : Follow the steps as guided
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Note：Read data from the csv in drive: you have put the csv data in Google Drive folder "UTXO." The data contains all UTXO that is generated by 2010-12-31. The information includes


*   the value of the UTXO （in naive coin)
*   the data that the UTXO was generated
*   the data that the UTXO was spent. "NaN" if not spent by 2020-10-12




In [89]:
import pandas as pd
df_2010=pd.read_csv('/content/drive/My Drive/UTXO/joint_2010.csv',index_col='Unnamed: 0')
df_2010.head()

Unnamed: 0,value,block_date,spent_block_date
0,5000000000,2009-01-03,
21553,5000000000,2009-01-09,2009-01-12
1,5000000000,2009-01-09,
2,5000000000,2009-01-09,
3,5000000000,2009-01-09,


Generate the UTXO value in bitcoin unit, which = $value/10^{8}$

In [90]:
df_2010['UTXO'] = df_2010['value']*10**(-8)
df_2010.head()

Unnamed: 0,value,block_date,spent_block_date,UTXO
0,5000000000,2009-01-03,,50.0
21553,5000000000,2009-01-09,2009-01-12,50.0
1,5000000000,2009-01-09,,50.0
2,5000000000,2009-01-09,,50.0
3,5000000000,2009-01-09,,50.0


In [91]:
# drop value and reset index
df_2010=df_2010.reset_index()
df_2010 = df_2010.drop(['value','index'], axis = 1)
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO
0,2009-01-03,,50.0
1,2009-01-09,2009-01-12,50.0
2,2009-01-09,,50.0
3,2009-01-09,,50.0
4,2009-01-09,,50.0


In [92]:
df_2010.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137525 entries, 0 to 137524
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   block_date        137525 non-null  object 
 1   spent_block_date  115972 non-null  object 
 2   UTXO              137525 non-null  float64
dtypes: float64(1), object(2)
memory usage: 3.1+ MB


Change the block_date and spent_block_date to datatime object

In [93]:
df_2010['block_date'] = pd.to_datetime(df_2010['block_date'], format='%Y/%m/%d')
df_2010['spent_block_date'] = pd.to_datetime(df_2010['spent_block_date'], format='%Y/%m/%d')
df_2010.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137525 entries, 0 to 137524
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   block_date        137525 non-null  datetime64[ns]
 1   spent_block_date  115972 non-null  datetime64[ns]
 2   UTXO              137525 non-null  float64       
dtypes: datetime64[ns](2), float64(1)
memory usage: 3.1 MB


#Task 4: Calculate the Distribution for UTXO (Bitcoin Age Distribution for the Bitcoin that are still Alive)

In [94]:
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO
0,2009-01-03,NaT,50.0
1,2009-01-09,2009-01-12,50.0
2,2009-01-09,NaT,50.0
3,2009-01-09,NaT,50.0
4,2009-01-09,NaT,50.0


In [95]:
from datetime import datetime

In [96]:
### calculate the age of each UTXO untill 2009-01-09
duration=pd.date_range(start='2009-01-09', end='2010-12-31')
size=np.size(duration)


### Generate the UTXO dataframe with zeros

In [97]:
df_UTXO=df = pd.DataFrame(np.zeros((size, 1)))
df_UTXO.columns=['date']
df_UTXO['date']=duration
df_UTXO.head()

Unnamed: 0,date
0,2009-01-09
1,2009-01-10
2,2009-01-11
3,2009-01-12
4,2009-01-13


In [98]:
df_UTXO.insert(1,'< 1d',0)
df_UTXO.insert(2,'1d ~ 1m',0)
df_UTXO.insert(3,'1m ~ 1q',0)
df_UTXO.insert(4,'1q ~ 6m',0)
df_UTXO.insert(5,'6m ~ 1y',0)
df_UTXO.insert(6,'1y ~ 2y',0)
df_UTXO.insert(7,'2y ~ 3y',0)
df_UTXO.insert(8,'3y ~ 4y',0)
df_UTXO.insert(9,'4y ~ 5y',0)
df_UTXO.insert(10, '> 5y', 0)
df_UTXO.head()

Unnamed: 0,date,< 1d,1d ~ 1m,1m ~ 1q,1q ~ 6m,6m ~ 1y,1y ~ 2y,2y ~ 3y,3y ~ 4y,4y ~ 5y,> 5y
0,2009-01-09,0,0,0,0,0,0,0,0,0,0
1,2009-01-10,0,0,0,0,0,0,0,0,0,0
2,2009-01-11,0,0,0,0,0,0,0,0,0,0
3,2009-01-12,0,0,0,0,0,0,0,0,0,0
4,2009-01-13,0,0,0,0,0,0,0,0,0,0


## We first try to calculate the UTXO first row and check whether it is correct. If so, we can do a for loop to finish the rest

In [99]:
now=duration[0]

In [100]:
col=now.strftime("%Y-%m-%d")

We first drop the UTXO that has been spent "now"=2009-01-09"

In [101]:
df_2010=df_2010.drop(df_2010[df_2010['spent_block_date']<=now].index)
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO
0,2009-01-03,NaT,50.0
1,2009-01-09,2009-01-12,50.0
2,2009-01-09,NaT,50.0
3,2009-01-09,NaT,50.0
4,2009-01-09,NaT,50.0


In [102]:
df_2010[col]=now
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
0,2009-01-03,NaT,50.0,2009-01-09
1,2009-01-09,2009-01-12,50.0,2009-01-09
2,2009-01-09,NaT,50.0,2009-01-09
3,2009-01-09,NaT,50.0,2009-01-09
4,2009-01-09,NaT,50.0,2009-01-09


In [103]:
df_2010[col]=df_2010[col]-df_2010['block_date']
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
0,2009-01-03,NaT,50.0,6 days
1,2009-01-09,2009-01-12,50.0,0 days
2,2009-01-09,NaT,50.0,0 days
3,2009-01-09,NaT,50.0,0 days
4,2009-01-09,NaT,50.0,0 days


In [104]:
df_2010[col]=df_2010[col].map(lambda x:x.days)
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
0,2009-01-03,NaT,50.0,6
1,2009-01-09,2009-01-12,50.0,0
2,2009-01-09,NaT,50.0,0
3,2009-01-09,NaT,50.0,0
4,2009-01-09,NaT,50.0,0


In [105]:
df_2010[col]=pd.to_numeric(df_2010[col])

In [106]:
df_2010.tail()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
137520,2010-12-31,2017-07-30,0.05,-721
137521,2010-12-31,2017-07-30,0.05,-721
137522,2010-12-31,2017-10-16,5.23,-721
137523,2010-12-31,2017-12-15,0.05,-721
137524,2010-12-31,2019-10-16,0.05,-721


In [107]:
df_2010.head()

Unnamed: 0,block_date,spent_block_date,UTXO,2009-01-09
0,2009-01-03,NaT,50.0,6
1,2009-01-09,2009-01-12,50.0,0
2,2009-01-09,NaT,50.0,0
3,2009-01-09,NaT,50.0,0
4,2009-01-09,NaT,50.0,0


In [108]:
df_UTXO.columns

Index(['date', '< 1d', '1d ~ 1m', '1m ~ 1q', '1q ~ 6m', '6m ~ 1y', '1y ~ 2y',
       '2y ~ 3y', '3y ~ 4y', '4y ~ 5y', '> 5y'],
      dtype='object')

In [109]:
df_UTXO.loc[(df_UTXO['date']==col),'< 1d']=df_2010[df_2010[col]==0]['UTXO'].sum()
df_UTXO.loc[(df_UTXO['date']==col),'1d ~ 1m']=df_2010[(df_2010[col]>0) & (df_2010[col]<30)]['UTXO'].sum()
df_UTXO.loc[(df_UTXO['date']==col),'1m ~ 1q']=df_2010[(df_2010[col]>=30) & (df_2010[col]<91)]['UTXO'].sum()
df_UTXO.loc[(df_UTXO['date']==col),'1q ~ 6m']=df_2010[(df_2010[col]>=91) & (df_2010[col]<182)]['UTXO'].sum()
df_UTXO.loc[(df_UTXO['date']==col),'6m ~ 1y']=df_2010[(df_2010[col]>=182) & (df_2010[col]<365)]['UTXO'].sum()
df_UTXO.loc[(df_UTXO['date']==col),'1y ~ 2y']=df_2010[(df_2010[col]>=365) & (df_2010[col]<365*2)]['UTXO'].sum()
df_UTXO.loc[(df_UTXO['date']==col),'2y ~ 3y']=df_2010[(df_2010[col]>=365*2) & (df_2010[col]<365*3)]['UTXO'].sum()
df_UTXO.loc[(df_UTXO['date']==col),'3y ~ 4y']=df_2010[(df_2010[col]>=365*3) & (df_2010[col]<365*4)]['UTXO'].sum()
df_UTXO.loc[(df_UTXO['date']==col),'4y ~ 5y']=df_2010[(df_2010[col]>=365*4) & (df_2010[col]<365*5)]['UTXO'].sum()
df_UTXO.loc[(df_UTXO['date']==col),'> 5y']=df_2010[(df_2010[col]>=365*5)]['UTXO'].sum()
df_UTXO.head()


Unnamed: 0,date,< 1d,1d ~ 1m,1m ~ 1q,1q ~ 6m,6m ~ 1y,1y ~ 2y,2y ~ 3y,3y ~ 4y,4y ~ 5y,> 5y
0,2009-01-09,700.0,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2009-01-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2009-01-11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2009-01-12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2009-01-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [110]:
### remember to drop the temporal column for calculation each time after the new row is appended
df_2010=df_2010.drop([col],axis=1)

## Now, we can iterate on duration=pd.date_range(start='2009-01-09', end='2010-12-31')
size=np.size(duration) to get the result for all date

Hint: the final pandas will have 722 more columns

Hint: please remember to drop the rows that has been spent before "now", the current calculating column to reduce data size

In [111]:
for i in range(size):
  now=duration[i]
  col=now.strftime("%Y-%m-%d")


In [112]:
now

Timestamp('2010-12-31 00:00:00', freq='D')

In [113]:
col

'2010-12-31'