# Bitcoin Analysis

Bitcoin is a digital currency that was created in January 2009. Bitcoin is a type of cryptocurrency. There are no physical bitcoins, only balances kept on a public ledger that everyone has transparent access to. All bitcoin transactions are verified by a massive amount of computing power. Bitcoins are not issued or backed by any banks or governments, nor are individual bitcoins.<br>
<t>The bitcoin system is a collection of computers (also referred to as "nodes" or "miners") that all run bitcoin's code and store its blockchain. Metaphorically, a blockchain can be thought of as a collection of blocks. In each block is a collection of transactions. Because all the computers running the blockchain has the same list of blocks and transactions, and can transparently see these new blocks being filled with new bitcoin transactions, no one can cheat the system.
<br>
Anyone, whether they run a bitcoin "node" or not, can see these transactions occurring live. In order to achieve a nefarious act, a bad actor would need to operate 51% of the computing power that makes up bitcoin. Bitcoin has around 12,000 nodes, as of January 2021, and this number is growing.

So we will view the market trends and wheather its still worth it to invest in it?

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
cd /kaggle/input/

In [None]:
import pandas as pd 
import numpy as np
from datetime import datetime
import requests
from time import sleep
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score,mean_squared_error

We have the kaggle dataset ,can be found [here](https://www.kaggle.com/mczielinski/bitcoin-historical-data).

In [None]:
data=pd.read_csv('bitcoin-historical-data/bitstampUSD_1-min_data_2012-01-01_to_2021-03-31.csv')

In [None]:
data.info()

In [None]:
data

From above we can see that their are many null values, so let's check how many null values are their:

In [None]:
for col in data.columns:                
    print("Number of null values in col ",col," is: ",data[col].isnull().sum()) ## checking total null values in each col

The total null values in each column is 1243608. <br>
It might be the case that each col have null values at different indexes , i.e. we need to check that whether whole row is null or these null values are from different cells. <br>

I am using concept of masking , which generate a bool value for the index places that are null.
Ex. : 

|Open|High|
|---|---|
|4.39|4.39|
|nan|nan|
|nan|nan|
|58714.31|58714.31|
|...|...|

The masking for each col would be like:
* Open: [false,true,true,false,...]
* High: [false,true,true,false,...]

Then we will take `and` operation of all these col values , after masking all of them we would get the the bool mask which contains `true` value which shows row having all col value `null`.

=> Open & High : [false,true,true,false,...]


In [None]:
m1=data.Open.isnull() #bool value for null value for Open col
for col in data.columns[2:]: # iterate over the col and find mask 
    m2=data[col].isnull()
    m1=m1&m2        # take 'and' operator for bool mask columns

print(m1.sum())

The total null rows is 1243608 and if we observe from above the total null values in each col is also 1243608 , this shows that these are the null rows only . And total null values are 4857377, so the missing rows are just 1/4th of total values. 

In [None]:
nn_df=data[data.Open.isnull()] ##getting the null rows

In [None]:
index=nn_df.index   ##getting the index value of null rows 

In [None]:
nn_df['dates'] = nn_df['Timestamp'].apply(lambda d: datetime.fromtimestamp(int(d)).strftime('%Y-%m-%d'))

In [None]:
nn_df

Their are many null values :
- if we fill those with 0 then we cannot find proper trend for those missing values.
- if we fill them with mean values, then it might be the case that their are many high values and less values so we might see the drastic change while seeing trend among them because those are 1/4th of the total values.

So , we need to find from where the data has been collected from . While searching for it i found this [link](https://stackoverflow.com/questions/29425894/scraping-data-from-bitcoincharts), but if one used to find the data from that link for the dates we get for null rows , then you get no values for those particular dates , this might shows that data for those values has not been recorded or available.<br>
I tried to find from the dataset has been collected from ,then i found in the description of kaggle dataset that data has been collected from [bitcoincharts](
https://bitcoincharts.com/charts).

I have write another `extract.ipynb` notebook to extract data from the above site using Selenium. 

In [None]:
uniq_dates=list(nn_df['dates'].unique()) ##getting list of dates for null rows

In [None]:
data.iloc[index[-1]]

From the above scrapped notebook , the data has been stored in `bitcoinunix.csv` .

In [None]:
scrapdata=pd.read_csv('bitcoinscrap/bitcoinunix.csv',index_col=[0]) ## reading scrapped data.

In [None]:
scrapdata

In [None]:
scrapdata.isna().sum()    #checking wheather their is any null value or not

In [None]:
df=data.append(scrapdata,ignore_index=True)  ## appending the scrapped data with the previous data

In [None]:
df.dropna(inplace=True) #dropping null values that comes with previous data

After appending the scrapped data, we need to sort the data according to `unix` time.

In [None]:
df.sort_values('Timestamp',ignore_index=True,inplace=True)

In [None]:
df['timestamp'] = pd.to_datetime(df['Timestamp'],unit='s') #converting the unix time to readable format

In [None]:
#abstracting year and month from the readable time format
df['year'] = df['timestamp'].dt.year
# df['day'] = df['timestamp'].dt.day
df['month'] = df['timestamp'].dt.month
# df['minute'] = df['timestamp'].dt.strftime('%M')
# df['hours'] = df['timestamp'].dt.strftime('%H')
df['date'] = df['timestamp'].dt.strftime('%Y-%m-%d')
# df['seconds']=df['timestamp'].dt.strftime('%S')

In [None]:
df.index = df.timestamp  ## changing the index values according to timestamp

We need to do the timeseries analysis and their are many values , we will use the resample the given datapoints on the basis of `month`. For more details check [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html)

In [None]:
df_month = df.resample('M').mean()

We will look at the trend that how the `volume` and `price` of bitcoin has been varied from year 2011 onwards.

In [None]:
# PLOTS
fig = plt.figure(figsize=[20, 20])
plt.suptitle('Bitcoin price and volume mean in USD', fontsize=22)
##plot for mean volume vs months
plt.subplot(311) ## position of plot 
plt.plot(df['Volume_(BTC)'], '-', label='By Months')  ## plot for volume 
plt.xlabel("Year")
plt.ylabel("volume of bitcoin")
plt.title("Volume mean according to months")

##plot for mean Pricee vs months
plt.subplot(312)
plt.plot(df_month.Weighted_Price, '-', label='By Months',color='c')    ## plot for the prices
plt.title("Price mean according to months")
plt.xlabel("Year")
plt.ylabel("Price of bitcoin")

##plot for mean currency volume vs months
plt.subplot(313)
plt.plot(df_month['Volume_(Currency)'], '-', label='By Months',color='g')
plt.title("currency vol mean according to months")
plt.xlabel("Year")
plt.ylabel("volume of Currency flow")

plt.show()

- From above we can say that in the starting years till 2015-16 the volume of the bitcoin in the market has been increases but it goes down after it.<br>
Bitcoin is decentralised and as more people come into mining, the price would be increased. Bitcoin releases a block in every 4 years which contains details regarding all the transactions, but volume of bitcoin would be decreases in the future, for more details look after the blockchain.
- From 2nd graph we can clearly say that prices are increases year by year, and we have some peak values in 2018 and 2021, the last block was released 2020 , it might be the case that more people have found this for better investing as its continuosly increasing year by year , thus more money flow hence more inflation. One can look more details for by looking demand and supply trend. And this is same for 3rd plot also.

As the price is increasing very often in recent years so can we say is it still safe for trading ?

In [None]:
price_diff = df["Weighted_Price"].diff()        ##finding the difference of price for each row
ax=price_diff.plot(figsize=(20,6),title='Variation in bitcoin price')
ax.set_ylabel("prices difference")

If we compare the above plot and previous price trend plot ,then we can say that as the market price for the bitcoin has been increased ,then similarly the variation between the prices also increases i.e. as the price is increasing frequently then the falling of price is also their in the same proportion.

In [None]:
df.boxplot(column='Volume_(BTC)',by = 'year', figsize=(10,10))

This shows that year 2014 has the highest volume for bitcoins in the market and after that the volume has been continously decreasing,althoug the avg volume was very less. One need to look at the market cap for future if wants to trade this in future because the price is definitely increasing but the volume is continuously decreasing ,investor needs to think whether he would be able to trade in this in future or not......

In [None]:
df.boxplot(column='Volume_(Currency)',by = 'year', figsize=(10,10))

The highest volume trades were made in 2021, avg. currency volume is still very less , this shows that many of the invetors were average investors who wants to do safe trading or they might have the fear of losses, and highest trade volume is very high ,they might be big investor who have the market knowledge or well experience with cypto trading.  

As their is drastic change in recent year, let's see changes over it....

In [None]:
yeardata=df[df.year>=2020]

In [None]:
##plot for price variation in year 2020-21
plt.figure(figsize=(20,15))
ax=yeardata[yeardata.year==2020].plot(kind='scatter', x='month', y='High',color = 'cyan',label='2020')
yeardata[yeardata.year==2021].plot(kind='scatter', x='month', y='High',color = 'violet',ax=ax,label='2021')
plt.xlabel('Month')              
plt.legend()
plt.ylabel('Hihest Prices of days in months')

plt.title('Month - Highest Price of day Scatter Plot(2020-2021)') 

From above we can see that their is drastic increase in the price value of bitcoin and according to some of the prediction [reports](https://investorplace.com/2021/06/bitcoin-price-prediction-2021-why-btc-could-hit-100k-by-year-end/) also mentions that the prices might hit 100K USD by the end of the year.

In [None]:
dat=df.resample('D').mean()

In [None]:
plt.figure(figsize=(200,20))
plt.plot(dat['Open'],'-',color='g')
plt.plot(dat['Close'],'+',color='k')
plt.show()

In [None]:
dfdate=df.groupby(['date']).max() #taking the max values for unique dates
ax=dfdate.plot(kind="scatter", y="Low",x='timestamp', alpha=0.3, color= "red",figsize=(100,10),label='Low')
dfdate.plot(kind="line", y="High",x='timestamp', alpha=0.3, color= "blue",ax=ax,label='high')
plt.xlabel("Open price")
plt.ylabel("High price")
plt.legend()
plt.title("Scatter ")
plt.show()

The price change trend for Low and high prices are almost same ,their is linear or direct proportion relation between them .This shows that their is not much variation of price for dates before end of 2020 , but in 2021 a significant difference between can be observed. <br>Let's try to see some more relationships among prices and volumes

In [None]:
colors=['b','g','r','c','m','y','k','brown']
j=1
plt.figure(figsize=(30,25))
for col in df.columns[1:5]:
  for _ in range(3):
    if j%3==1:
      cmp='Volume_(BTC)'
    elif j%3==2:
      cmp='Volume_(Currency)'
    else:
      cmp='Weighted_Price'
    plt.subplot(4,3,j)
    plt.plot(df[col],df[cmp], color=colors[j%8])
    plt.xlabel(col+" values")
    plt.ylabel(cmp+" of Bitcoins")
    plt.title( cmp+" values  vs  "+col+" of BTC")
    j+=1



- The volume of BTC i.e. col 1  charts represents the same trend that majority of trades or when the prices where low then large amount of Bitcoins were traded ,but as the Volume get decreases then the price of the bitcoin increases.
<br>i.e. $Volume\_BTC  \quad  \alpha \quad \frac{1}{Prices}$
- The Volume of Currency varies with the prices 
- The Trading price has linear relation with the closing ,open,high and low prices.

In [None]:
import seaborn as sns

In [None]:
ax=sns.distplot(df.Close,kde=False,bins=40) #ploting histogram 
ax.set_title('Frequencie of different Closing prices')
ax.set(xlabel='Close price range', ylabel='Frequency of Close price')
plt.show()


In [None]:
ax=sns.distplot(df.High,kde=True,bins=40)
ax.set_title('Frequencie of different Highest prices')
ax.set(xlabel='High price range', ylabel='Frequency of Highest price')
plt.show()


From above we can see their was variation between the prices and their corresponding number of trades, as the price increases the number of trades also decreases as we have seen that their is large difference in prices on daily basis, means as the prices goes up , it also went down in same way , so it might become quite difficult for less experience traders to trade for higer price range. 

In [None]:
df.boxplot(column=['High', 'Volume_(BTC)', 'Weighted_Price'])

In [None]:
X = np.arange(12)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(X + 0.00, df.groupby('month').mean()['Open'], color = 'b', width = 0.25,label='Open Price')
ax.bar(X + 0.25, df.groupby('month').mean()['Close'], color = 'g', width = 0.25,label='Close price')
ax.set_xticks(df.groupby('month').mean().index)
ax.set_title("Avg Open and close price according to months")
ax.set_xlabel("Months")
ax.legend()
ax.set_ylabel("Avg price month wise")

The avg price was more during winter season that is between DEC-MAR, but the price for bitcoin increases in 2021, so the avg might comes different ,hence we need to look for year before 2021 before concluding best investing period.

In [None]:
mn=df[df.year<2021]

In [None]:
X = np.arange(12)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(X + 0.00, mn.groupby('month').mean()['Open'], color = 'b', width = 0.25,label='Open Price')
ax.bar(X + 0.25, mn.groupby('month').mean()['Close'], color = 'g', width = 0.25,label='Close price')
ax.set_xticks(mn.groupby('month').mean().index)
ax.set_title("Avg Open and close price according to months")
ax.set_xlabel("Months")
ax.legend()
ax.set_ylabel("Avg price month wise")

X = np.arange(12)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(X + 0.00, mn.groupby('month').mean()['Open'], color = 'b', width = 0.25,label='Open Price')
ax.bar(X + 0.25, mn.groupby('month').mean()['Close'], color = 'pink', width = 0.25,label='Close price')
ax.set_xticks(mn.groupby('month').mean().index)
ax.set_title("Avg Open and close price according to months")
ax.set_xlabel("Months")
ax.legend()
ax.set_ylabel("Avg price month wise")

From above plot we can view that before 2021 the price of Bitcoin increases after the 7-8 month and again get decreases in starting period of year, this was similar case in 2021 also the price decreases after March 2021, so we can say that for long term trading or safe trading one can invest in the summer and should square off the starting period of new year.

Now we make bitcoin price prediction ,so before that let's see the correlation among atributes.

In [None]:
plt.figure(figsize = (20,10))
ax=sns.heatmap(df.corr(), annot = True,square=True,linewidths=.5,vmin=-1, vmax=1, center= 0)

The Weigted_price col has direct correlation with closing,high,low prices . And have other good correlation with timestamp , and little less with the volume of BTC . We should not use the attrbibute `High`,`Close`,`Open` and `Close` prices as they are directly correlated with `Weighted_Price`(target feature).


Since this is the time series data ,so we will perform time series price prediction for it.

In [None]:
df

In [None]:
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN,Dropout,Flatten,LSTM
from sklearn.preprocessing import StandardScaler
from keras.callbacks import ModelCheckpoint

In [None]:
df.head()

In [None]:
df['Date']=pd.to_datetime(df['Timestamp'],unit='s').dt.date


In [None]:
X=df.groupby('Date')['Weighted_Price'].max()

In [None]:
X.shape

In [None]:
train_size = int(len(X)*0.85)

train_data = X[0:train_size]
test_data = X[train_size:]

In [None]:
train_data=np.array(train_data)
train_data=train_data.reshape(train_data.shape[0],1)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
train_data=scaler.fit_transform(train_data)

In [None]:
timestep=30
x_train=[]
y_train=[]

for i in range(timestep,train_data.shape[0]):
    x_train.append(train_data[i-timestep:i,0])
    y_train.append(train_data[i,0])

x_train,y_train=np.array(x_train),np.array(y_train)
x_train=x_train.reshape(x_train.shape[0],x_train.shape[1],1) #reshaped for RNN
print("x_train shape= ",x_train.shape)
print("y_train shape= ",y_train.shape)

In [None]:
reg=Sequential()

reg.add(SimpleRNN(128,activation='relu',return_sequences=True,input_shape=(x_train.shape[1],1)))
reg.add(Dropout(0.3))
reg.add(SimpleRNN(256,return_sequences=True,activation='relu'))
reg.add(Dropout(0.3))
reg.add(SimpleRNN(64,return_sequences=True,activation='relu'))
reg.add(Dropout(0.3))
reg.add(Flatten())
reg.add(Dense(1))


reg.compile(optimizer='adam',loss='mean_squared_error')


In [None]:

val=reg.fit(x_train,y_train,epochs=100,batch_size=32,validation_split=0.1)

In [None]:
reg.save('/kaggle/working/timestamp_priceRNN.h5')

In [None]:
training_loss = val.history['loss']
val_loss = val.history['val_loss']

plt.plot(training_loss,label='training_loss')
plt.plot(val_loss,label='val_loss')
plt.legend()
plt.title('Visualising loss for RNN',fontsize=18)
plt.xlabel('Epochs',fontsize=15)
plt.ylabel('Loss',fontsize=15)
plt.show()

In [None]:
inputs=X[len(X)-len(test_data)-timestep:]
inputs=inputs.values.reshape(-1,1)
inputs=scaler.transform(inputs)

In [None]:
x_test=[]
y_test=[]
for i in range(timestep,inputs.shape[0]):
    x_test.append(inputs[i-timestep:i,0])
    y_test.append(inputs[i,0])
x_test=np.array(x_test)
y_test=np.array(y_test)
x_test=x_test.reshape(x_test.shape[0],x_test.shape[1],1)

In [None]:
pred = reg.predict(x_test)
rnn_pred=scaler.inverse_transform(pred)

In [None]:
data_test=np.array(test_data)
data_test=data_test.reshape(len(data_test),1)

In [None]:
plt.figure(figsize = (20,7))
plt.plot(data_test,'-')
plt.plot(rnn_pred,'-')
plt.xlabel('Time(days)')
plt.ylabel('Price')
plt.title('Price vs Time (using SimpleRNN)')
plt.legend(['Actual price', 'Predicted price'])
plt.show()

In [None]:
lstm=Sequential()

lstm.add(LSTM(64,input_shape=(x_train.shape[1],1),activation="relu"))


lstm.add(Dense(1))

lstm.compile(loss="mean_squared_error",optimizer="adam")

hist=lstm.fit(x_train,y_train,epochs=100,batch_size=32,validation_split=0.1)

In [None]:
lstm.save('/kaggle/working/timeseries_price_LSTM.h5')

In [None]:
training_loss = hist.history['loss']
val_loss = hist.history['val_loss']

plt.plot(training_loss,label='training_loss')
plt.plot(val_loss,label='val_loss')
plt.legend()
plt.title('Visualising loss for LSTM',fontsize=18)
plt.xlabel('Epochs',fontsize=15)
plt.ylabel('Loss',fontsize=15)
plt.show()

In [None]:
# inputs=X[len(X)-len(test_data)-timestep:]
# inputs=inputs.values.reshape(-1,1)
# inputs=scaler.transform(inputs)

In [None]:
# x_test=[]
# y_test=[]
# for i in range(timestep,inputs.shape[0]):
#     x_test.append(inputs[i-timestep:i,0])
#     y_test.append(inputs[i,0])
# x_test=np.array(x_test)
# y_test=np.array(y_test)
# x_test=x_test.reshape(x_test.shape[0],x_test.shape[1],1)

In [None]:
pred = lstm.predict(x_test)
lstm_pred=scaler.inverse_transform(pred)

In [None]:
# data_test=np.array(test_data)
# data_test=data_test.reshape(len(data_test),1)

In [None]:
print('MSE : ' + str(mean_squared_error(y_test, pred)))
rnn_score = r2_score(y_test,pred)
print("R2 Score of LSTM model = ",rnn_score)

In [None]:
plt.figure(figsize = (20,7))
plt.plot(data_test,'-')
plt.plot(lstm_pred,'-')
plt.xlabel('Time')
plt.ylabel('Price')
plt.title(' Price vs Time (using LSTM)')
plt.legend(['Actual price', 'Predicted price'])
plt.show()

In [None]:
plt.figure(figsize = (20,7))
plt.plot(data_test,'-',label='Actual price')
plt.plot(rnn_pred,'-',label='RNN Predicted price')
plt.plot(lstm_pred,'-',label='LSTM Predicted price')
plt.xlabel('Time')
plt.ylabel('Price')
plt.title(' Price vs Time (using LSTM)')
plt.legend()
plt.show()

In [None]:
prediction=x_test[-1]

In [None]:
for i in range(30):
    kl=prediction[i:timestep+i].reshape(1,timestep,1)
    prediction=np.append(prediction,lstm.predict(kl),axis=0)

In [None]:
prediction=scaler.inverse_transform(prediction)

In [None]:
plt.figure(figsize = (20,7))
plt.plot(prediction[30:],'-',label='Predicted price ')

plt.xlabel('Days')
plt.ylabel('Price')
plt.title(' Predicted price of next 30 days ')
plt.legend()
plt.show()

## Conclusion:
- If someone wants to trade in bitcoin then need to watch the prices on daily basis ,because the variation in price for down and up trend are in same proportion , so one need to look for square off for the price trades in order to make profit.
- The miner or the investor who had trade bitcoin during the year around 2014 or before 2014 would have more wealth now , because the volume was more and the prices were less,so they have make more profit percentage as compare to 2021.
- Highest price trades are made in 2021 ,but the avg. prices was still very less which shows most of the investors are not aggressive trader or who has fear of loss or less domain knowledge.
- The Avg. price volume was highest for 2021 and that is for the data of 3 months only ,which represents that even in quarter time we huge trade was occur this year and it may increases by the end of the year.
- The growth in prices in starting of 2021 is almost kind of exponential and some of the reports mentioned that it might hit 100K USD by end of year ,so if one wants to trade then he good profit could be made by end of the year.
- For a particular day the highest price and lowest price difference is almost negligible as compared to overall change in year or month,which shows those values have significant changes in prices after 24 hours.
- The best season to make profit from BTC is winter season , as price for bitcoin get increases for those period in the year. 