## Data collection strategy:

Choose a data source that provides historical cryptocurrency price data, such as Binance, Coinbase, or CoinMarketCap.
Use an API or web scraping technique to collect the historical price data for Bitcoin, Ethereum, and Litecoin over a specified time period.


### Import the required libraries

In [1]:
import pandas as pd
import numpy as np
import yfinance as yf
from yahoofinancials import YahooFinancials
from datetime import datetime



In [15]:
tickerSymbol = 'BTC-USD'
tickerData = yf.Ticker(tickerSymbol)
current_date = datetime.today().strftime('%Y-%m-%d')

tickerDf = tickerData.history(period='1d', start='2020-1-1', end=current_date)
tickerDf.reset_index(inplace=True)
tickerDf.rename(columns={'index': 'Date'}, inplace=True)


print(tickerDf.head())

                       Date         Open         High          Low   
0 2020-01-01 00:00:00+00:00  7194.892090  7254.330566  7174.944336  \
1 2020-01-02 00:00:00+00:00  7202.551270  7212.155273  6935.270020   
2 2020-01-03 00:00:00+00:00  6984.428711  7413.715332  6914.996094   
3 2020-01-04 00:00:00+00:00  7345.375488  7427.385742  7309.514160   
4 2020-01-05 00:00:00+00:00  7410.451660  7544.497070  7400.535645   

         Close       Volume  Dividends  Stock Splits  
0  7200.174316  18565664997        0.0           0.0  
1  6985.470215  20802083465        0.0           0.0  
2  7344.884277  28111481032        0.0           0.0  
3  7410.656738  18444271275        0.0           0.0  
4  7411.317383  19725074095        0.0           0.0  


### Save the file:

Save the data in a suitable format, such as a CSV file, for further analysis.

In [16]:
tickerDf.to_csv('output1.csv', index=False)
print(tickerDf.shape)
tickerDf

(1271, 8)


Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2020-01-01 00:00:00+00:00,7194.892090,7254.330566,7174.944336,7200.174316,18565664997,0.0,0.0
1,2020-01-02 00:00:00+00:00,7202.551270,7212.155273,6935.270020,6985.470215,20802083465,0.0,0.0
2,2020-01-03 00:00:00+00:00,6984.428711,7413.715332,6914.996094,7344.884277,28111481032,0.0,0.0
3,2020-01-04 00:00:00+00:00,7345.375488,7427.385742,7309.514160,7410.656738,18444271275,0.0,0.0
4,2020-01-05 00:00:00+00:00,7410.451660,7544.497070,7400.535645,7411.317383,19725074095,0.0,0.0
...,...,...,...,...,...,...,...,...
1266,2023-06-20 00:00:00+00:00,26841.664062,28388.968750,26668.791016,28327.488281,22211859147,0.0,0.0
1267,2023-06-21 00:00:00+00:00,28311.310547,30737.330078,28283.410156,30027.296875,33346760979,0.0,0.0
1268,2023-06-22 00:00:00+00:00,29995.935547,30495.998047,29679.158203,29912.281250,20653160491,0.0,0.0
1269,2023-06-23 00:00:00+00:00,29896.382812,31389.539062,29845.214844,30695.468750,24115570085,0.0,0.0


In [17]:
df= pd.read_csv("output1.csv",  index_col=False)

In [18]:
df

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2020-01-01 00:00:00+00:00,7194.892090,7254.330566,7174.944336,7200.174316,18565664997,0.0,0.0
1,2020-01-02 00:00:00+00:00,7202.551270,7212.155273,6935.270020,6985.470215,20802083465,0.0,0.0
2,2020-01-03 00:00:00+00:00,6984.428711,7413.715332,6914.996094,7344.884277,28111481032,0.0,0.0
3,2020-01-04 00:00:00+00:00,7345.375488,7427.385742,7309.514160,7410.656738,18444271275,0.0,0.0
4,2020-01-05 00:00:00+00:00,7410.451660,7544.497070,7400.535645,7411.317383,19725074095,0.0,0.0
...,...,...,...,...,...,...,...,...
1266,2023-06-20 00:00:00+00:00,26841.664062,28388.968750,26668.791016,28327.488281,22211859147,0.0,0.0
1267,2023-06-21 00:00:00+00:00,28311.310547,30737.330078,28283.410156,30027.296875,33346760979,0.0,0.0
1268,2023-06-22 00:00:00+00:00,29995.935547,30495.998047,29679.158203,29912.281250,20653160491,0.0,0.0
1269,2023-06-23 00:00:00+00:00,29896.382812,31389.539062,29845.214844,30695.468750,24115570085,0.0,0.0


### Features to use in the dataset:

1) The closing price of each cryptocurrency on a daily basis, which is typically the most important feature for price prediction models.

2) The opening price, high price, and low price of each cryptocurrency on a daily basis, which could provide additional information for prediction models.

3) The trading volume of each cryptocurrency on a daily basis, which could also be used as a feature in the model.

4) Drop the Dividends and stock splits as they are events that can occur for publicly traded companies, but not for cryptocurrencies.

In [19]:

df = pd.DataFrame(tickerDf)

df = df.drop(['Dividends', 'Stock Splits'], axis=1)
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2020-01-01 00:00:00+00:00,7194.89209,7254.330566,7174.944336,7200.174316,18565664997
1,2020-01-02 00:00:00+00:00,7202.55127,7212.155273,6935.27002,6985.470215,20802083465
2,2020-01-03 00:00:00+00:00,6984.428711,7413.715332,6914.996094,7344.884277,28111481032
3,2020-01-04 00:00:00+00:00,7345.375488,7427.385742,7309.51416,7410.656738,18444271275
4,2020-01-05 00:00:00+00:00,7410.45166,7544.49707,7400.535645,7411.317383,19725074095


### Removing features:

If there are too many missing values or if a feature doesn't provide much predictive power, it could be removed from the dataset. For example, if the trading volume data is missing for a large number of days, it may not be useful to include in the dataset.

In [20]:
#this line of code drops if there are more than two missing features
df = df.dropna(thresh=2)
print(df.shape)
df

(1271, 6)


Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2020-01-01 00:00:00+00:00,7194.892090,7254.330566,7174.944336,7200.174316,18565664997
1,2020-01-02 00:00:00+00:00,7202.551270,7212.155273,6935.270020,6985.470215,20802083465
2,2020-01-03 00:00:00+00:00,6984.428711,7413.715332,6914.996094,7344.884277,28111481032
3,2020-01-04 00:00:00+00:00,7345.375488,7427.385742,7309.514160,7410.656738,18444271275
4,2020-01-05 00:00:00+00:00,7410.451660,7544.497070,7400.535645,7411.317383,19725074095
...,...,...,...,...,...,...
1266,2023-06-20 00:00:00+00:00,26841.664062,28388.968750,26668.791016,28327.488281,22211859147
1267,2023-06-21 00:00:00+00:00,28311.310547,30737.330078,28283.410156,30027.296875,33346760979
1268,2023-06-22 00:00:00+00:00,29995.935547,30495.998047,29679.158203,29912.281250,20653160491
1269,2023-06-23 00:00:00+00:00,29896.382812,31389.539062,29845.214844,30695.468750,24115570085


### Timestamp:

The timestamp should be present in the dataset, typically in Unix time format, which is the number of seconds that have elapsed since January 1, 1970. To convert Unix time to human-readable datetime format, you can use the datetime module in Python.

However, the above dataset is already in human-readable datatime format

In [21]:
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d')
df['Date'] = pd.to_datetime(df['Date'])
df['date'] = (df['Date'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
df

Unnamed: 0,Date,Open,High,Low,Close,Volume,date
0,2020-01-01,7194.892090,7254.330566,7174.944336,7200.174316,18565664997,1577836800
1,2020-01-02,7202.551270,7212.155273,6935.270020,6985.470215,20802083465,1577923200
2,2020-01-03,6984.428711,7413.715332,6914.996094,7344.884277,28111481032,1578009600
3,2020-01-04,7345.375488,7427.385742,7309.514160,7410.656738,18444271275,1578096000
4,2020-01-05,7410.451660,7544.497070,7400.535645,7411.317383,19725074095,1578182400
...,...,...,...,...,...,...,...
1266,2023-06-20,26841.664062,28388.968750,26668.791016,28327.488281,22211859147,1687219200
1267,2023-06-21,28311.310547,30737.330078,28283.410156,30027.296875,33346760979,1687305600
1268,2023-06-22,29995.935547,30495.998047,29679.158203,29912.281250,20653160491,1687392000
1269,2023-06-23,29896.382812,31389.539062,29845.214844,30695.468750,24115570085,1687478400


### Missing values:

1) If there are missing values in the dataset, one common imputation technique is to fill in the missing values with the mean or median value for that feature.

2) Another technique is to use forward or backward filling, where missing values are filled in with the previous or next available value in the dataset. 

In [7]:
print(df.isna().sum())

# df.dropna(inplace=True)
# df.fillna(df.mean(), inplace=True)
# df.fillna(df.median(), inplace=True)
# df.interpolate(inplace=True)

Open      0
High      0
Low       0
Close     0
Volume    0
dtype: int64


There are no missing values. If there is any missed value, you can use the commented lines of code