## Data Pipeline


1. Define a function to retrieve data from the data source (e.g., Yahoo Finance, Binance, etc.) using an API or web scraping tool. This function should take in any relevant parameters such as start and end dates or specific cryptocurrencies to retrieve.


2. Clean the retrieved data by removing any irrelevant columns, handling missing values, and converting the timestamp to human-readable format if necessary.


3. Save the cleaned data to a master dataset, either by appending to an existing file or creating a new one. This dataset should be stored in a format that is easily accessible for future use, such as a CSV or database.


4. Create a function to retrieve and clean data for a specific date range or cryptocurrency. This function should be used for live predictions or testing the model on specific data. It should take in any relevant parameters such as the date range or specific cryptocurrency.


5. Define a function to preprocess the data for training the model. This function should handle any additional feature engineering or scaling necessary for the model to use the data effectively.


6. Finally, use the processed data to train and test the model, and then use the functions defined in step 5 to make live predictions on new data.


Note that the specific implementation of these steps will depend on the data source and the specific requirements of the project.

In [2]:
import yfinance as yf
import pandas as pd

## 1) Retrieve data from the data source

In [5]:
def get_crypto_data(ticker_symbol, start_date, end_date, save_file=False):
    ticker_data = yf.Ticker(ticker_symbol)
    crypto_data = ticker_data.history(period='1d', start=start_date, end=end_date)
    
    if save_file:
        filename = f"{ticker_symbol.replace('-', '_')}_data.csv"
        crypto_data.to_csv(filename, index=True)
        crypto_data.reset_index(inplace=True)
        crypto_data.rename(columns={'index': 'Date'}, inplace=True)
        print(f"Data saved to {filename}")
    
    return crypto_data
btc_data = get_crypto_data('BTC-USD', '2010-01-01', '2023-05-10', True)
print(btc_data.head())

Data saved to BTC_USD_data.csv
                       Date        Open        High         Low       Close   
0 2014-09-17 00:00:00+00:00  465.864014  468.174011  452.421997  457.334015  \
1 2014-09-18 00:00:00+00:00  456.859985  456.859985  413.104004  424.440002   
2 2014-09-19 00:00:00+00:00  424.102997  427.834991  384.532013  394.795990   
3 2014-09-20 00:00:00+00:00  394.673004  423.295990  389.882996  408.903992   
4 2014-09-21 00:00:00+00:00  408.084991  412.425995  393.181000  398.821014   

     Volume  Dividends  Stock Splits  
0  21056800        0.0           0.0  
1  34483200        0.0           0.0  
2  37919700        0.0           0.0  
3  36863600        0.0           0.0  
4  26580100        0.0           0.0  


In [6]:
df = pd.DataFrame(btc_data)
df

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2014-09-17 00:00:00+00:00,465.864014,468.174011,452.421997,457.334015,21056800,0.0,0.0
1,2014-09-18 00:00:00+00:00,456.859985,456.859985,413.104004,424.440002,34483200,0.0,0.0
2,2014-09-19 00:00:00+00:00,424.102997,427.834991,384.532013,394.795990,37919700,0.0,0.0
3,2014-09-20 00:00:00+00:00,394.673004,423.295990,389.882996,408.903992,36863600,0.0,0.0
4,2014-09-21 00:00:00+00:00,408.084991,412.425995,393.181000,398.821014,26580100,0.0,0.0
...,...,...,...,...,...,...,...,...
3152,2023-05-05 00:00:00+00:00,28851.480469,29668.908203,28845.509766,29534.384766,17936566518,0.0,0.0
3153,2023-05-06 00:00:00+00:00,29538.859375,29820.126953,28468.966797,28904.623047,15913866714,0.0,0.0
3154,2023-05-07 00:00:00+00:00,28901.623047,29157.517578,28441.367188,28454.978516,11301355486,0.0,0.0
3155,2023-05-08 00:00:00+00:00,28450.457031,28663.271484,27310.134766,27694.273438,19122903752,0.0,0.0


## 2) Clean the retrieved data 

In [7]:
import pandas as pd
from datetime import datetime

def clean_data(df):
    df = df.drop(['Dividends', 'Stock Splits'], axis=1)
    df = df.dropna()
    df['Date'] = pd.to_datetime(df.index).strftime('%Y-%m-%d')
    df['Date'] = pd.to_datetime(df['Date'])

# Convert the dates to Unix timestamps
    df['date'] = (df['Date'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
    df = df.set_index('date')
#     df = df.drop('Date',axis=1 )
    return df


df1 = clean_data(df)
print(df1.head())

           Date        Open        High         Low       Close    Volume
date                                                                     
0    1970-01-01  465.864014  468.174011  452.421997  457.334015  21056800
0    1970-01-01  456.859985  456.859985  413.104004  424.440002  34483200
0    1970-01-01  424.102997  427.834991  384.532013  394.795990  37919700
0    1970-01-01  394.673004  423.295990  389.882996  408.903992  36863600
0    1970-01-01  408.084991  412.425995  393.181000  398.821014  26580100


## 2) Present the retrieved data 

In [76]:
def get_price(df, date):
    cleaned_df = clean_data(df)
    cleaned_df.set_index('Date', inplace=True)
    
    if date in cleaned_df.index:
        price = cleaned_df.loc[date]
        return price
    else:
        return "Price data not available for the given date."



print(get_price(df,'2022-05-09 '))

Open      3.406002e+04
High      3.422207e+04
Low       3.029695e+04
Close     3.029695e+04
Volume    6.335549e+10
Name: 2022-05-09 00:00:00, dtype: float64
