# Predicting Bitcoin Price with Supervised Learning Methods

In the world of cryptocurrency market, traditional technical analysis (TA) is a method for studying historical price charts and trading volume data to identify potential future price movements. However, due to the inherent volatility and non-linear relationships within the market, traditional methods often fall short.

This project aims to develop and evaluate Supervised Learning models for predicting bitcoin price. The focus will be on assessing the effectiveness of Linear Regression and Support Vector Machines (SVM) in capturing historical price patterns and making predictions for future price movements.

All source in this project uploaded to this git repo: https://github.com/zac4j/btc-price-prediction

## Data Preparetion

### Data Description

I will use the [5 Years of Cryptocurrency Historical Prices](https://www.kaggle.com/datasets/mjdskaggle/5-years-of-crypto-data-as-of-632024/data.) data. This dataset contains the historical price information of some of the top crypto currencies by market capitalization, in this project I'll use the Bitcoin data. Price history is start from 5 years ago as of **June 03, 2024**.

|Factor |	Description|
|:---------|:-------------|
|Date |	Date of observation|
|Open |	Opening price on the given day|
|High |	Highest price on the given day|
|Low  |	Lowest price on the given day|
|Close | Closing price on the given day|
|Volumn| Volume of transactions on the given day|

### Data Initialization

Loading and Previewing Bitcoin Historical Price Data

In [None]:
import pandas as pd
import numpy as np

# Load Bitcoin csv data
filename = 'data/BTC-USD.csv'
df = pd.read_csv(filename,parse_dates=['Date'],index_col='Date')

Display Data Frame Information

In [None]:
from IPython.display import display, Markdown

display(Markdown(f"### DataFrame "))
display(df.head())
df.info()

Based on the above information, we can observe the following attributes of the DataFrame:

- It contains non-null values, guaranteed data integrity.
- It contains two closing price features: **Close** and **Adj Close**, for the price prediction, we will use the *Adj Close* (Adjusted Closing Price).
- The **Date** column represents the date of observation, also serving as the index of the DataFrame.
- The other columns are numeric values:
  - **Open**: Indicates the opening price of the Bitcoin for the given day in USD.
  - **High**: Indicates the highest price of the Bitcoin for the given day in USD.
  - **Low**: Indicates the lowest price of the Bitcoin for the given day in USD.
  - **Close**: Indicates the closing price of the Bitcoin for the given day in USD.
  - **Adj Close**: Indicates the adjusted closing price of the Bitcoin for the given day in USD.
  - **Volume**: Represents the volume of the Bitcoin for the given day.

## Data Cleaning

### Data Format Consistent Check

In [None]:
from pandas import DataFrame
def check_format_consistent(df: DataFrame) -> None:
    """
    Check and display format consistent for a given DataFrame.

    Parameters:
        df: A given DataFrame.

    """
    display(Markdown("### Data Format Consistent:"))
    # Check data types
    display(Markdown(f"- Data Types: {df.dtypes}"))
    # Check for missing values
    display(Markdown(f"- Null Values: {df.isna().sum()}"))
    # Check for unique values
    display(Markdown(f"- Unique Values: {df.nunique()}"))

check_format_consistent(df)

Based on the above observation, we see the DataFrame have a consistent format.

### Outlier Detection

In [None]:
def detect_outlier(df: DataFrame) -> dict:
    outlier_info = {}
    # Pick numerical columns
    num_cols = df.select_dtypes(include=['number']).columns

    # Initialize outlier data frames
    num_outliers = {}
    pct_outliers = {}

    # Data statistics summary
    desc = df.describe()
    for col in num_cols:
        # Get quartiles and IQR
        q1 = desc.loc['25%', col]
        q3 = desc.loc['75%', col]
        iqr = q3 - q1
        
        # Define the outlier range
        outlier_range = 1.5 * iqr

        # Count the number of outliers
        num_outliers[col] = ((df[col] > q3 + outlier_range) | (df[col] < q1 - outlier_range)).sum()

        # Compute the percentage of outliers
        percentage = (num_outliers[col] / len(df[col])) * 100
        pct_outliers[col] = "{:.2f}%".format(percentage)

    outlier_info = {
        'numerical_outliers': num_outliers,
        'percent_outliers': pct_outliers
    }
    return outlier_info

outlier_info = detect_outlier(df)
print(outlier_info)

Base on above obervation, only Volume feature have outliers, however, at this moment, I prefer to retain the outliers for the Volume, because employ new Volume value may introduce bias for the prices.

## Exploratory Data Analysis

### Correlation of features

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
def display_correlation_heatmap(df: DataFrame):
    # Pick numerical data
    numeric_data = df.select_dtypes(include=[np.number])
    # Get features correlation
    corr_matrix = numeric_data.corr()

    # Find the most correlated pair features
    corr_matrix_value = corr_matrix.mask(corr_matrix == 1.0).stack().idxmax()
    print(f'The most correlated feature pair is {corr_matrix_value}, with the value of {corr_matrix.loc[corr_matrix_value]} ')

    # Plot correlation heatmap
    plt.figure(figsize=(10,8))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm',fmt='.2f')
    plt.title('Bitcoin Features Correlation Heatmap')
    plt.show()

display_correlation_heatmap(df)

Based on the above correlation heatmap, we know that the OHLC features are strong positive correlation(close to 1).

### Price Movements and Trends

#### The OHLC (Open, High, Low, Close) prices historical trend

In [None]:
from matplotlib.colors import rgb2hex
import plotly.graph_objects as go
def plot_prices(features:list[str]):
    # Define a color palette for the features
    palette = sns.color_palette('Paired', n_colors=len(features))
    hex_palette = [rgb2hex(color) for color in palette]
    # Define line style for the features
    dashes = ['solid','dash', 'dot', 'dashdot']
    fig = go.Figure()

    # Function to create and show plots for a given feature
    for feature, color, dash in zip(features, hex_palette, dashes):
        fig.add_trace(go.Scatter(x=df.index, y=df[feature], mode='lines', name=feature, line=dict(color=color,dash=dash)))
    
    fig.update_layout(
            title="Bitcoin OHLC Prices Over Time",
            xaxis_title='Date',
            yaxis_title='Price',
            template='plotly_dark',
            autosize=True,
            height=600,
        )
        
    fig.show()
# OHLC Features
features = ["Open","High","Low","Close"]
plot_prices(features)

Base on above chart, we can observe that:

Even through the prices may fluctuating significantly over time, the long-term trend is steady growth.

Bearish Phases:

- 2014-2015: Prices decline significantly.
- 2018-2019: Another bearish phase with a downward trend.

Bullish Phases:

- 2017: A strong bullish phase for Bitcoin.
- 2020-2021: Another bullish period with notable price growth

#### Relationship between opening and closing price

In [None]:
from scipy.stats import linregress
import plotly.express as px

# Calculate given features regression statistics
def calculate_regression_stats(x, y):
    """
    Calculate regression statistics for the given data.

    This function computes the slope, intercept, coefficient of determination (R-squared),
    p-value, and standard error of the regression line for the provided x and y data.

    Parameters:
    x (array-like): The independent variable data.
    y (array-like): The dependent variable data.

    Returns:
    dict: A dictionary containing the following keys and their corresponding values:
        - 'slope': The slope of the regression line.
        - 'intercept': The intercept of the regression line.
        - 'r_squared': The coefficient of determination (R-squared) value.
        - 'p_value': The p-value for the slope.
        - 'std_err': The standard error of the regression line.
    """
    slope, intercept, r_value, p_value, std_err = linregress(x, y)
    return {
        'slope': slope,
        'intercept': intercept,
        'r_squared': r_value**2,
        'p_value': p_value,
        'std_err': std_err
    }

# Define meaningful pairs of numerical features
features = [('Open', 'Close')]

# Create scatter plots for each pair of numerical features for each selected coin
for feature_x, feature_y in features:
    # Filter data for the specific coin
    coin_data = df

    # Create scatter plot
    fig = px.scatter(
        coin_data, x=feature_x, y=feature_y, title=f'{feature_x} vs {feature_y} (Bitcoin)',
        labels={feature_x: feature_x, feature_y: feature_y},
        template='plotly_dark', opacity=0.5
    )

    # Calculate regression statistics
    stats = calculate_regression_stats(coin_data[feature_x], coin_data[feature_y])
    # print(stats)

    # Conditionally add the regression line if R² is above a threshold and p-value is below a threshold
    if stats['r_squared'] > 0.5 and stats['p_value'] < 0.05:
        fig.add_trace(
            go.Scatter(
                x=coin_data[feature_x], y=stats['slope']*coin_data[feature_x] + stats['intercept'],
                mode='lines', name=f"y = {stats['slope']:.2f}x + {stats['intercept']:.2f}",
                line=dict(color='red')
            )
        )
    else:
        print(f"The relationship between {feature_x} and {feature_y} for Bitcoin is not significant.")

    fig.show()


Based on above chart, we can observe that:

The opening price (Open) and the closing price (Close) have a positive linear relationship. As the opening price increases, the closing price increase as well.

### Volume Change Over Time

In [None]:
def plot_volume():
    feature_volume = "Volume"
    fig = go.Figure()
    # Function to create and show plot for a given feature
    fig.add_trace(go.Scatter(x=df.index, y=df[feature_volume], mode='lines', name=feature_volume, line=dict(color='blue')))
    
    fig.update_layout(
            title="Bitcoin Volume Change Over Time",
            xaxis_title='Date',
            yaxis_title='Price',
            template='plotly_dark',
            autosize=True,
            height=600,
        )
        
    fig.show()
# Plot volume movement chart
plot_volume()

Based on the volume change chart, we observe that the most transaction happened in March, 2021, marks the time when the cryptocurrency market atmosphere is the most enthusiastic.

### Volume correlate with price changes

In [None]:
# Define meaningful pairs of numerical features
feature_pairs = [
    ('Open', 'Volume'),
    ('Close', 'Volume')
]

# Create scatter plots for each pair of features
for feature_x, feature_y in feature_pairs:
    # Create scatter plot
    fig = px.scatter(
        df, x=feature_x, y=feature_y, title=f'{feature_x} vs {feature_y} (Bitcoin)',
        labels={feature_x: feature_x, feature_y: feature_y},
        template='plotly_dark', opacity=0.5
    )

    # Calculate regression statistics
    stats = calculate_regression_stats(df[feature_x], df[feature_y])

    # Conditionally add the regression line if R² is above a threshold and p-value is below a threshold
    if stats['r_squared'] > 0.5 and stats['p_value'] < 0.05:
        fig.add_trace(
            go.Scatter(
                x=df[feature_x], y=stats['slope']*df[feature_x] + stats['intercept'],
                mode='lines', name=f"y = {stats['slope']:.2f}x + {stats['intercept']:.2f}",
                line=dict(color='red')
            )
        )
    else:
        print(f"The relationship between {feature_x} and {feature_y} for Bitcoin is not significant.")

    # Show the plot
    fig.show()

Based on above charts, we can oberve that:
- The opening price ("Open") and the trading volume ("Volume") have a positive correlation.
- The closing price ("Close") and the trading volume ("Volume") have a positive correlation as well.

This also indicates price change impact trading volume.

### Investment Returns

In [None]:
# Calculate daily returns and cumulative returns
df['return'] = df['Close'].pct_change()
df['Cumulative_Return'] = (1 + df['return']).cumprod() - 1

# Create a Plotly figure
fig = go.Figure()
    
# Add trace for cumulative returns
fig.add_trace(go.Scatter(
    x=df.index, 
    y=df['Cumulative_Return'], 
    mode='lines', 
    name=f"{df['Cumulative_Return']}", 
    line=dict(color='green')
))

# Update layout
fig.update_layout(
    title='Cumulated Return for Bitcoin',
    xaxis_title='Date',
    yaxis_title='Cumulative Return',
    template='plotly_dark',  # Set the dark theme
    height=600,  # Adjust height as needed
)

# Show the plot
fig.show()

Given above cumulative return chart, we can observe that:
- During Oct, 2020 - Nov, 2021, Bitcoin generated obvisously high profits.
- From Dec, 2022 - Present, Bitcoin generated highest profits.

## Models

In the previous section we completed data processing and feature engineering, in this section we focus on using different Supervised Learning algorithms to create and train models and thne make predictions on Bitcoin price.

### Bitcoin Price Prediction Use LinearRegression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Use previously proccessed data
data = df
# Create variable N to predict future N days price.
N = 5
# Adj Close (Adjusted Close Price) can be used for long-term price analysis
data['Prediction'] = data['Adj Close'].shift(-N)

# Split data into features (X) and target (y)
# Feature: use Adjusted Closing Price as train feature
X = data[['Adj Close']][:-N]
# Target: Predict future prices
y = data['Prediction'][:-N]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Target predict on the test sets
y_pred = model.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)
print(f'Linear Regression Model got MSE: {mse}, RMSE: {rmse}, R²: {r2}')

# Make predictions on the projection set
X_projection = data[['Adj Close']][-N:]
print(X_projection)
y_projection = model.predict(X_projection)
print(f'Linear Regression Model predicted BTC next {N} days price are: {y_projection}')

# Append predict data to existing data sets
data_series = pd.Series(y_projection)
data = pd.concat([data, data_series.to_frame('Prediction')], ignore_index=True)

# Draw historical (include predict data) line
fig = go.Figure()
fig.add_trace(go.Scatter(x=data.index,y=data['Prediction'],
            mode='lines',
            fill='none',
            showlegend=False,
            line=dict(color='green',dash='solid')))

fig.update_layout(title="Bitcoin Price Data Prediction",
            xaxis_title='Date',
            yaxis_title='Price',
            template='plotly_dark',
            autosize=True,
            height=600)
        
fig.show()


We got Model MSE: 6210016.296432966 with R²: 0.9802407135806924, it looks not bad so far, let's continue to see how other models perform.

### Bitcoin Price Prediction Use SVM

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

# Use previously proccessed data
data = df

# Drop rows with missing values
data.dropna(inplace=True)

X = data[['Open', 'High', 'Low']][:-1]
# Define and fit the scaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(X)

# Predict next day close price
target_variable = data['Close'][:-1]  # Shift target by 1 for prediction

X_train, X_test, y_train, y_test = train_test_split(scaled_features, target_variable, test_size=0.2, random_state=42)

# Define and train the SVR model
model = SVR(kernel='rbf', C=191, gamma=0.1)
model.fit(X_train, y_train)

# Make predictions on the testing set
y_predicted = model.predict(X_test)

# Calculate MSE and R-squared
mse = mean_squared_error(y_test, y_predicted)
r2 = r2_score(y_test, y_predicted)

print(f"SVM Model1 got Mean Squared Error: {mse:.2f}")
print(f"SVM Model1 got R-squared: {r2:.6f}")

# Prediction on data
X_projection = data[['Open', 'High', 'Low']][-1:]
scaled_X = scaler.transform(X_projection)
predict_price = model.predict(scaled_X)
print(f"SVM Model1 predicted BTC next day price: {predict_price}")

Given random selected kernel, C and gamma value, we got the R-squared value as 0.97 with Mean Squared Error which indicates this is not a good predict. Fortunately, we could use the GridSearchCV to identify best parameters and improve predict quality.

In [None]:
# Use GridSearchCV to grab the best parameters
c_range = [i for i in range(1, 200, 10)]
gamma_range = [0.1, 0.3, 0.5, 0.7, 0.9]
epsilon_range = [0.01, 0.1, 1.0]
kernel = ['linear', 'rbf']
param_grid = dict(gamma=gamma_range, C=c_range, kernel=kernel,epsilon=epsilon_range)
grid = GridSearchCV(model, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

print("best parameters: ", grid.best_params_)
print("best accuracy: ", grid.best_score_)

With the help of GridSearchCV, we got best paramters {'C': 191, 'epsilon': 1.0, 'gamma': 0.1, 'kernel': 'linear'}, so let's create new model with these best parameters, and make prediction again.

In [None]:
# Create model2 with given best parameters
model2 = SVR(kernel='linear', C=191, gamma=0.1, epsilon = 1.0)
model2.fit(X_train, y_train)

y_predicted = model2.predict(X_test)

# Calculate MSE and R-squared
mse = mean_squared_error(y_test, y_predicted)
r2 = r2_score(y_test, y_predicted)

print(f"SVM Model2 Mean Squared Error: {mse:.2f}")
print(f"SVM Model2 R-squared: {r2:.6f}")

predict_price = model2.predict(scaled_X)
print(f"SVM Model2 predicted BTC next day price: {predict_price}")

After using given best parameters, we got R-squared value as 0.998649 with MSE value as 427034.38, which indicates a better model compared to the previous one, the predicted price of next day of BTC is 68738.20440099, which looks make sense and more reasonable.

## Results and Analysis

### Model Evaluation Results Analysis

Based on above output here we got below Model Evaluation Results:

|          Model          |      MSE     |    RMSE    | R-squared |
|:-----------------------:|:------------:|:----------:|:---------:|
| Linear Regression Model | 6,210,016.29 | 2,491.99   | 0.9802    |
| SVM Model1              | 9,724,326.35 | ~3,119.24  | 0.9692    |
| SVM Model2              | 427,034.38   | ~654.07    | 0.9986    |

Note: SVM RMSE values are estimated based on MSE (RMSE = sqrt(MSE)).

#### Linear Regression Result

For the Linear Regression Model, the high R-squared value(>0.98) suggests a strong correlation between the predicted and actual values. However, the large MSE and RMSE indicate significant absolute errors in the predictions.

#### SVM Results

The two SVM results offer contrasting outcomes:

- SVM Result1: Similar to the linear regression, the R-squared value suggest a good fit, but the high MSE indicates large prediction errors.
- SVM Result2: This SVM model achieved the lowest lower MSE and highest R-squared value, which suggest potentially more accurate prediction.

This also indicates the GridSearchCV help great to tuning hyperparameter to make good prediction!

### Overall Observation

- It's challenging to say which model or method performs best based on the limited data provided. However, SVM Model2 shows promise with lowest MSE and highest R-squared value.
- High R-squared values alone can be misleading, especially for financial time series data. It's crucial to consider the absolute errors for a more realistic assessment of prediction accuracy.
- All models likely suffer from limitations in capturing the non-linear nature of Bitcoin price movement.

## Conclusion

This project explored the potential of supervised learning algorithms, specifically linear regression and SVM, for predicting Bitcoin prices. While the project contributes to the ongoing investigation of machine learning in financial forecasting, the results highlight the challenges associated with this task.

The evaluation metrics suggest that linear models may not be sufficient for capturing the complex, non-linear dynamics of cryptocurrency markets. While they achieved high R-squared values, indicating a good fit to the data, the large absolute errors (MSE, RMSE) reveal limitations in accurately predicting future prices.

This finding underscores the importance of considering the inherent non-linearity of cryptocurrency price movements. Future efforts in this domain could benefit from exploring alternative models like Long Short-Term Memory (LSTM) networks, which are specifically designed for time series data and can potentially learn and model these non-linear relationships more effectively.

Furthermore, we can incorporate additional features beyond the OHLC and volume data, such as news sentiment analysis, or technical indicators, might further improve the accuracy of predictions.

By acknowledging the limitations of the current approach and outlining potential avenues for improvement, this project lays the groundwork for further exploration of advanced machine learning techniques for Bitcoin price forecasting. It emphasizes the need for continuous learning and experimentation in this ever-evolving field.


## Reference

The following papers or projects provided great help during the design and implementation of this project

- [Stock price prediction based on financial statements using SVM](https://gvpress.com/journals/IJHIT/vol9_no2/5.pdf)
- [Crypto-currency price prediction using decision tree and regression techniques](https://ieeexplore.ieee.org/abstract/document/8862585/?casa_token=CYP8qUONrC0AAAAA:1jOuWfcsjRj08mJivToycTitTiMQndn9FmxZIVgNiaJRd_jB7T2VYl1BO_HekUnpjI6kG1_hOQ)
- [5 Years of Crypto Data as of 6/3/2024](https://www.kaggle.com/datasets/mjdskaggle/5-years-of-crypto-data-as-of-632024/data)
- [EDA for Cryptocurrency](https://www.kaggle.com/code/yannaktb/eda-for-cryptocurrency)
- [Dogecoin Price Prediction](https://www.kaggle.com/code/iahhel/dogecoin-price-prediction-xgboost-gridsearchcv)
- [Prediction of cryptocurrency price](https://www.kaggle.com/code/aishwarya2210/prediction-of-cryptocurrency-price/notebook)
- [Exploring and Predicting Cryptocurrencies](https://www.kaggle.com/code/syedanwarafridi/exploring-and-predicting-cryptocurrencies/notebook)