# Prediction

In this notebook, we are trying to predict the close price of Monero or XMR cryptocurrency in one day by designing a model using other characteristics of this currency in the previous day.

## Import Libraries

In [1]:
# to get the all subsets of a set for implement exhaustive feature selection
from itertools import chain, combinations

# to download data of the cryptocurrency
import yfinance as yf

# to preprocess the data, make, train and evaluate the model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics

## Download Data

In [2]:
# download data of the cryptocurrency

df_xmr = yf.download(tickers="XMR-USD", period="max", interval="1d", start="2023-01-01", end="2023-10-09")
df_xmr

[*********************100%%**********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-01-01,147.309662,148.931030,146.437485,148.576935,148.576935,36453347
2023-01-02,148.582184,149.623535,147.943558,147.943558,147.943558,47050925
2023-01-03,147.933929,149.027832,147.628860,148.487930,148.487930,48662135
2023-01-04,148.466995,152.488983,148.342621,150.743652,150.743652,83915181
2023-01-05,150.790253,155.921738,150.769043,155.921738,155.921738,78049428
...,...,...,...,...,...,...
2023-10-04,147.168442,150.702347,145.940781,150.469055,150.469055,59400400
2023-10-05,150.474197,151.328369,148.565491,149.623718,149.623718,55704972
2023-10-06,149.623337,152.669296,148.641647,151.992264,151.992264,49535004
2023-10-07,151.988235,155.247528,151.100983,155.212143,155.212143,61159796


The downloaded dataset has date indices, so this problem is actually a time series problem. But in this notebook, we intend to solve this problem without considering the nature of its time series and analyze and check the final result obtained.

In [3]:
df_xmr.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 281 entries, 2023-01-01 to 2023-10-08
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       281 non-null    float64
 1   High       281 non-null    float64
 2   Low        281 non-null    float64
 3   Close      281 non-null    float64
 4   Adj Close  281 non-null    float64
 5   Volume     281 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 15.4 KB


## Data Preprocessing

In [4]:
actual_class = (
    (df_xmr.loc["2023-01-02":, "Close"].to_numpy() > df_xmr.loc["2023-01-01":"2023-10-07", "Close"]).astype(int)
)
df_xmr = df_xmr.drop("2023-10-08")
df_xmr["Price increase (in the next day)"] = actual_class

### Feature Selection

We use exhaustive feature selection method. In this method, the performance of all subsets of the set of columns are evaluated and the subset that gives us the best score is selected. In fact, this method, like brute force, checks all possible modes, and returns the mode with the best performance. Here we use the cross validation score to evaluate our model and finally, using the selected features, we get the ``f1_score`` of our model using test set.

In [5]:
# Define feature columns
feature_cols = ['Open', 'High', 'Low', 'Adj Close', 'Volume']

def powerset(iterable):
    """a function for return all subsets of an iterable
    (ref: https://stackoverflow.com/questions/1482308/how-to-get-all-subsets-of-a-set-powerset)

    Args:
        iterable: collection of some data

    Returns:
        all subsets of the iterable
    """
    return chain.from_iterable(
        combinations(iterable, r) for r in range(len(iterable) + 1)
    )


# feature selection using exhaustively feature selection
# we use cross validation method to select the best features
max_cv_score = 0  # the maximum cross validation score
selected_features = None  # the best features
for features in list(powerset(feature_cols))[
    1:
]:
    # select the required feature(s) and label
    X = df_xmr.loc[:"2023-09-07", list(features)]  # feature(s)
    y = df_xmr.loc[:"2023-09-07", "Price increase (in the next day)"]  # target

    # scaling the values of the features with StandardScaler
    X = StandardScaler().fit_transform(X)

    # create random forest model
    model = RandomForestClassifier(n_estimators=100, random_state=42)

    # calculate score of the model for train set using cross validation method
    # ref: https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
    cv_scores = cross_val_score(model, X, y, cv=5)
    # store the best result
    if cv_scores.mean() > max_cv_score:
        # update max_cv_score variable
        max_cv_score = cv_scores.mean()
        # update selected features variable
        selected_features = list(features)

# print the best features for random forest and maximum cv_score
print("selected feature(s):", selected_features)
print("cross validation score of the model:", max_cv_score)

selected feature(s): ['Volume']
cross validation score of the model: 0.56


## Create and Train Model with Best Features and Get Results

In the following code blocks, we create our model and train it with the data of the best features and classify each day by predicted value of our trained model. Finally, we calculate the accuracy, precision, f1-score, recall and auc score, and also plot the confusion matrix in the last part.

In [6]:
# Split features and target variable for training, validation, and test sets
X_train, y_train = df_xmr.loc[:"2023-09-07", list(features)], df_xmr.loc[:"2023-09-07", 'Price increase (in the next day)']
X_test, y_test = df_xmr.loc["2023-09-08":, list(features)], df_xmr.loc["2023-09-08":, 'Price increase (in the next day)']

# Train the random forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [7]:
# Make predictions on the test set
test_predictions = model.predict(X_test)

# Evaluate model performance on the test set
test_accuracy = metrics.accuracy_score(y_test, test_predictions)
test_precision = metrics.precision_score(y_test, test_predictions)
test_recall = metrics.recall_score(y_test, test_predictions)
test_f1 = metrics.f1_score(y_test, test_predictions)
test_auc = metrics.roc_auc_score(y_test, test_predictions)
test_confusion = metrics.confusion_matrix(y_test, test_predictions)

# Print evaluation metrics and confusion matrix for the test set
print('\nTest Set Performance:')
print(f'Test Accuracy: {test_accuracy}')
print(f'Test Precision: {test_precision}')
print(f'Test Recall: {test_recall}')
print(f'Test F1 Score: {test_f1}')
print(f'Test AUC: {test_auc}')
print(f'Test Confusion Matrix: {test_confusion}')

# cm_display = metrics.ConfusionMatrixDisplay(
#     confusion_matrix=metrics.confusion_matrix, display_labels=[False, True]
# )
# cm_display.plot()


Test Set Performance:
Test Accuracy: 0.5333333333333333
Test Precision: 0.5909090909090909
Test Recall: 0.7222222222222222
Test F1 Score: 0.65
Test AUC: 0.4861111111111111
Test Confusion Matrix: [[ 3  9]
 [ 5 13]]


In [8]:
test_predictions

array([1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0])