# Loading Feature-Engineered Data for Model Training

Before starting model training, we first load the feature-engineered dataset prepared in the previous step.  
This dataset contains all the processed features necessary for training the predictive models.


In [4]:
import pandas as pd

# Define the path to the feature-engineered dataset
file_path = r"C:\Users\ACER\OneDrive\Documents\my codess\Data-Analytics-Assignment\Crypto-Liquidity-Prediction-ML-Project\data\processed\crypto_data_feature_engineered.csv"

# Loading the dataset
df = pd.read_csv(file_path)

# Displaying the first few rows to verify successful loading
print(df.head())

             coin symbol         price     1h    24h     7d    24h_volume  \
0         Bitcoin    BTC  40859.460000  0.022  0.030  0.055  3.539076e+10   
1   Origin Dollar   OUSD      0.993428  0.001 -0.002  0.001  8.863360e+05   
2  Iron Bank EURO  IBEUR      1.080000  0.000 -0.004  0.009  9.525810e+04   
3       Prometeus   PROM      7.960000  0.017  0.008  0.015  1.069360e+06   
4    MaidSafeCoin   MAID      0.294920  0.023  0.010  0.045  3.041720e+03   

        mkt_cap        date   price_MA_2d  market_cap_MA_2d  volatility_score  \
0  7.709915e+11  2022-03-16           NaN               NaN             0.008   
1  1.503384e+08  2022-03-16  20430.226714      3.855709e+11             0.003   
2  1.300442e+08  2022-03-16      1.036714      1.401913e+08             0.004   
3  1.302007e+08  2022-03-16      4.520000      1.301224e+08             0.009   
4  1.327759e+08  2022-03-16      4.127460      1.314883e+08             0.013   

   liquidity_ratio  
0         0.045903  
1       

# 03. Model Training and Evaluation

### 3.1 Model Selection

We are predicting a continuous variable (liquidity ratio), so we use **Regression models**.
Start with **Linear Regression** as a simple baseline before trying complex models.

### 3.2 Preparing Data for Model Training

In this step, we prepare the dataset for building and evaluating a machine learning model. The features (independent variables) and the target (dependent variable) are separated. We then split the dataset into training and testing subsets — the training data is used to train the model, and the testing data is used to evaluate how well the model performs on unseen data.

In [10]:
# Corrected code for Ethereum dataset
from sklearn.model_selection import train_test_split

# Selecting correct features based on your dataset
X = df[['price', '24h_volume', 'mkt_cap', 'volatility_score']]
y = df['liquidity_ratio']

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print("Shape of training data:", X_train.shape)
print("Shape of testing data:", X_test.shape)


Shape of training data: (793, 4)
Shape of testing data: (199, 4)


### 3.3 Model Training

In this step, we train a machine learning model using the training dataset `(X_train, y_train)` that we prepared earlier. The model will learn the underlying patterns and relationships between the input features (like price, volume, market cap, volatility) and the target variable `(liquidity_ratio)`.
For this example, we'll use a Linear Regression model — a basic and widely used algorithm for regression tasks that assumes a linear relationship between inputs and output.


In [13]:
# Importing the Linear Regression model from scikit-learn
from sklearn.linear_model import LinearRegression

# Initializing the Linear Regression model
model = LinearRegression()

# Training (fit) the model using the training data
model.fit(X_train, y_train)

# Printing model coefficients which is usually not requied
# but can be useful for understanding the model
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
print("----------------------------------------------------")
print("Model is trained successfully.")


Model Coefficients: [-8.43524745e-07  1.67091127e-11 -6.43485972e-13  2.17238199e+00]
Model Intercept: 0.03300684393692446
----------------------------------------------------
Model is trained successfully.


### 3.4 Model Evaluation (Initial)

In this step, we evaluate the initial performance of our trained model using the test dataset. This evaluation provides insight into how well the model is able to predict unseen data. 

We calculate key regression metrics such as:

- **Root Mean Squared Error (RMSE):** Measures the average size of the errors and penalizes larger errors more heavily.
- **Mean Absolute Error (MAE):** Provides the average magnitude of prediction errors.
- **R² Score:** Indicates the proportion of variance in the target variable explained by the model.

A higher R² score and lower error values indicate better model performance. The results from this initial evaluation help us understand if the model is making accurate predictions or if further improvements — like hyperparameter tuning or feature selection — are necessary to enhance its predictive ability.

In [12]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Generate predictions from the test dataset
predictions = model.predict(X_test)

# Compute evaluation metrics
rmse_val = np.sqrt(mean_squared_error(y_test, predictions))
mae_val = mean_absolute_error(y_test, predictions)
r2_val = r2_score(y_test, predictions)

print(f"RMSE: {rmse_val}")
print(f"MAE: {mae_val}")
print(f"R-squared: {r2_val}")

RMSE: 0.44150904133427954
MAE: 0.12048422855312403
R-squared: 0.049835303442470336


**From the output we see:**

- The **RMSE** value of approximately **0.44** indicates a moderate average prediction error magnitude, meaning predictions deviate from actual values by this amount on average.  
- The **MAE** of around **0.12** confirms that the average absolute error is fairly low but still noticeable.  
- The **R-squared** value of about **0.05** suggests the model explains only 5% of the variance in the target variable, indicating weak predictive power.  
- Overall, the model is **not performing well** and struggles to accurately capture the relationship between features and the target.  
- **Further improvements** like feature engineering, hyperparameter tuning, or trying different algorithms are needed to boost performance.

### 3.5 Hyperparameter Tuning with GridSearchCV

In this step, we will aim to improve the performance of our machine learning model by tuning its hyperparameters. Hyperparameters are settings that control the learning process, such as the number of trees in a Random Forest or the maximum depth of each tree. Instead of manually trying different combinations, we use **GridSearchCV**, a powerful tool that performs an exhaustive search over a specified parameter grid with cross-validation. This helps us find the best combination of hyperparameters that yields the highest model performance based on a chosen metric (here, R² score). By optimizing these parameters, we expect the model to generalize better on unseen data and improve prediction accuracy.

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Initializing the model
model = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15]
}

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=3,
    n_jobs=-1,
    scoring='r2',
    verbose=1,
    refit=True
)
grid_search.fit(X_train, y_train)

# Printing the best parameters and the best score
print("Best Parameters Found:", grid_search.best_params_)
print(f"Best cross-validation R² score: {grid_search.best_score_:.4f}")


Fitting 3 folds for each of 9 candidates, totalling 27 fits
Best Parameters Found: {'max_depth': 10, 'n_estimators': 150}
Best cross-validation R² score: 0.7591


**From the output we see:**

- GridSearchCV evaluated 27 different hyperparameter combinations using 3-fold cross-validation.
- The best parameters found are:
  - `max_depth`: 10
  - `n_estimators`: 150
- The best cross-validation R² score achieved with these parameters is approximately 0.7591.
- This indicates that the tuned Random Forest model explains about 75.9% of the variance in the target variable on validation data.
- Using these parameters for retraining should significantly improve the model's prediction accuracy and generalization ability compared to default settings.
- Further tuning or experimenting with other hyperparameters could potentially enhance the performance even more.


### 3.6 Retrain Model with Best Parameters and Re-evaluate

Once the best hyperparameters are found using GridSearchCV, the next thing is to retrain the model with these optimal parameters on the entire training dataset. This enables the model to learn using the optimized settings that have performed the best across cross-validation. After retraining, the model is again tested using the test dataset in order to determine its prediction accuracy and ability to generalize to new data. We employ the same performance metrics—Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R² Score—to measure the improvement realized through hyperparameter tuning. This verifies whether the tuning has improved the model's capability to produce good and reliable predictions.

In [16]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Retraining the model with the best parameters
best_params = grid_search.best_params_
best_model = RandomForestRegressor(
    n_estimators=best_params['n_estimators'], 
    max_depth=best_params['max_depth'], 
    random_state=42
)

# Fitting the model on the training data
best_model.fit(X_train, y_train)

# Predicting on the test set
y_pred_best = best_model.predict(X_test)

# Evaluating model performance
rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best))
mae_best = mean_absolute_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

print(f"Updated Root Mean Squared Error (RMSE): {rmse_best:.4f}")
print(f"Updated Mean Absolute Error (MAE): {mae_best:.4f}")
print(f"Updated R-squared (R²) Score: {r2_best:.4f}")


Updated Root Mean Squared Error (RMSE): 0.3467
Updated Mean Absolute Error (MAE): 0.0504
Updated R-squared (R²) Score: 0.4140


**Model Evaluation After Hyperparameter Tuning**

- The updated RMSE of approximately **0.35** indicates a reduction in the average magnitude of prediction errors compared to before, showing improved accuracy.
- The MAE value of **0.05** reflects that the average absolute error has significantly decreased, meaning predictions are closer to actual values.
- The R-squared score of **0.41** suggests the model now explains about 41% of the variance in the target variable, which is a notable improvement in predictive power.
- Overall, the hyperparameter tuning has enhanced the model’s performance, but there is still room for improvement to achieve higher accuracy.


### 3.7 Final Prediction Check (Compare Actual vs Predicted)

In this step, we will compare the actual target values with the predicted values from our tuned model on the test dataset. This comparison will helps us visually inspect how close the model’s predictions are to the real data points. By analyzing these results, we can identify if the model is consistently accurate across the range of values or if there are specific areas where it struggles. Such insights can guide further model improvements or validation.

In [22]:
# Predicting on test data
predicted_liquidity = best_model.predict(X_test)

# Created comparison DataFrame with reset index
comparison_df = pd.DataFrame({
    'Actual Liquidity': y_test.values,
    'Predicted Liquidity': predicted_liquidity
}).reset_index(drop=True)

# Add error column
comparison_df['Error'] = comparison_df['Actual Liquidity'] - comparison_df['Predicted Liquidity']

print(comparison_df.head(10))

   Actual Liquidity  Predicted Liquidity     Error
0          0.051516             0.053734 -0.002219
1          0.080784             0.068837  0.011947
2          0.064324             0.067555 -0.003231
3          0.153632             0.156490 -0.002859
4          0.010830             0.011185 -0.000356
5          0.123382             0.131801 -0.008419
6          0.219153             0.189631  0.029522
7          0.007353             0.007230  0.000122
8          0.003577             0.004392 -0.000816
9          0.141715             0.142974 -0.001258


**Conclusion: Actual vs Predicted Liquidity Comparison**

The comparison between actual and predicted liquidity values shows that the model's predictions are generally close to the true values, with most errors being small in magnitude. The calculated errors indicate the difference between the predicted and actual liquidity for each data point, revealing the model's precision on individual samples. While some points exhibit slightly higher errors, overall the prediction deviations are minor, demonstrating that the model captures the underlying patterns in the data reasonably well. This output confirms the improved accuracy achieved after hyperparameter tuning, though continuous evaluation and further refinement could help reduce these errors even more for robust real-world applications.

### 3.8 Saving the Trained Model

Once the model is trained and tuned to achieve satisfactory performance, the next important step is to save the model to disk. Saving the model allows you to reuse it later without retraining, which is essential for deployment or further analysis. The saved model can be loaded at any time to make predictions on new data.

Common libraries used for saving models in Python are  `joblib` and `pickle`. `joblib` is often preferred for saving large models efficiently.

In [None]:
import joblib
import os

model = best_model

# path to save the model
model_save_path = r"C:\Users\ACER\OneDrive\Documents\my codess\Data-Analytics-Assignment\Crypto-Liquidity-Prediction-ML-Project\models\crypto_liquidity_rf_model.pkl"

# This will Create a directory if it doesn't exist
os.makedirs(os.path.dirname(model_save_path), exist_ok=True)

# Save the model
joblib.dump(model, model_save_path)

print(f"Model saved successfully at: \n{model_save_path}")


Model saved successfully at: 
C:\Users\ACER\OneDrive\Documents\my codess\Data-Analytics-Assignment\Crypto-Liquidity-Prediction-ML-Project\models\crypto_liquidity_rf_model.pkl


### 3.9 Loading and Testing the Saved Model

After saving the trained machine learning model, it is important to verify that the model can be successfully loaded and used to make predictions. This step ensures that the saved model file is intact and the model's predictive capabilities remain consistent. Loading the model from disk allows you to reuse the model in different environments without retraining, facilitating deployment and future inference tasks.

In this step, we load the saved model file and use it to predict on the test dataset. We then evaluate the predictions by calculating error metrics such as `RMSE` and `R² score` to confirm the model’s performance remains reliable.

In [None]:
import joblib
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import os

# Define model save path
model_dir = r"C:\Users\ACER\OneDrive\Documents\my codess\Data-Analytics-Assignment\Crypto-Liquidity-Prediction-ML-Project\outputs\models"
model_path = os.path.join(model_dir, "crypto_liquidity_rf_model.pkl")

# Ensure the directory exists
os.makedirs(model_dir, exist_ok=True)

joblib.dump(best_model, model_path)

print(f"Model saved at: {model_path}")

# Load the model from the specified path
loaded_model = joblib.load(model_path)

# Using the loaded model to predict on test data
y_pred_loaded = loaded_model.predict(X_test)

# Calculating evaluation metrics to verify performance
rmse_loaded = np.sqrt(mean_squared_error(y_test, y_pred_loaded))
r2_loaded = r2_score(y_test, y_pred_loaded)

# Printing the evaluation results
print(f"Loaded Model RMSE: {rmse_loaded:.4f}")
print(f"Loaded Model R² Score: {r2_loaded:.4f}")

Model saved at: C:\Users\ACER\OneDrive\Documents\my codess\Data-Analytics-Assignment\Crypto-Liquidity-Prediction-ML-Project\outputs\models\crypto_liquidity_rf_model.pkl
Loaded Model RMSE: 0.3467
Loaded Model R² Score: 0.4140


# Summary of `03_Model_Training.ipynb`

This notebook covers the complete lifecycle of building a predictive model for Crypto Liquidity:

## 1. Data Preparation
- Loaded and preprocessed the dataset.
- Split the data into training and testing sets.

## 2. Initial Model Training
- Trained a baseline Random Forest Regressor model using default parameters.
- Evaluated initial performance using RMSE, MAE, and R² metrics.

## 3. Hyperparameter Tuning
- Applied GridSearchCV to identify the best hyperparameters (`n_estimators` and `max_depth`).
- Achieved improved model performance with optimized parameters.

## 4. Retraining and Evaluation
- Retrained the model using the best hyperparameters.
- Re-evaluated the model and observed significant improvement in accuracy.

## 5. Prediction Comparison
- Compared actual vs predicted liquidity values on the test dataset.
- Analyzed prediction errors to assess model accuracy and reliability.

## 6. Model Saving
- Saved the trained model as a pickle file in a dedicated `models` directory.
- This facilitates easy loading for future inference or deployment.

---

# Next Steps: Deployment `app.py`
In the deployment, we will:

- Load the saved model from the `models` directory.
- Prepare the pipeline to accept new input data and generate predictions.
- Optionally, create an streamlit interface for practical use of the model.

---