## crypto trading model template v1
- 1: Data Loading
- 2: Feature Engineering
- 3: Data Cleaning
- 4: Class Balancing
- 5: Feature Scaling
- 6: Model Training
- 7: Model Evaluation
- 8: Cross-Validation
#### Future Uses
- Hyperparameter Tuning
    - Add improved searches to optimize model here
- Model Comparisons
    - Framework is modular for seamless model swapping
- Extended Feature Engineering
    - Feature Engineering is endless
    - As the the model learns, I gain new insights and expand my field knowledge
      - This process is progressive and eventually we'll have a whole system build using this template

### 1: Data Loading
- (get_historical_data): Easily swap out the data source or adjust parameters 
like coin_id, vs_currency, and days for different datasets.

In [443]:
# Import libraries
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#### Automated data pull
- Download cryptocurrency data from a default URL and save as a CSV file
- Args:
    - output_path (str): Path to save the downloaded CSV
    - url (str): The URL to fetch the CSV data from (default is Binance BTC/USDT daily data)  
- Returns: str: Path to the saved CSV file

In [328]:
# Automated CSV Pull Function
def download_crypto_csv(output_path, url="https://www.cryptodatadownload.com/cdd/Binance_BTCUSDT_d.csv"):
    try:
        print(f"Downloading data from {url}")
        response = requests.get(url)
        response.raise_for_status()
        
        # Save the CSV content
        with open(output_path, 'w') as f:
            f.write(response.text)
        
        print(f"Data downloaded and saved to {output_path}")
        return output_path
    except Exception as e:
        print(f"Error downloading data: {e}")
        return None

#### Data Load
- Load cryptocurrency data from CSV file
- Args:
    - fileopath (str): path to CSV file
- Returns:
    - pd.DataFrame: data loaded into DataFrame

In [331]:
# Load Crypto Data from CSV
def load_crypto_data(filepath):
    try:
        # Skip the first row and load the data
        df = pd.read_csv(filepath, skiprows=1)
        print(f"Data loaded successfully: {len(df)} rows")
        print("Column names:", df.columns)
        return df
    except Exception as e:
        print(f"Error loading data: {e}")
        return None

### 2: Feature Engineering
- (generate_features): Add, remove, or tweak features.
- Append more calculations or move them around.

In [334]:
# Calculate the Relative Strength Index (RSI).
def rsi(data, window=14):
    delta = data['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

In [336]:
# Feature: 50-period Moving Average (SMA)
def moving_average(data, window=50):
    return data['Close'].rolling(window=window).mean()

In [338]:
# Feature: MACD (Moving Average Convergence Divergence)
def macd(data, short_window=12, long_window=26, signal_window=9):
    short_ema = data['Close'].ewm(span=short_window, min_periods=1).mean()
    long_ema = data['Close'].ewm(span=long_window, min_periods=1).mean()
    macd_line = short_ema - long_ema
    signal_line = macd_line.ewm(span=signal_window, min_periods=1).mean()
    return macd_line, signal_line

In [340]:
# Calculate Bollinger Bands.
def bollinger_bands(data, window=20):
    sma = data['Close'].rolling(window=window).mean()
    std = data['Close'].rolling(window=window).std()
    upper_band = sma + (2 * std)
    lower_band = sma - (2 * std)
    return upper_band, lower_band

In [342]:
# Calculate Average True Range (ATR).
def average_true_range(data, window=14):
    high_low = data['High'] - data['Low']
    high_close = abs(data['High'] - data['Close'].shift())
    low_close = abs(data['Low'] - data['Close'].shift())
    tr = high_low.combine(high_close, max).combine(low_close, max)
    return tr.rolling(window=window).mean()

#### 3: Data Cleaning
- (dropna): Handle any future missing data or NaNs, ensures the model doesn't encounter issues when scaling or fitting.

In [357]:
# Clean the dataset by handling missing values and scaling features.
def clean_data(df):
    # Drop rows with NaN values for the required features
    df.dropna(subset=['RSI', 'MACD', 'MACD_Signal', '50_MA', 'Upper_Band', 'Lower_Band', 'ATR'], inplace=True)

    # Scale numerical features (excluding Upper_Band and Lower_Band)
    scaler = StandardScaler()
    features_to_scale = ['RSI', '50_MA', 'MACD', 'MACD_Signal', 'ATR']
    df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

    print(f"Data cleaned and scaled: {len(df)} rows remaining")
    return df

In [359]:
# Load, process, and clean crypto data.
# Args: filepath (str): Path to the CSV file.
# Returns: pd.DataFrame: Processed and cleaned data.
def prepare_data(filepath):
    # Load data
    data = load_crypto_data(filepath)
    print("Data after loading:", data.head() if data is not None else "None")
    
    if data is not None:
        # Ensure 'Close', 'High', and 'Low' are numeric
        data['Close'] = pd.to_numeric(data['Close'], errors='coerce')
        data['High'] = pd.to_numeric(data['High'], errors='coerce')
        data['Low'] = pd.to_numeric(data['Low'], errors='coerce')
        
        # Generate features
        data['RSI'] = rsi(data)
        data['50_MA'] = moving_average(data)
        data['MACD'], data['MACD_Signal'] = macd(data)
        data['Upper_Band'], data['Lower_Band'] = bollinger_bands(data)
        data['ATR'] = average_true_range(data)
        
        print("Data before cleaning and scaling:", data.head())
        
        # Clean data
        prepared_data = clean_data(data)
        print("Data after cleaning:", prepared_data.head() if prepared_data is not None else "None")
        return prepared_data
    else:
        print("Data loading failed.")
        return None

### TESTDRIVER : Trading Model

In [363]:
output_path = "crypto_data.csv"
downloaded_file = download_crypto_csv(output_path)
if downloaded_file:
    with open(downloaded_file, 'r') as f:
        for i in range(5):
            print(f.readline())

    prepared_data = prepare_data(downloaded_file)
    print(prepared_data.head())

Downloading data from https://www.cryptodatadownload.com/cdd/Binance_BTCUSDT_d.csv
Data downloaded and saved to crypto_data.csv
https://www.CryptoDataDownload.com

Unix,Date,Symbol,Open,High,Low,Close,Volume BTC,Volume USDT,tradecount

1733702400000,2024-12-09,BTCUSDT,101109.6,101215.93,94150.05,97276.47,53949.11595,5283626995.705648,8445872

1733616000000,2024-12-08,BTCUSDT,99831.99,101351.0,98657.7,101109.59,14612.99688,1459576946.402151,2994709

1733529600000,2024-12-07,BTCUSDT,99740.84,100439.18,98844.0,99831.99,14931.9459,1487953846.838546,2634566

Data loaded successfully: 2672 rows
Column names: Index(['Unix', 'Date', 'Symbol', 'Open', 'High', 'Low', 'Close', 'Volume BTC',
       'Volume USDT', 'tradecount'],
      dtype='object')
Data after loading:             Unix        Date   Symbol       Open       High       Low  \
0  1733702400000  2024-12-09  BTCUSDT  101109.60  101215.93  94150.05   
1  1733616000000  2024-12-08  BTCUSDT   99831.99  101351.00  98657.70   
2  1733529600

In [365]:
print(prepared_data['RSI'].describe())
print(prepared_data[['MACD', 'MACD_Signal']].describe())
print(prepared_data[['Close', 'Upper_Band', 'Lower_Band']].head())
print(prepared_data['ATR'].describe())

count    2.623000e+03
mean     8.668459e-17
std      1.000191e+00
min     -2.559690e+00
25%     -6.815060e-01
50%      3.796224e-02
75%      7.179793e-01
max      2.616823e+00
Name: RSI, dtype: float64
               MACD   MACD_Signal
count  2.623000e+03  2.623000e+03
mean  -2.167115e-17 -3.250672e-17
std    1.000191e+00  1.000191e+00
min   -3.614405e+00 -3.935634e+00
25%   -3.174013e-01 -3.160748e-01
50%    7.978049e-02  9.086910e-02
75%    4.360468e-01  4.329307e-01
max    4.484965e+00  4.044815e+00
       Close    Upper_Band    Lower_Band
49  67377.50  77146.832043  63483.615957
50  69031.99  76087.668111  63778.232889
51  68378.00  74873.067316  64179.655684
52  68428.00  73606.891524  64702.842476
53  67421.78  72076.662039  65418.050961
count    2.623000e+03
mean     4.334229e-17
std      1.000191e+00
min     -1.100705e+00
25%     -7.965450e-01
50%     -3.946270e-01
75%      6.788972e-01
max      4.129427e+00
Name: ATR, dtype: float64


#### 4: Class Balancing (SMOTE)
- Adjust sampling_strategy to explore ways to address class imbalance.
- Experiment with other resampling techniques here in the future when time permits
    - like NearMiss or RandomUnderSampler.

In [369]:
# Create target column based on price movement 
prepared_data.loc[:, 'target'] = (prepared_data['Close'].shift(-1) > prepared_data['Close']).astype(int)

In [373]:
# Clean The Data 
prepared_data = prepared_data.dropna(subset=['RSI', 'MACD', 'MACD_Signal', '50_MA', 'Upper_Band', 'Lower_Band', 'ATR', 'target'])

# Assess Cleanliness
print(prepared_data.head())
print(f"Number of rows in cleaned data: {len(prepared_data)}")

             Unix        Date   Symbol      Open      High       Low  \
49  1729468800000  2024-10-21  BTCUSDT  69032.00  69519.52  66840.67   
50  1729382400000  2024-10-20  BTCUSDT  68377.99  69400.00  68100.00   
51  1729296000000  2024-10-19  BTCUSDT  68427.99  68693.26  68010.00   
52  1729209600000  2024-10-18  BTCUSDT  67421.78  69000.00  67192.36   
53  1729123200000  2024-10-17  BTCUSDT  67620.00  67939.40  66666.00   

       Close   Volume BTC   Volume USDT  tradecount       RSI     50_MA  \
49  67377.50  31374.42184  2.130834e+09     3686777  0.061345  2.932015   
50  69031.99  12442.47378  8.540824e+08     1563795  0.194357  2.903780   
51  68378.00   8193.66737  5.596286e+08     1152428 -0.026648  2.871060   
52  68428.00  28725.63500  1.959736e+09     4010969 -0.040079  2.839668   
53  67421.78  25328.22861  1.702164e+09     3981960 -0.352440  2.807360   

        MACD  MACD_Signal    Upper_Band    Lower_Band       ATR  target  
49 -3.467932    -3.935634  77146.832043  6

In [377]:
# Split into features (X) and target (y)
feature_columns = ['RSI', '50_MA', 'MACD', 'MACD_Signal', 'Upper_Band', 'Lower_Band', 'ATR']
X = prepared_data.loc[:, feature_columns]
y = prepared_data.loc[:, 'target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [383]:
# Check class distribution before applying SMOTE
# balance is not bad but why not balance it
print("Class distribution before SMOTE:")
print(y_train.value_counts())

Class distribution before SMOTE:
target
0    1075
1    1023
Name: count, dtype: int64


In [387]:
# Apply SMOTE to perfect the balance in class distribution
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# Check the class distribution after applying SMOTE
print(f"Class distribution after SMOTE: {y_train_res.value_counts()}")

Class distribution after SMOTE: target
1    1075
0    1075
Name: count, dtype: int64


#### 5: Feature Scaling
- Currently using the StandardScaler
    - Can swap to a different scaler here when time permits
    - something like: MinMaxScaler or RobustScaler
- Scaling optimizes model performance by created consistent ranges in the data

In [390]:
# Scale the features using StandardScaler
scaler = StandardScaler()

In [392]:
# Fit and transform the training data, and transform the test data
X_train_scaled = scaler.fit_transform(X_train_res)
X_test_scaled = scaler.transform(X_test)

In [398]:
# Verify the scaled data
print("First 5 rows of scaled training data:")
print(X_train_scaled[:5])

First 5 rows of scaled training data:
[[ 2.13110721  0.03531851  0.8143963   0.56786761  0.09294378  0.09167859
  -0.45820035]
 [ 1.15722658 -0.93244254  0.32117468  0.23104401 -0.92370256 -0.89209491
  -0.74802718]
 [-2.02346264 -0.66871053 -1.1485002  -1.16890297 -0.78573941 -0.91926356
  -0.63486257]
 [-0.79330567  2.02157109 -1.1716956  -0.64842767  2.05734873  2.02517554
   1.20320364]
 [ 0.14953347  0.78675016  0.09949341 -0.24620059  0.75162101  0.70188664
   1.35025029]]


#### 6: Model Training
- Select a model and train it
- Default model : (RandomForestClassifier)

In [433]:
# Dynamic Model Selection allows swapping models in an out for performance comparisons
# args: #  model_name (str): The name of the model to use (e.g., "random_forest").
        # X_train (np.array): Scaled training features.
        # y_train (np.array): Training target labels.
        # X_test (np.array): Scaled testing features.
# Returns:
        # model: Trained model instance.
        # y_pred: Predictions on the test set.
def train_model(model_name, X_train, y_train, X_test):
    """
    Dynamically select, train, and evaluate a model.
    
    Args:
        model_name (str): The name of the model to use (e.g., "random_forest").
        X_train (np.array): Scaled training features.
        y_train (np.array): Training target labels.
        X_test (np.array): Scaled testing features.
    
    Returns:
        model: Trained model instance.
        y_pred: Predictions on the test set.
    """
    # Define supported models
    models = {
        "random_forest": RandomForestClassifier(random_state=42),
        "logistic_regression": LogisticRegression(random_state=42),
        "svm": SVC(random_state=42),
    }
    
    # Validate model_name
    if model_name not in models:
        raise ValueError(f"Model '{model_name}' is not supported.")
    
    # Get the selected model
    model = models[model_name]
    
    # Train the model
    model.fit(X_train, y_train)
    print(f"'{model_name}' training complete.")
    
    # Make predictions
    y_pred = model.predict(X_test)
    print(f"First 5 predictions for '{model_name}': {y_pred[:5]}")
    
    return model, y_pred


In [435]:
print(X_train_scaled.shape, y_train_res.shape)  # Ensure they match : allll goood

(2150, 7) (2150,)


In [437]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train_res)  # Train the model
y_pred = model.predict(X_test_scaled)  # Make predictions
print("Direct training successful. First 5 predictions:", y_pred[:5])


Direct training successful. First 5 predictions: [0 1 0 0 0]


In [441]:
model_name = "random_forest"  # Try other models like "logistic_regression" or "svm"
model, y_pred = train_model(model_name, X_train_scaled, y_train_res, X_test_scaled)


'random_forest' training complete.
First 5 predictions for 'random_forest': [0 1 0 0 0]


#### 7: Model Evaluation
- Accuracy, classification report, confusion matrix, and ROC curve.
- Implement additional/other metrics here
    - perhaps precision-recall curve or F1 score analysis

In [445]:
# Accuracy, Classification, Confusion.
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.4571

Classification Report:
              precision    recall  f1-score   support

           0       0.48      0.50      0.49       274
           1       0.43      0.41      0.42       251

    accuracy                           0.46       525
   macro avg       0.46      0.46      0.46       525
weighted avg       0.46      0.46      0.46       525


Confusion Matrix:
[[136 138]
 [147 104]]


#### 8: Cross-Validation
- Get an accurate view of model performance across multiple folds to reduce overfitting and signal noise