## crypto trading model template v2

- 1: Data Loading
- 2: Feature Engineering
- 3: Data Cleaning
- 4: Class Balancing
- 5: Feature Scaling
- 6: Model Training
- 7: Model Evaluation
- 8: Cross-Validation
#### Future Uses
- Hyperparameter Tuning
    - Add improved searches to optimize model here
- Model Comparisons
    - Framework is modular for seamless model swapping
- Extended Feature Engineering
    - Feature Engineering is endless
    - As the the model learns, I gain new insights and expand my field knowledge
      - This process is progressive and eventually we'll have a whole system build using this template

### 1: Data Loading
- (get_historical_data): Easily swap out the data source or adjust parameters 
like coin_id, vs_currency, and days for different datasets.

In [67]:
# Import libraries
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#### Automated data pull
- Download cryptocurrency data from a default URL and save as a CSV file
- Args:
    - output_path (str): Path to save the downloaded CSV
    - url (str): The URL to fetch the CSV data from (default is Binance BTC/USDT daily data)  
- Returns: str: Path to the saved CSV file

In [5]:
# Automated CSV Pull Function
def download_crypto_csv(output_path, url="https://www.cryptodatadownload.com/cdd/Binance_BTCUSDT_d.csv"):
    try:
        print(f"Downloading data from {url}")
        response = requests.get(url)
        response.raise_for_status()
        
        # Save the CSV content
        with open(output_path, 'w') as f:
            f.write(response.text)
        
        print(f"Data downloaded and saved to {output_path}")
        return output_path
    except Exception as e:
        print(f"Error downloading data: {e}")
        return None

#### Data Load
- Load cryptocurrency data from CSV file
- Args:
    - fileopath (str): path to CSV file
- Returns:
    - pd.DataFrame: data loaded into DataFrame

In [7]:
# Load Crypto Data from CSV
def load_crypto_data(filepath):
    try:
        # Skip the first row and load the data
        df = pd.read_csv(filepath, skiprows=1)
        print(f"Data loaded successfully: {len(df)} rows")
        print("Column names:", df.columns)
        return df
    except Exception as e:
        print(f"Error loading data: {e}")
        return None

### 2: Feature Engineering
- (generate_features): Add, remove, or tweak features.
- Append more calculations or move them around.

In [9]:
# Calculate the Relative Strength Index (RSI).
def rsi(data, window=14):
    delta = data['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

In [10]:
# Feature: 50-period Moving Average (SMA)
def moving_average(data, window=50):
    return data['Close'].rolling(window=window).mean()

In [11]:
# Feature: MACD (Moving Average Convergence Divergence)
def macd(data, short_window=12, long_window=26, signal_window=9):
    short_ema = data['Close'].ewm(span=short_window, min_periods=1).mean()
    long_ema = data['Close'].ewm(span=long_window, min_periods=1).mean()
    macd_line = short_ema - long_ema
    signal_line = macd_line.ewm(span=signal_window, min_periods=1).mean()
    return macd_line, signal_line

In [12]:
# Calculate Bollinger Bands.
def bollinger_bands(data, window=20):
    sma = data['Close'].rolling(window=window).mean()
    std = data['Close'].rolling(window=window).std()
    upper_band = sma + (2 * std)
    lower_band = sma - (2 * std)
    return upper_band, lower_band

In [13]:
# Calculate Average True Range (ATR).
def average_true_range(data, window=14):
    high_low = data['High'] - data['Low']
    high_close = abs(data['High'] - data['Close'].shift())
    low_close = abs(data['Low'] - data['Close'].shift())
    tr = high_low.combine(high_close, max).combine(low_close, max)
    return tr.rolling(window=window).mean()

#### 3: Data Cleaning
- (dropna): Handle any future missing data or NaNs, ensures the model doesn't encounter issues when scaling or fitting.

In [15]:
# Clean the dataset by handling missing values and scaling features.
def clean_data(df):
    # Drop rows with NaN values for the required features
    df.dropna(subset=['RSI', 'MACD', 'MACD_Signal', '50_MA', 'Upper_Band', 'Lower_Band', 'ATR'], inplace=True)

    # Scale numerical features (excluding Upper_Band and Lower_Band)
    scaler = StandardScaler()
    features_to_scale = ['RSI', '50_MA', 'MACD', 'MACD_Signal', 'ATR']
    df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

    print(f"Data cleaned and scaled: {len(df)} rows remaining")
    return df

In [16]:
# Load, process, and clean crypto data.
# Args: filepath (str): Path to the CSV file.
# Returns: pd.DataFrame: Processed and cleaned data.
def prepare_data(filepath):
    # Load data
    data = load_crypto_data(filepath)
    print("Data after loading:", data.head() if data is not None else "None")
    
    if data is not None:
        # Ensure 'Close', 'High', and 'Low' are numeric
        data['Close'] = pd.to_numeric(data['Close'], errors='coerce')
        data['High'] = pd.to_numeric(data['High'], errors='coerce')
        data['Low'] = pd.to_numeric(data['Low'], errors='coerce')
        
        # Generate features
        data['RSI'] = rsi(data)
        data['50_MA'] = moving_average(data)
        data['MACD'], data['MACD_Signal'] = macd(data)
        data['Upper_Band'], data['Lower_Band'] = bollinger_bands(data)
        data['ATR'] = average_true_range(data)
        
        print("Data before cleaning and scaling:", data.head())
        
        # Clean data
        prepared_data = clean_data(data)
        print("Data after cleaning:", prepared_data.head() if prepared_data is not None else "None")
        return prepared_data
    else:
        print("Data loading failed.")
        return None

### TESTDRIVER : Trading Model

In [18]:
output_path = "crypto_data.csv"
downloaded_file = download_crypto_csv(output_path)
if downloaded_file:
    with open(downloaded_file, 'r') as f:
        for i in range(5):
            print(f.readline())

    prepared_data = prepare_data(downloaded_file)
    print(prepared_data.head())

Downloading data from https://www.cryptodatadownload.com/cdd/Binance_BTCUSDT_d.csv
Data downloaded and saved to crypto_data.csv
https://www.CryptoDataDownload.com

Unix,Date,Symbol,Open,High,Low,Close,Volume BTC,Volume USDT,tradecount

1733788800000,2024-12-10,BTCUSDT,97276.48,98270.0,94256.54,96593.0,51708.68933,4988660195.092429,11556189

1733702400000,2024-12-09,BTCUSDT,101109.6,101215.93,94150.05,97276.47,53949.11595,5283626995.705648,8445872

1733616000000,2024-12-08,BTCUSDT,99831.99,101351.0,98657.7,101109.59,14612.99688,1459576946.402151,2994709

Data loaded successfully: 2673 rows
Column names: Index(['Unix', 'Date', 'Symbol', 'Open', 'High', 'Low', 'Close', 'Volume BTC',
       'Volume USDT', 'tradecount'],
      dtype='object')
Data after loading:             Unix        Date   Symbol       Open       High       Low  \
0  1733788800000  2024-12-10  BTCUSDT   97276.48   98270.00  94256.54   
1  1733702400000  2024-12-09  BTCUSDT  101109.60  101215.93  94150.05   
2  1733616000

In [19]:
print(prepared_data['RSI'].describe())
print(prepared_data[['MACD', 'MACD_Signal']].describe())
print(prepared_data[['Close', 'Upper_Band', 'Lower_Band']].head())
print(prepared_data['ATR'].describe())

count    2624.000000
mean        0.000000
std         1.000191
min        -2.560094
25%        -0.681480
50%         0.036109
75%         0.718039
max         2.617374
Name: RSI, dtype: float64
               MACD   MACD_Signal
count  2.624000e+03  2.624000e+03
mean  -3.249433e-17 -4.874150e-17
std    1.000191e+00  1.000191e+00
min   -3.641802e+00 -4.039720e+00
25%   -3.155787e-01 -3.138227e-01
50%    8.094901e-02  9.231699e-02
75%    4.360918e-01  4.331587e-01
max    4.474226e+00  4.032209e+00
       Close    Upper_Band    Lower_Band
49  67426.00  78987.381697  62942.317303
50  67377.50  77146.832043  63483.615957
51  69031.99  76087.668111  63778.232889
52  68378.00  74873.067316  64179.655684
53  68428.00  73606.891524  64702.842476
count    2.624000e+03
mean    -4.332578e-17
std      1.000191e+00
min     -1.101088e+00
25%     -7.967380e-01
50%     -3.948591e-01
75%      6.804814e-01
max      4.129316e+00
Name: ATR, dtype: float64


#### 4: Class Balancing (SMOTE)
- Adjust sampling_strategy to explore ways to address class imbalance.
- Experiment with other resampling techniques here in the future when time permits
    - like NearMiss or RandomUnderSampler.

In [21]:
# Create target column based on price movement 
prepared_data.loc[:, 'target'] = (prepared_data['Close'].shift(-1) > prepared_data['Close']).astype(int)

In [22]:
# Clean The Data 
prepared_data = prepared_data.dropna(subset=['RSI', 'MACD', 'MACD_Signal', '50_MA', 'Upper_Band', 'Lower_Band', 'ATR', 'target'])

# Assess Cleanliness
print(prepared_data.head())
print(f"Number of rows in cleaned data: {len(prepared_data)}")

             Unix        Date   Symbol      Open      High       Low  \
49  1729555200000  2024-10-22  BTCUSDT  67377.50  67836.01  66571.42   
50  1729468800000  2024-10-21  BTCUSDT  69032.00  69519.52  66840.67   
51  1729382400000  2024-10-20  BTCUSDT  68377.99  69400.00  68100.00   
52  1729296000000  2024-10-19  BTCUSDT  68427.99  68693.26  68010.00   
53  1729209600000  2024-10-18  BTCUSDT  67421.78  69000.00  67192.36   

       Close   Volume BTC   Volume USDT  tradecount       RSI     50_MA  \
49  67426.00  24598.96268  1.654286e+09     3669291 -0.179382  2.955721   
50  67377.50  31374.42184  2.130834e+09     3686777  0.061425  2.926559   
51  69031.99  12442.47378  8.540824e+08     1563795  0.194461  2.898366   
52  68378.00   8193.66737  5.596286e+08     1152428 -0.026584  2.865695   
53  68428.00  28725.63500  1.959736e+09     4010969 -0.040018  2.834348   

        MACD  MACD_Signal    Upper_Band    Lower_Band       ATR  target  
49 -3.641802    -4.039720  78987.381697  6

In [23]:
# Split into features (X) and target (y)
feature_columns = ['RSI', '50_MA', 'MACD', 'MACD_Signal', 'Upper_Band', 'Lower_Band', 'ATR']
X = prepared_data.loc[:, feature_columns]
y = prepared_data.loc[:, 'target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
# Check class distribution before applying SMOTE
# balance is not bad but why not balance it
print("Class distribution before SMOTE:")
print(y_train.value_counts())

Class distribution before SMOTE:
target
0    1064
1    1035
Name: count, dtype: int64


In [25]:
# Apply SMOTE to perfect the balance in class distribution
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# Check the class distribution after applying SMOTE
print(f"Class distribution after SMOTE: {y_train_res.value_counts()}")

Class distribution after SMOTE: target
0    1064
1    1064
Name: count, dtype: int64


#### 5: Feature Scaling
- Currently using the StandardScaler
    - Can swap to a different scaler here when time permits
    - something like: MinMaxScaler or RobustScaler
- Scaling optimizes model performance by created consistent ranges in the data

In [27]:
# Scale the features using StandardScaler
scaler = StandardScaler()

In [28]:
# Fit and transform the training data, and transform the test data
X_train_scaled = scaler.fit_transform(X_train_res)
X_test_scaled = scaler.transform(X_test)

In [29]:
# Verify the scaled data
print("First 5 rows of scaled training data:")
print(X_train_scaled[:5])

First 5 rows of scaled training data:
[[-1.97516711e+00 -6.78844126e-01 -1.13459160e+00 -1.15949102e+00
  -8.02437481e-01 -9.29700474e-01 -6.23654861e-01]
 [-2.31505712e-01 -2.06418576e-01 -7.17216453e-01 -9.60892056e-01
  -3.90854678e-01 -3.19558903e-01 -4.97477272e-02]
 [-6.30403376e-02  1.86955887e-01  3.51155892e-01  4.44809992e-01
   1.24387218e-01  3.91861394e-01 -5.21574927e-01]
 [-4.48721412e-01 -4.43290228e-01 -2.91756867e-02  1.21825513e-03
  -4.90599264e-01 -3.61347001e-01 -8.22236081e-01]
 [ 2.13835666e+00  2.87129688e-02  7.74643632e-01  4.97630726e-01
   7.45886435e-02  9.21440340e-02 -4.41053238e-01]]


#### 6: Model Training
- Select a model and train it
- Default model : (RandomForestClassifier)

In [31]:
# Dynamic Model Selection allows swapping models in an out for performance comparisons
# args: #  model_name (str): The name of the model to use (e.g., "random_forest").
        # X_train (np.array): Scaled training features.
        # y_train (np.array): Training target labels.
        # X_test (np.array): Scaled testing features.
# Returns:
        # model: Trained model instance.
        # y_pred: Predictions on the test set.
def train_model(model_name, X_train, y_train, X_test):
    # Define supported models
    models = {
        "random_forest": RandomForestClassifier(random_state=42),
        "logistic_regression": LogisticRegression(random_state=42),
        "svm": SVC(random_state=42),
    }
    
    # Validate model_name
    if model_name not in models:
        raise ValueError(f"Model '{model_name}' is not supported.")
    
    # Get the selected model
    model = models[model_name]
    
    # Train the model
    model.fit(X_train, y_train)
    print(f"'{model_name}' training complete.")
    
    # Make predictions
    y_pred = model.predict(X_test)
    print(f"First 5 predictions for '{model_name}': {y_pred[:5]}")
    
    return model, y_pred


In [32]:
print(X_train_scaled.shape, y_train_res.shape)  # Ensure they match : allll goood

(2128, 7) (2128,)


In [33]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train_res)  # Train the model
y_pred = model.predict(X_test_scaled)  # Make predictions
print("Direct training successful. First 5 predictions:", y_pred[:5])


Direct training successful. First 5 predictions: [0 0 0 1 1]


In [34]:
model_name = "random_forest"  # Try other models like "logistic_regression" or "svm"
model, y_pred = train_model(model_name, X_train_scaled, y_train_res, X_test_scaled)


'random_forest' training complete.
First 5 predictions for 'random_forest': [0 0 0 1 1]


#### 7: Model Evaluation
- Accuracy, classification report, confusion matrix, and ROC curve.
- Implement additional/other metrics here
    - perhaps precision-recall curve or F1 score analysis

In [36]:
# Accuracy, Classification, Confusion.
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.4705

Classification Report:
              precision    recall  f1-score   support

           0       0.52      0.48      0.49       286
           1       0.43      0.46      0.44       239

    accuracy                           0.47       525
   macro avg       0.47      0.47      0.47       525
weighted avg       0.47      0.47      0.47       525


Confusion Matrix:
[[136 150]
 [128 111]]


#### 8: Cross-Validation
- Get an accurate view of model performance across multiple folds to reduce overfitting and signal noise