## crypto trading model template v3
- 1: Data Loading
- 2: Feature Engineering
- 3: Data Cleaning
- 4: Class Balancing
- 5: Feature Scaling
- 6: Model Training
- 7: Model Evaluation
- 8: Cross-Validation
#### Future Uses
- Hyperparameter Tuning
    - Add improved searches to optimize model here
- Model Comparisons
    - Framework is modular for seamless model swapping
- Extended Feature Engineering
    - Feature Engineering is endless
    - As the the model learns, I gain new insights and expand my field knowledge
      - This process is progressive and eventually we'll have a whole system build using this template
- Implement paper trading to provide evidence and data for the project cause that would be cool!
#### v3 updates
- Multi-Timeframe Integration (inprogress)
  - fetch data for specific intervals: hourly, four hour, daily, weekly, ...
    - increase accuracy of open,close,high,and low values for better feature engineering 
  - combine and align into comprehensive DataFrame
    - combine on features (OHLC) and Volume 
    - Update Data Cleaning
  - Generate Multi-Timeframe Features
    - Update indicators for each timeframe: RSI, MACD, MA
- Build Feature Iteration Mechanism (in-complete)
  - Systematically test combinations of features
  - generate heatmaps and visualizations for performance insights
  - refine and discard noisy features
- Test Models on Improved features
  - Evaluate current features without tweaking
  - identify how models perform on the baseline setup
  - begin planning for presentation here by documenting and comparing each models performance
    - this becomes a facet of the project that makes it self analysing, it is self optimizing.


### 1: Data Loading
- (get_historical_data): Easily swap out the data source or adjust parameters 
like coin_id, vs_currency, and days for different datasets.

In [218]:
# Import libraries
import requests
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

### Automated data pull (V3 Update: implement new functions for data retrieval on Multiple Timeframe intervals)
- Download cryptocurrency data from a default URL and save as a CSV file

In [271]:
# Generalized function to download cryptocurrency data and save as a CSV
# Args: output_path (str): Path to save the downloaded CSV
     #  url (str): The URL to fetch the CSV data from
# Returns: pd.DataFrame: A DataFrame containing the data
def download_crypto_data(output_path, url):
    
    try:
        print(f"Downloading data from {url}")
        response = requests.get(url)
        response.raise_for_status()
        
        # Save the CSV content
        with open(output_path, 'w') as f:
            f.write(response.text)
        
        print(f"Data downloaded and saved to {output_path}")
        return pd.read_csv(output_path, skiprows=1)  # Assuming the first row is metadata
    except Exception as e:
        print(f"Error downloading data: {e}")
        return None


In [223]:
# Function Downloads cryptocurrency data for a given interval and saves it to a file.
# Args:# interval (str): Timeframe interval (e.g., "1h", "4h", "d", "w").
       # symbol (str): Trading pair symbol (default: BTCUSDT).
       # output_dir (str): Directory to save the downloaded CSV.
# Returns: pd.DataFrame: A DataFrame containing the data.
def download_timeframe_data(interval, symbol="BTCUSDT", output_dir="data"):
    import os
    
    # Define the base URL for CryptoDataDownload
    base_url = "https://www.cryptodatadownload.com/cdd/"
    
    # Construct the URL dynamically
    url = f"{base_url}Binance_{symbol}_{interval}.csv"
    output_path = os.path.join(output_dir, f"{symbol}_{interval}.csv")
    
    return download_crypto_data(output_path, url)

##### download data for 1h,4h,d,w

In [226]:
# Download hourly data
hourly_data = download_timeframe_data(interval="1h", symbol = "BTCUSD")

Downloading data from https://api.cryptodatadownload.com/v1/data/OHLC/BINANCE/SPOT
Error downloading data: 404 Client Error: Not Found for url: https://api.cryptodatadownload.com/v1/data/OHLC/BINANCE/SPOT


In [227]:
# Download 4-hour data
four_hour_data = download_timeframe_data(interval="4h")

Downloading data from https://api.cryptodatadownload.com/v1/data/OHLC/BINANCE/SPOT
Error downloading data: 404 Client Error: Not Found for url: https://api.cryptodatadownload.com/v1/data/OHLC/BINANCE/SPOT


In [229]:
# Download daily data
daily_data = download_timeframe_data(interval="d")

Downloading data from https://api.cryptodatadownload.com/v1/data/OHLC/BINANCE/SPOT
Error downloading data: 404 Client Error: Not Found for url: https://api.cryptodatadownload.com/v1/data/OHLC/BINANCE/SPOT


In [231]:
# Download weekly data
weekly_data = download_timeframe_data(interval="w")

Downloading data from https://api.cryptodatadownload.com/v1/data/OHLC/BINANCE/SPOT
Error downloading data: 404 Client Error: Not Found for url: https://api.cryptodatadownload.com/v1/data/OHLC/BINANCE/SPOT


# We hit a limit with 

#### Data Load
- Load cryptocurrency data from CSV file
- Args:
    - fileopath (str): path to CSV file
- Returns:
    - pd.DataFrame: data loaded into DataFrame

In [236]:
# Load Crypto Data from CSV
def load_crypto_data(filepath):
    try:
        # Skip the first row and load the data
        df = pd.read_csv(filepath, skiprows=1)
        print(f"Data loaded successfully: {len(df)} rows")
        print("Column names:", df.columns)
        return df
    except Exception as e:
        print(f"Error loading data: {e}")
        return None

### 2: Feature Engineering
- (generate_features): Add, remove, or tweak features.
- Append more calculations or move them around.

In [239]:
# Calculate the Relative Strength Index (RSI).
def rsi(data, window=14):
    delta = data['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

In [241]:
# Feature: 50-period Moving Average (SMA)
def moving_average(data, window=50):
    return data['Close'].rolling(window=window).mean()

In [243]:
# Feature: MACD (Moving Average Convergence Divergence)
def macd(data, short_window=12, long_window=26, signal_window=9):
    short_ema = data['Close'].ewm(span=short_window, min_periods=1).mean()
    long_ema = data['Close'].ewm(span=long_window, min_periods=1).mean()
    macd_line = short_ema - long_ema
    signal_line = macd_line.ewm(span=signal_window, min_periods=1).mean()
    return macd_line, signal_line

In [245]:
# Calculate Bollinger Bands.
def bollinger_bands(data, window=20):
    sma = data['Close'].rolling(window=window).mean()
    std = data['Close'].rolling(window=window).std()
    upper_band = sma + (2 * std)
    lower_band = sma - (2 * std)
    return upper_band, lower_band

In [247]:
# Calculate Average True Range (ATR).
def average_true_range(data, window=14):
    high_low = data['High'] - data['Low']
    high_close = abs(data['High'] - data['Close'].shift())
    low_close = abs(data['Low'] - data['Close'].shift())
    tr = high_low.combine(high_close, max).combine(low_close, max)
    return tr.rolling(window=window).mean()

#### 3: Data Cleaning
- (dropna): Handle any future missing data or NaNs, ensures the model doesn't encounter issues when scaling or fitting.

In [250]:
# Clean the dataset by handling missing values and scaling features.
def clean_data(df):
    # Drop rows with NaN values for the required features
    df.dropna(subset=['RSI', 'MACD', 'MACD_Signal', '50_MA', 'Upper_Band', 'Lower_Band', 'ATR'], inplace=True)

    # Scale numerical features (excluding Upper_Band and Lower_Band)
    scaler = StandardScaler()
    features_to_scale = ['RSI', '50_MA', 'MACD', 'MACD_Signal', 'ATR']
    df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

    print(f"Data cleaned and scaled: {len(df)} rows remaining")
    return df

In [252]:
# Load, process, and clean crypto data.
# Args: filepath (str): Path to the CSV file.
# Returns: pd.DataFrame: Processed and cleaned data.
def prepare_data(filepath):
    # Load data
    data = load_crypto_data(filepath)
    print("Data after loading:", data.head() if data is not None else "None")
    
    if data is not None:
        # Ensure 'Close', 'High', and 'Low' are numeric
        data['Close'] = pd.to_numeric(data['Close'], errors='coerce')
        data['High'] = pd.to_numeric(data['High'], errors='coerce')
        data['Low'] = pd.to_numeric(data['Low'], errors='coerce')
        
        # Generate features
        data['RSI'] = rsi(data)
        data['50_MA'] = moving_average(data)
        data['MACD'], data['MACD_Signal'] = macd(data)
        data['Upper_Band'], data['Lower_Band'] = bollinger_bands(data)
        data['ATR'] = average_true_range(data)
        
        print("Data before cleaning and scaling:", data.head())
        
        # Clean data
        prepared_data = clean_data(data)
        print("Data after cleaning:", prepared_data.head() if prepared_data is not None else "None")
        return prepared_data
    else:
        print("Data loading failed.")
        return None

### TESTDRIVER : Trading Model

In [256]:
output_path = "crypto_data.csv"
downloaded_file = download_crypto_csv(output_path)
if downloaded_file:
    with open(downloaded_file, 'r') as f:
        for i in range(5):
            print(f.readline())

    prepared_data = prepare_data(downloaded_file)
    print(prepared_data.head())

NameError: name 'download_crypto_csv' is not defined

In [258]:
print(prepared_data['RSI'].describe())
print(prepared_data[['MACD', 'MACD_Signal']].describe())
print(prepared_data[['Close', 'Upper_Band', 'Lower_Band']].head())
print(prepared_data['ATR'].describe())

NameError: name 'prepared_data' is not defined

#### 4: Class Balancing (SMOTE)
- Adjust sampling_strategy to explore ways to address class imbalance.
- Experiment with other resampling techniques here in the future when time permits
    - like NearMiss or RandomUnderSampler.

In [261]:
# Create target column based on price movement 
prepared_data.loc[:, 'target'] = (prepared_data['Close'].shift(-1) > prepared_data['Close']).astype(int)

NameError: name 'prepared_data' is not defined

In [263]:
# Clean The Data 
prepared_data = prepared_data.dropna(subset=['RSI', 'MACD', 'MACD_Signal', '50_MA', 'Upper_Band', 'Lower_Band', 'ATR', 'target'])

# Assess Cleanliness
print(prepared_data.head())
print(f"Number of rows in cleaned data: {len(prepared_data)}")

NameError: name 'prepared_data' is not defined

In [265]:
# Split into features (X) and target (y)
feature_columns = ['RSI', '50_MA', 'MACD', 'MACD_Signal', 'Upper_Band', 'Lower_Band', 'ATR']
X = prepared_data.loc[:, feature_columns]
y = prepared_data.loc[:, 'target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

NameError: name 'prepared_data' is not defined

In [267]:
# Check class distribution before applying SMOTE
# balance is not bad but why not balance it
print("Class distribution before SMOTE:")
print(y_train.value_counts())

Class distribution before SMOTE:


NameError: name 'y_train' is not defined

In [269]:
# Apply SMOTE to perfect the balance in class distribution
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# Check the class distribution after applying SMOTE
print(f"Class distribution after SMOTE: {y_train_res.value_counts()}")

NameError: name 'X_train' is not defined

#### 5: Feature Scaling
- Currently using the StandardScaler
    - Can swap to a different scaler here when time permits
    - something like: MinMaxScaler or RobustScaler
- Scaling optimizes model performance by created consistent ranges in the data

In [113]:
# Scale the features using StandardScaler
scaler = StandardScaler()

In [115]:
# Fit and transform the training data, and transform the test data
X_train_scaled = scaler.fit_transform(X_train_res)
X_test_scaled = scaler.transform(X_test)

NameError: name 'X_train_res' is not defined

In [117]:
# Verify the scaled data
print("First 5 rows of scaled training data:")
print(X_train_scaled[:5])

First 5 rows of scaled training data:


NameError: name 'X_train_scaled' is not defined

#### 6: Model Training
- Select a model and train it
- Default model : (RandomForestClassifier)

In [120]:
# Dynamic Model Selection allows swapping models in an out for performance comparisons
# args: #  model_name (str): The name of the model to use (e.g., "random_forest").
        # X_train (np.array): Scaled training features.
        # y_train (np.array): Training target labels.
        # X_test (np.array): Scaled testing features.
# Returns:
        # model: Trained model instance.
        # y_pred: Predictions on the test set.
def train_model(model_name, X_train, y_train, X_test):
    """
    Dynamically select, train, and evaluate a model.
    
    Args:
        model_name (str): The name of the model to use (e.g., "random_forest").
        X_train (np.array): Scaled training features.
        y_train (np.array): Training target labels.
        X_test (np.array): Scaled testing features.
    
    Returns:
        model: Trained model instance.
        y_pred: Predictions on the test set.
    """
    # Define supported models
    models = {
        "random_forest": RandomForestClassifier(random_state=42),
        "logistic_regression": LogisticRegression(random_state=42),
        "svm": SVC(random_state=42),
    }
    
    # Validate model_name
    if model_name not in models:
        raise ValueError(f"Model '{model_name}' is not supported.")
    
    # Get the selected model
    model = models[model_name]
    
    # Train the model
    model.fit(X_train, y_train)
    print(f"'{model_name}' training complete.")
    
    # Make predictions
    y_pred = model.predict(X_test)
    print(f"First 5 predictions for '{model_name}': {y_pred[:5]}")
    
    return model, y_pred


In [122]:
print(X_train_scaled.shape, y_train_res.shape)  # Ensure they match : allll goood

NameError: name 'X_train_scaled' is not defined

In [124]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train_res)  # Train the model
y_pred = model.predict(X_test_scaled)  # Make predictions
print("Direct training successful. First 5 predictions:", y_pred[:5])


NameError: name 'X_train_scaled' is not defined

In [126]:
model_name = "random_forest"  # Try other models like "logistic_regression" or "svm"
model, y_pred = train_model(model_name, X_train_scaled, y_train_res, X_test_scaled)


NameError: name 'X_train_scaled' is not defined

#### 7: Model Evaluation
- Accuracy, classification report, confusion matrix, and ROC curve.
- Implement additional/other metrics here
    - perhaps precision-recall curve or F1 score analysis

In [129]:
# Accuracy, Classification, Confusion.
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

NameError: name 'y_test' is not defined

#### 8: Cross-Validation
- Get an accurate view of model performance across multiple folds to reduce overfitting and signal noise