## crypto trading model template v1
- 1: Data Loading
- 2: Feature Engineering
- 3: Data Cleaning
- 4: Class Balancing
- 5: Feature Scaling
- 6: Model Training
- 7: Model Evaluation
- 8: Cross-Validation
#### Future Uses
- Hyperparameter Tuning
    - Add improved searches to optimize model here
- Model Comparisons
    - Framework is modular for seamless model swapping
- Extended Feature Engineering
    - Feature Engineering is endless
    - As the the model learns, I gain new insights and expand my field knowledge
      - This process is progressive and eventually we'll have a whole system build using this template

#### 1: Data Loading
- (get_historical_data): Easily swap out the data source or adjust parameters 
like coin_id, vs_currency, and days for different datasets.

In [42]:
# Import libraries
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from imblearn.over_sampling import SMOTE

In [44]:
# Function to fetch historical data
def get_historical_data(coin_id, vs_currency, days):
    url = f'https://api.coingecko.com/api/v3/coins/{coin_id}/market_chart'
    params = {'vs_currency': vs_currency, 'days': days}
    
    print(f"Requesting data for: {coin_id}, Currency: {vs_currency}, Days: {days}")
    
    response = requests.get(url, params=params)
    if response.status_code == 200:
        data = response.json()
        df = pd.DataFrame(data['prices'], columns=['timestamp', 'price'])
        df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
        return df
    else:
        print(f"Error fetching data: {response.status_code}")
        print(response.json())  # Print the error response for more detail
        return None

# Example: Fetch 180 days of Bitcoin data
data = get_historical_data('bitcoin', 'usd', 180)

# Check the length of your data and date range
print(len(data))  # Length of your data in rows (days)
print(data['timestamp'].min(), data['timestamp'].max())  # Check the date range

Requesting data for: bitcoin, Currency: usd, Days: 180
181
2024-06-14 00:00:00 2024-12-10 00:40:09


#### 2: Feature Engineering
- (generate_features): Add, remove, or tweak features.
- Append more calculations or move them around.

In [47]:
# Feature: RSI (Relative Strength Index)
def rsi(data, window=14):
    delta = data['price'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

# Feature: 50-period Moving Average (SMA)
def moving_average(data, window=50):
    return data['price'].rolling(window=window).mean()

# Feature: MACD (Moving Average Convergence Divergence)
def macd(data, fast_period=12, slow_period=26, signal_period=9):
    fast_ema = data['price'].ewm(span=fast_period, min_periods=fast_period).mean()
    slow_ema = data['price'].ewm(span=slow_period, min_periods=slow_period).mean()
    macd = fast_ema - slow_ema
    signal = macd.ewm(span=signal_period, min_periods=signal_period).mean()
    return macd, signal

# Feature: Generate additional features based on existing ones
def generate_features(data):
    # Calculate RSI
    data['rsi'] = rsi(data)

    # Calculate 50-period SMA
    data['50_ma'] = moving_average(data)

    # Calculate MACD and MACD Signal
    data['macd'], data['macd_signal'] = macd(data)

    # Price and MACD conditions
    data['price_breakout'] = data['price'] > data['50_ma']  # Price above SMA (for trend confirmation)
    data['rsi_above_30'] = data['rsi'] > 30  # RSI increasing
    data['macd_above_signal'] = data['macd'] > data['macd_signal']  # MACD crossover
    data['near_resistance'] = data['price'] > (data['50_ma'] * 0.99)  # Price is near resistance (e.g., 1% away)
    
    # Candlestick pattern: Bullish Engulfing
    data['bullish_engulfing'] = (data['price'] > data['price'].shift(1)) & (data['price'].shift(1) < data['price'].shift(2))

    # Entry conditions
    data['pre_breakout_signal'] = data['near_resistance'] & data['rsi_above_30'] & data['macd_above_signal'] & data['bullish_engulfing']
    data['price_pullback'] = data['price'] < data['50_ma']  # If price is below 50-period SMA
    data['rsi_oversold'] = data['rsi'] < 30  # RSI < 30 for a potential pullback buy
    data['long_entry_signal'] = (data['price_breakout'] | data['pre_breakout_signal'] | data['price_pullback']) & data['rsi_above_30'] & data['macd_above_signal']
    
    # Exit conditions
    data['price_below_sma'] = data['price'] < data['50_ma']
    data['rsi_above_70'] = data['rsi'] > 70
    data['macd_below_signal'] = data['macd'] < data['macd_signal']
    data['long_exit_signal'] = data['price_below_sma'] & data['rsi_above_70'] & data['macd_below_signal']

    return data

#### 3: Data Cleaning
- (dropna): Handle any future missing data or NaNs, ensures the model doesn't encounter issues when scaling or fitting.

In [52]:
# Generate the features for the data
data_clean = generate_features(data)

# Check for NaNs after generating features
print(data_clean.isna().sum())  # Display number of NaN values in each column

# Drop rows where any of the key columns (RSI, MACD, 50 MA) have NaN values
data_clean = data_clean.dropna(subset=['rsi', 'macd', 'macd_signal', '50_ma', 'long_entry_signal', 'long_exit_signal'])

# Check the cleaned data
print(data_clean.head(60))  # Display the first 60 rows after cleaning
print(f"Number of rows in cleaned data: {len(data_clean)}")  # Check how many rows are left

timestamp               0
price                   0
rsi                    13
50_ma                  49
macd                   25
macd_signal            33
price_breakout          0
rsi_above_30            0
macd_above_signal       0
near_resistance         0
bullish_engulfing       0
pre_breakout_signal     0
price_pullback          0
rsi_oversold            0
long_entry_signal       0
price_below_sma         0
rsi_above_70            0
macd_below_signal       0
long_exit_signal        0
dtype: int64
     timestamp         price        rsi         50_ma         macd  \
49  2024-08-02  65357.529608  55.031492  63192.932944  1139.441799   
50  2024-08-03  61407.295474  32.523403  63087.072514   699.808546   
51  2024-08-04  60738.744925  28.815672  62981.625838   295.036621   
52  2024-08-05  58006.206587  20.546258  62817.940220  -239.939976   
53  2024-08-06  53956.261842  17.000575  62564.754700  -974.617324   
54  2024-08-07  55959.841074  26.257582  62354.743742 -1381.294049   
55 

#### 4: Class Balancing (SMOTE)
- Adjust sampling_strategy to explore ways to address class imbalance.
- Experiment with other resampling techniques here in the future when time permits
    - like NearMiss or RandomUnderSampler.

In [60]:
# Create the target column based on price movement (up or down)
data_clean.loc[:, 'target'] = (data_clean['price'].shift(-1) > data_clean['price']).astype(int)

# Drop rows with NaN values
data_clean = data_clean.dropna(subset=['rsi', 'macd', 'macd_signal', '50_ma', 'long_entry_signal', 'long_exit_signal', 'target'])

# Check the cleaned data
print(data_clean.head(60))  # Display the first 60 rows after cleaning
print(f"Number of rows in cleaned data: {len(data_clean)}")  # Check how many rows

     timestamp         price        rsi         50_ma         macd  \
49  2024-08-02  65357.529608  55.031492  63192.932944  1139.441799   
50  2024-08-03  61407.295474  32.523403  63087.072514   699.808546   
51  2024-08-04  60738.744925  28.815672  62981.625838   295.036621   
52  2024-08-05  58006.206587  20.546258  62817.940220  -239.939976   
53  2024-08-06  53956.261842  17.000575  62564.754700  -974.617324   
54  2024-08-07  55959.841074  26.257582  62354.743742 -1381.294049   
55  2024-08-08  55099.951811  25.862670  62154.556482 -1751.821859   
56  2024-08-09  61859.031599  42.990908  62093.595622 -1489.115926   
57  2024-08-10  60912.588533  36.794071  62014.953904 -1341.017145   
58  2024-08-11  60887.708616  36.436216  61951.275591 -1211.660522   
59  2024-08-12  58804.234500  33.336155  61842.562211 -1261.175592   
60  2024-08-13  59350.074333  36.455818  61765.136550 -1242.399983   
61  2024-08-14  60601.223178  40.001241  61769.896314 -1114.493630   
62  2024-08-15  5873

In [62]:
# Split into features (X) and target (y)
X = data_clean[['rsi', '50_ma', 'macd', 'price_breakout', 'rsi_above_30', 'macd_above_signal', 'price_pullback', 'rsi_oversold', 'pre_breakout_signal']]  # Add all relevant features here
y = data_clean['target']  # Target variable (price up or down)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE to balance the class distribution in the training set
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# Check the class distribution after applying SMOTE
print(f"Class distribution after SMOTE: {y_train_res.value_counts()}")


Class distribution after SMOTE: target
1    56
0    56
Name: count, dtype: int64


#### 5: Feature Scaling
- Currently using the StandardScaler
    - Can swap to a different scaler here when time permits
    - something like: MinMaxScaler or RobustScaler

In [67]:
# Scale the features using StandardScaler
scaler = StandardScaler()

# Fit and transform the training data, and transform the test data
X_train_scaled = scaler.fit_transform(X_train_res)
X_test_scaled = scaler.transform(X_test)

#### 6: Model Training
- Current model : (RandomForestClassifier)
- Core Model Access: Swap out for other classification models here to compare results
    - Models we could try: Logistic Regression, SVM, etc..

In [69]:
# set current model
model = RandomForestClassifier(random_state=42)
# fit / train current model
model.fit(X_train_scaled, y_train_res)

In [72]:
# generate predictions
y_pred = model.predict(X_test_scaled)

#### 7: Model Evaluation
- Accuracy, classification report, confusion matrix, and ROC curve.
- Implement additional/other metrics here
    - perhaps precision-recall curve or F1 score analysis

In [78]:
# Accuracy, Classification, Confusion.
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.4074074074074074
Classification Report:
              precision    recall  f1-score   support

           0       0.28      0.62      0.38         8
           1       0.67      0.32      0.43        19

    accuracy                           0.41        27
   macro avg       0.47      0.47      0.41        27
weighted avg       0.55      0.41      0.42        27

Confusion Matrix:
[[ 5  3]
 [13  6]]


#### 8: Cross-Validation
- Get an accurate view of model performance across multiple folds to reduce overfitting and signal noise