# High Frequency Trading Algorithm

You have been tasked by the investment firm Renaissance High Frequency Trading (RHFT) to develop an automated trading strategy utilizing a combination of machine learning algorithms and high frequency algorithms. RHFT wants this new algorithm to be based on stock market data of the 30 stocks in the Dow Jones at the minute level and to conduct buys and sells every minute based on 1 min, 5 min, and 10 min Momentum. The CIO asked you to choose the Machine Learning Algorithm best suited for this task and wants you to execute the trades via Alpaca's API.

### Initial Set-Up

In [1]:
import os
from pathlib import Path
import alpaca_trade_api as tradeapi
import pandas as pd
import numpy as np
import datetime
import time
# from dotenv import load_dotenv

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load .env enviroment variables
# YOUR CODE HERE

In [3]:
# Set Alpaca API key and secret

api_key = 'PKKHA6KCMXSEQPSP9ZI9'
secret_key = '4yLfOkgUrO898gbNKPcXEH0Gx0taRLDLv9TdvAUJ'

In [4]:
# Create the Alpaca API object, specifying use of the paper trading account:

base_url = "https://paper-api.alpaca.markets"
alpaca = tradeapi.REST(api_key, secret_key, base_url, api_version='v2')
account = alpaca.get_account()

## Part 2: Train and Compare Multiple Machine Learning Algorithms

 In this section, you'll train each of the requested algorithms and compare performance. Be sure to use the same parameters and training steps for each model. This is necessary to compare each model accurately.

### Preprocessing Data

#### 1. Generate your feature data (`X`) and target data (`y`):
* Create a dataframe `X` that contains all the columns from the returns dataframe that will be used to predict `F_1_m_returns`.
* Create a variable, called `y`, that is equal 1 if `F_1_m_returns` is larger than 0. This will be our target variable.

In [5]:
# Load the dataset returns.csv and set the index to level_0 and time

returns = pd.read_csv('returns.csv', index_col = ['level_0', 'level_1'])
returns.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,F_1_m_returns,1_m_returns,5_m_returns,10_m_returns
level_0,level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
FB,2021-01-05 09:40:00-05:00,0.000814,7.4e-05,0.00126,0.004758
FB,2021-01-05 09:41:00-05:00,0.000887,0.000814,0.001889,0.004941
FB,2021-01-05 09:42:00-05:00,0.000628,0.000887,0.001999,0.003782
FB,2021-01-05 09:43:00-05:00,0.00048,0.000628,0.003408,0.00785
FB,2021-01-05 09:44:00-05:00,-0.001291,0.00048,0.002886,0.005416


In [6]:
# Create a separate dataframe for features and define the target variable as a binary target
X = returns.drop('F_1_m_returns', axis=1)

# Create the target variable
Y = returns['F_1_m_returns']
Y = np.where(Y>0, 1, 0)
Y = pd.Series(Y)

##### Note:
> Notice that we don't use shuffle when splitting the dataset into a training and testing dataset. 

> We want to keep the original ordering of the data, so we don't end up using observations in the future to predict past observations,

> This is a critical mistake known as look ahead bias.

#### 2. Use the train_test_split library to split the dataset into a training and testing dataset, with 70% used for testing
* Set the shuffle parameter to False, so that you use the first 70% for training to prvent look ahead bias.
* Make sure you have these 4 variables: `X_train`, `X_test`, `y_train`, `y_test`. 

In [7]:
# Import train_test_split 
from sklearn.model_selection import train_test_split

# Split the dataset without shuffling

X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.7, random_state=1, shuffle=False)

#### 3. Use the `Counter` function to test the distribution of the data. 
* The result of `Counter({1: 668, 0: 1194})` reveals the data is indeed unbalanced.

In [8]:
# Import the Counter function from the collections library
from collections import Counter

# Use Counter to count the number 1s and 0 in y_train
Counter(y_train)

Counter({1: 668, 0: 1193})

#### 4. Balance the dataset with the Oversampler libary, setting `random state= 1`.

In [9]:
# Import RandomOverSampler from the imblearn library
from imblearn.over_sampling import RandomOverSampler

# Use RandomOverSampler to resample the datase using random_state=1
ros = RandomOverSampler(random_state=1)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

#### 5. Test the distribution once again with `Counter`. The new result of `Counter({1: 1194, 0: 1194})` shows the data is now balanced.

In [10]:
# Use Counter again to verify imbalance removed

Counter(y_resampled)

Counter({1: 1193, 0: 1193})

# Machine Learning

#### 1. The first cells in this section provide an example of how to fit and train your model using the `LogisticRegression` model from sklearn:
* Import select model.
* Instantiate model object.
* Fit the model to the resampled data - `X_resampled` and `y_resampled`.
* Predict the model using `X_test`.
* Print the classification report.

In [11]:
# Import classification_report from sklearn
from sklearn.metrics import classification_report

In [12]:
# Import LogisticRegression from sklearn
from sklearn.linear_model import LogisticRegression

# Create a LogisticRegression model and train it on the X_resampled data we created before
log_model = LogisticRegression()
log_model.fit(X_resampled, y_resampled)  

# Use the model you trained to predict using X_test
y_pred = log_model.predict(X_test)   

# Print out a classification report toevaluate performance
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.5396    0.5197    0.5295       406
           1     0.5221    0.5420    0.5318       393

    accuracy                         0.5307       799
   macro avg     0.5309    0.5308    0.5307       799
weighted avg     0.5310    0.5307    0.5306       799



#### 2. Use the same approach as above to train and test the following ML Algorithms:
* [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
* [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
* [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html)

#### RandomForestClassifier

In [13]:
# Import RandomForestClassifier from sklearn
from sklearn.ensemble import RandomForestClassifier

# Create a RandomForestClassifier model and train it on the X_resampled data we created before
rfc_model = RandomForestClassifier()
rfc_model.fit(X_resampled, y_resampled)

# Use the model you trained to predict using X_test
y_pred = rfc_model.predict(X_test)

# Print out a classification report to evaluate performance
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.5051    0.6133    0.5539       406
           1     0.4869    0.3791    0.4263       393

    accuracy                         0.4981       799
   macro avg     0.4960    0.4962    0.4901       799
weighted avg     0.4961    0.4981    0.4912       799



#### GradientBoostingClassifier

In [14]:
# Import RandomForestClassifier from sklearn
from sklearn.ensemble import GradientBoostingClassifier

# Create a GradientBoostingClassifier model and train it on the X_resampled data we created before
gbc_model = GradientBoostingClassifier()
gbc_model.fit(X_resampled, y_resampled)

# Use the model you trained to predict using X_test
y_pred = gbc_model.predict(X_test)

# Print out a classification report to evaluate performance
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.5180    0.5320    0.5249       406
           1     0.5026    0.4885    0.4955       393

    accuracy                         0.5106       799
   macro avg     0.5103    0.5103    0.5102       799
weighted avg     0.5104    0.5106    0.5104       799



#### AdaBoostClassifier

In [15]:
# Import RandomForestClassifier from sklearn
from sklearn.ensemble import AdaBoostClassifier

# Create a AdaBoostClassifier model and train it on the X_resampled data we created before
abc_model = AdaBoostClassifier()
abc_model.fit(X_resampled, y_resampled)

# Use the model you trained to predict using X_test
y_pred = gbc_model.predict(X_test)

# Print out a classification report to evaluate performance
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.5180    0.5320    0.5249       406
           1     0.5026    0.4885    0.4955       393

    accuracy                         0.5106       799
   macro avg     0.5103    0.5103    0.5102       799
weighted avg     0.5104    0.5106    0.5104       799



#### XGBClassifier

In [16]:
# Import RandomForestClassifier from sklearn
from xgboost import XGBClassifier

# Create a XGBClassifier model and train it on the X_resampled data we created before
xgbc_model = XGBClassifier()
xgbc_model.fit(X_resampled, y_resampled)

# Use the model you trained to predict using X_test
y_pred = xgbc_model.predict(X_test)

# Print out a classification report to evaluate performance
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.4906    0.5788    0.5311       406
           1     0.4656    0.3791    0.4180       393

    accuracy                         0.4806       799
   macro avg     0.4781    0.4790    0.4745       799
weighted avg     0.4783    0.4806    0.4754       799



### Evaluate the performance of each model


#### 1. Using the classification report for each model, choose the model with the highest precision for use in your algo-trading program.
#### 2. Save the selected model with the `joblib` libary to avoid retraining every time you wish to use it.

In [17]:
# Import the joblib library 
import joblib

# Use the library to save the model that you want to use for trading
joblib.dump(log_model, 'log_model.pkl')

['log_model.pkl']

## Part 3: Implement the strongest model using Apaca API

### Develop the Algorithm


#### 1. Use the provided code to ping the Alpaca API and create the DataFrame needed to feed data into the model.
   * This code will also store the correct feature data in `X` for later use.

In [18]:
# Create the list of tickers

ticker_list = ['FB','AMZN','AAPL','NFLX', 'GOOGL', 'MSFT', 'TSLA']
# Define Dates

beg_date = '2021-01-06'
end_date = '2021-01-06'

# Convert the date in a format the Alpaca API reqires
start =  pd.Timestamp(f'{beg_date} 09:30:00-0400', tz='America/New_York').replace(hour=9, minute=30, second=0).astimezone('GMT').isoformat()[:-6]+'Z'
end   =  pd.Timestamp(f'{end_date} 16:00:00-0400', tz='America/New_York').replace(hour=15, minute=0, second=0).astimezone('GMT').isoformat()[:-6]+'Z'
timeframe='1Min'

# Use iloc to get the last 10 mins every time we pull new data
prices = alpaca.get_barset(ticker_list, "minute", start=start, end=end).df.iloc[-11:]
prices.ffill(inplace=True)   

# Create an empty DataFrame for closing prices
df_closing_prices = pd.DataFrame()

# Fetch the closing prices of our tickers
df_closing_prices["FB"] = prices["FB"]["close"]
df_closing_prices["AMZN"] = prices["AMZN"]["close"]
df_closing_prices["AAPL"] = prices["AAPL"]["close"]
df_closing_prices["NFLX"] = prices["NFLX"]["close"]
df_closing_prices["GOOGL"] = prices["GOOGL"]["close"]
df_closing_prices['MSFT'] = prices['MSFT']["close"]
df_closing_prices['TSLA'] = prices['TSLA']["close"]

print(df_closing_prices.head(20))

                                FB      AMZN     AAPL    NFLX    GOOGL  \
time                                                                     
2021-01-06 14:50:00-05:00  264.610  3146.960  127.110  506.54  1721.82   
2021-01-06 14:51:00-05:00  264.630  3146.910  127.430  506.54  1721.82   
2021-01-06 14:52:00-05:00  264.830  3147.980  127.720  506.69  1723.67   
2021-01-06 14:53:00-05:00  264.525  3148.570  127.510  506.01  1723.67   
2021-01-06 14:54:00-05:00  264.560  3147.840  127.645  506.01  1720.84   
2021-01-06 14:55:00-05:00  264.880  3150.330  127.920  506.30  1720.60   
2021-01-06 14:56:00-05:00  264.965  3150.610  128.150  506.72  1721.10   
2021-01-06 14:57:00-05:00  264.980  3151.745  127.980  507.07  1720.07   
2021-01-06 14:58:00-05:00  265.000  3149.280  127.850  506.33  1720.07   
2021-01-06 14:59:00-05:00  265.360  3150.840  127.930  506.13  1720.48   
2021-01-06 15:00:00-05:00  264.840  3148.580  127.630  506.43  1720.48   

                              MSFT   

In [19]:
# Create list of momentums
list_of_momentums = [1,5,10]

for i in list_of_momentums:  
    # Compute percentage change for each one of the momentums in the momentum list
    returns_temp = df_closing_prices.pct_change(i)
    # Unstack the returns 
    returns_temp = pd.DataFrame(returns_temp.unstack())
    name = f'{i}_m_returns'
    returns_temp.rename(columns={0: name}, inplace = True)
    # Reset the index so we can merge based on index
    returns_temp.reset_index(inplace = True)
    # Merge newly computed returns with previously created returns
    if i ==1:
        returns = returns_temp
    else:
        returns = pd.merge(returns,returns_temp,left_on=['level_0', 'time'],right_on=['level_0', 'time'], how='left', suffixes=('_original', 'right'))

# Drop nulls and set index
returns.dropna(axis=0, how='any', inplace=True)
returns.set_index(['level_0', 'time'], inplace=True)

# Generate feature data and preview first 10 rows.
X = returns
X.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,1_m_returns,5_m_returns,10_m_returns
level_0,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FB,2021-01-06 15:00:00-05:00,-0.00196,-0.000151,0.000869
AMZN,2021-01-06 15:00:00-05:00,-0.000717,-0.000555,0.000515
AAPL,2021-01-06 15:00:00-05:00,-0.002345,-0.002267,0.004091
NFLX,2021-01-06 15:00:00-05:00,0.000593,0.000257,-0.000217
GOOGL,2021-01-06 15:00:00-05:00,0.0,-7e-05,-0.000778
MSFT,2021-01-06 15:00:00-05:00,-0.00077,-0.001213,0.000654
TSLA,2021-01-06 15:00:00-05:00,-0.000735,-0.001587,0.0105


#### 2. Using `joblib`, load the chosen model.

In [20]:
# Load the previously trained and saved model using joblib

final_model = joblib.load('log_model.pkl')

#### 3. Use the model file to make predicttions:
* Use `predict` on `X` and save this as `y_pred`.
* Convert `y_pred` to a DataFrame, setting the index to the index of `X`.
* Rename the column 0 to 'buy', be sure to set `inplace =True`.

In [21]:
# Use the model file to predict on X
y_pred = final_model.predict(X)

# Convert y_pred to a dataframe, set the index to the index of X
y_pred_df = pd.DataFrame(y_pred, index=X.index)

# Rename the column 0 to 'buy', be sure to set inplace =True
y_pred_df.rename(columns={0:'buy'}, inplace=True)

In [22]:
y_pred_df

Unnamed: 0_level_0,Unnamed: 1_level_0,buy
level_0,time,Unnamed: 2_level_1
FB,2021-01-06 15:00:00-05:00,1
AMZN,2021-01-06 15:00:00-05:00,1
AAPL,2021-01-06 15:00:00-05:00,1
NFLX,2021-01-06 15:00:00-05:00,0
GOOGL,2021-01-06 15:00:00-05:00,1
MSFT,2021-01-06 15:00:00-05:00,1
TSLA,2021-01-06 15:00:00-05:00,0


#### 4. Filter the stocks where 'buy' is equal to 1, saving the filter as `y_pred`.

In [23]:
# Filter the stocks where 'buy' is equal to 1

buy_df = y_pred_df[y_pred_df['buy'] == 1]
buy_df

Unnamed: 0_level_0,Unnamed: 1_level_0,buy
level_0,time,Unnamed: 2_level_1
FB,2021-01-06 15:00:00-05:00,1
AMZN,2021-01-06 15:00:00-05:00,1
AAPL,2021-01-06 15:00:00-05:00,1
GOOGL,2021-01-06 15:00:00-05:00,1
MSFT,2021-01-06 15:00:00-05:00,1


#### 5. Using the `y_pred` filter, create a dictionary called `buy_dict` and assign 'n' to each Ticker (key value) as a placeholder.

In [24]:
# Create dictionary from y_pred and assign a 'n' to each of them for now as a placeholder.
buy_dict = dict.fromkeys(y_pred_df.index.get_level_values(0), 'n')
buy_dict

{'FB': 'n',
 'AMZN': 'n',
 'AAPL': 'n',
 'NFLX': 'n',
 'GOOGL': 'n',
 'MSFT': 'n',
 'TSLA': 'n'}

#### 6. Obtain the total available equity in your account from the Alpaca API and store in a variable called `total_capital`. You will split the capital equally between all selected stocks per the CIO's request.

In [25]:
# Pull the total available equity in our account from the  Alpaca API

total_capital = int(account.equity)
total_capital

100000

In [26]:
# Compute capital per stock, divide equity in account by number of stocks
# Use Alpaca API to pull the equity in the account
if len(buy_dict) > 0:
    capital_per_stock = float(total_capital)/ len(buy_dict)
else:
    capital_per_stock = 0
print(f'Capital per stock: {capital_per_stock}')

Capital per stock: 14285.714285714286


#### 7. Use a for-loop to iterate through `buy_dict` to determine the number stocks you need to buy for each ticker.

In [27]:
# Use for loop to iterate through dictionary of buys 
# Determine the number stocks we need to buy for each ticker
for ticker in buy_dict:
    try:
        buy_dict[ticker] = int(capital_per_stock /int(prices[ticker].iloc[-1]['close']))
    except:
        pass

print(buy_dict)

{'FB': 54, 'AMZN': 4, 'AAPL': 112, 'NFLX': 28, 'GOOGL': 8, 'MSFT': 66, 'TSLA': 18}


#### 8. Cancel all previous orders in the Alpaca API (so you don't buy more than intended) and sell all currently held stocks to close all positions.

In [28]:
# Cancel all previous orders in the Alpaca API
alpaca.cancel_all_orders()

# Sell all currently held stocks to close all positions
alpaca.close_all_positions()

[]

#### 9. Iterate through `buy_dict` and send a buy order for each ticker with their corresponding number of shares.

In [29]:
# Iterate through the longlist object and send a buy order for each ticker with a corresponding number of shares:

for key,value in buy_dict.items():
    alpaca.submit_order(symbol=key, qty=value, side="buy", type="market", time_in_force="day")

In [30]:
# alpaca.cancel_all_orders()

### Automate the algorithm

#### 1. Make a function called `trade()` that incorporates all of the steps above.

In [31]:
# Add all of the steps conducted above into the function trade
def trade():

    ticker_list = ['FB','AMZN','AAPL','NFLX', 'GOOGL', 'MSFT', 'TSLA']
    # Notice that we remove the start and end variables since we want the latest prices.
    timeframe='1Min'
    # Use iloc to get the last 10 mins every time we pull new data
    prices = alpaca.get_barset(ticker_list, "minute").df.iloc[-11:]
    prices.ffill(inplace=True)   

    # Create and empty DataFrame for closing prices
    df_closing_prices = pd.DataFrame()

    # Fetch the closing prices of our tickers
    df_closing_prices["FB"] = prices["FB"]["close"]
    df_closing_prices["AMZN"] = prices["AMZN"]["close"]
    df_closing_prices["AAPL"] = prices["AAPL"]["close"]
    df_closing_prices["NFLX"] = prices["NFLX"]["close"]
    df_closing_prices["GOOGL"] = prices["GOOGL"]["close"]
    df_closing_prices['MSFT'] = prices['MSFT']["close"]
    df_closing_prices['TSLA'] = prices['TSLA']["close"]
    print(df_closing_prices.head())
    
    # Loop through momentums to build new DataFrame
    list_of_momentums = [1,5,10]
    for i in list_of_momentums:   
        returns_temp = df_closing_prices.pct_change(i)
        returns_temp = pd.DataFrame(returns_temp.unstack())
        name = f'{i}_m_returns'
        returns_temp.rename(columns={0: name}, inplace = True)
        returns_temp.reset_index(inplace = True)
        if i ==1:
            returns = returns_temp
        else:
            returns = pd.merge(returns,returns_temp,left_on=['level_0', 'time'],right_on=['level_0', 'time'], how='left', suffixes=('_original', 'right'))

    # Drop nulls and set index            
    returns.dropna(axis=0, how='any', inplace=True)
    returns.set_index(['level_0', 'time'], inplace=True)
    
    # Preprocess data for model
    X = returns
    final_model = joblib.load('log_model.pkl')
    y_pred = final_model.predict(X)
    y_pred_df = pd.DataFrame(y_pred, index=X.index)
    y_pred_df.rename(columns={0:'buy'}, inplace=True)
    buy_df = y_pred_df[y_pred_df['buy'] == 1]

    # Create the `buy_dict` object
    buy_dict = dict.fromkeys(y_pred_df.index.get_level_values(0), 'n')
    
    # Split capital between stocks and determine buy or sell
    total_capital = int(account.equity)
    if len(buy_dict) > 0:
        capital_per_stock = float(total_capital)/ len(buy_dict)
    else:
        capital_per_stock = 0
    for ticker in buy_dict:
        try:
            buy_dict[ticker] = int(capital_per_stock /int(prices[ticker].iloc[-1]['close']))
        except:
            pass
    
    # Cancel pending orders and close positions
    alpaca.cancel_all_orders()
    alpaca.close_all_positions()
   
    
    # Submit orders
    for key,value in buy_dict.items():
        alpaca.submit_order(symbol=key, qty=value, side="buy", type="market", time_in_force="day")


#### 2. Import Python's schedule module.

In [32]:
# Import Python's schedule module 
import schedule

ModuleNotFoundError: No module named 'schedule'

#### 3. Use the "schedule" module to automate the algorithm:
* Clear the schedule with `.clear()`.
* Define a schedule to run the trade function every minute at 5 seconds past the minute mark (e.g. `10:31:05`).
* Use the Alpaca API to check whether the market is open.
* Use run_pending() function inside schedule to execute the schedule you defined while the market is open

In [None]:
# Clear the schedule
schedule.clear()

# Define a schedule to run the trade function every minute at 5 seconds past the minute mark (e.g. 10:31:05)
schedule.every().minute.at(":05").do(trade)

# Use the Alpaca API to check whether the market is open
clock = alpaca.get_clock()

# Use run_pending() function inside schedule to execute the schedule you defined as long as the market is open
while clock.is_open:
    schedule.run_pending()
    time.sleep(1)