## Raw Data Extraction

In this section we manualy insert the the key core data which consist of symbol, entry/exit time and date, entry price, exit price, and some additional tags. Based on the symbol, date and time keys we download additional 1 minute,1 hour and 1 day bar data and fundamental data for each symbol and date inserted.

Sources:

* 1 minute bar data: Interactive Brokers intraday data and Yahoo Finance.
* 1 hour bar data: Yahoo Finance.
* 1 day bar data: Yahoo Finance.
* Fundamentals data: Yahoo Finace.

The data from Yahoo Finance will be downloaded using the yfinance library and the data from Interactive Brokers will be downloaded using the Interactive Brokers API (only possible with a valid IB account and a monthly data subscrition).

The trade picking process is carried out in a discretionary manner (manualy) using the TWS Interactive Brokers trading platform, which brings us to reason why the 1 minute data is downloaded from Interactive Brokers. Yahoo Finance 1 minute data will be used only if necessary depending on the accuracy and availability of the Interactive Brokers data, in other words 1 minutes Yahoo Finance data is used for backup.

Since data from Yahoo Finance is relatively accurate, reliable and easily downloaded it is sufficent for the 1 hour and 1 day bar data which is why data of these time periods from Interactive Brokers is unnecessary. 


In [1]:
# Import libraries
from ibapi.client import EClient
from ibapi.wrapper import EWrapper
from ibapi.contract import Contract
import pandas as pd
import threading
import datetime
import time
import os #kill the kernel
import sys
import yfinance as yf




### Trades Snap-Shot Entry (daily manual data entry)


In [17]:


##### symbol #####
symbol = 'OCGN'

##### intended entry #####
intended_entry = 7.45

##### SL #####
SL = 7.23

##### Generic Exit #####
exit = 7.23

##### Entry Time #####
entry_time = datetime.datetime(2021,4,22,10,44)

#### General Exit Time #####
#NOTE: if exited not on SL then always write the following candle after the exit signal candle
exit_time = datetime.datetime(2021,4,22,11,15)

#### VWAP TAG ##### 
#  BO,  SUPPORT,  FALSE
vwap_tag="SUPPORT"

#### Pattern #####
# W- wedge, F - flag, AT -ascending triangle, DT - descending triangle, ST - symetrical triangle, R -  rectangle, P- penannt
pattern ='ST'

#### Catalyst #####
# H  - hype, L - leading industry/sector, C - news catalyst
catalyst = "C"


#### Strategy ####
strategy ="BO"

#### Entry ####
entry = intended_entry
###############################################


### Check Errors

This cell validates the manual data entrys to avoid errors and future cleaning.
Makes sure dates, time and prices are aligned.


In [18]:
# errors:
if SL>=intended_entry:
    print("******** ERROR: SL higher than entry ********")
    sys.exit()
    

if (entry_time>exit_time) or (entry_time.date()!=exit_time.date()):
    print("******** ERROR: Incorrect entry time or exit time  ********")
    sys.exit()
    
day_start = datetime.datetime.combine(entry_time.date(),datetime.time(9,30))
day_end = datetime.datetime.combine(entry_time.date(),datetime.time(16,0))
    
if (entry_time<day_start or entry_time>=day_end) or (exit_time<day_start or exit_time>=day_end):
    print("******** ERROR: Invalid entry time or exit time, exceeds the trading hours range  ********")
    sys.exit()



### Store The Core and Fundamentals Data

Based on given symbol and date, download the fundamental data of that given symbol and date and store it in the Fundamentals dataset for future analysis and prediciton. Locally store the core data as insterted above and the freshly downloaded fundamental data alongside the data previously added.

In [19]:
# Create the new core data row and core keys.

core_key = {"Symbol":symbol,"Date": entry_time.date()}
core_dict = {"Entry Time": entry_time.time(),"Exit Time": exit_time.time(), "Intended Entry":intended_entry,
"Entry": entry, "SL": SL,"Exit": exit,"Pattern": pattern,"VWAP Tag": vwap_tag,"Strategy":strategy,"Catalyst":catalyst,"Download":0}

# import the datasets 
core_data =  pd.read_excel('Core Data.xlsx')  
f_data =  pd.read_excel('Fundamentals.xlsx')

# a function that merges to dictionaries
def merge_two_dicts(x, y):
    z = x.copy()   # start with x's keys and values
    z.update(y)    # modifies z with y's keys and values & returns None
    return z

# timestamp - Datetime type changes
timestamp_date = pd.Timestamp(entry_time.date())
strftime_time = entry_time.time().strftime("%H:%M:%S")

# add the Core Data dictionary to the Core Data data set + a warning if the data exists to avoid duplicates
if len(core_data)==0 or len(core_data[(core_data["Symbol"]==symbol) & (core_data["Entry Time"]==strftime_time)
& (core_data["Date"]==timestamp_date)])==0:    
    core_dict = merge_two_dicts(core_key, core_dict)
    core_data =  core_data.append(core_dict, ignore_index=True)
    core_data.to_excel("Core Data.xlsx",index = False) 
else:
    print("******** WARNING: Given key already inserted   ********")

# download and add the Fundamentals dictionary to the Fundamentals data set 
if len(f_data)==0 or (len(f_data[(f_data["Symbol"]==symbol) & (f_data["Date"]==timestamp_date)])==0):
    ticker = yf.Ticker(symbol)
    fundamentals= ticker.info
    fundamentals= merge_two_dicts(core_key, fundamentals)
    f_data =  f_data.append(fundamentals, ignore_index=True)
    f_data.to_excel("Fundamentals.xlsx",index = False)  


### Intra-day Data collection

Different function for utilizing the data collection from the IB API.

In [2]:

class TradeApp(EWrapper, EClient): 
    def __init__(self): 
        EClient.__init__(self, self) 
        self.data = {}
        
    def historicalData(self, reqId, bar):
        if reqId not in self.data:
            self.data[reqId] = [{"Datetime":bar.date,"Open":bar.open,"High":bar.high,"Low":bar.low,"Close":bar.close,"Volume":bar.volume}]
        else:
            self.data[reqId].append({"Datetime":bar.date,"Open":bar.open,"High":bar.high,"Low":bar.low,"Close":bar.close,"Volume":bar.volume})
        #print("reqID:{}, date:{}, open:{}, high:{}, low:{}, close:{}, volume:{}".format(reqId,bar.date,bar.open,bar.high,bar.low,bar.close,bar.volume))

def usTechStk(symbol,sec_type="STK",currency="USD",exchange="ISLAND"):
    contract = Contract()
    contract.symbol = symbol
    contract.secType = sec_type
    contract.currency = currency
    contract.exchange = exchange
    return contract 

def histData(req_num,contract,endDate,duration,candle_size):
    """extracts historical data"""
    app.reqHistoricalData(reqId=req_num, 
                          contract=contract,
                          endDateTime=endDate,
                          durationStr=duration,
                          barSizeSetting=candle_size,
                          whatToShow='TRADES',
                          useRTH=1,
                          formatDate=1,
                          keepUpToDate=0,
                          chartOptions=[])	 # EClient function to request contract details    

# storing trade app object data (the downloaded data) in a dataframe 
def dataDataframe(symbols,TradeApp_obj):
    "returns extracted historical data in dataframe format"
    df_data = {}
    i=0
    for symbol in symbols:
        df_data[symbol] = pd.DataFrame(TradeApp_obj.data[i])
        df_data[symbol].set_index("Datetime",inplace=True)
        i+=1
    return df_data

# convert string to datetime and validate the time
def datetimeCon(x):
    date_time_obj = datetime.datetime.strptime(x, '%Y%m%d %H:%M:%S')
    if date_time_obj.time()< datetime.time(9,30) or date_time_obj.time()>datetime.time(16,0):
        print("******** WARNING: Unsual time, change IB time-zone ********")
        print(date_time_obj.time())
        sys.exit()
    return date_time_obj

# connect to web socket with multi threading
def websocket_con():
    app.run()

### Download Raw Data

In this cell we will download the bar data from the soreces as mentioned abov.

In [3]:
#import the Core Data
core_data =  pd.read_excel('Core Data.xlsx') 

# Filter the data to get only the rows where data has not yet been downloaded (where the column Download is 0)
download_data = core_data[core_data["Download"]==0][["Symbol","Date"]].drop_duplicates().reset_index()
print(download_data)


# connect to IB server
app = TradeApp()
app.connect(host='127.0.0.1', port=7497, clientId=23) #port 4002 for ib gateway paper trading/7497 for TWS paper trading
con_thread = threading.Thread(target=websocket_con, daemon=True)
con_thread.start()
time.sleep(1) # some latency added to ensure that the connection is established


# make an API call to download the 1 minute data from IB
for index,row in download_data.iterrows():    
    date = datetime.datetime.combine(row["Date"].date(),datetime.time(23,59)).strftime("%Y%m%d %H:%M:%S")
    
    
    symbol = row["Symbol"] 
    dict_download = {"Symbol": symbol,"Date":date}
    
    print(str(symbol)+ " " + date)
    histData(index,usTechStk(symbol),date,'1 D', '1 min')
    time.sleep(5)

        



#extract and store IB 1 minute data in dataframes
historicalData = dataDataframe(download_data["Symbol"].values,app)

# store the data localy
# store intraday IB data from the broker

    
# store intraday and D1 data from Yahoo finance, IB and merge the Yahoo volume only with the IB data 

#the local directories for the bar data
directory1 = 'Yahoo Intraday Data'
directory2 = 'Yahoo D1 Data'
directory3 = 'Yahoo H1 Data'
directory4 = 'IB Intraday Data'
directory5 = 'Merged Intraday Data'
directory6 = 'SPY Intraday Data'

time_period1 = 365 # 1 year
time_period2 = 30 #1 month

# for every symbol download the intrady,1h and 1d bar data
for key in historicalData:
    print(key)
    
    
    date_timestamp = historicalData[key].index[0].split()[0]
    date = datetime.datetime.strptime(date_timestamp, '%Y%m%d').date()
    
    # download the 1 minute data from Yahoo Finance
    minute_data_yahoo =  yf.download(key, start = date, end =  (date + datetime.timedelta(days=1)), interval = "1m" )  
    
    # Remove time zone from date-time index value
    minute_data_yahoo.index = minute_data_yahoo.index.map(lambda x: datetime.datetime.combine(x.date(),x.time()).strftime("%Y%m%d  %H:%M:%S"))
    
    # download the d1 and 1h bar data from Yahoo Finance
    d1_data=  yf.download(key, start = (date - datetime.timedelta(days=time_period1)), end =  (date + datetime.timedelta(days=1)), interval = "1d" )
    h1_data=  yf.download(key, start = (date - datetime.timedelta(days=time_period2)), end =  (date + datetime.timedelta(days=1)), interval = "1h" )

    #create strings for file names
    file_name = key + ' ' + date_timestamp + '.xlsx'
    spy_data_file_name = "SPY "+date_timestamp + ".xlsx"
    
    spy_files =os.listdir(directory6)
    
    if spy_data_file_name not in spy_files:
            spy_minute_data_yahoo =  yf.download("SPY", start = date, end =  (date + datetime.timedelta(days=1)), interval = "1m" )
            spy_minute_data_yahoo.index = spy_minute_data_yahoo.index.map(lambda x: datetime.datetime.combine(x.date(),x.time()).strftime("%Y%m%d  %H:%M:%S"))
            directory_destination = "SPY Intraday Data"
    

            path6 = os.path.join(directory6, spy_data_file_name)
            spy_minute_data_yahoo.to_excel(path6)
        

    
    # merge the data so that high,low, open, close are taken from IB and Volume is taken from Yahoo
    merged_data = pd.merge(historicalData[key].drop(columns = ["Volume"]),minute_data_yahoo[["Volume"]],on='Datetime',how='left')
    merged_data["Volume"] =merged_data["Volume"].fillna(0) 
    
    # every session is 390 minutes starting at 9:30 and ending at the end of 15:59
    # leave a warning if some minutes are missing
    minutes_per_session = 390
    if len(merged_data)!=minutes_per_session:
        print("******** WARNING: The length is not 390, please fix the lenth of symbol" + key +" ********")
        
    
    
    #local paths to the directories
    path1 = os.path.join(directory1, file_name)
    path2 = os.path.join(directory2, file_name)
    path3 = os.path.join(directory3, file_name)
    path4 = os.path.join(directory4, file_name)
    path5 = os.path.join(directory5, file_name)
    
    
    # save the datasets in excel files in the paths
    minute_data_yahoo.to_excel(path1)
    d1_data.to_excel(path2)
    h1_data.to_excel(path3)
    historicalData[key].to_excel(path4)
    merged_data.to_excel(path5)

# update the core_data table, change download to 1 from 0 to signal that the data has been downloaded
core_data["Download"] =  core_data["Download"].replace({0:1})
core_data.to_excel('Core Data.xlsx',index = False) 


ERROR -1 2104 Market data farm connection is OK:usfarm.nj
ERROR -1 2104 Market data farm connection is OK:usfarm
ERROR -1 2106 HMDS data farm connection is OK:euhmds
ERROR -1 2106 HMDS data farm connection is OK:ushmds.nj
ERROR -1 2106 HMDS data farm connection is OK:fundfarm
ERROR -1 2106 HMDS data farm connection is OK:ushmds
ERROR -1 2158 Sec-def data farm connection is OK:secdefil


   index Symbol       Date
0     59   FCEL 2021-04-22
1     61   FWAA 2021-04-22
2     62   OCGN 2021-04-22
FCEL 20210422 23:59:00
FWAA 20210422 23:59:00
OCGN 20210422 23:59:00
FCEL
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
FWAA
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
OCGN
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
