### Database

**Input**: 
* input/factors/YYYYMMDD.csv - factor location decile for all NYSE-listed stocks on one-minute frequency of the DD-MM-YYYY day.
* input/returns/YYYYMMDD.csv - returns of all NYSE-listed stocks on one-minute frequency of the DD-MM-YYYY day.

**Output**: 
* output/data/median_breakpoint/value_weighted/YYYYMMDD.csv - portfolios based on factors from NYSE-listed stocks on one-minute frequency of the DD-MM-YYYY day.

The purpose of this notebook is to join two databases, one containing data on stock returns and the other containing the factor location deciles of each stock. From this merge, we will create value-weighted portfolios of factors using the median as breakpoint.

In [1]:
# packages
import numpy as np
import pandas as pd

In [2]:
# hide warning messages
import warnings
warnings.filterwarnings("ignore")

In [3]:
# pd.set_option('display.max_columns', None)

In [4]:
# pd.set_option('display.max_rows', None)

### Functions

portfolio_position function: receives three parameters, col, df_factors and df_returns.

* col is the percentile NYSE factor column (pNYSE_size, pNYSE_value, ..., pNYSE_ipo)
* df_factors if the factors dataframe of any day.
* df_returns is the returns dataframe of any trade day.

This funtion returns long_position and short_position lists, each of them has tickers of firms whose are in this respective position and there is its matching column in returns dataframe.

In [5]:
def portfolio_position(col, df_factors, df_returns):
    long_position = []      # stocks whose we're gonna buy
    short_position = []     # stocks whose we're gonna sell

    # first, we need to drop the rows whose have NaN as decile location factor
    temp = df_factors[df_factors[col].notna()]
    """
    We create a loop with the criteria:
    above the median '5' (strictly), then the portfolio assumes long position
    below the median '5' (not strictly), then the portfolio assumes short position
    """
    for permno in temp.index:
        if temp.loc[permno][col] > 5:
            long_position.append(temp.loc[permno]['TAQ_TICKER'])
        else:
            short_position.append(temp.loc[permno]['TAQ_TICKER'])
    """
    now, we need create two loops for each list:
    this program checks if the ticker inside of a list is on returns dataframe,
    if not, we drop this ticker.
    """

    drop_long_position = []
    drop_short_position = []

    for ticker in long_position:
        if ticker not in df_returns.columns:
            drop_long_position.append(ticker)

    for ticker in drop_long_position:
        long_position.remove(ticker)

    for ticker in short_position:   
        if ticker not in df_returns.columns:
            drop_short_position.append(ticker)

    for ticker in drop_short_position:
        short_position.remove(ticker)

    return(long_position, short_position)

portfolio_formation function: receives four parameters, col, df, df_factors, df_returns.

* col is the percentile NYSE factor column (pNYSE_size, pNYSE_value, ..., pNYSE_ipo).
* df is the factor dataframe which will be filled with the value weighted portfolios.
* df_factors if the factors dataframe of any day.
* df_returns is the returns dataframe of any trade day.

This funtion returns nothing. It just fills the df inputed with portfolio returns.

In [6]:
def portfolio_formation(col, df, df_factors, df_returns):
    # we use the portfolio_position() function to get the long_position and short_position list
    long_position, short_position = portfolio_position(col, df_factors, df_returns)
    
    # the factor column name will be:
    new_col = col[6].upper() + col[7:] + ' Factor'

    """
    Now, we need to create a weight vector to form the value weighted portfolio
    """
    # market cap of companies in long_position
    marketcap_long_position = list(df_factors[df_factors['TAQ_TICKER'].isin(long_position)]['MARKETCAP'].values)

    # market cap of companies in short_position
    marketcap_short_position = list(df_factors[df_factors['TAQ_TICKER'].isin(short_position)]['MARKETCAP'].values)

    # sum of market cap of companies in long_position
    sum_marketcap_long_position = sum(marketcap_long_position)

    # sum of market cap of companies in short_position
    sum_marketcap_short_position = sum(marketcap_short_position)

    # value weight of companies in long_position
    weight_long_position = marketcap_long_position/sum_marketcap_long_position

    # value weight of companies in short_position
    weight_short_position = marketcap_short_position/sum_marketcap_short_position
    """
    The portfolio is formed subtracting the weighted returns (value-weighted) of short_position stocks
    from the weighted returns (value-weighted) of long_position stocks
    """
    df[new_col] = (df_returns[long_position]*weight_long_position).sum(axis=1) - (df_returns[short_position]*weight_short_position).sum(axis=1)

### Data Generator Loop

This loop will generate daily data (one .csv for each day) with all one-minute-frequency portfolio returns (one for each factor).

In [7]:
# we need to create a date range for the period we have data
bdates = pd.bdate_range('2005-01-01', '2019-12-31')
bdates_ = []

# converting same way as csv names
for date in bdates:
    day = str(date)[:4] + str(date)[5:7] + str(date)[8:10]
    bdates_.append(day)

In [8]:
for day in bdates_:
    try:
        # factors dataframe
        factors_path = f'../../../input/factors/{day}.csv'
        factors = pd.read_csv(factors_path, index_col=0)
        # dropping all firms whose doesn't have TAQ_TICKER, because that is the only variable we can connect to returns database
        factors = factors[factors['TAQ_TICKER'] != '<undefined>']
        # fixing the marketcap column values (just get the absolute value)
        factors['MARKETCAP'] = factors['MARKETCAP'].abs()
        
        # returns dataframe
        returns_path = f'../../../input/returns/{day}.csv'
        returns = pd.read_csv(returns_path, index_col=0)
        
        # portfolio returns dataframe
        portfolio = pd.DataFrame(index=returns.index)

        # a loop filling the portfolio returns dataframes (intradaily)
        pNYSE_factors = factors.columns[4:]
        for pNYSE in pNYSE_factors:
            portfolio_formation(pNYSE, portfolio, factors, returns)

        # converting portfolio returns dataframes to csv
        output_path = f'../../../output/data/median_breakpoint/value_weighted/{day}.csv'
        portfolio.to_csv(output_path, sep=',', encoding='utf-8')
    except:
        pass