### Database

**Input**: 
* input/factors/YYYYMMDD.csv - factor location decile for all NYSE-listed stocks on one-minute frequency of the DD-MM-YYYY day.
* input/returns/YYYYMMDD.csv - returns of all NYSE-listed stocks on one-minute frequency of the DD-MM-YYYY day.

**Output**: 
* output/data/threedeciles_breakpoint/value_weighted/YYYYMMDD.csv - portfolios based on factors from NYSE-listed stocks on one-minute frequency of the DD-MM-YYYY day.

The purpose of this notebook is to join two databases, one containing data on stock returns and the other containing the factor location deciles of each stock. From this merge, we will create value-weighted portfolios of factors using the three deciles (bottom and top) as breakpoint.

In [18]:
# packages
import numpy as np
import pandas as pd

In [19]:
# hide warning messages
import warnings
warnings.filterwarnings("ignore")

In [20]:
# pd.set_option('display.max_columns', None)

In [21]:
# pd.set_option('display.max_rows', None)

### DateRange

In [22]:
# getting the daterange in daily frequency to create some dataframes (we're gonna use the marketcap dataset for this)
returns_path = '../../../input/returns/daily.parquet'
returns = pd.read_parquet(returns_path)
daterange = returns.index
daterange

DatetimeIndex(['2005-01-03', '2005-01-04', '2005-01-05', '2005-01-06',
               '2005-01-07', '2005-01-10', '2005-01-11', '2005-01-12',
               '2005-01-13', '2005-01-14',
               ...
               '2019-12-17', '2019-12-18', '2019-12-19', '2019-12-20',
               '2019-12-23', '2019-12-24', '2019-12-26', '2019-12-27',
               '2019-12-30', '2019-12-31'],
              dtype='datetime64[ns]', length=3773, freq=None)

### Factors

In [23]:
# all factors
pNYSE_factors = ['pNYSE_size', 'pNYSE_value', 'pNYSE_prof', 'pNYSE_dur', 'pNYSE_valprof', 
                 'pNYSE_nissa', 'pNYSE_accruals', 'pNYSE_growth', 'pNYSE_aturnover', 
                 'pNYSE_gmargins', 'pNYSE_divp', 'pNYSE_ep', 'pNYSE_cfp', 'pNYSE_noa', 
                 'pNYSE_inv', 'pNYSE_invcap', 'pNYSE_igrowth', 'pNYSE_sgrowth', 
                 'pNYSE_lev', 'pNYSE_roaa', 'pNYSE_roea', 'pNYSE_sp', 'pNYSE_gltnoa', 
                 'pNYSE_divg', 'pNYSE_invaci', 'pNYSE_mom', 'pNYSE_indmom', 'pNYSE_valmom',
                 'pNYSE_valmomprof', 'pNYSE_shortint', 'pNYSE_mom12', 'pNYSE_momrev',
                 'pNYSE_lrrev', 'pNYSE_valuem', 'pNYSE_nissm', 'pNYSE_sue', 'pNYSE_roe',
                 'pNYSE_rome', 'pNYSE_roa', 'pNYSE_strev', 'pNYSE_ivol', 'pNYSE_betaarb',
                 'pNYSE_season', 'pNYSE_indrrev', 'pNYSE_indrrevlv', 'pNYSE_indmomrev',
                 'pNYSE_ciss', 'pNYSE_price', 'pNYSE_age', 'pNYSE_shvol',
                 'pNYSE_fscore', 'pNYSE_debtiss', 'pNYSE_repurch', 'pNYSE_exchsw', 'pNYSE_ipo']

### Functions

drop_ticker function: receives two parameters, percentile and df_returns. 

* percentile is the percentile list (long_position, short_position).
* df_returns is the returns dataframe of any trade day.

This funtion returns the percentile list with just the ticks that are in the returns dataframe.

In [24]:
def drop_ticker(percentile, df_returns):
    drop_tickers = []
    for ticker in percentile:
        if ticker not in df_returns.columns:
            drop_tickers.append(ticker)
    for ticker in drop_tickers:
        percentile.remove(ticker)
    return percentile

portfolio_position function: receives three parameters, col, df_factors and df_returns.

* col is the percentile NYSE factor column (pNYSE_size, pNYSE_value, ..., pNYSE_ipo)
* df_factors if the factors dataframe of any day.
* df_returns is the returns dataframe of any trade day.

This funtion returns long_position and short_position lists, each of them has tickers of firms whose are in this respective position and there is its matching column in returns dataframe.

In [25]:
def portfolio_position(col, df_factors, df_returns):
    long_position = []      # stocks whose we're gonna buy
    short_position = []     # stocks whose we're gonna sell

    # first, we need to drop the rows whose have NaN as decile location factor
    temp = df_factors[df_factors[col].notna()]
    """
    We create a loop with the criteria:
    top three decile, above '4' strictly ('3' not stricly), then the portfolio assumes long position
    bottom three decile, below '7' strictly ('8' not strictly), then the portfolio assumes short position
    """
    for permno in temp.index:
        if temp.loc[permno][col] <= 3:
            long_position.append(temp['TAQ_TICKER'][permno])
        elif temp.loc[permno][col] >= 8:
            short_position.append(temp['TAQ_TICKER'][permno])

    """
    Now, we need to use the drop_ticker function.
    Thus, we'll have the percentile lists with just the tickers that are in the returns dataframe.
    We'll use a loop to pass for all percentile lists
    """    
    
    drop_ticker(long_position, df_returns)
    drop_ticker(short_position, df_returns)

    return(long_position, short_position)

value_weight function: receives two parameters, percentile and df_factors.

* percentile is the percentile list (long_position, short_position).
* df_factors if the factors dataframe of any day.

This funtion returns the value weight of the firms in this percentile (they are in the same sequence of the percentile list).

In [26]:
def value_weight(percentile, df_factors):
    # getting the marketcap value list
    marketcap = list(df_factors[df_factors['TAQ_TICKER'].isin(percentile)]['MARKETCAP'].values)
    # sum of the marketcaps of this percentile firms
    sum_marketcap = sum(marketcap)
    # getting the value weight
    weight = marketcap/sum_marketcap
    return weight

portfolio_formation function: receives four parameters, col, df, df_factors, df_returns.

* col is the percentile NYSE factor column (pNYSE_size, pNYSE_value, ..., pNYSE_shvol).
* df is the factor dataframe which will be filled with the value weighted portfolios.
* df_factors if the factors dataframe of any day.
* df_returns is the returns dataframe of any trade day.

This funtion returns nothing. It just fills the df inputed with portfolio returns.

In [27]:
def portfolio_formation(col, df, df_factors, df_returns):
    # getting the percentile lists from portfolio_decile function
    long_position, short_position = portfolio_position(col, df_factors, df_returns)

    # the factor column name will be:
    new_col = col[6].upper() + col[7:] + ' Factor'

    """
    Now, we'll fill the df (input) dataframe with factor portfolios.
    Each factor will be computed by the difference between value weighted portfolio of long_position stocks and value weighted portfolio of short_positioon stocks.
    """

    df[new_col] = (df_returns[long_position]*value_weight(long_position, df_factors)).sum(axis=1) - (df_returns[short_position]*value_weight(short_position, df_factors)).sum(axis=1)

n_firms function: receives four parameters, col, df, df_factors, df_returns.

* col is the percentile NYSE factor column (pNYSE_size, pNYSE_value, ..., pNYSE_ipo).
* df is the number of firms dataframe which will be filled with the number of firms in each position.
* df_factors if the factors dataframe of any day.
* df_returns is the returns dataframe of any trade day.

This funtion returns nothing. It just fills the df inputed with number of firms.

In [28]:
def n_firms(col, df, date, df_factors, df_returns):
    # getting the percentile lists from portfolio_decile function
    long_position, short_position = portfolio_position(col, df_factors, df_returns)

    """
    Now, we'll fill the df (input) dataframe with number of firms by percentile.
    Each percentile column of df will receive the number of firms in this specified percentile.
    """

    df['Long Position'][date] = len(long_position)
    df['Short Position'][date] = len(short_position)

### Data Generator Process

In [29]:
for date in daterange:
    # we need to convert in the csv's names format
    day = str(date)[:4] + str(date)[5:7] + str(date)[8:10]
    try:
        # factors dataframe
        factors_path = f'../../../input/factors/{day}.parquet'
        factors = pd.read_parquet(factors_path)
        
        # returns dataframe
        returns_path = f'../../../input/returns/{day}.parquet'
        returns = pd.read_parquet(returns_path)
        
        # portfolio returns dataframe
        portfolio = pd.DataFrame(index=returns.index)

        # a loop filling the portfolio returns dataframes (intradaily)
        for pNYSE in pNYSE_factors:
            portfolio_formation(pNYSE, portfolio, factors, returns)

        # converting portfolio returns dataframes to parquet
        output_path = f'../../../output/data/threedeciles_breakpoint/value_weighted/{day}.parquet'
        portfolio.to_parquet(output_path)
    except:
        pass

In [None]:
pNYSE = pNYSE_factors[0]

In [None]:
n10 = pd.DataFrame(index=daterange, columns=['Long Position', 'Short Position'])
for date in daterange:
    # we need to convert in the csv's names format
    day = str(date)[:4] + str(date)[5:7] + str(date)[8:10]
    try:
        # factors dataframe
        factors_path = f'../../../input/factors/{day}.parquet'
        factors = pd.read_parquet(factors_path)
        
        # returns dataframe
        returns_path = f'../../../input/returns/{day}.parquet'
        returns = pd.read_parquet(returns_path)
        
        # filling the number of firms dataframes (daily)
        n_firms(pNYSE, n10, date, factors, returns)
    except:
        pass

In [None]:
n10

Unnamed: 0,Long Position,Short Position
2005-01-03,2960,526
2005-01-04,2936,526
2005-01-05,2919,526
2005-01-06,2893,526
2005-01-07,2885,526
...,...,...
2019-12-24,1913,536
2019-12-26,1908,536
2019-12-27,1919,536
2019-12-30,1926,536
