# Universe Selection

One of the steps in the quantfundamental workflow is the selection of the universe.

We need to restrict the investment universe to those companies that matches certain criteria.

In our case, we are going to identify our universe of companies applying the following process:

1. We are going to compute the market capitalization of each company. $mkt\_cap = total\_shares * stock\_price$

    We will need the account: Account 1.89.03 - Total Shares  
    
    
2. We are going to compute the liquidity of each company for a 120-days time window. How much money has been trade during the period. $liquidity=\sum_{t=1}^{k}volume_traded_t\ast stock_price_t$



3. We are going to create two ranks of companies, one with Large-cap companies and other with all the Mid-cap companies, ordered by its $liquidity$ from high to low liquid companies. 
    
    3.1. Any public company with a market cap above \\$10 billion is generally considered to be a large cap company.
    
    3.2. Any public company with a market cap between US\\$ 2 and US\\$ 10 billion is generally considered to be a medium cap company.


4. We select the Top 12 companies of each group.


5. We generate a quality index for each company using the Greenblatt formula for which is known as __Wonderful Companies__: _"when a business earning a high return on equity"_ (ROC).  
    $$ROC=\frac{earnings}{capital}$$
    or also $$ROC=\frac{earnings}{fixed\_assets + net\_working\_capital - cash}$$

    where;

    $$net\_working\_capital = current\_assets - current\_liabilities$$
    
    To compute ROC we will need access to the following company accounts:
    
    - Account 1.01 - Current Assets (needed by the ROC formula)
    - Account 1.01.01 - Cash (needed by the ROC formula)
    - Account 1.02 - Fixed Assets (needed by the ROC formula)
    - Account 2.01 - Current Liabilities
    - Account 3.09 - Net income (needed by the ROC formula)


6. We generate another quality index using the Greenblatt formula for which is known as __Wonderful Companies__: _"when is a company with fair prices"_ as the Earnings Yield.

    $$Earnings\ Yieal=\frac{EBIT}{TEV}$$
    
	$$ EY=\frac{EBIT}{marketcap + total\_debt - excess\_cash + stock}$$
    
    where 
    
    $excess\_cash = total\_cash - MAX(current\_liabilities - current\_non\_cash\_assets, 0)$
    
    $total\_cash = cash + short\_term\_investments$
    
    $current\_non\_cash\_assets = current\_assets - total\_cash$
    
    $total\_debt = (current\_liabilities + fixed\_liabilities) - (cash + cash\_equivalents)$
    
    To compute Earnings Yield we will need access to the following company accounts:
    
    - Account 1.01 - Current Assets
    - Account 1.01.01 - Cash
    - Account 1.01.02 - Short term investments
    - Account 1.01.04 - Stock
    - Account 1.89.03 - Total Shares    
    - Account 2.01 - Current Liabilities
    - Account 2.02 - Fixed Liabilities    
    - Account 3.05 - EBIT
    
    and the Stock Price.
    
    
7. We combine both quality indexes (ROC * Earnings Yield) and we order descendently the rankings, selecting the first 7 companies of each group.

Now we have our __14 companies__ that will be part of our investment portfolio.

In [5]:
import IPython
IPython.auto_scroll_threshold = 9999

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import display, HTML

%load_ext autoreload
%autoreload 2
%load_ext autotime

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 6.44 ms


Prepare the Spark environment

In [6]:
from dateutil.parser import parse
from pyspark.sql.functions import *
from pyspark.sql import Window

time: 1.54 ms


In [7]:
from spark import init_spark_context, load_and_get_table_df

sc, sql_context = init_spark_context("Universe Selection Job")

time: 3.54 s


## Global Universe

We locate all companies that have fundamental data and stock prices.

In [8]:
companies_tickers_df = load_and_get_table_df(sql_context, "tfm_uoc_analysis", "company_tickers")
companies_tickers_df.show(10)

+------+-----+------------------+--------------------+------------+----------+-----------+
|ticker| ccvm|              cnpj|        company_name|num_accounts|solr_query|ticker_type|
+------+-----+------------------+--------------------+------------+----------+-----------+
| MNDL3| 5312|88.610.191/0001-54|MUNDIAL S.A - PRO...|        9347|      null|        3.0|
| TXRX3| 7544|82.982.075/0001-80|TÊXTIL RENAUXVIEW...|       22140|      null|        3.0|
| PATI3|   94|92.693.019/0001-89|     PANATLANTICA SA|       23100|      null|        3.0|
| CESP5| 2577|60.933.603/0001-78|CESP - COMPANHIA ...|        3502|      null|        5.0|
| ESTR3| 8427|61.082.004/0001-50|MANUFATURA DE  BR...|       21062|      null|        3.0|
| FRIO3|20613|04.821.041/0001-08|METALFRIO SOLUTIO...|       24378|      null|        3.0|
| UGPA3|18465|33.256.439/0001-39|ULTRAPAR PARTICIP...|       25859|      null|        3.0|
| BRSR3| 1210|92.702.067/0001-96|BANCO DO ESTADO D...|       20568|      null|        3.0|

### Selectable companies

The followings are the base criteria we will use to conduct our analysis:

- Investment date: 2015-12-30 (last operating day of 2015)

We need to select companies that were active in 2015-12-30.

And companies we know they have historic stock prices and fundamental data.

In [9]:
last_stock_exchange_date = parse("2015-12-30")
last_fundamentals_date = parse("2015-12-31")

print(f"Analysis conducted for {last_stock_exchange_date} in"
      f"stock exchange and {last_fundamentals_date} for fundamentals")

Analysis conducted for 2015-12-30 00:00:00 instock exchange and 2015-12-31 00:00:00 for fundamentals
time: 1.55 ms


In [10]:
companies_df = load_and_get_table_df(sql_context, "tfm_uoc", "bovespa_company")
companies_df = companies_df.filter(
    ((col("canceled_date").isNull()) | (col("canceled_date") > last_fundamentals_date)) & 
    ((~col("granted_date").isNull() & (col("granted_date") < last_fundamentals_date) )) )

print(f"The number if active companies at {last_fundamentals_date}: {companies_df.count()}")

The number if active companies at 2015-12-31 00:00:00: 562
time: 701 ms


In [11]:
companies_with_data_df = load_and_get_table_df(sql_context, "tfm_uoc_analysis", "company_tickers")

print(f"The number companies with historic stock prices and fundamental data: {companies_with_data_df.count()}")

The number companies with historic stock prices and fundamental data: 401
time: 487 ms


In [12]:
selectable_companies_df = companies_with_data_df.join(companies_df, companies_with_data_df.ccvm == companies_df.ccvm)

print(f"The number of selectable companies is: {selectable_companies_df.count()}")

The number of selectable companies is: 387
time: 935 ms


In [13]:
selectable_companies = [row.ccvm for row in selectable_companies_df.collect()]

time: 338 ms


## Investment Universe

Let's start applying the rules to restrict our investment universe

We are using `selectable_companies_df` as our global available investment universe and we are going to select the stocks that better match our criteria.

In [14]:
# Accessing to the fundamental data of the companies
fundamentals = load_and_get_table_df(sql_context, "tfm_uoc", "bovespa_account")

time: 14.4 ms


In [15]:
# Accessing to the technical data of the companies (stock prices)
security_prices = load_and_get_table_df(sql_context, "tfm_uoc_analysis", "security_prices")
security_prices = security_prices.filter(col("type") == "EQUITY")

time: 30.3 ms


### Compute Marketcap

We are going to compute the market capitalization of all the companies today.

$ Market\ capitalization = total\_shares * stock\_price$

We need to get the account `Account 1.89.03 - Total Shares` for `2015-12-31` and the price at the end of the year for the associated tickers.

In [16]:
total_shares_df = fundamentals.filter(
    (col("number").isin(["1.89.03"])) & 
    (col("period") == last_fundamentals_date) &
    (col("ccvm").isin(selectable_companies)))

# We make sure that we use the last version reported by the companies
total_shares_df = total_shares_df. \
    orderBy(["ccvm", "period", "version"], ascending=[True, True, True]). \
    groupBy("ccvm", "period"). \
    agg(last("version").alias("version"), last("amount").alias("amount")). \
    orderBy(["version"], ascending=[False])

total_shares_df = total_shares_df.withColumn("account_name", lit("total_shares"))

total_shares_df = total_shares_df.groupby(col("ccvm"), col("period").alias("astodate"))\
    .pivot("account_name").agg(last("amount")).orderBy("ccvm", "astodate")


print(f"We have {total_shares_df.count()} companies with shares information for {last_fundamentals_date}")

We have 239 companies with shares information for 2015-12-31 00:00:00
time: 2min 4s


In [17]:
total_shares_df.cache()
total_shares_df.show(20)

DataFrame[ccvm: string, astodate: date, total_shares: decimal(38,18)]

+-----+----------+--------------------+
| ccvm|  astodate|        total_shares|
+-----+----------+--------------------+
| 1023|2015-12-31|2865417.000000000...|
|10456|2015-12-31|470450.0000000000...|
|10472|2015-12-31|28596.00000000000...|
|10880|2015-12-31|2431.000000000000...|
|10960|2015-12-31|126471.0000000000...|
| 1120|2015-12-31|15285.00000000000...|
|11207|2015-12-31|99305.00000000000...|
|11215|2015-12-31|1206.000000000000...|
|11223|2015-12-31|503.0000000000000...|
|11231|2015-12-31|2948.000000000000...|
|11258|2015-12-31|118440.0000000000...|
|11312|2015-12-31|825761.0000000000...|
|11398|2015-12-31|23214.00000000000...|
| 1155|2015-12-31|315912.0000000000...|
|11592|2015-12-31|83550.00000000000...|
| 1171|2015-12-31|9521.000000000000...|
|11762|2015-12-31|740921.0000000000...|
|11932|2015-12-31|94863.00000000000...|
|11975|2015-12-31|27000.00000000000...|
| 1210|2015-12-31|408974.0000000000...|
+-----+----------+--------------------+
only showing top 20 rows

time: 1min 6s


In [18]:
tickers_w_market_data = [row.ticker for row in 
                         companies_with_data_df.filter(
                             col("ccvm").isin(selectable_companies)).collect()]

market_price_df = security_prices.filter(
    (col("date") == last_stock_exchange_date) & 
    (col("ticker").isin(tickers_w_market_data)))

print(f"We have {market_price_df.count()} companies with stock price in {last_stock_exchange_date}")

We have 281 companies with stock price in 2015-12-30 00:00:00
time: 8.39 s


In [19]:
market_price_df = market_price_df.select(
    col("ccvm").alias("market_ccvm"), 
    "ticker", 
    col("date").alias("market_date"), 
    col("adjclose").alias("price_share"))
    
market_price_df.show()

+-----------+------+-----------+-----------+
|market_ccvm|ticker|market_date|price_share|
+-----------+------+-----------+-----------+
|      20532| SANB3| 2015-12-30|   7.160567|
|      20532| SANB4| 2015-12-30|   5.424535|
|      16306| RSID3| 2015-12-30|        3.2|
|      19453| ECOR3| 2015-12-30|   4.401374|
|      11592| UNIP3| 2015-12-30|    4.62786|
|      11592| UNIP5| 2015-12-30|  2.0308588|
|      11592| UNIP6| 2015-12-30|  1.4911208|
|      19992| TOTS3| 2015-12-30|  29.083752|
|      20524| EVEN3| 2015-12-30|  3.9321954|
|      14320| USIM3| 2015-12-30|  3.9599302|
|      14320| USIM5| 2015-12-30|  1.5193313|
|       6629| HETA4| 2015-12-30|        8.2|
|      20346| PFRM3| 2015-12-30|     4.6314|
|      20338| MDIA3| 2015-12-30|  20.696838|
|      13366| HAGA3| 2015-12-30|       2.39|
|      17485| EKTR4| 2015-12-30|  17.235706|
|      21008| GSHP3| 2015-12-30| 0.69697803|
|      20788| MRFG3| 2015-12-30|       6.35|
|       2100| CAMB3| 2015-12-30|    399.832|
|       21

In [20]:
market_capitalization_df = total_shares_df.join(
    market_price_df,
    total_shares_df.ccvm == market_price_df.market_ccvm, how='left')

market_capitalization_df = market_capitalization_df.withColumn("marketcap", col("total_shares") * col("price_share"))

market_capitalization_df = market_capitalization_df.select("ccvm", "astodate", "ticker", "total_shares", "price_share", "marketcap")

time: 27.3 ms


In [21]:
market_capitalization_df.show()

+-----+----------+------+--------------------+-----------+-------------------+
| ccvm|  astodate|ticker|        total_shares|price_share|          marketcap|
+-----+----------+------+--------------------+-----------+-------------------+
|11592|2015-12-31| UNIP3|83550.00000000000...|    4.62786|  386657.7087879181|
|11592|2015-12-31| UNIP5|83550.00000000000...|  2.0308588| 169678.24898958206|
|11592|2015-12-31| UNIP6|83550.00000000000...|  1.4911208| 124583.14411640167|
|13030|2015-12-31|  null|12600.00000000000...|       null|               null|
| 3891|2015-12-31| CRIV3|103515.0000000000...|   2.412978|  249779.4108259678|
| 3891|2015-12-31| CRIV4|103515.0000000000...|  2.5041864| 259220.85435032845|
|14460|2015-12-31| CYRE3|399743.0000000000...|   6.484147|  2591992.402937889|
|20621|2015-12-31| FHER3|53857.00000000000...|       1.35|  72706.95128405094|
|21440|2015-12-31| LLIS3|349863.0000000000...|  11.959079| 4184039.1822710037|
|12696|2015-12-31|  null|19291.00000000000...|      

### Compute liquidity

$liquidity=\sum_{t=1}^{k}volume_traded_t\ast stock_price_t$

In [22]:
liquidity_df = security_prices.withColumn("liquidity", col("volume") * col("adjclose"))
liquidity_df = liquidity_df.orderBy(["ccvm", "ticker", "date"], ascending=[True, True, True])
liquidity_df = liquidity_df.withColumn("liquidity120days", sum("liquidity")
             .over(Window.partitionBy("ticker").rowsBetween(-120, 0))) \
             .filter(
                (col("date") == last_stock_exchange_date) &
                (col("ticker").isin(tickers_w_market_data))) \
             .select("ccvm", "ticker", "liquidity120days")

time: 2.04 s


In [23]:
liquidity_df.show()

+-----+------+-------------------+
| ccvm|ticker|   liquidity120days|
+-----+------+-------------------+
| 3980| GGBR4|     6.3312292145E9|
|20362| POSI3| 1.69953639609375E7|
| 4820| BRKM5|      4.003165093E9|
|12653| KLBN3|                NaN|
|19879| LIGT3|  1.0787184818125E9|
|22012| MILS3|    3.37530769375E8|
|11932| MYPK3|    8.97043041875E8|
| 9342| PNVL4|    6421988.6484375|
|22187| PRIO3|3.260092503466797E7|
|19348| ITUB3|  2.4047461684375E8|
|19763| ENBR3|    2.25959503775E9|
|16608|ENMA3B| 3178444.1604003906|
|16306| RSID3|  2.5553822434375E8|
|20532| SANB4|  6280221.784057617|
| 1228| BNBR3|  276445.3107910156|
| 2437| ELET3|  1.0668287653125E9|
| 4146| CTKA4|  257727.0002593994|
|15253| ENGI4|  624835.5443115234|
|21199| BPAN4|4.553221304187012E7|
|20060| LUPA3|  1436750.245666504|
+-----+------+-------------------+
only showing top 20 rows

time: 15.3 s


### Create two ranks

We order the joint dataframe per market captitzalization and liquidity descendent.

Then we generate two dataframes, one with `$marketcap > 10B$` (__Big__) and other with companies `$2B <= marketcap <= 10B$` (__MED__)

In [24]:
all_companies_df = market_capitalization_df.join(
    liquidity_df, ["ccvm", "ticker"], how='left')

all_companies_df = all_companies_df.orderBy(
    ["marketcap", "liquidity120days"], ascending=[False, False])

all_companies_df = all_companies_df.orderBy(["ccvm"], ascending=[True])

all_companies_df.show()

+-----+------+----------+--------------------+-----------+--------------------+--------------------+
| ccvm|ticker|  astodate|        total_shares|price_share|           marketcap|    liquidity120days|
+-----+------+----------+--------------------+-----------+--------------------+--------------------+
| 1023| BBAS3|2015-12-31|2865417.000000000...|  12.486569|3.5779228243626595E7|    1.42961338945E10|
|10456| ALPA3|2015-12-31|470450.0000000000...|   6.911579|  3251552.4026870728|3.4135243062316895E7|
|10456| ALPA4|2015-12-31|470450.0000000000...|   5.103701|   2401036.189389229|   4.1551202871875E8|
|10472| SLED3|2015-12-31|28596.00000000000...|  4.6795573|  133816.62122154236|   34802.80584716797|
|10880| SOND3|2015-12-31|2431.000000000000...|   40.29944|   97967.93493652344|                 0.0|
|10960|  null|2015-12-31|126471.0000000000...|       null|                null|                null|
| 1120| BGIP4|2015-12-31|15285.00000000000...|   8.485733|  129704.42939758301|   3357995.2

In [25]:
# The total shares were expressed in thousands of shares. Then
# we check for market cap over 5M not 5B
bigcap_companies = all_companies_df.filter(col("marketcap") >= 5e6)
midcap_companies = all_companies_df.filter((col("marketcap") > 1e6) & (col("marketcap") < 5e6))

print(f"Number of Big-cap companies is {bigcap_companies.count()}")
print(f"Number of Mid-cap companies is {midcap_companies.count()}")

Number of Big-cap companies is 48
Number of Mid-cap companies is 79
time: 47.9 s


In [26]:
bigcap_companies = bigcap_companies.orderBy(["liquidity120days"], ascending=[False]).limit(15)

midcap_companies = midcap_companies.orderBy(["liquidity120days"], ascending=[False]).limit(20)

time: 13.5 ms


### Calculate index for wonderful companies (ROC)

$$ROC=\frac{earnings}{capital}
$$
    or also $$ROC=\frac{earnings}{fixed\_assets + net\_working\_capital - cash}$$

    where;

$$net\_working\_capital = current\_assets - current\_liabilities$$
    
    To compute ROC we are going to need access to the following company accounts:
    
    - Account 3.09 - Net income (needed by the ROC formula)
    - Account 1.01 - Current Assets (needed by the ROC formula)
    - Account 1.01.01 - Cash (needed by the ROC formula)
    - Account 1.02 - Fixed Assets (needed by the ROC formula)
    - Account 2.01 - Current Liabilities

In [27]:
# Get all the required accounts
ROC_accounts_df = fundamentals.filter(
    col("number").isin(["1.01", "1.01.01", "1.02", "2.01", "3.09"]))
ROC_accounts_df = ROC_accounts_df.withColumn(
    "account_name", when(ROC_accounts_df.number == "1.01", "current_assets").otherwise(
        when(ROC_accounts_df.number == "1.01.01", "cash").otherwise(
            when(ROC_accounts_df.number == "1.02", "fixed_assets").otherwise(
                when(ROC_accounts_df.number == "2.01", "current_liabilities").otherwise("net_income")))))

ROC_accounts_df = ROC_accounts_df \
    .select(col("ccvm"), 
            col("period").alias("astodate"),
            col("account_name"), 
            col("amount").alias("amount"))

ROC_accounts_df = ROC_accounts_df.groupby(col("ccvm"), col("astodate"))\
    .pivot("account_name").sum("amount").orderBy("ccvm", "astodate")

ROC_accounts_df.show()

+----+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
|ccvm|  astodate|                cash|      current_assets| current_liabilities|        fixed_assets|          net_income|
+----+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
|1023|2010-12-31|38008416.00000000...|1296401789.000000...|1515634908.000000...|998645507.0000000...|14391005.00000000...|
|1023|2011-03-31|12244142.00000000...|460670797.0000000...|536607202.0000000...|298169181.0000000...|469621.0000000000...|
|1023|2011-06-30|18868357.00000000...|479955357.0000000...|554643276.0000000...|306061270.0000000...|71272.00000000000...|
|1023|2011-09-30|19767405.00000000...|512077755.0000000...|571192875.0000000...|312982066.0000000...|1313364.000000000...|
|1023|2011-12-31|57893154.00000000...|2163486616.000000...|2289926472.000000...|1682047332.000000...|28022160.00000000...|
|1023|2012-03-31

In [28]:
ROC_df = ROC_accounts_df.filter(
        (col("astodate") == last_fundamentals_date) &
        (col("ccvm").isin(selectable_companies)))

ROC_df = ROC_df. \
    withColumn(
        "capital", 
        col("fixed_assets") + 
        col("current_assets") - 
        col("current_liabilities") - 
        col("cash")). \
    withColumn("ROC", col("net_income") / col("capital")). \
    select("ccvm", "astodate", "ROC")
    
ROC_df.show()

+-----+----------+---------+
| ccvm|  astodate|      ROC|
+-----+----------+---------+
| 1023|2015-12-31| 0.036004|
|10456|2015-12-31| 0.139110|
|10472|2015-12-31|-0.244239|
|10880|2015-12-31|-0.181915|
|10960|2015-12-31|-0.261820|
| 1120|2015-12-31| 0.000000|
|11207|2015-12-31| 0.003945|
|11215|2015-12-31|-1.023490|
|11223|2015-12-31| 0.151255|
|11231|2015-12-31|-0.054596|
|11258|2015-12-31|-0.115173|
|11312|2015-12-31|-0.122770|
|11398|2015-12-31|-0.056296|
| 1155|2015-12-31| 0.007317|
|11592|2015-12-31| 0.070570|
| 1171|2015-12-31| 0.000000|
|11762|2015-12-31|-0.143392|
|11932|2015-12-31| 0.022565|
|11975|2015-12-31| 0.012990|
| 1210|2015-12-31| 0.016671|
+-----+----------+---------+
only showing top 20 rows

time: 32 s


### Calculate index for Wonderful Companies (Earnings Yield)

$$Earnings\ Yieal=\frac{EBIT}{TEV}$$

$$ EY=\frac{EBIT}{marketcap + total\_debt - excess\_cash + stock}$$

where 

$excess\_cash = total\_cash - MAX(current\_liabilities - current\_non\_cash\_assets, 0)$

$total\_cash = cash + short\_term\_investments$

$current\_non\_cash\_assets = current\_assets - total\_cash$

$total\_debt = (current\_liabilities + fixed\_liabilities) - (cash + cash\_equivalents)$

To compute Earnings Yield we will need access to the following company accounts:

- Account 1.01 - Current Assets
- Account 1.01.01 - Cash
- Account 1.01.02 - Short term investments
- Account 1.01.04 - Stock
- Account 1.89.03 - Total Shares    
- Account 2.01 - Current Liabilities
- Account 2.02 - Fixed Liabilities
- Account 3.05 - EBIT

and the Stock Price.


In [29]:
# Get all the required accounts
EY_accounts_df = fundamentals.filter(
    col("number").isin(["1.01", "1.01.01", "1.01.02", "1.01.04", "2.01", "2.02", "3.05"]))
EY_accounts_df = EY_accounts_df.withColumn(
    "account_name", when(EY_accounts_df.number == "1.01", "current_assets").otherwise(
        when(EY_accounts_df.number == "1.01.01", "cash").otherwise(
            when(EY_accounts_df.number == "1.01.02", "short_term_investments").otherwise(
                when(EY_accounts_df.number == "1.01.04", "stock").otherwise(
                    when(EY_accounts_df.number == "2.01", "current_liabilities").otherwise(
                        when(EY_accounts_df.number == "2.02", "fixed_liabilities").otherwise("EBIT")))))))

EY_accounts_df = EY_accounts_df \
    .select(col("ccvm"), 
            col("period").alias("astodate"),
            col("account_name"), 
            col("amount").alias("amount"))

EY_accounts_df = EY_accounts_df.groupby(col("ccvm"), col("astodate"))\
    .pivot("account_name").sum("amount").orderBy("ccvm", "astodate")

EY_accounts_df.show()

+----+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+
|ccvm|  astodate|                EBIT|                cash|      current_assets| current_liabilities|   fixed_liabilities|short_term_investments|               stock|
+----+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+
|1023|2010-12-31|67696690.00000000...|38008416.00000000...|1296401789.000000...|1515634908.000000...|549724995.0000000...|  344677921.0000000...|258862344.0000000...|
|1023|2011-03-31|4420967.000000000...|12244142.00000000...|460670797.0000000...|536607202.0000000...|198846804.0000000...|  145400347.0000000...|90974816.00000000...|
|1023|2011-06-30|5009653.000000000...|18868357.00000000...|479955357.0000000...|554643276.0000000...|206937578.0000000...|  145592090.0000000...|91742924.00000000...

In [30]:
EY_df = EY_accounts_df.filter(
        (col("astodate") == last_fundamentals_date) &
        (col("ccvm").isin(selectable_companies)))

EY_df = EY_df.join(all_companies_df, "ccvm", how='left')

EY_df = EY_df.drop(all_companies_df.astodate)

EY_df = EY_df.filter(EY_df.liquidity120days.isNotNull())

EY_df.show()

+-----+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+------+--------------------+-----------+--------------------+-------------------+
| ccvm|  astodate|                EBIT|                cash|      current_assets| current_liabilities|   fixed_liabilities|short_term_investments|               stock|ticker|        total_shares|price_share|           marketcap|   liquidity120days|
+-----+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+------+--------------------+-----------+--------------------+-------------------+
|11592|2015-12-31|157732.0000000000...|126949.0000000000...|362629.0000000000...|293480.0000000000...|544351.0000000000...|  95492.00000000000...|26644.00000000000...| UNIP3|83550.00000000000...|    4.62786|   386657.7087879181| 1155224.1525878906|
|115

In [31]:
EY_computation_df = EY_df. \
    withColumn(
        "total_debt", 
        abs(col("current_liabilities")) + 
        abs(col("fixed_liabilities")) - 
        col("cash")). \
    withColumn(
        "total_cash", 
        col("cash") + col("short_term_investments")). \
    withColumn(
        "current_non_cash_assets", 
        col("current_assets") - col("total_cash")). \
    withColumn(
        "excess_cash", 
        col("total_cash") - 
        when((abs(col("current_liabilities")) - col("current_non_cash_assets"))  >= 0,
             (abs(col("current_liabilities")) - col("current_non_cash_assets"))).otherwise(0)). \
    withColumn(
        "dividend", 
        (col("marketcap") * 1000) - col("total_debt") - col("excess_cash") + col('stock')). \
    withColumn(
        "EarningsYield", 
        col("EBIT") / col("dividend"))

EY_computation_df.show()

+-----+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+------+--------------------+-----------+--------------------+-------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+
| ccvm|  astodate|                EBIT|                cash|      current_assets| current_liabilities|   fixed_liabilities|short_term_investments|               stock|ticker|        total_shares|price_share|           marketcap|   liquidity120days|          total_debt|          total_cash|current_non_cash_assets|         excess_cash|            dividend|       EarningsYield|
+-----+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+------+--------------------+-----------+--------------------+----------------

In [42]:
EY_computation_df.cache()

DataFrame[ccvm: string, astodate: date, EBIT: decimal(38,18), cash: decimal(38,18), current_assets: decimal(38,18), current_liabilities: decimal(38,18), fixed_liabilities: decimal(38,18), short_term_investments: decimal(38,18), stock: decimal(38,18), ticker: string, total_shares: decimal(38,18), price_share: float, marketcap: double, liquidity120days: double, total_debt: decimal(38,16), total_cash: decimal(38,17), current_non_cash_assets: decimal(38,16), excess_cash: decimal(38,14), dividend: double, EarningsYield: double]

time: 202 ms


### Combine both indexes to select the companies of our universe

We will combine both indexes multiplying them and ordering the resulting index in descendent order. We do this in both group of companies, the big-cap and mid-cap pre-selected companies.

In [32]:
scored_bigcap_companies = bigcap_companies.join(
    ROC_df, "ccvm", how='left').join(EY_computation_df, ["ccvm", "ticker"], how='left')

scored_bigcap_companies = scored_bigcap_companies.drop(ROC_df.astodate)
scored_bigcap_companies = scored_bigcap_companies.drop(EY_computation_df.astodate)
scored_bigcap_companies = scored_bigcap_companies.drop(EY_computation_df.total_shares)
scored_bigcap_companies = scored_bigcap_companies.drop(EY_computation_df.price_share)
scored_bigcap_companies = scored_bigcap_companies.drop(EY_computation_df.marketcap)
scored_bigcap_companies = scored_bigcap_companies.drop(EY_computation_df.liquidity120days)

scored_bigcap_companies = scored_bigcap_companies. \
    withColumn("quality_index", col("ROC") * col("EarningsYield")). \
    orderBy(["quality_index"], ascending=[False]).limit(15).na.drop()

time: 278 ms


In [33]:
scored_bigcap_companies. \
    select(col("ccvm"), 
           col("ticker"), 
           col("ROC"), 
           col("EarningsYield"), 
           col("quality_index")).show(15)

+-----+------+---------+--------------------+--------------------+
| ccvm|ticker|      ROC|       EarningsYield|       quality_index|
+-----+------+---------+--------------------+--------------------+
| 4170| VALE3|-0.158347|-8.63221170417238E-4|1.366884826720583...|
| 7617| ITSA4| 0.189773|6.098875137125337E-4|1.157401831397686...|
|20575| JBSS3| 0.088494|0.001270149168437...|1.124005805116751...|
|23159| BBSE3| 0.623346|1.724109032173471...|1.074716468769204...|
|21733| CIEL3| 0.193639|2.380277920926754...|4.609146363303358...|
| 1023| BBAS3| 0.036004|0.001198105840941668|4.313660269726381...|
|18465| UGPA3| 0.129780|3.183256187277489...|4.131229879848726E-5|
|23264| ABEV3| 0.208983|1.321774901825072...|2.762284843081091...|
|16292| BRFS3| 0.116983|2.074101170406661E-4|2.426345772176824...|
|19348| ITUB4| 0.035055|4.674190724581319E-4|1.638537558501981...|
|  906| BBDC4| 0.023877|3.330352678931849E-4|7.951883091485575E-6|
|14826| PCAR4|-0.001188|7.699758579245562E-4|-9.14731319214372

In [34]:
scored_midcap_companies = midcap_companies.join(
    ROC_df, "ccvm", how='left').join(EY_computation_df, ["ccvm", "ticker"], how='left')

scored_midcap_companies = scored_midcap_companies.drop(ROC_df.astodate)
scored_midcap_companies = scored_midcap_companies.drop(EY_computation_df.astodate)
scored_midcap_companies = scored_midcap_companies.drop(EY_computation_df.total_shares)
scored_midcap_companies = scored_midcap_companies.drop(EY_computation_df.price_share)
scored_midcap_companies = scored_midcap_companies.drop(EY_computation_df.marketcap)
scored_midcap_companies = scored_midcap_companies.drop(EY_computation_df.liquidity120days)

scored_midcap_companies = scored_midcap_companies. \
    withColumn("quality_index", col("ROC") * col("EarningsYield")). \
    orderBy(["quality_index"], ascending=[False]).limit(22)

time: 184 ms


In [35]:
scored_midcap_companies. \
    select(col("ccvm"), 
           col("ticker"), 
           col("ROC"), 
           col("EarningsYield"), 
           col("quality_index")).show(22)

+-----+------+---------+--------------------+--------------------+
| ccvm|ticker|      ROC|       EarningsYield|       quality_index|
+-----+------+---------+--------------------+--------------------+
|18724| BRAP4|-0.277357|-0.00344038958013...|9.542161327779384E-4|
|14320| USIM5|-0.156665|-0.00336588311383...|5.273160780282257E-4|
| 8656| GOAU4|-0.105342|-0.00377913841021...|3.981019984084973E-4|
| 4030| CSNA3| 0.040765|0.005409873630167907|2.205334985337947...|
|19739| RENT3| 0.123553|9.927848364600783E-4|1.226615448991520...|
|19763| ENBR3| 0.130239|9.007812700306895E-4|1.173168518275269...|
|19550| NATU3| 0.141377|5.056658489592716E-4|7.148952072831494E-5|
|14460| CYRE3| 0.067733|9.016770190401691E-4|6.107328953064778E-5|
|11312| OIBR4|-0.122770|-3.19359932288224...|3.920781888702529E-5|
|20982| MULT3| 0.060055|5.758746778735874E-4|3.458415377969829E-5|
|20125| ODPV3| 0.245647|1.368352415368857...|3.361316657781138E-5|
|20028| VLID3| 0.099457|3.079707734975453E-4|3.062984921974536

## Result

We proceed to merge the best big-cap companies and best mid-cap companies to generate the stocks of our portfoli.

In [36]:
portfolio_universe = scored_bigcap_companies.union(scored_midcap_companies)
portfolio_universe.show(37)

+-----+------+----------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+--------------------+-----------+--------------------+------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+
| ccvm|ticker|  astodate|      ROC|                EBIT|                cash|      current_assets| current_liabilities|   fixed_liabilities|short_term_investments|               stock|        total_shares|price_share|           marketcap|  liquidity120days|          total_debt|          total_cash|current_non_cash_assets|         excess_cash|            dividend|       EarningsYield|       quality_index|
+-----+------+----------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+

In [None]:
portfolio_universe.cache()

In [44]:
portfolio_universe_pdf = portfolio_universe.toPandas()

Unnamed: 0,ccvm,ticker,astodate,ROC,EBIT,cash,current_assets,current_liabilities,fixed_liabilities,short_term_investments,...,price_share,marketcap,liquidity120days,total_debt,total_cash,current_non_cash_assets,excess_cash,dividend,EarningsYield,quality_index
0,4170,VALE3,2015-12-31,-0.158347,-52256885.0,14539346.0,108947720.0,66900919.0,313648883.0,0.0,...,11.612576,60900020.0,12166840000.0,366010456.0,14539346.0,94408374.0,14539346.0,60537080000.0,-0.000863,0.0001366885
1,7617,ITSA4,2015-12-31,0.189773,17905000.0,2976000.0,8452000.0,4746000.0,4382000.0,564000.0,...,4.344943,29366600.0,15843350000.0,6152000.0,3540000.0,4912000.0,3540000.0,29357870000.0,0.00061,0.0001157402
2,20575,JBSS3,2015-12-31,0.088494,42021267.0,51062832.0,205276848.0,180469273.0,209917194.0,39242961.0,...,11.694019,33408140.0,15456120000.0,339323635.0,90305793.0,114971055.0,24807575.0,33083730000.0,0.00127,0.0001124006
3,23159,BBSE3,2015-12-31,0.623346,6623388.0,4862586.0,13827669.0,13031277.0,3665169.0,156.0,...,19.214462,38428920.0,12734350000.0,11833860.0,4862742.0,8964927.0,796392.0,38416290000.0,0.000172,0.0001074716
4,21733,CIEL3,2015-12-31,0.193639,8294421.0,1294011.0,23116011.0,14963423.0,21600292.0,0.0,...,18.536745,34883000.0,17061630000.0,35269704.0,1294011.0,21822000.0,1294011.0,34846440000.0,0.000238,4.609146e-05
5,1023,BBAS3,2015-12-31,0.036004,38532135.0,85601543.0,2539085289.0,2977909236.0,1360017690.0,1196303181.0,...,12.486569,35779230.0,14296130000.0,4252325383.0,1281904724.0,1257180565.0,-438823947.0,32160880000.0,0.001198,4.31366e-05
6,18465,UGPA3,2015-12-31,0.12978,4438200.0,2750954.0,10412404.0,4165779.0,10074542.0,810012.0,...,25.080441,13954880.0,10256240000.0,11489367.0,3560966.0,6851438.0,3560966.0,13942330000.0,0.000318,4.13123e-05
7,23264,ABEV3,2015-12-31,0.208983,32828106.0,15565033.0,42359214.0,48257627.0,37975835.0,2812575.0,...,15.805341,248422300.0,25827910000.0,70668429.0,18377608.0,23981606.0,-5898413.0,248363800000.0,0.000132,2.762285e-05
8,16292,BRFS3,2015-12-31,0.116983,9743254.0,6207975.0,31392866.0,21971311.0,29083675.0,932518.0,...,53.893978,47021040.0,22687160000.0,44847011.0,7140493.0,24252373.0,7140493.0,46975790000.0,0.000207,2.426346e-05
9,19348,ITUB4,2015-12-31,0.035055,38847751.0,155156.0,34675517.0,5575186.0,79242280.0,9978893.0,...,13.676385,83205980.0,49735280000.0,84662310.0,10134049.0,24541468.0,10134049.0,83111180000.0,0.000467,1.638538e-05


time: 5.9 s


In [45]:
portfolio_universe_pdf[["ccvm", "ticker", "astodate", "marketcap", "liquidity120days", "ROC", "EarningsYield", "quality_index"]]

Unnamed: 0,ccvm,ticker,astodate,marketcap,liquidity120days,ROC,EarningsYield,quality_index
0,4170,VALE3,2015-12-31,60900020.0,12166840000.0,-0.158347,-0.000863,0.0001366885
1,7617,ITSA4,2015-12-31,29366600.0,15843350000.0,0.189773,0.00061,0.0001157402
2,20575,JBSS3,2015-12-31,33408140.0,15456120000.0,0.088494,0.00127,0.0001124006
3,23159,BBSE3,2015-12-31,38428920.0,12734350000.0,0.623346,0.000172,0.0001074716
4,21733,CIEL3,2015-12-31,34883000.0,17061630000.0,0.193639,0.000238,4.609146e-05
5,1023,BBAS3,2015-12-31,35779230.0,14296130000.0,0.036004,0.001198,4.31366e-05
6,18465,UGPA3,2015-12-31,13954880.0,10256240000.0,0.12978,0.000318,4.13123e-05
7,23264,ABEV3,2015-12-31,248422300.0,25827910000.0,0.208983,0.000132,2.762285e-05
8,16292,BRFS3,2015-12-31,47021040.0,22687160000.0,0.116983,0.000207,2.426346e-05
9,19348,ITUB4,2015-12-31,83205980.0,49735280000.0,0.035055,0.000467,1.638538e-05


time: 45 ms


In [38]:
from db import sync_table
sync_table(portfolio_universe, "tfm_uoc_dse", "tfm_uoc_analysis", "portfolio_universe", ["ccvm", "ticker"])

Closing connections
time: 256 ms




In [39]:
portfolio_universe.write\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="portfolio_universe", keyspace="tfm_uoc_analysis")\
    .option("confirm.truncate","true")\
    .mode("overwrite")\
    .partitionBy("astodate")\
    .save()

time: 2min 38s


In [40]:
# sc.stop()

time: 1.14 ms
