# Select top-N assets with good indicators
The objective is to calculate the standard indicators for a top N number of assets. Then use the indicators to decide on a potentially significant set of assets to consider for the portfolio. Thereafter, apply the MPT monte carlo algorithm to construct a weigted portfolio. 

In [1]:
'''
    WARNING CONTROL to display or ignore all warnings
'''
import warnings; warnings.simplefilter('ignore') #switch betweeb 'default' and 'ignore'
import traceback

''' Set debug flag to view extended error messages; 
    else set it to False to turn off debugging mode '''
debug = True

## Initialize the classes

In [2]:
import os
import sys
from datetime import datetime, date, timedelta
from pyspark.sql import functions as F

sys.path.insert(1,"/home/nuwan/workspace/rezaware/")
import rezaware as reza
from mining.modules.assets.etp import performIndex as indx
from utils.modules.etl.load import sparkDBwls as sdb
from utils.modules.ml.timeseries import rollingstats as stats
from utils.modules.etl.load import sparkNoSQLwls as nosql
from utils.modules.lib.spark import execSession as spark

''' restart initiate classes '''
if debug:
    import importlib
    reza = importlib.reload(reza)
    indx = importlib.reload(indx)
    stats= importlib.reload(stats)
    nosql= importlib.reload(nosql)
    spark= importlib.reload(spark)
    sdb = importlib.reload(sdb)
    
__desc__ = "analyze crypto market capitalization time series data"
clsIndx =indx.Portfolio(desc=__desc__)
# clsStat=stats.RollingStats(desc=__desc__)
clsSDB = sdb.SQLWorkLoads(desc=__desc__)
clsNoSQL=nosql.NoSQLWorkLoads(desc=__desc__)
print("\nClass initialization and load complete!")

All functional APP-libraries in REZAWARE-package of REZAWARE-module imported successfully!
All functional PERFORMINDEX-libraries in ETP-package of ASSETS-module imported successfully!
All functional SPARKDBWLS-libraries in LOAD-package of ETL-module imported successfully!
All packages in utils ml timeseries RollingStats imported successfully!
All functional SPARKNOSQLWLS-libraries in LOAD-package of ETL-module imported successfully!
All functional EXECSESSION-libraries in SPARK-package of LIB-module imported successfully!
All functional APP-libraries in REZAWARE-package of REZAWARE-module imported successfully!
All functional PERFORMINDEX-libraries in ETP-package of ASSETS-module imported successfully!
All packages in utils ml timeseries RollingStats imported successfully!
All functional SPARKNOSQLWLS-libraries in LOAD-package of ETL-module imported successfully!
All functional EXECSESSION-libraries in SPARK-package of LIB-module imported successfully!
All functional SPARKDBWLS-librari

## Read top N mcap assets from tip sql db

Set the following parameters to select the mcap data from the database
* ```_num_assets``` (integer) limits the number of asset count
* ```_mcap_val_lb```(decimal) limits the asset selection by mcap_value
* ```_date``` (datetime) selects assets with values for that day
* ```_table```(string) by default is 'warehouse.mcap_past' where mcap daily data is stored

Extends the ```utils/etl/load/sparkDBwls``` package to read the data from database table


In [3]:
_num_assets=3
_mcap_val_lb=10000.0
_date=datetime.strftime(date(2022,1,30),'%Y-%m-%dT00:00:00')
_table='warehouse.mcap_past'
kwargs={}

_query =f"select * from {_table} wmp where wmp.mcap_date = '{_date}' " +\
        f"and wmp.mcap_value > {_mcap_val_lb} " +\
        f"order by wmp.mcap_value DESC limit {_num_assets} "

_data = clsSDB.read_data_from_table(select=_query, **kwargs)
print("selected %d assets \n" % _data.count())
_data.show(n=1,vertical=True)

23/03/22 20:16:52 WARN Utils: Your hostname, FarmRaiderTester resolves to a loopback address: 127.0.1.1; using 192.168.124.15 instead (on interface enp2s0)
23/03/22 20:16:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/03/22 20:16:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/22 20:16:53 WARN FileSystem: Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem Unable to get public no-arg constructor
23/03/22 20:16:53 WARN FileSystem: java.lang.NoClassDefFoundError: com/google/api/client/auth/oauth2/Credential
23/03/22 20:16:53 WARN FileSystem: java.lang.ClassNotFoundException: com.google.api.client.auth.oauth2.Credential


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/22 20:16:56 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/03/22 20:16:56 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


                                                                                

selected 3 assets 



[Stage 6:>                                                          (0 + 1) / 1]

-RECORD 0-----------------------------
 mcap_past_pk  | 19539                
 uuid          | 6397027b7cc473c58... 
 data_source   | coingecko            
 asset_name    | btc                  
 asset_symbol  | btc                  
 mcap_date     | 2022-01-30 00:00:00  
 mcap_value    | 729084397823.3950... 
 mcap_rank     | null                 
 created_dt    | 2023-02-14 12:07:... 
 created_by    | farmraider           
 created_proc  | wrangler_assets_e... 
 modified_dt   | 2023-02-21 14:45:... 
 modified_by   | farmraider           
 modified_proc | utils_etl_load_sp... 
 deactivate_dt | null                 
 log_ror       | 0.6966161566986901   
 simp_ror      | 0.0069020000000000   
only showing top 1 row



                                                                                

## Construct a dict with selected assets
The dictionary serves as an input to the ```mining/modules/assets/etp/performIndex``` package to compuiting the index values.

In [4]:
from pyspark.sql import functions as F
_assets=_data.select(F.col('mcap_date'),
                     F.col('asset_name'),
                     F.col('mcap_value'))\
                .distinct()
_portf=[]
for _asset in _assets.collect():
    _asset_dict={}
    _asset_dict={"date" : datetime.strftime(_asset[0],'%Y-%m-%dT00:00:00'),
                 "asset": _asset[1],
                 'mcap.weight': 1.0,
                 'mcap.value' : float(_asset[2]),
                }
    _portf.append(_asset_dict)
_portf=sorted(_portf, key=lambda d: d['mcap.value'], reverse=True)
_portf[:3]

                                                                                

[{'date': '2022-01-30T00:00:00',
  'asset': 'btc',
  'mcap.weight': 1.0,
  'mcap.value': 729084397823.395},
 {'date': '2022-01-30T00:00:00',
  'asset': 'bnb',
  'mcap.weight': 1.0,
  'mcap.value': 65455197964.9447},
 {'date': '2022-01-30T00:00:00',
  'asset': 'avax',
  'mcap.weight': 1.0,
  'mcap.value': 17939687507.8046}]

## Compute index values
Set the parameters to compute the desired set of index measures
* ```__idx_type__``` (list) typically 'adx','sharp','rsi','mfi','beta'
* ```_coll_dt``` (date) the date, same as the date for which assets were selected
* ```__val_col__``` (string) the column name with the mcap value
* ```__name_col__```(string) the column name with the mcap asset names
* ```__date_col__```(string) the column name with mcap date for which value was set
* ```__rf_assets__``` (list) of 'risk free' assets to use as the baseline
* ```__rf_val_col__``` (string) the column name with the mcap value
* ```__rf_name_col__```(string) the column name with the mcap asset names
* ```__rf_date_col__```(string) the column name with mcap date for which value was set

The ```mining/modules/assets/etp/performIndex``` package will retrieve the index values for each of the assets.

In [5]:
import pandas as pd

__idx_type__=['adx','sharp','rsi','mfi','beta']
_coll_dt=date(2022,1,30)
__val_col__="simp_ror"
__name_col__='asset_name'
__date_col__='mcap_date'
__rf_assets__=['btc']
__rf_val_col__="log_ror"
__rf_name_col__='asset_name'
__rf_date_col__='mcap_date'
_kwargs={
    "WINLENGTH":7,
    "WINUNIT":'DAY',
}
_res_df=pd.DataFrame()
# _results=[]
__idx_dict={}
for asset_portf in _portf:
    _idx_dict = clsIndx.get_index(
        portfolio=[asset_portf],
        asset_eval_date=_coll_dt,
        asset_name_col=__name_col__,
        asset_val_col =__val_col__,
        asset_date_col=__date_col__,
        index_type=__idx_type__,
        risk_free_assets=__rf_assets__,
        risk_free_name_col=__rf_name_col__,
        risk_free_val_col=__rf_val_col__,
        risk_free_date_col=__rf_date_col__,
        **_kwargs,
    )
    _idx_dict['asset']=asset_portf['asset']
    _res_df=pd.concat([_res_df,pd.DataFrame([_idx_dict])])
_res_df.insert(0, 'asset', _res_df.pop('asset'))
_res_df

                                                                                

Unnamed: 0,asset,adx,sharp,rsi,mfi,beta
0,btc,0.088397,36.016989,0.451438,0.451438,0.868303
0,bnb,0.320226,34.996164,0.292148,0.292148,0.631958
0,avax,0.38621,20.929375,0.468284,0.468284,0.950068


## Dimensionality reduction techniques
Review and select one
* Principal Component Analysis (PCA) - [sampling adequacy](https://statistics.laerd.com/spss-tutorials/principal-components-analysis-pca-using-spss-statistics.php)
* Generalized discriminant analysis (GDA)
* Missing Values Ratio.
* Low Variance Filter.
* [Feature selection](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/) - unsupervised selection removing the target variable but dimensionality reduction is a replacement for unsupervised correlation based feature seelction.
* Feature extraction - not into finding isoltated signals in the data
* Non-negative matrix factorization (NMF) - we can normalize the data, which might be faster?
* Linear discriminant analysis (LDA) - it's a supervised technique
