# Colin Lefter

## Research question/interests

**What equity data is the most deterministic of the performance of an equity, and of this data, which is the most relevant for a growth portfolio investment strategy such that we can compute an optimized portfolio of equities while using user input to drive our optimization algorithm?**

My research objective is to develop a scalable asset allocation and construction algorithm that implements an objected-oriented design approach. This objective is an outcome of determining what equity data is the most deterministic of the price of an equity, which will be the focus for the majority of the project/

I intend to develop algorithms for constructing multiple linear regressions and Fourier Franforms, among others, that I will then use to construct interactive and statistical models with Plotly and Seaborn. As such, I have a strong interest in the system design of our software and in developing helper functions that can assist all of us with processing data more efficiently. I am also looking forward to using Facebook Prophet[^1] to construct a time series forecast of a sample portfolio recommendation from our software, which can be included in our Tableau Dashboard.

### Analysis Plan
Our objective function is one that takes in a selection of columns from our data sets to then search for the top n companies that satisfy a criteria for having the highest probability of producing an optimal return on investment. These inputs themselves refer to sub-objective functions that take as input user-defined parameters and thresholds that set the criteria for favourable performance attributes. To rank the companies from our data set, and ultimately determine what portion of capital to assign to each equity, I propose a data normalization algorithm that normalizes the data that comprises the favourable subset from each column of our data set. We interpret these normalized values as probabilities of equity selection and ultimately average the score of each company across all columns to then multiply the final score percentage of each company with the total capital specified by the user. In a broad sense, our software is composed of four general classes that include "Data", "Quantitative Analysis", "Data Visualization" and "Portfolio Construction". We inherit the properties from each of these classes to build a functional data analysis chain.

Our data visualization will be concerned with analyzing the influence of certain financial variables, such as Price-to-Earnings, on the price of each equity from a sample of 500 equities (from the S&P 500 index). Such analysis would begin with a statistical summary that will constitute exploratory data analysis, followed by our application of analysis algorithms that we design. The construction of a portfolio is a bonus of our project and will be made possible by the analysis algorithms we have constructed.

**Important Note**
A component of the analysis will involve the comparison of different values of financial variables with the corresponding price of each equity. This constitutes inferential analysis as we are attempting to identify a correlation on the basis of picking stocks based on expected performance. Therefore, this will require us to use past financial data and compare this data with the current price of each equity. As a result, we can only use the 3-month performance data (i.e. 3-month change in share price data) for this comparison as otherwise we would be using future data to predict past performance, which would be invalid.

#### User-defined parameters
Some initial ideas for these parameters include:
- (float) Initial capital
- (float) Additional capital per day, week or month
- (int) Intended holding period (in days)
- (boolean) Importance of dividends (validated based on capital invested)
- (String) Preferred industries (choose from a list, or select all)
- (int) Volatility tolerance (from 0 to 1, 1 indicating that volatility is not important)
- (String) Preferred companies (as a list)[^2]
- (int) Preferred degree of portfolio diversification (from 0 to 1, 1 indicating complete diversification)
- (String) Preferred investment strategy (choose from "Growth", "Value", "GARP")

### Algorithm Plan

####  Tier 1: Threshold-based screening algorithms
- The current plan is to use these algorithms to screen the financial documents from each company by setting a minimum threshold for each financial ratio. This class of algorithms will need to conduct such screening per industry as industry financial ratios are dinstinct from one another.
- A global screening algorithm that selects companies which show favourable performance across all ratios can also be used after each ratio has been individually tested.

#### Tier 2: Regression models
- As of now, the intent is to develop a multiple linear regression model that will attempt to determine a relationship between the yearly and quarterly performance of each company in relation to several columns of data that act as predictors. This can essentially implement the results from the threshold-based screening algorithms to only conduct this analysis on the pre-screened companies.

#### Tier 3: Statistical modelling algorithms
- Tier 3 denotes a class of broadly experimental statistical modelling algorithms that are applied on a pre-final portfolio to add additional points to companies that perform exceptionally well compared to others in the portfolio. For now, these algorithms constitute signal processing algorithms such as a Fourier Transform algorithm that attempts to identify peaks in numerical values that would otherwise not be apparent when examined in isolation and without further processing. Therefore, these algorithms will be used to fine-tune the capital allocation percentages for each company in the pre-final portfolio.

#### Columns of relevance
Data set 1: Overview
- Price
- MKT Cap
- P/E
- EPS
- Sector

Data set 2: Performance
- 1M change (1 month change)
- 3-Month performance
- 6-month perfromance
- YTD performance
- Yearly performance
- Volatility

Data set 3: Valuation
- Price / revenue
- Enterprise value

Data set 4: Dividends
- Dividend yield FWD
- Dividends per share (FY)

Data set 5: Margins
- Gross profit margin
- Operating margin
- Net profit margin

Data set 6: Income Statement
- Gross profit
- Income
- Net cash flow

Data set 7: Balance Sheet
- Current ratio
- Debt/equity
- Quick ratio

The total number of columns would be 24 in this case.

[^1]: This would mean that a few time series data sets would need to be downloaded from TradingView at the end of the project to test the demo porfolio.

[^2]: A helper function can be developed for this, where the user can just type out the name of the company and the ticker is identifed.

In [63]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as mplt
import plotly.graph_objects as go
from IPython.display import display, HTML, Markdown, Latex
from plotly.subplots import make_subplots
import sys
sys.path.append('..')
from analysis.code import project_functions1 as pf

In [64]:
sns.set_theme(style="darkgrid")
sns.set(rc={"figure.dpi":300, 'savefig.dpi':300})
sns.set(style="ticks", context="talk")
mplt.style.use("dark_background")

config = {
  'toImageButtonOptions': {
    'format': 'png',
    'filename': 'custom_image',
    'height': 3500,
    'width': 2000,
    'scale': 2
  }
}

# Data Loading

In [65]:
equities = pf.EquityData()
overview_df = equities.load_and_process("overview", exclude_columns=['Change %', 'Change', 'Technical Rating', 'Volume', 'Volume*Price'])
income_statement_df = equities.load_and_process("income_statement")
balance_sheet_df = equities.load_and_process("balance_sheet")
dividends_df = equities.load_and_process("dividends", exclude_columns=['Price'])
margins_df = equities.load_and_process("margins")
performance_df = equities.load_and_process("performance", exclude_columns=['Change 1m, %', 'Change 5m, %', 'Change 15m, %', 'Change 1h, %', 'Change 4h, %', 'Change 1W, %', 'Change 1M, %', 'Change %'])
valuation_df = equities.load_and_process("valuation", exclude_columns=['Price', 'Market Capitalization', 'Price to Earnings Ratio (TTM)', 'Basic EPS (TTM)', 'EPS Diluted (FY)'])

### Analysis-Specific Data Wrangling

In [66]:
dfs = [
    overview_df,
    income_statement_df,
    balance_sheet_df,
    dividends_df,
    margins_df,
    performance_df,
    valuation_df
    ]

dfs_names = [
    "Overview Data",
    "Balance Sheet Data",
    "Dividends Data",
    "Income Statement Data",
    "Margins Data",
    "Performance Data",
    "Valuation Data"
    ]

overview_df['3-Month Performance'] = performance_df['3-Month Performance']
income_statement_df['3-Month Performance'] = performance_df['3-Month Performance']
balance_sheet_df['3-Month Performance'] = performance_df['3-Month Performance']
dividends_df['3-Month Performance'] = performance_df['3-Month Performance']
margins_df['3-Month Performance'] = performance_df['3-Month Performance']
valuation_df['3-Month Performance'] = performance_df['3-Month Performance']

mega_df = pd.concat(dfs, axis=1)
mega_df = mega_df.loc[:,~mega_df.columns.duplicated()].copy()
mega_df = mega_df.dropna()
mega_df_no_strings = mega_df.select_dtypes(exclude='object')

mega_df['6-Month Performance'] = performance_df['6-Month Performance']
mega_df['YTD Performance'] = performance_df['YTD Performance']
mega_df['Yearly Performance'] = performance_df['Yearly Performance']

# Experimental Feature Development Zone

In [67]:
quant = pf.QuantitativeAnalysis()
viz = pf.DataVisualization()

In [68]:
#test = QuantitativeAnalysis()
#test.rank(mega_df, '6-Month Performance', inplace=True) # taking 6-month performance and inplace=True leads the norm algorithm to skip row 493
#mega_df

# Analysis Zone
## Note: to view the interactive graphs plotted, run this analysis notebook in a Jupyter Notebook environment

In [69]:
#viz.correlation_plot(mega_df, 'S&P 500 Equity Data')
fig = viz.heatmap_plot(mega_df, 'Correlation Plot of S&P500 Equity Data', number_of_companies=500, correlation_plot=True)

config_2 = {
  'toImageButtonOptions': {
    'format': 'png',
    'filename': 'custom_image',
    'height': 2000,
    'width': 2500,
  }
}

fig.show(config=config_2)


In a future version of pandas all arguments of DataFrame.dropna will be keyword-only.



Values with a correlation coefficient greater than or equal to 0.7 are considered as strong correlations. Likewise, negative correlation coefficients follow the inverse of this criteria. The purpose of this correlation plot is to identify singular variables that are correlated with the positive returns of many other variables that are often considered as benchmarks for strong equity performance. Even more significant is the identification of variables that normally do not have any apparent correlation when viewed in isolation, but do when paired together in a regression. After establishing such occurrences, certain variables can be assigned a stronger weight than others when being processed by the normalization algorithm that is the basis of our equity ranking system used in our asset allocation and construction algorithm. Such variables can later be grouped as predictors into a multiple linear regression model for further analysis. The results from the correlation plot can be classified as follows:

| X-Value(s) | Strong Positive Y-Values (r >= 0.7) |
| --- | --- |
| Market Capitalization | Total Shares Outstanding, Net Income (FY), Gross Profit (FY), Gross Profit (MRQ), EBITDA (TTM), Total Current Assets (MRQ) |
| Basic EPS (TTM) | EPS Diluted (FY), EPS Diluted (TTM), Basic EPS FY |
| EBITDA (TTM), Gross Profit (MRQ), Gross Profit (FY) | Total Shares Outstanding, Enterprise Value (MRQ), Total Current Assets (MRQ), Total Assets (MRQ), Net Income (FY), Gross Profit (FY), Gross Profit (MRQ), Total Debt (MRQ), Last Year Revenue (FY), Total Revenue (FY)|
| Total Revenue (FY) | Total Current Assets (MRQ), Last Year Revenue (FY), Total Assets (MRQ) |
| Last Year Revenue (FY) | Total Current Assets (MRQ), Total Assets (MRQ) |
| Current Ratio (MRQ) | Quick ratio (MRQ), Total Shares Outstanding, Enterprise Value (MRQ) |
| Total Assets (MRQ) | Total Current Assets (MRQ), Total Debt (MRQ) |
| Operating Margin (TTM) | Net Margin (TTM), Pretax Margin (TTM) |
| Enterprise Value (MRQ) | Total Shares Outstanding |
| Number of Employees | Last Year Revenue, Total Revenue (FY) |
| Net Income | Total Shares Outstanding, Enterprise Value, Total Current Assets (MRQ), Total Assets (MRQ) |
| Net Debt | Total Assets (MRQ) |
| Gross Margin (TTM) | Price to Revenue Ratio (TTM), Net Margin (TTM), Pretax Margin (TTM), Operating Margin (TTM) |
| Price to Revenue Ratio (TTM) | Enterprise Value / EBITDA (TTM) |

It should be noted, however, that many of the X values show a high correlation with other Y values due to those Y values being a derivative of the initial X value and vice versa. This observation confirms that  Taking this feature into account can give the following results:

| X-Value(s) | Strong Positive Y-Values (r >= 0.7) |
| --- | --- |
| Market Capitalization | Total Shares Outstanding, Net Income (FY), Gross Profit (FY), Gross Profit (MRQ), EBITDA (TTM), Total Current Assets (MRQ) |
| EBITDA (TTM), Gross Profit (MRQ), Gross Profit (FY) | Total Shares Outstanding, Enterprise Value (MRQ), Total Current Assets (MRQ), Total Assets (MRQ), Net Income (FY), Gross Profit (FY), Gross Profit (MRQ), Total Debt (MRQ), Last Year Revenue (FY), Total Revenue (FY) |
| Current Ratio (MRQ) | Quick ratio (MRQ), Total Shares Outstanding, Enterprise Value (MRQ) |
| Enterprise Value (MRQ) | Total Shares Outstanding |
| Number of Employees | Last Year Revenue, Total Revenue (FY) |
| Net Income | Total Shares Outstanding, Enterprise Value, Total Current Assets (MRQ), Total Assets (MRQ) |
| Price to Revenue Ratio (TTM) | Enterprise Value / EBITDA (TTM) |

Taking variables that may not have any immediate obvious correlation can yield the following:

| X-Value(s) | Strong Positive Y-Values (r >= 0.7) |
| --- | --- |
| EBITDA (TTM), Gross Profit (MRQ), Gross Profit (FY) | Total Shares Outstanding |
| Current Ratio (MRQ) | Total Shares Outstanding |
| Enterprise Value (MRQ) | Total Shares Outstanding |
| Number of Employees | Last Year Revenue, Total Revenue (FY) |
| Net Income | Total Shares Outstanding |

An outcome of these observations can be to take the correlations of these variables and assign them as a multiplier to the normalized values for each respective column to prioritize certain equity data as being more deterministic of a positive return on investment than others.

Another outcome of these observations can be to analyze the density plot creating during the EDA phase and solely focus on the distributions of the above variables from tables 2 or 3 that are skewed towads low normalized values. The logic with this would be that outliers in negatively skewed distributions are more likely to be indicative of stronger performance because they excel in a financial ratio that very few companies excel in.

The weighted scores from each analysis can be aggregated and in the end, a weighted scoring system is used.

In [70]:
predictors = [
    ['Market Capitalization',
    'Basic EPS (TTM)',
    'EBITDA (TTM)',
    'Gross Profit (MRQ)'],
    ['Gross Profit (FY)',
    'Total Revenue (FY)',
    'Last Year Revenue (FY)',
    'Current Ratio (MRQ)'],
    ['Total Assets (MRQ)',
    'Operating Margin (TTM)',
    'Enterprise Value (MRQ)',
    'Number of Employees'],
    ['Net Income (FY)',
    'Net Debt (MRQ)',
    'Gross Margin (TTM)',
    'Price to Revenue Ratio (TTM)']
    ]

In [71]:
# target_y = 'Market Capitalization Score'
# mega_dfx = mega_df.select_dtypes(exclude='object')
# predictors = [x for x in mega_dfx.columns]
# #predictors = ['Net Income (FY)', 'Number of Employees', 'Current Ratio (MRQ)']

# for predictor in predictors:
#     quant.rank(mega_df, predictor)

# predictors = [x + " Score" for x in predictors]
# predictors
# quant.multiple_linear_regression(mega_df, predictors, target_y) # result for market cap score seems suspicious if you plot the graph

In [72]:
# fig = px.scatter(
#     mega_df, x=predictors, y=target_y, opacity=0.65,
#     trendline_color_override='darkblue'
# )
# #fig.show()

In [73]:
def subplot_generator(predictors, height_reduction_factor = 8, horizontal_spacing= 0.02, vertical_spacing = 0.005):
    fig = make_subplots(
        rows=len(predictors),
        cols=2,
        shared_yaxes=True,
        column_titles=['Bottom 20 Companies', 'Top 20 Companies'],
        horizontal_spacing=horizontal_spacing,
        vertical_spacing=vertical_spacing,
        row_titles=['Sorted by ' + predictor for predictor in predictors]
        )

    row, col = 1, 1
    for predictor in predictors:
        fig.add_trace(
        trace=viz.heatmap_plot(
        df=mega_df,
        plot_last_companies=True,
        sort_by=predictor).data[0],
        row = row,
        col = col
    )
        fig.add_trace(
        trace=viz.heatmap_plot(
        df=mega_df,
        plot_last_companies=False,
        sort_by=predictor).data[0],
        row = row,
        col = col + 1
    )
        row += 1
        col = 1

    height_multiplier = len(predictors) - height_reduction_factor
    fig.update_layout(
        title_text=f'Heat Map Facet Grid of Normalized Equity Data from the Top and Bottom 20 Companies in the S&P500 Index by Predictor (1 = Best, 0 = Worst)',
        template='plotly_dark',
        width=1500,
        height=1500*height_multiplier)
    
    return fig


for predictor_set in predictors:
    display(subplot_generator(predictor_set, 2, vertical_spacing=0.02).show(config=config))
    
#large_fig_2 = subplot_generator(predictors[1], 4, vertical_spacing=0.009)

Exception: The (row, col) pair sent is out of range. Use Figure.print_grid to view the subplot grid. 

The heatmap plots validate the strong correlation between market capitalization and overall equity performance across all other variables. Therefore, market capitalization is a strong determinant of the likelihood of a good return on investment as larger companies are more likely to perform better due to being more established in the market. The bottom 20 companies fromt he S&P 500 index perform poorly across almost all variables in direct contrast to those in the top 20 by market capitalization. 