# Colin Lefter

## Research question/interests

**What equity data is the most deterministic of the performance of an equity, and of this data, which is the most relevant for a growth portfolio investment strategy such that we can compute an optimized portfolio of equities while using user input to drive our optimization algorithm?**

My research objective is to develop a scalable asset allocation and construction algorithm that implements an objected-oriented design approach. This objective is an outcome of determining what equity data is the most deterministic of the price of an equity, which will be the focus for the majority of the project/

I intend to develop algorithms for constructing multiple linear regressions and Fourier Franforms, among others, that I will then use to construct interactive and statistical models with Plotly and Seaborn. As such, I have a strong interest in the system design of our software and in developing helper functions that can assist all of us with processing data more efficiently. I am also looking forward to using Facebook Prophet[^1] to construct a time series forecast of a sample portfolio recommendation from our software, which can be included in our Tableau Dashboard.

### Analysis Plan
Our objective function is one that takes in a selection of columns from our data sets to then search for the top n companies that satisfy a criteria for having the highest probability of producing an optimal return on investment. These inputs themselves refer to sub-objective functions that take as input user-defined parameters and thresholds that set the criteria for favourable performance attributes. To rank the companies from our data set, and ultimately determine what portion of capital to assign to each equity, I propose a data normalization algorithm that normalizes the data that comprises the favourable subset from each column of our data set. We interpret these normalized values as probabilities of equity selection and ultimately average the score of each company across all columns to then multiply the final score percentage of each company with the total capital specified by the user. In a broad sense, our software is composed of four general classes that include "Data", "Quantitative Analysis", "Data Visualization" and "Portfolio Construction". We inherit the properties from each of these classes to build a functional data analysis chain.

Our data visualization will be concerned with analyzing the influence of certain financial variables, such as Price-to-Earnings, on the price of each equity from a sample of 500 equities (from the S&P 500 index). Such analysis would begin with a statistical summary that will constitute exploratory data analysis, followed by our application of analysis algorithms that we design. The construction of a portfolio is a bonus of our project and will be made possible by the analysis algorithms we have constructed.

**Important Note**
A component of the analysis will involve the comparison of different values of financial variables with the corresponding price of each equity. This constitutes inferential analysis as we are attempting to identify a correlation on the basis of picking stocks based on expected performance. Therefore, this will require us to use past financial data and compare this data with the current price of each equity. As a result, we can only use the 3-month performance data (i.e. 3-month change in share price data) for this comparison as otherwise we would be using future data to predict past performance, which would be invalid.

#### User-defined parameters
Some initial ideas for these parameters include:
- (float) Initial capital
- (float) Additional capital per day, week or month
- (int) Intended holding period (in days)
- (boolean) Importance of dividends (validated based on capital invested)
- (String) Preferred industries (choose from a list, or select all)
- (int) Volatility tolerance (from 0 to 1, 1 indicating that volatility is not important)
- (String) Preferred companies (as a list)[^2]
- (int) Preferred degree of portfolio diversification (from 0 to 1, 1 indicating complete diversification)
- (String) Preferred investment strategy (choose from "Growth", "Value", "GARP")

### Algorithm Plan

####  Tier 1: Threshold-based screening algorithms
- The current plan is to use these algorithms to screen the financial documents from each company by setting a minimum threshold for each financial ratio. This class of algorithms will need to conduct such screening per industry as industry financial ratios are dinstinct from one another.
- A global screening algorithm that selects companies which show favourable performance across all ratios can also be used after each ratio has been individually tested.

#### Tier 2: Regression models
- As of now, the intent is to develop a multiple linear regression model that will attempt to determine a relationship between the yearly and quarterly performance of each company in relation to several columns of data that act as predictors. This can essentially implement the results from the threshold-based screening algorithms to only conduct this analysis on the pre-screened companies.

#### Tier 3: Statistical modelling algorithms
- Tier 3 denotes a class of broadly experimental statistical modelling algorithms that are applied on a pre-final portfolio to add additional points to companies that perform exceptionally well compared to others in the portfolio. For now, these algorithms constitute signal processing algorithms such as a Fourier Transform algorithm that attempts to identify peaks in numerical values that would otherwise not be apparent when examined in isolation and without further processing. Therefore, these algorithms will be used to fine-tune the capital allocation percentages for each company in the pre-final portfolio.

#### Columns of relevance
Data set 1: Overview
- Price
- MKT Cap
- P/E
- EPS
- Sector

Data set 2: Performance
- 1M change (1 month change)
- 3-Month performance
- 6-month perfromance
- YTD performance
- Yearly performance
- Volatility

Data set 3: Valuation
- Price / revenue
- Enterprise value

Data set 4: Dividends
- Dividend yield FWD
- Dividends per share (FY)

Data set 5: Margins
- Gross profit margin
- Operating margin
- Net profit margin

Data set 6: Income Statement
- Gross profit
- Income
- Net cash flow

Data set 7: Balance Sheet
- Current ratio
- Debt/equity
- Quick ratio

The total number of columns would be 24 in this case.

[^1]: This would mean that a few time series data sets would need to be downloaded from TradingView at the end of the project to test the demo porfolio.

[^2]: A helper function can be developed for this, where the user can just type out the name of the company and the ticker is identifed.

In [613]:
import pandas as pd
import plotly as plt
import seaborn as sns
import numpy as np
import datetime as dt
import matplotlib.pyplot as mplt
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from IPython.display import display, HTML, Markdown, Latex
from tqdm import tqdm, trange
from typing import *
from dataclasses import dataclass
from scipy import stats
import plotly.io as pio
import sys
from sklearn.model_selection import train_test_split
from sklearn import metrics
sys.path.append('..')
from analysis.code import project_functions1 as pf

In [614]:
sns.set_theme(style="darkgrid")
sns.set(rc={"figure.dpi":300, 'savefig.dpi':300})
sns.set(style="ticks", context="talk")
mplt.style.use("dark_background")

config = {
  'toImageButtonOptions': {
    'format': 'png',
    'filename': 'custom_image',
    'height': 800,
    'width': 2000,
    'scale': 2
  }
}

In [615]:
@dataclass
class ValueRange:
    min: float
    max: float
    
    def validate(self, x):
        """Checks if inputs to variables that must lie within a specific range are valid
        
        :x: the value that must be checked as satisfying the specified range
        :raises ValueError: if the value does not lie within the specified range
        """
        if not (self.min <= x <= self.max):
            raise ValueError(f'{x} must be between 0 and 1 (including).')

# Data Loading

In [616]:
equities = pf.EquityData()
overview_df = equities.load_and_process("overview", exclude_columns=['Change %', 'Change', 'Technical Rating', 'Volume', 'Volume*Price'])
income_statement_df = equities.load_and_process("income_statement")
balance_sheet_df = equities.load_and_process("balance_sheet")
dividends_df = equities.load_and_process("dividends", exclude_columns=['Price'])
margins_df = equities.load_and_process("margins")
performance_df = equities.load_and_process("performance", exclude_columns=['Change 1m, %', 'Change 5m, %', 'Change 15m, %', 'Change 1h, %', 'Change 4h, %', 'Change 1W, %', 'Change 1M, %', 'Change %'])
valuation_df = equities.load_and_process("valuation", exclude_columns=['Price', 'Market Capitalization', 'Price to Earnings Ratio (TTM)', 'Basic EPS (TTM)', 'EPS Diluted (FY)'])

### Analysis-Specific Data Wrangling

In [617]:
dfs = [
    overview_df,
    income_statement_df,
    balance_sheet_df,
    dividends_df,
    margins_df,
    performance_df,
    valuation_df
    ]

dfs_names = [
    "Overview Data",
    "Balance Sheet Data",
    "Dividends Data",
    "Income Statement Data",
    "Margins Data",
    "Performance Data",
    "Valuation Data"
    ]

overview_df['3-Month Performance'] = performance_df['3-Month Performance']
income_statement_df['3-Month Performance'] = performance_df['3-Month Performance']
balance_sheet_df['3-Month Performance'] = performance_df['3-Month Performance']
dividends_df['3-Month Performance'] = performance_df['3-Month Performance']
margins_df['3-Month Performance'] = performance_df['3-Month Performance']
valuation_df['3-Month Performance'] = performance_df['3-Month Performance']

mega_df = pd.concat(dfs, axis=1)
mega_df = mega_df.loc[:,~mega_df.columns.duplicated()].copy()
mega_df = mega_df.dropna()
mega_df_no_strings = mega_df.select_dtypes(exclude='object')

mega_df['6-Month Performance'] = performance_df['6-Month Performance']
mega_df['YTD Performance'] = performance_df['YTD Performance']
mega_df['Yearly Performance'] = performance_df['Yearly Performance']

# Functional Classes

In [618]:
class QuantitativeAnalysis:
    def __init__(self, number_of_companies: int=500, initial_capital: float=100000.00, capital_per_period: float=100.00, period: int=7, dividends_importance: bool=False, preferred_industries: list=["Technology Services, Electronic Technology"],
                volatility_tolerance: Annotated[float, ValueRange(0.0, 1.0)]=0.7, preferred_companies: list=["Apple, Google, Microsoft, Amazon"], diversification: Annotated[float, ValueRange(0.0, 1.0)]=0.4, investment_strategy: str="Growth"):
        """Includes several analysis functions that process select data across all data sets

        :number_of_companies: the number of companies included in the sample, with the default being those from the S&P500 Index\n
        :initial_capital: the initial amount of cash to be invested by the client, in USD\n
        :capital_per_period: the amount of cash to be invested by the client at a fixed rate in addition to the initial capital invested, in USD\n
        :period: the frequency (in days) at which additional cash is invested, if desired\n
        :dividends_importance: specifies whether dividends are important to the client, dictating whether analysis algorithms should place greater importance on dividends\n
        :preferred_industries: specifies a list of industries that the analysis algorithms should prioritize when constructing the investment portfolio\n
        :volatility_tolerance: accepts a range of values from 0 to 1, with 1 implying maximum volatility tolerance (i.e. the client is willing to lose 100% of their investment to take on more risk)\n
        :preferred_companies: specifies a list of companies that the analysis algorithms will accomodate in the final portfolio irrespective of their score\n
        :diversification: accepts a range of values from 0 to 1, with 1 implying maximum diversification (i.e. funds will be distributed evenly across all industries and equally among all companies)\n
        :investment_strategy: specifies the investment strategy that will guide the output of the analysis algorithms, in which this analysis notebook strictly focuses on growth investing\n
        :raises: ValueError if an input parameter does not satisfy its accepted range
        """
        
        self.number_of_companies = number_of_companies
        self.initial_capital = initial_capital
        self.capital_per_period = capital_per_period
        self.period = period
        self.dividends_importance = dividends_importance
        self.preferred_industries = preferred_industries
        self.volatility_tolerance = volatility_tolerance
        self.preferred_companies = preferred_companies
        self.diversification = diversification
        self.preferred_companies = preferred_industries
        self.investment_strategy = investment_strategy
        
    def lin_reg_coef_determination(self, df: pd.DataFrame, X: str, y: str='3-Month Performance', filter_outliers: bool=True) -> np.float64:
        if filter_outliers:
            df = self.outlier_filtered_df(df, col=y)
        
        X = df[X]
        y = df[y]
        
        y = y.dropna()
        X = X.dropna()
        
        if len(X) > len(y):
            X = X[:len(y)]
        else:
            y = y[:len(X)]
        
        self.X = np.array(X).reshape(-1, 1)
        self.y = np.array(y).reshape(-1, 1)
        
        model = LinearRegression()
        model.fit(self.X, self.y)
         
        return model.score(self.X, self.y)

    def get_lin_reg_coefs(self, df: pd.DataFrame, x_values: list(), y_value: str='3-Month Performance') -> pd.DataFrame:
        """Returns a Pandas DataFrame with the coefficients of determination for each y-on-x regression
        Example: 3-Month Performance against Price to Earnings Ratio (TTM)
        
        :df: the data frame that contains the columns to process\n
        :x_values: a list of strings of the names of each column to process\n
        :y_value: a common y-value to map each x value against in the regression analysis\n
        :returns: A Pandas DataFrame with the coefficients of determination for each y-on-x regression\n
        
        """
        self.coef_dict = dict.fromkeys(x_values, 0) # initialize a dict with all the columns assigned to a value of 0
        
        for predictor in tqdm(x_values, desc="Constructing linear regression models", total=len(x_values)):
            self.coef_dict[predictor] = self.lin_reg_coef_determination(df, X=predictor, y=y_value)
        
        self.processed_df = pd.DataFrame(list(zip(self.coef_dict.keys(), self.coef_dict.values())), columns=[f'Equity Data Against {y_value}', 'Coefficient of Determination'])
        
        return self.processed_df
        
    def multiple_linear_regression(self, df: pd.DataFrame, predictors: list(), target_y: str='Market Capitalization') -> pd.DataFrame:
        """Consturcts a multiple linear regression model
        :df: a Pandas DataFrame containing the data to be processed
        :predictors: the x values that will be used to predict the target y value
        :target_y: the y value to be predicted
        :returns: a Pandas DataFrame containing a statistical summary of the performance of the model
        """
        df = df.select_dtypes(exclude='object')
        
        if target_y in predictors:
            predictors.remove(target_y) # so you don't have a perfect correlation for the same variable

        X = mega_df[predictors]
        y = mega_df[target_y]

        x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)
        mlr = LinearRegression()
        mlr.fit(x_train, y_train)
        y_pred_mlr = mlr.predict(x_test)

        mlr_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred_mlr})
        mlr_diff.head()

        meanAbErr = metrics.mean_absolute_error(y_test, y_pred_mlr)
        meanSqErr = metrics.mean_squared_error(y_test, y_pred_mlr)
        rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_mlr))
        
        results = {'R squared': mlr.score(X,y) * 100, 'Mean Absolute Error': meanAbErr, 'Mean Square Error': meanSqErr, 'Root Mean Square Error': rootMeanSqErr}
        results_df = pd.DataFrame(results, index=['Model Results'])
        return results_df
    
    def fourier_transform(self):
        pass
    
    def rank(self, df: pd.DataFrame, col: str, normalize_only: bool=True, threshold: float=1.5,
             below_threshold: bool=True, filter_outliers: bool=True, normalize_after: bool=False,
             lower_quantile: float=0.05, upper_quantile: float=0.95, inplace: bool=False) -> None:
        """The scoring algorithm for determining the weight of each equity in the construction of the portfolio for this specific column examined.
        Features a custom outlier-filtering algorithm that is robust to outliers in the data set while still returning normalized values.
        
        :df: The original dataframe\n
        :col: The name of the column being extracted from the dataframe provided\n
        :normalize_only: if True, does not apply a threshold to the screening algorithm, and only normalizes values with a minmax scaler\n
        :threshold: the minimum value that equities must have for that column in order to be considered for further analysis\n
        :below_threshold: if True, removes equities that are below the threshold for that column\n
        :filter_outliers: if True, does not consider equities in the data normalization algorithm, but assigns a min or max value to all outliers depending on the below_threshold parameter\n
        :normalize_after: if True, normalizes the data only after the threshold filter has been applied\n
        :lower_quantile: specifies the lower quantile of the distribution when filtering outliers\n
        :upper_quantile: specifies the upper quantile of the distribution when filtering outliers\n
        :inplace: if true, specifies that the normalization algorithm should directly modify the column being processed, otherwise, a new column is created
        """
        
        #NOTE: should make an option for no threshold
        self.x = df[col]
        new_col = col + " Score"
        
        # normalization can be done either before or after equities have been filtered by the threshold
        # the difference is that by filtering initially, the min and max values of that smaller set will become 0 and 1 respectively
        df[new_col] = np.NaN # initialize the score column with only NaN values
        
        def outlier_filter(self):
            """Nested helper function to filter outliers"""
            upper_fence = self.x.quantile(upper_quantile)
            lower_fence = self.x.quantile(lower_quantile)
            
            if below_threshold:
                df.loc[self.x > upper_fence, new_col] = 1 # outliers still need to be included in the data (max score assigned)
                df.loc[self.x < lower_fence, new_col] = 0 # lowest score assigned
            else:
                # inverse of the above
                df.loc[self.x > upper_fence, new_col] = 0
                df.loc[self.x < lower_fence, new_col] = 1

            # now only take the rows that are not outliers into the minmax scaler
            self.x = self.x[(self.x <= upper_fence) & (self.x >= lower_fence)]
            
            if normalize_only:
                normalize_after = False
                
            if normalize_after:
                if below_threshold:
                    # since we are only taking valid values, we consider the inverse of the values that are below the threshold to be valid values
                    self.x = self.x[self.x >= threshold]
                else:
                    self.x = self.x[self.x <= threshold]
        
        if filter_outliers:
            outlier_filter(self)
        
        self.y = np.array(self.x).reshape(-1, 1)
        self.y = preprocessing.MinMaxScaler().fit_transform(self.y)
 
        if inplace: # NOTE: this is currently an unstable feature and does not give accurate results
            df.drop(columns=[new_col], inplace=True) # directly modifying the original column, so the new column should be removed
            for col_idx, array_idx in zip(self.x.index, range(len(self.y))):
                df.at[col_idx, col] = self.y[array_idx]
        else:
            for col_idx, array_idx in zip(self.x.index, range(len(self.y))):
                df.at[col_idx, new_col] = self.y[array_idx]
        
        # if we are giving the minimum score to values below the threshold, assign 0 to those values
        if not normalize_only:
            if below_threshold:
                df.loc[df[col] <= threshold, new_col] = 0
            else:
                df.loc[df[col] >= threshold, new_col] = 0
    
    def outlier_filtered_df(self, df: pd.DataFrame, col: list(), lower_quantile: float=0.05, upper_quantile: float=0.95):
        upper_fence = df[col].quantile(upper_quantile)
        lower_fence = df[col].quantile(lower_quantile)

        df = df[(df[col] <= upper_fence) & (df[col] >= lower_fence)]
        
        return df

In [619]:
#test = QuantitativeAnalysis()
#test.rank(mega_df, '6-Month Performance', inplace=True) # taking 6-month performance and inplace=True leads the norm algorithm to skip row 493
#mega_df

In [620]:
class DataVisualization(QuantitativeAnalysis):
    def __init__(self):
        QuantitativeAnalysis.__init__(self)

    def score_density_plot(self, df: pd.DataFrame, data_name: str) -> plt.graph_objs._figure.Figure:
        """Constructs an interactive compound density plot based on a histogram of the data provided, plotting a density curve with clusters of data points below
        
        :df: a Pandas DataFrame of equity data
        :data_name: the name of the type of data that has been input into the plot
        :returns: a density plot
        """
        df = df.select_dtypes(exclude='object')[:self.number_of_companies]
        self.n = len(df)
        
        for column in df.columns:
            self.rank(df, col=column, upper_quantile=0.99, lower_quantile=0.01)
            
        self.score_data_length = len(df.axes[1])
        self.input_df = df.T[int(self.score_data_length/2 + 1):].T
        self.hist_data = [self.input_df[x] for x in self.input_df.columns]
        
        self.group_labels = [x for x in self.input_df.columns]
        self.colors = ['#333F44', '#37AA9C', '#94F3E4']

        self.fig = ff.create_distplot(self.hist_data, self.group_labels, show_hist=False, colors=self.colors)
        self.fig.update_layout(title_text=f'Distribution for Normalized {data_name} of {self.n} Companies in the S&P500', template='plotly_dark')
        
        self.fig.update_xaxes(title='Score (0 = low, 1 = high)')
        self.fig.update_yaxes(title='Density')
        
        return self.fig

    def heatmap_plot(self, df: pd.DataFrame, title: str, number_of_companies: int=50, number_of_subset_companies: int=20,
                    plot_last_companies: bool=False, sort_by: str='Market Capitalization', correlation_plot: bool=False, plot_width: int=1000, plot_height: int=1000) -> plt.graph_objs._figure.Figure:
        """A wrapper function for the default heatmap plot, constructing an interactive heatmap plot of equity data against each company (ticker)
        
        :df: a Pandas DataFrame of equity data
        :data_name: the name of the type of data that has been input into the plot
        :number_of_companies: the number of companies to include in the heatmap
        :correlation_plot: if true, creates a correlation plot instead of a heatmap plot
        :returns: a heatmap plot
        """
        def construct_correlation_plot(self) -> pd.DataFrame:
            """A helper function to convert the heat map into a correlation plot"""
            # Correlation
            self.df_corr = df.corr(numeric_only=True).round(1)
            # Conver to a triangular correlation plot
            self.mask = np.zeros_like(self.df_corr, dtype=bool)
            self.mask[np.triu_indices_from(self.mask)] = True
            # Final visualization
            self.df_corr_viz = self.df_corr.mask(self.mask).dropna(how='all').dropna('columns', how='all')
            
            return self.df_corr_viz
        
        df = df.sort_values(by=sort_by, ascending=False)
    
        if correlation_plot:            
            self.cor_df = construct_correlation_plot(self)
            self.fig = px.imshow(
                self.cor_df,
                text_auto=True,
                template='plotly_dark',
                title=title,
                width=plot_width,
                height=plot_height)
        else:
            df = df[:number_of_companies] # selecting only x number of companies in order
                
            self.z = []
            self.tickers = df['Ticker']
            df.index = df['Ticker']
            df = df.select_dtypes(exclude='object')
            for column in df.columns:
                self.rank(df, col=column) # scoring the data
                        
            if plot_last_companies:
                df = df[-number_of_subset_companies:]
                self.tickers = self.tickers[-number_of_subset_companies:]
            else:
                df = df[:number_of_subset_companies] # the normalization algorithm has been applied on number_of_companies but we choose a subset from that
                self.tickers = self.tickers[:number_of_subset_companies]
            
            self.score_data_length = len(df.axes[1])
            self.input_df = df.T[int(self.score_data_length/2 + 1):].T
            for column in self.input_df.columns:
                self.z.append(self.input_df[column].round(3))
            
            self.fig = px.imshow(
                self.z,
                text_auto=True,
                template='plotly_dark',
                title=title,
                x=[x for x in self.tickers], 
                y=[x for x in df.columns[int(self.score_data_length/2 + 1):]],
                width=plot_width,
                height=plot_height)
        
        return self.fig

    def scatter_3d(self, df: pd.DataFrame, x: str, y: str, z: str) -> plt.graph_objs._figure.Figure:
        """Constructs a 3D interactive plot of equity data on 3 axes
        :df: a Pandas DataFrame of equity data
        :x: the name of the column data to be plotted on the x-axis, as a string
        :y: the name of the column data to be plotted on the y-axis, as a string
        :z: the name of the column data to be plotted on the z-axis, as a string
        :returns: a 3D scatter plot
        """
        df.index = df['Ticker']
        df = df.select_dtypes(exclude='object')
        
        for column in df.columns:
            self.rank(df, col=column)
            
        fig = px.scatter_3d(df, x=x, y=y, z=z,
                    title='3D Scatter Plot of Normalized Equity Data',
                    template='plotly_dark',
                    size_max=18,
                    color='3-Month Performance Score',
                    opacity=0.7)

        return fig
    
    def correlation_plot(self, df, data_name) -> None: 
        """Produces a correlation plot that maps all of the data points in the Data Frame provided
        :df: a Pandas DataFrame of the data to be processed
        :data_name: the name of the data being plotted
        """
        # Compute the correlation matrix
        self.corr = df.corr(numeric_only=True)

        # Generate a mask for the upper triangle
        self.mask = np.triu(np.ones_like(self.corr, dtype=bool))

        # Set up the matplotlib figure
        self.f, self.ax = plt.subplots(figsize=(18, 14))

        # Generate a custom diverging colormap
        self.cmap = sns.diverging_palette(1, 10, as_cmap=True)

        #Draw the heatmap with the mask and correct aspect ratio
        sns.heatmap(self.corr, mask=self.mask, cmap=self.cmap, vmax=.3, center=0,
                    square=True, linewidths=.5, cbar_kws={"shrink": .5})
        mplt.title(f"Correlation Plot of {data_name}")

In [621]:
class PortfolioConstruction(DataVisualization):
    def __init__(self):
        DataVisualization.__init__(self)
    
    def asset_allocation(self):
        pass
    
    def construct_portfolio(self):
        pass

In [622]:
quant = QuantitativeAnalysis()
viz = DataVisualization()

# Experimental Feature Development Zone

# Analysis Zone
## Note: to view the interactive graphs plotted, run this analysis notebook in a Jupyter Notebook environment

In [623]:
#viz.correlation_plot(mega_df, 'S&P 500 Equity Data')
viz.heatmap_plot(mega_df, 'Correlation Plot of S&P500 Equity Data', number_of_companies=500, correlation_plot=True)


In a future version of pandas all arguments of DataFrame.dropna will be keyword-only.



Values with a correlation coefficient greater than or equal to 0.7 are considered as strong correlations. Likewise, negative correlation coefficients follow the inverse of this criteria. The purpose of this correlation plot is to identify singular variables that are correlated with the positive returns of many other variables that are often considered as benchmarks for strong equity performance. Even more significant is the identification of variables that normally do not have any apparent correlation when viewed in isolation, but do when paired together in a regression. After establishing such occurrences, certain variables can be assigned a stronger weight than others when being processed by the normalization algorithm that is the basis of our equity ranking system used in our asset allocation and construction algorithm. Such variables can later be grouped as predictors into a multiple linear regression model for further analysis. The results from the correlation plot can be classified as follows:

| X-Value(s) | Strong Positive Y-Values (r >= 0.7) |
| --- | --- |
| Market Capitalization | Total Shares Outstanding, Net Income (FY), Gross Profit (FY), Gross Profit (MRQ), EBITDA (TTM), Total Current Assets (MRQ) |
| Basic EPS (TTM) | EPS Diluted (FY), EPS Diluted (TTM), Basic EPS FY |
| EBITDA (TTM), Gross Profit (MRQ), Gross Profit (FY) | Total Shares Outstanding, Enterprise Value (MRQ), Total Current Assets (MRQ), Total Assets (MRQ), Net Income (FY), Gross Profit (FY), Gross Profit (MRQ), Total Debt (MRQ), Last Year Revenue (FY), Total Revenue (FY)|
| Total Revenue (FY) | Total Current Assets (MRQ), Last Year Revenue (FY), Total Assets (MRQ) |
| Last Year Revenue (FY) | Total Current Assets (MRQ), Total Assets (MRQ) |
| Current Ratio (MRQ) | Quick ratio (MRQ), Total Shares Outstanding, Enterprise Value (MRQ) |
| Total Assets (MRQ) | Total Current Assets (MRQ), Total Debt (MRQ) |
| Operating Margin (TTM) | Net Margin (TTM), Pretax Margin (TTM) |
| Enterprise Value (MRQ) | Total Shares Outstanding |
| Number of Employees | Last Year Revenue, Total Revenue (FY) |
| Net Income | Total Shares Outstanding, Enterprise Value, Total Current Assets (MRQ), Total Assets (MRQ) |
| Net Debt | Total Assets (MRQ) |
| Gross Margin (TTM) | Price to Revenue Ratio (TTM), Net Margin (TTM), Pretax Margin (TTM), Operating Margin (TTM) |
| Price to Revenue Ratio (TTM) | Enterprise Value / EBITDA (TTM) |

It should be noted, however, that many of the X values show a high correlation with other Y values due to those Y values being a derivative of the initial X value and vice versa. This observation confirms that  Taking this feature into account can give the following results:

| X-Value(s) | Strong Positive Y-Values (r >= 0.7) |
| --- | --- |
| Market Capitalization | Total Shares Outstanding, Net Income (FY), Gross Profit (FY), Gross Profit (MRQ), EBITDA (TTM), Total Current Assets (MRQ) |
| EBITDA (TTM), Gross Profit (MRQ), Gross Profit (FY) | Total Shares Outstanding, Enterprise Value (MRQ), Total Current Assets (MRQ), Total Assets (MRQ), Net Income (FY), Gross Profit (FY), Gross Profit (MRQ), Total Debt (MRQ), Last Year Revenue (FY), Total Revenue (FY) |
| Current Ratio (MRQ) | Quick ratio (MRQ), Total Shares Outstanding, Enterprise Value (MRQ) |
| Enterprise Value (MRQ) | Total Shares Outstanding |
| Number of Employees | Last Year Revenue, Total Revenue (FY) |
| Net Income | Total Shares Outstanding, Enterprise Value, Total Current Assets (MRQ), Total Assets (MRQ) |
| Price to Revenue Ratio (TTM) | Enterprise Value / EBITDA (TTM) |

Taking variables that may not have any immediate obvious correlation can yield the following:

| X-Value(s) | Strong Positive Y-Values (r >= 0.7) |
| --- | --- |
| EBITDA (TTM), Gross Profit (MRQ), Gross Profit (FY) | Total Shares Outstanding |
| Current Ratio (MRQ) | Total Shares Outstanding |
| Enterprise Value (MRQ) | Total Shares Outstanding |
| Number of Employees | Last Year Revenue, Total Revenue (FY) |
| Net Income | Total Shares Outstanding |

An outcome of these observations can be to take the correlations of these variables and assign them as a multiplier to the normalized values for each respective column to prioritize certain equity data as being more deterministic of a positive return on investment than others.

Another outcome of these observations can be to analyze the density plot creating during the EDA phase and solely focus on the distributions of the above variables from tables 2 or 3 that are skewed towads low normalized values. The logic with this would be that outliers in negatively skewed distributions are more likely to be indicative of stronger performance because they excel in a financial ratio that very few companies excel in.

The weighted scores from each analysis can be aggregated and in the end, a weighted scoring system is used.

In [624]:
target_y = 'Market Capitalization Score'
mega_dfx = mega_df.select_dtypes(exclude='object')
predictors = [x for x in mega_dfx.columns]
#predictors = ['Net Income (FY)', 'Number of Employees', 'Current Ratio (MRQ)']

for predictor in predictors:
    quant.rank(mega_df, predictor)

predictors = [x + " Score" for x in predictors]
predictors
quant.multiple_linear_regression(mega_df, predictors, target_y) # result for market cap score seems suspicious if you plot the graph

Unnamed: 0,R squared,Mean Absolute Error,Mean Square Error,Root Mean Square Error
Model Results,98.329168,0.030034,0.00244,0.0494


In [625]:
fig = px.scatter(
    mega_df, x=predictors, y=target_y, opacity=0.65,
    trendline_color_override='darkblue'
)
fig.show()

In [626]:
sort_by = 'Market Capitalization'

display(
viz.heatmap_plot(
    mega_df,
    'Heat Map of Normalized Equity Data from the Bottom 20 Companies in the S&P500',
    500,
    20,
    True,
    plot_width=800,
    plot_height=1000,
    sort_by=sort_by),
viz.heatmap_plot(
    mega_df,
    'Heat Map of Normalized Equity Data from the Top 20 Companies in the S&P500',
    500,
    20,
    False,
    plot_width=800,
    plot_height=1000,
    sort_by=sort_by))


The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.


The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.



The heatmap plots validate the strong correlation between market capitalization and overall equity performance across all other variables. Therefore, market capitalization is a strong determinant of the likelihood of a good return on investment as larger companies are more likely to perform better due to being more established in the market. The bottom 20 companies fromt he S&P 500 index perform poorly across almost all variables in direct contrast to those in the top 20 by market capitalization. 