![QuantConnect Logo](https://cdn.quantconnect.com/web/i/icon.png)
<hr>

# Factor Investing Research In QuantConnect

The objective of this Notebook is to implement in QuantConnect most of the features present in Alphalens (Quantopian): a set of standard statistical techniques commonly used in the research process of factor selection for the design of long-short equity strategies.
Most of the analysis is carried out visually through a number of plots with the intention to speed up the process of iterating and testing different factors. However, the tools below also return all the necessary data for the user to extend this study.

The Notebook is structured in two sections: Factor Analysis and Risk Analysis.

## Part 1: Using Factors To Construct A Long-Short Equity Strategy

This section corresponds to the **FactorAnalysis class** whose purpose is to build a long-short portfolio based on statistically significant factors.
* Provided a list of tickers and time period, this class will pull at initialization all the **historical OHLCV** data needed for analysis.
* Using the CustomFactor function, **calculate factor values for each symbol and time** based on historical price and volume data. This function gets applied to the historical OHLCV DataFrame for each symbol, and the calculations need to be done in a rolling fashion so each day gets a factor value based on data up until that point (see examples of factors below for more info). These factors can then be **standardized and used to create combined factors** as linear combinations of single factors.
* The next step in the process is to **create quantile groups** based on the chosen factors and calculate different **forward period returns** to assess the relationship between the two.
* Finally, we need to **build a portfolio** that goes long one quintile and short another with the idea of potentially exploiting the returns spread between two opposite quantiles. Naturally, this can only be done successfully if the factors are able to consistently separate relative winners from losers.
* A standard way of assessing the degree of **correlation between the factors and forward returns** is the Spearman Rank Correlation (Information Coefficient). This measure will also be plotted for each forward return period.

## Part 2: Analysing Common Risk Exposures Of The Long-Short Equity Strategy

This section corresponds to the **RiskAnalysis class** whose purpose is to discover what **risk factors our strategy is exposed to and to what degree**. As we will see below in more detail, these external factors can be any time series of returns that our portfolio could have some exposure to. Some popular risk factors are provided here (Fama-French Five Factors, Industry Factors), but the user can easily test any other by passing its time series of returns.
* Run **multiple linear regression** for the entire period of analysis along with **partial regression** plots for each pair of dependent/independent variables.
* Run **rolling multiple regression** and visualize the **rolling coefficients** for each independent variable throughout the entire period. Ideally, the exposures remain relatively stable throughout time.
* Visualize the **distribution of rolling exposures** in order to quickly see where the average exposures lie and their range.

---

In [1]:
# import packages
import autoreload
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

from ResearchFactorAnalysis import FactorAnalysis
from ResearchRiskAnalysis import RiskAnalysis

qb = QuantBook()

## Part 1: Using Factors To Construct A Long-Short Equity Strategy
---

In [2]:
# create a list of tickers
from io import StringIO
import random

# dropbox link to the current SP500 tickers
link = 'https://www.dropbox.com/s/4ru3kxbns1fp5lt/constituents_csv.csv?dl=1'
strFile = qb.Download(link)
fileDf = pd.read_csv(StringIO(strFile), sep = ',')
tickers = [x for x in list(fileDf['Symbol'])]

In [3]:
# select start and end date for analysis
startDate = datetime(2017, 1, 1)
endDate = datetime(2020, 10, 1)

In [4]:
# initialize factor analysis
factorAnalysis = FactorAnalysis(qb, tickers, startDate, endDate, Resolution.Daily)

#### Historical OHLCV
After initializing the FactorAnalysis class, we get our OHLCV DataFrame. This is a MultiIndex DataFrame with pricing and volume data indexed by symbol and time.

In [6]:
factorAnalysis.ohlcvDf

### 1.1. Calculate Factors
The CustomFactor function allows us to apply factor calculations in a rolling fashion to each symbol using open, high, low, close and volume data.

The below example calculates a momentum factor as follows:
* When the GetFactorsDf function runs, the OHLCV MultiIndex DataFrame is grouped by Symbol and then the CustomFactor is applied.
* The x now simply becomes a SingleIndex DataFrame for each Symbol.
* As you can see, we first extract the Close series and create a rolling window for it. This ensures our calculations will be applied at each time step, based on data up until that point in time.

In [7]:
# example of calculating a momentum factor using the CustomFactor function

def CustomFactor(x):
    
    '''
    Description:
        Applies factor calculations to a SingleIndex DataFrame of historical data OHLCV by symbol
    Args:
        x: SingleIndex DataFrame of historical OHLCV data for each symbol
    Returns:
        The factor value for each day
    '''
    
    try:
        # momentum factor --------------------------------------------------------------------------
        closePricesTimeseries = x['close'].rolling(252) # create a 252 day rolling window of close prices
        momentum = closePricesTimeseries.apply(lambda x: (x[-1] / x[-252]) - 1)
        
        # get momentum factor
        factors = pd.concat([momentum], axis = 1)

    except BaseException as e:
        factors = np.nan
        
    return factors

In [8]:
# example of a single factor
factorsDf = factorAnalysis.GetFactorsDf(CustomFactor)
factorsDf

The below example calculates multiple factors in the same way. Notice how the factors are concatenated at the end.

In [9]:
# example of calculating multiple factors using the CustomFactor function

from scipy.stats import skew, kurtosis

def CustomFactor(x):
    
    '''
    Description:
        Applies factor calculations to a SingleIndex DataFrame of historical data OHLCV by symbol
    Args:
        x: SingleIndex DataFrame of historical OHLCV data for each symbol
    Returns:
        The factor value for each day
    '''
    
    try:
        # momentum factor --------------------------------------------------------------------------
        closePricesTimeseries = x['close'].rolling(252) # create a 252 day rolling window of close prices
        returns = x['close'].pct_change().dropna() # create a returns series
        momentum = closePricesTimeseries.apply(lambda x: (x[-1] / x[-252]) - 1)
        
        # volatility factor ------------------------------------------------------------------------
        volatility = returns.rolling(252).apply(lambda x: np.nanstd(x, axis = 0))
        
        # get a dataframe with all factors as columns --------------------------------------------
        factors = pd.concat([momentum, volatility], axis = 1)

    except BaseException as e:
        factors = np.nan
        
    return factors

In [10]:
# example of multiple factors
factorsDf = factorAnalysis.GetFactorsDf(CustomFactor)
factorsDf

Visualize the distributions of each raw factor before standardization

In [11]:
factorAnalysis.PlotHistograms(factorsDf)

#### Winsorizing And Standardizing Factors
Winsorize to reduce the effect of outliers and then standardize (zscore) each factor.

In [12]:
standardizedFactorsDf = factorAnalysis.GetStandardizedFactorsDf(factorsDf)
standardizedFactorsDf

#### Create A Combined Factor
Create a new combined factor as a linear combination of the single factors. Notice how we can give negative weights to some factors if we want to inverse their effect.

In [13]:
# dictionary containing the factor name and weights for each factor
combinedFactorWeightsDict = {'Factor_1': 1, 'Factor_2': 1}
#combinedFactorWeightsDict = None # None to not add a combined factor when using single factors

finalFactorsDf = factorAnalysis.GetCombinedFactorsDf(standardizedFactorsDf, combinedFactorWeightsDict)
finalFactorsDf

#### Run All Section 1.1.
The steps above can be run all at the same time using the GetFinalFactorsDf method as per below.

In [14]:
# dictionary containing the factor name and weights for each factor
#combinedFactorWeightsDict = {'Factor_1': 1, 'Factor_2': -1} # None to not add a combined factor
#combinedFactorWeightsDict = None # None to not add a combined factor when using single factors

#finalFactorsDf = factorAnalysis.GetFinalFactorsDf(CustomFactor, combinedFactorWeightsDict, standardize = True)
#finalFactorsDf.head()

Visualize factor correlations.

In [15]:
factorAnalysis.PlotFactorsCorrMatrix(finalFactorsDf)

Visualize the distributions of each standardized factor.

In [16]:
factorAnalysis.PlotHistograms(finalFactorsDf)

### 1.2. Create Quantile Groups And Calculate Forward Returns
* Calculate multiple forward period returns (field and forwardPeriods parameter) for each day and asset. This will be used to evaluate the performance of each quantile.
* For each day, group the assets into quantiles (q parameter) based on chosen factor values (factor parameter).

In [17]:
# inputs for forward returns calculations
field = 'open' # choose between open, high, low, close prices to calculate returns
forwardPeriods = [1, 5, 21] # choose periods for forward return calculations

# inputs for quantile calculations
factor = 'Combined_Factor' # choose a factor to create quantiles
q = 5 # choose the number of quantile groups to create

factorQuantilesForwardReturnsDf = factorAnalysis.GetFactorQuantilesForwardReturnsDf(finalFactorsDf, field,
                                                                                    forwardPeriods,
                                                                                    factor, q)
factorQuantilesForwardReturnsDf

Plot a box plot with the distributions of number of stocks in each quintile to make sure each quintile has an almost equal number of stocks most of the time.

In [18]:
factorAnalysis.PlotBoxPlotQuantilesCount(factorQuantilesForwardReturnsDf)

Plot overall mean returns by quantile and forward period return to get an idea about the mean return spread between extreme quantiles.

In [19]:
factorAnalysis.PlotMeanReturnsByQuantile(factorQuantilesForwardReturnsDf)

#### Calculate And Plot Cumulative Returns By Quantile
For each day, group the forward returns (forwardPeriod parameter) by quantile based on a given weighting (weighting parameter):
* **Mean**: Take the average return within each quantile
* **Factor**: Take a factor-weighted return within each quantile

In [20]:
forwardPeriod = 1 # choose the forward period to use for returns
weighting = 'mean' # mean/factor

returnsByQuantileDf = factorAnalysis.GetReturnsByQuantileDf(factorQuantilesForwardReturnsDf,
                                                            forwardPeriod, weighting)
returnsByQuantileDf

The ultimate goal is for the factors to be able to consistently separate relative winners from losers, which we should be able to visualize by looking at the below plot if opposite quantiles divert from each other over time.

In [21]:
# this function runs the above internally so no need to first calculate returnsByQuantileDf
#forwardPeriod = 1 # choose the forward period to use for returns
#weighting = 'mean' # mean/factor

factorAnalysis.PlotCumulativeReturnsByQuantile(factorQuantilesForwardReturnsDf, forwardPeriod, weighting)

### 1.3. Create A Long-Short Portfolio
Linearly combine the daily returns of two quantiles to simulate a portfolio. The portfolioWeightsDict parameter allows to enter the quintile name and the weight for that quintile in the portfolio.

In [22]:
# dictionary containing the quintile group names and portfolio weights for each
portfolioWeightsDict = {'Group_5': 1, 'Group_1': -1}

portfolioLongShortReturnsDf = factorAnalysis.GetPortfolioLongShortReturnsDf(returnsByQuantileDf, portfolioWeightsDict)
portfolioLongShortReturnsDf

Plot the cumulative returns of the long-short portfolio.

In [23]:
# this function runs the above internally so no need to first calculate portfolioLongShortReturnsDf
#forwardPeriod = 1 # choose the forward period to use for returns
#weighting = 'mean' # mean/factor
# dictionary containing the quintile group names and portfolio weights for each
#portfolioWeightsDict = {'Group_5': 1, 'Group_1': -1}

factorAnalysis.PlotPortfolioLongShortCumulativeReturns(factorQuantilesForwardReturnsDf,
                                                       forwardPeriod, weighting,
                                                       portfolioWeightsDict)

#### Plot Spearman Rank Correlation (Information Coefficient)
The Spearman Rank Correlation measures the strength and direction of association between two ranked variables. It is the non-parametric version of the Pearson correlation and focuses on the monotonic relationship between two variables rather than their linear relationship. Below we plot the daily IC between the factor values and each forward period return, along with a 21-day moving average.

In [24]:
factorAnalysis.PlotIC(factorQuantilesForwardReturnsDf)

#### Run All Section 1.3.
The method RunFactorAnalysis below will run the functions in sections 1.2. and 1.3. by only taking the factorQuantilesForwardReturnsDf generated at the start of section 1.2. This method will also generate all the relevant DataFrames needed for Part 2: Risk Analysis. The parameter makePlots controls whether we want to visualize plots or only generate the DataFrames.

In [25]:
#forwardPeriod = 1 # choose the forward period to use for returns
#weighting = 'mean' # mean/factor
# dictionary containing the quintile group names and portfolio weights for each
#portfolioWeightsDict = {'Group_5': 1, 'Group_1': -1}

# run analysis
factorAnalysis.RunFactorAnalysis(factorQuantilesForwardReturnsDf,
                                 forwardPeriod, weighting,
                                 portfolioWeightsDict,
                                 makePlots = False)

After running method RunFactorAnalysis, the following DataFrames are generated

In [26]:
# returns by quantile
factorAnalysis.returnsByQuantileDf.head()

In [27]:
# cumulative returns by quintile
factorAnalysis.cumulativeReturnsByQuantileDf.head()

In [28]:
# portfolio returns
factorAnalysis.portfolioLongShortReturnsDf.head()

In [29]:
# cumulative portfolio returns
factorAnalysis.portfolioLongShortCumulativeReturnsDf.head()

## Part 2: Analysing Common Risk Exposures Of The Long-Short Equity Strategy
---

In [30]:
# initialize risk analysis
riskAnalysis = RiskAnalysis(qb)

#### External Factors
After initializing the RiskAnalysis class, we get two datasets with classic risk factors:
* **Fama-French 5 Factors**: Historical daily returns of Market Excess Return (Mkt-RF), Small Minus Big (SMB), High Minus Low (HML), Robust Minus Weak (RMW) and Conservative Minus Aggressive (CMA).
* **12 Industry Factors**: Consumer Nondurables (NoDur), Consumer durables (Durbl), Manufacturing (Manuf), Energy (Enrgy), Chemicals (Chems), Business Equipment (BusEq), Telecommunications (Telcm), Utilities (Utils), Wholesale and Retail (Shops), Healthcare (Hlth), Finance (Money), Other (Other)

Visit https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html for more factor datasets to add to this analysis.

In [31]:
# fama-french 5 factors
riskAnalysis.ffFiveFactorsDf.head()

In [32]:
# 12 industry factors
riskAnalysis.industryFactorsDf.head()

#### Combine Strategy Returns And External Risk Factors
Create a DataFrame containing both the returns from our long-short portfolio and the external risk factors

In [33]:
# combined fama-french 5 factors and 12 industry factors
#externalFactorsDf = pd.merge(riskAnalysis.ffFiveFactorsDf, riskAnalysis.industryFactorsDf,
#                             how = 'inner', left_index = True, right_index = True)
externalFactorsDf = riskAnalysis.ffFiveFactorsDf

combinedReturnsDf = riskAnalysis.GetCombinedReturnsDf(factorAnalysis.portfolioLongShortReturnsDf, externalFactorsDf)
combinedReturnsDf

#### Plot Cumulative Returns
Visualize the historical cumulative returns of our strategy together with all the other external risk factors

In [34]:
riskAnalysis.PlotCumulativeReturns(combinedReturnsDf)

In [35]:
# plot correlation matrix
factorAnalysis.PlotFactorsCorrMatrix(combinedReturnsDf)

#### Run Regression Analysis
* Fit a **Regression Model** to the data to analyse linear relationships between our strategy returns and the external risk factors.
* **Partial Regression plots**. When performing multiple linear regression, these plots are useful in analysing the relationship between each independent variable and the response variable while accounting for the effect of all the other independent variables peresent in the model. Calculations are as follows (Wikipedia):
    1. Compute the residuals of regressing the response variable against the independent variables but omitting Xi.
    2. Compute the residuals from regressing Xi against the remaining independent variables.
    3. Plot the residuals from (1) against the residuals from (2).

In [36]:
riskAnalysis.PlotRegressionModel(combinedReturnsDf, dependentColumn = 'Strategy')

#### Plot Rolling Regression Coefficients
The above relationships are not static through time, therefore it is useful to visualize how these coefficients behave over time by running a rolling regression model (with a given lookback period).

In [37]:
riskAnalysis.PlotRollingRegressionCoefficients(combinedReturnsDf, dependentColumn = 'Strategy', lookback = 126)

#### Plot Distribution Of Rolling Exposures
We can now visualize the historical distributions of the rolling regression coefficients in order to get a better idea of the variability of the data.

In [38]:
riskAnalysis.PlotBoxPlotRollingFactorExposure(combinedReturnsDf, dependentColumn = 'Strategy', lookback = 126)

#### Run All
We can just run all the above using the method RunRiskAnalysis by passing two DataFrames (our strategy and the external risk factors), the column name for the response variable and a lookback period for the rolling regression analysis

In [39]:
riskAnalysis.RunRiskAnalysis(factorAnalysis.portfolioLongShortReturnsDf, externalFactorsDf,
                             dependentColumn = 'Strategy', lookback = 126)