<a href="https://www.quantrocket.com"><img alt="QuantRocket logo" src="https://www.quantrocket.com/assets/img/notebook-header-logo.png"></a><br>
<a href="https://www.quantrocket.com/disclaimer/">Disclaimer</a>

***
[Pipeline Tutorial](Introduction.ipynb) › Lesson 4: Factors
***

# Factors
A factor is a function from an asset and a moment in time to a number.
```
F(asset, timestamp) -> float
```
In Pipeline, Factors are the most commonly-used term, representing the result of any computation producing a numerical result. Factors require a column of data and a window length as input.

The simplest factors in Pipeline are built-in Factors. Built-in Factors are pre-built to perform common computations. As a first example, let's make a factor to compute the average close price over the last 10 days. We can use the `SimpleMovingAverage` built-in factor which computes the average value of the input data (close price) over the specified window length (10 days). To do this, we need to import our built-in `SimpleMovingAverage` factor and the `EquityPricing` dataset.

In [1]:
# New from the last lesson, import the EquityPricing dataset.
from zipline.pipeline import Pipeline, EquityPricing
from zipline.research import run_pipeline

# New from the last lesson, import the built-in SimpleMovingAverage factor.
from zipline.pipeline.factors import SimpleMovingAverage

To see the full list of built-in factors, click on the `factors` module in the above import statement then press Control, or see the [API Reference](https://www.quantrocket.com/docs/api/#built-in-factors).

## Creating a Factor
Let's go back to our `make_pipeline` function from the previous lesson and instantiate a `SimpleMovingAverage` factor. To create a `SimpleMovingAverage` factor, we can call the `SimpleMovingAverage` constructor with two arguments: inputs, which must be a list of `BoundColumn` objects, and window_length, which must be an integer indicating how many days worth of data our moving average calculation should receive. (We'll discuss `BoundColumn` in more depth later; for now we just need to know that a `BoundColumn` is an object indicating what kind of data should be passed to our Factor.).

The following line creates a `Factor` for computing the 10-day mean close price of securities.

In [2]:
mean_close_10 = SimpleMovingAverage(inputs=EquityPricing.close, window_length=10)

It's important to note that creating the factor does not actually perform a computation. Creating a factor is like defining the function. To perform a computation, we need to add the factor to our pipeline and run it.

## Adding a Factor to a Pipeline

Let's update our original empty pipeline to make it compute our new moving average factor. To start, let's move our factor instantiation into `make_pipeline`. Next, we can tell our pipeline to compute our factor by passing it a `columns` argument, which should be a dictionary mapping column names to factors, filters, or classifiers. Our updated `make_pipeline` function should look something like this:

In [3]:
def make_pipeline():
    
    mean_close_10 = SimpleMovingAverage(inputs=EquityPricing.close, window_length=10)
    
    return Pipeline(
        columns={
            '10_day_mean_close': mean_close_10
        }
    )

To see what this looks like, let's make our pipeline, run it, and display the result.

In [4]:
result = run_pipeline(make_pipeline(), start_date='2010-01-05', end_date='2010-01-05')
result

Unnamed: 0_level_0,Unnamed: 1_level_0,10_day_mean_close
date,asset,Unnamed: 2_level_1
2010-01-05,Equity(FIBBG000C2V3D6 [A]),30.432000
2010-01-05,Equity(QI000000004076 [AABA]),16.605000
2010-01-05,Equity(FIBBG000BZWHH8 [AACC]),6.434000
2010-01-05,Equity(FIBBG000V2S3P6 [AACG]),4.501444
2010-01-05,Equity(FIBBG000M7KQ09 [AAI]),5.250000
2010-01-05,...,...
2010-01-05,Equity(FIBBG011MC2100 [AATC]),11.980500
2010-01-05,Equity(FIBBG000GDBDH4 [BDG]),
2010-01-05,Equity(FIBBG000008NR0 [ISM]),
2010-01-05,Equity(FIBBG000GZ24W8 [PEM]),


Now we have a column in our pipeline output with the 10-day average close price for all ~8000 securities (display truncated). Note that each row corresponds to the result of our computation for a given security on a given date stored. The `DataFrame` has a MultiIndex where the first level is a datetime representing the date of the computation and the second level is an `Equity` object corresponding to the security.

If we run our pipeline over more than one day, the output looks like this.

In [5]:
result = run_pipeline(make_pipeline(), start_date='2010-01-05', end_date='2010-01-07')
result

Unnamed: 0_level_0,Unnamed: 1_level_0,10_day_mean_close
date,asset,Unnamed: 2_level_1
2010-01-05,Equity(FIBBG000C2V3D6 [A]),30.432000
2010-01-05,Equity(QI000000004076 [AABA]),16.605000
2010-01-05,Equity(FIBBG000BZWHH8 [AACC]),6.434000
2010-01-05,Equity(FIBBG000V2S3P6 [AACG]),4.501444
2010-01-05,Equity(FIBBG000M7KQ09 [AAI]),5.250000
...,...,...
2010-01-07,Equity(FIBBG011MC2100 [AATC]),11.816000
2010-01-07,Equity(FIBBG000GDBDH4 [BDG]),
2010-01-07,Equity(FIBBG000008NR0 [ISM]),
2010-01-07,Equity(FIBBG000GZ24W8 [PEM]),


Note: factors can also be added to an existing `Pipeline` instance using the `Pipeline.add` method. Using `add` looks something like this:

```python
my_pipe = Pipeline()
f1 = SomeFactor(...)
my_pipe.add(f1, 'f1')
```

## Latest
The most commonly used built-in `Factor` is `Latest`. The `Latest` factor gets the most recent value of a given data column. This factor is common enough that it is instantiated differently from other factors. The best way to get the latest value of a data column is by getting its `.latest` attribute. As an example, let's update `make_pipeline` to create a latest close price factor and add it to our pipeline:

In [6]:
def make_pipeline():

    mean_close_10 = SimpleMovingAverage(inputs=EquityPricing.close, window_length=10)
    latest_close = EquityPricing.close.latest

    return Pipeline(
        columns={
            '10_day_mean_close': mean_close_10,
            'latest_close_price': latest_close
        }
    )

And now, when we make and run our pipeline again, there are two columns in our output dataframe. One column has the 10-day mean close price of each security, and the other has the latest close price.

In [7]:
result = run_pipeline(make_pipeline(), start_date='2010-01-05', end_date='2010-01-05')
result.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,10_day_mean_close,latest_close_price
date,asset,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-05,Equity(FIBBG000C2V3D6 [A]),30.432,31.3
2010-01-05,Equity(QI000000004076 [AABA]),16.605,17.1
2010-01-05,Equity(FIBBG000BZWHH8 [AACC]),6.434,7.15
2010-01-05,Equity(FIBBG000V2S3P6 [AACG]),4.501444,4.702
2010-01-05,Equity(FIBBG000M7KQ09 [AAI]),5.25,5.18


`.latest` can sometimes return things other than `Factors`. We'll see examples of other possible return types in later lessons.

## Default Inputs
Some factors have default inputs that should never be changed. For example the VWAP built-in factor is always calculated from `EquityPricing.close` and `EquityPricing.volume`. When a factor is always calculated from the same `BoundColumn`, we can call the constructor without specifying `inputs`.

In [8]:
from zipline.pipeline.factors import VWAP
vwap = VWAP(window_length=10)

## Choosing a Start Date

When choosing a `start_date` for `run_pipeline`, there are two gotchas to keep in mind. First, the earliest possible `start_date` you can specify must be one day after the start date of the bundle. This is because the `start_date` you pass to `run_pipeline` indicates the first date you want to include in the pipeline output, and each day's pipeline output is based on the previous day's data. The purpose of this one-day lag is to avoid lookahead bias. Pipeline output tells you what you would have known at the start of each day, based on the previous day's data.

The learning bundle starts on 2007-01-03 (the first trading day of 2007), but if we try to run a pipeline that starts on (or before) that date, we'll get an error that tells us to start one day after the bundle start date:

In [9]:
result = run_pipeline(Pipeline(), start_date='2007-01-03', end_date='2007-01-03')

ValidationError: start_date cannot be earlier than 2007-01-04 for this bundle (one session after the bundle start date of 2007-01-03)

The second gotcha to keep in mind is that the `start_date` you choose must also make allowance for the `window_length` of your factors. The following pipeline includes a 10-day VWAP factor, so if we set the `start_date` to 2007-01-04 (as suggested by the previous error message), we will get a new error (scroll to the bottom of the traceback for the useful error message):    

In [10]:
pipeline = Pipeline(
    columns={
        "vwap": VWAP(window_length=10)
    }
)

result = run_pipeline(pipeline, start_date='2007-01-04', end_date='2007-01-04')

NoDataOnDate: the pipeline definition requires EquityPricing<US>.close::float64 data on 2006-12-18 00:00:00 but no bundle data is available on that date; the cause of this issue is that another pipeline term needs EquityPricing<US>.close::float64 and has a window_length of 10, which necessitates loading 9 extra rows of EquityPricing<US>.close::float64; try setting a later start date so that the maximum window_length of any term doesn't extend further back than the bundle start date. Review the pipeline dependencies below to help determine which terms are causing the problem:

{'dependencies': [{'term': EquityPricing<US>.close::float64,
                   'used_by': VWAP([EquityPricing.close, EquityPricing.volume], 10)},
                  {'term': EquityPricing<US>.volume::float64,
                   'used_by': VWAP([EquityPricing.close, EquityPricing.volume], 10)}],
 'nodes': [{'extra_rows': 9, 'needed_for': EquityPricing<US>.close::float64},
           {'extra_rows': 9, 'needed_for': EquityPricing<US>.volume::float64}]}

The error message indicates that we would need data back to 2006-12-18 in order to calculate a 10-day VWAP and produce pipeline output on 2007-01-04 (`window_length` is measured in trading days, not calendar days). The solution is to set a later start date so that the VWAP factor doesn't require data prior to the bundle start date of 2007-01-03. In this example, the earliest possible `start_date` turns out to be 2007-01-18 (14 calendar days, or 10 trading days, after 2007-01-04). 

In [11]:
result = run_pipeline(pipeline, start_date='2007-01-18', end_date='2007-01-18')

---

**Next Lesson:** [Combining Factors](Lesson05-Combining-Factors.ipynb) 