**Fin 585R**  
**Diether**  
**Regression/Estimating Linear Models** 

**Overview**

The purpose of this notebook is to introduce how to estimate linear regressions using the Python Data Analysis library. In this class, we will primarily rely on the `statsmodels` library to estimate regression models.

To introduce estimating regression models using Pandas/statsmodels I am going to use the short-selling data we used in your first portfolio assignment. Remember, the data are monthly stock data for all stocks in the U.S. with non-missing loan fee data. The basic unit of observation is the stock month. You can download the data directly using the following link: [the data](http://diether.org/prephd/03-mstk_short_02-12.csv). The data contain the following variables:

|Variable | Description                                       |
|---------|---------------------------------------------------|
|permno   | stock identifier                                  |
|caldt    | calendar date                                     |
|ret      | monthly return                                    |
|prclag   | stock price, lagged                               |   
|melag    | market equity, lagged                             |
|feelag   | the loan fee expressed a percent per anum, lagged |

Remember, feelag represent the cost of shorting.

You should look over this notebook on you own. If you have questions please bring them up in class.

**I. Importing the statsmodel library**

In [None]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

**II. Read in the data and create a `dataframe`**

I am going to read in the data, multiple returns by 100 (so they represent percent per month), and create the port/binning variable from the homework using a different method. 

In [None]:
df = pd.read_csv("https://diether.org/prephd/03-mstk_short_02-12.csv",parse_dates=['caldt'])
df['ret'] = df['ret']*100

df['port'] = 'low'
df.loc[(df.feelag >= 3) & (df.feelag < 5),'port'] = 'medium'
df.loc[df.feelag >= 5,'port'] = 'high'
df.head()

**Estimating Regressions**

Estimating linear models using regressions will be a bread and butter part of the course as we move forward. `Statsmodels` is our main regression analysis library in Python. It has a formula based interface with a syntax very similar to R. It's designed to work seamlessly with `pandas`' dataframes.

**Estimating regressions using statsmodels and the patsy formula interface**

First, let's do a simple pooled univariate linear regression using `ret` and `feelag`. So the model is the following:<br>

$$
ret_{it} = a + \beta fee_{i,t-1} + \epsilon_{it}
$$

Graphically, we are fitting a line through the following scatterplot:

In [None]:
df.plot.scatter(y='ret',x='feelag')

Statsmodels include an ols function. The ols function requires two parameters (and you can use other optional parameters). The **first required parameter is a formula expressed as a string.** This formula interface may seem slightly strange the first time you see it, but it allows you to compactly express linear models. What does the formula string look like for the following regression?

$$
ret_{it} = a + \beta fee_{i,t-1} + \epsilon_{it}
$$

The formula string is the following:

```python
'ret ~ feelag'
```

A few notes on the formula interface:

1. `~` is the equal sign in the formula interface. I think the idea behind the use of `~` is the `=` is used in so many other contexts for a programming languge. The use of `~` makes it obvious that we are writing formula for a model.<br><br>

2. Model coefficients are implied. If we want to estimate the following regression we don't need to specify the $\alpha$ and the $\beta$ in our statsmodel formula (technically it's called a patsy formula ... that's the name of the formula library): <BR><br>
$$
ret_{it} = a + \beta fee_{i,t-1} + \epsilon_{it}
$$ <br>
The formula assumes it because an intercept and a slope coefficient are standard part of a univariate linear model.<br><br>

3. This formula syntax comes from the R statistical programming language (well, really it comes from the S statistical language ... R started off as basically a reimplementation of that language).

The **second required parameter is the dataframe** containing the data that will be used; the dataframe must contain both the dependent and indepent variables (with the exception that you can create new variables derived from the variables in the dataframe via the formula interface).<br><br>

In [None]:
reg0 = smf.ols('ret ~ feelag',data=df).fit()

Note, the above function call to `smf.ols` and the `fit()` method estimates the regression and then creates a regression object. If I print out reg0, then I get the following:

In [None]:
reg0

**Statsmodels Regression Object**

It's just an object, and by default it doesn't report the results when we print it. A statsmodel regression object contains a lot of methods that allow you to access or report information from the regression but you have to call them explicitly.< We will probably use the `summary` method the most. The summary method outputs the results of the regression:

In [None]:
reg0.summary()

<br>

**Creating new variables via the formula interface**

You can create new variables derived from the variables in the dataframe via the formula interface. For example, maybe you want to include lagged market-cap (`melag`) as a control variable but you also know that market-cap is pretty skewed (some stocks have very big market-caps). Therefore, you want to take the natural log of the variable. You don't need to create a new column with the natural log of `melag` in your dataframe. Instead, you can create it in the formula string itself. So let's estimate the following regression and create the independent variable we need in the formula string:

$$
ret_{it} = a + \beta_1 fee_{i,t-1} + \beta_2 log(me_{i,t-1}) +  \epsilon_{it}
$$

In [None]:
reg1 = smf.ols('ret ~ feelag + np.log(melag)',data=df).fit()
reg1.summary()

**Creating dummy variables and interaction terms via the formula interface**

One nice feature of the formula interface is it includes a special function for creating dummy variables/categorial variables. The function is called `C` (\leftarrow$ short for Categorical). In the homework you created a `port` variable that breaks feelag into low, median, and high categories. I recreated that variable at the beginning of the notebook. Let's create a model that uses those categories as dummy variables instead of using feelag<br>

$$
ret_{it} = a + \beta_1\bigl(3 < fee_{i,t-1} <= 5\bigr) + \beta_2\bigl(fee_{i,t-1} > 5\bigr) +  \beta_3 log(me_{i,t-1}) + \epsilon_{i}
$$

<br>

In [None]:
reg2 = smf.ols('ret ~ C(port) + np.log(melag)',data=df).fit()
reg2.summary()

<br>

**More on the "C" Function**

The Categorical function is quite handy. If there are more then two categories, it will automatically create dummy variables for each group and then omit one of the categories (it omits a category to avoid linear dependence between the sum of the dummies and the intercept). Note, it didn't precisely do what I wanted in the formula. The high group is the omitted category (the high group is the intercept).

I can explicitly create the dummy variables in the patsy formula string. Suppose I want to estimate the following model:

$$
ret_{it} = a + \beta_1\bigl(fee_{i,t-1} > 5\bigr) +  \beta_2 log(me_{i,t-1}) + \epsilon_{i}
$$

<br>

In [None]:
reg3 = smf.ols('ret ~ feelag > 5 + np.log(melag)',data=df).fit()
reg3.summary()

<br>

**Interaction Teerms in the Formula Interface**

You can also specify interaction terms in a formula string:<br><br>

In [None]:
reg4 = smf.ols('ret ~ feelag > 5 + np.log(melag) + np.log(melag)*(feelag > 5) ',data=df).fit()
reg4.summary()

<br>

**Supressing the intercept**

By default the formula string implicitly includes an intercept coefficient. Given that, how do we suppress estimation of an intercept term? A more explicit way to write down a linear regression model with an intercept term (using formula interface syntax) is to include an intercept data column. An intercept coefficent corresponds to a variable that is a column of all ones:<br><br>

In [None]:
reg5 = smf.ols('ret ~ 1 + +feelag + np.log(melag)',data=df).fit()
reg5.summary()

<br>

The suppression of the intercept in the formula string comes from the idea that a regression doesn't estimate an intercept if we get rid of the column of ones in the independent variable matrix. Based on that idea, `statsmodels` (and once again, this comes from the `R` world) allows you to suppress the intercept by putting '-1' in the formula:<br><br>

In [None]:
reg5 = smf.ols('ret ~ -1 + feelag + np.log(melag)',data=df).fit()
reg5.summary()

**Supressing the Intercept and "C" Function**

If you suppress the intercept in formula string, the Categorical function will include all the potential dummy variable columns in the regression instead of omitting one category:<br><br>

In [None]:
reg6 = smf.ols('ret ~ -1 + C(port) + np.log(melag)',data=df).fit()
reg6.summary()

**Creating a Table of Regressions Results**

**The Regtable function**

The BYU Fin 585R Library includes a function called `Regtable` that allows you to display regression results in a table. In the academic literature for economics its standard to stack regressions where each column represents a seperate regression.

[Regtable documentation](https://fin-library.readthedocs.io/en/latest/regtables.html)

Note, the primary parameter for Regtable is a `list` of regression objects.<br><br>

In [None]:
from finance_byu.regtables import Regtable

tbl = Regtable([reg0,reg1,reg2,reg3,reg4],stat='tstat',sig='coeff')
tbl.render()

In [None]:
tbl = Regtable([reg0,reg1,reg2,reg3,reg4,reg5,reg6],stat='tstat',sig='coeff')
tbl.render()