**Fin 585R**  
**Diether**  
**Intro to Portfolios** <br><br> 


**I. Overview**

+ This notebook introduces the concept of **portfolios.**<br><br>

+ It also introduces **portfolio construction** using Python/Pandas.<br><br>

+ It covers programming concepts for basic portfolio formation and computing portfolio returns.<br><br>

+ Portfolio formation and computing portfolio returns relies heavily on the groupby programming construct.<br><br>


**II. Portfolios: Conceptual Overview**

+ A portfolio is a collection of assets.<br><br>

+ Portfolios aren't an artificial construct. $\leftarrow$ if you own any financial assets, you have a portfolio.<br><br>

+ These assets can be primitive securites like stocks or bonds.<br><br>

+ These assets could also be other portfolios.<br><br>

+ A portfolio defined by two parameters:<br>

  1. The assets in the portfolio.<br><br>
  
  2. The weights on each assets $\leftarrow$ weight = percent of overall investment.<br><br>


**Example Portfolio**

+ You invest 30% in Krispy Kreme's Stock and 70\% in Google's stock.<br><br>

+ Weights: $w_{g} = 0.7$ and $w_{k} = 0.3$.<br><br>

+ For a standard portfolio (also called a unit cost portfolio), the weights must sum to 1. <i>Summation Constraint</i><br><br>

+ <i><b>MY NOTES:</b> d_t is the dividend payment or the bond payment for the month; the return is technically a rate of change </i><br><br>

+ The one period return (period $t$) for any asset ($i$) is the following:<br>

$$
r_{it} = r_t = \frac{d_t + P_t - P_{t-1}}{P_{t-1}} 
$$<br>

+ Or for Google:

$$
r_{gt} = \frac{P_{gt} - P_{g,t-1} + d_{gt}}{P_{g,t-1}}= \frac{P_{gt} + d_{gt}}{P_{g,t-1}} - 1
$$<br>

+ It's just the percentage change in value for the asset including in cash payments or dividends during the period.<br><br>

+ Given portfolios are defined by the assets in the portfolio and the weights, the return on our two asset portfolio is the following (call the portfolio P:<br>

\begin{align*}
r_p  &= wr_g + (1-w)r_k \\
r_{p} &= 0.7r_{g} + 0.3r_{k}
\end{align*}<br>

+ <i><b>MY NOTES:</b> Note in each case the weights will always sum to 1 </i><br><br>

**N-Asset Portfolio**

+ In general we can write the return on a portfolio with N assets as the following:<br>

$$
r_{p} = \sum_{i=1}^{N} \omega_{i}r_{i}, \quad \text{where} \quad \sum_{i=1}^{N} \omega_{i} = 1  
$$

+ $r_i$ refers to the return on asset $i$.<br><br>

+ $\omega_i$ refers to the weight on asset $i$ in the portfolio.<br><br>


**IV. Portfolio Construction Framework**

1. Data preparation.<br>

+ <i><b>MY NOTES:</b> We will typically skip this step </i><br>

2. Creation of the portfolio formation variable.<br>

+ <i><b>MY NOTES:</b> Pick something economically interesting; maybe number of analysts?</i><br>

3. Binning the stock return data based the formation variable.<br>

+ <i><b>MY NOTES:</b> Stratify the sample; does it belong in or outside of the portfolio? </i><br>

4. Portfolio creation.<br>

+ <i><b>MY NOTES:</b> This code can be reused for whatever strategy we use </i><br>

5. Estimating historical performance of the portfolios or testing economic models using portfolios.<br><br>


**Today's Focus for Our Framework**

+ Today, we introduce steps 2-4.<br><br>

+ In the future, we will talk more about all the steps in more detail.<br><br>


**Goal of Step 1**

+ I've already done step 1: data manipulation.<br><br>

+ Our goal for step 1 is to get the data in <b>panel form.</b><br><br>

+ <i><b>MY NOTES:</b> Panel data means that the permno and the date uniquely determine an observation </i><br><br>

+ That form $\rightarrow$ panel data with two dimensions: date and entity (e.g, different stocks)<br><br>

+ For example, our data today in the panel form: monthly-stock data<br><br>

+ Permno/caldt defines on observation.<br><br>

+ Returns and prices of the stocks are going to be are variables of interest.


```
    permno      caldt ticker     prc       ret
0    10026 2020-09-30   JJSF  130.39 -0.036668
1    10026 2020-10-30   JJSF  135.57  0.039727
2    10026 2020-11-30   JJSF  145.39  0.072435
3    10026 2020-12-31   JJSF  155.37  0.072598
4    10028 2020-09-30    ELA    4.29  0.105670
5    10028 2020-10-30    ELA    4.04 -0.058275
6    10028 2020-11-30    ELA    4.62  0.143560
7    10028 2020-12-31    ELA    5.20  0.125540
8    10032 2020-09-30   PLXS   70.63 -0.071513
9    10032 2020-10-30   PLXS   69.54 -0.015432
10   10032 2020-11-30   PLXS   74.71  0.074346
11   10032 2020-12-31   PLXS   78.21  0.046848
```

<br>

+ We often want to group and then transform data by stock ID or date.<br><br>

+ Portfolio construction typically involves both.<br><br>


**Data for our Grouping Example**

+ The data are monthly stock price and return for all publicly trading stocks in the U.S from 2019-2021.<br><br> 

+ <i><b>MY NOTES:</b> 1926 - Present is considered the modern era of financial data; CRISP starts in '26 </i><br><br>

+ The data are drawn from the most common academic source: the Center for Research and Security Prices (CRSP). <br><br>

+ The basic unit of observation is the stock-month. <br><br>

+ You can download the data directly using the following link: [the data](http://diether.org/markets/02-mstk.csv).<br><br>

+ Data variables:<br><br>

|Variable | Description                                       |
|---------|---------------------------------------------------|
|permno   | stock identifier                                  |
|caldt    | calendar date                                     |
|ticker   | ticker symbol                                     |
|prc      | month end price                                   |
|ret      | monthly return                                    |
|vol      | monthly shares traded (in 1,000s)                 |
|shr      | shares outstanding (in 1,000s)                    |   


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("https://diether.org/prephd/02-mstk.csv",parse_dates=['caldt'])
df

Unnamed: 0,permno,caldt,ticker,prc,ret,vol,shr
0,10026,2019-01-31,JJSF,154.35,0.067501,21310.0,18783.0
1,10026,2019-02-28,JJSF,155.28,0.006025,14119.0,18783.0
2,10026,2019-03-29,JJSF,158.84,0.026146,13216.0,18815.0
3,10026,2019-04-30,JJSF,157.18,-0.010451,14432.0,18817.0
4,10026,2019-05-31,JJSF,160.85,0.023349,15264.0,18817.0
...,...,...,...,...,...,...,...
137552,93436,2021-08-31,TSLA,735.72,0.070605,3812200.0,1001800.0
137553,93436,2021-09-30,TSLA,775.48,0.054042,3889200.0,1004000.0
137554,93436,2021-10-29,TSLA,1114.00,0.436530,5264000.0,1004300.0
137555,93436,2021-11-30,TSLA,1144.80,0.027612,6457200.0,1004300.0


**V. Our First Portfolio**

+ An equal-weight portfolio of all stocks.<br><br>

+ <i><b>MY NOTES:</b> Equal weight is very easy for pandas to compute </i><br><br>

+ It's actual an easy portfolio portfolio to contruct.<br><br>

+ Step 2: portfolio formation variable $\leftarrow$ all stocks.<br><br>

+ Step 3: bin the data $\leftarrow$ no binning, want all observations.<br><br>

+ Step 4: portfolio creation and returns $\leftarrow$ some work here.<br><br>


**Equal-Weight Portfolio of All Stocks**

+ Equal-weight portfolios very common.<br><br>

+ Pretty easy to program as well.<br><br>

+ Each stock's weight in the portfolio is 1/N<br><br>

+ Implies we rebalance the weights every month.<br><br>

+ <i><b>MY NOTES:</b> This implies that we need to buy/sell stocks each month to equalize the portfolio "Well diversified portfolio"</i><br><br>


**Step 4: Need to Implement the Formula**

+ Every month, the return on the portfolio is the following:

\begin{align*}
r_p &= \frac{1}{n}r_1 + \frac{1}{n}r_2 + \frac{1}{n}r_3 + \cdots  + \frac{1}{n}r_n \\
    &= \frac{1}{n} \bigl(r_1 + r_2 + r_3 + \cdots  + r_n \bigr) \\
    &= \frac{1}{n} \sum_{i=1}^{n} r_i 
\end{align*}

+ Note, the preceding is just an average across all stocks in a given month.<br><br>

+ That gives us a shortcut.<br><br>

**Implementing Step 4 in Python**

+ So conceptually to form this portfolio we want to do the following:<br>

  1. group the observations by calender month<br><br>
  
  2. loop through each of the months computing the average across the stocks (equivalent to the equal-weight          portfolio return)<br>
      + <i><b>MY NOTES:</b> The average is equivalent to the equal weight portfolio </i><br>
 
  3. save those portfolio returns into a new dataframe.<br><br>

+ Python/Pandas is really good at this.<br><br>

+ Groupy does this for us.<br><br>

In [4]:
df.groupby('caldt')['ret']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fd2b21fb6d0>

In [5]:
df.groupby('caldt')['ret'].mean()

caldt
2019-01-31    0.142560
2019-02-28    0.053005
2019-03-29   -0.012808
2019-04-30    0.025195
2019-05-31   -0.080560
2019-06-28    0.055159
2019-07-31   -0.009140
2019-08-30   -0.052700
2019-09-30    0.021564
2019-10-31    0.002017
2019-11-29    0.039943
2019-12-31    0.053211
2020-01-31   -0.011801
2020-02-28   -0.072001
2020-03-31   -0.223813
2020-04-30    0.192553
2020-05-29    0.081180
2020-06-30    0.071086
2020-07-31    0.040533
2020-08-31    0.048534
2020-09-30   -0.026038
2020-10-30    0.008745
2020-11-30    0.205248
2020-12-31    0.096930
2021-01-29    0.109572
2021-02-26    0.087045
2021-03-31    0.012909
2021-04-30    0.012332
2021-05-28    0.010912
2021-06-30    0.030062
2021-07-30   -0.041969
2021-08-31    0.023730
2021-09-30   -0.028472
2021-10-29    0.023266
2021-11-30   -0.053534
2021-12-31   -0.008932
Name: ret, dtype: float64

In [7]:
port = df.groupby('caldt')['ret'].mean()*100
port.describe()

count    36.000000
mean      2.293117
std       7.750542
min     -22.381319
25%      -1.205256
50%       2.241497
75%       5.369789
max      20.524796
Name: ret, dtype: float64

**This is a high-growth period**, typically the month-over-month equal weighted return is 1%

**VI. Our Second Portfolio**

+ Let's form two portfolios:<br>

  1. Portfolio contains stocks with low prices: $P_{lag} \le 5$.<br><br>
  
  2. Portfolio contains with stocks with higher prices: $P_{lag} > 5$.<br><br>
  
**Portfolio Construction**

2. Portfolio formation variable $\leftarrow$ lagged price.<br><br>

3. Bin the data $\leftarrow$ create binning variable that equals 0 if $P_{lag} \le 5$ and 1 if $P_{lag} > 5$.<br><br>

4. Portfolio creation and returns $\leftarrow$ let's form equal-weight portfolios for each.<br><br>

**Portfolio Formation Variable: Lagged Price**

+ Key: for portfolio construction can only use info we would have if in real time.<br><br>

+ Can create terrible biases in testing.<br><br>

+ **Always be careful with this issue.**<br><br>

+ Returns are of time = t<br><br>

+ Asset selection and portfolio construction info has to come from $t-1$ or earlier.<br><br>

+ Therefore, if price is the portfolio formation variable it must be lagged.<br><br>

+ <i><b>MY NOTES:</b> We cannot create portfolios with information we don't have at time t! Important! </i><br><br>


**Pandas: Use Shift**

+ The code in the next call is wrong. Why?<br><br>

+ <i><b>MY NOTES:</b> The prices "overflow" into different tickers! Screwing up the data </i><br><br>

+ We need to use a groupby with shift. Why?<br><br>

+ <i><b>MY NOTES:</b> This will respect the seam, protecting the fidelity of the data, and producing a missing value instead of an error </i><br><br>


In [10]:
df['prc_lag'] = df['prc'].shift(1)
df.tail(40)

Unnamed: 0,permno,caldt,ticker,prc,ret,vol,shr,prc_lag
137517,93434,2021-09-30,SANW,2.59,-0.071685,32656.0,36777.0,2.79
137518,93434,2021-10-29,SANW,4.29,0.65637,48416.0,38643.0,2.59
137519,93434,2021-11-30,SANW,2.92,-0.31935,25531.0,38715.0,4.29
137520,93434,2021-12-31,SANW,2.73,-0.065069,23067.0,38890.0,2.92
137521,93436,2019-01-31,TSLA,307.02,-0.077464,1757100.0,172600.0,2.73
137522,93436,2019-02-28,TSLA,319.88,0.041887,1285700.0,172720.0,307.02
137523,93436,2019-03-29,TSLA,279.86,-0.12511,2137800.0,173680.0,319.88
137524,93436,2019-04-30,TSLA,238.69,-0.14711,2307500.0,173720.0,279.86
137525,93436,2019-05-31,TSLA,185.16,-0.22427,2826500.0,177270.0,238.69
137526,93436,2019-06-28,TSLA,223.46,0.20685,2149800.0,179120.0,185.16


In [11]:
df['prc_lag'] = df.groupby('permno')['prc'].shift(1)
df.tail(40)

Unnamed: 0,permno,caldt,ticker,prc,ret,vol,shr,prc_lag
137517,93434,2021-09-30,SANW,2.59,-0.071685,32656.0,36777.0,2.79
137518,93434,2021-10-29,SANW,4.29,0.65637,48416.0,38643.0,2.59
137519,93434,2021-11-30,SANW,2.92,-0.31935,25531.0,38715.0,4.29
137520,93434,2021-12-31,SANW,2.73,-0.065069,23067.0,38890.0,2.92
137521,93436,2019-01-31,TSLA,307.02,-0.077464,1757100.0,172600.0,
137522,93436,2019-02-28,TSLA,319.88,0.041887,1285700.0,172720.0,307.02
137523,93436,2019-03-29,TSLA,279.86,-0.12511,2137800.0,173680.0,319.88
137524,93436,2019-04-30,TSLA,238.69,-0.14711,2307500.0,173720.0,279.86
137525,93436,2019-05-31,TSLA,185.16,-0.22427,2826500.0,177270.0,238.69
137526,93436,2019-06-28,TSLA,223.46,0.20685,2149800.0,179120.0,185.16


**Bin the Data with Cut**

+ pd.cut takes breakpoints and bins the data.<br>

    + <i><b>MY NOTES:</b> VERY INTERESTING; cut will ignore missing observations anyway</i><br>

+ Specify the breakpoint values in a list: [0,5,500000] <br><br>

+ Above with creat two bins (0,5] and (5,500,000]<br><br>

+ <i><b>MY NOTES:</b> Turn off labels to give integers separators </i><br><br>

+ <i><b>MY NOTES:</b> Cut is most useful; it is used for more than one grouping </i><br><br>



In [17]:
pd.cut(df['prc_lag'],[0,5,500000])

0                     NaN
1         (5.0, 500000.0]
2         (5.0, 500000.0]
3         (5.0, 500000.0]
4         (5.0, 500000.0]
               ...       
137552    (5.0, 500000.0]
137553    (5.0, 500000.0]
137554    (5.0, 500000.0]
137555    (5.0, 500000.0]
137556    (5.0, 500000.0]
Name: prc_lag, Length: 137557, dtype: category
Categories (2, interval[int64, right]): [(0, 5] < (5, 500000]]

In [18]:
pd.cut(df['prc_lag'],[0,5,500000],labels=False)

0         NaN
1         1.0
2         1.0
3         1.0
4         1.0
         ... 
137552    1.0
137553    1.0
137554    1.0
137555    1.0
137556    1.0
Name: prc_lag, Length: 137557, dtype: float64

In [19]:
df['bins'] = pd.cut(df['prc_lag'],[0,5,500000],labels=False)
df

Unnamed: 0,permno,caldt,ticker,prc,ret,vol,shr,prc_lag,bins
0,10026,2019-01-31,JJSF,154.35,0.067501,21310.0,18783.0,,
1,10026,2019-02-28,JJSF,155.28,0.006025,14119.0,18783.0,154.35,1.0
2,10026,2019-03-29,JJSF,158.84,0.026146,13216.0,18815.0,155.28,1.0
3,10026,2019-04-30,JJSF,157.18,-0.010451,14432.0,18817.0,158.84,1.0
4,10026,2019-05-31,JJSF,160.85,0.023349,15264.0,18817.0,157.18,1.0
...,...,...,...,...,...,...,...,...,...
137552,93436,2021-08-31,TSLA,735.72,0.070605,3812200.0,1001800.0,687.20,1.0
137553,93436,2021-09-30,TSLA,775.48,0.054042,3889200.0,1004000.0,735.72,1.0
137554,93436,2021-10-29,TSLA,1114.00,0.436530,5264000.0,1004300.0,775.48,1.0
137555,93436,2021-11-30,TSLA,1144.80,0.027612,6457200.0,1004300.0,1114.00,1.0


In [20]:
df['bins'].describe()

count    129918.000000
mean          0.809880
std           0.392397
min           0.000000
25%           1.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: bins, dtype: float64

**Portfolio Construction**

+ Use the same basic code as our equal-weight portfolio off all stocks.<br><br>

+ But want to group on date/bin combinations.<br><br>

+ Pandas does this too: a two way groupby.<br><br>

+ For each one date/bin combination, compute equal-weight portfolio return (equivalent to average return across the assets for each bin in a given month)<br><br>

In [30]:
port = df.groupby(['caldt','bins'])['ret'].mean()*100
port

caldt       bins
2019-02-28  0.0      6.156624
            1.0      5.169918
2019-03-29  0.0      1.175309
            1.0     -1.906431
2019-04-30  0.0      0.019657
                      ...    
2021-10-29  1.0      2.911849
2021-11-30  0.0    -11.457200
            1.0     -4.368234
2021-12-31  0.0     -9.007991
            1.0      0.723044
Name: ret, Length: 70, dtype: float64

In [31]:
port.reset_index()

Unnamed: 0,caldt,bins,ret
0,2019-02-28,0.0,6.156624
1,2019-02-28,1.0,5.169918
2,2019-03-29,0.0,1.175309
3,2019-03-29,1.0,-1.906431
4,2019-04-30,0.0,0.019657
...,...,...,...
65,2021-10-29,1.0,2.911849
66,2021-11-30,0.0,-11.457200
67,2021-11-30,1.0,-4.368234
68,2021-12-31,0.0,-9.007991


**Trick: Unstack**

+ Nobody like this form for portfolios.<br><br>

+ Generally work with portfolios in a matrix like dataframe.<br><br>

+ <b>Use unstack to make bins into columns.</b><br><br>

+ <i><b>MY NOTES:</b> The index is the LEVELS </i><br><br>


In [32]:
port = df.groupby(['caldt','bins'])['ret'].mean()*100
port = port.unstack(level='bins')
port.head(5)

bins,0.0,1.0
caldt,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-02-28,6.156624,5.169918
2019-03-29,1.175309,-1.906431
2019-04-30,0.019657,3.169937
2019-05-31,-10.468014,-7.568818
2019-06-28,1.347738,6.81364


In [24]:
port.describe()

bins,0.0,1.0
count,35.0,35.0
mean,3.623642,1.505018
std,12.814236,6.752536
min,-24.987104,-21.724547
25%,-3.333687,-2.280127
50%,1.347738,1.816352
75%,10.09985,4.985778
max,38.76391,18.110653


**BYU Finance Library**

+ Going to use the my Finance library's summary function.<br><br>

+ It adds ad t-stat that the mean of each column equals zero.<br><br>



In [25]:
from finance_byu.summarize import summary
summary(port)

bins,0.0,1.0
count,35.0,35.0
mean,3.623642,1.505018
std,12.814236,6.752536
tstat,1.672964,1.318587
pval,0.103513,0.196121
min,-24.987104,-21.724547
25%,-3.333687,-2.280127
50%,1.347738,1.816352
75%,10.09985,4.985778
max,38.76391,18.110653


In [26]:
summary(port).loc[['count','mean','std','tstat']].round(3)

bins,0.0,1.0
count,35.0,35.0
mean,3.624,1.505
std,12.814,6.753
tstat,1.673,1.319
