# McKinney Chapter 8 - Practice for Section 02

## Announcements

1. *Joining Data with pandas*, our fourth and final DataCamp course, is due by Friday, 2/16, at 11:59 PM
2. We will use class next week, from 2/19 through 2/23, for group work on Project 1, so there is no lecture video or pre-class quiz
3. Project 1 is due by Tuesday, 2/27, at 8:25 AM
3. Please complete the [anonymous ungraded survey on Canvas](https://northeastern.instructure.com/courses/171271/quizzes/586970) to help me help you learn better: 

## 10-Minute Recap

Chapter 8 of McKinney covers 3 topics.

1. *Hierarchical Indexing:* Hierarchical indexing helps us organize data at multiple levels, rather than just a flat, two-dimensional structure.
It helps is work with high-dimensional data in a low-dimensional form.
For example, we can index rows by multiple levels like "ticker" and "date", or columns by "variable" and "ticker".
2. *Combining Data:* We can combine datasets on one or more keys.
    1. We will use the `pd.merge()` function for database-style joins, which can be `inner`, `outer`, `left`, or `right` joins.
    2. We will use the `.join()` method to combine data frames with similar indexes.
    3. We will use the `pd.concat()` to combine similarly-shaped series and data frames.
4. *Reshaping Data:* We can reshape data to change its structure, such as pivoting from wide to long format or vice versa.
We will most often use the `.stack()` and `.unstack()` methods, which pivot columns to rows and rows to columns, respectively.
Laster in the course we will learn about the `.pivot()` method for aggregating data and the `.melt()` method for more advanced reshaping.

## Practice

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import yfinance as yf

In [2]:
%precision 4
pd.options.display.float_format = '{:.4f}'.format
%config InlineBackend.figure_format = 'retina'

### Download data from Yahoo! Finance for BAC, C, GS, JPM, MS, and PNC and assign to data frame `stocks`.

Use `.rename_axis(columns=stocks.columns.names = ['Variable', 'Ticker'])` to assign the names `Variable` and `Ticker` to the column multi index.
We could instead use `stocks.columns.names = ['Variable', 'Ticker']`, but we can chain the `.rename_axis()` method.

In [3]:
stocks = (
    yf.download(tickers='BAC C GS JPM MS PNC')
    .rename_axis(columns=['Variable', 'Ticker'])
)

[*********************100%%**********************]  6 of 6 completed


In [4]:
stocks.tail()

Variable,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Close,Close,Close,Close,...,Open,Open,Open,Open,Volume,Volume,Volume,Volume,Volume,Volume
Ticker,BAC,C,GS,JPM,MS,PNC,BAC,C,GS,JPM,...,GS,JPM,MS,PNC,BAC,C,GS,JPM,MS,PNC
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2024-02-09,33.07,53.99,384.26,175.01,85.89,147.77,33.07,53.99,384.26,175.01,...,384.77,175.0,85.69,147.87,36176000,13843300.0,2028200.0,6296700.0,5664500.0,1157600.0
2024-02-12,33.62,53.92,392.64,175.79,86.87,149.14,33.62,53.92,392.64,175.79,...,385.0,174.78,85.86,147.77,34160400,17162300.0,2797200.0,8539300.0,7888000.0,1612700.0
2024-02-13,32.75,52.76,378.75,174.26,83.97,145.26,32.75,52.76,378.75,174.26,...,387.59,175.32,85.86,146.78,43801500,17672100.0,3030800.0,8397600.0,11339400.0,2121400.0
2024-02-14,33.13,53.98,378.04,176.03,84.0,147.87,33.13,53.98,378.04,176.03,...,380.88,175.07,84.54,146.63,27833900,14891900.0,2041300.0,7056700.0,5973700.0,1265300.0
2024-02-15,34.07,55.21,385.42,179.87,85.67,149.63,34.07,55.21,385.42,179.87,...,379.42,176.15,84.45,148.75,41557209,16288325.0,1780459.0,8718729.0,7981854.0,1991304.0


### Reshape `stocks` from wide to long with dates and tickers as row indexes and assign to data frame `stocks_long`.

In [5]:
stocks_long = stocks.stack()

In [6]:
stocks_long.tail()

Unnamed: 0_level_0,Variable,Adj Close,Close,High,Low,Open,Volume
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-02-15,C,55.21,55.21,55.48,54.135,54.22,16288325.0
2024-02-15,GS,385.42,385.42,387.21,379.35,379.42,1780459.0
2024-02-15,JPM,179.87,179.87,180.21,176.15,176.15,8718729.0
2024-02-15,MS,85.67,85.67,86.23,84.41,84.45,7981854.0
2024-02-15,PNC,149.63,149.63,150.2575,147.37,148.75,1991304.0


### Add daily returns for each stock to data frames `stocks` and `stocks_long`.

Name the returns variable `Returns`, and maintain all multi indexes.
*Hint:* Use `pd.MultiIndex()` to create a multi index for the the wide data frame `stocks`.

In [7]:
_ = pd.MultiIndex.from_product([['Returns'], stocks['Adj Close'].columns]) # I use _ as a temporary variable
stocks[_] = stocks['Adj Close'].iloc[:-1].pct_change()

In [8]:
stocks.tail()

Variable,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Close,Close,Close,Close,...,Volume,Volume,Volume,Volume,Returns,Returns,Returns,Returns,Returns,Returns
Ticker,BAC,C,GS,JPM,MS,PNC,BAC,C,GS,JPM,...,GS,JPM,MS,PNC,BAC,C,GS,JPM,MS,PNC
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2024-02-09,33.07,53.99,384.26,175.01,85.89,147.77,33.07,53.99,384.26,175.01,...,2028200.0,6296700.0,5664500.0,1157600.0,-0.0015,-0.0055,-0.002,0.0012,0.0028,-0.0011
2024-02-12,33.62,53.92,392.64,175.79,86.87,149.14,33.62,53.92,392.64,175.79,...,2797200.0,8539300.0,7888000.0,1612700.0,0.0166,-0.0013,0.0218,0.0045,0.0114,0.0093
2024-02-13,32.75,52.76,378.75,174.26,83.97,145.26,32.75,52.76,378.75,174.26,...,3030800.0,8397600.0,11339400.0,2121400.0,-0.0259,-0.0215,-0.0354,-0.0087,-0.0334,-0.026
2024-02-14,33.13,53.98,378.04,176.03,84.0,147.87,33.13,53.98,378.04,176.03,...,2041300.0,7056700.0,5973700.0,1265300.0,0.0116,0.0231,-0.0019,0.0102,0.0004,0.018
2024-02-15,34.07,55.21,385.42,179.87,85.67,149.63,34.07,55.21,385.42,179.87,...,1780459.0,8718729.0,7981854.0,1991304.0,,,,,,


The easiest way to add returns to long data frame `stocks_long` is to `.stack()` the wide data frame `stocks`!
We could sort `stocks_long` by ticker and date (to sort chronologically within each ticker), then use `.pct_change()`.
However, this approach miscalculates the first return for every ticker except for the first ticker.
The easiest and safest solution is to `.stack()` the wide data frame `stocks`!

In [9]:
stocks_long = stocks.stack()

In [10]:
stocks_long.tail()

Unnamed: 0_level_0,Variable,Adj Close,Close,High,Low,Open,Volume,Returns
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2024-02-15,C,55.21,55.21,55.48,54.135,54.22,16288325.0,
2024-02-15,GS,385.42,385.42,387.21,379.35,379.42,1780459.0,
2024-02-15,JPM,179.87,179.87,180.21,176.15,176.15,8718729.0,
2024-02-15,MS,85.67,85.67,86.23,84.41,84.45,7981854.0,
2024-02-15,PNC,149.63,149.63,150.2575,147.37,148.75,1991304.0,


### Download the daily benchmark return factors from Ken French's data library.

I rarely remember the exact combination of uppercase and lowercase letters, dashes, and underscores, so I use `pdr.famafrench.get_available_datasets()` to display the list of available names for data from Ken French's website.

In [11]:
pdr.famafrench.get_available_datasets()[:5]

['F-F_Research_Data_Factors',
 'F-F_Research_Data_Factors_weekly',
 'F-F_Research_Data_Factors_daily',
 'F-F_Research_Data_5_Factors_2x3',
 'F-F_Research_Data_5_Factors_2x3_daily']

In [12]:
ff = (
    pdr.DataReader(
        name='F-F_Research_Data_Factors_daily',
        data_source='famafrench',
        start='1900' # otherwise, pdr.DataReader defaults to the last five years of data
    )
    [0] # pdr.DataReader returns a dictionary of data frames
    .div(100) # French stores returns as percents, but our stock returns are decimals
)

  pdr.DataReader(


In [13]:
ff.tail()

Unnamed: 0_level_0,Mkt-RF,SMB,HML,RF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2023-12-22,0.0021,0.0064,0.0009,0.0002
2023-12-26,0.0048,0.0069,0.0046,0.0002
2023-12-27,0.0016,0.0014,0.0012,0.0002
2023-12-28,-0.0001,-0.0036,0.0003,0.0002
2023-12-29,-0.0043,-0.0112,-0.0037,0.0002


### Add the daily benchmark return factors to `stocks` and `stocks_long`.

For the wide data frame `stocks`, use the outer index name `Factors`.

For the wide data frame `stocks`, the simplest approach is to create a multi index with four columns, one for each of the four factors.

In [14]:
_ = pd.MultiIndex.from_product([['Factors'], ff.columns]) # I use _ as a temporary variable
stocks[_] = ff

In [15]:
stocks.tail()

Variable,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Close,Close,Close,Close,...,Returns,Returns,Returns,Returns,Returns,Returns,Factors,Factors,Factors,Factors
Ticker,BAC,C,GS,JPM,MS,PNC,BAC,C,GS,JPM,...,BAC,C,GS,JPM,MS,PNC,Mkt-RF,SMB,HML,RF
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2024-02-09,33.07,53.99,384.26,175.01,85.89,147.77,33.07,53.99,384.26,175.01,...,-0.0015,-0.0055,-0.002,0.0012,0.0028,-0.0011,,,,
2024-02-12,33.62,53.92,392.64,175.79,86.87,149.14,33.62,53.92,392.64,175.79,...,0.0166,-0.0013,0.0218,0.0045,0.0114,0.0093,,,,
2024-02-13,32.75,52.76,378.75,174.26,83.97,145.26,32.75,52.76,378.75,174.26,...,-0.0259,-0.0215,-0.0354,-0.0087,-0.0334,-0.026,,,,
2024-02-14,33.13,53.98,378.04,176.03,84.0,147.87,33.13,53.98,378.04,176.03,...,0.0116,0.0231,-0.0019,0.0102,0.0004,0.018,,,,
2024-02-15,34.07,55.21,385.42,179.87,85.67,149.63,34.07,55.21,385.42,179.87,...,,,,,,,,,,


In [16]:
stocks_long = stocks_long.join(ff)

In [17]:
stocks_long.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Close,Close,High,Low,Open,Volume,Returns,Mkt-RF,SMB,HML,RF
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1973-02-21,BAC,1.5818,4.625,4.625,4.625,4.625,99200.0,,-0.0074,-0.0039,0.0054,0.0002
1973-02-22,BAC,1.5872,4.6406,4.6406,4.6406,4.6406,47200.0,0.0034,-0.003,-0.0037,0.0022,0.0002
1973-02-23,BAC,1.5818,4.625,4.625,4.625,4.625,133600.0,-0.0034,-0.0108,-0.0019,0.0054,0.0002
1973-02-26,BAC,1.5818,4.625,4.625,4.625,4.625,24000.0,0.0,-0.0088,-0.005,0.0054,0.0002
1973-02-27,BAC,1.5818,4.625,4.625,4.625,4.625,41600.0,0.0,-0.0115,-0.0018,0.0064,0.0002


In [18]:
stocks_long.loc['2023-12'].iloc[-12:]

Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Close,Close,High,Low,Open,Volume,Returns,Mkt-RF,SMB,HML,RF
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2023-12-28,BAC,33.88,33.88,33.97,33.77,33.82,21799600.0,0.0012,-0.0001,-0.0036,0.0003,0.0002
2023-12-28,C,51.0329,51.52,51.8,51.4,51.4,10218500.0,0.0012,-0.0001,-0.0036,0.0003,0.0002
2023-12-28,GS,386.41,386.41,387.76,383.63,384.52,1024700.0,0.005,-0.0001,-0.0036,0.0003,0.0002
2023-12-28,JPM,169.2563,170.3,170.66,169.0,169.35,6320100.0,0.0053,-0.0001,-0.0036,0.0003,0.0002
2023-12-28,MS,92.7316,93.64,93.95,93.24,93.31,4089500.0,-0.0002,-0.0001,-0.0036,0.0003,0.0002
2023-12-28,PNC,154.0486,155.63,156.21,154.95,155.44,1153300.0,0.0041,-0.0001,-0.0036,0.0003,0.0002
2023-12-29,BAC,33.67,33.67,33.99,33.55,33.94,28037800.0,-0.0062,-0.0043,-0.0112,-0.0037,0.0002
2023-12-29,C,50.9537,51.44,51.61,51.22,51.56,13147900.0,-0.0016,-0.0043,-0.0112,-0.0037,0.0002
2023-12-29,GS,385.77,385.77,386.64,383.57,385.57,881300.0,-0.0017,-0.0043,-0.0112,-0.0037,0.0002
2023-12-29,JPM,169.0575,170.1,170.69,169.63,170.0,6431800.0,-0.0012,-0.0043,-0.0112,-0.0037,0.0002


### Write a function `download()` that accepts tickers and returns a wide data frame of returns with the daily benchmark return factors.

We can even add a `shape` argument to return a wide or long data frame!

In [19]:
def download(tickers, shape='wide'):
    if shape.lower() not in ['wide', 'long']:
        raise ValueError('Invalid shape: must be "wide" or "long".')

    stocks = yf.download(tickers)

    factors_all = (
        pdr.DataReader(
            name='F-F_Research_Data_Factors_daily',
            data_source='famafrench',
            start='1900'
        )
    )
    
    factors = factors_all[0].div(100)
    
    # if stocks has a multi index on the columns because we asked for more than one ticker
    # then we have more work to do
    if type(stocks.columns) is pd.MultiIndex:
        
        # whether we want a wide or long data frame, it is easier to add returns to a wide data frame
        _ = pd.MultiIndex.from_product([['Returns'], stocks['Adj Close'].columns])
        stocks[_] = stocks['Adj Close'].pct_change()

        # if we want a wide data frame, then we need a factors multi index
        if shape.lower() == 'wide':
            _ = pd.MultiIndex.from_product([['Factors'], factors.columns])
            stocks[_] = factors
            return stocks.rename_axis(columns=['Variable', 'Ticker'])

        # if we want a long data frame, then we need to stack stocks and join factors
        # because we tested for wide and long above, we know that here shape must be long
        else:
            return stocks.stack().join(factors).rename_axis(columns=['Variable'], index=['Date', 'Ticker'])
            
    # if stocks does not have a multi index on the columns
    # then we only have join factors
    else:
        return stocks.join(ff).rename_axis(columns=['Variable'])

In [20]:
download(tickers=['AAPL', 'TSLA'], shape='long')

[*********************100%%**********************]  2 of 2 completed
  pdr.DataReader(


Unnamed: 0_level_0,Variable,Adj Close,Close,High,Low,Open,Volume,Returns,Mkt-RF,SMB,HML,RF
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1980-12-12,AAPL,0.0992,0.1283,0.1289,0.1283,0.1283,469033600.0000,,0.0138,-0.0001,-0.0105,0.0006
1980-12-15,AAPL,0.0940,0.1217,0.1222,0.1217,0.1222,175884800.0000,-0.0522,0.0011,0.0025,-0.0046,0.0006
1980-12-16,AAPL,0.0871,0.1127,0.1133,0.1127,0.1133,105728000.0000,-0.0734,0.0071,-0.0075,-0.0047,0.0006
1980-12-17,AAPL,0.0893,0.1155,0.1161,0.1155,0.1155,86441600.0000,0.0248,0.0152,-0.0086,-0.0034,0.0006
1980-12-18,AAPL,0.0919,0.1189,0.1194,0.1189,0.1189,73449600.0000,0.0290,0.0041,0.0022,0.0126,0.0006
...,...,...,...,...,...,...,...,...,...,...,...,...
2024-02-13,TSLA,184.0200,184.0200,187.2600,182.1100,183.9900,86759500.0000,-0.0218,,,,
2024-02-14,AAPL,184.1500,184.1500,185.5300,182.4400,185.3200,54630500.0000,-0.0048,,,,
2024-02-14,TSLA,188.7100,188.7100,188.8900,183.3500,185.3000,81203000.0000,0.0255,,,,
2024-02-15,AAPL,183.8600,183.8600,184.4900,181.4000,183.5500,63936069.0000,-0.0016,,,,


### Download earnings per share for the stocks in `stocks` and combine to a long data frame `earnings`.

Use the `.earnings_dates` method described [here](https://pypi.org/project/yfinance/).
Use `pd.concat()` to combine the result of each the `.earnings_date` data frames and assign them to a new data frame `earnings`.
Name the row indexes `Ticker` and `Date` and swap to match the order of the row index in `stocks_long`.

### Combine `earnings` with the returns from `stocks_long`.

***It is easier to leave `stocks` and `stocks_long` as-is and work with slices `returns` and `returns_long`.***
Use the `tz_localize('America/New_York')` method add time zone information back to `returns.index` and use `pd.to_timedelta(16, unit='h')` to set time to the market close in New York City.
Use [`pd.merge_asof()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html) to match earnings announcement dates and times to appropriate return periods.
For example, if a firm announces earnings after the close at 5 PM on February 7, we want to match the return period from 4 PM on February 7 to 4 PM on February 8.

### Plot the relation between daily returns and earnings surprises

Three options in increasing difficulty:

1. Scatter plot
1. Scatter plot with a best-fit line using `regplot()` from the seaborn package
1. Bar plot using `barplot()` from the seaborn package after using `pd.qcut()` to form five groups on earnings surprises

### Repeat the earnings exercise with the S&P 100 stocks

### Repeat the earnings exercise with *excess returns* of the S&P 100 Stocks

Excess returns are returns minus market returns.
We need to add a timezone and the closing time to the market return from Fama and French.