# The Data Deluge: Alternative Data and Its Role in Modern Financial Analysis

In [None]:
# hide
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
import pandas as pd
from IPython.display import Image, display

In the prevous section, we described hedge funds, and more generally the buy-side industry, as `consumers of data`. In practice hedge funds have been a beneficiary of the exponential growth in the number of available datasets. We first describe the main classes of datasets that have been made available in the industry. We then introduce some of the illustrative `toy` datasets that will be used across notebooks. 

## Data deluge

There are now sensors everywhere in the physical world and most of the online interactions are tracked -- leading a `data deluge` (e.g. see [Mary Meeker (2018)](http://www.kpcb.com/internet-trends) on internet trends or [Lori Lewis (2022)](https://www.allaccess.com/merge/archive/31294/infographic-what-happens-in-an-internet-minute)). On the one hand, the cost of sensors has dropped so much that these sensors can be put practically everywhere and record data from the physicial world. On the other hand, the volume and velocity of online data only has also exploded. 

In [None]:
# hide
# sidebyside
display(
    Image("images/sensors.PNG", width=500),
    Image("images/internet_minute_2020.png", width=400),
)

### Data scouting

The article "Hedge funds see a gold rush is data mining" [FT (08/28/2017)](https://www.ft.com/content/d86ad460-8802-11e7-bf50-e1c239b45787) describes use-cases associated with four types of alternative data: 

- web traffic 

- credit card transaction

- geolocalisation 

- satellite imaging 

In each of these cases, the main idea is to use alternative data to get a noisy, but real-time forecast of company fundamentals (e.g. quarterly  sales). 

In [None]:
# hide
display(Image("images/ftalternativedata.png", width=700))

Obviously, predicting company fundamentals is only part of the story of what drives stock prices in financial markets. On the other side, understanding the market participants, in particular retail traders, is also important as shown in China [Bloomberg (06/15/2019)](https://www.bloomberg.com/news/articles/2019-05-15/quants-think-like-amateurs-in-the-world-s-wildest-stock-market) or in the US with the Game Stop saga [FT(01/29/2021)](https://www.ft.com/content/04477ee8-0af2-4f0f-a331-2987444892c3). 

In [None]:
# hide
# sidebyside
display(
    Image("images/bloomberg_investing_in_china.png", width=600),
    Image("images/game_stop.png", width=600),
)

Recent examples show even more sophistication and creativity for new data -- .e.g. FT (09/13/2024): "How SEC mobile phones can signal an imminent stock price drop."

In [None]:
# hide
display(Image("images/ft_sec_as_data.png", width=600))

### ML research

ML research has become a race with new ideas coming out with an increasing speed -- e.g. as illustrated by the number of papers published on the scientific paper repository arxiv.com ([Jeff Dean (06/02/2019)](https://twitter.com/JeffDean/status/1135114657344237568)). 

In [None]:
# hide
display(Image("images/mlarxiv2.png", width=600))

The success of deep-learning depends on: i) model capacity, ii) computational power, iii) dataset  size. [Sun, Shrivastava, Singh, Gupta (2017)](https://arxiv.org/abs/1707.02968) note that the size of the largest dataset has remained somewhat constant over the last few years. 

In [None]:
# hide
# sidebyside
display(
    Image("images/unreasonableEffectiveness.PNG", width=300),
    Image("images/unreasonableEffectiveness2.PNG", width=300),
)

Here are several trends from the Stanford's AI index report: 

- The training costs for cutting-edge AI models have skyrocketed, with OpenAI's GPT-4 and Google's Gemini Ultra costing \\$78m and \\$191m, respectively.
- AI projects on GitHub have surged from 845 in 2011 to about 1.8m in 2023, with a 59.3% increase in 2023 alone.
- As language models have rapidly advanced, surpassing human performance on numerous benchmarks, there has been a growing need for more robust and wide-ranging evaluation methods -- with the Holistic Evaluation of Language Models (HELM) providing a comprehensive framework .
-  Introduced by OpenAI researchers in 2021, HumanEval is a benchmark for assessing the coding abilities of AI systems. A variant of the GPT-4 model, AgentCoder, is currently leading in HumanEval performance. 

In [None]:
# hide
display(Image("images/nlp_model_size2.png", width=800))

In [None]:
# hide
display(Image("images/nlp_model_cost.png", width=800))

In [None]:
# hide
display(Image("images/ai_understanding.png", width=800))

In [None]:
# hide
display(Image("images/ai_adoption.png", width=800))

## Datasets

In particular, we look at several sources of data:

- Ken French's Financial data library

- Berkshire Hathaway financials 

- Stock returns 

- Federal Open Market Committee (FOMC) statements

- the sentiment dictionary created by Loughran and McDonalds 

- US filings to the U.S. Securities and Exchange Commission (SEC) called 10Qs and 10Ks

- Long term stock market data put together by Amit Goyal

### data helper functions

In this section, we use the `jupyter` command `%%writefile` to construct a module of data helper function. 

In [None]:
%%writefile ../skfin/dataloaders/loaders.py
import logging
import subprocess
import sys
from pathlib import Path
from typing import Dict

import pandas as pd
from tqdm.auto import tqdm

from skfin.dataloaders.cache import CacheManager
from skfin.dataloaders.web_utils import WebUtils
from skfin.dataloaders.cleaners import DataCleaner
from skfin.dataloaders.fomc import FomcUtils
from skfin.dataloaders.constants.mappings import symbol_dict
from skfin.dataloaders.io_utils import _download_file_safely

logging.basicConfig(stream=sys.stdout, level=logging.CRITICAL)
logger = logging.getLogger(__name__)


class DatasetLoader:
    """Class for loading various financial datasets."""

    def __init__(self, cache_dir: str = "data"):
        self.cache_manager = CacheManager(cache_dir)
        self.logger = logging.getLogger(__name__)

    def load_kf_returns(
        self, filename: str = "12_Industry_Portfolios", force_reload: bool = False
    ) -> Dict:
        """
        Load Ken French return data.

        Args:
            filename: Name of the data file to load
            force_reload: If True, ignore cache and reload data

        Returns:
            Dictionary of return data
        """
        if filename == "12_Industry_Portfolios":
            skiprows, multi_df = 11, True
        elif filename == "F-F_Research_Data_Factors":
            skiprows, multi_df = 3, False
        elif filename == "F-F_Momentum_Factor":
            skiprows, multi_df = 13, False
        elif filename == "F-F_Research_Data_Factors_daily":
            skiprows, multi_df = 4, False
        else:
            skiprows, multi_df = 11, True

        def loader_func():
            path = (
                "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"
                + filename
                + "_CSV.zip"
            )
            files = WebUtils.download_zip_content(path)

            df = pd.read_csv(
                files.open(filename + ".csv"), skiprows=skiprows, index_col=0
            )
            if "daily" in filename:
                return {
                    "Daily": df.iloc[:-1].pipe(
                        lambda x: x.set_index(pd.to_datetime(x.index))
                    )
                }
            else:
                return DataCleaner.clean_kf_dataframes(df, multi_df=multi_df)

        return self.cache_manager.get_cached_dataframe(
            filename=Path(filename), loader_func=loader_func, force_reload=force_reload
        )

    def load_buffets_data(self, force_reload: bool = False) -> pd.DataFrame:
        """
        Load Buffett's portfolio data.

        Args:
            force_reload: If True, ignore cache and reload data

        Returns:
            DataFrame containing Buffett's portfolio data
        """

        def loader_func():
            path = "https://github.com/slihn/buffetts_alpha_R/archive/master.zip"
            files = WebUtils.download_zip_content(path)

            df = pd.read_csv(
                files.open("buffetts_alpha_R-master/ffdata_brk13f.csv"), index_col=0
            )
            df.index = pd.to_datetime(df.index, format="%m/%d/%Y")
            return df

        return self.cache_manager.get_cached_dataframe(
            filename=Path("ffdata_brk13f.parquet"),
            loader_func=loader_func,
            force_reload=force_reload,
        )

    def load_sklearn_stock_returns(self, force_reload: bool = False) -> pd.DataFrame:
        """
        Load stock returns data from scikit-learn.

        Args:
            force_reload: If True, ignore cache and reload data

        Returns:
            DataFrame containing stock returns
        """

        def loader_func():
            url = "https://raw.githubusercontent.com/scikit-learn/examples-data/master/financial-data"
            df = (
                pd.concat(
                    {
                        c: pd.read_csv(f"{url}/{c}.csv", index_col=0, parse_dates=True)[
                            "close"
                        ].diff()
                        for c in symbol_dict.keys()
                    },
                    axis=1,
                )
                .asfreq("B")
                .iloc[1:]
            )
            return df

        return self.cache_manager.get_cached_dataframe(
            filename=Path("sklearn_returns.parquet"),
            loader_func=loader_func,
            force_reload=force_reload,
        )

    def load_fomc_statements(
        self,
        add_url: bool = True,
        force_reload: bool = False,
        progress_bar: bool = False,
        from_year: int = 1999,
    ) -> pd.DataFrame:
        """
        Load FOMC statements.

        Args:
            add_url: If True, adds URLs to the output
            force_reload: If True, ignore cache and reload data
            progress_bar: If True, displays a progress bar during loading
            from_year: Year from which to load data

        Returns:
            DataFrame containing FOMC statements
        """

        def loader_func():
            urls = FomcUtils.get_fomc_urls(from_year=from_year)
            if progress_bar:
                urls_ = tqdm(urls)
            else:
                urls_ = urls

            corpus = [
                DataCleaner.bs_cleaner(
                    WebUtils.parse_html(WebUtils.get_response(url).text)
                )
                for url in urls_
            ]

            statements = FomcUtils.feature_extraction(corpus).set_index("release_date")
            if add_url:
                statements = statements.assign(url=urls)
            return statements.sort_index()

        return self.cache_manager.get_cached_dataframe(
            filename=Path("fomc_statements.parquet"),
            loader_func=loader_func,
            force_reload=force_reload,
        )

    def load_loughran_mcdonald_dictionary(
            self, filename: str = None, force_reload: bool = False
    ) -> pd.DataFrame:
        """
        Load the Loughran-McDonald dictionary.

        Args:
            filename: Custom filename to use
            force_reload: If True, ignore cache and reload data

        Returns:
            DataFrame containing the dictionary data
        """
        if filename is None:
            filename = "Loughran-McDonald_MasterDictionary_1993-2021.csv"
        filename = Path(filename)

        def loader_func():
            id = "17CmUZM9hGUdGYjCXcjQLyybjTrcjrhik"
            url = f"https://docs.google.com/uc?export=download&confirm=t&id={id}"
            filepath = self.cache_manager.cache_dir / filename

            _download_file_safely(
                url=url,
                filepath=filepath,
                manual_url="https://sraf.nd.edu/loughran-mcdonald-master-dictionary/"
            )

            return pd.read_csv(filepath)

        return self.cache_manager.get_cached_dataframe(
            filename=filename, loader_func=loader_func, force_reload=force_reload
        )


    def load_10X_summaries(self, filename: str = None, force_reload: bool = False) -> pd.DataFrame:
        """
        Load 10-X summaries.

        Args:
            filename: Custom filename to use
            force_reload: If True, ignore cache and reload data

        Returns:
            DataFrame containing 10-X summaries
        """
        if filename is None:
            filename = "Loughran-McDonald_10X_Summaries_1993-2021.csv"
        filename = Path(filename)

        def loader_func():
            id = "1CUzLRwQSZ4aUTfPB9EkRtZ48gPwbCOHA"
            url = f"https://docs.google.com/uc?export=download&confirm=t&id={id}"
            filepath = self.cache_manager.cache_dir / filename

            _download_file_safely(
                url=url,
                filepath=filepath,
                manual_url="https://sraf.nd.edu/sec-edgar-data/lm_10x_summaries/"
            )

            return pd.read_csv(filepath)

        df = self.cache_manager.get_cached_dataframe(
            filename=filename,
            loader_func=loader_func,
            force_reload=force_reload,
        )
        return df.assign(
            date=lambda x: pd.to_datetime(x.FILING_DATE, format="%Y%m%d")
        ).set_index("date")

    def load_ag_features(
            self, filename: str = None, sheet_name: str = "Monthly", force_reload: bool = False
    ) -> pd.DataFrame:
        """
        Load Amit Goyal's characteristics data.

        Args:
            filename: Custom filename to use
            sheet_name: Name of the sheet to load
            force_reload: If True, ignore cache and reload data

        Returns:
            DataFrame containing characteristic data
        """
        if filename is None:
            filename = "Data2024.xlsx"
        filename = Path(filename)

        def loader_func():
            id = "10_nkOkJPvq4eZgNl-1ys63PzhbnM3S2y"
            url = f"https://docs.google.com/spreadsheets/d/{id}/export?format=xlsx"
            filepath = self.cache_manager.cache_dir / filename

            _download_file_safely(
                url=url,
                filepath=filepath,
                manual_url="https://sites.google.com/view/agoyal145/data-library"
            )

            return pd.read_excel(filepath, sheet_name=sheet_name)

        df = self.cache_manager.get_cached_dataframe(
            filename=filename,
            loader_func=loader_func,
            force_reload=force_reload,
            sheet_name=sheet_name,
        )
        return df.assign(
            date=lambda x: pd.to_datetime(x.yyyymm, format="%Y%m")
        ).set_index("date")

In [None]:
%%writefile ../skfin/datasets_.py
import warnings
import sys
import logging
from typing import Dict, Optional

import pandas as pd

from skfin.dataloaders import DatasetLoader

warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger(__name__)


def load_kf_returns(
    filename: str = "12_Industry_Portfolios",
    force_reload: bool = False,
    cache_dir: Optional[str] = "data",
) -> Dict:
    """Load Ken French return data."""
    loader = DatasetLoader(cache_dir=cache_dir)
    return loader.load_kf_returns(filename, force_reload)


def load_buffets_data(
    force_reload: bool = False, cache_dir: Optional[str] = "data"
) -> pd.DataFrame:
    """Load Buffett's portfolio data."""
    loader = DatasetLoader(cache_dir=cache_dir)
    return loader.load_buffets_data(force_reload)


def load_sklearn_stock_returns(
    force_reload: bool = False, cache_dir: Optional[str] = "data"
) -> pd.DataFrame:
    """Load stock returns data from scikit-learn."""
    loader = DatasetLoader(cache_dir=cache_dir)
    return loader.load_sklearn_stock_returns(force_reload)


def load_fomc_statements(
    add_url: bool = True,
    force_reload: bool = False,
    progress_bar: bool = False,
    from_year: int = 1999,
    cache_dir: Optional[str] = "data",
) -> pd.DataFrame:
    """Load FOMC statements."""
    loader = DatasetLoader(cache_dir=cache_dir)
    return loader.load_fomc_statements(add_url, force_reload, progress_bar, from_year)


def load_loughran_mcdonald_dictionary(
    force_reload: bool = False, cache_dir: Optional[str] = "data", filename: str = None, 
) -> pd.DataFrame:
    """Load the Loughran-McDonald dictionary."""
    loader = DatasetLoader(cache_dir=cache_dir)
    return loader.load_loughran_mcdonald_dictionary(force_reload=force_reload, filename=filename)


def load_10X_summaries(
    force_reload: bool = False, cache_dir: Optional[str] = "data", filename: str = None, 
) -> pd.DataFrame:
    """Load 10-X summaries."""
    loader = DatasetLoader(cache_dir=cache_dir)
    return loader.load_10X_summaries(force_reload=force_reload, filename=filename)


def load_ag_features(
    sheet_name: str = "Monthly",
    force_reload: bool = False,
    cache_dir: Optional[str] = "data",
    filename: str = None, 
) -> pd.DataFrame:
    """Load Amit Goyal's characteristics data."""
    loader = DatasetLoader(cache_dir=cache_dir)
    return loader.load_ag_features(sheet_name=sheet_name, force_reload=force_reload, filename=filename)


### Ken French data: industry returns

In [None]:
from skfin.datasets_ import (
    load_buffets_data,
    load_kf_returns,
    load_sklearn_stock_returns,
)

import logging 
logging.getLogger("skfin.dataloaders.cache").setLevel(level=logging.INFO)

In [None]:
%%time
returns_data = load_kf_returns(filename="12_Industry_Portfolios", force_reload=True)

Reloading from a cache directory is faster!

In [None]:
%%time
returns_data = load_kf_returns(filename="12_Industry_Portfolios", force_reload=False)

In [None]:
returns_data_SMB_HML = load_kf_returns(filename="F-F_Research_Data_Factors")

In [None]:
returns_data_MOM = load_kf_returns(filename="F-F_Momentum_Factor")

In [None]:
returns_data_DAILY = load_kf_returns(
    filename="F-F_Research_Data_Factors_daily", force_reload=True
)["Daily"]

### Stock returns (2003-2007)

In [None]:
%%time
returns_data = load_sklearn_stock_returns(force_reload=True)

In [None]:
from skfin.dataloaders.constants.mappings import symbol_dict
from skfin.metrics import sharpe_ratio
from skfin.plot import bar

In [None]:
start_date, end_date = (
    returns_data.index[0].strftime("%Y-%m-%d"),
    returns_data.index[-1].strftime("%Y-%m-%d"),
)

df = (
    returns_data.pipe(sharpe_ratio)
    .rename(index=symbol_dict)
    .sort_values()
    .pipe(lambda x: pd.concat([x.head(), x.tail()]))
)
bar(
    df,
    horizontal=True,
    title=f"Annualized stock sharpe ratio: {start_date} to {end_date}",
)

### 13F Berkshire Hathaway

In [None]:
%%time
df = load_buffets_data(force_reload=True)

### FOMC Statements 

In [None]:
from skfin.datasets_ import load_fomc_statements

In [None]:
%%time
statements = load_fomc_statements(force_reload=True)

In [None]:
%%time
statements = load_fomc_statements(force_reload=False)

### Loughran-McDonalds sentiment dictionary and regulartory filing summaries

In [None]:
from skfin.datasets_ import load_loughran_mcdonald_dictionary, load_10X_summaries

In [None]:
%%time
filing_summaries = load_10X_summaries(force_reload=False)

In [None]:
%%time
lm = load_loughran_mcdonald_dictionary(force_reload=False, filename="Loughran-McDonald_MasterDictionary_1993-2021.csv")

### Goyal long-term market data

In [None]:
from skfin.datasets_ import load_ag_features

In [None]:
%%time 
ag = load_ag_features(force_reload=False);