**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Scott Benninger
- Ruben Melikyan
- Ethan Fastovsky
- Noah Cramer
- Shion Okino

# Research Question

Which of the following macroeconomic factors—aggregate consumption, unemployment, or CPI—has the strongest correlation with stock market volatility, as measured by percent change in aggregate stock market returns, in the US from 2015 Jan 1 to 2024 Jan 1?



## Background and Prior Work


Since a couple of us are economics majors with aspirations in finance, we are particularly interested in understanding how macroeconomic indicators influence stock market dynamics and decided to take our project in such a direction. Through this project, we aim to deepen our knowledge of the economic forces that drive stock market volatility. This project not only aligns with our academic background but also prepares us for future roles in finance, where understanding these relationships is essential for informed decision-making.
Stock market volatility is influenced by a complex interplay of economic forces, and understanding these relationships has been a longstanding focus in economic and financial research. Macroeconomic indicators such as unemployment, inflation, and interest rates serve as signals of economic health, impacting market stability and investor sentiment. Recent work recognized by the Nobel Prize in Economic Sciences highlights the role these indicators play in national economic cycles, exploring how fluctuations in key factors can signal periods of growth or downturn. One notable exploration of these dynamics is found in Why Nations Rise and Fall, which underscores how these economic indicators not only track but often predict shifts in stability and growth.
Beyond this, in an economics course, a couple of us had to previously read part of Fama’s 1981 work, Stock Returns, Real Activity, Inflation, and Money, emphasizing the importance of understanding how key economic indicators impact stock market performance. Fama’s research found that inflation, in particular, has a significant inverse relationship with stock returns, where higher inflation diminishes purchasing power and investor confidence, often leading to market volatility and downturns. This finding is highly relevant to our project, as it illustrates the role of macroeconomic indicators—like inflation and unemployment—in shaping financial markets. By examining unemployment as a potential primary predictor of stock market volatility, we aim to build on Fama’s insights, analyzing whether this variable still holds unique predictive power over other indicators, such as the Consumer Price Index (CPI), during different economic cycles. 

Probably most directly useful is a research paper Chen, Roll, and Ross (1986) in Economic Forces and the Stock Market which examined how macroeconomic factors, including unemployment, influence stock market returns. They found that rising unemployment often signals reduced consumer spending and lower corporate earnings, which can drive stock prices down due to weakened investor confidence. Drawing from these findings among analysis of other variables, our project will investigate whether unemployment alone is a leading predictor of stock market volatility, or if other indicators, such as the CPI, play a more significant role.

The most valuable insight from this paper is their use of an aggregated time series of stock returns as the explanatory variable. Initially, we considered using the VIX index, but after reviewing this study, we realized that an index based on market sentiment and perception might not be as suitable. The S&P 500, as a broad measure of stock market performance, offers a more direct assessment of market movements. The authors emphasize that while macroeconomic data is often smoothed and averaged, making it challenging to capture immediate market impacts, stock prices respond rapidly to new information. This difference means that stock market returns may show only a weak and noisy relationship with innovations in macroeconomic factors. Thus, using the S&P 500 allows for a timely response to economic changes, making it an effective choice for our analysis.

Methodologically, the paper used a combination of time-series and cross-sectional regression techniques to analyze the relationship between stock returns and macroeconomic variables. They employ time-series analysis to identify unanticipated changes (or "innovations") in economic factors like industrial production, inflation, and risk premiums. They then estimate stock return sensitivities to these innovations using historical data, applying cross-sectional regressions to test the relationship between these sensitivities and subsequent asset returns. Additionally, they use something called the Fama-MacBeth procedure, where they calculate average risk premiums across multiple periods and test their significance with t-tests, allowing them to assess if systematic economic factors significantly impact expected stock returns. We can apply a similar approach by using time-series analysis to capture unexpected changes in macroeconomic factors and then  regressions to examine how these innovations relate to stock market volatility, helping us identify which factors are most predictive.


# Hypothesis


We hypothesize that a regression analysis will reveal that among aggregate consumption, unemployment, and CPI, the unemployment rate will have the strongest correlation with stock market volatility, measured by percent change in aggregate stock market returns, in the United States from 2014 to 2024. Historically, high unemployment rates are associated with economic downturns, which tend to increase stock market volatility as investor uncertainty rises. Unemployment data is also a closely monitored indicator for forecasting potential recessions and expansions, guiding investor decisions on buying and selling. Therefore, we expect that fluctuations in the unemployment rate will be a significant predictor of stock market volatility, signaling broader economic shifts and investor sentiment.

# Data

## Data overview

Dataset #1

Percent Change, Daily, Seasonally Adjusted, Vintage: Current 2014-11-13 to 2024-11-12 (2024-11-12)

Dataset Name: FRED Economic Data St Louis Fed S&P 500

Link to the dataset: https://fred.stlouisfed.org/graph/?g=lln 

https://fred.stlouisfed.org/seriesBeta/SP500 

Number of observations: 2610

Number of variables: 2


Dataset #2 

Dataset Name:FRED Economic Data St Louis Fed Unemployment Rate

Link to the dataset:https://fred.stlouisfed.org/seriesBeta/UNRATE 

Number of observations: 109

Number of variables:2


Dataset #3

Dataset Name: FRED Economic Data St Louis Fed Aggregate Consumption Rate

Link to the dataset: https://fred.stlouisfed.org/series/PCECA  

Number of observations: 42

Number of variables: 4


Dataset #4

Dataset Name: Core CPI seasonally adjusted 1995 - 2024

Link to dataset: https://datacatalog.worldbank.org/search/dataset/0037798/Global-Economic-Monitor

Number of observations: 29

Number of variables: 57



Dataset #1: 
FRED Economic Data - S&P 500 Percent Change
This dataset contains daily percentage changes in the S&P 500 index over a ten-year period, from November 13, 2014, to November 12, 2024, with 2,610 observations. The primary variable, "Percent Change," reflects the daily price movement of the S&P 500 and has a float data type, measured as a percentage and not seasonally adjusted, representing fluctuations in the stock market. This variable is crucial for assessing short-term market volatility and understanding trends in investor sentiment and economic conditions. Data wrangling steps will include aggregating the percent change figures so that it matches our annual analysis, handle and check for any missing data and converting the time column to a date time object. Also our data includes dates beyond necessary for our analysis so we will filter for the necessary years.


Dataset #2: 
FRED Economic Data - Unemployment Rate
This dataset includes 109 monthly observations of the U.S. unemployment rate, provided by the St. Louis Fed’s FRED database. The dataset consists of two variables: "Date" and "Unemployment Rate," where the latter represents the percentage of the labor force (float data type) that is unemployed and actively seeking work. This rate is a vital economic indicator, serving as a proxy for labor market conditions and broader economic health. Data wrangling steps would be doing a linear transformation to the data such that it is percent change quarterly for time series analysis also converting the time column to a date time object through pandas for analysis.

Dataset #3: FRED Economic Data - Aggregate Consumption Rate
This dataset from the St. Louis Fed’s FRED database tracks aggregate consumption in the U.S., with 42 quarterly observations and four variables. Key variables likely include "Date" (indicating the quarter), "Aggregate Consumption Rate" (float) (representing total personal consumption expenditures as a percentage of GDP), and potentially other related metrics to contextualize consumer spending. This rate is a key economic indicator, reflecting consumer behavior trends and its contribution to overall economic growth. Preprocessing steps may involve converting time variable into date time object, converting values to consistent units or types, and handling any missing or outlier data points for reliable trend analysis.

Dataset #4: Core CPI Seasonally Adjusted (1995-2024)
This dataset provides seasonally adjusted Core Consumer Price Index (CPI) data from 1995 to 2024, sourced from the World Bank's Global Economic Monitor, with 29 annual observations and 57 variables (float). Core CPI measures inflation by tracking the price changes of goods and services, excluding volatile categories like food and energy, making it a critical indicator for understanding underlying inflation trends. Variables likely include "Year," "Core CPI" for specific regions or countries, and potentially additional economic indicators for comparative analysis. Data wrangling may involve reshaping the dataset for time series analysis, handling any missing converting time to a date time object, removing columns for other countries as our analysis is limited to the US.


Plan for Combining Datasets:
The datasets we downloaded from FRED gave us the flexibility of choosing exact date ranges and units for measurement. Therefore we will be merging the datasets quarterly for a more smooth time series analysis. This matches with the data downloaded from the world bank and very convenient to merge using the time column, which is more convenient by convertin the time column to date time objects.


In [1]:
# import needed packages
import pandas as pd
from pathlib import Path
import numpy as np

## S&P 500 data

In [2]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

# read in data
fp = Path('data') / 'SP500.csv'
sp500 = pd.read_csv(fp)

# Ensure the DATE column is in datetime format
sp500['DATE'] = pd.to_datetime(sp500['DATE'])

# Filter the data for dates between 2015-01-01 and 2024-01-01
sp500 = sp500[(sp500['DATE'] >= '2015-01-01') & (sp500['DATE'] <= '2024-01-01')]

#Dropping non numeric data points from SP500 column
sp500 = sp500.reset_index(drop=True)
sp500['SP500'] = pd.to_numeric(sp500['SP500'], errors='coerce')

# Drop rows where 'SP500' is NaN (i.e., non-numeric rows)
sp500 = sp500.dropna(subset=['SP500'])

# Convert remaining values to float
sp500['SP500'] = sp500['SP500'].astype(float)

# Calculate daily returns (percent change)
sp500['Daily_Return'] = sp500['SP500'].pct_change() * 100  # daily returns in percentage

# Drop the first row, as it will contain NaN due to the percent change calculation
sp500 = sp500.dropna(subset=['Daily_Return'])

# Aggregate daily returns to get monthly returns by summing them
monthly_sp500 = sp500.resample('ME', on='DATE')['Daily_Return'].sum().reset_index()

# Display the monthly data
monthly_sp500
                    

Unnamed: 0,DATE,Daily_Return
0,2015-01-31,-3.007878
1,2015-02-28,5.384284
2,2015-03-31,-1.668783
3,2015-04-30,0.877221
4,2015-05-31,1.087943
...,...,...
103,2023-08-31,-1.720630
104,2023-09-30,-4.943132
105,2023-10-31,-2.138358
106,2023-11-30,8.608871


## Unemployment Rate 

In [3]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

In [4]:
#read in the unemployment rate data
fp = Path('data') / 'UNRATE.csv'
unem = pd.read_csv(fp)

# Ensure the DATE column is in datetime format
unem['DATE'] = pd.to_datetime(unem['DATE'])

# Filter the data for dates between 2015-01-01 and 2024-01-01
unem = unem[(unem['DATE'] >= '2015-01-01') & (unem['DATE'] <= '2024-01-01')]
unem.reset_index().drop(columns = 'index')

Unnamed: 0,DATE,UNRATE
0,2015-01-01,5.7
1,2015-02-01,5.5
2,2015-03-01,5.4
3,2015-04-01,5.4
4,2015-05-01,5.6
...,...,...
104,2023-09-01,3.8
105,2023-10-01,3.8
106,2023-11-01,3.7
107,2023-12-01,3.7


## Aggregate Consumption 

In [6]:
# read in the aggregate consumption data
fp = Path('data') / 'PCE.csv'
agg = pd.read_csv(fp)

# Ensure the DATE column is in datetime format
agg['DATE'] = pd.to_datetime(agg['DATE'])

# Filter the data for dates between 2015-01-01 and 2024-01-01
agg = agg[(agg['DATE'] >= '2015-01-01') & (agg['DATE'] <= '2024-01-01')]
agg.reset_index().drop(columns = 'index')


Unnamed: 0,DATE,PCE
0,2015-01-01,12066.7
1,2015-02-01,12116.6
2,2015-03-01,12176.1
3,2015-04-01,12209.1
4,2015-05-01,12275.4
...,...,...
104,2023-09-01,19024.9
105,2023-10-01,19069.5
106,2023-11-01,19151.0
107,2023-12-01,19289.9


## Core CPI Data 

In [10]:
# read in the core CPI data
fp = Path('data') / 'CORECPI.xlsx'
CPI_data = pd.read_excel(fp)

# grab only the United States data and the dates
CPI_data = CPI_data[['United States', 'Unnamed: 0']]

# rename the Date column appropriately
CPI_data.rename(columns={'Unnamed: 0': 'DATE'}, inplace=True)

# Convert the Date column from 'YYYYMM' format to datetime format
CPI_data['DATE'] = pd.to_datetime(CPI_data['DATE'], format='%YM%m', errors='coerce')

# Filter the data for dates between 2015-01-01 and 2024-01-01
CPI_data = CPI_data[(CPI_data['DATE'] >= '2015-01-01') & (CPI_data['DATE'] <= '2024-01-01')]

CPI_data

Unnamed: 0,United States,DATE
242,108.6923,2015-01-01
243,108.8559,2015-02-01
244,109.1201,2015-03-01
245,109.3880,2015-04-01
246,109.5430,2015-05-01
...,...,...
346,140.7967,2023-09-01
347,141.1348,2023-10-01
348,141.5695,2023-11-01
349,141.9593,2023-12-01


## Merging Data

In [11]:
# Convert 'DATE' to year and month only, to ensure monthly consistency
for df, name in zip([monthly_sp500, CPI_data, unem, agg], ["monthly_sp500", "CPI_data", "unem", "agg"]):
    # Convert to datetime if needed
    df['DATE'] = pd.to_datetime(df['DATE'])
    # Keep only the year and month part
    df['DATE'] = df['DATE'].dt.to_period('M').dt.to_timestamp()

# Perform the outer merge on 'DATE' after adjusting all to monthly
merged_data = (
    monthly_sp500
    .merge(CPI_data, on='DATE', how='outer')
    .merge(unem, on='DATE', how='outer')
    .merge(agg, on='DATE', how='outer')
)

# Filter to ensure the date range is from January 2015 to December 2023
merged_data = merged_data[(merged_data['DATE'] >= '2015-01-01') & (merged_data['DATE'] <= '2023-12-31')]

# Check for null values in merged data
assert merged_data.isnull().sum(axis = 0).sum() == 0

#display the data
merged_data.head()


Unnamed: 0,DATE,Daily_Return,United States,UNRATE,PCE
0,2015-01-01,-3.007878,108.6923,5.7,12066.7
1,2015-02-01,5.384284,108.8559,5.5,12116.6
2,2015-03-01,-1.668783,109.1201,5.4,12176.1
3,2015-04-01,0.877221,109.388,5.4,12209.1
4,2015-05-01,1.087943,109.543,5.6,12275.4


# Ethics & Privacy

There could be ethical concerns regarding 

Biases/Privacy/Terms of Use: World Bank U.S. data is generally public and follows strict data privacy and usage standards. However, potential biases can arise due to the structure of reported metrics and reliance on government and institutional sources, which might reflect particular economic priorities or omit certain demographic nuances.
Potential Biases in Composition: U.S. World Bank data might underrepresent specific subpopulations or regional economic conditions, especially if it aggregates across states or overlooks informal economies. Additionally, while U.S. data is generally comprehensive, it might still exclude or simplify information from marginalized communities or rural areas, which could skew results.  When analyzing the stock market, it is possible that it will not adequately represent activity by lower income populations as this market tends to be influenced primarily by wealthy investors.
Detecting Biases: To detect these biases, we’ll review data for representativeness across different demographics and states, looking for gaps in coverage or skewed trends. Cross-checking with other sources, like the Bureau of Labor Statistics, will also help validate findings and identify potential issues during analysis.
Other Issues (Privacy & Equitable Impact): While anonymized, the data could still emphasize economic conditions linked to sensitive attributes, such as income disparities. To avoid unintended biases, we’ll contextualize findings carefully to avoid reinforcing stereotypes or inequities.
Handling Issues Identified: We’ll adjust models to account for possible underreporting in certain states or demographics and provide transparency about these limitations in our communication. Sensitivity analyses and clear statements about data limitations will also help ensure fair, unbiased interpretations.


# Team Expectations 

Check our discord regularly for any updates
Respond promptly to texts in our group chat. 
Let everyone know if you are unable to attend our meeting times
Be communicative and transparent if you can’t get your share of the work done
Be understanding 


Scott - Clean Data, Create initial visualizations 

Ruben  -  Clean Data, Create initial visualizations 

Ethan -  Double Check Visualizations and Data - Analysis, Discussion 

Noah - Analysis, Conclusion, Discussion

Shion - Analysis, Conclusion, Discussion


# Project Timeline Proposal

**All meetings are tentatively set to 11:30 a.m., however further communication prior to our meetings will ensue**


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/28  |  11:30 AM | Read Ruben's proposed ideas; brainstorm topics/questions  | Finalize our question and understand our roles within the project going forward| 
| 11/11  |  11:30 AM |  Checkpoint #1: Data due on Wednesday. Ruben and Scott should have our datasets cleaned and ready to have analysis performed on it| Ruben and Scott will present the cleaned data and we will discuss how to continue on with our project. Discuss possible analytical approaches
| 11/25  | 11:30 AM  | Checkpoint #2: EDA* is due on Wednesday. Scott and Ruben should finalize wrangling; Ethan and Noah should continue to perform EDA, | Address any concerns with the EDA process, and discuss further advanced analysis and new directions to take the project  |
| 12/02| 11:30 AM  | wrap up any data analysis and go through any completed work | Figure out the best time to film our final video, and discuss anything that should be changed.|
| 12/09  | 11:30 AM  | Shion should conclude our paper and go through intently and fix anything that needs fixing | Wrap up our project. Ensure we are happy with our results and have the project turned in 2 days early to avoid any last minute issues. |