# Predicting S&P 500 Movements Using Macroeconomic Indicators

### Team
- **Prathmesh D. Pawar** (GitHub ID: `Prathmesh0312`)
- **Pratik More** (GitHub ID: `Pratikmgit`)
- **Rushikesh Shinde** (GitHub ID: `rushikesh328`)
- **Sai Swetha Lakkoju** (GitHub ID: `Shweta23-rgb`)
- **Shikha A. Singh** (GitHub ID: `ssingh952002`)



## Introduction

The goal of our project is to predict whether the S&P 500 movements, a major U.S. stock market index, will go up or down each day. To do this, we will analyze several factors that might affect the stock market, such as gold and oil prices, currency exchange rates, and the strength of the U.S. dollar. By examining data from the past few years, we want to see if these factors can help predict future movements of the S&P 500.

Traditional models often rely heavily on stock prices, trading volumes, or financial metrics, whereas our model will account for broader economic factors. This more holistic approach provides a comprehensive view of market movements, which could offer a higher degree of accuracy and adaptability, particularly during volatile market periods

### Primary Stakeholders

Our stakeholders include retail investors, financial analysts, FinTech platforms, and corporate leaders. Retail investors can use our predictions to make better investment decisions and manage their finances more effectively. Financial analysts can improve their forecasts by incorporating additional macroeconomic data into their analysis. FinTech platforms can enhance their tools by offering more reliable market predictions to their users. Corporate leaders can use these insights to plan their budgets, allocate resources, and make informed decisions about managing risks in the market. Our project aims to help all these groups make smarter and more confident financial decisions.


### Solution: 

Our solution involves analyzing macroeconomic factors, such as the strength of the US Dollar Index (USIDX), currency exchange rates, gold prices, and crude oil prices, to predict future movements of the S&P 500 index. We created a target variable called binary movement, which categorizes the S&P 500's daily movement as up, down, or unchanged. This solution addresses the needs of our stakeholders by providing a data-driven approach to market predictions. Retail investors can use the predictions to make informed decisions, financial analysts can incorporate additional macroeconomic data into their forecasts, FinTech platforms can enhance their tools with more reliable predictions, and corporate leaders can use the insights for strategic planning and risk management.









## Literature Review

Current market prediction models rely heavily on technical factors such as stock prices, trading volumes, and common economic indicators (interest rates, inflation). Chen et al. (2020) demonstrated that Random Forest (RF) models offered more accurate predictions of S&P 500 returns than other machine learning models, but they still primarily focused on stock-specific data.[^1]

Fischer and Krauss (2018) showed that Long Short-Term Memory (LSTM) networks could forecast 1-day ahead returns for S&P 500 stocks, but they did not include broader macroeconomic factors, limiting the scope of their prediction.[^2] Similarly, Krauss et al. (2017) demonstrated that ensemble machine learning models, such as combining Random Forests and Gradient-Boosted Trees, improved prediction accuracy. However, these studies did not fully integrate non-traditional factors like consumer spending or commodity prices, which can provide deeper insights into market fluctuations.[^3]

Our approach seeks to address these limitations by incorporating commodity prices, and currency exchange rates into a unified prediction model, offering a more comprehensive understanding of market dynamics.



## Data and Methods: 

This section will introduce your data and specify your modeling approach.

Data Description: The dataset used for this project spans from 2016 to 2024 and contains 2,236 rows and 31 columns.(Details of the datasource mentioned below). It combines financial and macroeconomic indicators with additional engineered features to forecast movements of the S&P 500.

Modeling Approach: Our modeling approach involved preprocessing, systematic feature engineering, and testing various machine learning algorithms with hyperparameter tuning. 

**Google Drive Link to access all the data files:**
https://drive.google.com/drive/folders/1s6eXHTSxT-JsYRjJFXvUoOaYpX22w1my?usp=drive_link



## Data:

### Data Source:

The data for this project was collected from Investing.com, a well-known platform that provides reliable real-time and historical data for financial markets. It is commonly used by both professionals and individual investors to track stock indices, commodities, currencies, and other economic trends. This makes the dataset trustworthy and suitable for our analysis.

### About the dataset:

- **S&P 500 Futures**: Historical daily data for the S&P 500 index, including closing prices, highs, lows, and volume.
- **Gold and Crude Oil Futures**: Prices for gold and crude oil, reflecting commodity trends.
- **Forex Data**: Exchange rates for major currencies (EUR/USD, GBP/USD, USD/CNY, and USD/JPY).
- **US Dollar Index**: A measure of the U.S. dollar's value relative to a basket of foreign currencies.

**Data Quality**: The dataset was thoroughly cleaned to ensure high quality:

- **Number of Rows**: 2,236
- **Initial Number of Columns**: 47 (Not including engineered features)
- **Target Variable**: Binary Movement (classification: -1 for down, 1 for up, 0 for no movement).
- **Balance of Classes**:
  - `-1`: 1213 instances
  - `1`: 1000 instances
  - `0`: 23 instances 




### Visualisations to provide an overview of our data: 
![Screenshot%202024-12-17%20at%2012.18.22%20PM.png](attachment:Screenshot%202024-12-17%20at%2012.18.22%20PM.png)



![Screenshot%202024-12-17%20at%2012.22.09%20PM.png](attachment:Screenshot%202024-12-17%20at%2012.22.09%20PM.png)

![Screenshot%202024-12-17%20at%2012.28.50%20PM.png](attachment:Screenshot%202024-12-17%20at%2012.28.50%20PM.png)

### Methods

#### Goal of the Project
To reiterate, the goal of our project was to predict the daily movements of the S&P 500 index using historical market data and macroeconomic indicators. To achieve this, we created a target variable called **Binary Movement**, which predicts whether the index will go up (+1), go down (-1), or stay the same (0) the next day.

#### Data Preprocessing

To prepare the data for modeling, we followed these steps:

**1. Merging the Datasets:**
- We combined several datasets, including S&P 500 Futures, gold and crude oil prices, forex exchange rates, and the US Dollar Index.
- These datasets were merged using the `Date` column, resulting in a unified dataset with **2,236 rows** and **31 columns**.

**2. Handling Missing Data:**
- Some columns, such as volume columns for certain datasets, had too many missing values, so we dropped them.
- For other columns with a small number of missing values, we used **forward-fill** and **backward-fill** methods to fill the gaps.

**3. Creating New Features:**
To improve the performance of the models, we created new features:
- **Lagged Features:** Previous day values of important variables, such as `Price` and `Gold_Price`.
- **Rolling Averages:** A 5-day moving average (`Price_MA5`) to capture short-term trends.
- **Volatility:** A 7-day rolling standard deviation (`Price_volatility_7d`) to measure price fluctuations.
- **Ratios:**
  - `Gold_to_SP500`: Ratio of gold price to S&P 500 price.
  - `DollarIndex_to_SP500`: Ratio of the US Dollar Index to S&P 500 price.
- **Momentum Indicator:** We calculated the Relative Strength Index (**RSI**) using 14-day price changes to measure market momentum.

To capture seasonal patterns and trends in the data, we created a few additional features: 

- **Day of the Week**: This feature indicates the day of the week (0 for Monday, 6 for Sunday). Financial markets often show different behaviors depending on the day.  
- **Month**: A feature to identify the month of the year, capturing any monthly trends or seasonality.  
- **Quarter**: This feature identifies which quarter (Q1 to Q4) a particular data point falls into, as quarterly reporting and earnings can impact market movements.  



**4. Creating the Target Variable:**
- We created a target variable called **Binary Movement** to predict whether the S&P 500 would go up, down, or stay the same the next day.
- To align with this goal, we shifted the target column backward by one day so that today’s features would predict tomorrow’s movement.

**5. Scaling the Features:**
- Since the data contained features with different ranges, we tested multiple scaling methods, including:
  - **MinMaxScaler**
  
**6.Class Imbalance:**
The target variable was imbalanced:

Class -1 (down): 969 instances
Class 1 (up): 806 instances
Class 0 (no movement): Only 13 instances

 We decided not to use SMOTE (oversampling) because our data is financial time-series data, where the sequence of events is crucial. Using SMOTE could disrupt this sequential nature by introducing synthetic data points that do not preserve the temporal dependencies in the dataset. 

  
**7.Removal of "Open" and "Change%" and "Price" Columns:** 

We identified that the Open and Change % columns posed a risk of data leakage, as they could indirectly provide future information to the model during training. The Open price, being the first traded price of the day, could overlap with the next day’s target movement. Similarly, the Change % column inherently embeds information about price changes that are closely related to the target variable. To avoid bias and ensure fair model training, we decided to drop these columns from our dataset. After we created binary movement target variable which is derived from the "Price" columns, we have also removed this while training the model ro avoid data leakage and overfitting. 


### Modeling 

For our modeling process, we explored multiple machine learning algorithms to determine which model would best predict the daily movements of the S&P 500 index. We started with Logistic Regression, as it is a simple yet effective model for binary classification tasks and serves as a good baseline. Next, we tried the Decision Tree Classifier, which can capture non-linear relationships in the data and provide interpretability by showing how features influence predictions. To improve performance further, we applied Random Forest and Gradient Boosting, both ensemble methods that combine multiple decision trees to reduce overfitting and improve generalization. Finally, we tested Support Vector Machines (SVM), as they are known for their ability to handle complex decision boundaries.
  
  


### Results

Below, we summarize the performance of the top two models: **Support Vector Machine (SVM)** and **Logistic Regression**. We evaluated these models using **accuracy**, **F1 score**, **precision**, and **recall** on both the training and testing datasets.

---

| Metric                | SVM (Linear Kernel) | Logistic Regression |
|-----------------------|----------------------|----------------------|
| **Train Accuracy**    | 83.07%              | 82.30%              |
| **Test Accuracy**     | 80.33%              | 79.14%              |
| **Train F1 Score**    | 82.53%              | 81.74%              |
| **Test F1 Score**     | 79.74%              | 78.59%              |
| **Train Precision**   | 82.44%              | 81.71%              |
| **Test Precision**    | 79.78%              | 78.39%              |
| **Train Recall**      | 83.07%              | 82.30%              |
| **Test Recall**       | 80.33%              | 79.14%              |

---

### Reason for Not Performing Cross-Validation

We **did not perform k-fold cross-validation** because our dataset represents **time-series data**, where the chronological order of the observations is crucial. Performing cross-validation, which randomly shuffles and splits the data, would **disrupt the temporal dependencies** in the dataset. Instead, we opted for a **train-test split** approach, ensuring that the test set contains data that occurs **chronologically after** the training set. This approach more accurately mimics real-world scenarios where future predictions depend on past observations.

--- 
### Key Takeaways

- **SVM** performed the best overall with a test accuracy of **80.33%**, followed closely by **Logistic Regression** at **79.14%**.  
- Both models demonstrated strong generalization ability and balanced the trade-off between training and testing performance.  
- Other models, such as **Random Forest** and **Gradient Boosting**, tended to overfit the training data and underperformed on the test set.  

The consistent performance of SVM and Logistic Regression makes them the most reliable models for predicting daily movements in the S&P 500 index.




### Discussions:

Our goal for this project was to predict the **daily movements of the S&P 500 index** using historical market data and macroeconomic indicators. The stakeholder need was to develop a reliable model that could provide insights into the direction of the market (up, down, or no movement) to help in decision-making processes such as investments and trading strategies. 

#### Degree to Which We Achieved Our Goals

To a **large extent**, we were able to meet our goals. Our best-performing models, **Support Vector Machine (SVM)** and **Logistic Regression**, achieved **80.33%** and **79.14%** test accuracy, respectively. These results indicate that our models can predict daily market movements reasonably well. However, there are a few key reflections and limitations:

1. **Class Imbalance**:  
   The "no movement" class (0) was heavily underrepresented, with only 13 observations. This impacted the performance of our models in correctly predicting this class, as reflected in the **classification reports**, where precision and recall for the "no movement" class were 0. Our models focused more on distinguishing between "up" (+1) and "down" (-1) movements. Addressing this imbalance remains a critical area for improvement.

2. **Slight Overfitting**:  
   While SVM and Logistic Regression generalized well to the test set, other models like **Random Forest** and **Gradient Boosting** overfit the training data. This highlights the challenge of balancing complexity with generalization when working with time-series financial data.

3. **Temporal Nature of Data**:  
   We deliberately avoided **k-fold cross-validation** because it would disrupt the **chronological order** of our data. Instead, we relied on a simple **train-test split** where the test data occurred after the training data. While this method preserves the temporal structure, it may not fully evaluate model robustness under varying conditions.

4. **Feature Engineering**:  
   We incorporated several features, such as **lagged variables, moving averages, volatility indicators**, and **seasonal features** (Day of Week, Month, and Quarter), which we believe contributed to our models' performance. 
   

#### Addressing Stakeholder Needs

Our work, we believe mostly addresses the needs of our stakeholders. The models can predict market direction with reasonable accuracy, providing a solid foundation for identifying trends in the S&P 500 index. However:

- The models still face challenges with **class imbalance**, particularly for predicting no movement. This may limit their applicability in stable market conditions.  
- Stakeholders looking for **higher accuracy** in predicting small fluctuations or anticipating major market movements may require more complex models, such as ensemble techniques, deep learning, or models incorporating external economic data.





### Limitations

Apart from some of the limitations mentioned above, there are a couple of limitations that could be addressed to improve the our project.


1. **Limited Features**:  
   While we created several new features (lagged variables, moving averages, volatility indicators, seasonal features), our feature set was still limited to **historical market data**. We did not incorporate external factors such as:
   - **Macroeconomic indicators** like GDP, unemployment rates, or interest rates.  
   - **Sentiment analysis** from financial news, reports, or social media.  
   - **Global economic events** like trade wars, geopolitical conflicts, or policy changes.

   Including these factors could make the models more robust and responsive to external market influences.

2. **Lack of Advanced Models**:  
   While we explored a variety of models, including **Logistic Regression**, **Support Vector Machines**, and **tree-based methods**, we did not experiment with **deep learning techniques** such as LSTMs (Long Short-Term Memory networks) or other recurrent neural networks, which are well-suited for time-series data. This was due to the complexity and time required to implement such models.





### Future work

To better address stakeholder needs, future efforts could focus on:
1. **Class Balancing**: Using methods such as cost-sensitive learning, downsampling, or exploring other time-series-specific balancing techniques.

2. **Incorporating External Data**: Adding sentiment analysis from news headlines or other macroeconomic indicators (e.g., unemployment rates, GDP growth) to improve model predictions.  

3. **Advanced Models**: Exploring time-series models like **LSTM** (Long Short-Term Memory networks) or ensemble approaches (e.g., XGBoost, stacking models) to better capture market trends and volatility.  

4. **Better Validation**: Implementing **time-series cross-validation** with more splits to further ensure model robustness and stability over unseen data.


In conclusion, while our models provide a strong starting point and meet our goals to a significant degree, there is still room for improvement. The insights gained from this project form a foundation for building more accurate and reliable predictive models to better address stakeholder needs in the future.
