# Stock Forecasting with the SARIMA Model

## Summary

**SARIMA Model:** Seasonal Autoregressive Integrated Moving Average (SARIMA) model, a time series forecasting method, is implemented to capture the underlying patterns and trends in the stock price data.

The Seasonal Autoregressive Integrated Moving Average (SARIMA) model provides a robust framework for predicting stock prices. SARIMA's ability to capture seasonal patterns and cyclic behavior inherent in sequential data, along with its foundation in statistical time series analysis principles, lends credibility to our predictions. The model's interpretable parameters allow for a deeper understanding of the underlying dynamics driving stock price movements. However, it's important to note SARIMA's limitation in capturing external factors such as news events or sudden market shocks, which may impact stock prices but are not directly incorporated into the model. Despite this consideration, SARIMA remains a valuable tool for predicting stock prices based on historical patterns and internal data dynamics.

For this project, four time ranges of historical data were evaluated for the accuracy of the SARIMA models' predictions for three different stocks (AAPL, TSLA, NVDA).

Predicted Dates: 06-01-2024 to 06-11-2024

Historical Data Time Ranges:

1. 01-01-2015 to 5-31-2024
2. 01-01-2017 to 5-31-2024
3. 01-01-2029 to 5-31-2024
4. 01-01-2021 to 5-31-2024

From these time ranges, it was determined how much historical data provided the most accurate predictions through the use of comparison metrics (R-squared, p-value, RMSE, MSE, MAPE, AIC, BIC).

### Question: Is the SARIMA model useful for a company's stock prediction? Can a company use this tool to make informed decisions with excellent accuracy?
This tool would allow a company to make informed decisions based on future stock behavior, as this tool has excellent accuracy (on average, 96.8% accuracy for the recent time ranges (2019, 2021)).

Parameters for Selection:
1. Accuracy (RMSE, MSE, MAPE)
2. Speed
3. Compute resource usage considerations
4. Reproducibility
5. World Events
6. Other factors

**Selected Model:** The SARIMA models with the most recent historical data time range (2021) were chosen because:
- The more recent models demonstrated high accuracy in capturing seasonal trends and underlying patterns in stock data.
- The more recent models provided interpretable parameters that offer insights into stock price movements.
Other time ranges (2015, 2017) weren't as effective because:
- These time ranges showed slower computation times.
- These time ranges had lower accuracy metrics compared to recent time ranges.
- These time ranges required more computational resources, making them less practical for real-time stock prediction.

### A Little More Detail About Why Not Other Answers and Methods (Context)

Time Series Prediction vs. Other Machine Learning Tasks: SARIMA specifically excels in time series analysis, making it well-suited for stock price predictions where historical data is crucial. In contrast, other machine learning models may not capture the temporal dependencies as effectively.

Specifics of the Model:
SARIMA: Captures seasonality and trends in data, making it robust for stocks with clear patterns (e.g., AAPL and TSLA).
Other Models (like ARIMA or regression-based models): While they may also predict stock prices, they often fail to account for seasonal variations, leading to less accurate forecasts.
Contextual Effectiveness:
The SARIMA model's ability to utilize recent historical data proved beneficial for stable stocks like AAPL, while its effectiveness varied with more volatile stocks like NVDA and TSLA, which required longer historical data for accurate predictions.

### Why Not Other Models?

1. **ARIMA (Autoregressive Integrated Moving Average)**:
   - While ARIMA is effective for univariate time series data, it lacks the capability to model seasonal patterns as effectively as SARIMA. SARIMA extends ARIMA by incorporating seasonal components, making it more suitable for datasets with clear seasonal trends, such as stock prices.

2. **GARCH (Generalized Autoregressive Conditional Heteroskedasticity)**:
   - GARCH models focus primarily on modeling volatility and are useful for forecasting the variance of returns rather than predicting actual prices. They are not designed for point forecasts of stock prices, limiting their applicability for direct price prediction tasks.

3. **XGBoost**:
   - XGBoost is a powerful machine learning algorithm that excels in classification and regression tasks. However, it does not inherently account for the temporal dependencies in time series data. Without careful feature engineering to incorporate time-based features, it may not perform as well for stock price predictions as SARIMA, which directly models these temporal relationships.

4. **Random Forest**:
   - Like XGBoost, Random Forest is a robust model for various prediction tasks but does not leverage the sequential nature of time series data. While it can provide accurate predictions with sufficient data, it typically requires extensive feature engineering to capture the temporal dynamics, which is automatically handled by SARIMA.

5. **TBATS (Trigonometric, Box-Cox Transformation, ARMA Errors, Trend, and Seasonal Components)**:
   - TBATS is designed for complex seasonal patterns, but it can be computationally intensive and may require tuning of multiple parameters. In contrast, SARIMA offers a more straightforward framework for seasonal data without needing as much computational overhead.

6. **Prophet**:
   - Prophet is user-friendly and effective for forecasting time series data with strong seasonal effects and missing values. However, it may not be as robust for datasets with less clear seasonal patterns or when high-frequency forecasting is needed. SARIMA, with its statistical basis, can provide more granular control over the model parameters for precise forecasting.

### Summary
While each of these models has its strengths, they may not capture the intricate seasonal patterns and temporal relationships present in stock price data as effectively as SARIMA. SARIMA’s design specifically caters to time series forecasting, allowing for more reliable predictions in financial contexts, particularly when historical data plays a critical role.

## How

### High-Level Description of Final Process for AAPL, NVDA, and TSLA

Based on the dataset comprising historical stock prices from 2021 (AAPL, TSLA) and 2019 (NVDA), the SARIMA model was utilized, leveraging packages such as statsmodels, pandas, and matplotlib to create the most accurate predictions. The results were visualized using the matplotlib package.

This dataset employs definitions for Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and prediction accuracy.

**Definition of Accuracy:** Accuracy in this context refers to the degree to which the predicted values from the SARIMA model align with actual observed values, as indicated by the low MSE, RMSE, and MAPE metrics.

### Demonstration of How to Implement SARIMA Model



In [1]:
# Step 1: Import packages
# % pip install pandas
# % pip install numpy
# %%! pip install plt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Step 2: Read in data
data = pd.read_csv('data.csv')

# Step 3: Check data before processing
print(data.head())
print(data.info())
print(data.describe())

# Step 4: Process data
# Example: Handle missing values and encode categorical variables
data.fillna(data.mean(), inplace=True)  # Filling missing values with mean
data = pd.get_dummies(data, drop_first=True)  # One-hot encoding

# Step 5: Adjust model as needed
# Define features and target variable
X = data.drop('target', axis=1)  # Features
y = data['target']  # Target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 6: Run
y_pred = model.predict(X_test)

# Step 7: Assess output
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Step 8: Visualize
plt.scatter(y_test, y_pred)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.title('True vs Predicted Values')
plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red')  # Diagonal line
plt.show()

# Step 9: Output data and charts somewhere
output_data = pd.DataFrame({'True Values': y_test, 'Predictions': y_pred})
output_data.to_csv('predictions.csv', index=False)
plt.savefig('true_vs_predicted.png')  # Save the visualization



FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'