# Stock Forecasting with the SARIMA Model

The Seasonal Autoregressive Integrated Moving Average (SARIMA) Machine Learning Model should be added to your company's stock forecasting toolkit because it offers several advantages that can significantly improve forecasting accuracy. The SARIMA model is suitable when dealing with time-series data, like stock data, that exhibit both trend and seasonality. Our SARIMA model has demonstrated high accuracy (~96%). By incorporating SARIMA, you can enhance forecast precision, uncover key seasonal insights, and improve decision-making for better investment strategies.

## Summary

The SARIMA Model provides a robust framework for predicting stock prices and is a strong complement to the existing machine learning models used by stock market analysts. 

For this project, we applied the SARIMA model to predict stock prices for companies Apple (AAPL), Tesla (TSLA), and NVIDIA (NVDA). The goal was to assess the effectiveness of SARIMA in capturing seasonal patterns and underlying trends in stock price movements. We evaluated multiple time ranges of historical data (from 2015 to 2024) and found that using the most recent data (2021) yielded the most accurate predictions, with a forecast accuracy of ~96%.

### Definition

**SARIMA: Seasonal Autoregressive Integrated Moving Average Model**

Key components of the SARIMA model include:
* S (Seasonal): Includes seasonal autoregressive (SAR), seasonal differencing (SI), and seasonal moving average (SMA) terms to account for recurring patterns or cycles at specific intervals (e.g., yearly, quarterly).

* AR (AutoRegressive): Models the relationship between an observation and a specified number of lagged observations.
* I (Integrated): Differencing the series to make it stationary (i.e., removing trends). These trends make the data non-stationary, which can lead to unreliable predictions because the model might mistake the trend as part of the pattern rather than just noise.
* MA (Moving Average): Models the relationship between an observation and the residual errors from a moving average model applied to lagged observations.

### Benefits

SARIMA's ability to capture the seasonal patterns and cyclic behavior inherent in sequential data, combined with its foundation in statistical time series analysis principles, enhances the credibility of stock forecasting processes.

The model's interpretable parameters (S, AR, I, MA structure) provide a deeper understanding of the underlying dynamics driving stock price movements. With these parameters, this model is designed to:

* notice the seasonal patterns present in the historical data (S component)
* highlight the behavior of past observations (AR component)
* understand the difference between noise and actual patterns (I component)
* model the relationship between the current observation and past forecast errors (residuals) (MA component)

#### Three Reasons to Use this Model

1. Captures Stock Price Seasonality: SARIMA can model seasonal patterns in stock prices, such as recurring market cycles or quarterly trends, improving forecast accuracy

2. Accounts for Time Dependencies: By using past stock prices (AR) and past forecast errors (MA), SARIMA can effectively capture trends and fluctuations in stock prices over time

3. Provides Interpretability: The model's parameters offer insights into the factors influencing stock price movements, helping analysts understand the underlying dynamics and make informed investment decisions

### Caveats

It is important to note SARIMA's limitation in capturing external factors, such as news events or sudden market shocks, which may affect stock prices but are not directly incorporated into the model. Despite this limitation, SARIMA remains a valuable tool for predicting stock prices based on historical patterns and internal data dynamics.

While SARIMA is a powerful tool for time series forecasting, there are several reasons why it might not be widely used by everyone, especially in stock forecasting:

1. Complexity and Tuning: SARIMA requires careful parameter tuning, which can be time-consuming and technical.
2. Assumes Stationarity: SARIMA works best with stationary data, but stock prices are often non-stationary and need preprocessing.
3. Limited to Historical Data: It only uses past data, ignoring external factors like news or market sentiment, which can influence stock prices.
4. Difficulty with Volatility: Stock prices are volatile and unpredictable, making SARIMA less effective for handling sudden market changes.
5. Competition from Machine Learning: More flexible machine learning models, like XGBoost or LSTMs, are often preferred for their ability to handle complex data and capture non-linear relationships.

## Methodology

For this project, four time ranges of historical data were evaluated to determine the optimal time lag and time steps to include in the model for three different stocks, AAPL (Apple Inc.), TSLA (Tesla Inc.), and NVDA (NVIDIA Corp.). It is important to test the accuracy of this model with stocks that exhibit all behaviors, which was the reason the three stocks were chosen. Historically, AAPL is known to be a safer stock, TSLA is known to be safe and volatile (fluctuates) and NVDA is known to be commonly volatile. By testing these three stocks, this SARIMA model's performance was analyzed with all ranges of behavior. In turn, improvements were made to the model's parameters to best work for all behavior.

Predicted Dates: 06-01-2024 to 06-11-2024

Historical Data Time Ranges:

1. 01-01-2015 to 5-31-2024
2. 01-01-2017 to 5-31-2024
3. 01-01-2029 to 5-31-2024
4. 01-01-2021 to 5-31-2024

### Tests Used to Determine Accuracy

Prediction accuracy was determined using comparison metrics, including R-squared, p-value, RMSE, MSE, MAPE, AIC, and BIC. Please refer to Definition Metric Choices section below for definition of each metric. These metrics were chosen as they are the most common metrics used by financial analysts to analyze model prediction accuracy.

First, these metrics were specifically utilizated to compare time ranges. Once the optimal time ranges were chosen for each stock, the overall accuracy of the rool was analyzed. The metric with the most weight for determination of accuracy was the MAPE metric (Mean Absolute Percentage Error). The MAPE metric was emphasized more than other metrics due to it being ideal for stationary data. Since the SARIMA model's Integrated component caused the data to undergo differencing, thus making the data stationary, the MAPE metric was determined to be suitable for this application of SARIMA Model with stock forecasting.

### Accuracy Results

| **Stock**       | **Best Time Range** | **MAPE**     | **Overall Accuracy** | **Behavior**      |
|-----------------|---------------------|--------------|----------------------|-------------------|
| Apple (AAPL)    | 2021                | 2.3569%      | 97.6431%             | Stable            |
| Tesla (TSLA)    | 2021                | 1.9726%      | 98.0274%             | Fluctuates        |
| NVIDIA (NVDA)   | 2019                | 5.2467%      | 94.7533%             | Volatile          |

**Average Accuracy: 96.8079%**

The most recent time ranges yielded the most accurate results (2019, 2021). The finalized predictive model was thoroughly tested and found to have excellent accuracy, averaging ~96.8% for the recent time ranges (2019, 2021). Therefore, the SARIMA model is a valuable addition to a company's toolkit.

   ### Kevin: Factors for Selecting Time Ranges:
1. Accuracy (RMSE, MSE, MAPE) *which? why?* 
    1a. *seasonality*
2. Speed
3. Compute resource usage considerations
4. Reproducibility
5. World Events

### **Kevin: Variance Among Stocks**?? (can you check her notes? I don't know what she wants in this section)
### COME BACKKKKK!??
 They demonstrated high accuracy in capturing seasonal trends and underlying patterns in stock data. *see above*
~~- They provided interpretable parameters that offered valuable insights into stock price movements.~~ *what does this really mean?*

### A Detailed Analysis of SARIMA's Superiority Over Other Forecasting Methods

#### Time Series Prediction vs. Other Machine Learning Tasks: 

SARIMA excels in time series analysis, making it highly effective for stock price predictions, where the temporal dependencies of historical data play a crucial role. In contrast, other machine learning models may struggle to capture these temporal patterns as effectively. 

SARIMA is specifically useful due to its interpretable, flexible, and effective for different-sized datasets, offering a reliable method for forecasting complex time series with seasonal effects.

Specifics of the Model:

- SARIMA: Captures seasonality and trends in data, making it robust for stocks with clear cyclical patterns (seasonal patterns), such as AAPL and TSLA. By leveraging its interpretable parameters, SARIMA provides a nuanced understanding of stock price movements and helps identify recurring trends. While it explicitly models seasonal patterns (e.g., yearly cycles), it also accounts for long-term trends, such as gradual increases or decreases in the data. The differencing component of SARIMA helps remove these trends to make the data stationary, enabling more accurate forecasting of underlying patterns, whether they are cyclical or directional.

- Kevin: Contextual Effectiveness:
(Not sure what to put here)

#### The Disadvantages of Other Models:

1. ARIMA (Autoregressive Integrated Moving Average):
   - While ARIMA is effective for univariate time series data, it lacks the capability to model seasonal patterns as effectively as SARIMA. SARIMA extends ARIMA by incorporating seasonal components, making it more suitable for datasets with clear seasonal trends, such as stock prices.


2. GARCH (Generalized Autoregressive Conditional Heteroskedasticity):
   - GARCH models focus primarily on modeling volatility and are useful for forecasting the variance of returns rather than predicting actual prices. They are not designed for point forecasts of stock prices, limiting their applicability for direct price prediction tasks.

3. XGBoost:
   - XGBoost is a powerful machine learning algorithm that excels in classification and regression tasks. However, it does not inherently account for the temporal dependencies in time series data. Without careful feature engineering to incorporate time-based features, it may not perform as well for stock price predictions as SARIMA, which directly models these temporal relationships.

4. Random Forest:
   - Like XGBoost, Random Forest is a robust model for various prediction tasks but does not leverage the sequential nature of time series data. While it can provide accurate predictions with sufficient data, it typically requires extensive feature engineering to capture the temporal dynamics, which is automatically handled by SARIMA.

5. TBATS (Trigonometric, Box-Cox Transformation, ARMA Errors, Trend, and Seasonal Components):
   - TBATS is designed for complex seasonal patterns, but it can be computationally intensive and may require tuning of multiple parameters. In contrast, SARIMA offers a more straightforward framework for seasonal data without needing as much computational overhead.

6. Prophet:
   - Prophet is user-friendly and effective for forecasting time series data with strong seasonal effects and missing values. However, it may not be as robust for datasets with less clear seasonal patterns or when high-frequency forecasting is needed. SARIMA, with its statistical basis, can provide more granular control over the model parameters for precise forecasting.

### SARIMA vs. Other Models
While each of these models has its strengths, they may not capture the intricate seasonal patterns and temporal relationships present in stock price data as effectively as SARIMA. SARIMA’s design specifically caters to time series forecasting, allowing for more reliable predictions in financial contexts, particularly when historical data plays a critical role.

## Process Overview

### Final Workflow Summary for AAPL, NVDA, and TSLA Stock Predictions

Utilizing historical stock price data from 2021 (AAPL, TSLA) and 2019 (NVDA), the SARIMA model was applied. Key packages such as statsmodels, pandas, and matplotlib were used to generate the most accurate predictions, with results visualized using matplotlib.

The evaluation metrics employed include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and prediction accuracy.

### Definition Metric choices:
The following metrics were considered when determining prediction accuracy:

* R-squared (R²):
R-squared represents the proportion of variance in the dependent variable explained by the independent variables. Values range from 0 (no fit) to 1 (perfect fit), with higher values indicating a better model fit.

* P-value:
The p-value tests the significance of results in hypothesis testing. A value less than 0.05 suggests strong evidence against the null hypothesis, indicating a statistically significant result.
* RMSE (Root Mean Squared Error):
RMSE measures the average magnitude of errors between predicted and actual values. Lower RMSE values indicate better predictive accuracy.
* MSE (Mean Squared Error):
MSE is the average of squared differences between predicted and actual values. Lower values suggest fewer errors and better model performance.
* MAPE (Mean Absolute Percentage Error):
MAPE calculates the average percentage difference between predicted and actual values. It’s used to assess forecasting accuracy, with lower values indicating better performance.
* AIC (Akaike Information Criterion):
AIC compares models based on fit and complexity, penalizing models with more parameters. A lower AIC suggests a more efficient model.
* BIC (Bayesian Information Criterion):
BIC is similar to AIC but applies a stronger penalty for additional parameters. It helps identify the simplest, best-fitting model with a lower BIC indicating better performance.


**Selected Definition of Accuracy:** In this context, accuracy refers to how closely the SARIMA model's predicted values match the actual observed stock prices, as measured by low (less than 10%) MSE, RMSE, and MAPE scores.


### Fit Assessment

The Train-Test Split and Cross-Validation fit assessments were performed for our models. Both Train-Test Split and Cross-Validation are techniques for assessing the performance of machine learning models, but they differ in how they split the data and how many times they train and test the model. Below, you will see how evaluated the AAPL model's fit with the historical data time range from 01-01-2021 to 05-31-2024.

- Train-Test split: Rolling 
    - Train period window: 684 days (80% of days stock market was open)
    - Test period window: 172 days (20% of days stock market was open)
    - Quarterly splitting was used (every three months)
- Cross-Validation: 3-Fold
   - Period 1: 
      - Train dates: 2021-01-01 to 2022-12-31 (2 years)
      - Test dates: 2023-01-01 to 2023-12-31 (1 year)
   - Period 2:
      - Train dates: 2021-01-01 to 2023-06-30 (2.5 years)
      - Test dates: 2023-07-01 to 2024-05-31 (1 year)
   - Period 3:
      - Train dates: 2021-01-01 to 2023-12-31 (3 years)
      - Test dates: 2024-01-01 to 2024-05-31 (5 months)

   


## **Step-by-Step Guide to Implementing the SARIMA Model with the AAPL Stock**

#### For this example, we will guide you step-by-step in implementing the SARIMA model for the AAPL stock.


### **Step 1: Import Packages**

Necessary packages were imported into Python. Packages important to note were pandas, numpy, yfinance, etc.

In [18]:
# Step 1

# Import packages
import numpy as np
import pandas as pd
import matplotlib
import sklearn
import pmdarima as pm
import matplotlib.pyplot as plt
import datetime

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Packages specific to project
import statsmodels.api as sm
import yfinance as yf

### **Step 2: Download Historical Data**

The stock ticker was specified, and multiple years of historical data for the selected stock were downloaded and stored in the stock_select_str variable. In this case, the stock was AAPL.

In [3]:
# Step 2

# Specify stock ticker
stock_select_str = "AAPL"

# Initial load run
stock_data = yf.download(stock_select_str, start='2021-01-01', end='2024-05-31', interval='1d')


[*********************100%%**********************]  1 of 1 completed


### **Step 3: Check Data Before Processing**

The historical data for AAPL was checked before processing. First, the head of the dataframe was printed to verify the right dates were downloaded for our set of historical data. Below, you will see the first date is Jan 4th, 2021 (the first business day of the year and the first day of the year the stock market was open.) From this, it was determined the historical data set was correct. Second, more information was printed, with the intent of seeing the count of each column in the dataframe. Below, there is a count of 857 displayed for each column, meaning the right amount of days were pulled.

Lastly, the index of the stock_data DataFrame was first converted to a DateTimeIndex to enable easy filtering. Then, weekend data was filtered out by retaining only rows where the day of the week was less than 5 (Monday to Friday).

In [14]:
# Step 3

# Check data before processing
print("Beginning of historical data df:\n\n", stock_data.head())
print("\nExtra information:\n\n", stock_data.describe())

# Ensure the index is a DateTimeIndex for easy filtering
stock_data.index = pd.to_datetime(stock_data.index)
# Filter out weekends
stock_data = stock_data[stock_data.index.to_series().dt.dayofweek < 5]


Beginning of historical data df:

                   Open        High         Low       Close   Adj Close  \
Date                                                                     
2021-01-04  133.520004  133.610001  126.760002  129.410004  126.544228   
2021-01-05  128.889999  131.740005  128.429993  131.009995  128.108795   
2021-01-06  127.720001  131.050003  126.379997  126.599998  123.796440   
2021-01-07  128.360001  131.630005  127.860001  130.919998  128.020752   
2021-01-08  132.429993  132.630005  130.229996  132.050003  129.125717   

               Volume  
Date                   
2021-01-04  143301900  
2021-01-05   97664900  
2021-01-06  155088000  
2021-01-07  109578200  
2021-01-08  105158200  

Extra information:

              Open        High         Low       Close   Adj Close  \
count  857.000000  857.000000  857.000000  857.000000  857.000000   
mean   158.865963  160.564808  157.292415  158.995519  157.163260   
std     20.422276   20.367330   20.476360   20.42

### **Step 4: Process Historical Data with Model**

First, the SARIMA Model Parameters were defined.

#### **Kevin (please fill in below)**
- SARIMA: (s)easonal + (i)ntegrated + (a)uto (r)egression + (m)oving (a)verage
    - p = number of lagged values included:
    - q = number of time steps in moving average:
    - d = number of differencing sequences: 
    - m = number of observations per year:
    - smoothing choice: explain
    - final model choice: SARIMA(p, d, q)(P, D, Q)m
- p: 1 - 3
- q: 1 - 3
- d: none
- m: 12
- P: 0
- D: 1
- Q: ?

Next, The process_model function was defined to create and configure a SARIMA (Seasonal AutoRegressive Integrated Moving Average) model. The function uses "Close" price of the stock dataframe. The auto_arima function was set up to optimize the SARIMA model parameters (like p, q, P, and Q) with specific settings: a yearly seasonal cycle (m=12), seasonal differencing of order 1, and stepwise selection for efficiency. After fitting, the model was returned for further use.

In [15]:
# Step 4

# Define the process_model function
def process_model(stock_transformed):
    print("Processing SARIMA model...\n")
    
    sarima_model = pm.auto_arima(stock_transformed["Close"], start_p=1, start_q=1,
                                  test='adf',
                                  max_p=3, max_q=3,
                                  m=12,  # 12 is the frequency of the cycle (yearly)
                                  start_P=0,
                                  seasonal=True,  # Set to seasonal
                                  d=None,
                                  D=1,  # Order of the seasonal differencing
                                  trace=False,
                                  error_action='ignore',
                                  suppress_warnings=True,
                                  stepwise=True)
    
    return sarima_model



When passing historical stock data through the auto_arima function (see below code block), this will fit the data in the model by doing the following:

- Stationarity Check: It tests whether the data is stationary using the Augmented Dickey-Fuller test (test='adf').
- Differencing: If necessary, it applies differencing to make the data stationary.
- Seasonality Detection: It looks for any seasonal patterns and models them accordingly (e.g., yearly cycles).
- Parameter Optimization: It tries different combinations of the AR, MA, seasonal AR, and seasonal MA parameters to find the optimal model that minimizes the AIC/BIC (Akaike/Bayesian Information Criteria).
- Model Return: After fitting, the function returns the SARIMA model that has been trained on the historical data (i.e., the model that can now make predictions based on past data).

In [17]:
# Fit the SARIMA model 
sarima_model = process_model(stock_data)

Processing SARIMA model...



### **Step 5: Run Predictions**

The predictions were run by calling a built-in predict function. The n_forecast variable is 12 because of the number of days to forecast. 

In [20]:
# Step 5

# Run predictions
n_forecast = 12
forecast, conf_int = sarima_model.predict(n_periods=n_forecast, return_conf_int=True)


  return get_prediction_index(
  return get_prediction_index(


In [57]:
import yfinance as yf
import pandas as pd
import pmdarima as pm
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# Total number of data points
total_days = len(stock_data)

# Defining the train and test periods based on your requested specifications
train_period_days = int(0.5 * total_days)  # 50% of the total period
test_period_days = int(0.5 * total_days)  # 50% of the total period 

# Set initial train-test split periods based on your 3-fold cross-validation requirement
rolling_splits = []

# Period 1
train_period_1 = stock_data.iloc[:train_period_days]  # Train for 684 days

test_period_1 = stock_data.iloc[train_period_days:]  # Test for 172 days
#print("HEREEE\n", test_period_1)

rolling_splits.append({
    "train_dates": (train_period_1.index[0], train_period_1.index[-1]),
    "test_dates": (test_period_1.index[0], test_period_1.index[-1])
})

#print(rolling_splits)

# Defining the train and test periods based on your requested specifications
train_period_days2 = int(0.7 * total_days)  # 70% of the total period
print("HEREEE\n", train_period_days2)
test_period_days2 = int(0.3 * total_days)  # 30% of the total period 
print("HEREEE\n", test_period_days2)
# Period 2 (Extend training window to 2.5 years)
train_period_2 = stock_data.iloc[:train_period_days2]  # Train for 2.5 years (1007 + 252 days)
test_period_2 = stock_data.iloc[train_period_days2:]  # Test for 252 days
rolling_splits.append({
    "train_dates": (train_period_2.index[0], train_period_2.index[-1]),
    "test_dates": (test_period_2.index[0], test_period_2.index[-1])
})

# Defining the train and test periods based on your requested specifications
train_period_day3 = int(0.9 * total_days)  # 50% of the total period
test_period_days3 = int(0.1 * total_days)  # 50% of the total period 

# Period 3 (Extend training window to 3 years)
train_period_3 = stock_data.iloc[:train_period_days]  # Train for 3 years (1007 + 252 + 252 days)
test_period_3 = stock_data.iloc[train_period_days:]  # Test for 152 days (5 months)
rolling_splits.append({
    "train_dates": (train_period_3.index[0], train_period_3.index[-1]),
    "test_dates": (test_period_3.index[0], test_period_3.index[-1])
})

# Step 2: Perform train-test split for each period and fit/predict with SARIMA
for i, split in enumerate(rolling_splits):
    print(f"Period {i+1}:")
    #print (split)
    print(f"  Train dates: {split['train_dates']}")
    print(f"  Test dates: {split['test_dates']}\n")

    # Get the actual data for training and testing
    train_data = stock_data.loc[split["train_dates"][0]:split["train_dates"][1]]
    test_data = stock_data.loc[split["test_dates"][0]:split["test_dates"][1]]
    print("traindata\n", train_data)
    print("testdata\n", test_data)

    # Fit the SARIMA model on the training data
    sarima_model = process_model(train_data)

    # Make predictions on the test data
    forecast, conf_int = sarima_model.predict(n_periods=test_period_days, return_conf_int=True)
    #test_data = test_data.drop(test_data.index[-1])
    # Plot predictions and confidence intervals (optional)
    plt.figure(figsize=(10, 5))
    #plt.plot(test_data.index, test_data['Close'], color='blue', label='Actual')
    #print(test_data.shape)
    #print(test_data.head(), "\n")
    #print(test_data.tail(), "\n")

    print("forecasstttt", forecast)
    forecast_dates = pd.date_range(start=test_data.index[0], periods=test_period_days, freq='B')
    forecast_df = pd.DataFrame(forecast, index=forecast_dates, columns=['Forecast'])
    print(forecast_df)
    #size_predict = forecast.size
    #first_date = forecast.loc[0].date
    #print(size_predict)
    #actual = stock_data[size_predict]
    #print(actual)


    print(forecast.shape)
    print(forecast.head(), "\n")
    print(forecast.tail(), "\n")
    '''
    plt.plot(test_data.index, forecast, color='red', linestyle='--', label='Forecast')
    plt.fill_between(test_data.index, conf_int[:, 0], conf_int[:, 1], color='gray', alpha=0.2)
    plt.title(f"Period {i+1}: Actual vs Forecast")
    plt.legend()
    plt.show()

    '''

'''
    # Evaluate the predictions (using RMSE and MAE as example metrics)
    mae = mean_absolute_error(test_data['Close'], forecast)
    rmse = np.sqrt(mean_squared_error(test_data['Close'], forecast))
    
    print(f"  MAE (Mean Absolute Error) for Period {i+1}: {mae}")
    print(f"  RMSE (Root Mean Squared Error) for Period {i+1}: {rmse}")
    print("------------\n")
'''

HEREEE
 599
HEREEE
 257
Period 1:
  Train dates: (Timestamp('2021-01-04 00:00:00'), Timestamp('2022-09-14 00:00:00'))
  Test dates: (Timestamp('2022-09-15 00:00:00'), Timestamp('2024-05-30 00:00:00'))

traindata
                   Open        High         Low       Close   Adj Close  \
Date                                                                     
2021-01-04  133.520004  133.610001  126.760002  129.410004  126.544228   
2021-01-05  128.889999  131.740005  128.429993  131.009995  128.108795   
2021-01-06  127.720001  131.050003  126.379997  126.599998  123.796440   
2021-01-07  128.360001  131.630005  127.860001  130.919998  128.020752   
2021-01-08  132.429993  132.630005  130.229996  132.050003  129.125717   
...                ...         ...         ...         ...         ...   
2022-09-08  154.639999  156.360001  152.679993  154.460007  152.599472   
2022-09-09  155.470001  157.820007  154.750000  157.369995  155.474426   
2022-09-12  159.589996  164.259995  159.300003 

  return get_prediction_index(
  return get_prediction_index(


forecasstttt 428    154.262325
429    156.812420
430    157.219351
431    159.074026
432    159.074005
          ...    
851    155.338166
852    154.997057
853    154.328502
854    155.287463
855    153.385077
Length: 428, dtype: float64
            Forecast
2022-09-15       NaN
2022-09-16       NaN
2022-09-19       NaN
2022-09-20       NaN
2022-09-21       NaN
...              ...
2024-04-30       NaN
2024-05-01       NaN
2024-05-02       NaN
2024-05-03       NaN
2024-05-06       NaN

[428 rows x 1 columns]
(428,)
428    154.262325
429    156.812420
430    157.219351
431    159.074026
432    159.074005
dtype: float64 

851    155.338166
852    154.997057
853    154.328502
854    155.287463
855    153.385077
dtype: float64 

Period 2:
  Train dates: (Timestamp('2021-01-04 00:00:00'), Timestamp('2023-05-19 00:00:00'))
  Test dates: (Timestamp('2023-05-22 00:00:00'), Timestamp('2024-05-30 00:00:00'))

traindata
                   Open        High         Low       Close   Adj Close  \
D

  return get_prediction_index(
  return get_prediction_index(


forecasstttt 599     175.085831
600     178.462054
601     178.072236
602     176.351551
603     176.102946
           ...    
1022    179.577695
1023    179.823884
1024    179.085718
1025    177.640282
1026    178.345940
Length: 428, dtype: float64
            Forecast
2023-05-22       NaN
2023-05-23       NaN
2023-05-24       NaN
2023-05-25       NaN
2023-05-26       NaN
...              ...
2025-01-02       NaN
2025-01-03       NaN
2025-01-06       NaN
2025-01-07       NaN
2025-01-08       NaN

[428 rows x 1 columns]
(428,)
599    175.085831
600    178.462054
601    178.072236
602    176.351551
603    176.102946
dtype: float64 

1022    179.577695
1023    179.823884
1024    179.085718
1025    177.640282
1026    178.345940
dtype: float64 

Period 3:
  Train dates: (Timestamp('2021-01-04 00:00:00'), Timestamp('2022-09-14 00:00:00'))
  Test dates: (Timestamp('2022-09-15 00:00:00'), Timestamp('2024-05-30 00:00:00'))

traindata
                   Open        High         Low       Close 

  return get_prediction_index(
  return get_prediction_index(


'\n    # Evaluate the predictions (using RMSE and MAE as example metrics)\n    mae = mean_absolute_error(test_data[\'Close\'], forecast)\n    rmse = np.sqrt(mean_squared_error(test_data[\'Close\'], forecast))\n    \n    print(f"  MAE (Mean Absolute Error) for Period {i+1}: {mae}")\n    print(f"  RMSE (Root Mean Squared Error) for Period {i+1}: {rmse}")\n    print("------------\n")\n'

<Figure size 1000x500 with 0 Axes>

<Figure size 1000x500 with 0 Axes>

<Figure size 1000x500 with 0 Axes>

### Step 7: The output was assessed.

In [None]:
# Step 7: Assess output
print(f"Forecasted values: {forecast}")
print(f"Confidence intervals: {conf_int}")

### Step 8: Chart Actuals Against Forecast
Actual closing prices were imported for the stock AAPL into the real_values DataFrame, using data up to June 12, 2024. It then filtered this data to include only the closing prices from January 1 to June 11, 2024, which were stored in plot_real.

A plot was created with a figure size of 12x8 inches, displaying the historical closing prices alongside the forecasted closing prices. The forecast was plotted starting from June 1, 2024, extending for the specified number of forecasted days (n_forecast) on business days (freq='B'). A shaded area representing the 95% confidence interval was added around the forecast line. Finally, the x-axis range was set from April 1 to June 11, 2024, and labels for the title, x-axis, y-axis, and legend were included for clarity.

In [None]:
# Step 8: Visualize
# Import actual closing prices into real_values dataframe
real_values = yf.download("AAPL", end='2024-06-12')
plot_real = real_values.loc['2024-01-01':'2024-06-11', 'Close']
    
# Plot the extended historical data
plt.figure(figsize=(12, 8))
plt.plot(plot_real, label='Historical Close Prices')
plt.plot(pd.date_range(start='2024-06-01', periods=n_forecast, freq='B'), 
         forecast, label='Forecasted Close Prices', color='orange')
plt.fill_between(pd.date_range(start='2024-06-01', periods=n_forecast, freq='B'), 
                 conf_int[:, 0], conf_int[:, 1], color='gray', alpha=0.5, label='95% Confidence Interval')

# Adjust the x-axis and other labels
plt.xlim(pd.Timestamp('2024-04-01'), pd.Timestamp('2024-06-11'))
plt.title(f'Close Price Forecast for {stock_select_str}')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.show()

### Step 9: 
The print_metrics function outputs the chosen comparison metrics for determining accuracy (RMSE, MSE, and MAPE)

In [None]:
def print_metrics(actual, predicted):
    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(actual, predicted))
    
    # Calculate MSE
    mse = mean_squared_error(actual, predicted)
    
    # Calculate MAPE (Mean Absolute Percentage Error)
    actual_values = np.array(actual)
    predicted_values = np.array(predicted)
    mape = np.mean(np.abs((actual_values - predicted_values) / actual_values[actual_values != 0])) * 100 if np.any(actual_values != 0) else np.nan
    
    # Print the metrics
    print(f"RMSE: {rmse:.4f}")
    print(f"MSE: {mse:.4f}")
    print(f"MAPE: {mape:.2f}%")

# Example usage (assuming 'forecast' and 'plot_real[-n_forecast:]' are defined)
print_metrics(plot_real[-n_forecast:], forecast)


In [None]:
import pandas as pd
import yfinance as yf
from datetime import timedelta

# Load stock data (using a sample ticker here, e.g., "AAPL")
stock_ticker = "AAPL"
stock_data = yf.download(stock_ticker, start="2021-01-01", end="2024-05-31", interval="1d")

# Calculate the total number of days in the dataset
total_days = len(stock_data)
train_days = int(0.8 * total_days)  # 80% of the data for training
test_days = total_days - train_days  # 20% of the data for testing

# Define rolling periods (quarterly)
rolling_splits = []

# Period 1 (Train on first 80%, Test on the next 20%)
train_period_1 = stock_data.iloc[:train_days]
test_period_1 = stock_data.iloc[train_days:train_days+test_days]
rolling_splits.append({"train_dates": (train_period_1.index[0], train_period_1.index[-1]), 
                       "test_dates": (test_period_1.index[0], test_period_1.index[-1])})

# Period 2 (Roll forward by 3 months, Train on 80%, Test on the next 20%)
train_period_2 = stock_data.iloc[:train_days + 30]  # Add 30 days for training
test_period_2 = stock_data.iloc[train_days + 30:train_days + 30 + test_days]
rolling_splits.append({"train_dates": (train_period_2.index[0], train_period_2.index[-1]), 
                       "test_dates": (test_period_2.index[0], test_period_2.index[-1])})

# Period 3 (Roll forward again, Train on 80%, Test on the next 20%)
train_period_3 = stock_data.iloc[:train_days + 60]  # Add another 30 days for training
test_period_3 = stock_data.iloc[train_days + 60:train_days + 60 + test_days]
rolling_splits.append({"train_dates": (train_period_3.index[0], train_period_3.index[-1]), 
                       "test_dates": (test_period_3.index[0], test_period_3.index[-1])})

# Output train-test splits for each period
for i, split in enumerate(rolling_splits):
    print(f"Period {i+1}:")
    print(f"  Train dates: {split['train_dates']}")
    print(f"  Test dates: {split['test_dates']}\n")


In [None]:
# Example of Cross-Validation Splits
cv_splits = []

# Fold 1 (Train on 2021-2022, Test on 2023)
train_1 = stock_data.loc["2021-01-01":"2022-12-31"]
test_1 = stock_data.loc["2023-01-01":"2023-12-31"]
cv_splits.append({"train_dates": (train_1.index[0], train_1.index[-1]), "test_dates": (test_1.index[0], test_1.index[-1])})

# Fold 2 (Train on 2021-2023, Test on 2023 mid-year to 2024)
train_2 = stock_data.loc["2021-01-01":"2023-06-30"]
test_2 = stock_data.loc["2023-07-01":"2024-05-31"]
cv_splits.append({"train_dates": (train_2.index[0], train_2.index[-1]), "test_dates": (test_2.index[0], test_2.index[-1])})

# Fold 3 (Train on 2021-2023, Test on 2024)
train_3 = stock_data.loc["2021-01-01":"2023-12-31"]
test_3 = stock_data.loc["2024-01-01":"2024-05-31"]
cv_splits.append({"train_dates": (train_3.index[0], train_3.index[-1]), "test_dates": (test_3.index[0], test_3.index[-1])})

# Output Cross-Validation Splits
for i, split in enumerate(cv_splits):
    print(f"Fold {i+1}:")
    print(f"  Train dates: {split['train_dates']}")
    print(f"  Test dates: {split['test_dates']}\n")
