# 🌞 Summer School 2024: Advanced Data Science and Time Series Analysis

Welcome to Summer School 2024! This session is designed to take you on an in-depth journey through advanced data science concepts, with a special focus on time series analysis, model evaluation, and visualization using Python. Whether you're a beginner or an experienced practitioner, this notebook will guide you through practical applications in data preprocessing, feature selection, and model performance evaluation, with a specific emphasis on working with real-world datasets like space weather data.

## 📚 Learning Objectives

By the end of this session, you will be able to:
- Load, clean, and preprocess data using `pandas`
- Perform correlation analysis and visualize relationships between variables
- Prepare and manipulate time series data for autoregressive modeling
- Train and evaluate a Bidirectional LSTM model for time series prediction
- Assess model performance using both regression and classification metrics
- Visualize results to draw meaningful insights from data

## 🛠️ Tools and Libraries

Throughout this notebook, we will utilize the following tools and libraries to facilitate our data science workflow:
- **Pandas**: For data manipulation, loading, and preprocessing
- **Seaborn**: For creating intuitive and visually appealing statistical plots
- **Matplotlib**: The backbone for all our visualizations and plots
- **Scikit-learn**: For feature selection and model evaluation metrics
- **TensorFlow/Keras**: For building, training, and evaluating our LSTM models

## 📊 Dataset Overview

The dataset we'll be using is sourced from the `omni_full.csv` file, which contains a range of space weather-related attributes. This data is vital for understanding the effects of solar activities on Earth's magnetosphere, and in this session, we will focus on predicting the Disturbance Storm Time (DST) index—a key measure of geomagnetic activity.

## 🔍 What You Will Do

1. **Data Loading and Exploration**:
   - Start by loading the dataset and exploring its structure to gain a clear understanding of the attributes available for analysis.

2. **Data Cleaning and Preprocessing**:
   - Clean the dataset by removing unnecessary columns and handling missing values. This step ensures that the data is ready for analysis and modeling.

3. **Correlation Analysis and Visualization**:
   - Perform a detailed correlation analysis to identify relationships between different attributes, and visualize these relationships using heatmaps.

4. **Feature Selection and Data Splitting**:
   - Select the most relevant features for predicting the DST index using statistical methods, and split the dataset into training, validation, and test sets to ensure robust model evaluation.

5. **Time Series Data Preparation**:
   - Prepare the time series data for an autoregressive LSTM model using `TimeseriesGenerator`. Understand how to manipulate sequential data for time series forecasting.

6. **Model Training and Evaluation**:
   - Train a Bidirectional LSTM model to predict future values of the DST index. Evaluate the model's performance using key metrics such as Mean Squared Error (MSE), Pearson Correlation Coefficient (PCC), and Matthews Correlation Coefficient (MCC).

7. **Final Model Evaluation**:
   - Apply both regression and classification metrics to assess the final model's performance on unseen test data, ensuring that it generalizes well to new observations.

## 📈 Let's Get Started!

Follow the instructions in each section of this notebook, and feel free to experiment with different parameters and approaches. This is a hands-on learning experience, so dive in and explore the fascinating world of data science and time series analysis. If you have any questions or need assistance, don't hesitate to ask. Let's make this session both educational and enjoyable!

## 📦 Importing Necessary Libraries

We start by importing the necessary libraries and modules that will be used throughout this notebook. 
#These include libraries for data manipulation, visualization, feature selection, and model training.


In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Scikit-learn modules for feature selection and evaluation metrics
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import matthews_corrcoef

# TensorFlow/Keras modules for model creation, training, and evaluation
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import Input, LSTM, Bidirectional, TimeDistributed, Dense, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# Additional Scikit-learn metrics for classification tasks
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, matthews_corrcoef


## 🌐 Running This Notebook on Google Colab

To ensure that everyone can follow along and execute the notebook without any local setup issues, we recommend using Google Colab. Google Colab is a free, cloud-based platform that allows you to run Jupyter notebooks with access to powerful computing resources, including GPUs.

### Steps to Open and Run the Notebook on Google Colab:

1. **Open the Notebook**:
   - Click on the following link to open the notebook directly in Google Colab: [Open in Google Colab]().

2. **Clone the Repository**:
   - Once the notebook is open in Colab, you'll need to clone the entire repository to access all necessary files and data.
   - Run the following command in a new code cell to clone the repository:


In [None]:
!git clone https://github.com/VieraMaslej/summer_school_2024.git

3. **Install Git LFS (Large File Storage):**

    - Git LFS is used to handle large files in the repository. If you encounter any issues with file sizes or missing files, it's likely because Git LFS hasn't been installed or initialized.
    - Run the following commands to install Git LFS and initialize it:
    


In [None]:
!sudo apt-get install git-lfs
!git lfs install


4. **Pull the LFS Files:**

    - Navigate to the `data` directory within the cloned repository and pull the large files tracked by Git LFS:
    
*Comment*: The `%cd` command is used to change the working directory to `data` within the repository. Then, `git lfs pull` ensures that any large files tracked by LFS are downloaded.


In [None]:
cd summer_school_2024/data
!git lfs pull


5. **Create Necessary Directories:**

    - Create the `results` and `models` directories if they do not already exist. These directories will store your output files and trained models:
    
*Comment*: The `-p` flag ensures that the directories are only created if they do not already exist, preventing errors.
    

In [None]:
!mkdir -p ../results
!mkdir -p ../models


6. **Navigate Back to the Notebook Directory:**

    - Return to the main directory or `notebook` directory to continue running the notebook:

In [None]:
cd ../notebook

## 📂 Loading the Dataset

Here, we load our dataset from the CSV file `omni_full.csv`. This dataset contains various space weather-related attributes that we will analyze and visualize in the following sections.

The OMNI dataset is available online. The dataset was obtained using the Aidapy and Heliopy libraries.

Source of the dataset: [https://spdf.gsfc.nasa.gov/pub/data/omni/low_res_omni/](https://spdf.gsfc.nasa.gov/pub/data/omni/low_res_omni/)


In [None]:
df = pd.read_csv('../data/omni_full.csv')

## 🔍 Exploring the Dataset

To better understand the structure of our dataset, we start by examining the columns it contains. This will give us an overview of the attributes available for analysis.


In [None]:
df.columns

## 🧹 Cleaning the Dataset

In this step, we clean the dataset by removing unnecessary columns that won't be needed for our analysis. This helps to simplify the dataset and focus on the most relevant attributes.


In [None]:
df = df.drop(columns = ['Unnamed: 0.1', 'Unnamed: 0', 'BZ_GSM-1', 'BZ_GSM-2', 'BZ_GSM-3', 'BZ_GSM-4', 'BZ_GSM-5', 'BZ_GSM-6',
       'BZ_GSM-7', 'BZ_GSM-8', 'BZ_GSM-9', 'BZ_GSM-10', 'BZ_GSM-11',
       'BZ_GSM-12', 'BZ_GSM-13', 'BZ_GSM-14', 'BZ_GSM-15', 'BZ_GSM-16',
       'BZ_GSM-17', 'BZ_GSM-18', 'BZ_GSM-19', 'BZ_GSM-20', 'BZ_GSM-21',
       'BZ_GSM-22', 'BZ_GSM-23', 'BZ_GSM-24'])

## 🏗️ Checking Dataset Dimensions

After cleaning the dataset, it's important to check the shape of the data. This will tell us the number of rows and columns that remain, giving us a sense of the dataset's size and scope.


In [None]:
df.shape

## 📊 Descriptive Statistics for DST

The `DST` column is one of the key attributes in our dataset. Here, we calculate descriptive statistics for this column to understand its distribution, including metrics like mean, standard deviation, and range.


In [None]:
df["DST"].describe()

## 🚨 Checking for Missing Values

Finally, we check for missing values in the `DST` column. Missing data can affect our analysis, so it's important to identify any gaps early on so they can be handled appropriately.


In [None]:
df.isna().sum()

## 🔗 Correlation Analysis

In this section, we will perform a correlation analysis on our dataset to understand the relationships between different attributes. Correlation measures how strongly two variables are related to each other, with values ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation). 

### Steps:
1. **Calculate the Correlation Matrix**: We start by calculating the correlation matrix for the entire dataset. This matrix will show us the correlation coefficients between all pairs of attributes.
  
2. **Determine the Total Number of Attributes**: Next, we determine the total number of attributes in our dataset. This will help us in dividing the attributes into two halves for easier visualization.

3. **Splitting Attributes into Two Halves**: To make the heatmaps more readable, we split the attributes into two halves:
   - **First Half**: Contains the first half of the attributes.
   - **Second Half**: Contains the second half of the attributes.
   
4. **Heatmap for the First Half**: We create a heatmap for the first half of the attributes, showing how each attribute correlates with `DST`. The `vlag` color palette is used to intuitively represent the correlation, with the color scale ranging from -1 to +1.
  
5. **Heatmap for the Second Half**: Similarly, we create a heatmap for the second half of the attributes, focusing again on their correlation with `DST`.

### Visualization Goals:
- **Identify Strong Correlations**: The heatmaps will help us quickly identify which attributes have strong positive or negative correlations with `DST`.
- **Focus on Important Attributes**: By visualizing the correlations, we can focus on the most significant attributes for further analysis or model development.

The heatmaps will be saved as images (`corr1.png` and `corr2.png`) for easy reference and sharing.


In [None]:
# Ensure that only numeric columns are selected for the correlation matrix
numeric_df = df.select_dtypes(include=[np.number])

# Calculate the correlation matrix only for numeric columns
correlation_matrix = numeric_df.corr()

# Determine the total number of attributes
total_attributes = len(correlation_matrix.columns)

# Find the middle index to split the attributes into two halves
mid_index = total_attributes // 2

# First half of the attributes
first_half = correlation_matrix.columns[:mid_index]

# Second half of the attributes
second_half = correlation_matrix.columns[mid_index:]

# Create a heatmap for the first half of attributes in correlation with DST
plt.figure(figsize=(15, 5))
sns.heatmap(correlation_matrix.loc[['DST'], first_half], annot=True, fmt=".2f", cmap="vlag", vmin=-1, vmax=1)
plt.title('Correlation heatmap for the first half of attributes with DST')
plt.savefig("../results/corr1")  # Save the figure as "corr1"
plt.show()

# Create a heatmap for the second half of attributes in correlation with DST
plt.figure(figsize=(15, 5))
sns.heatmap(correlation_matrix.loc[['DST'], second_half], annot=True, fmt=".2f", cmap="vlag", vmin=-1, vmax=1)
plt.title('Correlation heatmap for the second half of attributes with DST')
plt.savefig("../results/corr2")  # Save the figure as "corr2"
plt.show()


## 🛠️ Data Preparation and Feature Selection

In this section, we proceed with data preparation and feature selection, focusing on identifying the most relevant attributes for predicting the `DST` variable:

1. **Creating a New DataFrame for Analysis**: We create a new DataFrame `df_analysis` for the current experiment to ensure that the original data, including the `time1` attribute, is preserved for future use.

2. **Removing the `time1` Attribute**: The `time1` column, which is not needed for this specific analysis, is removed from `df_analysis`. This allows us to focus on the other attributes without introducing potential noise from the timestamp data.

3. **Handling Missing Values**: Any missing values in `df_analysis` are replaced with the median value of their respective columns. This step ensures that the dataset is complete and ready for further analysis.

4. **Target Variable (`y`)**: The target variable `DST` is extracted from `df_analysis` and stored in the variable `y`. This will be the variable we aim to predict.

5. **Input Attributes (`X`)**: All input attributes, excluding `DST`, are stored in the variable `X`. These attributes will be used as predictors in our model.

6. **Feature Labels**: The names of the input attributes (excluding `DST`) are stored in the variable `feature_labels`. This will help us identify which features are selected as most relevant.

7. **Feature Selection**: 
   - We use the `SelectKBest` model with the `f_regression` scoring function to select the top 10 most relevant features for predicting `DST`.
   - The selected features are those that have the highest correlation with the target variable `DST`, as measured by the F-score.

8. **Visualizing the Selected Features**: A horizontal bar chart is plotted to visualize the top 10 selected features based on their F-scores. This visualization helps to identify which attributes are most significant for predicting `DST`.

9. **Output Selected Features**: Finally, the selected features and their corresponding F-scores are printed for further inspection. This output provides a clear understanding of which features are most important for the prediction task.

The resulting plot is saved as "KBest.png" and displayed for easy reference and documentation.


In [None]:
# Create a new DataFrame for the current analysis
df_analysis = df.copy()

# Removing the time1 attribute from df_analysis for the current experiment
df_analysis = df_analysis.drop(columns=['time1'])

# Replacing missing values with the median in df_analysis
df_analysis.fillna(df_analysis.median(), inplace=True)

# The target variable DST is stored in the variable y
y = df_analysis['DST'].values

# Input attributes are stored in the variable X (all columns except DST)
X = df_analysis.drop(columns=['DST']).values

# The names of the attributes (except DST) are stored in the variable feature_labels
feature_labels = df_analysis.drop(columns=['DST']).columns.values

# The SelectKBest model with the f_regression score function is used to select the 10 best attributes (numFeatures).
numFeatures = 10
fSelect_model = SelectKBest(score_func=f_regression, k=numFeatures)
X_fSelect = fSelect_model.fit_transform(X, y)
scores = fSelect_model.scores_
order = np.argsort(scores)
ordered_feature_labels = feature_labels[order]
y_pos = np.arange(len(feature_labels))

# Plotting the graph
plt.figure(figsize=(10, 8))
plt.barh(y_pos[-numFeatures:], scores[order][-numFeatures:], align='center', color="#EFBF7B")
plt.yticks(y_pos[-numFeatures:], ordered_feature_labels[-numFeatures:], fontsize=12)
plt.xlabel('F-score', fontsize=15)
plt.title('Top 10 Features Based on F-score', fontsize=15)
plt.grid(axis='x', linestyle='--', alpha=0.6) 
plt.savefig("../results/KBest")
plt.show()

# Extracting selected feature labels and their scores
selected_feature_labels = ordered_feature_labels[-numFeatures:]
selected_scores = scores[order][-numFeatures:]

# Printing selected features and their F-scores
print("Selected features and their F-scores:")
for label, score in zip(selected_feature_labels, selected_scores):
    print(f"{label}: {score}")


## 📊 Data Splitting for Training, Validation, and Testing

In this section, we split the dataset into training, validation, and test sets. This is a crucial step in machine learning workflows to ensure that the model is trained, validated, and tested on separate portions of the data. Here’s how the data is divided:

1. **Convert `time1` to Datetime**: The `time1` column is converted to a datetime format to facilitate time-based splitting.

2. **Splitting the Data**:
   - The data is split into training and test sets based on a specific date (`2003-11-19`).
   - The training set contains all data points before this date, while the test set contains all data points from this date onwards.

3. **Creating the Validation Set**:
   - The last 20% of the training set is further split off to create a validation set. This set is used to tune and validate the model during training.

4. **Visualizing the Data Splits**:
   - The `DST` index is plotted over time for the training, validation, and test sets. Different colors are used to distinguish between the different sets, making it easier to visualize how the data is divided.

5. **Saving the Splits**:
   - The resulting training and test datasets are saved to CSV files (`train_omni.csv` and `test_omni.csv`) for future use in model training and evaluation.

This structured approach ensures that the model can be trained on one part of the data, validated on another, and finally tested on unseen data to evaluate its performance.


In [None]:
# Convert the time1 column to datetime format
df['time1'] = pd.to_datetime(df['time1'])

# Split the data based on the date '2003-11-19'
split_index = df[df['time1'] >= '2003-11-19'].index.min()
train = df.iloc[:split_index, :]
test = df.iloc[split_index:, :]

# Display the last few rows of the training set to verify the split
train.tail()

In [None]:
# Display the first few rows of the test set to verify the split
test.head()

In [None]:
# Keep only relevant columns in the test set
test = test[["time1", "KP", "DST"]]

# Print the number of records in the test and train sets
print(len(test))
print(len(train))

# Create a validation set from the last 20% of the training data
valid_size = int(len(train) * 0.2)
valid = train.iloc[-valid_size:, :].copy()
train = train.iloc[:-valid_size, :].copy()

# Plotting the DST index for training, validation, and test sets
plt.rcParams['figure.figsize'] = [12, 8]
plt.figure()
plt.title("DST Index - Data Splitting")
plt.plot(train['time1'], train['DST'], label='Training Data', color="#FFA07A") 
plt.plot(valid['time1'], valid['DST'], label='Validation Data', color="#20B2AA") 
plt.plot(test['time1'], test['DST'], label='Test Data', color="#9370DB")  
plt.savefig("../results/dataset_cut.png")
plt.legend()

# Save the test and train datasets to CSV files
test.to_csv('../data/test_omni.csv')
train.to_csv('../data/train_omni.csv')


# 🔄 Autoregressive Model Data Preparation

In this section, we prepare the data for training an autoregressive model. An autoregressive model predicts the future values of a time series based on its own previous values. T

## ⏳ Time Series Data Preparation and Generators

In this section, we prepare the time series data and set up the generators needed for model training, validation, and testing.

### What is a TimeseriesGenerator?

The `TimeseriesGenerator` is a powerful tool for preparing sequential data for time series forecasting models. It automatically generates batches of input-output pairs for your model by sliding a window over your dataset. The generator allows you to specify how many previous time steps (`n_input`) to include for each prediction.

### How Does It Work?

1. **Sliding Window**: 
   - The generator slides a window of a specified length (`n_input`) over the dataset. For each position of the window, it extracts a sequence of past observations as the input (`X`) and the next observation as the output (`y`).

2. **Batch Creation**: 
   - These sequences are grouped into batches, with the size of each batch defined by `batch_size`. During training, the model processes one batch at a time.

3. **Example**:
   - Suppose you have a dataset of daily temperatures over 10 days: `[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]`.
   - With `n_input=3` and `batch_size=2`, the `TimeseriesGenerator` would create the following sequences:
     - **Batch 1**:
       - `X`: `[[10, 11, 12], [11, 12, 13]]`
       - `y`: `[13, 14]`
     - **Batch 2**:
       - `X`: `[[12, 13, 14], [13, 14, 15]]`
       - `y`: `[15, 16]`
   - The generator continues sliding the window across the data to create all possible sequences.

### Setting Up TimeseriesGenerator

We define the following parameters:

- **n_input**: The number of previous time steps to consider for each prediction.
- **b_size**: The number of samples to process in each batch.
- **n_features**: The number of features in the dataset (in this case, it's the length of the training data).



In [None]:
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

# Define the target column to predict
y_col = 'DST'

# Extract the target (y) and input (X) variables for training, validation, and test sets
y_train = train[y_col].values.copy()
X_train = train['DST'].values.copy()
y_val = valid[y_col].values.copy()
X_val = valid['DST'].values.copy()
y_test = test[y_col].values.copy()
X_test = test['DST'].values.copy()

# Time series generator settings
n_input = 6  # Number of previous time steps to consider for each prediction
n_features = len(X_train) # Number of features in the input data (DST)
b_size = 256  # Batch size

# Create TimeseriesGenerators for training, validation, and test sets
train_generator = TimeseriesGenerator(X_train, y_train, length=n_input, batch_size=b_size)
val_generator = TimeseriesGenerator(X_val, y_val, length=n_input, batch_size=b_size)
test_generator = TimeseriesGenerator(X_test, y_test, length=n_input, batch_size=256)

# Inspect the generators
print("Number of batches: ", len(train_generator))
print("Each batch contains features (X component) and labels (y component): ", len(train_generator[0]))
print("Length of the X component of a batch: ", len(train_generator[0][0]))
print("Length of the y component of a batch (number of measurements in the batch): ", len(train_generator[0][1]))
print("Number of rows considered for one measurement (how many steps back): ", len(train_generator[0][0][0]))


## 🔍 Inspecting the First Batch from TimeseriesGenerator

In this section, we verify the correctness of the `TimeseriesGenerator` by inspecting the first batch generated from the training data. This process involves the following steps:

1. **Extract the First Batch**:
   - We extract the first batch of input sequences (`X_train_batch`) and target values (`y_train_batch`) from the training generator.

2. **Print the Batch**:
   - We print the contents of the first batch to visually inspect the sequences and their corresponding target values.

3. **Manual Verification**:
   - We manually verify the first few sequences to ensure that the `TimeseriesGenerator` is correctly sliding over the data and generating the expected input-output pairs.

This inspection step helps ensure that the data preparation is functioning as intended before moving on to model training.


In [None]:

# Inspect the first batch from the train generator to verify correctness
X_train_batch, y_train_batch = train_generator[0]

print("First batch from the training generator:")
print("X (input sequences):")
print(X_train_batch)
print("y (target values):")
print(y_train_batch)
print()

# Manually verify the first few sequences
print("Manually verifying the first few sequences:")
for i in range(min(5, len(X_train_batch))):  # Verify first 5 sequences or less if batch size is smaller
    print(f"Sequence {i+1}:")
    print(f"X: {X_train_batch[i]}")
    print(f"Expected y: {y_train_batch[i]}")
    print()


## 🚀 Training an Autoregressive LSTM Model

In this section, we build, compile, and train a Bidirectional LSTM model designed to predict future values of the `DST` index based on its past values. The model is trained using the sequences generated by the `TimeseriesGenerator` and incorporates techniques to ensure robust training and prevent overfitting.

### Model Architecture:

1. **Bidirectional LSTM Layers**:
   - The model begins with a `Bidirectional LSTM` layer, allowing the model to capture patterns in the time series data in both forward and backward directions.
   - This is followed by another `LSTM` layer to further process the sequential data.

2. **TimeDistributed Layer**:
   - A `TimeDistributed` dense layer applies the same dense layer to each time step of the sequence, maintaining the temporal structure of the data.

3. **Output Layer**:
   - The output from the sequence is flattened and passed through a dense layer to predict the next value in the time series.

4. **Model Compilation**:
   - The model is compiled with Mean Squared Error (`mse`) as the loss function and the Adam optimizer for efficient training. The Mean Absolute Error (`mae`) is used as a metric to monitor performance.

### Model Training:

1. **Checkpoint and Early Stopping**:
   - The model is trained with callbacks for saving the best model (based on validation MAE) and stopping early if the validation performance does not improve, which helps prevent overfitting.

2. **Training Process**:
   - The model is trained over 200 epochs, with both training and validation data fed from the `TimeseriesGenerator`.

### Prediction:

- After training, the best model can be loaded, and predictions can be made on the test data using the trained LSTM model.



In [None]:
# Define the LSTM-based autoregressive model
inputs = Input(shape=(n_input, 1))
layers = Bidirectional(LSTM(128, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(inputs)
layers = LSTM(128, return_sequences=True)(layers)
layers = TimeDistributed(Dense(1, activation='linear'))(layers)
output = Flatten()(layers)
output = Dense(1, activation='linear')(output)

model = Model(inputs=inputs, outputs=output)
model.compile(loss='mse', optimizer='adam', metrics=["mae"])
print(model.summary())

# Define the model checkpoint and early stopping callbacks
saved_model = "../models/1h_dopredu_DST.keras"
checkpoint = ModelCheckpoint(saved_model, monitor='val_mae', verbose=1, save_best_only=True, mode='min')
early = EarlyStopping(monitor="val_mae", mode="min", patience=25)
callbacks_list = [checkpoint, early]

# Train the model with the TimeseriesGenerator
history = model.fit(train_generator, validation_data=val_generator, epochs=100, verbose=1, callbacks=callbacks_list)


In [None]:
# Load the best saved model
model = load_model(saved_model)
# Make predictions on the test data
y_pred = model.predict(test_generator)

## 📊 Signal Comparison Metrics

In this section, we evaluate the performance of the model in predicting the `DST` values using various signal comparison metrics. These metrics help us understand the magnitude of the error, the correlation between predicted and actual values, and the overall signal quality.

### 1. Mean Squared Error (MSE)
- **Definition**: MSE measures the average of the squares of the errors—that is, the average squared difference between the predicted and actual values. It provides a measure of the magnitude of the error.
- **Formula**: 
  $$
  \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
  $$
  
### 2. Root Mean Squared Error (RMSE)
- **Definition**: RMSE is the square root of MSE and provides an error metric in the same units as the predicted values, making it easier to interpret.
- **Formula**: 
  $$
  \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
  $$

### 3. Pearson Correlation Coefficient (PCC)
- **Definition**: PCC measures the linear correlation between the predicted and actual values. It ranges from `-1` to `1`, with `1` indicating a perfect positive correlation.
- **Formula**: 
  $$
  \text{PCC} = \frac{\text{Cov}(y, \hat{y})}{\sigma_y \sigma_{\hat{y}}}
  $$

### 4. Signal-to-Noise Ratio (SNR)
- **Definition**: SNR compares the level of the desired signal to the level of background noise. A higher SNR indicates that the signal is much stronger than the noise.
- **Formula**: 
  $$
  \text{SNR} = 10 \log_{10}\left(\frac{\sum_{i=1}^{n} y_i^2}{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\right)
  $$
  
- **Interpretation**:
  - **High SNR**: Indicates a strong signal compared to noise. Generally, an SNR above 20 dB is considered good, with values above 40 dB being excellent.
  - **Low SNR**: Indicates that the noise level is comparable to or greater than the signal level, which can make the signal difficult to distinguish. An SNR below 10 dB is typically considered poor.


In [None]:
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
import numpy as np

# Signal Comparison Metrics

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test[n_input:], y_pred.reshape(-1))
print(f"Mean Squared Error: {mse}")

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error: {rmse}")

# Calculate Pearson Correlation Coefficient (PCC)
pcc, _ = pearsonr(y_test[n_input:], y_pred.reshape(-1))
print(f"Pearson Correlation Coefficient: {pcc}")

# Calculate Signal-to-Noise Ratio (SNR)
snr = 10 * np.log10(np.sum(y_test[n_input:]**2) / np.sum((y_test[n_input:] - y_pred.reshape(-1))**2))
print(f"Signal-to-Noise Ratio: {snr}")


## 🧮 Classification Metrics: Evaluating Model Performance with MCC and Other Metrics

In this section, we evaluate the performance of the trained LSTM model on the validation set using various classification metrics, including the Matthews Correlation Coefficient (MCC), confusion matrix, classification report, and accuracy. These metrics provide a comprehensive understanding of the model's ability to distinguish between classes, particularly in imbalanced datasets.

### 1. Threshold Selection Using MCC

The first step in evaluating the model's performance is to determine the optimal threshold for classifying `DST` values as events or non-events. This is done using the Matthews Correlation Coefficient (MCC).

#### What is MCC?

The Matthews Correlation Coefficient (MCC) is a single-value measure that takes into account true and false positives and negatives, providing a balanced evaluation of classification performance. It returns a value between `-1` and `+1`:
- **`+1`**: Indicates a perfect prediction.
- **`0`**: Indicates no better performance than random guessing.
- **`-1`**: Indicates total disagreement between predictions and actual values.

#### Advantages of Using MCC:

1. **Handles Imbalanced Datasets**:
   - Unlike accuracy, MCC provides a more informative and reliable score when the classes are imbalanced. This makes it ideal for datasets where one class (e.g., extreme `DST` events) is much rarer than the other.

2. **Balanced Evaluation**:
   - MCC considers all four quadrants of the confusion matrix (true positives, false positives, true negatives, and false negatives), offering a balanced evaluation even when one class is underrepresented.

3. **Single Metric**:
   - MCC combines the information from precision, recall, and specificity into a single metric, making it easier to compare the performance of different models or thresholds without needing to calculate multiple metrics.

#### Steps for Threshold Selection:

1. **Prediction on Validation Set**:
   - We use the trained model to predict the `DST` values on the validation set using the `TimeseriesGenerator`.

2. **Binary Classification of `DST` Values**:
   - The true labels are determined by setting a threshold of `-20` on the `DST` index. Values less than or equal to `-20` are classified as `1` (event), and others as `0` (non-event).

3. **Threshold Tuning**:
   - We calculate the MCC for a range of thresholds (`-10` to `-40` with a step of `-0.1`) to determine the threshold that yields the best MCC score.
   - The best threshold is identified as the one that maximizes the MCC.

4. **Results**:
   - The best threshold and its corresponding MCC are printed, providing insights into the optimal cutoff for event detection based on the model’s predictions.

### 2. Confusion Matrix, Classification Report, and Accuracy

Once the optimal threshold is selected using MCC, we evaluate the model's classification performance using additional metrics.

#### Confusion Matrix:
- **Definition**: The confusion matrix provides a detailed breakdown of the classification outcomes, showing the true positives, true negatives, false positives, and false negatives.

#### Classification Report:
- **Definition**: The classification report provides key metrics like precision, recall, and F1-score for each class, giving a more detailed insight into the model's performance across different classes.

#### Accuracy (ACC):
- **Definition**: Accuracy is the ratio of correctly predicted instances (both true positives and true negatives) to the total instances. It provides a general measure of how often the model is correct.
- **Formula**: $$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$


In [None]:
# Predict on the validation set
y_val_pred = model.predict(val_generator)

# Create binary true labels for the validation set based on a threshold of -20
true_labels_val = np.where(y_val[n_input:] <= -20, 1, 0)

# Function to calculate the Matthews Correlation Coefficient (MCC)
def calculate_mcc(y_true, y_pred):
    return matthews_corrcoef(y_true, y_pred)

# Explore different thresholds to find the best MCC
thresholds = np.arange(-10, -40, -0.1)
mcc_scores = []

for thresh in thresholds:
    # Convert predictions to binary labels based on the current threshold
    predictions = np.where(y_val_pred <= thresh, 1, 0)
    # Calculate MCC for the current threshold
    mcc = calculate_mcc(true_labels_val, predictions)
    mcc_scores.append(mcc)

# Find the threshold with the highest MCC
best_threshold_mcc = thresholds[np.argmax(mcc_scores)]
best_mcc = max(mcc_scores)

print("Best Threshold for MCC on Validation Set:", best_threshold_mcc)
print("Best MCC on Validation Set:", best_mcc)

# Optional: Round the best threshold for reporting
best_threshold_mcc_rounded = round(best_threshold_mcc, 1)
print("Best Threshold for MCC (Rounded):", best_threshold_mcc_rounded)


In [None]:
# Convert predictions to binary classifications based on the best threshold
y_pred_best_th = np.where(y_test[n_input:] <= best_threshold_mcc_rounded, 1, 0)
true_labels_test = np.where(y_test[n_input:] <= -20, 1, 0)

# Calculate Accuracy (ACC)
acc = accuracy_score(true_labels_test, y_pred_best_th)
print(f"Accuracy: {acc}")

# Calculate MCC on the test set
mcc_test = matthews_corrcoef(true_labels_test, y_pred_best_th)
print(f"Test Set MCC: {mcc_test}")

# Confusion Matrix
cm = confusion_matrix(true_labels_test, y_pred_best_th)
print("Confusion matrix:\n", cm)

# Classification Report
print(classification_report(true_labels_test, y_pred_best_th))


## 📈 Visualizing Model Predictions vs. True Values

In this section, we create visualizations to compare the predicted values of the DST index with the true values. We use different colors for the true and predicted values to make the comparison clear and visually appealing.

## Full Data Visualization

First, we plot the true and predicted DST index values over the entire time period. This helps us get an overall sense of how well the model's predictions align with the actual data.


In [None]:
# Create a DataFrame with time, true values, and predicted values
df = pd.DataFrame(data={
    "time": test['time1'][n_input:], 
    "y_true": y_test[n_input:].reshape(-1), 
    "y_predict": y_pred.reshape(-1)
})

# Plotting the full data with enhanced colors
plt.figure(figsize=(14, 7))
plt.title('Predicting the Change in DST Index', fontsize=16)
plt.plot(df['time'], df['y_true'], label='True Values', color='#17becf', linewidth=2)  # Green color for true values
plt.plot(df['time'], df['y_predict'], label='Predicted Values', color='#9467bd', linewidth=2)  # Red color for predicted values
plt.legend(loc='upper right', fontsize=12)
plt.gcf().autofmt_xdate()
plt.xlabel('Time', fontsize=14)
plt.ylabel('DST Index', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()


## Zoomed-In Data Visualization

Next, we provide a zoomed-in view of the DST index over a smaller time interval. This detailed view allows us to closely examine the accuracy of the model's predictions in a specific segment of the data.


In [None]:

# Zoomed-in view for a smaller time interval
# Define the interval (e.g., the first 200 time steps for detailed view)
zoom_interval = 200
df_zoomed = df.iloc[:zoom_interval]

plt.figure(figsize=(14, 7))
plt.title('Zoomed-In: Predicting the Change in DST Index', fontsize=16)
plt.plot(df_zoomed['time'], df_zoomed['y_true'], label='True Values', color='#17becf', linewidth=2) 
plt.plot(df_zoomed['time'], df_zoomed['y_predict'], label='Predicted Values', color='#9467bd', linewidth=2) 
plt.legend(loc='upper right', fontsize=12)
plt.gcf().autofmt_xdate()
plt.xlabel('Time', fontsize=14)
plt.ylabel('DST Index', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()


## 📅 Interactive Date-Based Visualization (Limited to Test Set Range)

In this section, you can interactively select a date range within the limits of the test set to visualize the predicted versus true values of the DST index. Use the date pickers below to choose the start and end dates. The plot will automatically update to reflect the selected time interval.

The date pickers are constrained to the date range of the test set, ensuring that you only select valid dates for which predictions are available.


In [None]:
import ipywidgets as widgets
from IPython.display import display


# Get the minimum and maximum dates from the test set
min_date = test['time1'].min().date()
max_date = test['time1'].max().date()

# Create date picker widgets for selecting the start and end dates
start_date_picker = widgets.DatePicker(
    description='Start Date',
    value=min_date,  # Set the default start date to the minimum date in the test set
    disabled=False
)

end_date_picker = widgets.DatePicker(
    description='End Date',
    value=max_date,  # Set the default end date to the maximum date in the test set
    disabled=False
)

# Display the date pickers
display(start_date_picker, end_date_picker)

def update_plot_by_date(start_date, end_date):
    # Ensure the dates are within the test set range
    if start_date is not None and end_date is not None:
        if start_date < min_date:
            start_date = min_date
        if end_date > max_date:
            end_date = max_date
            
        # Filter the DataFrame based on the selected date range
        mask = (df['time'] >= pd.Timestamp(start_date)) & (df['time'] <= pd.Timestamp(end_date))
        df_filtered = df.loc[mask]
        
        # Plotting the data within the selected date range
        plt.figure(figsize=(14, 7))
        plt.title(f'Predicting the Change in DST Index from {start_date} to {end_date}', fontsize=16)
        plt.plot(df_filtered['time'], df_filtered['y_true'], label='True Values', color='#17becf', linewidth=2)  # Cyan color for true values
        plt.plot(df_filtered['time'], df_filtered['y_predict'], label='Predicted Values', color='#9467bd', linewidth=2)  # Purple color for predicted values
        plt.legend(loc='upper right', fontsize=12)
        plt.gcf().autofmt_xdate()
        plt.xlabel('Time', fontsize=14)
        plt.ylabel('DST Index', fontsize=14)
        plt.grid(True, linestyle='--', alpha=0.7)
        plt.show()

# Link the function to the date pickers using an interactive widget
widgets.interactive(update_plot_by_date, start_date=start_date_picker, end_date=end_date_picker)



## 🚀 Next Steps: Enhancing Your Time Series Model

Now that you've successfully built and evaluated a model for predicting the DST index, it's time to explore some extensions and enhancements. These next steps are designed to help you deepen your understanding of time series forecasting and neural network modeling.

### 1. Predicting Further Into the Future

Currently, the model is set up to predict the DST index one hour ahead with a window size of 6 time steps. However, you can modify the model to predict further into the future (e.g., 3 hours, 6 hours) or adjust the window size to see how it affects the predictions.

**How to Adjust the Prediction Window:**
- Change the prediction window to forecast multiple hours ahead by adjusting the number of past time steps (window size) the model uses to make predictions.
- For example, you might increase the window size from 6 to 18 to predict 3 hours ahead.
- Modify the neural network architecture accordingly to handle the increased complexity when predicting further into the future.

### 2. Adding More Features to the Neural Network

The DST index is influenced by various factors, including the KP index, solar wind speed, and other space weather parameters. By adding these features to your neural network, you may improve its predictive power.

**Steps to Add More Features:**
- Include additional features in your dataset, such as the KP index, to provide the model with more information.
- Adjust the input shape of your neural network to account for the additional features.

### 3. Experimenting with Neural Network Architecture

Explore different neural network architectures to see how they affect model performance. Consider trying different types of layers, activation functions, or even completely different models like GRU (Gated Recurrent Unit) or Transformer-based models.

**Ideas for Architecture Modifications:**
- Increase the number of LSTM units to capture more complex patterns in the data.
- Add dropout layers to help prevent overfitting by randomly setting a fraction of input units to 0 during training.
- Experiment with other recurrent layers like GRU instead of LSTM, or even a combination of both.

### 4. Exploring Other Time Series Models

Beyond LSTMs, there are various other models you can experiment with:
- **ARIMA (AutoRegressive Integrated Moving Average)**: A popular statistical method for time series forecasting.
- **Prophet**: A model developed by Facebook for forecasting time series data, particularly effective with missing data and outliers.
- **Transformer Models**: Explore the use of Transformer models for time series forecasting, which are particularly good at handling long-range dependencies.

### 5. Evaluating with Different Metrics

Expand your evaluation by using additional metrics like:
- **R-squared (Coefficient of Determination)**: Measures the proportion of variance in the dependent variable that is predictable from the independent variable(s).
- **MAE (Mean Absolute Error)**: Provides a measure of errors between paired observations expressing the same phenomenon.

### Summary

These next steps provide you with a pathway to deepen your understanding and enhance your model's capabilities. By experimenting with different configurations, adjusting the window size, adding more features, and exploring new models, you'll gain valuable insights into the challenges and intricacies of time series forecasting in real-world scenarios.

Feel free to experiment, iterate, and share your findings!
