### **Problem Understanding**

Geomagnetic storms are caused by the interaction of solar wind with Earth's magnetic field. The resulting disturbances in the geomagnetic field can wreak havoc on GPS systems, satellite communication, electric power transmission, and more. These disturbances are measured by the Disturbance Storm-Time Index, or [Dst](https://www.ngdc.noaa.gov/stp/GEOMAG/dst.html).

The task is to forecast Dst in real-time to help satellite operators, power grid operators, and users of magnetic navigation systems prepare for magnetic disturbances.

The primary input data  is provided by sensor data from two satellites, NASA's [ACE](https://www.swpc.noaa.gov/products/ace-real-time-solar-wind) and NOAA's DSCOVR. This space weather data includes sensor readings related to both the interplanetary magnetic field and plasma from solar wind.


> **The Interplanetary Magnetic Field (IMF)**
>
The interplanetary magnetic field (IMF) plays a huge rule in how the solar wind interacts with Earth’s magnetosphere. In this article we will learn what the interplanetary magnetic field is and how it affects auroral activity here on Earth.

> **The Sun’s magnetic field**
>
During solar minimum, the magnetic field of the Sun looks similar to Earth’s magnetic field. It looks a bit like an ordinary bar magnet with closed lines close to the equator and open field lines near the poles. Scientist call those areas a dipole. The dipole field of the Sun is about as strong as a magnet on a refrigerator (around 50 gauss). The magnetic field of the Earth is about 100 times weaker.
Around solar maximum, when the Sun reaches her maximum activity, many sunspots are visible on the visible solar disk. These sunspots are filled with magnetism and large magnetic field lines which run material along them. These field lines are often hundreds of times stronger than the surrounding dipole. This causes the magnetic field around the Sun to be a very complex magnetic field with many disturbed field lines.
The magnetic field of our Sun doesn’t stay around the Sun itself. The solar wind carries it through the Solar System until it reaches the heliopause. The heliopause is the place where the solar wind comes to a stop and where it collides with the interstellar medium. Because the Sun turns around her axis (once in about 25 days) the interplanetary magnetic field has a spiral shape which is called the Parker Spiral.

> **Bt value**
>
The Bt value of the interplanetary magnetic field indicates the total strength of the interplanetary magnetic field. It is a combined measure of the magnetic field strength in the north-south, east-west, and towards-Sun vs. away-from-Sun directions. The higher this value, the better it is for enhanced geomagnetic conditions. We speak of a moderately strong total interplanetary magnetic field when the Bt exceeds 10nT. Strong values start at 20nT and we speak of a very strong total interplanetary magnetic field when values exceed 30nT. The units are in nano-Tesla (nT) — named after Nikola Tesla, the famous physicist, engineer and inventor.

> **Bx, By and Bz**
>
The interplanetary magnetic field is a vector quantity with a three axis component, two of which (Bx and By) are orientated parallel to the ecliptic. The Bx and By components are not important for auroral activity and are therefor not featured on our website. The third component, the Bz value is perpendicular to the ecliptic and is created by waves and other disturbances in the solar wind.

![im1](https://www.spaceweatherlive.com/images/help/IMF/BxByBz.gif)

> **Interaction with Earth’s magnetosphere**
>
The north-south direction of the interplanetary magnetic field (Bz) is the most important ingredient for auroral activity. When the north-south direction (Bz) of the the interplanetary magnetic field is orientated southward, it will connect with Earth’s magnetosphere which points northward. Think of the ordinary bar magnets that you have at home. Two opposite poles attract each other! A (strong) southward Bz can create havoc with Earth’s magnetic field, disrupting the magnetosphere and allowing particles to rain down into our atmosphere along Earth’s magnetic field lines. When these particles collide with the oxygen and nitrogen atoms that make up our atmosphere, it causes them to glow and emit light which we see as aurora.
>
For a geomagnetic storm to develop it is vital that the direction of the interplanetary magnetic field (Bz) turns southward. Continues values of -10nT and lower are good indicators that a geomagnetic storm could develop but the lower this value goes the better it is for auroral activity. Only during extreme events with high solar wind speeds it is possible for a geomagnetic storm (Kp5 or higher) to develop with a northward Bz.


![im2](https://www.spaceweatherlive.com/images/help/IMF/magnetosphere.jpg)
 A schematic diagram showing the interaction between the IMF with a southward Bz and Earth’s magnetosphere.

> **Measuring the interplanetary magnetic field**
>
>The real-time solar wind and interplanetary magnetic field data that you can find on this website come from the Deep Space Climate Observatory (DSCOVR) satellite which is stationed in an orbit around the Sun-Earth Lagrange Point 1. This is a point in space which is always located between the Sun and Earth where the gravity of the Sun and Earth have an equal pull on satellites meaning they can remain in a stable orbit around this point. This point is ideal for solar missions like DSCOVR, as this gives DSCOVR the opportunity to measure the parameters of the solar wind and the interplanetary magnetic field before it arrives at Earth. This gives us a 15 to 60 minute warning time (depending on the solar wind speed) as to what kind of solar wind structures are on their way to Earth.
>
>The Deep Space Climate Observatory (DSCOVR) mission is now the primary source for real-time solar wind and interplanetary magnetic field data but there is one more satellite at the Sun-Earth L1 point that measures the incoming solar wind and and that is the Advanced Composition Explorer. This satellite used to be the primary real-time space weather data source up until July 2016 when DSCOVR become fully operational. The Advanced Composition Explorer (ACE) satellite is still collecting data and now operates mostly as a backup to DSCOVR.

*The location of a satellite at the Sun-Earth L1 point.*
![im3](https://www.spaceweatherlive.com/images/help/zonnewind/L1_animation.gif)


In this notebook  we'll cover how to:

- load the data
- create features using the timedelta index
- generate batches of 32-length sequences for training
- train an LSTM model in Keras

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

from scipy import stats
from scipy.stats import norm, skew


from sklearn.model_selection import train_test_split, KFold, GroupKFold, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler

from sklearn.metrics import *

import sys, os
import random 

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
    
from IPython import display, utils




In [None]:


solar_wind = pd.read_csv("../input/soalr-wind/solar_wind.csv")
solar_wind.timedelta = pd.to_timedelta(solar_wind.timedelta)
solar_wind.set_index(["period", "timedelta"], inplace=True)

dst = pd.read_csv("../input/soalr-wind/labels.csv")
dst.timedelta = pd.to_timedelta(dst.timedelta)
dst.set_index(["period", "timedelta"], inplace=True)

sunspots = pd.read_csv("../input/soalr-wind/sunspots.csv")
sunspots.timedelta = pd.to_timedelta(sunspots.timedelta)
sunspots.set_index(["period", "timedelta"], inplace=True)

In [None]:
dst.groupby("period").describe()

We have nearly 140,000 observations of hourly dst data, representing over 15 years. There are almost twice as many observations in either the train_b or train_c periods than there are train_a. It also seems train_a represents a more intense period, given that it has a lower mean and higher standard deviation. Also note that most of the values are negative.

A very strong magnetic field disturbance has a large Dst value, measured in nano-Teslas (nT). Because these disturbances are usually flowing towards the Earth, the values are negative. Sometimes Dst can be highly positive. During calm conditions, Dst values are situated at or just below 0.



### EDA

In [None]:
print("Solar wind shape: ", solar_wind.shape)
solar_wind.head()

In [None]:
print("Sunspot shape: ", sunspots.shape)
sunspots.head()

We can see that we're mostly working with floats, except for the solar wind source column which tells us which of the two satellites recorded the data. We also see that the size of the solar_wind data is fairly large, close to 8.4 million rows. That makes sense, given we're working with minutely values.

On the other hand, we only have 192 observations of monthly sunspot data. When it comes to feature generation, we should think about the best ways to combine these features given their different frequencies.



In [None]:
solar_wind.groupby("period").describe().T

In [None]:
sunspots.groupby("period").describe().T

The mean and standard deviation of values in train_a is generally more intense than in the other two periods.

Also our values exist across very different scales. For instance, temperature values are quite high - reaching into the hundreds of thousands Kelvin. Meanwhile, IMF readings are fairly small values, and usually negative. One of the best practices when training deep learning models is scaling your features; we'll definitely want to think about that later on.

Let's do a few visualizations before we move onto feature generation. First, we'll just plot the first 1,000 rows for some of our time series data to get a sense of its shape. Instead of plotting all of our features, let's just choose a few IMF features and a few plasma features.

In [None]:
plt.style.use('fivethirtyeight')
def show_raw_visualization(data):
    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15), dpi=80)
    for i, key in enumerate(data.columns):
        t_data = data[key]
        ax = t_data.plot(
            ax=axes[i // 2, i % 2],
            title=f"{key.capitalize()}",
            rot=25,color='teal', lw=1.2
        )

    fig.subplots_adjust(hspace=0.8)
    plt.tight_layout()


cols_to_plot = ["bx_gse", "bx_gsm", "bt", "density", "speed", "temperature"]
show_raw_visualization(solar_wind[cols_to_plot].iloc[:1000])

In [None]:
solar_wind.isnull().sum()

The proportion varies by feature, but every one will require some kind of imputation. Turns out sensor readings from space aren't always reliable - instruments are subject to all kinds of nasty space weather. It's up to us to figure out a sensible way of dealing with these missing values.

Another observation is that the IMF features bx_gsm and bx_gse are closely related. This could present multicollinearity issues if both are present in our model. As our last exploratory step, let's plot a correlation matrix to find other relationships between our features.

In [None]:
joined = solar_wind.join(sunspots).join(dst).fillna(method="ffill")

plt.figure(figsize=(20, 15))
sns.clustermap(joined.corr(), annot=True)

- The plasma related features like speed and temperature look to be strongly anti-correlated with Dst. The IMF feature bt also exhibits strong anti-correlation.

- the gsm and gse variables are strongly correlated. We probably want to leave some of those out.



### Feature Engineering

In [None]:
from numpy.random import seed
from tensorflow.random import set_seed

seed(2020)
set_seed(2021)

**We've learned from data exploration:**


*1. Features exist across different scales*

We'll fix this using the StandardScaler from scikit-learn. It's a pretty standard affair - it'll help us subtract the mean and divide by the standard deviation for each feature. What's nice about it is that it'll save the parameters used for scaling so that we can re-use it later during prediction. You could also experiment with using the MinMaxScaler or other scaling methods instead.

*2. Features are provided at different frequencies*

One easy way to fix this is to aggregate values to the same frequency. Since our dst values are provided hourly, we'll aggregate solar_wind data to the hour. The timedelta API will make that really easy for us to group by the hour. We'll take both the mean and std of each value for each hour. This would be a great place to experiment with different frequencies and aggregations to try and get the best performance out of your model.

*3. Certain IMF features are highly correlated*

We'll solve this by taking a subset of the data. Let's use the plasma variables temperature, speed, and density, as well as the IMF feature bt. Since we saw that gse and gsm variables were highly correlated, let's only take the gse variables. This is a first pass at subsetting the variables - you may want to take a more principled approach in your implementation.

*4. There are many missing values*

We've already decided to aggregate our minutely solar wind data to an hourly frequency. That'll help with some of the missing values. For the remaining, there are a few different methods we could try. We want something that will be effective that will also work in a real-time environment. For simplicity, we're going to interpolate between missing values using df.interpolate(). Again, this is another great place for experimentation! You could impute using the mean or median, or you might even think about developing another model to estimate the missing values.

Finally, we have our monthly sunspot numbers. These aren't really "missing" - they're just provided at a coarser frequency. We'll fix that by using the forward fill or ffill method of imputation so that the correct monthly number is assigned to each row.

In [None]:
from sklearn.preprocessing import StandardScaler

# subset of solar wind features to use for modeling
SOLAR_WIND_FEATURES = [
    "bt",
    "temperature",
    "bx_gse",
    "by_gse",
    "bz_gse",
    "speed",
    "density",
]

# all of the features we'll use, including sunspot numbers
XCOLS = (
    [col + "_mean" for col in SOLAR_WIND_FEATURES]
    + [col + "_std" for col in SOLAR_WIND_FEATURES]
    + ["smoothed_ssn"]
)


def impute_features(feature_df):
    """Imputes data using the following methods:
    - `smoothed_ssn`: forward fill
    - `solar_wind`: interpolation
    """
    # forward fill sunspot data for the rest of the month
    feature_df.smoothed_ssn = feature_df.smoothed_ssn.fillna(method="ffill")
    # interpolate between missing solar wind values
    feature_df = feature_df.interpolate()
    return feature_df


def aggregate_hourly(feature_df, aggs=["mean", "std"]):
    """Aggregates features to the floor of each hour using mean and standard deviation.
    e.g. All values from "11:00:00" to "11:59:00" will be aggregated to "11:00:00".
    """
    # group by the floor of each hour use timedelta index
    agged = feature_df.groupby(
        ["period", feature_df.index.get_level_values(1).floor("H")]
    ).agg(aggs)
    # flatten hierachical column index
    agged.columns = ["_".join(x) for x in agged.columns]
    return agged


def preprocess_features(solar_wind, sunspots, scaler=None, subset=None):
    """
    Preprocessing steps:
        - Subset the data
        - Aggregate hourly
        - Join solar wind and sunspot data
        - Scale using standard scaler
        - Impute missing values
    """
    # select features we want to use
    if subset:
        solar_wind = solar_wind[subset]

    # aggregate solar wind data and join with sunspots
    hourly_features = aggregate_hourly(solar_wind).join(sunspots)

    # subtract mean and divide by standard deviation
    if scaler is None:
        scaler = StandardScaler()
        scaler.fit(hourly_features)

    normalized = pd.DataFrame(
        scaler.transform(hourly_features),
        index=hourly_features.index,
        columns=hourly_features.columns,
    )

    # impute missing values
    imputed = impute_features(normalized)

    # we want to return the scaler object as well to use later during prediction
    return imputed, scaler


In [None]:
features, scaler = preprocess_features(solar_wind, sunspots, subset=SOLAR_WIND_FEATURES)
print(features.shape)
features.head()

In [None]:
assert (features.isna().sum() == 0).all()

Our final feature set is composed of 15 features - the mean and standard deviation of seven solar_wind features, along with smoothed_ssn. We've also saved our scaler object. We'll serialize this later along with our model so that it can be used to preprocess features during prediction.

Before we start modeling, we also have to reshape our label, dst. Remember that we have to predict both t0, the current timestep, and t+1, an hour ahead. We'll train our LSTM to do multi-step prediction by providing it both steps. To do that, we just add another column called t1 that is dst shifted by 1. We'll also renamed our dst column to t0 for consistency.

In [None]:
YCOLS = ["t0", "t1"]


def process_labels(dst):
    y = dst.copy()
    y["t1"] = y.groupby("period").dst.shift(-1)
    y.columns = YCOLS
    return y


labels = process_labels(dst)
labels.head()

Et voilà! Now we have our features and our labels. Let's join them together into one data df so that it's easier to keep the appropriate rows together when splitting into our train, test, and validation sets.

In [None]:
data = labels.join(features)
data.head()

We want to split our data into three datasets. As you might have guessed, the train set will be used for training. We'll also pass a validation set to keras as it's training to monitor the modeling fitting over epochs. Finally, we'll use a test set to evaluate our model for over or under fitting before we submit to the competition.

We have two atypical concerns to consider when splitting this dataset:

1. We're dealing with timeseries data. Observations in a time series are not indepedent, so we cannot randomly assign observations across our datasets. We also don't want to "cheat" by leaking future information into our training data. In the real-world, we will never be able to train on data from the future, so we should emulate those same contraints here.

2. We have three non-contiguous periods, meaning we have gaps in our data. We don't know how long each gap is or in what order each period occured. We also know that the three periods are differently distributed. That suggests that observations from each period should be included in our train set, instead of reserving one wholesale as our test or validation set.

*To solve these problems, we'll hold out the last 6,000 rows from each period for our test set, and reserve last 3,000 before that for our validation set. The remaining rows will be used in the training set.*

In [None]:
def get_train_test_val(data, test_per_period, val_per_period):
    """Splits data across periods into train, test, and validation"""
    # assign the last `test_per_period` rows from each period to test
    test = data.groupby("period").tail(test_per_period)
    interim = data[~data.index.isin(test.index)]
    # assign the last `val_per_period` from the remaining rows to validation
    val = data.groupby("period").tail(val_per_period)
    # the remaining rows are assigned to train
    train = interim[~interim.index.isin(val.index)]
    return train, test, val


train, test, val = get_train_test_val(data, test_per_period=6_000, val_per_period=3_000)

In [None]:
### Modelingm


The first thing we have to do is separate our data into sequences and batches for modeling. We have to decide on:

timesteps: this determines the sequence length, ie. how many timesteps in the past to use to predict each step at t0 and t1. Our data is aggregated hourly, so timesteps is equal to the number of hours we want to use for each prediction.
batch_size: this determines the number of samples to work through before a model's parameters are updated.
For this tutorial, we'll choose fairly standard numbers of 32 timesteps per sequence and 32 sequences per batch. These numbers will likely have a large impact on your model, so feel free to experiment.

Now we need to use these numbers to separate our training data and labels into batches of sequences that will be fed into the model. Luckily, keras has just the tool with their timeseries_dataset_from_array function. According to the documentation:

If targets was passed, the dataset yields tuple (batch_of_sequences, batch_of_targets). If not, the dataset yields only batch_of_sequences.

So We can easily specify timesteps (referred to as sequence_length in the documentation) and batchsize to get a feature generator and a target generator that we can pass to model.fit(). For your implementation, you can experiment with different sequence_lengths along with sequence_stride (how many observations to skip between sequences) and sampling_rate (how many observations to sample per sequence).

There's one hiccup - we need to make sure that our sequences don't span across periods. To get around that, we'll iterate through our periods and generate a timeseries dataset for each one. Then we'll concatenate them at the end to rejoin our training set and validation set. And let's not forget - since we're only allowed to use feature data up until t-1, we'll need to realign our features and labels. We'll do that during our loop as well.

In [None]:
import tensorflow as tf
from keras import preprocessing


data_config = {
    "timesteps": 32,
    "batch_size": 32,
}


def timeseries_dataset_from_df(df, batch_size):
    dataset = None
    timesteps = data_config["timesteps"]

    # iterate through periods
    for _, period_df in df.groupby("period"):
        # realign features and labels so that first sequence of 32 is aligned with the 33rd target
        inputs = period_df[XCOLS][:-timesteps]
        outputs = period_df[YCOLS][timesteps:]

        period_ds = tf.keras.preprocessing.timeseries_dataset_from_array(
            inputs,
            outputs,
            timesteps,
            batch_size=batch_size,
        )

        if dataset is None:
            dataset = period_ds
        else:
            dataset = dataset.concatenate(period_ds)

    return dataset


train_ds = timeseries_dataset_from_df(train, data_config["batch_size"])
val_ds = timeseries_dataset_from_df(val, data_config["batch_size"])

print(f"Number of train batches: {len(train_ds)}")
print(f"Number of val batches: {len(val_ds)}")


Finally, we need to design our LSTM network. We're going to build a simple sequential model with one hidden LSTM layer and one output later with 2 output values (t0 and t1). You can experiment by making your model deep (many layers) or wide (adding more neurons).

There are many hyperparameters that we can tune. We'll concentrate on a few here:

n_epochs: This determines the number of complete passes your model takes through the training data. For this tutorial, we'll choose 20 epochs. You'll want to monitor your rate of convergence to tune this number.
n_neurons: The number of hidden nodes. Usually increases by powers of 2. We'll use 512.
dropout: This regularizes by randomly "ignoring" a dropout fraction of a layer's neurons during each pass through the network during training, so that no particular neuron overfits its input. We'll start with a value of 0.4.
stateful: This determines whether the model keeps track of the historical data that its seen within each batch. Since each sample within a batch encodes the entire sequence we care about, we can set this to False.
Tuning these values will impact how fast your model learns, whether it converges, and affect over/under fitting. Play around to see what works best.

After instantiating the model with these hyperparameters, we'll compile it with mean_squared_error as our loss function and adam as our optimizer. We'll have to remember to take the square root of our loss to get our competition metric, root_mean_squared_error.

In [None]:


from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, Flatten, Dense,LSTM

# define our model
model_config = {"n_epochs": 20, "n_neurons": 512, "dropout": 0.4, "stateful": False}

model = Sequential()
model.add(
    LSTM(
        model_config["n_neurons"],
        # usually set to (`batch_size`, `sequence_length`, `n_features`)
        # setting the batch size to None allows for variable length batches
        batch_input_shape=(None, data_config["timesteps"], len(XCOLS)),
        stateful=model_config["stateful"],
        dropout=model_config["dropout"],
    )
)
model.add(Dense(len(YCOLS)))
model.compile(
    loss="mean_squared_error",
    optimizer="adam",
)

model.summary()

In [None]:
history = model.fit(
    train_ds,
    batch_size=data_config["batch_size"],
    epochs=model_config["n_epochs"],
    verbose=1,
    shuffle=False,
    validation_data=val_ds,
)

In [None]:
for name, values in history.history.items():
    plt.plot(values)

In [None]:
test_ds = timeseries_dataset_from_df(test, data_config["batch_size"])
mse = model.evaluate(test_ds)
print(f"Test RMSE: {mse**.5:.2f}")

In [None]:
import json
import pickle

model.save("model")

with open("scaler.pck", "wb") as f:
    pickle.dump(scaler, f)

data_config["solar_wind_subset"] = SOLAR_WIND_FEATURES
print(data_config)
with open("config.json", "w") as f:
    json.dump(data_config, f)