<a href="https://colab.research.google.com/github/wilmi94/MasterThesis-AE/blob/main/notebooks/sdo_bin_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Binary Classification of full-disk SDO/AIA Data

> This notebook is part of the Master Thesis *Predicting Coronal Mass Ejections using Machine Learning methods* by Wilmar Ender, FH Wiener Neustadt, 2023.

This notebook is part of the Data Exploration phase (section 4.3 in thesis) and aims to apply a simple binary image classifier on the used SDO/AIA dataset.

**Objectives:** \\
* perform data exploration and getting used to the dataset
* develop helper functions and code for subsequent prediction model
* test and apply "simple" binary classification models
* evaluate the models to get a baseline in terms of performance.

**Solar event list:** \\
The solar event list is taken from the following paper: \\
*Liu et al. 2020, Predicting Coronal Mass Ejections Using SDO/HMI Vector Magnetic Data Products and Recurrent Neural Networks* \\
This list/catalog conists of:
* 129 M- and X-class Flares that are associated with CMEs and
* 610 M- and X-class Flares that are **not** associated with CMEs.

**Image Dataset:** \\
The used Image Data is taken from the following paper: \\
*Ahmadzadeh et al. 2019, A Curated Image Parameter Data Set from the Solar Dynamics Observatory Mission*. \\
The data iself is accessed via the so-called *sdo-cli* (https://github.com/i4Ds/sdo-cli) developed by Marius Giger.


## Setting up the Notebook

In [3]:
%%capture
pip install -U sdo-cli

In [4]:
from pathlib import Path
import os
import requests
import subprocess
import shutil
import random

import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.applications import VGG16
from keras.layers import Activation, Dropout, Flatten, Conv2D, MaxPooling2D, Dense
from keras.applications.inception_resnet_v2 import InceptionResNetV2
from keras.layers import Input, GlobalAveragePooling2D

In [5]:
# mount Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Change present working directory
%cd /content/drive/MyDrive/Academia/MSc. Aerospace Engineering - FH Wiener Neustadt/4. Master Thesis/03-Work/

/content/drive/MyDrive/Academia/MSc. Aerospace Engineering - FH Wiener Neustadt/4. Master Thesis/03-Work


In [7]:
!ls -a

 00_Dataset		   03_sdo_ConvLSTM			      .sdo-cli
 01_sdo_data_exploration   04_Tests
 02_sdo_binclass	  'Master Thesis-ML-Project-Checklist.gdoc'


## Helper Functions

In [53]:
def create_sdo_aia_dataset(output_dir, start_idx, event_list, dt, wavelength ):
  '''
  download with the help of sdo-cli AIA images wrt. wavelength and time timeinterval
  Note: this function can be adjusted such that the start, end time and intervall could be separately adjusted.
  input:
  output_dir = as a string where the images should be saved (repository path)
  eventlist  = dataframe which provides CME data like start, peak and end time
  start_idx  = from where the query should start
  dt         = string, time step between images (if possible), e.g. '6min'
  wavelength = string, corresponing wavelegth channel of AIA e.g. '171' for 171 Angström channel

  output:
  images (51x512) within folder output_dir folder
  '''

  for idx in range(start_idx, event_list.shape[0]):
    start_time = event_list['Peak Time'][idx]
    end_time = start_time
    command = f"sdo-cli data download --path={output_dir} --start={start_time} --end={end_time} --freq={dt} --wavelength={wavelength}"
    subprocess.call(command, shell=True)
    print("\r", idx, ': downloading CME from ', start_time, end="")

In [72]:
def compare_filenames_with_dataframe(directory, dataframe, wavelength):
  '''
  compare the filenames within a folder with a provided list.

  input:
  directory  = string where the images are saved (repository path)
  dataframe  = a dataframe where the events/images are stored
  wavelength = wavelength challel (only for filename neccessary)

  output:
  statistics = a dictionary where statistical data is stored
  df_missing = a dataframe with all the missing images
  '''


  file_end = '_' + str(wavelength) + '.jpeg'
  # Get list of filenames from the directory
  directory_filenames = [filename for filename in os.listdir(directory) if filename.endswith(file_end)]

  # Get list of names from the DataFrame
  df_check = dataframe.copy()
  df_check['Peak Time'] = pd.to_datetime(df_check['Peak Time']).dt.strftime('%Y-%m-%dT%H%M%S__171.jpeg') # Warning! wavelength is now hard coded

  dataframe_names = df_check['Peak Time'].tolist()  # the name of the image should correspond to the timestamp

  # Compare filenames
  common_filenames = set(directory_filenames) & set(dataframe_names)
  missing_filenames = set(dataframe_names) - set(directory_filenames)
  extra_filenames = set(directory_filenames) - set(dataframe_names)

  df_missing = pd.DataFrame (data = missing_filenames,  columns=['Peak Time'])
  df_missing['Peak Time'] = df_missing['Peak Time'].str.replace(r'__171.jpeg', '', regex=True)
  df_missing['Peak Time'] = pd.to_datetime(df_missing['Peak Time']).dt.strftime('%Y-%m-%dT%H:%M:%S')

  # Calculate statistics
  total_directory_files = len(directory_filenames)
  total_dataframe_names = len(dataframe_names)
  total_common_files = len(common_filenames)
  total_missing_files = len(missing_filenames)
  total_extra_files = len(extra_filenames)
  print('Total Directory Files: ', total_directory_files)
  print('Total DataFrame Names: ', total_dataframe_names)
  print('Common Files: ', total_common_files)
  print('Missing Files: ', total_missing_files)
  print('Extra Files: ', total_extra_files)

  statistics = {
      'Total Directory Files': total_directory_files,
      'Total DataFrame Names': total_dataframe_names,
      'Common Files': total_common_files,
      'Missing Files': total_missing_files,
      'Extra Files': total_extra_files,
      #'Common File Names': common_filenames,
      'Missing File Names': missing_filenames,
      'Extra File Names': extra_filenames
  }

  return statistics, df_missing

## Get Solar Events

The following lists (stored as a .csv file) from *Liu et al. 2020* holds all the events (129) and non-events (610) presented in the study. \\
First, the csv file is imported and stored as a pandas dataframe.


In [45]:
# load list with both (neg, pos) labels
df_events = pd.read_csv(r'00_Dataset/event_lists/all_cme_events.csv', delimiter =';')
# show numer of events and display the first twelfe samples
print('There are', df_events.shape[0], 'events stored in the list.\n')
df_events.head(12)

There are 739 events stored in the list.



Unnamed: 0,Flare Class,Start Time,Peak Time,End Time,Active Region Number,Harp Number,CME
0,M1.2,2010-05-05T17:13Z,2010-05-05T17:19Z,2010-05-05T17:22Z,11069,8,No
1,M1.0,2010-06-13T05:30Z,2010-06-13T05:39Z,2010-06-13T05:44Z,11079,49,No
2,M2.0,2010-06-12T00:30Z,2010-06-12T00:57Z,2010-06-12T01:02Z,11081,54,No
3,M1.0,2010-08-07T17:55Z,2010-08-07T18:24Z,2010-08-07T18:47Z,11093,115,No
4,M2.9,2010-10-16T19:07Z,2010-10-16T19:12Z,2010-10-16T19:15Z,11112,211,No
5,M5.4,2010-11-06T15:27Z,2010-11-06T15:36Z,2010-11-06T15:44Z,11121,245,No
6,M1.6,2010-11-04T23:30Z,2010-11-04T23:58Z,2010-11-05T00:12Z,11121,245,No
7,M1.0,2010-11-05T12:43Z,2010-11-05T13:29Z,2010-11-05T14:06Z,11121,245,No
8,M1.3,2011-01-28T00:44Z,2011-01-28T01:03Z,2011-01-28T01:10Z,11149,345,No
9,M1.9,2011-02-09T01:23Z,2011-02-09T01:31Z,2011-02-09T01:35Z,11153,362,No


As one can see, the list shows the Flare Class, the Start-, Peak- and End-Time and where the Event happed in terms of an Active Region and Harp Number. The last column labels the event as `No`, for a Flare event which was not associated to a CME or otherwise a time stamp (e.g. `2011-02-15T02:25:00-CME-001`) of the associated subsequent CME event.

### Extract CME Events from the list

Extract all CME events form `df_events` and store them in a new dataframe called `df_cme_list`:





In [42]:
# store only cme events
df_cme_list = df_events.loc[df_events['CME'] != 'No']
# reindex the list
df_cme_list = df_cme_list.reset_index(drop=True)
# show numer of events and display the first five samples
print('There are', df_cme_list.shape[0], 'CME events in the list.\n')
df_cme_list.head()

There are 129 CME events in the list.



Unnamed: 0,Flare Class,Start Time,Peak Time,End Time,Active Region Number,Harp Number,CME
0,X2.2,2011-02-15T01:44Z,2011-02-15T01:56Z,2011-02-15T02:06Z,11158,377,2011-02-15T02:25:00-CME-001
1,M3.5,2011-02-24T07:23Z,2011-02-24T07:35Z,2011-02-24T07:42Z,11163,392,2011-02-24T08:00:00-CME-001
2,M3.7,2011-03-07T19:43Z,2011-03-07T20:12Z,2011-03-07T20:58Z,11164,393,2011-03-07T20:12:00-CME-001
3,M2.0,2011-03-07T13:45Z,2011-03-07T14:30Z,2011-03-07T14:56Z,11166,401,2011-03-07T14:40:00-CME-001
4,M1.5,2011-03-08T03:37Z,2011-03-08T03:58Z,2011-03-08T04:20Z,11171,415,2011-03-08T05:00:00-CME-001


In the next step, the Times (start, end and peak) have to be converted into a ` sdo-cli ` readable format:

In [43]:
# convert time stamp such that sdo-cli can read them
df_cme_list['Start Time'] = pd.to_datetime(df_cme_list['Start Time']).dt.strftime('%Y-%m-%dT%H:%M:%S')
df_cme_list['Peak Time'] = pd.to_datetime(df_cme_list['Peak Time']).dt.strftime('%Y-%m-%dT%H:%M:%S')
df_cme_list['End Time'] = pd.to_datetime(df_cme_list['End Time']).dt.strftime('%Y-%m-%dT%H:%M:%S')
df_cme_list.head()

Unnamed: 0,Flare Class,Start Time,Peak Time,End Time,Active Region Number,Harp Number,CME
0,X2.2,2011-02-15T01:44:00,2011-02-15T01:56:00,2011-02-15T02:06:00,11158,377,2011-02-15T02:25:00-CME-001
1,M3.5,2011-02-24T07:23:00,2011-02-24T07:35:00,2011-02-24T07:42:00,11163,392,2011-02-24T08:00:00-CME-001
2,M3.7,2011-03-07T19:43:00,2011-03-07T20:12:00,2011-03-07T20:58:00,11164,393,2011-03-07T20:12:00-CME-001
3,M2.0,2011-03-07T13:45:00,2011-03-07T14:30:00,2011-03-07T14:56:00,11166,401,2011-03-07T14:40:00-CME-001
4,M1.5,2011-03-08T03:37:00,2011-03-08T03:58:00,2011-03-08T04:20:00,11171,415,2011-03-08T05:00:00-CME-001


### Extract all non-CME Events from the List
The same steps from above have to be done for the non-CME events:

In [44]:
# store only non-cme events
df_no_cme_list = df_events.loc[df_events['CME'] == 'No']
# reindex the list
df_no_cme_list = df_no_cme_list.reset_index(drop=True)
# convert time stamp such that sdo-cli can read them
df_no_cme_list['Start Time'] = pd.to_datetime(df_no_cme_list['Start Time']).dt.strftime('%Y-%m-%dT%H:%M:%S')
df_no_cme_list['Peak Time'] = pd.to_datetime(df_no_cme_list['Peak Time']).dt.strftime('%Y-%m-%dT%H:%M:%S')
df_no_cme_list['End Time'] = pd.to_datetime(df_no_cme_list['End Time']).dt.strftime('%Y-%m-%dT%H:%M:%S')
# show numer of non-events and display the first five samples
print('There are', df_no_cme_list.shape[0], 'non-CME events in the list.\n')
df_no_cme_list.head()

There are 610 non-CME events in the list.



Unnamed: 0,Flare Class,Start Time,Peak Time,End Time,Active Region Number,Harp Number,CME
0,M1.2,2010-05-05T17:13:00,2010-05-05T17:19:00,2010-05-05T17:22:00,11069,8,No
1,M1.0,2010-06-13T05:30:00,2010-06-13T05:39:00,2010-06-13T05:44:00,11079,49,No
2,M2.0,2010-06-12T00:30:00,2010-06-12T00:57:00,2010-06-12T01:02:00,11081,54,No
3,M1.0,2010-08-07T17:55:00,2010-08-07T18:24:00,2010-08-07T18:47:00,11093,115,No
4,M2.9,2010-10-16T19:07:00,2010-10-16T19:12:00,2010-10-16T19:15:00,11112,211,No


## Download and Check Images

Next, the events have to be downloaded and saved in the appropriate folders. \\

First, the positive class, i.e. the CME events are downloaded:

In [70]:
# NOTE: after the image-data was downloaded the first time, the cell can be commented out.
# Otherwise, when the notebook is started from the beginning the data would be downloaded again, which is not neccessary

data_path_pos = '00_Dataset/Liu2020_events/raw/positive'
#create_sdo_aia_dataset(output_dir = data_path_pos, start_idx = 0, event_list = df_cme_list, dt = '10min', wavelength = '171')

Check the content of the folder by comparing the image names (which are time stamps and the corresponding wavelength) with the corresponding list/dataframe.

In [77]:
result_raw_pos, missing_raw_pos = compare_filenames_with_dataframe(data_path_pos, df_cme_list, 171)
missing_raw_pos

Total Directory Files:  125
Total DataFrame Names:  129
Common Files:  125
Missing Files:  3
Extra Files:  0


Unnamed: 0,Peak Time
0,2014-11-07T04:25:00
1,2013-10-29T21:54:00
2,2014-11-06T03:46:00


In [74]:
create_sdo_aia_dataset(output_dir = data_path_pos, start_idx = 0, event_list = missing_raw_pos, dt = '10min', wavelength = '171')

 2 : downloading CME from  2014-11-06T03:46:00

Same for the negative class, i.e. Flares which are not associated to CMEs:



In [None]:
# NOTE: after the image-data was downloaded the first time, the cell can be commented out.
# Otherwise, when the notebook is started from the beginning the data would be downloaded again, which is not neccessary
data_path_neg = '00_Dataset/Liu2020_events/raw/negative/'
create_sdo_aia_dataset(output_dir = data_path_neg, start_idx = 0, event_list = df_no_cme_list, dt = '10min', wavelength = '171')

 525 : downloading CME from  2015-06-21T09:44:00

In [None]:
result_raw_neg, missing_raw_neg = compare_filenames_with_dataframe(data_path_neg, df_no_cme_list, 171)
missing_raw_neg

**SUMMARY**: \\
* it was not possible to download all the events due to the following reasons:
  * SDO data from *Ahmadzadeh et al. 2019* is only available from February 2011 until today. Hence XXXX events can't be considered
  * ??? corrupted samples were deetected where no image data is available
* DESCRIPE SOME SOLUTIONS!!!

## Prepare Dataset -- see if this is needed!!!

Positive Class: create positive class, i.e. CMEs which are related to Flares

Negative Class: create positive class, i.e. CMEs which are NOT related to Flares

#### Resize the Images


In [16]:
#!sdo-cli data resize --path='./data/raw/raw_512/positive/' --targetpath='./data/raw/raw_256/positive' --wavelength='171' --size=256

In [17]:
#!sdo-cli data resize --path='./data/raw/raw_512/negative/' --targetpath='./data/raw/raw_256/negative' --wavelength='171' --size=256

## Models: General Remarks

Since we deal with a very small dataset some preliminary considerations have to be mentioned:
* Small datasets lead to overfitting \\
as one of the most relevant problems with ML models is the so called Bias-Variance tradeoff, or in other words the balance between over-simplification (high bias) and paying too much attention to the training set, such that the model doesn't generalize well.
If a combination of a low bias and high variance is present, one speaks of overfitting the data. On the other hand, if the model shows a high bias and low variance the model underfits the data.
* Using simple models: \\
If only a limited amount of data is available, using a simple model architecture is often better suited as overly complex ones. Additionally regularization techniques can help to avoid overfitting.
* Combining models: \\
combinning some simple models could also help to reduce variance and improve the generalization.
* Use data augmention techniques: \\
As the suggests it, data augmentation techniques use the original data an augment the internal characteristics of the data. Examples in terms of images are: rotation of the image, zooming in/out, flipping the image horizontally or vertically, changning the brightness, etc.

These considerations (and more) have to be made in advance to generate a useful model for a tiny dataset like the present one.

## Model 1: "Keras tutorial"

### Prepare the Dataset

In [18]:
#%cd /content/drive/MyDrive/Academia/MSc. Aerospace Engineering - FH Wiener Neustadt/4. Master Thesis/03-Work/sdo_binclass/data/raw/

In [19]:
#!ls raw_512

In [20]:
# num_skipped = 0
# for folder_name in ("negative", "positive"):
#     folder_path = os.path.join("raw_512", folder_name)
#     for fname in os.listdir(folder_path):
#         fpath = os.path.join(folder_path, fname)
#         try:
#             fobj = open(fpath, "rb")
#             is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
#         finally:
#             fobj.close()

#         if not is_jfif:
#             num_skipped += 1
#             # Delete corrupted image
#             os.remove(fpath)

# print("Deleted %d images" % num_skipped)

FileNotFoundError: ignored

In [None]:
image_size = (180, 180)
batch_size = 10

train_ds, val_ds = tf.keras.utils.image_dataset_from_directory(
    "raw_512",
    validation_split=0.2,
    subset="both",
    seed=1337,
    image_size=image_size,
    batch_size=batch_size,
)

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(images[i].numpy().astype("uint8"))
        plt.title(int(labels[i]))
        plt.axis("off")

In [None]:
from tensorflow.keras import layers
data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(0.1),
    ]
)

In [None]:
plt.figure(figsize=(10, 10))
for images, _ in train_ds.take(1):
    for i in range(9):
        augmented_images = data_augmentation(images)
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(augmented_images[0].numpy().astype("uint8"))
        plt.axis("off")

In [None]:
inputs = keras.Input(shape=image_size)
x = data_augmentation(inputs)
x = layers.Rescaling(1./255)(x)

In [None]:
# Apply `data_augmentation` to the training images.
train_ds = train_ds.map(
    lambda img, label: (data_augmentation(img), label),
    num_parallel_calls=tf.data.AUTOTUNE,
)
# Prefetching samples in GPU memory helps maximize GPU utilization.
train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
val_ds = val_ds.prefetch(tf.data.AUTOTUNE)

### Model

In [None]:
def make_model(input_shape, num_classes, hp):
    inputs = keras.Input(shape=input_shape)

    # Entry block
    x = layers.Rescaling(1.0 / 255)(inputs)
    x = layers.Conv2D(128, 3, strides=2, padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)

    previous_block_activation = x  # Set aside residual

    for size in [256, 512, 728]:
        x = layers.Activation("relu")(x)
        x = layers.SeparableConv2D(size, 3, padding="same")(x)
        x = layers.BatchNormalization()(x)

        x = layers.Activation("relu")(x)
        x = layers.SeparableConv2D(size, 3, padding="same")(x)
        x = layers.BatchNormalization()(x)

        x = layers.MaxPooling2D(3, strides=2, padding="same")(x)

        # Project residual
        residual = layers.Conv2D(size, 1, strides=2, padding="same")(
            previous_block_activation
        )
        x = layers.add([x, residual])  # Add back residual
        previous_block_activation = x  # Set aside next residual

    x = layers.SeparableConv2D(1024, 3, padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)

    x = layers.GlobalAveragePooling2D()(x)
    if num_classes == 2:
        activation = "sigmoid"
        units = 1
    else:
        activation = "softmax"
        units = num_classes

    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(units, activation=activation)(x)
    return keras.Model(inputs, outputs)



In [None]:
model = make_model(input_shape=image_size + (3,), num_classes=2)
#keras.utils.plot_model(model, show_shapes=True)

### Train and Test

In [None]:
epochs = 25

callbacks = [
    keras.callbacks.ModelCheckpoint("save_at_{epoch}.keras"),
]
model.compile(
    optimizer=keras.optimizers.Adam(1e-3),
    loss="binary_crossentropy",
    metrics=["accuracy"],
)
model.fit(
    train_ds,
    epochs=epochs,
    callbacks=callbacks,
    validation_data=val_ds,
)

### Evaluate

## Model 2: ConvNet binary classifier

### Prepare the Dataset:

Split Image-Data - OLD shit! \\
The following function splits the downloaded data into three dataset, namely:


*   training
*   test
*   validation

with corresponding negative and positive samples.


In [None]:
def split_dataset(input_dir, output_dir, test_ratio, val_ratio, random_seed=None):
  '''
  - Replace 'input_dir' with the path to your 'dataset' folder and 'output_dir' with the desired output path.
  - The function will create 'train', 'test', and 'validation' folders inside the 'output_dir' and split the data accordingly.
  - Don't forget to specify the 'test_ratio', 'val_ratio', and 'random_seed' if desired.
  '''

  if random_seed is not None:
      random.seed(random_seed)

  # Create output directories for train, test, and validation
  train_dir = os.path.join(output_dir, 'train')
  test_dir = os.path.join(output_dir, 'test')
  val_dir = os.path.join(output_dir, 'validation')

  os.makedirs(train_dir, exist_ok=True)
  os.makedirs(test_dir, exist_ok=True)
  os.makedirs(val_dir, exist_ok=True)

  # Get the list of class folders in the input directory
  class_folders = os.listdir(input_dir)

  for class_folder in class_folders:
      # Get the path to the class folder
      class_path = os.path.join(input_dir, class_folder)

      # Get the list of image filenames in the class folder
      image_filenames = os.listdir(class_path)

      # Shuffle the image filenames randomly
      random.shuffle(image_filenames)

      # Calculate the number of images for testing and validation
      print('For the ', class_folder, 'class:')
      num_images = len(image_filenames)
      print('There are in total', num_images, ' images.')

      num_test_images = int(num_images * test_ratio)
      print('With ', num_test_images, 'images for testing, and')

      num_val_images = int(num_images * val_ratio)
      print(num_val_images, 'images for validation.\n')
      print('===================================\n')

      # Split the image filenames into train, test, and validation sets
      train_images = image_filenames[num_test_images + num_val_images:]
      test_images = image_filenames[:num_test_images]
      val_images = image_filenames[num_test_images:num_test_images + num_val_images]

      # Copy images to the corresponding directories
      for image_filename in train_images:
          src = os.path.join(class_path, image_filename)
          dst = os.path.join(train_dir, class_folder, image_filename)
          os.makedirs(os.path.dirname(dst), exist_ok=True)
          shutil.copy(src, dst)

      for image_filename in test_images:
          src = os.path.join(class_path, image_filename)
          dst = os.path.join(test_dir, class_folder, image_filename)
          os.makedirs(os.path.dirname(dst), exist_ok=True)
          shutil.copy(src, dst)

      for image_filename in val_images:
          src = os.path.join(class_path, image_filename)
          dst = os.path.join(val_dir, class_folder, image_filename)
          os.makedirs(os.path.dirname(dst), exist_ok=True)
          shutil.copy(src, dst)


In [None]:
input_dir = './data/raw/raw_256/'
output_dir = './data/datasets/Test_1/'
test_ratio = 0.15
val_ratio = 0.15
random_seed = 42  # Optional, set to None for random shuffling each time.
split_dataset(input_dir, output_dir, test_ratio, val_ratio, random_seed)

Define Parameter

In [None]:
# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

# Define constants
dataset_path = './data/datasets/Test_1/'
input_shape = (256, 256, 3)
num_classes = 2
batch_size = 32
epochs = 20
train_ratio = 0.7
test_ratio = 0.15
val_ratio = 0.15

# Create data generators for training and validation
train_datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)

train_generator = train_datagen.flow_from_directory(
    os.path.join(dataset_path, 'train'),
    target_size=(input_shape[0], input_shape[1]),
    batch_size=batch_size,
    class_mode='binary'
)

test_generator = train_datagen.flow_from_directory(
    os.path.join(dataset_path, 'test'),
    target_size=(input_shape[0], input_shape[1]),
    batch_size=batch_size,
    class_mode='binary',
    shuffle=False
)

val_generator = train_datagen.flow_from_directory(
    os.path.join(dataset_path, 'validation'),
    target_size=(input_shape[0], input_shape[1]),
    batch_size=batch_size,
    class_mode='binary',
    shuffle=False
)
# train_generator = train_datagen.flow_from_directory(
#     './data/datasets/Test_1/',
#     target_size=(img_width, img_height),
#     batch_size=batch_size,
#     class_mode='binary',
#     subset='training'
# )

# validation_generator = train_datagen.flow_from_directory(
#     './data/datasets/Test_1/',
#     target_size=(img_width, img_height),
#     batch_size=batch_size,
#     class_mode='binary',
#     subset='validation'
# )

## Model 3: Wang et al. 2019, CME-CNN
CME Arrival Time Prediction Using Convolutional Neural Network
Yimin Wang, Jiajia Liu, Ye Jiang, Robert Erdélyi
The Astrophysical Journal, 2019

See: https://github.com/yiminking/CME-CNN/blob/master/model_loader.py

### Prepare the Dataset

### Define the Model-Architecture

In [None]:
model = Sequential()
model.add(Conv2D(64, kernel_size=(11, 11), input_shape=(input_shape[0], input_shape[1],input_shape[2]), padding='same'))
model.add(BatchNormalization(momentum=0.7))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Conv2D(128, kernel_size=(11, 11), padding='same'))
model.add(BatchNormalization(momentum=0.7))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Conv2D(256, kernel_size=(11, 11), padding='same'))
model.add(BatchNormalization(momentum=0.7))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Flatten())
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dense(1))

### Training and Testing

In [None]:
model.compile(loss='mean_squared_error', optimizer=keras.optimizers.Adam(lr=0.01), metrics=['accuracy'])
history = model.fit(train_generator, batch_size=batch_size, epochs=20, verbose=1)
# Save the trained model
model.save('./Models/cme_classifier_Model-1.h5')
model.summary()
tf.keras.utils.plot_model(model, to_file="./Models/Model-1_Wang.png", show_shapes=True)

### Evaluate

## Model 4

In [None]:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(input_shape[0], input_shape[1], input_shape[2])))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_generator, epochs=epochs, validation_data=val_generator)

# Save the trained model
model.save('./models/cme_classifier_Model-2.h5')
model.summary()
#tf.keras.utils.plot_model(model, to_file="./data/datasets/Test_2/model-2.png", show_shapes=True)

### Model 5: Data Augmentation

In [None]:
# Step 1: Split data into train, test, and validation sets
#def split_data(input_dir, output_dir, test_ratio, val_ratio):
    # Use the split_data_into_sets function here
    # ... (the function we previously defined)

# Replace 'dataset_path' with the path to your 'dataset' folder.
# The function will create 'train', 'test', and 'validation' folders inside the 'dataset' folder.
# It will then split the data from 'positive' and 'negative' folders into these sets.
#dataset_path = './data/datasets/Test_1/'
#split_data(dataset_path, dataset_path, test_ratio, val_ratio)

# Step 2: Data Augmentation
# train_data_gen =  (
#     rescale=1.0/255.0,
#     rotation_range=20,
#     width_shift_range=0.2,
#     height_shift_range=0.2,
#     shear_range=0.2,
#     zoom_range=0.2,
#     horizontal_flip=True,
#     fill_mode='nearest'
# )

# test_data_gen = ImageDataGenerator(rescale=1.0/255.0)

# # Step 3: Create Data Generators for Train, Test, and Validation Sets
# train_generator = train_data_gen.flow_from_directory(
#     os.path.join(dataset_path, 'train'),
#     target_size=(input_shape[0], input_shape[1]),
#     batch_size=batch_size,
#     class_mode='categorical'
# )

# test_generator = test_data_gen.flow_from_directory(
#     os.path.join(dataset_path, 'test'),
#     target_size=(input_shape[0], input_shape[1]),
#     batch_size=batch_size,
#     class_mode='categorical',
#     shuffle=False
# )

# val_generator = test_data_gen.flow_from_directory(
#     os.path.join(dataset_path, 'validation'),
#     target_size=(input_shape[0], input_shape[1]),
#     batch_size=batch_size,
#     class_mode='categorical',
#     shuffle=False
# )

# Step 4: Build the CNN Model
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

# Step 5: Compile and Train the Model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Since the dataset is small, we'll use the fit() method directly instead of using fit_generator().
history = model.fit(
    train_generator,
    steps_per_epoch=train_generator.samples // batch_size,
    epochs=epochs,
    validation_data=val_generator,
    validation_steps=val_generator.samples // batch_size
)

# Step 6: Evaluate the Model on the Test Set
test_loss, test_accuracy = model.evaluate(test_generator, steps=test_generator.samples // batch_size)
print("Test Accuracy:", test_accuracy)

# Save the trained model
model.save('cme_classifier.h5')
model.plot()