
# Data Preprocessing
In this notebook we load and process the raw data to develop the final dataset for the IBM-Z Datathon. We make use of three main datasets for a list of all observed geoeffective CMEs from the post-SOHO era between 1996-2024, and two data sets for features and targets:

#### Geo-effective CMEs:
- The [Richardson and Cane list](https://izw1.caltech.edu/ACE/ASC/DATA/level3/icmetable2.htm); a list of near-Earth CMEs from 1996-2024.
- The [George Mason University CME/ICME list](http://solar.gmu.edu/heliophysics/index.php/GMU_CME/ICME_List); a list of geoeffective CMEs from 2007-2017.
- The [NASA CME Scoreboard](https://kauai.ccmc.gsfc.nasa.gov/CMEscoreboard/); a list of geoeffective CMEs from 2013-2024.

#### Features and Targets:
- The [SOHO-LASCO CME Catalogue](https://cdaw.gsfc.nasa.gov/CME_list/); a list of all CMEs observed from 1996-2024 containing information on physical quantities.
- [OMNIWeb Plus data](https://omniweb.gsfc.nasa.gov/); a list of features associated with the solar wind and sunspot numbers.


## Cleaning the data:

In [170]:
# Importing libraries:
# For data manipulation
import pandas as pd

#For data visualisation:
import matplotlib.pyplot as plt


#### SOHO-LASCO & OMNOWeb Plus
We begin by loading in the SOHO-LASCO Catalogue to obtain the physical quantities for all CMEs. The original dataset had 11 total features. Most of the data was missing for the mass and kinetic energy hence these have been excluded. We have also excluded the second-order speeds as these are correlated with the linear speed. As a result this dataset contains the dates and times for each CME, together with five features:
- Central Position Angle in degrees.
- Angular Width in degrees.
- Linear Speed in km/s.
- Acceleration in km/s$^2$.
- Measurement Position Angle in degrees.


In [171]:
# Adding filepaths as variables
cane_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\RichardsonCane.csv"
gmu_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\GMU.csv"
nasa_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\NASA_Scoreboard.csv"
soho_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\SOHO_LASCO.csv"
omniweb_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\OMNIWeb.csv"

In [172]:
# Reading SOHO-LASCO dataset
soho_df = pd.read_csv(soho_file_path)
soho_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39285 entries, 0 to 39284
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          39285 non-null  object 
 1   Time          39285 non-null  object 
 2   CentralPA     39285 non-null  object 
 3   AngularWidth  39285 non-null  int64  
 4   LinearSpeed   39166 non-null  float64
 5   Accel         39285 non-null  object 
 6   MPA           39285 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 2.1+ MB


In [173]:
soho_df.head(200)

Unnamed: 0,Date,Time,CentralPA,AngularWidth,LinearSpeed,Accel,MPA
0,11/01/1996,00:14:36,267,18,499.0,-64.3*,272
1,13/01/1996,22:08:30,265,16,290.0,2.8*,266
2,15/01/1996,07:01:10,262,43,525.0,-31.1,272
3,22/01/1996,03:11:01,105,37,267.0,-126.3*,103
4,26/01/1996,09:16:19,90,27,262.0,1.9*,90
...,...,...,...,...,...,...,...
195,20/06/1996,13:53:43,73,16,220.0,0.8*,82
196,20/06/1996,17:49:03,105,38,80.0,6.6*,103
197,20/06/1996,19:18:00,1,6,91.0,-4.3*,360
198,21/06/1996,02:07:30,260,20,154.0,2.2*,265


After inspecting the dataset, we will do the following:
- Convert all missing values labelled as "------" and "NaN" to `None`.
- Convert Central PA values labelled as "Halo" to 360.
- Reformat the Acceleration column by removing asterisks.
- Convert all columns to numeric.
- Remove CME data corresponding to an angular width below 90 degrees as it is known that these are not likely to be geoeffective.

In [174]:
# Step 1: Replace all missing values ('------' and 'NaN') with None
soho_df.replace(['------', 'NaN'], None, inplace=True)

# Step 2: Convert Angular Width values labelled as "Halo" to 360
soho_df['CentralPA'] = soho_df['CentralPA'].replace('Halo', 360)

# Step 3: Remove asterisks from the Acceleration column
soho_df['Accel'] = soho_df['Accel'].astype(str).str.replace('*', '', regex=False)

# Step 4: Convert all columns to numeric, except the first two (Date and Time)
cols_to_convert = soho_df.columns[2:]  # Keep first two columns (Date and Time) as object
soho_df[cols_to_convert] = soho_df[cols_to_convert].apply(pd.to_numeric, errors='coerce')

# Step 5: Remove rows where Central PA is below 90 degrees
soho_filtered_df = soho_df[soho_df['AngularWidth'] >= 90]

soho_filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5873 entries, 12 to 39282
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          5873 non-null   object 
 1   Time          5873 non-null   object 
 2   CentralPA     5873 non-null   int64  
 3   AngularWidth  5873 non-null   int64  
 4   LinearSpeed   5867 non-null   float64
 5   Accel         5851 non-null   float64
 6   MPA           5873 non-null   int64  
dtypes: float64(2), int64(3), object(2)
memory usage: 367.1+ KB


In [175]:
soho_filtered_df.head(200) 

Unnamed: 0,Date,Time,CentralPA,AngularWidth,LinearSpeed,Accel,MPA
12,02/02/1996,23:00:47,180,119,80.0,1.8,164
83,29/04/1996,14:38:48,360,360,65.0,,149
85,01/05/1996,08:41:46,94,95,314.0,0.7,70
188,18/06/1996,17:28:50,84,95,64.0,-0.4,79
285,20/07/1996,09:28:16,31,175,246.0,9.4,34
...,...,...,...,...,...,...,...
2561,07/11/1998,20:54:05,321,96,750.0,23.7,314
2565,08/11/1998,11:54:05,264,196,559.0,6.2,214
2569,09/11/1998,01:54:05,16,94,144.0,0.7,27
2573,09/11/1998,18:17:55,330,190,325.0,2.6,338
