
# Data Processing
In this notebook we load and process the raw data to develop the final dataset for the IBM-Z Datathon. We make use of three main datasets for a list of all observed geo-effective CMEs from the post-SOHO era between 1996-2024, and two data sets for features and targets:

#### Geo-effective CMEs:
- The [Richardson and Cane list](https://izw1.caltech.edu/ACE/ASC/DATA/level3/icmetable2.htm); a list of near-Earth CMEs from 1996-2010.
- The [George Mason University CME/ICME list](http://solar.gmu.edu/heliophysics/index.php/GMU_CME/ICME_List); a list of geo-effective CMEs from 2007-2017.
- The [NASA CME Scoreboard](https://kauai.ccmc.gsfc.nasa.gov/CMEscoreboard/); a list of geo-effective CMEs from 2013-2024.

#### Features and Targets:
- The [SOHO-LASCO CME Catalogue](https://cdaw.gsfc.nasa.gov/CME_list/); a list of all CMEs observed from 1996-2024 containing information on physical quantities.
- [OMNIWeb Plus data](https://omniweb.gsfc.nasa.gov/); a list of features associated with the solar wind and sunspot numbers.


## Cleaning the data:

In [91]:
# Importing libraries:
# For data manipulation
import pandas as pd

#For data visualisation:
import matplotlib.pyplot as plt


#### SOHO-LASCO & OMNOWeb Plus
We begin by loading in the SOHO-LASCO Catalogue to obtain the physical quantities for all CMEs. The original dataset had 11 total features. Most of the data was missing for the mass and kinetic energy hence these have been excluded. We have also excluded the second-order speeds as these are correlated with the linear speed. As a result this dataset contains the dates and times for each CME, together with five features:
- Central Position Angle in degrees.
- Angular Width in degrees.
- Linear Speed in km/s.
- Acceleration in km/s$^2$.
- Measurement Position Angle in degrees.


In [95]:
# Adding filepaths as variables
cane_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\RichardsonCane.csv"
gmu_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\GMU.csv"
nasa_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\NASA_Scoreboard.csv"
soho_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\SOHO_LASCO.csv"
omniweb_file_path = r"C:\Users\Yusuf\PycharmProjects\IBM-Z_Datathon_2024\data\OMNIWeb.txt"

In [99]:
# Reading SOHO-LASCO dataset
soho_df = pd.read_csv(soho_file_path)
soho_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39285 entries, 0 to 39284
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          39285 non-null  object 
 1   Time          39285 non-null  object 
 2   CentralPA     39285 non-null  object 
 3   AngularWidth  39285 non-null  int64  
 4   LinearSpeed   39166 non-null  float64
 5   Accel         39285 non-null  object 
 6   MPA           39285 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 2.1+ MB


In [103]:
soho_df.head(100)

Unnamed: 0,Date,Time,CentralPA,AngularWidth,LinearSpeed,Accel,MPA
0,11/01/1996,00:14:36,267,18,499.0,-64.3*,272
1,13/01/1996,22:08:30,265,16,290.0,2.8*,266
2,15/01/1996,07:01:10,262,43,525.0,-31.1,272
3,22/01/1996,03:11:01,105,37,267.0,-126.3*,103
4,26/01/1996,09:16:19,90,27,262.0,1.9*,90
...,...,...,...,...,...,...,...
95,07/05/1996,02:24:06,263,22,62.0,6.2*,265
96,07/05/1996,05:24:58,95,40,154.0,-6.3*,93
97,07/05/1996,21:34:54,80,26,127.0,8.1*,81
98,09/05/1996,12:29:36,81,16,95.0,1.1*,82


After inspecting the dataset, we will do the following:
- Convert all missing values labelled as "------" and convert them to "NaN".
- Reformat the Acceleration column by removing asterisks and convert values to numeric.
- We will also remove CME data corresponding to an angular width below 90 degrees as it is known that these are not likely to be geoeffective.