<a href="https://colab.research.google.com/github/sinahuss/solar-flare-prediction/blob/main/notebooks/solar_flare_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# C964 Capstone: Solar Flare Prediction and Analysis

### 1. Import Libraries

Import the necessary Python libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Load Dataset

Load the dataset from a public GitHub repository into a Pandas dataframe. Display the first few rows to verify that it has been loaded.

In [None]:
url = 'https://raw.githubusercontent.com/sinahuss/solar-flare-prediction/refs/heads/main/data/data.csv'
df = pd.read_csv(url)

df.head()

### 3. Initial Data Inspection

First, perform an initial inspection of the dataset to understand its basic structure. Use `.info()` to check the column names, data types, and look for any missing values. Then use `.describe()` to get a statistical summary of the numerical features, which helps in identifying the scale and distribution of the data.

In [None]:
df.info()

df.describe()

### 4. Categorize and Count Flare Classes

The dataset tracks C, M, and X-class flares in three separate columns, representing the count of each event type. For this classification task, a single target variable is needed. A new column will be created called `flare_class` that categorizes each 24-hour period by the most significant flare produced by a sunspot group, following the standard hierarchy (X > M > C).

In [None]:
# Determine the highest flare class for each row
def get_flare_class(row):
    if row['severe flares'] > 0:
        return 'X'
    elif row['moderate flares'] > 0:
        return 'M'
    elif row['common flares'] > 0:
        return 'C'
    else:
        return 'None'

# Create a new target column
df['flare_class'] = df.apply(get_flare_class, axis=1)

print(df['flare_class'].value_counts())

### 5. One-Hot Encode Categorical Features

The machine learning model requires all input features to be numerical. The dataset contains several categorical columns with object data types (e.g., 'modified Zurich class'). One-hot encoding will be used to convert these categorical features into a numerical format, creating new binary columns for each category. This is a crucial step to make the data suitable for the classification algorithm.

In [None]:
# Identify the categorical columns to be encoded
categorical_cols = ['modified Zurich class', 'largest spot size', 'spot distribution']

# Apply one-hot encoding using pandas get_dummies
df_encoded = pd.get_dummies(df, columns=categorical_cols)

# Display the first few rows to see the new columns
df_encoded.head()

### 6. Exploratory Data Analysis

#### 6.1. Correlation Matrix Heatmap

To understand the relationships between the different features of solar active regions, a correlation matrix heatmap will be generated. This visualization helps identify which features are strongly related, which could indicate redundancy, and which features might be most predictive of our target variable.

In [None]:
plt.figure(figsize=(12, 10))

sns.heatmap(df_encoded.corr(), annot=False, cmap='viridis')

plt.title('Feature Correlation Matrix')
plt.show()

#### 6.2. Class Imbalance Visualization

Next, a visualization of the distribution of our engineered `flare_class` target variable will be created. This is critical for identifying class imbalance, which is a common challenge in datasets where one class is much rarer than others. Understanding this imbalance is key to selecting the right evaluation metrics for our model later on.

In [None]:
plt.figure(figsize=(10, 6))

# Create a count plot for your new 'flare_class' target variable
sns.countplot(data=df_encoded, order=['None', 'C', 'M', 'X'], x='flare_class', hue='flare_class', palette='magma', legend=False)

plt.title('Distribution of Solar Flare Classes')
plt.ylabel('Number of Events')
plt.xlabel('Flare Class')
plt.show()