<a href="https://colab.research.google.com/github/sinahuss/solar-flare-prediction/blob/main/notebooks/solar_flare_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# C964 Capstone: Solar Flare Prediction and Analysis


## 1. Business Understanding

### Organizational Need

Space weather events, particularly solar flares, pose significant risks to critical infrastructure on Earth and in space. Organizations like NOAA's Space Weather Prediction Center require reliable early warning systems to protect:

- Satellite communications and GPS systems
- Power grids
- Astronauts and aircraft
- Radio communications

Current prediction methods rely heavily on human expertise and limited historical patterns, which may result in missed events or false alarms. These risks can lead to potentially billions of dollars in economic damage and disruptions to essential services.

### Project Goal

This project aims to develop a data product featuring a machine learning model that can predict the likelihood of solar flare events (C, M, or X-class) within a 24-hour period based on characteristics of sunspot regions. The model will provide early warning capability for space weather forecasters, and improved accuracy in flare prediction to reduce false alarms and missed events.

### Success Criteria

The model's success will be measured by:

- High recall for X and M-class flares (the most dangerous events) to minimize missed warnings
- Balanced precision and recall to reduce false alarms while maintaining sensitivity
- Practical deployment feasibility for integration into existing space weather monitoring systems

This predictive capability would enable space weather agencies to provide more reliable warnings, allowing for better preparation and protection of critical infrastructure.


## 2. Data Understanding


### 2.1. Load Libraries and Data

Our solar flare prediction analysis begins with importing essential libraries and loading the sunspot dataset.

The dataset will be loaded from a public GitHub repository containing the Solar Flare Dataset from Kaggle, which provides the historical data needed to train our flare prediction model.

This dataset contains morphological characteristics of sunspot groups that solar physicists use to assess flare potential. The first few rows will be displayed to verify successful data loading and provide an initial glimpse of the sunspot characteristics.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset from GitHub repository (public Kaggle dataset)
url = "https://raw.githubusercontent.com/sinahuss/solar-flare-prediction/refs/heads/main/data/data.csv"
df = pd.read_csv(url)

# Display first few rows to verify successful loading
df.head()

### 2.2. Dataset Feature Descriptions

The dataset contains 13 features describing each solar active region. The first 10 are the input features for our model, and the last three are the target variables we aim to predict.

**Input Features:**

- **modified Zurich class:** A classification of the sunspot group's magnetic complexity, generally ordered from least to most complex (A, B, C, D, E, F, H).
- **largest spot size:** Size of the largest spot in the group, ordered from smallest to largest (X, R, S, A, H, K).
- **spot distribution:** Compactness of the sunspot group, ordered from least to most compact (X, O, I, C).
- **activity:** A code representing the region's recent growth (1=decay, 2=no change).
- **evolution:** Describes the region's evolution over the last 24 hours (1=decay, 2=no growth, 3=growth).
- **previous 24 hour flare activity:** A code summarizing prior flare activity (1=none, 2=one M1, 3=>one M1).
- **historically-complex:** A flag indicating if the region was ever historically complex (1=Yes, 2=No).
- **became complex on this pass:** A flag indicating if the region became complex on its current transit (1=Yes, 2=No).
- **area:** A code for the total area of the sunspot group (1=small, 2=large).
- **area of largest spot:** A code for the area of the largest individual spot (1=<=5, 2=>5).

**Target Variables:**

- **common flares:** The number of **C-class** flares produced in the next 24 hours.
- **moderate flares:** The number of **M-class** flares produced in the next 24 hours.
- **severe flares:** The number of **X-class** flares produced in the next 24 hours.


### 2.3. Initial Data Inspection

A foundational understanding of the dataset's structure and quality must be established. This inspection is critical for the solar flare prediction model because data quality directly impacts model performance and reliability for space weather forecasting.

First, we will use `.info()` to examine the column names, data types, and check for any missing values. The output confirms that there are no missing values, meaning that null values do not have to be accounted for in the data preparation phase.

Next, we use `describe()` to generate a summary of the categorical features, including their unique values and most frequent entries, which helps us understand the distribution and composition of the dataset's categorical variables.


In [None]:
df.info()

df.astype("object").describe().transpose()

### 2.4. Exploratory Data Analysis


#### 2.4.1. Target Variable Analysis

Before analyzing the input features, we must first understand the distribution of our target variables: `common flares`, `moderate flares`, and `severe flares`. The plots below show the number of 24-hour periods in the dataset that recorded zero, one, two, or more flares of each type.

**Key Findings:**

The visualization reveals a severe class imbalance for our solar flare prediction. Out of all 24-hour periods available, only 15% experienced at least one C-Class event, 5% recorded M-Class events, and 1% showed X-Class events.

This imbalance has several implications for our machine learning approach:

1. **Model Selection:** Traditional accuracy metrics will be misleading due to the dominance of the "no flare" class. We must focus on precision, recall, and F1-score for the minority classes.

2. **Sampling Strategy:** We may need to employ techniques like SMOTE, class weights, or stratified sampling to address the imbalance.

3. **Evaluation Metrics:** The model's success will be measured primarily by its ability to correctly identify the rare but dangerous M and X-class flares, rather than overall accuracy.

4. **Business Impact:** Missing an X-class flare (false negative) is far more costly than incorrectly predicting one (false positive), making recall for severe flares our primary optimization target.


In [None]:
flare_columns = ["common flares", "moderate flares", "severe flares"]

# Create a figure with 3 subplots, one for each flare type
fig, axes = plt.subplots(1, 3, figsize=(16, 5), sharey=True)
fig.suptitle("Distribution of Raw Flare Counts Per 24-Hour Period")

# Loop through each flare type and plot its distribution
for i, col in enumerate(flare_columns):
    ax = axes[i]
    countplot = sns.countplot(
        data=df, x=col, ax=ax, hue=col, palette="viridis", legend=False
    )
    ax.set_title(f"Distribution of {col}")
    ax.set_xlabel("Flares Recorded")
    for container in ax.containers:
        ax.bar_label(container, fmt="%d", label_type="edge", padding=2)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

#### 2.4.2. Feature Distribution Analysis

To understand the characteristics of sunspot regions that may influence solar flare activity, we analyze three categorical features in our dataset. These features represent key morphological properties of sunspot groups that solar physicists use to assess flare potential:

**Modified Zurich Class:** This classification system categorizes sunspot groups based on their magnetic complexity and structure. The classes range from simple (A, B, C) to complex (D, E, F) then decayed (H), with more complex classes generally associated with higher flare probability. Complex magnetic configurations create more opportunities for magnetic reconnection events that trigger solar flares.

**Largest Spot Size:** This feature indicates the maximum size of individual sunspots within a group, measured in millionths of the solar hemisphere. Larger spots often indicate stronger magnetic fields and greater potential for energy release through flares. The size categories help quantify the scale of magnetic activity in the region.

**Spot Distribution:** This describes the spatial arrangement of sunspots within a group, ranging from single spots to complex clusters. The distribution pattern reflects the underlying magnetic field structure and can indicate the likelihood of magnetic reconnection events that produce flares.

**Key Findings:**

Each categorical feature seems to follow a normal distribution curve, with the exception of the modified Zurich class. There is a high representation of the H class, which represents decayed remnants of previous sunspot groups. This provides good training data for predicting high-intensity flares, but the rarity of Classes E and F may limit the model's ability to distinguish between different levels of high complexity.


In [None]:
# Define the categorical features to analyze
categorical_features = [
    "modified Zurich class",
    "largest spot size",
    "spot distribution",
]

# Specify the order for each categorical feature for consistent plotting
category_orders = {
    "modified Zurich class": ["B", "C", "D", "E", "F", "H"],
    "largest spot size": ["X", "R", "S", "A", "H", "K"],
    "spot distribution": ["X", "O", "I", "C"],
}

# Create a figure with 3 subplots, one for each categorical feature
fig, axes = plt.subplots(1, 3, figsize=(20, 6), sharey=True)
fig.suptitle("Distribution of Key Categorical Features")

# Loop through each categorical feature and plot its distribution
for i, feature in enumerate(categorical_features):
    sns.countplot(
        ax=axes[i],
        data=df,
        x=feature,
        order=category_orders[feature],
        hue=feature,
        palette="viridis",
    )
    axes[i].set_title(f"Distribution of {feature}")
    axes[i].tick_params(axis="x")

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

#### 2.4.3. Relationship Analysis

Having analyzed the features and targets individually, we now explore their relationship through two visualizations. These analyses investigate a key hypothesis: that more complex sunspot groups produce more significant flares.

**Total Flares Analysis:** The first subplot shows the total number of C, M, and X-class flares produced by each modified Zurich class, revealing which sunspot configurations are the most prolific sources of solar flares. 

**Average Flares Analysis:** The second subplot normalizes this data by showing the average number of flares per class instance, accounting for the different frequencies of each modified Zurich class in the dataset. This provides a more accurate assessment of flare risk per sunspot group.

**Key Findings:**

The two visualizations along with the distribution of modified Zurich class above give us an idea of the relationship between the complexity of sunspot groups and the flares produced by them.

B and C class sunspot regions are low complexity and produce the least amount of solar flares, so they can be seen as low-risk regions. H class regions are decayed remnants of C, D, E, and F regions, and are also low-risk regions.

D class sunspot regions are interesting because they produce the highest number of total solar flares in the dataset. But, after normalizing the data, we can see that they actually produce significantly fewer flares per sunspot region. Therefore, they can be categorized as medium-risk regions.

E and F class sunspots can be seen as high-risk regions. E class regions are almost guaranteed to produce solar flares, reaching just under 1 solar flare per class instance. F class regions produce a low total amount of solar flares, but adjusting for their lower representation in the dataset, they produce a high number of solar flares per region. F class regions also produce the highest amount of X-class (severe) flares when data is normalized.

In [None]:
# Melt the dataframe to have a single column for flare type and another for the count
flare_counts_df = df.melt(
    id_vars=["modified Zurich class"],
    value_vars=["common flares", "moderate flares", "severe flares"],
    var_name="flare_type",
    value_name="count",
)

# Remove rows where flares have not occurred
flare_counts_df = flare_counts_df[flare_counts_df["count"] > 0]

# Calculate the number of sunspot groups for each Zurich class
zurich_class_counts = df["modified Zurich class"].value_counts().to_dict()

# Calculate the proportional number of flares (per Zurich class instance)
flare_counts_df["class_count"] = flare_counts_df["modified Zurich class"].map(zurich_class_counts)
flare_counts_df["count_per_class"] = flare_counts_df["count"] / flare_counts_df["class_count"]

# Create a figure with 2 subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Create the Grouped Bar Plot
sns.barplot(
    data=flare_counts_df,
    x="modified Zurich class",
    y="count",
    hue="flare_type",
    estimator=sum,
    order=category_orders["modified Zurich class"],
    palette="viridis",
    errorbar=None,
    ax=ax1
)
ax1.set_title("Total Flares Produced by Sunspot Zurich Class")
ax1.set_xlabel("Modified Zurich Class")
ax1.set_ylabel("Total Number of Flares Recorded")
ax1.legend(title="Flare Type")

# Second subplot: Average Flares per Class Instance
sns.barplot(
    data=flare_counts_df,
    x="modified Zurich class",
    y="count_per_class",
    hue="flare_type",
    estimator=sum,
    order=category_orders["modified Zurich class"],
    palette="viridis",
    errorbar=None,
    ax=ax2
)
ax2.set_title("Average Number of Flares per Sunspot Zurich Class")
ax2.set_xlabel("Modified Zurich Class")
ax2.set_ylabel("Average Number of Flares per Class Instance")
ax2.legend(title="Flare Type")

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

## 3. Data Preparation


### 3.1. Categorize and Count Flare Classes

The dataset tracks C, M, and X-class flares in three separate columns, representing the count of each event type. For this classification task, a single target variable is needed. A new column will be created called `flare_class` that categorizes each 24-hour period by the most significant flare produced by a sunspot group, following the standard hierarchy (X > M > C).


In [None]:
# Determine the highest flare class for each row
def get_flare_class(row):
    if row["severe flares"] > 0:
        return 3  # X-class
    elif row["moderate flares"] > 0:
        return 2  # M-class
    elif row["common flares"] > 0:
        return 1  # C-class
    else:
        return 0  # None


# Create a new target column
df["flare_class"] = df.apply(get_flare_class, axis=1)

print(df["flare_class"].value_counts())

### 3.2. Prevent Data Leakage

The original flare columns are dropped to prevent data leakage.


In [None]:
# Drop original flare columns to prevent data leakage
df.drop(columns=["common flares", "moderate flares", "severe flares"], inplace=True)

### 3.3. One-Hot Encode Categorical Features

The machine learning model requires all input features to be numerical. The dataset contains several categorical columns with object data types (e.g., 'modified Zurich class'). One-hot encoding will be used to convert these categorical features into a numerical format, creating new binary columns for each category. This is a crucial step to make the data suitable for the classification algorithm.


In [None]:
# Identify the categorical columns to be encoded
categorical_cols = ["modified Zurich class", "largest spot size", "spot distribution"]

# Apply one-hot encoding using pandas get_dummies
df_encoded = pd.get_dummies(df, columns=categorical_cols)

# Display the first few rows to see the new columns
df_encoded.head()