<a href="https://colab.research.google.com/github/sinahuss/solar-flare-prediction/blob/main/notebooks/solar_flare_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# C964 Capstone: Solar Flare Prediction and Analysis

## 1. Business Understanding

### Organizational Need

Space weather events, particularly solar flares, pose significant risks to critical infrastructure on Earth and in space. Organizations like NOAA's Space Weather Prediction Center require reliable early warning systems to protect:

- **Satellite communications and GPS systems** that can be disrupted by solar radiation
- **Power grids** vulnerable to geomagnetic storms triggered by solar flares
- **Astronauts and aircraft** exposed to increased radiation during solar events
- **Radio communications** that can be severely affected by solar activity

Current prediction methods rely heavily on human expertise and limited historical patterns, often resulting in missed events or false alarms that can lead to unnecessary protective measures and associated costs.

### Project Goal

This project aims to develop a machine learning model that can predict the likelihood of solar flare events (C, M, or X-class) within a 24-hour period based on observable characteristics of sunspot regions. The model will provide:

- **Early warning capability** for space weather forecasters
- **Automated risk assessment** to supplement human expertise
- **Improved accuracy** in flare prediction to reduce false alarms and missed events

### Success Criteria

The model's success will be measured by:

- **High recall for X and M-class flares** (the most dangerous events) to minimize missed warnings
- **Balanced precision and recall** to reduce false alarms while maintaining sensitivity
- **Practical deployment feasibility** for integration into existing space weather monitoring systems

This predictive capability would enable space weather agencies to provide more reliable warnings, allowing for better preparation and protection of critical infrastructure.

## 2. Data Understanding

### 2.1. Load Libraries and Data

Import the necessary Python libraries and load the dataset from a public GitHub repository into a Pandas dataframe. Display the first few rows to verify that it has been loaded.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://raw.githubusercontent.com/sinahuss/solar-flare-prediction/refs/heads/main/data/data.csv'
df = pd.read_csv(url)

df.head()

### 2.2. Initial Data Inspection

First, perform an initial inspection of the dataset to understand its basic structure. Use `.info()` to check the column names, data types, and look for any missing values. Then use `.describe()` to get a statistical summary of the numerical features, which helps in identifying the scale and distribution of the data. The output from `.info()` confirms that there are no missing values, meaning no additional data preparation steps are required.


In [None]:
df.info()

df.describe()

### 2.3. Exploratory Data Analysis

#### 2.3.1. Correlation Matrix Heatmap

To understand the relationships between the different features of solar active regions, a correlation matrix heatmap will be generated. This visualization helps identify which features are strongly related, which could indicate redundancy, and which features might be most predictive of our target variable.

In [None]:
plt.figure(figsize=(10, 8))

sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='viridis')

plt.title('Feature Correlation Matrix')
plt.show()

#### 2.3.2. Class Imbalance Visualization

## 3. Data Understanding

### 3.1. Categorize and Count Flare Classes

The dataset tracks C, M, and X-class flares in three separate columns, representing the count of each event type. For this classification task, a single target variable is needed. A new column will be created called `flare_class` that categorizes each 24-hour period by the most significant flare produced by a sunspot group, following the standard hierarchy (X > M > C).

In [None]:
# Determine the highest flare class for each row
def get_flare_class(row):
    if row['severe flares'] > 0:
        return 3 # X-class
    elif row['moderate flares'] > 0:
        return 2 # M-class
    elif row['common flares'] > 0:
        return 1 # C-class
    else:
        return 0 # None

# Create a new target column
df['flare_class'] = df.apply(get_flare_class, axis=1)

print(df['flare_class'].value_counts())

### 3.2. Prevent Data Leakage

The original flare columns are dropped to prevent data leakage.

In [None]:
# Drop original flare columns to prevent data leakage
df.drop(columns=['common flares', 'moderate flares', 'severe flares'], inplace=True)

### 3.3. One-Hot Encode Categorical Features

The machine learning model requires all input features to be numerical. The dataset contains several categorical columns with object data types (e.g., 'modified Zurich class'). One-hot encoding will be used to convert these categorical features into a numerical format, creating new binary columns for each category. This is a crucial step to make the data suitable for the classification algorithm.

In [None]:
# Identify the categorical columns to be encoded
categorical_cols = ['modified Zurich class', 'largest spot size', 'spot distribution']

# Apply one-hot encoding using pandas get_dummies
df_encoded = pd.get_dummies(df, columns=categorical_cols)

# Display the first few rows to see the new columns
df_encoded.head()