## 1. Data Loading and Initial Exploration

In this section, we'll load the Turkish earthquake dataset and perform initial exploratory analysis to understand its structure, size, and basic properties. We'll also check for missing values and outliers that might affect our analysis.

Our dataset contains earthquake records with magnitude >4.0 from AFAD (Disaster and Emergency Management Presidency), including:
- Geographic coordinates (Longitude, Latitude)
- Earthquake characteristics (Magnitude, Depth, Type)
- Temporal information (Date)
- Location descriptions
- Fault line data for contextual analysis

We'll also set up the necessary directory structure for organizing our output.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
from datetime import datetime
import plotly.express as px
import geopandas as gpd
from shapely.geometry import Point, LineString
import math
import os

# Set visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)

# Load the earthquake dataset
earthquake_df = pd.read_csv('data\\earthquake_data.csv')

# Load fault line data
fault_gdf = gpd.read_file('data\\tr_faults_imp.geojson')
print(f"Number of fault lines: {len(fault_gdf)}")
print(f"Available properties: {fault_gdf.columns.tolist()}")

# Display first few rows to understand the structure
earthquake_df.head()

In [None]:
# Basic information about the dataset
print(f"Dataset shape: {earthquake_df.shape}")
print(f"Number of earthquakes: {len(earthquake_df)}")
print("\nData types:")
print(earthquake_df.dtypes)

# Check for missing values
print("\nMissing values:")
print(earthquake_df.isnull().sum())

# Basic statistics
print("\nBasic statistics:")
earthquake_df.describe()

os.makedirs("maps", exist_ok=True)
os.makedirs("models", exist_ok=True)
os.makedirs("produced_data", exist_ok=True)

print("Created output directories: maps, models, produced_data")

## 2. Exploratory Data Analysis (EDA)

### 2.1 Temporal Feature Creation

Converting date information to structured temporal features is crucial for analyzing patterns over time. We'll extract year, month, day, and create seasonal variables to identify potential cyclical patterns in earthquake occurrences.

These temporal features will help us understand if earthquakes follow certain seasonal patterns or have increased/decreased in frequency over time periods.

In [None]:
# Convert Date column to datetime format with explicit format
earthquake_df['Date'] = pd.to_datetime(earthquake_df['Date'], format="%d/%m/%Y %H:%M:%S", errors='coerce')

# Check if any dates couldn't be parsed
null_dates = earthquake_df['Date'].isnull().sum()
print(f"Number of dates that couldn't be parsed: {null_dates}")

# If we have null dates, we can try alternate formats
if null_dates > 0:
    print("Trying alternative date formats...")
    # Try another common format
    earthquake_df['Date'] = pd.to_datetime(earthquake_df['Date'], format="%d-%m-%Y %H:%M:%S", errors='coerce')
    # If still having issues, try auto-detection with dayfirst=True
    if earthquake_df['Date'].isnull().sum() > 0:
        earthquake_df['Date'] = pd.to_datetime(earthquake_df['Date'], dayfirst=True, errors='coerce')
    
    print(f"Remaining null dates after fixes: {earthquake_df['Date'].isnull().sum()}")

# Create additional time-based features
earthquake_df['Year'] = earthquake_df['Date'].dt.year
earthquake_df['Month'] = earthquake_df['Date'].dt.month
earthquake_df['Day'] = earthquake_df['Date'].dt.day
earthquake_df['DayOfWeek'] = earthquake_df['Date'].dt.dayofweek
earthquake_df['Season'] = earthquake_df['Month'].apply(lambda x: 
                                                     'Winter' if x in [12, 1, 2] else
                                                     'Spring' if x in [3, 4, 5] else
                                                     'Summer' if x in [6, 7, 8] else
                                                     'Fall')

# Display the updated dataframe
earthquake_df.head()

### 2.2 Geographic Visualization

Creating geographic visualizations is essential for understanding the spatial distribution of earthquakes across Turkey. We'll generate interactive maps showing:

1. Earthquake hotspots using heatmap visualization
2. Strong earthquakes (magnitude > 6) with detailed information
3. Fault lines with importance classification

These visualizations will help identify regions with higher seismic activity and their proximity to known fault lines.

In [None]:
# First check and clean coordinate data
print("Coordinate ranges before cleaning:")
print(f"Longitude: {earthquake_df['Longitude'].min()} to {earthquake_df['Longitude'].max()}")
print(f"Latitude: {earthquake_df['Latitude'].min()} to {earthquake_df['Latitude'].max()}")

# Filter out any extreme outliers (coordinates that are clearly wrong)
# Turkey coordinates should be roughly: Longitude 26-45 E, Latitude 36-42 N
valid_coords = (
    (earthquake_df['Longitude'] >= 25) & 
    (earthquake_df['Longitude'] <= 45) & 
    (earthquake_df['Latitude'] >= 35) & 
    (earthquake_df['Latitude'] <= 43)
)

# Filter the dataframe to keep only valid coordinates
clean_df = earthquake_df[valid_coords].copy()
outliers_removed = len(earthquake_df) - len(clean_df)
print(f"Removed {outliers_removed} records with coordinates outside Turkey's boundaries")

print("Coordinate ranges after cleaning:")
print(f"Longitude: {clean_df['Longitude'].min()} to {clean_df['Longitude'].max()}")
print(f"Latitude: {clean_df['Latitude'].min()} to {clean_df['Latitude'].max()}")

# Create a map centered on Turkey
# Create a map centered on Turkey
turkey_map = folium.Map(location=[38.5, 35.5], zoom_start=6)

# Sample points for better visualization performance
sample_df = clean_df.sample(min(2000, len(clean_df)))

# Create a heatmap layer with cleaned data
heat_data = [[row['Latitude'], row['Longitude']] for index, row in sample_df.iterrows()]
HeatMap(heat_data, radius=8, gradient={'0.4': 'blue', '0.6': 'cyan', '0.8': 'yellow', '1.0': 'red'}).add_to(turkey_map)

# Add markers for strong earthquakes (magnitude > 6)
for idx, row in clean_df[clean_df['Magnitude'] > 6].iterrows():
    # Create enhanced popup content with styled HTML
    popup_content = f"""
    <div style="font-family: Arial; min-width: 200px;">
        <h4 style="margin-bottom: 5px; color: #d32f2f;">Earthquake Details</h4>
        <b>Magnitude:</b> {row['Magnitude']:.1f}<br>
        <b>Depth:</b> {row['Depth']:.1f} km<br>
        <b>Date:</b> {row['Date']}<br>
        <b>Location:</b> {row['Location']}<br>
    """
    
    # Add type information if available
    if 'Type' in row:
        popup_content += f"<b>Type:</b> {row['Type']}<br>"
    
    # Add additional information if available  
    if 'TypeName' in row and not pd.isna(row['TypeName']):
        popup_content += f"<b>Type Description:</b> {row['TypeName']}<br>"
        
    # Add EventID if available
    if 'EventID' in row and not pd.isna(row['EventID']):
        popup_content += f"<b>Event ID:</b> {row['EventID']}<br>"
    
    popup_content += "</div>"
    
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=row['Magnitude'] * 1.5,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        popup=folium.Popup(popup_content, max_width=300),
    ).add_to(turkey_map)

# Add fault lines to the map
def add_faults_to_map(map_obj, fault_gdf, importance_threshold=0):
    # Filter faults by importance if desired
    if importance_threshold > 0:
        fault_data = fault_gdf[fault_gdf['importance'] >= importance_threshold]
    else:
        fault_data = fault_gdf
    
    # Color by importance
    def style_function(feature):
        importance = feature['properties']['importance']
        color = '#FF0000' if importance >= 4 else '#FFA500' if importance >= 3 else '#FFFF00'
        return {
            'color': color,
            'weight': importance * 0.5,  # Thicker lines for more important faults
            'opacity': 0.7
        }
    
    # Add GeoJSON to map
    folium.GeoJson(
        fault_data,
        name='Fault Lines',
        style_function=style_function,
        tooltip=folium.GeoJsonTooltip(fields=['FAULT_NAME', 'importance']),
    ).add_to(map_obj)
    
    return map_obj

# Add fault lines to the map
turkey_map = add_faults_to_map(turkey_map, fault_gdf, importance_threshold=3)

# Add a tile layer for better visualization
folium.TileLayer('cartodbpositron').add_to(turkey_map)

# Add a legend for the map
legend_html = '''
<div style="position: fixed; bottom: 50px; right: 50px; width: 200px; height: auto; 
    background-color: white; border:2px solid grey; z-index:9999; font-size:12px;
    padding: 10px; border-radius: 5px;">
    <p><b>Earthquake Map Legend</b></p>
    <p><i class="fa fa-circle" style="color:red"></i> Magnitude > 6</p>
    <div style="margin-top:5px;">
      <p><b>Heatmap Intensity:</b></p>
      <div style="display:inline-block; height:15px; width:30px; background:blue;"></div>
      <div style="display:inline-block; height:15px; width:30px; background:cyan;"></div>
      <div style="display:inline-block; height:15px; width:30px; background:yellow;"></div>
      <div style="display:inline-block; height:15px; width:30px; background:red;"></div>
      <p style="font-size:10px;">Low &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; High</p>
    </div>
    <div style="margin-top:5px;">
      <p><b>Fault Line Importance:</b></p>
      <p><span style="color:#FF0000;">━━━</span> High (4+)</p>
      <p><span style="color:#FFA500;">━━━</span> Medium (3)</p>
      <p><span style="color:#FFFF00;">━━━</span> Low (<3)</p>
    </div>
</div>
'''
turkey_map.get_root().html.add_child(folium.Element(legend_html))

# Save map to HTML file to view it
turkey_map.save('maps/earthquake_map.html')

# Display in notebook if you have ipywidgets installed
# from IPython.display import display
# display(turkey_map)

### 2.3 Temporal Analysis

Analyzing earthquake patterns over different time periods can reveal long-term trends and cyclical behaviors. We'll examine:

1. Yearly earthquake frequency to identify long-term trends
2. Seasonal distribution to detect potential seasonal patterns
3. Monthly patterns for finer granularity

These analyses may reveal whether earthquakes are becoming more frequent over time or occur more often during certain periods of the year.

In [None]:
# Use the cleaned dataframe for temporal analysis
# Yearly earthquake frequency
yearly_counts = clean_df.groupby('Year').size()

plt.figure(figsize=(14, 6))
yearly_counts.plot(kind='bar')
plt.title('Yearly Earthquake Frequency')
plt.xlabel('Year')
plt.ylabel('Number of Earthquakes')
plt.tight_layout()
plt.show()

# Seasonal patterns
seasonal_counts = clean_df.groupby('Season').size()

plt.figure(figsize=(10, 6))
seasonal_counts.plot(kind='pie', autopct='%1.1f%%')
plt.title('Seasonal Distribution of Earthquakes')
plt.ylabel('')
plt.tight_layout()
plt.show()

# Monthly patterns
monthly_counts = clean_df.groupby('Month').size()

plt.figure(figsize=(14, 6))
monthly_counts.plot(kind='bar')
plt.title('Monthly Earthquake Frequency')
plt.xlabel('Month')
plt.ylabel('Number of Earthquakes')
plt.xticks(range(12), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.tight_layout()
plt.show()

### 2.4 Magnitude and Depth Analysis

Examining the distribution of earthquake magnitudes and depths provides insight into the nature of seismic activity in Turkey. We'll analyze:

1. Magnitude distribution to understand the relative frequency of different earthquake sizes
2. Depth distribution to examine how deep earthquakes typically occur
3. Relationship between magnitude and depth to identify any correlation

This analysis will help us understand whether larger earthquakes tend to occur at specific depths and the typical characteristics of Turkish earthquakes.

In [None]:
# Magnitude distribution
plt.figure(figsize=(12, 6))
sns.histplot(clean_df['Magnitude'], bins=30, kde=True)
plt.title('Distribution of Earthquake Magnitudes')
plt.xlabel('Magnitude')
plt.ylabel('Frequency')
plt.axvline(clean_df['Magnitude'].mean(), color='red', linestyle='--', label=f'Mean: {clean_df["Magnitude"].mean():.2f}')
plt.legend()
plt.tight_layout()
plt.show()

# Depth distribution
plt.figure(figsize=(12, 6))
sns.histplot(clean_df['Depth'], bins=30, kde=True)
plt.title('Distribution of Earthquake Depths')
plt.xlabel('Depth (km)')
plt.ylabel('Frequency')
plt.axvline(clean_df['Depth'].mean(), color='red', linestyle='--', label=f'Mean: {clean_df["Depth"].mean():.2f}')
plt.legend()
plt.tight_layout()
plt.show()

# Relationship between magnitude and depth
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Depth', y='Magnitude', data=clean_df, alpha=0.6)
plt.title('Relationship Between Earthquake Depth and Magnitude')
plt.xlabel('Depth (km)')
plt.ylabel('Magnitude')
plt.tight_layout()
plt.show()

### 2.5 Correlation Analysis

Exploring correlations between numerical features helps identify relationships that might be useful for prediction. We'll create a correlation matrix to visualize the relationships between:

- Longitude and Latitude
- Depth and Magnitude
- Temporal features and earthquake characteristics

Strong correlations may indicate predictive relationships that our models can leverage.

In [None]:
# Correlation analysis of numerical columns
numerical_cols = clean_df.select_dtypes(include=[np.number]).columns
correlation_matrix = clean_df[numerical_cols].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

### 2.6 Additional Visualizations

Further exploration of earthquake patterns through specialized visualizations:

1. Geographic distribution by magnitude with color coding
2. Magnitude distribution over years using box plots
3. Depth trends over time
4. 3D visualization incorporating longitude, latitude, and depth

These visualizations provide additional perspectives on the data and may reveal patterns not evident in simpler analyses.

In [None]:
# Geographic distribution by magnitude
plt.figure(figsize=(14, 10))
scatter = plt.scatter(clean_df['Longitude'], clean_df['Latitude'], 
                     c=clean_df['Magnitude'], cmap='YlOrRd', 
                     alpha=0.7, s=clean_df['Magnitude']**2)
plt.colorbar(scatter, label='Magnitude')
plt.title('Geographic Distribution of Earthquakes by Magnitude')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Magnitude distribution over years (box plot)
plt.figure(figsize=(16, 8))
sns.boxplot(x='Year', y='Magnitude', data=clean_df)
plt.title('Magnitude Distribution Over Years')
plt.xlabel('Year')
plt.ylabel('Magnitude')
plt.xticks(rotation=90)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Depth vs Year analysis
plt.figure(figsize=(16, 8))
sns.boxplot(x='Year', y='Depth', data=clean_df)
plt.title('Depth Distribution Over Years')
plt.xlabel('Year')
plt.ylabel('Depth (km)')
plt.xticks(rotation=90)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# 3D visualization with Plotly
import plotly.express as px

fig = px.scatter_3d(clean_df.sample(min(3000, len(clean_df))), 
                   x='Longitude', y='Latitude', z='Depth',
                   color='Magnitude', size='Magnitude',
                   color_continuous_scale='Viridis',
                   title='3D Visualization of Earthquakes')
# Ensure proper axis orientation
fig.update_layout(scene=dict(
    xaxis_title='Longitude',
    yaxis_title='Latitude',
    zaxis_title='Depth (km)',
    # Reverse the depth axis to show deeper earthquakes lower
    zaxis=dict(autorange="reversed")
))
fig.write_html('maps/earthquake_3d.html')  # Save the interactive plot
# fig.show()  # Display in notebook if supported

In [None]:
# Magnitude frequency plot
plt.figure(figsize=(12, 6))
counts, bins, _ = plt.hist(clean_df['Magnitude'], bins=30, alpha=0.7)
plt.plot(bins[:-1], counts, '-o', color='darkred')
plt.title('Frequency Distribution of Earthquake Magnitudes')
plt.xlabel('Magnitude')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Regional magnitude comparison
# Extract region from location (assuming format includes region at end)
# Modify this based on your actual data format
if 'Location' in clean_df.columns:
    # Extract the first part of the location as the region
    clean_df['Region'] = clean_df['Location'].str.split(',').str[-1].str.strip()
    
    # Get top 10 regions by earthquake count
    top_regions = clean_df['Region'].value_counts().head(10).index
    
    # Plot magnitude distribution by region
    plt.figure(figsize=(14, 8))
    sns.boxplot(x='Region', y='Magnitude', data=clean_df[clean_df['Region'].isin(top_regions)])
    plt.title('Magnitude Distribution by Top 10 Regions')
    plt.xlabel('Region')
    plt.ylabel('Magnitude')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

In [None]:
# Heatmap of earthquake frequency by month and year
if len(clean_df) > 0:
    # Create pivot table
    heatmap_data = pd.pivot_table(
        clean_df,
        values='Magnitude',
        index=clean_df['Date'].dt.year,
        columns=clean_df['Date'].dt.month,
        aggfunc='count'
    )
    
    plt.figure(figsize=(14, 10))
    sns.heatmap(heatmap_data, cmap='YlOrRd', annot=False)
    plt.title('Earthquake Frequency by Month and Year')
    plt.xlabel('Month')
    plt.ylabel('Year')
    plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
    plt.tight_layout()
    plt.show()

### 2.7 Fault Line Analysis

Examining the relationship between earthquakes and fault lines is crucial for understanding seismic risk. We'll:

1. Calculate the distance from each earthquake to the nearest fault line
2. Analyze how earthquake magnitude relates to proximity to faults
3. Examine the relationship between fault importance and earthquake characteristics

This analysis will help determine whether earthquakes closer to major fault lines tend to have higher magnitudes.

In [None]:
# Calculate distances to fault lines
def calc_fault_distance(row, fault_gdf):
    point = Point(row['Longitude'], row['Latitude'])
    
    # Calculate distance to each fault line
    distances = []
    for idx, fault in fault_gdf.iterrows():
        fault_geom = fault.geometry
        dist = point.distance(fault_geom)
        distances.append((dist, idx))
    
    # Find the closest fault
    closest_dist, closest_idx = min(distances, key=lambda x: x[0])
    
    # Convert distance to kilometers (approximation)
    # 1 degree ≈ 111 km at the equator
    dist_km = closest_dist * 111
    
    # Get fault properties
    closest_fault = fault_gdf.iloc[closest_idx]
    
    return pd.Series({
        'distance_to_fault': dist_km,
        'nearest_fault_name': closest_fault.get('FAULT_NAME', 'Unknown'),
        'nearest_fault_importance': closest_fault.get('importance', 0)
    })

# Apply to a sample for visualization (full calculation will be done later)
sample_size = min(1000, len(clean_df))
fault_distance_sample = clean_df.sample(sample_size).apply(
    lambda row: calc_fault_distance(row, fault_gdf), axis=1
)

# Visualize relationship between earthquake magnitude and distance to fault
plt.figure(figsize=(12, 8))
plt.scatter(fault_distance_sample['distance_to_fault'], 
           clean_df.loc[fault_distance_sample.index, 'Magnitude'],
           alpha=0.6, c=fault_distance_sample['nearest_fault_importance'], 
           cmap='viridis')
plt.colorbar(label='Fault Importance')
plt.xlabel('Distance to Nearest Fault (km)')
plt.ylabel('Magnitude')
plt.title('Relationship Between Earthquake Magnitude and Distance to Fault')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 3. Data Preprocessing

This section focuses on preparing our data for machine learning by handling missing values, outliers, and other data quality issues:

1. Identifying and addressing missing values in all columns
2. Detecting and handling outliers using the IQR method
3. Validating geographic coordinates to ensure they're within Turkey's boundaries
4. Standardizing features that require normalization
5. Creating a clean dataset for the modeling phase

Proper preprocessing ensures our models receive high-quality input data, which is essential for accurate predictions.

In [None]:
# Data Preprocessing Section
print("Starting data preprocessing...")

# Check for missing values again to confirm
missing_values = clean_df.isnull().sum()
print(f"Missing values in each column:\n{missing_values}")

# Handle missing values
# For numerical columns: fill with median
numerical_cols = ['Longitude', 'Latitude', 'Depth', 'Magnitude']
for col in numerical_cols:
    if missing_values[col] > 0:
        median_value = clean_df[col].median()
        clean_df[col].fillna(median_value, inplace=True)
        print(f"Filled {missing_values[col]} missing values in {col} with median: {median_value}")

# For categorical columns: fill with mode
categorical_cols = [col for col in clean_df.columns if col not in numerical_cols 
                   and col not in ['Date', 'Year', 'Month', 'Day', 'YearMonth']]
for col in categorical_cols:
    if col in missing_values and missing_values[col] > 0:
        mode_value = clean_df[col].mode()[0]
        clean_df[col].fillna(mode_value, inplace=True)
        print(f"Filled {missing_values[col]} missing values in {col} with mode: {mode_value}")

# Handle outliers using IQR method for depth and magnitude
def handle_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    print(f"Found {len(outliers)} outliers in {column}")
    
    # Cap outliers instead of removing them
    df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
    df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
    
    return df

# Apply outlier handling to Depth
clean_df = handle_outliers(clean_df, 'Depth')

# For Magnitude, we may want to keep high values as they're important
# But we can still check for potential errors
magnitude_outliers = clean_df[clean_df['Magnitude'] > 8.5]
print(f"Extremely high magnitudes (>8.5): {len(magnitude_outliers)}")
if len(magnitude_outliers) > 0:
    print(magnitude_outliers[['Date', 'Magnitude', 'Location']])

# Standardize coordinates if needed
print("\nCoordinate ranges:")
print(f"Longitude: {clean_df['Longitude'].min()} to {clean_df['Longitude'].max()}")
print(f"Latitude: {clean_df['Latitude'].min()} to {clean_df['Latitude'].max()}")

# Verify coordinates are in the Turkey region (already done in previous step)
# This is now redundant since we've already filtered the coordinates
turkey_coords = clean_df[
    (clean_df['Longitude'] >= 25) & 
    (clean_df['Longitude'] <= 45) & 
    (clean_df['Latitude'] >= 35) & 
    (clean_df['Latitude'] <= 43)
]
outside_turkey = len(clean_df) - len(turkey_coords)
print(f"Records potentially outside Turkey region: {outside_turkey}")

# Create a copy of the dataframe for modeling
model_df = clean_df.copy()

print("\nData preprocessing completed!")
model_df.head()

## 4. Feature Engineering

Feature engineering is a critical step that can significantly improve model performance. We'll create new features that may help predict earthquake magnitudes:

1. Temporal features: cyclical encoding of time components (month, day of year) to capture seasonal patterns
2. Geographic features: grid-based location encoding and regional activity metrics
3. Historical activity features: information about previous earthquakes in the same region
4. Distance-based features: measuring proximity to fault lines and fault characteristics
5. Interaction features: combinations of existing features that might have predictive power together

These engineered features will provide our models with richer information than the raw data alone.

In [None]:
# Feature Engineering
print("Starting feature engineering...")

# Create time-based features
model_df['DayOfYear'] = model_df['Date'].dt.dayofyear
model_df['WeekOfYear'] = model_df['Date'].dt.isocalendar().week
model_df['IsWeekend'] = model_df['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)

# Encode seasonal information using cyclical encoding
model_df['MonthSin'] = np.sin(2 * np.pi * model_df['Month']/12)
model_df['MonthCos'] = np.cos(2 * np.pi * model_df['Month']/12)
model_df['DayOfYearSin'] = np.sin(2 * np.pi * model_df['DayOfYear']/365)
model_df['DayOfYearCos'] = np.cos(2 * np.pi * model_df['DayOfYear']/365)

# Create regional activity features
# Group by regions and calculate historical earthquake counts
# First, create a spatial grid
lon_grid = pd.cut(clean_df['Longitude'], bins=10)
lat_grid = pd.cut(clean_df['Latitude'], bins=10)
clean_df['Grid'] = pd.Series(zip(lon_grid, lat_grid)).astype(str)

# For each earthquake, count previous earthquakes in the same grid
clean_df = clean_df.sort_values('Date')
clean_df['PrevQuakesInGrid'] = clean_df.groupby('Grid').cumcount()

# Calculate distances between consecutive earthquakes
clean_df['PrevLon'] = clean_df['Longitude'].shift(1)
clean_df['PrevLat'] = clean_df['Latitude'].shift(1)

# Haversine formula to calculate distance in km
from math import radians, sin, cos, sqrt, asin

def haversine(lon1, lat1, lon2, lat2):
    # Convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    
    # Haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    r = 6371  # Radius of earth in km
    return c * r

# Apply haversine to calculate distance from previous earthquake
clean_df['DistFromPrev'] = clean_df.apply(
    lambda x: haversine(x['Longitude'], x['Latitude'], x['PrevLon'], x['PrevLat']) 
    if not pd.isna(x['PrevLon']) else np.nan, axis=1)

# Add distance features to model_df
model_df['PrevQuakesInGrid'] = clean_df['PrevQuakesInGrid']
model_df['DistFromPrev'] = clean_df['DistFromPrev']
model_df['DistFromPrev'].fillna(model_df['DistFromPrev'].median(), inplace=True)

# Create feature for time since last earthquake (in days)
clean_df['PrevDate'] = clean_df['Date'].shift(1)
clean_df['DaysSinceLastQuake'] = (clean_df['Date'] - clean_df['PrevDate']).dt.total_seconds() / (24 * 3600)
model_df['DaysSinceLastQuake'] = clean_df['DaysSinceLastQuake']
model_df['DaysSinceLastQuake'].fillna(model_df['DaysSinceLastQuake'].median(), inplace=True)

# Add historical magnitude information
clean_df['PrevMagnitude'] = clean_df['Magnitude'].shift(1)
model_df['PrevMagnitude'] = clean_df['PrevMagnitude']
model_df['PrevMagnitude'].fillna(model_df['PrevMagnitude'].median(), inplace=True)

# Create interaction features
model_df['DepthByLat'] = model_df['Depth'] * model_df['Latitude']
model_df['DepthByLon'] = model_df['Depth'] * model_df['Longitude']

# Add fault-related features - calculate for all data points
print("Calculating fault-related features...")
fault_features = clean_df.apply(lambda row: calc_fault_distance(row, fault_gdf), axis=1)
clean_df = pd.concat([clean_df, fault_features], axis=1)
model_df = pd.concat([model_df, fault_features], axis=1)

# Calculate fault density in a radius
def calc_fault_density(lat, lon, fault_gdf, radius=50):
    """Calculate fault density within radius (km) of a point"""
    point = Point(lon, lat)
    buffer_degrees = radius / 111  # Convert km to approximate degrees
    
    # Create a buffer around the point
    buffer = point.buffer(buffer_degrees)
    
    # Count intersecting faults and sum their lengths
    intersecting_faults = 0
    total_length = 0
    
    for _, fault in fault_gdf.iterrows():
        if buffer.intersects(fault.geometry):
            intersecting_faults += 1
            # Calculate length of intersection
            intersection = buffer.intersection(fault.geometry)
            total_length += intersection.length * 111  # Convert to km
    
    return pd.Series({
        'fault_count_50km': intersecting_faults,
        'fault_length_50km': total_length,
        'fault_density': total_length / (math.pi * radius**2) if radius > 0 else 0
    })

# Calculate fault density for strategic points (grid centers) to avoid heavy computation
print("Calculating fault density (this may take a while)...")
# Create a grid for Turkey
lon_range = np.linspace(25, 45, 10)
lat_range = np.linspace(35, 43, 10)
grid_points = []

for lon in lon_range:
    for lat in lat_range:
        grid_points.append((lon, lat))

# Calculate density at grid points
grid_densities = []
for lon, lat in grid_points:
    density = calc_fault_density(lat, lon, fault_gdf)
    density['lon'] = lon
    density['lat'] = lat
    grid_densities.append(density)

grid_df = pd.DataFrame(grid_densities)

# For each earthquake, find nearest grid point and assign its density
def assign_grid_density(row, grid_df):
    distances = []
    for idx, grid_point in grid_df.iterrows():
        dist = haversine(row['Longitude'], row['Latitude'], grid_point['lon'], grid_point['lat'])
        distances.append((dist, idx))
    
    closest_idx = min(distances, key=lambda x: x[0])[1]
    return pd.Series({
        'fault_count_50km': grid_df.iloc[closest_idx]['fault_count_50km'],
        'fault_length_50km': grid_df.iloc[closest_idx]['fault_length_50km'],
        'fault_density': grid_df.iloc[closest_idx]['fault_density']
    })

# Apply grid-based density estimation
density_features = clean_df.apply(lambda row: assign_grid_density(row, grid_df), axis=1)
clean_df = pd.concat([clean_df, density_features], axis=1)
model_df = pd.concat([model_df, density_features], axis=1)

# Add magnitude-distance interaction feature
model_df['magnitude_fault_interaction'] = model_df['Magnitude'] / (model_df['distance_to_fault'] + 1)

print("Feature engineering completed!")
model_df.head()

## 5. Model Selection and Training

In this section, we'll train and compare multiple regression models to predict earthquake magnitude:

1. Linear models: Linear Regression, Ridge, and Lasso
2. Tree-based models: Random Forest and Gradient Boosting
3. Advanced gradient boosting: XGBoost and LightGBM

We'll evaluate each model using:
- Cross-validation to ensure robust performance estimates
- RMSE (Root Mean Squared Error) as our primary metric
- MAE (Mean Absolute Error) and R² as secondary metrics

This comprehensive evaluation will help us identify the most suitable model for earthquake magnitude prediction.

In [None]:
# Import necessary libraries for modeling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Model Selection and Training
print("Setting up model training...")

# Define features and target
target = 'Magnitude'
# Remove non-feature columns
drops = ['Date', 'Location', 'EventID', 'TimeName', 'TypeName', 
         'MagnitudeName', 'Grid', 'PrevLon', 'PrevLat', 'PrevDate',
         'nearest_fault_name']  # Remove string columns

# Check if these optional columns exist and add them to drops if they do
optional_drops = ['YearMonth']
for col in optional_drops:
    if col in model_df.columns:
        drops.append(col)

# First, create a preliminary feature list
preliminary_features = [col for col in model_df.columns if col != target and col not in drops]

# Check for non-numeric columns in our features
for col in preliminary_features:
    if col in model_df.columns and model_df[col].dtype == 'object':
        print(f"Removing non-numeric column: {col}")
        drops.append(col)

# Final feature list with only numeric columns
features = [col for col in model_df.columns if col != target and col not in drops]

print(f"Selected features: {features}")

# Define features to scale
features_to_scale = ['Longitude', 'Latitude', 'Depth']
other_features = [f for f in features if f not in features_to_scale]

# Split data into training and testing sets
X = model_df[features]
y = model_df[target]

print("Columns with NaN values:")
for col in X.columns:
    nan_count = X[col].isna().sum()
    if nan_count > 0:
        print(f"- {col}: {nan_count} NaNs")

# Fill missing values appropriately for each column
for col in X.columns:
    if X[col].isna().sum() > 0:
        # For numeric columns, use median
        X[col] = X[col].fillna(X[col].median())

# Also check target variable
if y.isna().sum() > 0:
    print(f"Target has {y.isna().sum()} NaN values, filling with median")
    y = y.fillna(y.median())

# Verify all NaNs are fixed
print(f"Remaining NaN values in X: {X.isna().sum().sum()}")
print(f"Remaining NaN values in y: {y.isna().sum()}")

# Create new train-test split with cleaned data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

# Create preprocessing transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('geo_features', StandardScaler(), features_to_scale),
        ('other_features', 'passthrough', other_features)
    ])

# Set up models with pipelines
models = {
    'Linear Regression': Pipeline([
        ('preprocessor', preprocessor),
        ('model', LinearRegression())
    ]),
    'Ridge Regression': Pipeline([
        ('preprocessor', preprocessor),
        ('model', Ridge())
    ]),
    'Lasso Regression': Pipeline([
        ('preprocessor', preprocessor),
        ('model', Lasso())
    ]),
    'Random Forest': Pipeline([
        ('preprocessor', preprocessor),
        ('model', RandomForestRegressor(n_estimators=100, random_state=42))
    ]),
    'Gradient Boosting': Pipeline([
        ('preprocessor', preprocessor),
        ('model', GradientBoostingRegressor(n_estimators=100, random_state=42))
    ]),
    'XGBoost': Pipeline([
        ('preprocessor', preprocessor),
        ('model', XGBRegressor(n_estimators=100, random_state=42))
    ]),
    'LightGBM': Pipeline([
        ('preprocessor', preprocessor),
        ('model', LGBMRegressor(n_estimators=100, random_state=42))
    ])
}

# Function to evaluate models
def evaluate_model(pipeline, X_train, X_test, y_train, y_test):
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    return mae, rmse, r2, pipeline

# Cross-validation for more robust evaluation
results = {}
cv_results = {}
fitted_models = {}

for name, pipeline in models.items():
    print(f"Training {name}...")
    mae, rmse, r2, fitted_pipeline = evaluate_model(pipeline, X_train, X_test, y_train, y_test)
    fitted_models[name] = fitted_pipeline
    
    # 5-fold cross-validation for RMSE
    cv_scores = -cross_val_score(pipeline, X, y, cv=5, scoring='neg_root_mean_squared_error')
    
    results[name] = {'MAE': mae, 'RMSE': rmse, 'R²': r2}
    cv_results[name] = {'Mean RMSE': cv_scores.mean(), 'Std RMSE': cv_scores.std()}
    
    print(f"{name} - MAE: {mae:.4f}, RMSE: {rmse:.4f}, R²: {r2:.4f}, CV RMSE: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Convert results to DataFrames for better visualization
results_df = pd.DataFrame(results).T
cv_results_df = pd.DataFrame(cv_results).T

print("\nTest Results:")
print(results_df.sort_values('RMSE'))

print("\nCross-Validation Results:")
print(cv_results_df.sort_values('Mean RMSE'))

# Visualize model performance
plt.figure(figsize=(12, 6))
results_df['RMSE'].sort_values().plot(kind='bar')
plt.title('RMSE by Model')
plt.ylabel('RMSE (lower is better)')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Select the best performing model based on CV results
best_model_name = cv_results_df.sort_values('Mean RMSE').index[0]
print(f"\nBest model based on cross-validation: {best_model_name}")
best_pipeline = fitted_models[best_model_name]

## 6. Hyperparameter Optimization

Fine-tuning model hyperparameters is essential for maximizing prediction accuracy. For our best-performing model, we'll:

1. Define a comprehensive hyperparameter grid to explore
2. Use RandomizedSearchCV for efficient optimization
3. Evaluate results using cross-validation to avoid overfitting
4. Select the optimal hyperparameter configuration
5. Create a final model with the optimized parameters

This optimization process will ensure our model achieves its maximum potential on this specific earthquake prediction task.

In [None]:
# Hyperparameter Optimization
print(f"Optimizing hyperparameters for {best_model_name}...")

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Define hyperparameter grids for each model type
# You may need to adjust these based on your selected best model
param_grids = {
    'Random Forest': {
        'model__n_estimators': [50, 100, 200],
        'model__max_depth': [None, 10, 20, 30],
        'model__min_samples_split': [2, 5, 10],
        'model__min_samples_leaf': [1, 2, 4]
    },
    'Gradient Boosting': {
        'model__n_estimators': [50, 100, 200],
        'model__learning_rate': [0.01, 0.1, 0.2],
        'model__max_depth': [3, 5, 7],
        'model__min_samples_split': [2, 5]
    },
    'XGBoost': {
        'model__n_estimators': [50, 100, 200],
        'model__learning_rate': [0.01, 0.1, 0.2],
        'model__max_depth': [3, 5, 7],
        'model__colsample_bytree': [0.7, 0.8, 0.9]
    },
    'LightGBM': {
        'model__n_estimators': [50, 100, 200],
        'model__learning_rate': [0.01, 0.1, 0.2],
        'model__max_depth': [3, 5, 7],
        'model__num_leaves': [31, 50, 70]
    }
}

# Get the appropriate parameter grid
if best_model_name in param_grids:
    param_grid = param_grids[best_model_name]
    
    # Use RandomizedSearchCV for efficiency
    random_search = RandomizedSearchCV(
        best_pipeline, 
        param_distributions=param_grid,
        n_iter=20,  # Number of parameter settings sampled
        cv=5,
        scoring='neg_root_mean_squared_error',
        n_jobs=-1,
        random_state=42,
        verbose=1
    )
    
    # Fit the random search
    random_search.fit(X_train, y_train)
    
    # Print best parameters and score
    print(f"Best parameters: {random_search.best_params_}")
    print(f"Best RMSE: {-random_search.best_score_:.4f}")
    
    # Create the optimized model
    best_pipeline = random_search.best_estimator_
else:
    print(f"No parameter grid defined for {best_model_name}. Using default model.")
    best_pipeline = fitted_models[best_model_name]

# Final evaluation with the best model
y_pred = best_pipeline.predict(X_test)
final_mae = mean_absolute_error(y_test, y_pred)
final_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
final_r2 = r2_score(y_test, y_pred)

print(f"\nFinal model performance:")
print(f"MAE: {final_mae:.4f}")
print(f"RMSE: {final_rmse:.4f}")
print(f"R²: {final_r2:.4f}")

## 7. Model Evaluation and Interpretation

Thoroughly evaluating and interpreting our model helps us understand its strengths, limitations, and the factors influencing earthquake magnitude:

1. Visualizing actual vs. predicted magnitudes
2. Analyzing prediction errors through residual plots
3. Examining the distribution of residuals
4. Identifying the most important features for prediction
5. Saving the model pipeline for deployment

Understanding feature importance will provide insights into which factors most strongly influence earthquake magnitudes in Turkey, which may be valuable for risk assessment.

In [None]:
# Model Evaluation and Interpretation
print("Evaluating final model...")

# Visualize actual vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Magnitude')
plt.ylabel('Predicted Magnitude')
plt.title('Actual vs Predicted Earthquake Magnitude')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Plot residuals
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Magnitude')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Analyze residual distribution
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Feature importance (for tree-based models)
try:
    # Extract the model component from the pipeline
    model_component = best_pipeline.named_steps['model']
    
    # Check if it has feature importances
    if hasattr(model_component, 'feature_importances_'):
        # Get preprocessed feature names - slightly tricky with ColumnTransformer
        # For simplicity, we'll use the original feature names
        # Create DataFrame of feature importances
        feature_importance = pd.DataFrame({
            'Feature': features,
            'Importance': model_component.feature_importances_
        }).sort_values('Importance', ascending=False)
        
        # Visualize feature importances
        plt.figure(figsize=(12, 8))
        sns.barplot(x='Importance', y='Feature', data=feature_importance.head(15))
        plt.title('Top 15 Feature Importances')
        plt.tight_layout()
        plt.show()
        
        print("Top 10 most important features:")
        print(feature_importance.head(10))
except:
    print("Could not extract feature importances from the model.")

# Save the entire pipeline - this contains both preprocessing and model
import joblib
joblib.dump(best_pipeline, 'models/earthquake_pipeline.pkl')
print("Pipeline saved as 'models/earthquake_pipeline.pkl'")

# Also save the clean dataset with original coordinates for unsupervised learning
clean_df.to_csv('produced_data/clean_earthquake_data.csv', index=False)
print("Clean data with original coordinates saved as 'produced_data/clean_earthquake_data.csv'")

## 8. Conclusion and Next Steps

Our supervised learning model predicts earthquake magnitudes in Turkey with reasonable accuracy based on geographic, temporal, and fault-related features. The processed data has been saved for further analysis.

These unsupervised techniques will complement our supervised prediction model by identifying natural groupings and high-risk regions that might not be evident through regression analysis alone.

## 9. GPU Acceleration with PyTorch (Bonus)

In this section, we'll implement a deep learning approach using PyTorch with GPU acceleration to compare with our traditional machine learning models. This addresses the bonus requirement for utilizing GPU libraries.

### 9.1 PyTorch Model Setup

In [None]:
# Install required packages if needed (uncomment and run if necessary)
# !pip install torch torchvision torchaudio

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import time
import numpy as np
from sklearn.preprocessing import StandardScaler

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Define Neural Network architecture for earthquake prediction
class EarthquakeNN(nn.Module):
    def __init__(self, input_size):
        super(EarthquakeNN, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.BatchNorm1d(64),
            nn.Dropout(0.2),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
    
    def forward(self, x):
        return self.network(x)

In [None]:
# Prepare data for PyTorch
# Use the same features as in our traditional models
X_torch = X.copy()

# Scale the features
scaler_torch = StandardScaler()
X_scaled = scaler_torch.fit_transform(X_torch)

# Convert to PyTorch tensors
X_tensor = torch.FloatTensor(X_scaled)
y_tensor = torch.FloatTensor(y.values.reshape(-1, 1))

# Create train/test split - use the same proportions but create fresh tensors
X_train_tensor = torch.FloatTensor(scaler_torch.transform(X_train))
y_train_tensor = torch.FloatTensor(y_train.values.reshape(-1, 1))
X_test_tensor = torch.FloatTensor(scaler_torch.transform(X_test))
y_test_tensor = torch.FloatTensor(y_test.values.reshape(-1, 1))

# Create PyTorch datasets and dataloaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Testing samples: {len(test_dataset)}")

In [None]:
# Initialize PyTorch model, loss function and optimizer
model = EarthquakeNN(input_size=X_train.shape[1])
model = model.to(device)  # Move model to GPU if available

criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
# Train the model
def train_and_evaluate():
    epochs = 100
    best_val_loss = float('inf')
    patience = 10
    counter = 0
    
    train_losses = []
    val_losses = []
    
    # Record training time
    start_time = time.time()
    
    for epoch in range(epochs):
        # Training
        model.train()
        running_loss = 0.0
        for inputs, targets in train_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            
            # Zero the parameter gradients
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            
            # Backward pass and optimize
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item() * inputs.size(0)
        
        train_loss = running_loss / len(train_dataset)
        train_losses.append(train_loss)
        
        # Validation
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for inputs, targets in test_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                val_loss += loss.item() * inputs.size(0)
        
        val_loss = val_loss / len(test_dataset)
        val_losses.append(val_loss)
        
        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            counter = 0
        else:
            counter += 1
        
        if counter >= patience:
            print(f"Early stopping at epoch {epoch+1}")
            break
        
        if (epoch+1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
    
    # Calculate total training time
    training_time = time.time() - start_time
    print(f"Training completed in {training_time:.2f} seconds")
    
    return train_losses, val_losses, training_time

# Run training
train_losses, val_losses, gpu_training_time = train_and_evaluate()

In [None]:
# Evaluate the model on test data
def evaluate_model():
    model.eval()
    all_preds = []
    all_targets = []
    
    with torch.no_grad():
        for inputs, targets in test_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            all_preds.append(outputs.cpu().numpy())
            all_targets.append(targets.cpu().numpy())
    
    # Concatenate all predictions and targets
    y_pred_torch = np.vstack(all_preds).flatten()
    y_test_torch = np.vstack(all_targets).flatten()
    
    # Calculate metrics
    torch_mae = mean_absolute_error(y_test_torch, y_pred_torch)
    torch_rmse = np.sqrt(mean_squared_error(y_test_torch, y_pred_torch))
    torch_r2 = r2_score(y_test_torch, y_pred_torch)
    
    print(f"PyTorch Neural Network Results:")
    print(f"MAE: {torch_mae:.4f}")
    print(f"RMSE: {torch_rmse:.4f}")
    print(f"R²: {torch_r2:.4f}")
    
    return torch_mae, torch_rmse, torch_r2, y_pred_torch

# Evaluate model
torch_mae, torch_rmse, torch_r2, y_pred_torch = evaluate_model()

In [None]:
# Plot training and validation loss
plt.figure(figsize=(10, 6))
plt.plot(train_losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Training and Validation Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Visualize actual vs predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_torch, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Magnitude')
plt.ylabel('Predicted Magnitude (PyTorch)')
plt.title('Actual vs Predicted Earthquake Magnitude (PyTorch)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Compare PyTorch with traditional model results
comparison_data = {
    'Model': ['PyTorch NN (GPU)', f'{best_model_name} (CPU)'],
    'MAE': [torch_mae, final_mae],
    'RMSE': [torch_rmse, final_rmse],
    'R²': [torch_r2, final_r2],
    'Training Time (s)': [gpu_training_time, None]  # We don't have CPU time recorded
}

comparison_df = pd.DataFrame(comparison_data)
print("Model Performance Comparison:")
comparison_df

In [None]:
# Save the PyTorch model
torch.save(model.state_dict(), 'models/earthquake_pytorch_model.pt')
print("PyTorch model saved as 'models/earthquake_pytorch_model.pt'")

# Also save a script to load and use the model
with open('models/load_pytorch_model.py', 'w') as f:
    f.write("""
import torch
import torch.nn as nn

class EarthquakeNN(nn.Module):
    def __init__(self, input_size):
        super(EarthquakeNN, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.BatchNorm1d(64),
            nn.Dropout(0.2),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
    
    def forward(self, x):
        return self.network(x)

# Function to load model
def load_model(model_path, input_size, device='cpu'):
    model = EarthquakeNN(input_size)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()
    return model
""")
print("Model loading script saved as 'models/load_pytorch_model.py'")