# Japan Gas Demand Forecasting - Data Collection and Sources

## Project Overview

This notebook establishes the foundation for forecasting natural gas demand in Japan. We'll:

1. **Document data sources** for Japanese gas consumption
2. **Set up data collection infrastructure** 
3. **Generate synthetic data** for demonstration and testing
4. **Validate data quality** and establish baseline metrics

## Data Sources for Japan Gas Demand

### Primary Sources
- **METI (Ministry of Economy, Trade and Industry)**: Official energy statistics
- **Japan Gas Association**: Industry consumption data
- **JMA (Japan Meteorological Agency)**: Weather and temperature data
- **Bank of Japan**: Economic indicators

### Secondary Sources
- **IEA Statistics**: International benchmarks
- **Platts/S&P Global**: LNG prices and market data
- **Trading platforms**: Real-time demand signals


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Import our custom modules
import sys
sys.path.append('../src')
from data_processing import JapanGasDataCollector, load_sample_data
from forecasting_utils import calculate_forecast_metrics, time_series_cv_split

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("🚀 Japan Gas Demand Forecasting - Data Collection")
print("=" * 60)
print(f"📅 Analysis started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("✅ Libraries imported successfully")
print("✅ Custom modules loaded")


## 1. Data Collection Strategy

### Real Data Sources (When Available)

1. **METI Energy Statistics**
   - Monthly gas consumption by sector
   - Historical data from 2010+
   - High quality, official source

2. **Japan Meteorological Agency**
   - Daily temperature data
   - Heating/Cooling degree days
   - Regional weather patterns

3. **Economic Indicators**
   - GDP growth rates
   - Industrial production indices
   - Energy price indices

### Synthetic Data Generation

For demonstration and testing purposes, we'll generate realistic synthetic data that captures:
- **Seasonal patterns**: Winter heating demand
- **Temperature correlations**: Heating degree days
- **Economic cycles**: Industrial demand variations
- **COVID-19 impact**: Demand reduction during pandemic
- **Holiday effects**: Golden Week, Obon, New Year


In [None]:
# Initialize data collector
collector = JapanGasDataCollector()

print("📊 DATA COLLECTION SETUP")
print("=" * 40)
print(f"Data sources configured: {len(collector.data_sources)}")
for source, url in collector.data_sources.items():
    print(f"  • {source}: {url}")

# Generate comprehensive synthetic dataset
print(f"\n🔄 Generating synthetic Japan gas demand data...")
print("   This creates realistic data for demonstration and model development")

# Generate data for 6+ years (2018-2024) to have sufficient history
gas_data = collector.generate_synthetic_data(
    start_date='2018-01-01', 
    end_date='2024-08-31'
)

print(f"✅ Synthetic data generated successfully!")
print(f"📈 Dataset overview:")
print(f"   • Period: {gas_data.index.min().strftime('%Y-%m')} to {gas_data.index.max().strftime('%Y-%m')}")
print(f"   • Observations: {len(gas_data):,} monthly records")
print(f"   • Variables: {len(gas_data.columns)}")
print(f"   • Memory usage: {gas_data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display basic statistics
print(f"\n📊 Gas Demand Statistics:")
print(f"   • Mean: {gas_data['total_gas_demand_mcm'].mean():.1f} million m³/month")
print(f"   • Std Dev: {gas_data['total_gas_demand_mcm'].std():.1f} million m³/month")
print(f"   • Min: {gas_data['total_gas_demand_mcm'].min():.1f} million m³/month")
print(f"   • Max: {gas_data['total_gas_demand_mcm'].max():.1f} million m³/month")
print(f"   • Coefficient of Variation: {(gas_data['total_gas_demand_mcm'].std()/gas_data['total_gas_demand_mcm'].mean()*100):.1f}%")


In [None]:
# Add calendar features and enhance the dataset
print("🔧 ENHANCING DATASET WITH FEATURES")
print("=" * 40)

# Add calendar-based features
gas_data_enhanced = collector.add_calendar_features(gas_data)

# Create lagged features for forecasting
gas_data_enhanced = collector.create_lagged_features(
    gas_data_enhanced, 
    'total_gas_demand_mcm', 
    max_lag=12
)

# Clean and validate the data
gas_data_final = collector.clean_and_validate_data(gas_data_enhanced)

print(f"✅ Dataset enhancement complete!")
print(f"📊 Enhanced dataset features:")
print(f"   • Total columns: {len(gas_data_final.columns)}")
print(f"   • Calendar features: {len([c for c in gas_data_final.columns if c in ['year', 'month', 'quarter', 'day_of_year', 'is_winter', 'is_spring', 'is_summer', 'is_autumn']])}")
print(f"   • Lag features: {len([c for c in gas_data_final.columns if 'lag_' in c])}")
print(f"   • Rolling features: {len([c for c in gas_data_final.columns if 'ma_' in c or 'std_' in c])}")
print(f"   • Cyclical features: {len([c for c in gas_data_final.columns if 'sin' in c or 'cos' in c])}")

# Display first few rows
print(f"\n📋 Dataset Preview:")
print(gas_data_final.head())

# Show column information
print(f"\n📝 Column Information:")
for i, col in enumerate(gas_data_final.columns, 1):
    dtype = gas_data_final[col].dtype
    non_null = gas_data_final[col].count()
    print(f"  {i:2d}. {col:<30} | {str(dtype):<10} | {non_null:3d} non-null")


## 2. Data Quality Assessment

Let's analyze the quality and characteristics of our dataset to understand:
- **Seasonal patterns** and trends
- **Data completeness** and missing values
- **Outliers** and anomalies
- **Correlations** between variables
- **Structural breaks** in the time series


In [None]:
# Comprehensive data quality assessment
print("🔍 DATA QUALITY ASSESSMENT")
print("=" * 40)

# 1. Missing values analysis
missing_values = gas_data_final.isnull().sum()
missing_pct = (missing_values / len(gas_data_final) * 100).round(2)

print("📊 Missing Values Analysis:")
if missing_values.sum() > 0:
    print("Missing values found:")
    for col, count in missing_values[missing_values > 0].items():
        print(f"  • {col:<30}: {count:3d} ({missing_pct[col]:5.1f}%)")
else:
    print("✅ No missing values found")

# 2. Outlier detection using IQR method
print(f"\n🎯 Outlier Detection:")
target_col = 'total_gas_demand_mcm'
Q1 = gas_data_final[target_col].quantile(0.25)
Q3 = gas_data_final[target_col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = gas_data_final[
    (gas_data_final[target_col] < lower_bound) | 
    (gas_data_final[target_col] > upper_bound)
]

print(f"  • IQR bounds: {lower_bound:.1f} - {upper_bound:.1f} million m³/month")
print(f"  • Outliers detected: {len(outliers)} ({len(outliers)/len(gas_data_final)*100:.1f}%)")

if len(outliers) > 0:
    print(f"  • Outlier dates: {outliers.index.strftime('%Y-%m').tolist()}")
    print(f"  • Outlier values: {outliers[target_col].round(1).tolist()}")

# 3. Seasonal strength analysis
print(f"\n📈 Seasonal Analysis:")
seasonal_indices = calculate_seasonal_indices(gas_data_final, target_col)
print("Monthly seasonal indices:")
for month, index in seasonal_indices.items():
    month_name = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'][month-1]
    print(f"  • {month_name}: {index:.3f}")

# Calculate seasonal strength
seasonal_strength = seasonal_indices.std()
print(f"  • Seasonal strength: {seasonal_strength:.3f}")

# 4. Structural break detection
print(f"\n🔍 Structural Break Analysis:")
break_analysis = detect_structural_breaks(gas_data_final[target_col])
print(f"  • Has structural break: {'Yes' if break_analysis['has_structural_break'] else 'No'}")
if break_analysis['has_structural_break']:
    print(f"  • Break point: {break_analysis['break_point_approx'].strftime('%Y-%m')}")
    print(f"  • P-value: {break_analysis['p_value']:.4f}")
    print(f"  • Before break mean: {break_analysis['first_half_mean']:.1f}")
    print(f"  • After break mean: {break_analysis['second_half_mean']:.1f}")

# 5. Weather-demand correlation analysis
print(f"\n🌡️ Weather-Demand Correlation:")
correlations = weather_demand_correlation_analysis(gas_data_final)
for weather_var, corr_dict in correlations.items():
    print(f"  • {weather_var}:")
    for demand_var, corr in corr_dict.items():
        print(f"    - {demand_var}: {corr:.3f}")


In [None]:
# Create comprehensive data visualization dashboard
print("📊 CREATING DATA VISUALIZATION DASHBOARD")
print("=" * 50)

# Create subplots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Japan Gas Demand - Data Quality Dashboard', fontsize=16, fontweight='bold')

# 1. Time series plot with trend
axes[0,0].plot(gas_data_final.index, gas_data_final['total_gas_demand_mcm'], 
               linewidth=2, color='steelblue', alpha=0.8)
axes[0,0].set_title('Gas Demand Time Series', fontweight='bold')
axes[0,0].set_xlabel('Date')
axes[0,0].set_ylabel('Gas Demand (Million m³/month)')
axes[0,0].grid(True, alpha=0.3)

# Add trend line
from scipy import stats
x_numeric = np.arange(len(gas_data_final))
slope, intercept, r_value, p_value, std_err = stats.linregress(x_numeric, gas_data_final['total_gas_demand_mcm'])
trend_line = slope * x_numeric + intercept
axes[0,0].plot(gas_data_final.index, trend_line, 'r--', alpha=0.7, linewidth=2, 
               label=f'Trend (R²={r_value**2:.3f})')
axes[0,0].legend()

# 2. Seasonal patterns (monthly)
monthly_demand = gas_data_final.groupby('month')['total_gas_demand_mcm'].mean()
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
bars = axes[0,1].bar(range(1, 13), monthly_demand.values, color='lightcoral', alpha=0.8)
axes[0,1].set_title('Average Monthly Gas Demand', fontweight='bold')
axes[0,1].set_xlabel('Month')
axes[0,1].set_ylabel('Average Demand (Million m³/month)')
axes[0,1].set_xticks(range(1, 13))
axes[0,1].set_xticklabels(month_names)
axes[0,1].grid(True, alpha=0.3)

# Add value labels on bars
for bar, value in zip(bars, monthly_demand.values):
    axes[0,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
                   f'{value:.0f}', ha='center', va='bottom', fontweight='bold')

# 3. Temperature vs Gas Demand scatter
axes[0,2].scatter(gas_data_final['avg_temperature_celsius'], gas_data_final['total_gas_demand_mcm'], 
                  alpha=0.6, s=50, color='green')
axes[0,2].set_title('Gas Demand vs Temperature', fontweight='bold')
axes[0,2].set_xlabel('Temperature (°C)')
axes[0,2].set_ylabel('Gas Demand (Million m³/month)')

# Add correlation coefficient
correlation = gas_data_final['avg_temperature_celsius'].corr(gas_data_final['total_gas_demand_mcm'])
axes[0,2].text(0.05, 0.95, f'Correlation: {correlation:.3f}', 
              transform=axes[0,2].transAxes, fontsize=12, 
              bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8))
axes[0,2].grid(True, alpha=0.3)

# 4. Sector breakdown
sector_cols = ['residential_demand_mcm', 'commercial_demand_mcm', 
               'industrial_demand_mcm', 'power_generation_demand_mcm']
sector_means = gas_data_final[sector_cols].mean()
sector_labels = ['Residential', 'Commercial', 'Industrial', 'Power Gen']

wedges, texts, autotexts = axes[1,0].pie(sector_means.values, labels=sector_labels, 
                                         autopct='%1.1f%%', startangle=90)
axes[1,0].set_title('Gas Demand by Sector', fontweight='bold')

# 5. Heating Degree Days vs Demand
axes[1,1].scatter(gas_data_final['heating_degree_days'], gas_data_final['total_gas_demand_mcm'], 
                  alpha=0.6, s=50, color='red')
axes[1,1].set_title('Gas Demand vs Heating Degree Days', fontweight='bold')
axes[1,1].set_xlabel('Heating Degree Days')
axes[1,1].set_ylabel('Gas Demand (Million m³/month)')

# Add correlation
hdd_corr = gas_data_final['heating_degree_days'].corr(gas_data_final['total_gas_demand_mcm'])
axes[1,1].text(0.05, 0.95, f'Correlation: {hdd_corr:.3f}', 
              transform=axes[1,1].transAxes, fontsize=12, 
              bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8))
axes[1,1].grid(True, alpha=0.3)

# 6. Year-over-year growth
yearly_demand = gas_data_final.groupby('year')['total_gas_demand_mcm'].mean()
yoy_growth = yearly_demand.pct_change() * 100

axes[1,2].bar(yearly_demand.index[1:], yoy_growth.values[1:], 
              color=['green' if x > 0 else 'red' for x in yoy_growth.values[1:]], alpha=0.7)
axes[1,2].set_title('Year-over-Year Growth Rate', fontweight='bold')
axes[1,2].set_xlabel('Year')
axes[1,2].set_ylabel('Growth Rate (%)')
axes[1,2].grid(True, alpha=0.3)
axes[1,2].axhline(y=0, color='black', linestyle='-', alpha=0.5)

# Add value labels
for i, (year, growth) in enumerate(zip(yearly_demand.index[1:], yoy_growth.values[1:])):
    axes[1,2].text(year, growth + (0.5 if growth > 0 else -0.5), f'{growth:.1f}%', 
                   ha='center', va='bottom' if growth > 0 else 'top', fontweight='bold')

plt.tight_layout()
plt.show()

# Print summary statistics
print(f"\n📈 DATA SUMMARY STATISTICS")
print("=" * 40)
print(f"• Dataset period: {gas_data_final.index.min().strftime('%Y-%m')} to {gas_data_final.index.max().strftime('%Y-%m')}")
print(f"• Total observations: {len(gas_data_final):,} months")
print(f"• Average monthly demand: {gas_data_final['total_gas_demand_mcm'].mean():.1f} million m³")
print(f"• Peak demand: {gas_data_final['total_gas_demand_mcm'].max():.1f} million m³ ({gas_data_final['total_gas_demand_mcm'].idxmax().strftime('%Y-%m')})")
print(f"• Minimum demand: {gas_data_final['total_gas_demand_mcm'].min():.1f} million m³ ({gas_data_final['total_gas_demand_mcm'].idxmin().strftime('%Y-%m')})")
print(f"• Seasonal variation: {seasonal_indices.max():.3f}x peak vs {seasonal_indices.min():.3f}x trough")
print(f"• Temperature correlation: {correlation:.3f}")
print(f"• HDD correlation: {hdd_corr:.3f}")
print(f"• Annual trend: {slope*12:.1f} million m³/year {'increase' if slope > 0 else 'decrease'}")

print(f"\n✅ Data collection and quality assessment complete!")
print(f"📁 Dataset ready for exploratory data analysis and modeling.")


## 3. Data Export and Next Steps

### Export Processed Dataset

The enhanced dataset is now ready for the next phase of analysis. We'll save it for use in subsequent notebooks.


In [None]:
# Export the processed dataset for use in other notebooks
print("💾 EXPORTING PROCESSED DATASET")
print("=" * 40)

# Save to CSV for easy access
output_path = '../data/processed/japan_gas_demand_processed.csv'
gas_data_final.to_csv(output_path)

print(f"✅ Dataset exported to: {output_path}")
print(f"📊 Export summary:")
print(f"   • Rows: {len(gas_data_final):,}")
print(f"   • Columns: {len(gas_data_final.columns)}")
print(f"   • File size: {gas_data_final.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Also save a summary statistics file
summary_stats = {
    'dataset_info': {
        'period_start': gas_data_final.index.min().strftime('%Y-%m-%d'),
        'period_end': gas_data_final.index.max().strftime('%Y-%m-%d'),
        'total_observations': len(gas_data_final),
        'total_columns': len(gas_data_final.columns)
    },
    'gas_demand_stats': {
        'mean': gas_data_final['total_gas_demand_mcm'].mean(),
        'std': gas_data_final['total_gas_demand_mcm'].std(),
        'min': gas_data_final['total_gas_demand_mcm'].min(),
        'max': gas_data_final['total_gas_demand_mcm'].max(),
        'cv': gas_data_final['total_gas_demand_mcm'].std() / gas_data_final['total_gas_demand_mcm'].mean()
    },
    'seasonal_patterns': {
        'peak_month': gas_data_final.groupby('month')['total_gas_demand_mcm'].mean().idxmax(),
        'trough_month': gas_data_final.groupby('month')['total_gas_demand_mcm'].mean().idxmin(),
        'seasonal_strength': seasonal_indices.std()
    },
    'correlations': {
        'temperature': correlation,
        'heating_degree_days': hdd_corr
    }
}

# Save summary as JSON
import json
summary_path = '../data/processed/dataset_summary.json'
with open(summary_path, 'w') as f:
    json.dump(summary_stats, f, indent=2, default=str)

print(f"✅ Summary statistics saved to: {summary_path}")

print(f"\n🎯 NEXT STEPS:")
print(f"   1. Run notebook 02: Exploratory Data Analysis")
print(f"   2. Run notebook 03: Forecasting Models")
print(f"   3. Run notebook 04: Model Evaluation")
print(f"   4. Run notebook 05: Forecast Visualization")

print(f"\n✅ Data collection and processing complete!")
print(f"📁 Ready for comprehensive analysis and forecasting.")
