# Task 1: Laying the Foundation for Analysis

## Objective
Define the data analysis workflow and develop a thorough understanding of the model and data.

This notebook implements Task 1 requirements:
1. **Data Analysis Workflow**: Documented steps from data loading to insight generation
2. **Event Data Research**: Compiled dataset of major oil market events
3. **Assumptions and Limitations**: Documented with emphasis on correlation vs. causation
4. **Communication Channels**: Identified formats for stakeholder communication
5. **Time Series Properties**: Analysis of trend, stationarity, and volatility
6. **Change Point Models**: Explanation of purpose and expected outputs

## 1. Data Analysis Workflow

### Step 1: Data Loading and Validation
- Load Brent crude oil price time series data (daily frequency recommended)
- Validate data quality: check for missing values, outliers, and data integrity
- Document data source, date range, and frequency
- Create initial time series visualization to identify visual patterns

### Step 2: Exploratory Data Analysis (EDA)
- **Trend Analysis**: Decompose time series to identify long-term trends, test for deterministic vs stochastic trends
- **Stationarity Testing**: Apply ADF, KPSS, and PP tests to determine if differencing is required
- **Volatility Analysis**: Calculate rolling volatility, test for volatility clustering (ARCH effects), identify volatility regimes

### Step 3: Event Data Integration
- Load compiled event dataset
- Align event dates with price data timeline
- Categorize events by type (OPEC decisions, geopolitical events, economic shocks)
- Visually assess price movements around event dates

### Step 4: Change Point Detection
- Select appropriate change point detection method (Bayesian change point models, CUSUM, Chow test)
- Specify model parameters (mean shifts, variance changes, trend breaks)
- Fit models and estimate change point locations with uncertainty intervals
- Perform model diagnostics (convergence checks, model fit assessment)

### Step 5: Results Interpretation and Validation
- Extract most probable change point dates with credible intervals
- Estimate regime-specific parameters (mean, variance, trend)
- Compare detected change points with historical events
- Assess statistical significance of detected breaks
- Characterize each identified regime

### Step 6: Visualization and Reporting
- Create time series plots with overlaid change points
- Visualize posterior distributions for change point uncertainty
- Generate event timeline showing events and detected breaks
- Produce summary statistics comparing regimes
- Create executive summary and technical report

### Step 7: Insight Generation
- Identify patterns in structural breaks
- Discuss potential causal relationships (with appropriate caveats)
- Provide predictive insights for future price behavior
- Assess risk periods of high uncertainty

In [None]:
# Import required libraries
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.append(str(Path('../src').resolve()))

# Import project modules
from data_loader import load_oil_price_data, validate_data, preprocess_data, load_event_data
from eda import (
    descriptive_statistics, 
    trend_analysis, 
    test_stationarity, 
    volatility_analysis,
    autocorrelation_analysis
)
from event_integration import (
    align_events_with_prices,
    categorize_events,
    calculate_event_impact_statistics
)

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("Libraries imported successfully!")

## 2. Event Data Research and Compilation

We have compiled a structured dataset containing **22 major oil market events** (2001-2023) covering:
- OPEC production decisions
- Geopolitical events (wars, conflicts, sanctions)
- Economic shocks (financial crises, pandemics)
- Natural disasters
- Market events

Let's load and examine the event data:

In [None]:
# Load event data
event_data_path = '../data/raw/oil_market_events.csv'
events_df = load_event_data(event_data_path)

print(f"Total events compiled: {len(events_df)}")
print(f"\nEvent date range: {events_df['event_date'].min()} to {events_df['event_date'].max()}")
print(f"\nEvent types:")
print(events_df['event_type'].value_counts())
print(f"\nImpact types:")
print(events_df['impact_type'].value_counts())
print(f"\nSeverity distribution:")
print(events_df['severity'].value_counts())

# Display the complete event table
print("\n" + "="*100)
print("COMPLETE EVENT TABLE")
print("="*100)
display(events_df.sort_values('event_date'))

## 3. Data Loading and Validation

Now let's load actual Brent crude oil price data and perform validation. We'll use yfinance to fetch historical data.

In [None]:
# Load Brent crude oil price data using yfinance
try:
    import yfinance as yf
    
    # Fetch Brent crude oil futures (BZ=F) or use alternative ticker
    # Note: yfinance may not have direct Brent ticker, so we'll use a proxy or load from file
    print("Attempting to load Brent crude oil price data...")
    
    # Option 1: Try to fetch from yfinance (if available)
    # For demonstration, we'll create sample data structure
    # In practice, load from your data source
    
    # Create date range for analysis
    start_date = '2000-01-01'
    end_date = '2024-12-31'
    
    print(f"Data range: {start_date} to {end_date}")
    print("\nNOTE: To load actual data, either:")
    print("1. Provide a CSV file with Brent prices in data/raw/")
    print("2. Use yfinance with appropriate ticker")
    print("3. Use FRED API for DCOILBRENTEU (Brent Crude Oil Price)")
    
    # For demonstration, we'll show the data loading structure
    # Replace this with actual data loading when available
    
except ImportError:
    print("yfinance not installed. Install with: pip install yfinance")
    print("Alternatively, load data from CSV file.")

In [None]:
# Alternative: Load from FRED API or create sample data for demonstration
# This cell demonstrates the data loading workflow with actual implementation

def load_brent_data_from_fred():
    """Load Brent crude oil price from FRED API."""
    try:
        import pandas_datareader.data as web
        from datetime import datetime
        
        # FRED series ID for Brent Crude Oil Price
        series_id = 'DCOILBRENTEU'
        start = datetime(2000, 1, 1)
        end = datetime(2024, 12, 31)
        
        df = web.DataReader(series_id, 'fred', start, end)
        df.columns = ['price']
        df = df.dropna()
        return df
    except Exception as e:
        print(f"FRED API loading failed: {e}")
        return None

def load_brent_data_from_yfinance():
    """Load Brent crude oil price from yfinance."""
    try:
        import yfinance as yf
        # Use a proxy ticker or ETF that tracks Brent
        # CL=F is WTI, BZ=F might be Brent futures
        ticker = yf.Ticker("BZ=F")
        df = ticker.history(start="2000-01-01", end="2024-12-31")
        if not df.empty:
            df = df[['Close']].rename(columns={'Close': 'price'})
            return df
    except Exception as e:
        print(f"yfinance loading failed: {e}")
        return None

# Try to load data
print("Attempting to load Brent crude oil price data...")
price_df = None

# Try FRED first
price_df = load_brent_data_from_fred()
if price_df is not None:
    print(f"✓ Loaded {len(price_df)} observations from FRED")
    print(f"Date range: {price_df.index.min()} to {price_df.index.max()}")
else:
    # Try yfinance
    price_df = load_brent_data_from_yfinance()
    if price_df is not None:
        print(f"✓ Loaded {len(price_df)} observations from yfinance")
        print(f"Date range: {price_df.index.min()} to {price_df.index.max()}")
    else:
        print("⚠ Could not load data from online sources.")
        print("Please provide a CSV file with Brent prices or install required packages:")
        print("  pip install pandas-datareader yfinance")
        print("\nFor demonstration, creating sample data structure...")
        # Create sample date range
        dates = pd.date_range(start='2000-01-01', end='2024-12-31', freq='D')
        # Create placeholder (replace with actual data loading)
        price_df = pd.DataFrame(index=dates, columns=['price'])
        price_df['price'] = np.nan
        print("Sample structure created. Replace with actual data loading.")

if price_df is not None and not price_df.empty:
    # Validate data
    validation_report = validate_data(price_df)
    print("\n" + "="*80)
    print("DATA VALIDATION REPORT")
    print("="*80)
    for key, value in validation_report.items():
        if key != 'outliers':
            print(f"{key}: {value}")
    
    # Display first few rows
    print("\nFirst 5 rows:")
    print(price_df.head())
    print("\nLast 5 rows:")
    print(price_df.tail())
    print(f"\nData shape: {price_df.shape}")
    print(f"Missing values: {price_df.isnull().sum().sum()}")

## 4. Time Series Properties Analysis

Now we perform comprehensive analysis of time series properties: **Trend Analysis**, **Stationarity Testing**, and **Volatility Analysis**. These analyses inform our modeling choices for change point detection.

### 4.1 Trend Analysis

**Purpose**: Identify long-term directional movements in oil prices to understand if the series has deterministic trends, stochastic trends, or trend breaks.

**Methods Applied**:
- Moving averages (30-day and 60-day windows)
- Linear trend fitting
- Time series decomposition (trend, seasonal, residual components)

**Modeling Implications**:
- If **trend-stationary**: Include deterministic trend in change point model
- If **difference-stationary**: Use differencing or include stochastic trend
- If **trend breaks exist**: Model requires change point detection in trend component

In [None]:
# Perform Trend Analysis
if price_df is not None and not price_df['price'].isna().all():
    # Preprocess: handle missing values
    price_df_clean = preprocess_data(price_df, handle_missing='forward_fill')
    
    # Perform trend analysis
    print("="*80)
    print("TREND ANALYSIS")
    print("="*80)
    
    trend_results = trend_analysis(price_df_clean, window=30)
    
    # Extract results for price column
    price_col = price_df_clean.columns[0]
    
    # Display linear trend results
    if f"{price_col}_linear_trend" in trend_results:
        linear_trend = trend_results[f"{price_col}_linear_trend"]
        print(f"\nLinear Trend Analysis for {price_col}:")
        print(f"  Slope: {linear_trend['slope']:.4f} (price units per day)")
        print(f"  Intercept: {linear_trend['intercept']:.2f}")
        print(f"  R-squared: {linear_trend['r_squared']:.4f}")
        print(f"  P-value: {linear_trend['p_value']:.2e}")
        
        if linear_trend['p_value'] < 0.05:
            print("  → Significant linear trend detected")
        else:
            print("  → No significant linear trend")
    
    # Create visualization
    fig, axes = plt.subplots(2, 1, figsize=(14, 10))
    
    # Plot 1: Price with moving averages
    axes[0].plot(price_df_clean.index, price_df_clean[price_col], 
                 label='Brent Crude Price', alpha=0.6, linewidth=1)
    if f"{price_col}_ma_30" in trend_results:
        ma_30 = trend_results[f"{price_col}_ma_30"]
        axes[0].plot(price_df_clean.index, ma_30, 
                    label='30-day MA', linewidth=2)
    if f"{price_col}_ma_60" in trend_results:
        ma_60 = trend_results[f"{price_col}_ma_60"]
        axes[0].plot(price_df_clean.index, ma_60, 
                    label='60-day MA', linewidth=2)
    
    axes[0].set_title('Brent Crude Oil Price with Moving Averages', 
                      fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Date')
    axes[0].set_ylabel('Price (USD/barrel)')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Plot 2: Decomposed trend (if available)
    if f"{price_col}_trend" in trend_results:
        trend_component = trend_results[f"{price_col}_trend"]
        axes[1].plot(price_df_clean.index, price_df_clean[price_col], 
                    label='Original', alpha=0.3)
        axes[1].plot(trend_component.index, trend_component.values, 
                    label='Trend Component', linewidth=2, color='red')
        axes[1].set_title('Time Series Decomposition - Trend Component', 
                         fontsize=14, fontweight='bold')
        axes[1].set_xlabel('Date')
        axes[1].set_ylabel('Price (USD/barrel)')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
    else:
        axes[1].text(0.5, 0.5, 'Decomposition not available\n(insufficient data or period)', 
                    ha='center', va='center', transform=axes[1].transAxes)
        axes[1].set_title('Time Series Decomposition - Trend Component', 
                         fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.savefig('../reports/figures/task1_trend_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n✓ Trend analysis completed")
else:
    print("⚠ Cannot perform trend analysis: No price data available")
    print("Please load price data first.")

### 4.2 Stationarity Testing

**Purpose**: Determine if statistical properties (mean, variance) are constant over time. This is crucial for change point detection because non-stationary data may require differencing or different modeling approaches.

**Tests Applied**:
- **Augmented Dickey-Fuller (ADF) Test**: Tests null hypothesis of unit root (non-stationary)
- **Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test**: Tests null hypothesis of stationarity

**Modeling Implications**:
- If **non-stationary**: May require differencing or cointegration analysis before change point detection
- If **stationary**: Can be modeled directly with change point models
- **Structural breaks** can cause apparent non-stationarity, which is what we're trying to detect

In [None]:
# Perform Stationarity Testing
if price_df is not None and not price_df['price'].isna().all():
    price_df_clean = preprocess_data(price_df, handle_missing='forward_fill')
    price_col = price_df_clean.columns[0]
    
    print("="*80)
    print("STATIONARITY TESTING")
    print("="*80)
    
    # Test original series
    print(f"\nTesting stationarity of {price_col} (original series):")
    stationarity_results = test_stationarity(price_df_clean[price_col], alpha=0.05)
    
    # Display ADF test results
    if stationarity_results['adf']:
        adf = stationarity_results['adf']
        print(f"\nAugmented Dickey-Fuller (ADF) Test:")
        print(f"  Test Statistic: {adf['test_statistic']:.4f}")
        print(f"  P-value: {adf['p_value']:.4f}")
        print(f"  Critical Values:")
        for level, value in adf['critical_values'].items():
            print(f"    {level}: {value:.4f}")
        print(f"  Is Stationary: {adf['is_stationary']}")
        if adf['is_stationary']:
            print("  → Series is STATIONARY (reject null hypothesis of unit root)")
        else:
            print("  → Series is NON-STATIONARY (fail to reject null hypothesis)")
    
    # Display KPSS test results
    if stationarity_results['kpss']:
        kpss = stationarity_results['kpss']
        print(f"\nKwiatkowski-Phillips-Schmidt-Shin (KPSS) Test:")
        print(f"  Test Statistic: {kpss['test_statistic']:.4f}")
        print(f"  P-value: {kpss['p_value']:.4f}")
        print(f"  Critical Values:")
        for level, value in kpss['critical_values'].items():
            print(f"    {level}: {value:.4f}")
        print(f"  Is Stationary: {kpss['is_stationary']}")
        if kpss['is_stationary']:
            print("  → Series is STATIONARY (fail to reject null hypothesis)")
        else:
            print("  → Series is NON-STATIONARY (reject null hypothesis)")
    
    # Combined conclusion
    print(f"\n{'='*80}")
    print(f"CONCLUSION: {stationarity_results['conclusion']}")
    print(f"{'='*80}")
    
    # Test first difference if original is non-stationary
    if stationarity_results['conclusion'] in ['Non-stationary', 'Inconclusive - conflicting results']:
        print("\n" + "="*80)
        print("Testing First Difference (to check if series is difference-stationary):")
        print("="*80)
        
        price_diff = price_df_clean[price_col].diff().dropna()
        diff_results = test_stationarity(price_diff, alpha=0.05)
        
        if diff_results['adf']:
            adf_diff = diff_results['adf']
            print(f"\nADF Test on First Difference:")
            print(f"  Test Statistic: {adf_diff['test_statistic']:.4f}")
            print(f"  P-value: {adf_diff['p_value']:.4f}")
            print(f"  Is Stationary: {adf_diff['is_stationary']}")
            if adf_diff['is_stationary']:
                print("  → First difference is STATIONARY")
                print("  → Original series is DIFFERENCE-STATIONARY (I(1))")
                print("  → Modeling implication: Use differenced series or include stochastic trend")
        
        # Visualize original vs differenced
        fig, axes = plt.subplots(2, 1, figsize=(14, 8))
        
        axes[0].plot(price_df_clean.index, price_df_clean[price_col])
        axes[0].set_title('Original Series (Brent Crude Price)', fontsize=12, fontweight='bold')
        axes[0].set_ylabel('Price (USD/barrel)')
        axes[0].grid(True, alpha=0.3)
        
        axes[1].plot(price_diff.index, price_diff.values)
        axes[1].axhline(y=0, color='r', linestyle='--', alpha=0.5)
        axes[1].set_title('First Difference', fontsize=12, fontweight='bold')
        axes[1].set_xlabel('Date')
        axes[1].set_ylabel('Price Change')
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.savefig('../reports/figures/task1_stationarity_analysis.png', dpi=300, bbox_inches='tight')
        plt.show()
    
    print("\n✓ Stationarity testing completed")
else:
    print("⚠ Cannot perform stationarity testing: No price data available")

### 4.3 Volatility Analysis

**Purpose**: Understand how price variability changes over time. Volatility clustering and regime changes are important for change point detection.

**Methods Applied**:
- Rolling volatility (30-day window)
- Annualized volatility calculations
- ARCH effects test (Ljung-Box test on squared returns) to detect volatility clustering

**Modeling Implications**:
- If **constant volatility**: Simple variance parameter in change point model
- If **time-varying volatility**: Include volatility change points in model
- If **volatility clustering detected**: May require GARCH-type models or volatility regime detection

In [None]:
# Perform Volatility Analysis
if price_df is not None and not price_df['price'].isna().all():
    price_df_clean = preprocess_data(price_df, handle_missing='forward_fill')
    price_col = price_df_clean.columns[0]
    
    print("="*80)
    print("VOLATILITY ANALYSIS")
    print("="*80)
    
    volatility_results = volatility_analysis(price_df_clean, window=30)
    
    # Display volatility statistics
    if f"{price_col}_volatility_stats" in volatility_results:
        vol_stats = volatility_results[f"{price_col}_volatility_stats"]
        print(f"\nVolatility Statistics for {price_col}:")
        print(f"  Mean Daily Volatility: {vol_stats['mean_volatility']:.4f}")
        print(f"  Annualized Volatility: {vol_stats['annualized_volatility']:.2%}")
        print(f"  Maximum Rolling Volatility: {vol_stats['max_volatility']:.4f}")
        print(f"  Minimum Rolling Volatility: {vol_stats['min_volatility']:.4f}")
        print(f"  Volatility of Volatility: {vol_stats['volatility_of_volatility']:.4f}")
    
    # ARCH effects test
    if f"{price_col}_arch_test" in volatility_results and volatility_results[f"{price_col}_arch_test"]:
        arch_test = volatility_results[f"{price_col}_arch_test"]
        print(f"\nARCH Effects Test (Volatility Clustering):")
        print(f"  Ljung-Box Statistic: {arch_test['ljung_box_statistic']:.4f}")
        print(f"  P-value: {arch_test['p_value']:.4f}")
        print(f"  Has ARCH Effects: {arch_test['has_arch_effects']}")
        if arch_test['has_arch_effects']:
            print("  → Volatility clustering detected (ARCH effects present)")
            print("  → Modeling implication: Consider GARCH models or volatility change points")
        else:
            print("  → No significant volatility clustering detected")
    
    # Create visualizations
    fig, axes = plt.subplots(3, 1, figsize=(14, 12))
    
    # Plot 1: Price series
    axes[0].plot(price_df_clean.index, price_df_clean[price_col], linewidth=1, alpha=0.7)
    axes[0].set_title('Brent Crude Oil Price', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('Price (USD/barrel)')
    axes[0].grid(True, alpha=0.3)
    
    # Plot 2: Returns
    returns = price_df_clean[price_col].pct_change().dropna()
    axes[1].plot(returns.index, returns.values, linewidth=0.5, alpha=0.7, color='green')
    axes[1].axhline(y=0, color='r', linestyle='--', alpha=0.5)
    axes[1].set_title('Daily Returns', fontsize=12, fontweight='bold')
    axes[1].set_ylabel('Return')
    axes[1].grid(True, alpha=0.3)
    
    # Plot 3: Rolling volatility
    if f"{price_col}_rolling_volatility" in volatility_results:
        rolling_vol = volatility_results[f"{price_col}_rolling_volatility"]
        axes[2].plot(rolling_vol.index, rolling_vol.values, linewidth=2, color='red', label='30-day Rolling Volatility')
        if f"{price_col}_rolling_volatility_annualized" in volatility_results:
            rolling_vol_ann = volatility_results[f"{price_col}_rolling_volatility_annualized"]
            ax2_twin = axes[2].twinx()
            ax2_twin.plot(rolling_vol_ann.index, rolling_vol_ann.values, 
                         linewidth=2, color='orange', alpha=0.7, 
                         label='Annualized Volatility')
            ax2_twin.set_ylabel('Annualized Volatility', color='orange')
            ax2_twin.tick_params(axis='y', labelcolor='orange')
        axes[2].set_title('Rolling Volatility (30-day window)', fontsize=12, fontweight='bold')
        axes[2].set_xlabel('Date')
        axes[2].set_ylabel('Daily Volatility', color='red')
        axes[2].tick_params(axis='y', labelcolor='red')
        axes[2].grid(True, alpha=0.3)
        axes[2].legend(loc='upper left')
    
    plt.tight_layout()
    plt.savefig('../reports/figures/task1_volatility_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Volatility regime identification
    if f"{price_col}_rolling_volatility" in volatility_results:
        rolling_vol = volatility_results[f"{price_col}_rolling_volatility"]
        vol_mean = rolling_vol.mean()
        vol_std = rolling_vol.std()
        
        high_vol_threshold = vol_mean + vol_std
        low_vol_threshold = vol_mean - vol_std
        
        high_vol_periods = (rolling_vol > high_vol_threshold).sum()
        low_vol_periods = (rolling_vol < low_vol_threshold).sum()
        
        print(f"\nVolatility Regime Analysis:")
        print(f"  Mean Volatility: {vol_mean:.4f}")
        print(f"  Std of Volatility: {vol_std:.4f}")
        print(f"  High Volatility Periods (> mean + 1 std): {high_vol_periods} days")
        print(f"  Low Volatility Periods (< mean - 1 std): {low_vol_periods} days")
        print(f"  → Suggests time-varying volatility, supporting need for volatility change points")
    
    print("\n✓ Volatility analysis completed")
else:
    print("⚠ Cannot perform volatility analysis: No price data available")

## 5. Change Point Models: Explanation and Purpose

### 5.1 What are Change Point Models?

Change point models are statistical methods designed to identify **structural breaks** in time series data - points where the underlying data-generating process changes. In the context of oil prices, these breaks represent fundamental shifts in market dynamics.

### 5.2 Purpose in Oil Price Analysis

Change point models help us identify when and where structural breaks occur in Brent crude oil prices. These breaks may be caused by:

1. **Supply Shocks**: 
   - OPEC production decisions (cuts/increases)
   - Pipeline disruptions
   - Geopolitical conflicts affecting production
   - Natural disasters

2. **Demand Shocks**:
   - Economic recessions (reduced demand)
   - Economic booms (increased demand)
   - Technological changes (e.g., electric vehicles)
   - Policy shifts (e.g., carbon taxes)

3. **Market Structure Changes**:
   - Financialization of oil markets
   - Regulatory changes
   - New trading mechanisms

4. **Regime Shifts**:
   - Transition from one market equilibrium to another
   - Long-term structural changes in supply/demand balance

### 5.3 How Change Point Models Work

Change point models identify breaks by:

1. **Statistical Detection**: Objectively identify break points without requiring prior knowledge of when events occurred
2. **Uncertainty Quantification**: Provide probability distributions for break locations (not just point estimates)
3. **Multiple Breaks**: Can detect multiple structural breaks simultaneously
4. **Parameter Estimation**: Estimate regime-specific parameters (mean, variance, trend) for each identified period

### 5.4 Types of Change Points

1. **Mean Shifts**: Sudden changes in the average price level
2. **Variance Changes**: Changes in volatility/uncertainty
3. **Trend Breaks**: Changes in the direction or slope of trends
4. **Combined Changes**: Simultaneous changes in multiple parameters

### 5.5 Expected Outputs

When we run change point analysis, we expect to get:

1. **Change Point Dates**:
   - Most probable break dates
   - Uncertainty intervals (credible intervals showing range of possible break dates)
   - Posterior probability distributions

2. **Regime Parameters**:
   - Mean price levels for each regime
   - Volatility (variance) for each regime
   - Trend parameters (if applicable)
   - Transition probabilities (if using regime-switching models)

3. **Model Diagnostics**:
   - Convergence statistics (for Bayesian methods)
   - Model fit metrics
   - Posterior predictive checks

### 5.6 Limitations and Caveats

**Critical Limitation - Correlation vs. Causation**:
- Detected breaks may **correlate** with events but don't **prove causation**
- Multiple events may occur simultaneously
- Markets may anticipate events (prices change before official dates)
- Additional causal inference methods needed to establish causation

**Other Limitations**:
- Results depend on model assumptions and specifications
- Different models may yield different break points
- Temporal resolution limited by data frequency
- Testing for multiple breaks increases false positive risk
- Detected breaks are historical and may not be predictive

## 3. Assumptions and Limitations

### Key Assumptions

#### Data Assumptions
- **Data Quality**: Brent crude price data is accurate, complete, and free from systematic errors
- **Data Frequency**: Daily frequency is appropriate for detecting structural breaks
- **Market Representation**: Brent crude prices are representative of global oil market dynamics
- **Data Availability**: Sufficient historical data is available to identify meaningful patterns

#### Model Assumptions
- **Change Point Model**: Structural breaks can be adequately modeled using chosen change point detection methods
- **Parameter Stability**: Within each regime, parameters (mean, variance) are relatively stable
- **Independence**: Errors are independent across regimes (may not hold for financial time series)
- **Linearity**: Linear relationships within regimes (may not capture non-linear dynamics)
- **Prior Distributions**: In Bayesian models, prior distributions reasonably reflect prior knowledge

#### Event Data Assumptions
- **Event Dates**: Event dates accurately represent the true timing of market impact
- **Event Impact**: Events have immediate or near-immediate impact on prices (may not account for anticipation effects)
- **Event Completeness**: The compiled event list captures all major market-moving events
- **Event Classification**: Events can be meaningfully categorized (supply/demand/geopolitical)

#### Statistical Assumptions
- **Stationarity**: Within regimes, data is stationary or can be made stationary
- **Distribution**: Data follows specified distribution (e.g., normal, t-distribution)
- **Homoscedasticity**: Within regimes, constant variance (may not hold)
- **Sample Size**: Sufficient observations within each regime for reliable parameter estimation

### Critical Limitations

#### ⚠️ Correlation vs. Causation - MOST IMPORTANT LIMITATION

**This is the most critical limitation**: The analysis can identify **statistical correlations** between structural breaks and events, but **cannot prove causation**.

**Why this matters**:
- **Temporal Correlation ≠ Causation**: Just because a structural break occurs near an event date does not mean the event caused the break
- **Confounding Factors**: Multiple events may occur simultaneously, making it difficult to attribute breaks to specific causes
- **Anticipation Effects**: Markets may react before events occur (e.g., prices may change in anticipation of OPEC decisions)
- **Reverse Causation**: Price changes may influence events (e.g., high prices may trigger policy responses)
- **Omitted Variables**: Unobserved factors may drive both events and price changes

**What this means**:
- Detected breaks may be **associated** with events but not necessarily **caused** by them
- Additional evidence (e.g., event studies, causal inference methods) would be needed to establish causation
- Results should be interpreted as **suggestive** rather than **definitive**

#### Other Limitations

**Methodological Limitations**:
- **Model Specification**: Results depend on chosen model structure (number of change points, parameterization)
- **Model Uncertainty**: Different models may yield different break points
- **Overfitting Risk**: Complex models may detect spurious breaks
- **Multiple Testing**: Testing for multiple breaks increases false positive rates
- **Temporal Resolution**: Limited by data frequency (cannot detect breaks within a day if using daily data)

**Data Limitations**:
- **Missing Data**: Missing observations may affect break detection
- **Data Revisions**: Historical data may be revised, affecting results
- **Single Market**: Analysis focuses on Brent crude; other benchmarks may show different patterns
- **Time Period**: Results are limited to the available historical period

**Interpretation Limitations**:
- **Backward-Looking**: Analysis is historical; may not predict future breaks
- **Context-Dependent**: Results may not generalize to different time periods or markets
- **Event Attribution**: Difficult to attribute breaks to specific events without additional evidence

## 4. Communication Channels

### Primary Formats for Stakeholder Communication

1. **Executive Summary** (1-2 pages)
   - High-level overview for decision-makers
   - Focus on key change points, regime characteristics, and practical implications
   - Visual timeline of events and breaks
   - Risk assessment summary

2. **Technical Report**
   - Detailed methodology and model specifications
   - Statistical results and diagnostics
   - Model comparison and validation
   - Full results tables and figures

3. **Interactive Visualizations**
   - Time series plots with overlaid change points
   - Event timeline visualizations
   - Regime comparison charts
   - Interactive dashboards (Plotly, Streamlit)

4. **Jupyter Notebooks**
   - Reproducible analysis notebooks for technical audiences
   - Step-by-step methodology
   - Code and results together

5. **Presentations**
   - Slide decks for stakeholder meetings
   - Key findings and visualizations
   - Q&A preparation materials

### Key Messages to Communicate

- **Change Point Locations**: When structural breaks occurred (with uncertainty intervals)
- **Regime Characteristics**: How different periods differ (mean, volatility, trends)
- **Event Associations**: Which events correlate with detected breaks
- **Uncertainty Quantification**: Confidence levels in detected breaks
- **Practical Implications**: What this means for decision-making
- **Limitations**: Clear statement that correlations do not prove causation

### Stakeholder-Specific Considerations

- **Executives**: Focus on actionable insights and risk periods
- **Analysts**: Provide detailed methodology and statistical evidence
- **Risk Managers**: Emphasize uncertainty and limitations
- **Policy Makers**: Highlight correlation vs. causation distinction

## 5. Understanding the Model and Data

### 5.1 Time Series Properties Analysis

Before modeling, we need to investigate the Brent oil price data for key properties that will inform our modeling choices:

#### Trend Analysis
- **Purpose**: Identify long-term directional movements in oil prices
- **Methods**: Linear/non-linear trend fitting, unit root testing, structural trend breaks
- **Modeling Implications**: 
  - If trend-stationary: Include deterministic trend in model
  - If difference-stationary: Use differencing or include stochastic trend
  - If trend breaks exist: Model requires change point detection in trend component

#### Stationarity Testing
- **Purpose**: Determine if statistical properties (mean, variance) are constant over time
- **Methods**: ADF, KPSS, PP tests
- **Modeling Implications**:
  - Non-stationary data may require differencing or cointegration analysis
  - Stationary data can be modeled directly
  - Structural breaks can cause apparent non-stationarity

#### Volatility Patterns
- **Purpose**: Understand how price variability changes over time
- **Methods**: Rolling standard deviation, GARCH models, volatility clustering tests
- **Modeling Implications**:
  - Constant volatility: Simple variance parameter
  - Time-varying volatility: Include volatility change points
  - Volatility clustering: May require GARCH-type models

### 5.2 Change Point Models

#### Purpose in Oil Price Analysis
Change point models identify structural breaks where the underlying data-generating process changes. In oil prices, these breaks may occur due to:
- **Supply Shocks**: OPEC production cuts/increases, pipeline disruptions, geopolitical conflicts
- **Demand Shocks**: Economic recessions, technological changes, policy shifts
- **Market Structure Changes**: Financialization of oil markets, regulatory changes
- **Regime Shifts**: Transition from one equilibrium to another

#### How They Help Identify Structural Breaks
- **Statistical Detection**: Objectively identify break points without prior knowledge
- **Uncertainty Quantification**: Provide probability distributions for break locations
- **Multiple Breaks**: Can detect multiple structural breaks simultaneously
- **Parameter Estimation**: Estimate regime-specific parameters (mean, variance, trend)

### 5.3 Expected Outputs

#### Change Point Analysis Outputs
1. **Change Point Dates**: 
   - Most probable break dates
   - Uncertainty intervals (credible intervals)
   - Posterior probability distributions

2. **Regime Parameters**:
   - Mean price levels for each regime
   - Volatility (variance) for each regime
   - Trend parameters (if applicable)
   - Transition probabilities (if using regime-switching models)

3. **Model Diagnostics**:
   - Convergence statistics
   - Model fit metrics
   - Posterior predictive checks

#### Limitations of Change Point Analysis
- **Correlation vs Causation**: Detected breaks may correlate with events but don't prove causation
- **Model Uncertainty**: Results depend on model assumptions and specifications
- **Temporal Resolution**: Limited by data frequency (daily vs monthly)
- **Multiple Hypotheses**: Testing for multiple breaks increases false positive risk
- **Post-Hoc Analysis**: Detected breaks may not be predictive
- **Event Attribution**: Difficult to attribute breaks to specific events without additional evidence

## 6. Data Loading and Initial Exploration

Now let's load the Brent crude oil price data and perform initial validation. For this demonstration, we'll show the workflow. In practice, you would load actual price data from your source.

In [None]:
# Example: Data loading workflow
# In practice, replace this with actual data loading
# For demonstration, we'll create a note about data requirements

print("DATA LOADING WORKFLOW")
print("="*80)
print("""
To load Brent crude oil price data:

1. Option 1: Load from CSV file
   price_df = load_oil_price_data(
       file_path='data/raw/brent_prices.csv',
       start_date='2000-01-01',
       end_date='2024-12-31',
       frequency='D'
   )

2. Option 2: Fetch from online source (e.g., FRED, Yahoo Finance)
   - Implement API calls in data_loader.py
   - Use libraries like yfinance, pandas_datareader, or FRED API

3. Validate data quality
   validation_report = validate_data(price_df)
   print(validation_report)

4. Preprocess if needed
   price_df_clean = preprocess_data(price_df, handle_missing='forward_fill')
""")

# For demonstration, we'll note that actual data loading would happen here
print("\nNOTE: Actual price data should be loaded here for full analysis.")
print("The workflow functions are ready in src/data_loader.py")

## 7. Exploratory Data Analysis (EDA) Workflow

Once data is loaded, we perform comprehensive EDA to understand time series properties. The following code demonstrates the workflow:

In [None]:
# EDA Workflow Example
# This demonstrates the workflow - replace with actual data when available

print("EXPLORATORY DATA ANALYSIS WORKFLOW")
print("="*80)
print("""
When price data is loaded, perform the following analyses:

# 1. Descriptive Statistics
stats = descriptive_statistics(price_df)
print(stats)

# 2. Trend Analysis
trend_results = trend_analysis(price_df, window=30)
# Access results:
# - trend_results['price_ma_30']: 30-day moving average
# - trend_results['price_linear_trend']: Linear trend parameters
# - trend_results['price_trend']: Decomposed trend component

# 3. Stationarity Testing
for col in price_df.select_dtypes(include=[np.number]).columns:
    stationarity_results = test_stationarity(price_df[col])
    print(f"\n{col} Stationarity Test Results:")
    print(f"ADF Test: {stationarity_results['adf']}")
    print(f"KPSS Test: {stationarity_results['kpss']}")
    print(f"Conclusion: {stationarity_results['conclusion']}")

# 4. Volatility Analysis
volatility_results = volatility_analysis(price_df, window=30)
# Access results:
# - volatility_results['price_rolling_volatility']: Rolling volatility
# - volatility_results['price_volatility_stats']: Volatility statistics
# - volatility_results['price_arch_test']: ARCH effects test

# 5. Autocorrelation Analysis
for col in price_df.select_dtypes(include=[np.number]).columns:
    acf_vals, pacf_vals = autocorrelation_analysis(price_df[col], lags=40)
    # Plot ACF and PACF to identify AR/MA components
""")

print("\nNOTE: Run these analyses when actual price data is loaded.")
print("All functions are available in src/eda.py")

## 8. Event Data Integration Workflow

Integrate the compiled event data with price data to assess correlations:

In [None]:
# Event Integration Workflow
print("EVENT DATA INTEGRATION WORKFLOW")
print("="*80)
print("""
When price data is loaded, integrate events as follows:

# 1. Align events with prices
events_with_prices = align_events_with_prices(
    price_df=price_df,
    event_df=events_df,
    price_column='price',  # Adjust to your column name
    event_date_column='event_date',
    window_days=30  # Analyze 30 days before/after each event
)

# 2. Categorize events
event_categories = categorize_events(events_with_prices)
print("Event categories:", list(event_categories.keys()))

# 3. Calculate impact statistics
impact_stats = calculate_event_impact_statistics(events_with_prices)
print("\nEvent Impact Statistics:")
print(impact_stats)

# 4. Visualize events on price timeline
# (Create plots showing price movements around event dates)
""")

print("\nEvent data is ready for integration when price data is loaded.")
print("All functions are available in src/event_integration.py")

## 9. Summary: Task 1 Deliverables

### ✅ Completed Deliverables

1. **Data Analysis Workflow** (Section 1)
   - Documented 7-step workflow from data loading to insight generation
   - Clear process for each analysis stage

2. **Event Data Table** (Section 2)
   - Compiled 22 major oil market events (2001-2023)
   - Structured CSV file with dates, types, descriptions, impact types, and severity
   - Exceeds minimum requirement of 10-15 events

3. **Assumptions and Limitations** (Section 3)
   - Comprehensive documentation of all assumptions
   - **Critical discussion on correlation vs. causation** - the most important limitation
   - Methodological, data, and interpretation limitations

4. **Communication Channels** (Section 4)
   - Identified 5 primary formats for stakeholder communication
   - Key messages to communicate
   - Stakeholder-specific considerations

5. **Understanding Model and Data** (Section 5)
   - Time series properties analysis framework (trend, stationarity, volatility)
   - Change point model explanation and purpose
   - Expected outputs and limitations

6. **Implementation Code**
   - Data loading functions (`src/data_loader.py`)
   - EDA functions (`src/eda.py`)
   - Event integration functions (`src/event_integration.py`)
   - Workflow demonstrated in this notebook

### Next Steps (Task 2+)

- Load actual Brent crude oil price data
- Perform full EDA analysis
- Implement change point detection models
- Compare detected breaks with event dates
- Generate visualizations and reports

## 10. Event Data Table Summary

Below is a summary visualization of the compiled events:

In [None]:
# Create summary visualizations of event data
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Events by type
event_counts = events_df['event_type'].value_counts()
axes[0, 0].bar(event_counts.index, event_counts.values, color='steelblue')
axes[0, 0].set_title('Number of Events by Type', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Event Type')
axes[0, 0].set_ylabel('Count')
axes[0, 0].tick_params(axis='x', rotation=45)

# 2. Events by impact type
impact_counts = events_df['impact_type'].value_counts()
axes[0, 1].bar(impact_counts.index, impact_counts.values, color='coral')
axes[0, 1].set_title('Number of Events by Impact Type', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Impact Type')
axes[0, 1].set_ylabel('Count')
axes[0, 1].tick_params(axis='x', rotation=45)

# 3. Events by severity
severity_counts = events_df['severity'].value_counts()
axes[1, 0].bar(severity_counts.index, severity_counts.values, color='mediumseagreen')
axes[1, 0].set_title('Number of Events by Severity', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Severity')
axes[1, 0].set_ylabel('Count')

# 4. Timeline of events
events_df_sorted = events_df.sort_values('event_date')
axes[1, 1].scatter(events_df_sorted['event_date'], 
                   range(len(events_df_sorted)), 
                   s=100, alpha=0.6, c='red')
axes[1, 1].set_title('Timeline of Events (2001-2023)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Event Date')
axes[1, 1].set_ylabel('Event Index')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('../reports/figures/task1_event_summary.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nEvent Summary Statistics:")
print(f"Total events: {len(events_df)}")
print(f"Date range: {events_df['event_date'].min()} to {events_df['event_date'].max()}")
print(f"\nEvents by type:\n{event_counts}")
print(f"\nEvents by impact:\n{impact_counts}")
print(f"\nEvents by severity:\n{severity_counts}")