# Statistical Tests for Time Series Analysis on NIFTY50 5 min OHLC Data 📈

## 1. Augmented Dickey-Fuller (ADF) Test 🎯

### Purpose
> Tests for stationarity in time series data (Critical for prediction)

### Why Important
* **Model Prerequisites**
  - Essential for ARIMA modeling
  - Foundation for reliable predictions

### Interpretation
| Result | Meaning | Action |
|--------|---------|--------|
| p < 0.05 | Stationary | Ready for modeling |
| p > 0.05 | Non-stationary | Need differencing |

### Impact on Models
* **ARIMA**: Determines 'd' parameter directly
* **Transformers**: Guides data preprocessing strategy

---

## 2. ACF (Autocorrelation Function) 📊

### Purpose
> Reveals relationship between current and past values

### Key Insights
* **Pattern Detection**
  - Series memory depth
  - Seasonal patterns
  - Trend strength

### Pattern Interpretation
* **Slow Decay**: Strong trend present
* **Periodic Spikes**: Seasonal patterns exist
* **Quick Decay**: Weak autocorrelation

### Model Implications
* **ARIMA**: Helps set 'q' (MA) parameter
* **Transformers**: Guides sequence length choice

---

## 3. PACF (Partial Autocorrelation Function) 🔍

### Purpose
> Shows direct relationships between time points

### Key Benefits
* **Direct Dependencies**
  - Pure lag relationships
  - Optimal lookback period
  - AR order identification

### Pattern Reading
* **Significant Spikes**: Direct relationships
* **Cut-off Point**: Suggests AR order
* **Decay Pattern**: Complexity indicator

### Model Applications
* **ARIMA**: Sets 'p' (AR) parameter
* **Transformers**: Guides attention window

---

## 4. Ljung-Box Test ⚖️

### Purpose
> Validates residual randomness

### Importance
* **Model Validation**
  - Tests assumptions
  - Checks pattern capture
  - Quality assurance

### Results Guide
| p-value | Interpretation | Next Steps |
|---------|----------------|------------|
| > 0.05 | Good fit | Proceed with model |
| < 0.05 | Missing patterns | Model refinement needed |

---

## Trading Strategy Implementation 💹

### Data Properties Analysis
* **Pattern Understanding**
  - Price movement characteristics
  - Key time intervals
  - Preprocessing requirements

### Model Selection Framework
| Pattern | Suggested Model | Why |
|---------|----------------|-----|
| Strong seasonality | ARIMA | Good for regular patterns |
| Complex patterns | Transformers | Better with non-linear relationships |
| High randomness | Feature enrichment | Need more predictive signals |

### Parameter Optimization
1. **ARIMA Parameters**
   - p: From PACF analysis
   - d: From ADF test
   - q: From ACF analysis

2. **Transformer Settings**
   - Sequence length
   - Attention window
   - Feature engineering scope

---

## 🎯 Goal Achievement (0.6% Return Prediction)

### Framework Application
1. **Use Tests to:**
   - Validate predictability
   - Identify key patterns
   - Set model parameters

2. **Choose Model Based on:**
   - Pattern complexity
   - Seasonality strength
   - Data stationarity

3. **Optimize Using:**
   - Statistical significance
   - Pattern identification
   - Model validation

In [2]:
!pip install mlflow
import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller, kpss, acf, pacf
from statsmodels.stats.diagnostic import acorr_ljungbox
import matplotlib.pyplot as plt
import seaborn as sns
import mlflow



In [3]:
# Input File - Data 2014 to 2024 - 1min ohlc 
# 6 Parts - Last 6month 5 min ohlc, last to last 6 month 5 min ohlc, in similar fashion 1 year and 2 year.

In [21]:
# Data Extraction
def resample_and_get_periods(csv_path, interval='5T'):
   """
   Load data, resample to 5 min OHLC, perform quality checks, and split into periods
   
   Parameters:
   - csv_path: path to CSV file
   - interval: resampling interval (default '5T' for 5 minutes)
   
   Returns: Dictionary of quality-checked, resampled dataframes for different periods
   """
   # Load data
   def load_and_clean(csv_path):
       df = pd.read_csv(csv_path)
       df['date'] = pd.to_datetime(df['date'])
       return df.set_index('date').sort_index()
   
   # Quality checks
   def quality_checks(df, period_name):
       issues = []
       
       # Check for missing values
       missing = df.isnull().sum()
       if missing.any():
           issues.append(f"Missing values found: {missing[missing > 0]}")
           
       # Check for duplicates
       duplicates = df.index.duplicated().sum()
       if duplicates:
           issues.append(f"Found {duplicates} duplicate timestamps")
           
       # Check for price anomalies
       price_std = df['close'].std()
       price_mean = df['close'].mean()
       outliers = df[abs(df['close'] - price_mean) > 3 * price_std]
       if not outliers.empty:
           issues.append(f"Found {len(outliers)} potential price outliers")
           
       # Check for gaps in time series
       time_diff = df.index.to_series().diff()
       gaps = time_diff[time_diff > pd.Timedelta(minutes=6)]  # More than 6 min gap
       if not gaps.empty:
           issues.append(f"Found {len(gaps)} time gaps > 6 minutes")
           
       # Print issues for this period
       if issues:
           print(f"\nQuality issues in {period_name}:")
           for issue in issues:
               print(f"- {issue}")
               
       return len(issues) == 0
   
   # Resample to 5 min OHLC
   def resample_ohlc(df, interval):
       resampled = df.resample(interval).agg({
           'open': 'first',
           'high': 'max',
           'low': 'min',
           'close': 'last',
           'volume': 'sum'
       })
       return resampled.dropna()  # Remove any incomplete periods
   
   print("Loading and preprocessing data...")
   df = load_and_clean(csv_path)
   
   # Resample full dataset
   df_resampled = resample_ohlc(df, interval)
   
   # Get end date
   end_date = df_resampled.index.max()
   
   # Define periods and create slices
   periods = {
       'last_6m': (end_date - pd.DateOffset(months=6), end_date),
       'last_to_last_6m': (end_date - pd.DateOffset(months=12), 
                          end_date - pd.DateOffset(months=6)),
       'last_1y': (end_date - pd.DateOffset(years=1), end_date),
       'last_to_last_1y': (end_date - pd.DateOffset(years=2), 
                          end_date - pd.DateOffset(years=1)),
       'last_2y': (end_date - pd.DateOffset(years=2), end_date),
       'last_to_last_2y': (end_date - pd.DateOffset(years=4), 
                          end_date - pd.DateOffset(years=2))
   }
   
   datasets = {}
   print("\nCreating and validating period datasets...")
   
   for name, (start, end) in periods.items():
       # Slice data
       period_data = df_resampled[start:end].copy()
       
       # Perform quality checks
       is_clean = quality_checks(period_data, name)
       
       # Store data and metadata
       datasets[name] = {
           'data': period_data,
           'metadata': {
               'start_date': period_data.index.min(),
               'end_date': period_data.index.max(),
               'records': len(period_data),
               'trading_days': len(set(period_data.index.date)),
               'quality_passed': is_clean
           }
       }
   
   # Print summary
   print("\nPeriod Summaries:")
   print("="*70)
   for name, dataset in datasets.items():
       meta = dataset['metadata']
       print(f"\n{name}:")
       print(f"Date Range: {meta['start_date']} to {meta['end_date']}")
       print(f"Records: {meta['records']}, Trading Days: {meta['trading_days']}")
       print(f"Quality Check: {'✓' if meta['quality_passed'] else '✗'}")
   
   return datasets

# Usage
csv_path = 'nifty50_1min_2015_to_2024.csv'
datasets = resample_and_get_periods(csv_path)

# Access data for a period
df1 = datasets['last_6m']['data']
# last_6m_metadata = datasets['last_6m']['metadata']
df2 = datasets['last_to_last_6m']['data']
df3 = datasets['last_1y']['data']
df4 = datasets['last_to_last_1y']['data']
df5 = datasets['last_2y']['data']
df6 = datasets['last_to_last_2y']['data']
df1["Return"] = df1["close"].pct_change()
df1 = df1["Return"].dropna()


Loading and preprocessing data...

Creating and validating period datasets...

Quality issues in last_6m:
- Found 124 time gaps > 6 minutes

Quality issues in last_to_last_6m:
- Found 126 time gaps > 6 minutes

Quality issues in last_1y:
- Found 250 time gaps > 6 minutes

Quality issues in last_to_last_1y:
- Found 247 time gaps > 6 minutes

Quality issues in last_2y:
- Found 497 time gaps > 6 minutes

Quality issues in last_to_last_2y:
- Found 496 time gaps > 6 minutes

Period Summaries:

last_6m:
Date Range: 2024-02-28 15:25:00+05:30 to 2024-08-28 15:25:00+05:30
Records: 9043, Trading Days: 123
Quality Check: ✗

last_to_last_6m:
Date Range: 2023-08-28 15:25:00+05:30 to 2024-02-28 15:25:00+05:30
Records: 9388, Trading Days: 127
Quality Check: ✗

last_1y:
Date Range: 2023-08-28 15:25:00+05:30 to 2024-08-28 15:25:00+05:30
Records: 18430, Trading Days: 249
Quality Check: ✗

last_to_last_1y:
Date Range: 2022-08-29 09:15:00+05:30 to 2023-08-28 15:25:00+05:30
Records: 18537, Trading Days: 24

In [38]:
import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_stationarity(returns_series, title="Returns Analysis"):
   """
   Comprehensive stationarity analysis using ADF test
   
   Parameters:
   returns_series: Series of returns data
   title: Title for plots
   
   Returns:
   Dictionary containing test results and interpretation
   """
   
   # Run ADF Test
   adf_result = adfuller(returns_series)
   
   # Create visualization
   fig, axes = plt.subplots(2, 1, figsize=(15, 10))
   
   # Plot 1: Returns Time Series
   axes[0].plot(returns_series)
   axes[0].set_title(f"{title} - Time Series Plot")
   axes[0].set_xlabel("Time")
   axes[0].set_ylabel("Returns")
   
   # Plot 2: Returns Distribution
   sns.histplot(returns_series, kde=True, ax=axes[1])
   axes[1].set_title("Returns Distribution")
   axes[1].set_xlabel("Return Value")
   
   # Compile results
   results = {
       'test_statistic': adf_result[0],
       'p_value': adf_result[1],
       'critical_values': adf_result[4],
       'is_stationary': adf_result[1] < 0.05,
   }
   
   # Print detailed interpretation
   print("\n=== Augmented Dickey-Fuller Test Results ===")
   print(f"Test Statistic: {results['test_statistic']:.4f}")
   print(f"P-value: {results['p_value']:.4f}")
   print("\nCritical Values:")
   for key, value in results['critical_values'].items():
       print(f"\t{key}: {value:.4f}")
   
   print("\nInterpretation:")
   print("-" * 50)
   if results['is_stationary']:
       print("✓ Series is STATIONARY")
       print("• Suitable for direct modeling")
       print("• ARIMA: No differencing needed (d=0)")
       print("• Good for prediction tasks")
   else:
       print("✗ Series is NON-STATIONARY")
       print("• Needs differencing before modeling")
       print("• ARIMA: Use d>0")
       print("• Consider transforming data")
   
   # Additional statistics
   results['statistics'] = {
       'mean': returns_series.mean(),
       'std': returns_series.std(),
       'skew': returns_series.skew(),
       'kurtosis': returns_series.kurtosis()
   }
   
   print("\nBasic Statistics:")
   print(f"Mean: {results['statistics']['mean']:.6f}")
   print(f"Std Dev: {results['statistics']['std']:.6f}")
   print(f"Skewness: {results['statistics']['skew']:.6f}")
   print(f"Kurtosis: {results['statistics']['kurtosis']:.6f}")
   
   return results, fig

# Example usage
def document_stationarity_analysis(returns_df):
   """
   Document stationarity analysis results in markdown format
   """
   results, fig = analyze_stationarity(returns_df['returns'])
   
   markdown_doc = f"""
# Stationarity Analysis Report 📊

## Test Results
- **ADF Test Statistic**: {results['test_statistic']:.4f}
- **P-value**: {results['p_value']:.4f}
- **Is Stationary**: {'Yes' if results['is_stationary'] else 'No'}

## Critical Values
"""
   for key, value in results['critical_values'].items():
       markdown_doc += f"- {key}: {value:.4f}\n"
   
   markdown_doc += f"""
## Statistical Properties
- Mean: {results['statistics']['mean']:.6f}
- Standard Deviation: {results['statistics']['std']:.6f}
- Skewness: {results['statistics']['skew']:.6f}
- Kurtosis: {results['statistics']['kurtosis']:.6f}

## Implications for Modeling

### ARIMA
- {'Use original series (d=0)' if results['is_stationary'] else 'Need differencing (d>0)'}
- {'Ready for AR and MA components' if results['is_stationary'] else 'Transform data first'}

### Transformers
- {'Use raw returns' if results['is_stationary'] else 'Consider preprocessing'}
- Feature engineering implications: {
   'Direct feature creation possible' if results['is_stationary'] 
   else 'Need to handle non-stationarity in features'
}

## Recommendations
1. {
   'Proceed with modeling using raw returns' if results['is_stationary'] 
   else 'Apply differencing or transformation'
}
2. {
   'Focus on model selection' if results['is_stationary'] 
   else 'Validate stationarity after transformation'
}
3. Consider return distribution characteristics for model assumptions
"""
   
   return markdown_doc

# Usage
# returns_df = your_processed_dataframe
# doc = document_stationarity_analysis(returns_df)
# print(doc)  # Or save to markdown file

In [27]:
# sorted(set(last_6m_data.index))
last_6m_data.head(5)

Unnamed: 0_level_0,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024-02-28 15:25:00+05:30,21934.05,21935.0,21915.85,21926.25,0
2024-02-29 09:15:00+05:30,21935.2,21945.05,21901.15,21936.35,0
2024-02-29 09:20:00+05:30,21934.4,21934.75,21887.1,21932.45,0
2024-02-29 09:25:00+05:30,21933.25,21963.7,21931.85,21952.6,0
2024-02-29 09:30:00+05:30,21952.5,21992.45,21943.6,21985.05,0


In [37]:
# Rename columns appropriately for the dataset
df1["Return"] = df1["close"].pct_change()
df1 = df1["Return"].dropna()
df1






date
2024-02-29 09:15:00+05:30    0.000461
2024-02-29 09:20:00+05:30   -0.000178
2024-02-29 09:25:00+05:30    0.000919
2024-02-29 09:30:00+05:30    0.001478
2024-02-29 09:35:00+05:30   -0.001599
2024-02-29 09:40:00+05:30   -0.000164
2024-02-29 09:45:00+05:30   -0.000162
2024-02-29 09:50:00+05:30    0.001780
2024-02-29 09:55:00+05:30    0.001522
2024-02-29 10:00:00+05:30   -0.000722
2024-02-29 10:05:00+05:30   -0.002559
2024-02-29 10:10:00+05:30   -0.003170
2024-02-29 10:15:00+05:30    0.000025
2024-02-29 10:20:00+05:30    0.000505
2024-02-29 10:25:00+05:30    0.000790
2024-02-29 10:30:00+05:30   -0.000794
2024-02-29 10:35:00+05:30    0.000021
2024-02-29 10:40:00+05:30    0.001076
2024-02-29 10:45:00+05:30   -0.000249
2024-02-29 10:50:00+05:30   -0.000813
2024-02-29 10:55:00+05:30   -0.000155
2024-02-29 11:00:00+05:30    0.001376
2024-02-29 11:05:00+05:30    0.000618
2024-02-29 11:10:00+05:30    0.001931
2024-02-29 11:15:00+05:30   -0.000683
2024-02-29 11:20:00+05:30    0.000305
2024-02

In [None]:
def run_stationarity_tests(series, title=""):
    """
    Comprehensive stationarity tests
    1. Augmented Dickey-Fuller (ADF) Test
    2. KPSS Test
    3. Visual Tests with plots
    """
    results = {}
    
    # ADF Test
    adf_result = adfuller(series)
    results['adf'] = {
        'test_statistic': adf_result[0],
        'p_value': adf_result[1],
        'critical_values': adf_result[4],
        'is_stationary': adf_result[1] < 0.05
    }
    
    # KPSS Test
    kpss_result = kpss(series)
    results['kpss'] = {
        'test_statistic': kpss_result[0],
        'p_value': kpss_result[1],
        'is_stationary': kpss_result[1] > 0.05
    }
    
    # Plotting
    fig, axes = plt.subplots(3, 1, figsize=(12, 10))
    
    # Time Series Plot
    axes[0].plot(series)
    axes[0].set_title(f'{title} Time Series Plot')
    
    # ACF Plot
    acf_values = acf(series, nlags=40)
    axes[1].plot(range(len(acf_values)), acf_values)
    axes[1].axhline(y=0, linestyle='--', color='gray')
    axes[1].axhline(y=-1.96/np.sqrt(len(series)), linestyle='--', color='gray')
    axes[1].axhline(y=1.96/np.sqrt(len(series)), linestyle='--', color='gray')
    axes[1].set_title('Autocorrelation Function')
    
    # PACF Plot
    pacf_values = pacf(series, nlags=40)
    axes[2].plot(range(len(pacf_values)), pacf_values)
    axes[2].axhline(y=0, linestyle='--', color='gray')
    axes[2].axhline(y=-1.96/np.sqrt(len(series)), linestyle='--', color='gray')
    axes[2].axhline(y=1.96/np.sqrt(len(series)), linestyle='--', color='gray')
    axes[2].set_title('Partial Autocorrelation Function')
    
    plt.tight_layout()
    return results, fig

def test_randomness(series):
    """
    Ljung-Box Q Test for randomness
    """
    lb_result = acorr_ljungbox(series, lags=[10, 20, 30], return_df=True)
    # Fix accessing results
    return {
        'lb_pvalue_10': lb_result.iloc[0]['lb_pvalue'],
        'lb_pvalue_20': lb_result.iloc[1]['lb_pvalue'],
        'lb_pvalue_30': lb_result.iloc[2]['lb_pvalue']
    }

# Log results to MLflow
with mlflow.start_run(run_name="stationarity_tests"):
    # Load your data
    df = pd.read_csv('nifty50_1min_2015_to_2024.csv')  # Replace with your path
    df = df.tail(144000)
    df['date'] = pd.to_datetime(df['date'])
    close_prices = df['close']
    
    # Run tests on original series
    results_orig, fig_orig = run_stationarity_tests(close_prices, "Original Series")
    mlflow.log_figure(fig_orig, "original_series_tests.png")
    
    # Log original series test results
    mlflow.log_params({
        "adf_statistic": results_orig['adf']['test_statistic'],
        "adf_pvalue": results_orig['adf']['p_value'],
        "kpss_statistic": results_orig['kpss']['test_statistic'],
        "kpss_pvalue": results_orig['kpss']['p_value']
    })
    
    # If not stationary, test first difference
    if not results_orig['adf']['is_stationary']:
        diff_series = close_prices.diff().dropna()
        results_diff, fig_diff = run_stationarity_tests(diff_series, "First Difference")
        mlflow.log_figure(fig_diff, "diff_series_tests.png")
        
        # Log differenced series test results
        mlflow.log_params({
            "diff_adf_statistic": results_diff['adf']['test_statistic'],
            "diff_adf_pvalue": results_diff['adf']['p_value'],
            "diff_kpss_statistic": results_diff['kpss']['test_statistic'],
            "diff_kpss_pvalue": results_diff['kpss']['p_value']
        })
    
    lb_test = test_randomness(close_prices)
    mlflow.log_params({
        "lb_test_10": lb_test['lb_pvalue_10'],
        "lb_test_20": lb_test['lb_pvalue_20'],
        "lb_test_30": lb_test['lb_pvalue_30']
    })


print("Test Interpretation:")
print("1. ADF Test:")
print(f"   - Null Hypothesis: Series is non-stationary")
print(f"   - p-value: {results_orig['adf']['p_value']:.4f}")
print(f"   - Series is {'stationary' if results_orig['adf']['is_stationary'] else 'non-stationary'}")

print("\n2. KPSS Test:")
print(f"   - Null Hypothesis: Series is stationary")
print(f"   - p-value: {results_orig['kpss']['p_value']:.4f}")
print(f"   - Series is {'non-stationary' if results_orig['kpss']['p_value'] < 0.05 else 'stationary'}")

print("\n3. Ljung-Box Test:")
print("   - Null Hypothesis: Data is independently distributed")
print(f"   - p-values for lags 10, 20, 30: {list(lb_test.values())}")


In [9]:
df = pd.read_csv('nifty50_1min_2015_to_2024.csv')  # Replace with your path
df = df.tail(144000)


In [None]:
2+2