# Data Types and Scientific Types (mtypes and scitypes)

This tutorial explains sktime's data type system and how to work with different data formats including polars.

**Duration:** ~10 minutes

## Learning objectives

By the end of this tutorial, you will be able to:
- Understand what mtypes and scitypes are in sktime
- Work with different data formats in sktime
- Use polars DataFrames with sktime
- Convert between different data representations

## 1. Introduction to mtypes and scitypes

sktime uses a sophisticated type system to handle different data formats consistently.

In [None]:
import pandas as pd
import numpy as np
from sktime.datasets import load_airline, load_longley
from sktime.datatypes import check_raise, convert_to, mtype, scitype
from sktime.utils.plotting import plot_series
import matplotlib.pyplot as plt

print("sktime Data Type System Overview:")
print("=" * 35)

print("\nSCITYPE (Scientific Type):")
print("  - Abstract mathematical concept")
print("  - Examples: 'Series', 'Panel', 'Hierarchical'")
print("  - Defines what the data represents")

print("\nMTYPE (Machine Type):")
print("  - Concrete data format/representation")
print("  - Examples: 'pd.Series', 'pd.DataFrame', 'np.ndarray'")
print("  - Defines how the data is stored")

print("\nThe relationship: scitype → mtype → actual data")
print("Example: Series → pd.Series → your pandas Series object")

## 2. Exploring Available Data Types

In [None]:
from sktime.registry import all_tags

# Get information about supported types
print("Supported Scientific Types (scitypes):")
scitypes = ['Series', 'Panel', 'Hierarchical', 'Table']
for i, scitype_name in enumerate(scitypes, 1):
    print(f"{i}. {scitype_name}")

print("\nSupported Machine Types (mtypes) for Series:")
series_mtypes = [
    'pd.Series',      # pandas Series
    'pd.DataFrame',   # pandas DataFrame (single column)
    'np.ndarray',     # numpy array
    'pl.DataFrame',   # polars DataFrame
    'pl.Series'       # polars Series
]

for i, mtype_name in enumerate(series_mtypes, 1):
    print(f"{i}. {mtype_name}")

print("\nSupported Machine Types (mtypes) for Panel:")
panel_mtypes = [
    'pd-multiindex',  # pandas MultiIndex DataFrame
    'nested_univ',    # nested pandas DataFrame
    'numpy3D',        # 3D numpy array
    'df-list',        # list of DataFrames
    'pl.DataFrame'    # polars DataFrame
]

for i, mtype_name in enumerate(panel_mtypes, 1):
    print(f"{i}. {mtype_name}")

## 3. Working with Different Data Formats

Let's load data and explore different representations.

In [None]:
# Load sample data
y = load_airline()
print(f"Original data type: {type(y)}")
print(f"Original data shape: {y.shape}")
print(f"Data head:\n{y.head()}")

# Check the scitype and mtype
print(f"\nScitype: {scitype(y)}")
print(f"Mtype: {mtype(y)}")

# Verify data integrity
try:
    check_raise(y, mtype="pd.Series", scitype="Series")
    print("✓ Data passes validation checks")
except Exception as e:
    print(f"✗ Data validation failed: {e}")

## 4. Converting Between Data Types

In [None]:
# Convert to different mtypes
print("Converting to different formats:")
print("=" * 32)

# 1. Convert to pandas DataFrame
y_df = convert_to(y, to_type="pd.DataFrame")
print(f"\n1. As DataFrame:")
print(f"   Type: {type(y_df)}")
print(f"   Shape: {y_df.shape}")
print(f"   Columns: {y_df.columns.tolist()}")
print(f"   Head:\n{y_df.head()}")

# 2. Convert to numpy array
y_np = convert_to(y, to_type="np.ndarray")
print(f"\n2. As numpy array:")
print(f"   Type: {type(y_np)}")
print(f"   Shape: {y_np.shape}")
print(f"   First 5 values: {y_np[:5]}")

# 3. Try converting to polars (if available)
try:
    import polars as pl
    y_pl = convert_to(y, to_type="pl.DataFrame")
    print(f"\n3. As Polars DataFrame:")
    print(f"   Type: {type(y_pl)}")
    print(f"   Shape: {y_pl.shape}")
    print(f"   Schema: {y_pl.schema}")
    print(f"   Head:\n{y_pl.head()}")
except ImportError:
    print(f"\n3. Polars not available")
    print("   Install with: pip install polars")
except Exception as e:
    print(f"\n3. Polars conversion failed: {e}")

## 5. Working with Polars DataFrames

Polars is a fast DataFrame library that can be used with sktime.

In [None]:
try:
    import polars as pl
    
    print("Working with Polars:")
    print("=" * 20)
    
    # Create a polars DataFrame directly
    dates = pd.date_range('2020-01-01', periods=100, freq='D')
    values = np.random.randn(100).cumsum() + 100
    
    # Create polars DataFrame
    pl_df = pl.DataFrame({
        'date': dates,
        'value': values
    })
    
    print(f"Original Polars DataFrame:")
    print(f"Type: {type(pl_df)}")
    print(f"Shape: {pl_df.shape}")
    print(f"Schema: {pl_df.schema}")
    
    # Convert to sktime-compatible format
    # First convert to pandas for sktime compatibility
    pd_from_pl = pl_df.to_pandas()
    pd_from_pl = pd_from_pl.set_index('date')['value']
    
    print(f"\nConverted to sktime format:")
    print(f"Type: {type(pd_from_pl)}")
    print(f"Scitype: {scitype(pd_from_pl)}")
    print(f"Mtype: {mtype(pd_from_pl)}")
    
    # Use with sktime forecaster
    from sktime.forecasting.naive import NaiveForecaster
    
    forecaster = NaiveForecaster(strategy="drift")
    forecaster.fit(pd_from_pl)
    prediction = forecaster.predict(fh=[1, 2, 3, 4, 5])
    
    print(f"\nForecast successful!")
    print(f"Prediction shape: {prediction.shape}")
    print(f"Predictions: {prediction.values}")
    
    # Convert prediction back to polars if needed
    pred_pl = pl.DataFrame({
        'date': prediction.index,
        'forecast': prediction.values
    })
    
    print(f"\nPredictions as Polars DataFrame:")
    print(pred_pl)
    
except ImportError:
    print("Polars Example (Conceptual):")
    print("=" * 28)
    print("\nTo use Polars with sktime:")
    print("1. Install polars: pip install polars")
    print("2. Create polars DataFrame")
    print("3. Convert to pandas for sktime compatibility")
    print("4. Use with sktime estimators")
    print("5. Convert results back to polars if needed")
    
    print("\nExample workflow:")
    print("```python")
    print("import polars as pl")
    print("from sktime.forecasting.naive import NaiveForecaster")
    print("")
    print("# Create polars data")
    print("pl_df = pl.DataFrame({'date': dates, 'value': values})")
    print("")
    print("# Convert for sktime")
    print("pd_series = pl_df.to_pandas().set_index('date')['value']")
    print("")
    print("# Use with sktime")
    print("forecaster = NaiveForecaster()")
    print("forecaster.fit(pd_series)")
    print("prediction = forecaster.predict(fh=[1, 2, 3])")
    print("```")

## 6. Panel Data (Multiple Time Series)

Understanding how to work with multiple time series in different formats.

In [None]:
# Create sample panel data
np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=50, freq='D')

# Method 1: MultiIndex DataFrame (preferred for sktime)
panel_data = []
for series_id in ['A', 'B', 'C']:
    for date in dates:
        value = np.random.randn() * (ord(series_id) - ord('A') + 1) + 100
        panel_data.append({'series_id': series_id, 'date': date, 'value': value})

panel_df = pd.DataFrame(panel_data)
panel_multiindex = panel_df.set_index(['series_id', 'date'])['value']

print("Panel Data Formats:")
print("=" * 18)

print(f"\n1. MultiIndex DataFrame:")
print(f"   Type: {type(panel_multiindex)}")
print(f"   Shape: {panel_multiindex.shape}")
print(f"   Index levels: {panel_multiindex.index.names}")
print(f"   Scitype: {scitype(panel_multiindex)}")
print(f"   Mtype: {mtype(panel_multiindex)}")
print(f"   Sample:\n{panel_multiindex.head(10)}")

# Method 2: Convert to nested format
try:
    panel_nested = convert_to(panel_multiindex, to_type="nested_univ")
    print(f"\n2. Nested DataFrame:")
    print(f"   Type: {type(panel_nested)}")
    print(f"   Shape: {panel_nested.shape}")
    print(f"   Columns: {panel_nested.columns.tolist()}")
    print(f"   Mtype: {mtype(panel_nested)}")
    print(f"   Sample:\n{panel_nested.head()}")
except Exception as e:
    print(f"\n2. Nested format conversion failed: {e}")

# Method 3: List of DataFrames
try:
    panel_list = convert_to(panel_multiindex, to_type="df-list")
    print(f"\n3. List of DataFrames:")
    print(f"   Type: {type(panel_list)}")
    print(f"   Length: {len(panel_list)}")
    print(f"   Mtype: {mtype(panel_list)}")
    print(f"   First series shape: {panel_list[0].shape}")
    print(f"   First series head:\n{panel_list[0].head()}")
except Exception as e:
    print(f"\n3. List format conversion failed: {e}")

## 7. Data Type Validation and Debugging

In [None]:
from sktime.datatypes import check_is_mtype, check_is_scitype

print("Data Type Validation:")
print("=" * 21)

# Test different data objects
test_objects = [
    ("airline_series", y),
    ("numpy_array", y.values),
    ("panel_multiindex", panel_multiindex),
    ("plain_list", [1, 2, 3, 4, 5])
]

for name, obj in test_objects:
    print(f"\n{name}:")
    
    # Check various mtypes
    mtypes_to_test = ["pd.Series", "pd.DataFrame", "np.ndarray", "pd-multiindex"]
    
    for mtype_test in mtypes_to_test:
        is_mtype = check_is_mtype(obj, mtype_test, return_metadata=False)
        status = "✓" if is_mtype else "✗"
        print(f"   {status} {mtype_test}")
    
    # Check scitypes
    scitypes_to_test = ["Series", "Panel", "Table"]
    for scitype_test in scitypes_to_test:
        is_scitype = check_is_scitype(obj, scitype_test, return_metadata=False)
        status = "✓" if is_scitype else "✗"
        print(f"   {status} scitype:{scitype_test}")

print("\n\nDebugging Tips:")
print("=" * 15)
print("1. Use scitype(obj) to identify the scientific type")
print("2. Use mtype(obj) to identify the machine type")
print("3. Use check_raise(obj, mtype, scitype) for validation")
print("4. Use convert_to(obj, to_type) for format conversion")
print("5. Check sktime documentation for supported formats")

## 8. Practical Examples and Use Cases

In [None]:
print("Practical Use Cases:")
print("=" * 19)

print("\n1. WORKING WITH EXTERNAL DATA:")
print("   # From CSV file")
print("   df = pd.read_csv('data.csv', index_col='date', parse_dates=True)")
print("   y = df['target']  # Extract series")
print("   # Validate: check_raise(y, mtype='pd.Series', scitype='Series')")

print("\n2. CONVERTING NUMPY TO SKTIME:")
print("   # From numpy array with known frequency")
print("   values = np.random.randn(100)")
print("   dates = pd.date_range('2020-01-01', periods=100, freq='D')")
print("   y = pd.Series(values, index=dates)")

print("\n3. HANDLING MULTIPLE SERIES:")
print("   # Create MultiIndex for panel data")
print("   index = pd.MultiIndex.from_product([['A', 'B'], dates])")
print("   panel = pd.Series(values, index=index)")

print("\n4. POLARS INTEGRATION:")
print("   # Convert polars to sktime-compatible format")
print("   pl_df = pl.DataFrame({'date': dates, 'value': values})")
print("   pd_series = pl_df.to_pandas().set_index('date')['value']")

# Demonstrate a complete workflow
print("\n\nComplete Workflow Example:")
print("=" * 26)

# Simulate loading data from different sources
print("\nStep 1: Load data from various sources")

# Source 1: CSV-like data
csv_data = pd.DataFrame({
    'date': pd.date_range('2020-01-01', periods=50, freq='D'),
    'sales': np.random.randint(100, 200, 50),
    'region': ['North'] * 25 + ['South'] * 25
})

print(f"CSV data loaded: {csv_data.shape}")

# Convert to sktime format
print("\nStep 2: Convert to sktime format")
y_north = csv_data[csv_data['region'] == 'North'].set_index('date')['sales']
y_south = csv_data[csv_data['region'] == 'South'].set_index('date')['sales']

print(f"North series: {scitype(y_north)} / {mtype(y_north)}")
print(f"South series: {scitype(y_south)} / {mtype(y_south)}")

# Use with sktime
print("\nStep 3: Use with sktime forecaster")
from sktime.forecasting.naive import NaiveForecaster

forecaster = NaiveForecaster(strategy="drift")
forecaster.fit(y_north)
forecast = forecaster.predict(fh=[1, 2, 3, 4, 5])

print(f"Forecast generated: {forecast.shape}")
print(f"Forecast values: {forecast.values}")

# Convert results to desired format
print("\nStep 4: Convert results to desired format")
forecast_df = forecast.reset_index()
forecast_df.columns = ['date', 'forecast']
print(f"Results as DataFrame:\n{forecast_df}")

## Summary

In this tutorial, you learned:

1. **Data Type System**: Understanding scitypes (what) vs mtypes (how)
2. **Available Formats**: Pandas, numpy, polars, and other supported formats
3. **Data Conversion**: Using `convert_to()` to change between formats
4. **Polars Integration**: How to work with polars DataFrames in sktime
5. **Panel Data**: Working with multiple time series in different representations
6. **Validation**: Using data type checking and debugging tools
7. **Practical Workflows**: Real-world examples of data format handling

## Key Takeaways

- **Flexibility**: sktime supports multiple data formats for different use cases
- **Conversion**: Easy conversion between formats using built-in functions
- **Validation**: Always validate data types when in doubt
- **Polars Support**: Modern DataFrame libraries can be used with conversion
- **Panel Data**: Multiple time series require specific index structures

## Best Practices

1. **Use pandas Series/DataFrame** for most sktime operations
2. **Validate data types** when loading from external sources
3. **Convert consistently** between formats in your workflow
4. **Use MultiIndex** for panel data when possible
5. **Check documentation** for format-specific requirements

## Next Steps

- Explore specific forecasting tutorials to apply these data types
- Learn about "Global Forecasting" for advanced panel data techniques
- Try "Classification" tutorials for different data type applications