# Data Types in Data Science: Time Series, Cross-Sectional, and Panel Data

This cheat sheet covers the three fundamental data types encountered in data science: **time series**, **cross-sectional**, and **panel data**. Understanding these is crucial for selecting the right analysis methods, models, and visualizations.

---

## 1. Time Series Data

**Definition:**
- Observations collected sequentially over time (e.g., daily stock prices, hourly temperature).
- Each observation is associated with a specific timestamp or time interval.

**Key Characteristics:**
- **Temporal Order:** Order matters; past values can influence future values.
- **Frequency:** Can be regular (daily, monthly) or irregular.
- **Trends & Seasonality:** May exhibit long-term trends, cycles, or repeating seasonal patterns.

**Common Use Cases:**
- Forecasting (sales, weather, demand)
- Anomaly detection (fraud, equipment failure)
- Signal processing

**Typical Structure:**
| Time       | Value   |
|------------|---------|
| 2023-01-01 | 100     |
| 2023-01-02 | 105     |
| ...        | ...     |

**Key Analysis Methods:**
- Smoothing (moving averages)
- Decomposition (trend, seasonality, residual)
- Autocorrelation analysis
- Time series forecasting models (ARIMA, Exponential Smoothing)

**Pitfalls:**
- Ignoring autocorrelation
- Failing to account for non-stationarity

---

## 2. Cross-Sectional Data

**Definition:**
- Observations collected at a single point in time across multiple entities (e.g., survey results from 1,000 people in 2024).
- No inherent time component.

**Key Characteristics:**
- **Snapshot:** Represents a "slice" of the population at one moment.
- **Entities:** Can be individuals, companies, countries, etc.
- **No temporal order:** Observations are independent.

**Common Use Cases:**
- Market research
- Demographic analysis
- Comparing groups or categories

**Typical Structure:**
| Entity   | Feature 1 | Feature 2 | ... |
|----------|-----------|-----------|-----|
| Person A | 25        | 1.75      | ... |
| Person B | 32        | 1.68      | ... |
| ...      | ...       | ...       | ... |

**Key Analysis Methods:**
- Descriptive statistics (mean, median, mode)
- Regression analysis
- Group comparisons (t-tests, ANOVA)

**Pitfalls:**
- Assuming causality from correlation
- Ignoring sampling bias

---

## 3. Panel Data (Longitudinal Data)

**Definition:**
- Observations collected over time for multiple entities (e.g., annual income for 100 people over 10 years).
- Combines features of both time series and cross-sectional data.

**Key Characteristics:**
- **Two Dimensions:** Entity and time.
- **Tracks changes:** Can analyze how entities evolve over time.
- **Correlation:** Observations for the same entity are not independent.

**Common Use Cases:**
- Policy impact studies
- Customer behavior analysis
- Medical studies (patient follow-up)

**Typical Structure:**
| Entity   | Time       | Value   |
|----------|------------|---------|
| Person A | 2020-01-01 | 100     |
| Person A | 2020-02-01 | 110     |
| Person B | 2020-01-01 | 90      |
| ...      | ...        | ...     |

**Key Analysis Methods:**
- Fixed effects and random effects models
- Difference-in-differences
- Growth curve modeling

**Pitfalls:**
- Ignoring within-entity correlation
- Not handling missing data properly

---

## Summary Table

| Data Type        | Time Component | Multiple Entities | Example Use Case         |
|------------------|---------------|------------------|-------------------------|
| Time Series      | Yes           | No               | Stock price forecasting |
| Cross-Sectional  | No            | Yes              | Market survey analysis  |
| Panel (Longitudinal) | Yes        | Yes              | Policy impact study     |

---

**Tip:**
- Always identify your data type before analysis. The choice of statistical methods and models depends on it.
- Panel data allows you to control for unobserved heterogeneity and study dynamics over time.
- Time series requires special handling for trends and autocorrelation.
- Cross-sectional analysis is best for comparing groups at a single time point.

# Understanding Autocorrelation in Time Series

Autocorrelation is a key concept in time series analysis that measures the degree of similarity between a time series and a lagged version of itself. In other words, it tells us how much the current value of a series depends on its past values.

## What is Autocorrelation?

**Definition:**
- The correlation between observations of the same variable at different time points
- Indicates whether past values influence future values
- Values range from -1 to +1 (like regular correlation)

**Types of Autocorrelation:**
1. **Positive Autocorrelation (>0)**
   - High values tend to be followed by high values
   - Low values tend to be followed by low values
   - Example: Temperature readings (hot days tend to be followed by hot days)

2. **Negative Autocorrelation (<0)**
   - High values tend to be followed by low values
   - Low values tend to be followed by high values
   - Example: Stock market overreactions (extreme up days often followed by down days)

3. **No Autocorrelation (≈0)**
   - No relationship between consecutive values
   - Example: Fair coin tosses

## Common Patterns

**Lag-1 Autocorrelation:**
```
Time:     1  2  3  4  5
Value:    10 12 15 17 20  (Strong positive autocorrelation)
Value:    10 8  12 9  13  (Negative autocorrelation)
Value:    10 9  11 8  12  (Weak/No autocorrelation)
```

## Why It Matters in Data Science

1. **Forecasting:**
   - High autocorrelation means past values are good predictors
   - Models like ARIMA rely on understanding autocorrelation

2. **Model Assumptions:**
   - Many statistical tests assume no autocorrelation
   - Violating this can lead to incorrect conclusions

3. **Feature Engineering:**
   - Can create lagged features for machine learning
   - Helps in detecting seasonal patterns

## Common Tests and Tools

1. **Autocorrelation Function (ACF) Plot:**
   - Shows correlation at different lags
   - Helps identify seasonal patterns
   - Used in ARIMA model identification

2. **Durbin-Watson Test:**
   - Tests for lag-1 autocorrelation
   - Values near 2 suggest no autocorrelation
   - Values near 0 or 4 indicate strong autocorrelation

## Example Use Cases

1. **Financial Markets:**
   - Stock returns often show negative autocorrelation (mean reversion)
   - Trading volumes show positive autocorrelation (trending behavior)

2. **Weather Data:**
   - Temperature shows strong positive autocorrelation
   - Daily patterns and seasonal effects

3. **Website Traffic:**
   - Hour-to-hour traffic shows strong positive autocorrelation
   - Day-of-week patterns are common

## Handling Autocorrelation

1. **When to Address It:**
   - In time series forecasting
   - When using statistical tests
   - When validating model assumptions

2. **Methods:**
   - Differencing the series
   - Including lagged variables
   - Using appropriate time series models
   - Adjusting confidence intervals

## Code Example (Conceptual)
```python
# Simple autocorrelation calculation
def lag_1_autocorrelation(series):
    # Remove last and first value to align series
    main_series = series[1:]
    lagged_series = series[:-1]
    
    # Calculate correlation
    mean = sum(series) / len(series)
    numerator = sum((main_series[i] - mean) * (lagged_series[i] - mean) 
                   for i in range(len(main_series)))
    denominator = sum((x - mean) ** 2 for x in series)
    
    return numerator / denominator
```

## Tips and Best Practices

1. **Always Plot Your Data:**
   - Visual inspection can reveal patterns
   - Look for trends and seasonality

2. **Consider Multiple Lags:**
   - Don't just look at lag-1
   - Seasonal data may show peaks at regular intervals

3. **Context Matters:**
   - What's "high" autocorrelation depends on the field
   - Financial data often has lower autocorrelation than physical measurements

In [None]:
# Example of autocorrelation in time series data
import random

# Generate sample time series data
random.seed(42)

# Create a series with strong positive autocorrelation
def generate_autocorrelated_series(n=100, correlation_strength=0.8):
    series = [random.normalvariate(0, 1)]  # Start with random value
    for _ in range(n-1):
        # New value depends on previous value
        new_value = (correlation_strength * series[-1] + 
                    (1 - correlation_strength) * random.normalvariate(0, 1))
        series.append(new_value)
    return series

# Generate three different series
strong_autocorr = generate_autocorrelated_series(100, 0.8)
weak_autocorr = generate_autocorrelated_series(100, 0.2)
no_autocorr = [random.normalvariate(0, 1) for _ in range(100)]

# Calculate lag-1 autocorrelation
def calc_autocorrelation(series, lag=1):
    main = series[lag:]
    lagged = series[:-lag]
    
    # Remove mean
    main_mean = sum(main) / len(main)
    lagged_mean = sum(lagged) / len(lagged)
    
    # Calculate correlation
    numerator = sum((x - main_mean) * (y - lagged_mean) 
                   for x, y in zip(main, lagged))
    denominator = (sum((x - main_mean) ** 2 for x in main) * 
                  sum((x - lagged_mean) ** 2 for x in lagged)) ** 0.5
    
    return numerator / denominator

# Print results
print("Lag-1 Autocorrelation:")
print(f"Strong autocorrelation series: {calc_autocorrelation(strong_autocorr):.3f}")
print(f"Weak autocorrelation series: {calc_autocorrelation(weak_autocorr):.3f}")
print(f"No autocorrelation series: {calc_autocorrelation(no_autocorr):.3f}")

# Print first few values of each series
print("\nFirst 5 values of each series:")
print("Strong autocorr:", [f"{x:.3f}" for x in strong_autocorr[:5]])
print("Weak autocorr:", [f"{x:.3f}" for x in weak_autocorr[:5]])
print("No autocorr:", [f"{x:.3f}" for x in no_autocorr[:5]])