# pdtab: Comprehensive Tabulation Tutorial

This notebook demonstrates the full functionality of the pdtab library, showing how to replicate Stata's tabulate command functionality in Python.

## Table of Contents
1. [Installation and Setup](#installation)
2. [Basic One-way Tabulation](#oneway)
3. [Two-way Cross-tabulation](#twoway)
4. [Statistical Tests](#tests)
5. [Summary Tabulation](#summary)
6. [Multiple Tables](#multiple)
7. [Immediate Tabulation](#immediate)
8. [Weighted Analysis](#weights)
9. [Visualization](#viz)
10. [Advanced Examples](#advanced)

## 1. Installation and Setup {#installation}

First, let's import the necessary libraries and create some sample data.

In [None]:
# Install pdtab (uncomment if needed)
# !pip install pdtab

import pandas as pd
import numpy as np
import pdtab

# Set random seed for reproducibility
np.random.seed(42)

print(f"pdtab version: {pdtab.__version__}")

In [None]:
# Create sample dataset
n = 200

data = {
    'gender': np.random.choice(['Male', 'Female'], n, p=[0.55, 0.45]),
    'education': np.random.choice(['High School', 'College', 'Graduate'], n, p=[0.4, 0.45, 0.15]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'age_group': np.random.choice(['18-29', '30-44', '45-59', '60+'], n, p=[0.25, 0.35, 0.25, 0.15]),
    'income': np.random.lognormal(10.8, 0.5, n),  # Log-normal distribution for income
    'satisfaction': np.random.choice([1, 2, 3, 4, 5], n, p=[0.1, 0.15, 0.3, 0.35, 0.1]),
    'treatment': np.random.choice(['Control', 'Treatment'], n),
    'outcome': np.random.choice(['Success', 'Failure'], n, p=[0.6, 0.4])
}

# Add some correlation between treatment and outcome
for i in range(n):
    if data['treatment'][i] == 'Treatment':
        data['outcome'][i] = np.random.choice(['Success', 'Failure'], p=[0.75, 0.25])

# Add some missing values
missing_indices = np.random.choice(n, 10, replace=False)
for idx in missing_indices:
    data['education'][idx] = None

df = pd.DataFrame(data)
print(f"Dataset shape: {df.shape}")
df.head()

## 2. Basic One-way Tabulation {#oneway}

Let's start with basic frequency tables for single variables.

In [None]:
# Basic one-way tabulation
result = pdtab.tabulate('gender', data=df)
print("Basic Gender Distribution:")
print(result)

In [None]:
# Sorted by frequency
result = pdtab.tabulate('education', data=df, sort=True)
print("Education Distribution (sorted by frequency):")
print(result)

In [None]:
# Include missing values
result = pdtab.tabulate('education', data=df, missing=True)
print("Education Distribution (including missing):")
print(result)

In [None]:
# Suppress frequencies, show only percentages
result = pdtab.tabulate('region', data=df, nofreq=True)
print("Region Distribution (percentages only):")
print(result)

## 3. Two-way Cross-tabulation {#twoway}

Now let's explore relationships between two categorical variables.

In [None]:
# Basic two-way table
result = pdtab.tabulate('gender', 'education', data=df)
print("Gender by Education Cross-tabulation:")
print(result)

In [None]:
# With row percentages
result = pdtab.tabulate('treatment', 'outcome', data=df, row=True)
print("Treatment by Outcome (with row percentages):")
print(result)

In [None]:
# With column percentages
result = pdtab.tabulate('gender', 'age_group', data=df, column=True)
print("Gender by Age Group (with column percentages):")
print(result)

## 4. Statistical Tests {#tests}

pdtab provides comprehensive statistical testing for independence and association.

In [None]:
# Chi-square test for independence
result = pdtab.tabulate('treatment', 'outcome', data=df, chi2=True)
print("Treatment by Outcome with Chi-square Test:")
print(result)

if 'chi2' in result.statistics:
    chi2_stat = result.statistics['chi2']['statistic']
    p_value = result.statistics['chi2']['p_value']
    print(f"\nChi-square statistic: {chi2_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}")

In [None]:
# Fisher's exact test
result = pdtab.tabulate('treatment', 'outcome', data=df, exact=True)
print("Treatment by Outcome with Fisher's Exact Test:")
print(result)

if 'exact' in result.statistics:
    p_exact = result.statistics['exact']['p_value']
    print(f"\nFisher's exact p-value: {p_exact:.4f}")

In [None]:
# Multiple association measures
result = pdtab.tabulate('gender', 'education', data=df, 
                       chi2=True, V=True, gamma=True, taub=True)
print("Gender by Education with Association Measures:")
print(result)

print("\nAssociation Measures:")
if 'cramers_v' in result.statistics:
    print(f"Cramér's V: {result.statistics['cramers_v']:.4f}")
if 'gamma' in result.statistics:
    gamma_val = result.statistics['gamma']['statistic']
    gamma_ase = result.statistics['gamma']['ase']
    print(f"Goodman-Kruskal Gamma: {gamma_val:.4f} (ASE: {gamma_ase:.4f})")
if 'taub' in result.statistics:
    taub_val = result.statistics['taub']['statistic']
    taub_ase = result.statistics['taub']['ase']
    print(f"Kendall's τb: {taub_val:.4f} (ASE: {taub_ase:.4f})")

## 5. Summary Tabulation {#summary}

Analyze continuous variables broken down by categorical variables.

In [None]:
# One-way summary tabulation
result = pdtab.tabulate('gender', data=df, summarize='income')
print("Income Summary by Gender:")
print(result)

In [None]:
# Two-way summary tabulation
result = pdtab.tabulate('gender', 'education', data=df, summarize='income')
print("Income Summary by Gender and Education:")
print(result)

In [None]:
# Custom statistics selection
result = pdtab.tabulate('age_group', data=df, summarize='satisfaction', 
                       means=True, standard=False, freq=True)
print("Satisfaction by Age Group (means and frequencies only):")
print(result)

## 6. Multiple Tables {#multiple}

Generate multiple tables efficiently using tab1 and tab2 functions.

In [None]:
# Multiple one-way tables
results = pdtab.tab1(['gender', 'education', 'region'], data=df)

print("Multiple One-way Tables:")
for variable, result in results.items():
    print(f"\n{variable.upper()}:")
    print(result)

In [None]:
# All possible two-way tables
results = pdtab.tab2(['gender', 'treatment', 'outcome'], data=df, chi2=True)

print("All Two-way Combinations:")
for (var1, var2), result in results.items():
    print(f"\n{var1.upper()} × {var2.upper()}:")
    print(result)
    
    # Show chi-square results if available
    if result.statistics and 'chi2' in result.statistics:
        chi2_p = result.statistics['chi2']['p_value']
        print(f"Chi-square p-value: {chi2_p:.4f}")

## 7. Immediate Tabulation {#immediate}

Analyze data directly without creating a DataFrame first.

In [None]:
# 2×2 table from string (Stata format)
result = pdtab.tabi("45 25 \\ 35 55", exact=True, chi2=True)
print("2×2 Table Analysis:")
print(result)

print("\nStatistical Results:")
if 'exact' in result.statistics:
    print(f"Fisher's exact p-value: {result.statistics['exact']['p_value']:.4f}")
if 'chi2' in result.statistics:
    print(f"Chi-square p-value: {result.statistics['chi2']['p_value']:.4f}")

In [None]:
# Larger table from list
table_data = [
    [30, 25, 20],
    [40, 35, 30],
    [20, 15, 25]
]

result = pdtab.tabi(table_data, chi2=True, V=True)
print("3×3 Table Analysis:")
print(result)

if 'cramers_v' in result.statistics:
    print(f"\nCramér's V: {result.statistics['cramers_v']:.4f}")

## 8. Weighted Analysis {#weights}

Perform weighted tabulation to account for sampling weights or importance weights.

In [None]:
# Add sampling weights to our data
df['sample_weight'] = np.random.uniform(0.5, 2.0, len(df))

# Weighted one-way tabulation
result_unweighted = pdtab.tabulate('region', data=df)
result_weighted = pdtab.tabulate('region', data=df, weights='sample_weight')

print("Region Distribution - Unweighted:")
print(result_unweighted)

print("\nRegion Distribution - Weighted:")
print(result_weighted)

In [None]:
# Weighted summary tabulation
result = pdtab.tabulate('gender', data=df, summarize='income', weights='sample_weight')
print("Weighted Income Summary by Gender:")
print(result)

## 9. Visualization {#viz}

Create plots directly from tabulation results.

In [None]:
# Bar chart for one-way table
result = pdtab.tabulate('education', data=df)

try:
    fig = pdtab.viz.create_tabulation_plots(result, plot_type='bar', 
                                           title='Education Distribution')
    fig.show()
except ImportError:
    print("Matplotlib not available for plotting")

In [None]:
# Heatmap for two-way table
result = pdtab.tabulate('gender', 'education', data=df)

try:
    fig = pdtab.viz.create_tabulation_plots(result, plot_type='heatmap',
                                           title='Gender by Education')
    fig.show()
except ImportError:
    print("Matplotlib/Seaborn not available for plotting")

In [None]:
# Association measures visualization
result = pdtab.tabulate('treatment', 'outcome', data=df, 
                       V=True, gamma=True, taub=True)

try:
    fig = pdtab.viz.create_tabulation_plots(result, plot_type='association')
    fig.show()
except ImportError:
    print("Matplotlib not available for plotting")

## 10. Advanced Examples {#advanced}

Complex real-world scenarios demonstrating the full power of pdtab.

In [None]:
# Clinical trial analysis
print("=== CLINICAL TRIAL ANALYSIS ===")
print()

# Comprehensive analysis with all tests
result = pdtab.tabulate('treatment', 'outcome', data=df,
                       chi2=True, exact=True, lrchi2=True, V=True,
                       row=True, expected=True)

print("Treatment Efficacy Analysis:")
print(result)

# Extract key statistics
stats = result.statistics
print("\n=== STATISTICAL SUMMARY ===")
if 'chi2' in stats:
    print(f"Pearson χ²: {stats['chi2']['statistic']:.4f} (p = {stats['chi2']['p_value']:.4f})")
if 'exact' in stats:
    print(f"Fisher's exact: p = {stats['exact']['p_value']:.4f}")
if 'lrchi2' in stats:
    print(f"LR χ²: {stats['lrchi2']['statistic']:.4f} (p = {stats['lrchi2']['p_value']:.4f})")
if 'cramers_v' in stats:
    print(f"Cramér's V: {stats['cramers_v']:.4f}")

# Calculate effect size (risk ratio)
cross_table = pd.crosstab(df['treatment'], df['outcome'])
if 'Success' in cross_table.columns and 'Treatment' in cross_table.index:
    treat_success = cross_table.loc['Treatment', 'Success']
    treat_total = cross_table.loc['Treatment'].sum()
    control_success = cross_table.loc['Control', 'Success']
    control_total = cross_table.loc['Control'].sum()
    
    risk_treat = treat_success / treat_total
    risk_control = control_success / control_total
    risk_ratio = risk_treat / risk_control
    
    print(f"\nRisk in treatment group: {risk_treat:.3f}")
    print(f"Risk in control group: {risk_control:.3f}")
    print(f"Risk ratio: {risk_ratio:.3f}")

In [None]:
# Market research analysis
print("=== MARKET RESEARCH ANALYSIS ===")
print()

# Multi-way analysis of satisfaction
print("1. Overall satisfaction distribution:")
result = pdtab.tabulate('satisfaction', data=df, sort=True)
print(result)

print("\n2. Satisfaction by demographics:")
result = pdtab.tabulate('gender', 'satisfaction', data=df, 
                       column=True, chi2=True)
print(result)

print("\n3. Income analysis by satisfaction level:")
result = pdtab.tabulate('satisfaction', data=df, summarize='income')
print(result)

print("\n4. Regional differences in satisfaction:")
results = pdtab.tab2(['region', 'satisfaction'], data=df, chi2=True)
for (var1, var2), result in results.items():
    print(f"\n{var1} × {var2}:")
    print(result)
    if 'chi2' in result.statistics:
        p_val = result.statistics['chi2']['p_value']
        significance = "significant" if p_val < 0.05 else "not significant"
        print(f"Association is {significance} (p = {p_val:.4f})")

In [None]:
# Publication-ready analysis with export
print("=== PUBLICATION-READY OUTPUT ===")
print()

result = pdtab.tabulate('gender', 'education', data=df,
                       chi2=True, exact=True, V=True)

print("Table 1: Educational Attainment by Gender")
print(result)

# Export options
print("\nExport formats available:")
print("1. Dictionary format:")
data_dict = result.to_dict()
print(f"   Keys: {list(data_dict.keys())}")

print("\n2. HTML format:")
html_output = result.to_html()
print(f"   HTML length: {len(html_output)} characters")

# Statistical reporting
stats = result.statistics
if stats:
    print("\n3. Statistical summary for manuscript:")
    if 'chi2' in stats:
        chi2_stat = stats['chi2']['statistic']
        chi2_p = stats['chi2']['p_value']
        df_val = stats['chi2']['df']
        print(f"   χ²({df_val}) = {chi2_stat:.3f}, p = {chi2_p:.3f}")
    
    if 'cramers_v' in stats:
        v_stat = stats['cramers_v']
        print(f"   Cramér's V = {v_stat:.3f}")
        
        # Effect size interpretation
        if v_stat < 0.1:
            effect_size = "negligible"
        elif v_stat < 0.3:
            effect_size = "small"
        elif v_stat < 0.5:
            effect_size = "medium"
        else:
            effect_size = "large"
        print(f"   Effect size: {effect_size}")

## Summary

This tutorial has demonstrated the comprehensive functionality of the pdtab library:

- ✅ **One-way tabulation** with frequencies, percentages, and sorting
- ✅ **Two-way cross-tabulation** with various percentage options
- ✅ **Statistical testing** including chi-square, Fisher's exact, and likelihood-ratio tests
- ✅ **Association measures** like Cramér's V, Gamma, and Kendall's τb
- ✅ **Summary tabulation** for continuous variables by categories
- ✅ **Multiple table generation** with tab1 and tab2 functions
- ✅ **Immediate tabulation** from direct data input
- ✅ **Weighted analysis** for complex sampling designs
- ✅ **Visualization** capabilities for publication-quality plots
- ✅ **Export options** for integration with other tools

The pdtab library provides a seamless transition from Stata to Python for tabulation analysis, maintaining full compatibility while leveraging the power of the pandas ecosystem.

### Next Steps

- Explore the [full documentation](https://pdtab.readthedocs.io)
- Try pdtab with your own datasets
- Contribute to the project on [GitHub](https://github.com/pdtab/pdtab)
- Report issues or request features