# Distribution Plot Types Requirements

## Core Distribution Types

### 1. Univariate Distributions
- **Histogram**
  - Configurable bin width and count
  - Option for frequency or density display
  - Support for automatic bin optimization
  - Ability to overlay multiple histograms

- **Kernel Density Estimation (KDE)**
  - Adjustable bandwidth parameter
  - Multiple kernel function options (Gaussian, Epanechnikov, etc.)
  - Ability to overlay KDE on histograms

- **Box Plot**
  - Standard five-number summary (min, Q1, median, Q3, max)
  - Outlier detection and visualization
  - Support for notched box plots
  - Option for violin plot hybrid

- **Violin Plot**
  - KDE-based density visualization
  - Symmetrical density display
  - Option to show/hide inner box plot
  - Configurable scale (width) options

### 2. Data Distribution Requirements
- Support for continuous numerical data
- Handling of discrete/categorical data
- Management of missing values
- Automatic detection of data type
- Support for weighted distributions

### 3. Multi-Distribution Support
- Side-by-side comparison capability
- Overlay support with transparency
- Grouped distribution displays
- Faceted/grid layout options

### 4. Edge Cases
- Handle extremely skewed distributions
- Support for multimodal distributions
- Management of long-tail distributions
- Zero-inflated data handling
- Boundary condition management (e.g., non-negative data)

### 5. Scale Requirements
- Support for different data scales (linear, log, symlog)
- Automatic scale selection based on data characteristics
- Custom scale transformations
- Axis limit handling

### 6. Statistical Overlay Options
- Mean indicators
- Median lines
- Standard deviation bands
- Confidence intervals
- Percentile markers

# Statistical Requirements for Distribution Plots

## 1. Central Tendency Metrics
### Required Calculations
- Arithmetic mean
- Weighted mean
- Geometric mean
- Median
- Mode (including multimodal detection)
- Trimmed mean (configurable trim percentage)

### Visual Representations
- Vertical/horizontal reference lines
- Annotated values with configurable precision
- Hover/tooltip information
- Optional confidence intervals around central tendency metrics

## 2. Dispersion Measures
### Core Statistics
- Standard deviation
- Variance
- Interquartile range (IQR)
- Mean absolute deviation
- Coefficient of variation
- Range (min-max)

### Visualization Features
- Standard deviation bands (±1,2,3 SD)
- Percentile bands
- Quantile ranges
- Configurable whisker lengths for box plots

## 3. Distribution Shape Metrics
### Calculations
- Skewness
- Kurtosis
- Modality testing
- Normality tests
  - Shapiro-Wilk test
  - Anderson-Darling test
  - Kolmogorov-Smirnov test
  - Q-Q plot support

### Statistical Annotations
- Distribution type indicators
- Shape characteristic labels
- P-values for normality tests
- Critical values for statistical tests

## 4. Outlier Detection
### Methods
- IQR-based detection
- Z-score method
- Modified Z-score
- Tukey's method
- Custom threshold definitions

### Visualization
- Highlighted outlier points
- Outlier summary statistics
- Optional outlier labeling
- Configurable outlier treatment

## 5. Comparative Statistics
### Between Groups
- T-tests
- Mann-Whitney U test
- Kolmogorov-Smirnov two-sample test
- Effect size calculations
  - Cohen's d
  - Hedges' g

### Multiple Distributions
- ANOVA support
- Kruskal-Wallis test
- Multiple comparison corrections
  - Bonferroni
  - Holm-Bonferroni
  - False Discovery Rate (FDR)

## 6. Confidence Intervals
### Types
- Mean CI
- Median CI
- Proportion CI
- Quantile CI

### Configuration
- Configurable confidence levels
- Bootstrap CI support
- Asymptotic and exact methods
- Visual representation options

## 7. Density Estimation Parameters
### KDE Configuration
- Bandwidth selection methods
  - Silverman's rule
  - Scott's rule
  - Cross-validation
- Kernel function options
- Boundary correction methods

### Histogram Statistics
- Optimal bin width calculations
  - Sturges' rule
  - Freedman-Diaconis rule
  - Scott's rule
- Density normalization options

## 8. Export Capabilities
### Statistical Summary
- Comprehensive statistics table
- Test results summary
- Configuration parameters
- Data quality metrics

### Format Options
- CSV export of calculations
- PNG/SVG of visualizations with annotations
- Statistical report generation
- Machine-readable JSON output

In [11]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from scipy.stats import norm, jarque_bera
from statsmodels.stats.diagnostic import lilliefors
from statsmodels.nonparametric.kde import KDEUnivariate
from scipy.stats import ks_2samp, ttest_ind, mannwhitneyu
import warnings
warnings.filterwarnings('ignore')

In [12]:
# Generate synthetic financial data
np.random.seed(42)
n_days = 1000

In [13]:

# Generate returns with slight skewness and excess kurtosis
returns = np.random.normal(0.0005, 0.01, n_days)
returns = returns + 0.1 * returns**3  # Add skewness
returns = returns + np.random.standard_t(df=5, size=n_days) * 0.002  # Add fat tails


In [14]:

# Create price series
prices = 100 * np.exp(np.cumsum(returns))


In [15]:

# Create DataFrame
dates = pd.date_range(start='2022-01-01', periods=n_days, freq='B')
df = pd.DataFrame({
    'Date': dates,
    'Price': prices,
    'Returns': returns,
    'Volume': np.random.lognormal(10, 1, n_days),
    'Volatility': np.abs(returns) * np.sqrt(252)
})

In [23]:
def central_tendency(returns):
    """1. Central Tendency Metrics"""
    results = {
        'arithmetic_mean': np.mean(returns),
        'geometric_mean': np.exp(np.mean(np.log(1 + returns))) - 1,
        'median': np.median(returns),
        'mode': stats.mode(returns)[0],
        'trimmed_mean_10': stats.trim_mean(returns, 0.1)
    }
    
    # Annualize returns
    results['annualized_return'] = results['arithmetic_mean'] * 252
    return results

In [25]:
returns = df['Returns']

In [26]:
central_tendency(returns=returns)

{'arithmetic_mean': np.float64(0.0006392321780688051),
 'geometric_mean': np.float64(0.0005889639302967264),
 'median': np.float64(0.0008171283617610226),
 'mode': np.float64(-0.03144776556281655),
 'trimmed_mean_10': np.float64(0.0005395784029374917),
 'annualized_return': np.float64(0.1610865088733389)}

In [27]:
def dispersion_measures(returns):
    """2. Dispersion Measures"""
    results = {
        'std_dev': np.std(returns),
        'variance': np.var(returns),
        'ann_volatility': np.std(returns) * np.sqrt(252),
        'IQR': np.percentile(returns, 75) - np.percentile(returns, 25),
        'MAD': np.mean(np.abs(returns - np.mean(returns))),
        'CV': np.std(returns) / np.mean(returns) if np.mean(returns) != 0 else np.nan,
        'range': np.max(returns) - np.min(returns)
    }
    return results

In [28]:
dispersion_measures(returns)

{'std_dev': np.float64(0.01003272972615938),
 'variance': np.float64(0.00010065566575816206),
 'ann_volatility': np.float64(0.15926464695925724),
 'IQR': np.float64(0.01318178739240081),
 'MAD': np.float64(0.007988098926295593),
 'CV': np.float64(15.694969794026052),
 'range': np.float64(0.07155798888588491)}

In [29]:
def distribution_shape(returns):
    """3. Distribution Shape Metrics"""
    
    # Normality tests
    shapiro_stat, shapiro_p = stats.shapiro(returns)
    jb_stat, jb_p = jarque_bera(returns)
    ks_stat, ks_p = lilliefors(returns)
    
    results = {
        'skewness': stats.skew(returns),
        'kurtosis': stats.kurtosis(returns),
        'normality_tests': {
            'shapiro': {'statistic': shapiro_stat, 'p_value': shapiro_p},
            'jarque_bera': {'statistic': jb_stat, 'p_value': jb_p},
            'lilliefors': {'statistic': ks_stat, 'p_value': ks_p}
        }
    }
    return results

In [30]:
distribution_shape(returns)

{'skewness': np.float64(0.10146453416747092),
 'kurtosis': np.float64(0.1495143004468691),
 'normality_tests': {'shapiro': {'statistic': np.float64(0.9985892662344467),
   'p_value': np.float64(0.6138332197034445)},
  'jarque_bera': {'statistic': np.float64(2.6472805338918386),
   'p_value': np.float64(0.2661646259877677)},
  'lilliefors': {'statistic': np.float64(0.020163704968186646),
   'p_value': np.float64(0.4934888055353677)}}}

In [32]:
def detect_outliers(returns):
    """4. Outlier Detection"""
    
    # IQR method
    Q1 = np.percentile(returns, 25)
    Q3 = np.percentile(returns, 75)
    IQR = Q3 - Q1
    iqr_outliers = returns[(returns < Q1 - 1.5 * IQR) | (returns > Q3 + 1.5 * IQR)]
    
    # Z-score method
    z_scores = (returns - np.mean(returns)) / np.std(returns)
    z_outliers = returns[np.abs(z_scores) > 3]
    
    # Modified Z-score
    mad = np.median(np.abs(returns - np.median(returns)))
    modified_z = 0.6745 * (returns - np.median(returns)) / mad
    mod_z_outliers = returns[np.abs(modified_z) > 3.5]
    
    results = {
        'iqr_outliers_count': len(iqr_outliers),
        'z_score_outliers_count': len(z_outliers),
        'modified_z_outliers_count': len(mod_z_outliers),
        'outlier_percentage': len(z_outliers) / len(returns) * 100
    }
    return results

In [33]:
detect_outliers(returns)

{'iqr_outliers_count': 10,
 'z_score_outliers_count': 3,
 'modified_z_outliers_count': 1,
 'outlier_percentage': 0.3}

In [None]:
def comparative_statistics(self, other_returns):
    """5. Comparative Statistics"""
    # Perform various statistical tests
    t_stat, t_p = ttest_ind(self.returns, other_returns)
    u_stat, u_p = mannwhitneyu(self.returns, other_returns)
    ks_stat, ks_p = ks_2samp(self.returns, other_returns)
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt((np.var(self.returns) + np.var(other_returns)) / 2)
    cohens_d = (np.mean(self.returns) - np.mean(other_returns)) / pooled_std
    
    results = {
        't_test': {'statistic': t_stat, 'p_value': t_p},
        'mann_whitney': {'statistic': u_stat, 'p_value': u_p},
        'ks_test': {'statistic': ks_stat, 'p_value': ks_p},
        'cohens_d': cohens_d
    }
    return results



In [None]:
def confidence_intervals(self, confidence=0.95):
    """6. Confidence Intervals"""
    returns = self.returns
    n = len(returns)
    mean = np.mean(returns)
    std_err = stats.sem(returns)
    
    # Mean CI
    mean_ci = stats.t.interval(confidence, n-1, mean, std_err)
    
    # Median CI (bootstrap)
    bootstrap_medians = []
    for _ in range(1000):
        bootstrap_sample = np.random.choice(returns, size=n, replace=True)
        bootstrap_medians.append(np.median(bootstrap_sample))
    median_ci = np.percentile(bootstrap_medians, [2.5, 97.5])
    
    results = {
        'mean_ci': mean_ci,
        'median_ci': median_ci,
        'std_error': std_err
    }
    return results
    


In [None]:
def density_estimation(self):
    """7. Density Estimation Parameters"""
    returns = self.returns
    
    # KDE
    kde = KDEUnivariate(returns)
    kde.fit()
    
    # Histogram bins using different rules
    n = len(returns)
    sturges_bins = int(np.ceil(np.log2(n)) + 1)
    fd_bins = int(np.ceil((np.max(returns) - np.min(returns)) / 
                (2 * stats.iqr(returns) / np.power(n, 1/3))))
    scott_bins = int(np.ceil((np.max(returns) - np.min(returns)) / 
                    (3.5 * np.std(returns) / np.power(n, 1/3))))
    
    results = {
        'kde_bandwidth': kde.bw,
        'bin_counts': {
            'sturges': sturges_bins,
            'freedman_diaconis': fd_bins,
            'scott': scott_bins
        }
    }
    return results

In [34]:
class FinancialAnalysis:
    def __init__(self, data):
        self.data = data
        self.returns = data['Returns']
    
    def central_tendency(self):
        """1. Central Tendency Metrics"""
        results = {
            'arithmetic_mean': np.mean(self.returns),
            'geometric_mean': np.exp(np.mean(np.log(1 + self.returns))) - 1,
            'median': np.median(self.returns),
            'mode': stats.mode(self.returns)[0],
            'trimmed_mean_10': stats.trim_mean(self.returns, 0.1)
        }
        
        # Annualize returns
        results['annualized_return'] = results['arithmetic_mean'] * 252
        return results
    
    def dispersion_measures(self):
        """2. Dispersion Measures"""
        returns = self.returns
        results = {
            'std_dev': np.std(returns),
            'variance': np.var(returns),
            'ann_volatility': np.std(returns) * np.sqrt(252),
            'IQR': np.percentile(returns, 75) - np.percentile(returns, 25),
            'MAD': np.mean(np.abs(returns - np.mean(returns))),
            'CV': np.std(returns) / np.mean(returns) if np.mean(returns) != 0 else np.nan,
            'range': np.max(returns) - np.min(returns)
        }
        return results
    
    def distribution_shape(self):
        """3. Distribution Shape Metrics"""
        returns = self.returns
        
        # Normality tests
        shapiro_stat, shapiro_p = stats.shapiro(returns)
        jb_stat, jb_p = jarque_bera(returns)
        ks_stat, ks_p = lilliefors(returns)
        
        results = {
            'skewness': stats.skew(returns),
            'kurtosis': stats.kurtosis(returns),
            'normality_tests': {
                'shapiro': {'statistic': shapiro_stat, 'p_value': shapiro_p},
                'jarque_bera': {'statistic': jb_stat, 'p_value': jb_p},
                'lilliefors': {'statistic': ks_stat, 'p_value': ks_p}
            }
        }
        return results
    
    def detect_outliers(self):
        """4. Outlier Detection"""
        returns = self.returns
        
        # IQR method
        Q1 = np.percentile(returns, 25)
        Q3 = np.percentile(returns, 75)
        IQR = Q3 - Q1
        iqr_outliers = returns[(returns < Q1 - 1.5 * IQR) | (returns > Q3 + 1.5 * IQR)]
        
        # Z-score method
        z_scores = (returns - np.mean(returns)) / np.std(returns)
        z_outliers = returns[np.abs(z_scores) > 3]
        
        # Modified Z-score
        mad = np.median(np.abs(returns - np.median(returns)))
        modified_z = 0.6745 * (returns - np.median(returns)) / mad
        mod_z_outliers = returns[np.abs(modified_z) > 3.5]
        
        results = {
            'iqr_outliers_count': len(iqr_outliers),
            'z_score_outliers_count': len(z_outliers),
            'modified_z_outliers_count': len(mod_z_outliers),
            'outlier_percentage': len(z_outliers) / len(returns) * 100
        }
        return results
    
    def comparative_statistics(self, other_returns):
        """5. Comparative Statistics"""
        # Perform various statistical tests
        t_stat, t_p = ttest_ind(self.returns, other_returns)
        u_stat, u_p = mannwhitneyu(self.returns, other_returns)
        ks_stat, ks_p = ks_2samp(self.returns, other_returns)
        
        # Effect size (Cohen's d)
        pooled_std = np.sqrt((np.var(self.returns) + np.var(other_returns)) / 2)
        cohens_d = (np.mean(self.returns) - np.mean(other_returns)) / pooled_std
        
        results = {
            't_test': {'statistic': t_stat, 'p_value': t_p},
            'mann_whitney': {'statistic': u_stat, 'p_value': u_p},
            'ks_test': {'statistic': ks_stat, 'p_value': ks_p},
            'cohens_d': cohens_d
        }
        return results
    
    def confidence_intervals(self, confidence=0.95):
        """6. Confidence Intervals"""
        returns = self.returns
        n = len(returns)
        mean = np.mean(returns)
        std_err = stats.sem(returns)
        
        # Mean CI
        mean_ci = stats.t.interval(confidence, n-1, mean, std_err)
        
        # Median CI (bootstrap)
        bootstrap_medians = []
        for _ in range(1000):
            bootstrap_sample = np.random.choice(returns, size=n, replace=True)
            bootstrap_medians.append(np.median(bootstrap_sample))
        median_ci = np.percentile(bootstrap_medians, [2.5, 97.5])
        
        results = {
            'mean_ci': mean_ci,
            'median_ci': median_ci,
            'std_error': std_err
        }
        return results
    
    def density_estimation(self):
        """7. Density Estimation Parameters"""
        returns = self.returns
        
        # KDE
        kde = KDEUnivariate(returns)
        kde.fit()
        
        # Histogram bins using different rules
        n = len(returns)
        sturges_bins = int(np.ceil(np.log2(n)) + 1)
        fd_bins = int(np.ceil((np.max(returns) - np.min(returns)) / 
                    (2 * stats.iqr(returns) / np.power(n, 1/3))))
        scott_bins = int(np.ceil((np.max(returns) - np.min(returns)) / 
                      (3.5 * np.std(returns) / np.power(n, 1/3))))
        
        results = {
            'kde_bandwidth': kde.bw,
            'bin_counts': {
                'sturges': sturges_bins,
                'freedman_diaconis': fd_bins,
                'scott': scott_bins
            }
        }
        return results

In [35]:
# Create analysis instance
analysis = FinancialAnalysis(df)

In [36]:

# Run all analyses
results = {
    'central_tendency': analysis.central_tendency(),
    'dispersion': analysis.dispersion_measures(),
    'distribution_shape': analysis.distribution_shape(),
    'outliers': analysis.detect_outliers(),
    'comparative_stats': analysis.comparative_statistics(np.random.normal(0, 0.01, n_days)),
    'confidence_intervals': analysis.confidence_intervals(),
    'density_estimation': analysis.density_estimation()
}


In [37]:
# Print formatted results
def print_results(results):
    print("\nQuantitative Financial Analysis Results")
    print("="*40)
    
    for category, metrics in results.items():
        print(f"\n{category.upper()}")
        print("-"*40)
        
        if isinstance(metrics, dict):
            for metric, value in metrics.items():
                if isinstance(value, dict):
                    print(f"\n{metric}:")
                    for sub_metric, sub_value in value.items():
                        print(f"  {sub_metric}: {sub_value:.6f}" if isinstance(sub_value, float) 
                              else f"  {sub_metric}: {sub_value}")
                else:
                    print(f"{metric}: {value:.6f}" if isinstance(value, float) 
                          else f"{metric}: {value}")

print_results(results)


Quantitative Financial Analysis Results

CENTRAL_TENDENCY
----------------------------------------
arithmetic_mean: 0.000639
geometric_mean: 0.000589
median: 0.000817
mode: -0.031448
trimmed_mean_10: 0.000540
annualized_return: 0.161087

DISPERSION
----------------------------------------
std_dev: 0.010033
variance: 0.000101
ann_volatility: 0.159265
IQR: 0.013182
MAD: 0.007988
CV: 15.694970
range: 0.071558

DISTRIBUTION_SHAPE
----------------------------------------
skewness: 0.101465
kurtosis: 0.149514

normality_tests:
  shapiro: {'statistic': np.float64(0.9985892662344467), 'p_value': np.float64(0.6138332197034445)}
  jarque_bera: {'statistic': np.float64(2.6472805338918386), 'p_value': np.float64(0.2661646259877677)}
  lilliefors: {'statistic': np.float64(0.020163704968186646), 'p_value': np.float64(0.4934888055353677)}

OUTLIERS
----------------------------------------
iqr_outliers_count: 10
z_score_outliers_count: 3
modified_z_outliers_count: 1
outlier_percentage: 0.300000

COMP