
## Introduction

Data visualization is crucial for understanding distributions, patterns, and outliers in datasets. This guide explores four fundamental statistical plots, their strengths, limitations, and practical applications with interactive Plotly visualizations. In the notebook we try to explore the most used data distribution plots for continous data

we try to explore the
1. Boxplot
2. Frequency Table
3. Histogram
4. Density Plot


---

## Boxplot

### What is a Boxplot?

A boxplot (also called box-and-whisker plot) is a graphical representation of a dataset's distribution through quartiles. It displays the median, first quartile (Q1), third quartile (Q3), and outliers in a compact format.

### Key Components

- **Median Line**: The line inside the box represents the 50th percentile
- **Box**: Shows the interquartile range (IQR) containing 50% of data
- **Whiskers**: Lines extending from box showing data range (typically 1.5×IQR)
- **Outliers**: Individual points beyond whiskers

### Advantages

- **Quick distribution overview**: Instantly see median, spread, and skewness
- **Outlier detection**: Clearly identifies extreme values
- **Multiple dataset comparison**: Easily compare distributions side-by-side
- **Symmetry visibility**: Quickly detect if data is skewed
- **Compact representation**: Works well with limited space
- **Robust statistics**: Based on quartiles, resistant to extreme values

### Limitations

- **Loses detail**: Hides the actual distribution shape and bimodality
- **Not suitable for small samples**: Less informative with fewer than 10-15 data points
- **Arbitrary whisker definition**: Different conventions exist (1.5×IQR, std dev, percentiles)
- **Individual data points hidden**: Cannot see frequency of specific values
- **Assumes continuous data**: Less effective for discrete or categorical data
- **No information about data density**: Cannot determine if data is sparse or dense

### When to Use Boxplots

- Comparing distributions across multiple groups or categories
- Identifying outliers in a dataset
- Presenting statistical summaries to stakeholders
- Assessing symmetry and skewness of distributions
- Large datasets where individual points would clutter visualization
- Quick exploratory data analysis on multiple variables simultaneously

### Disadvantages vs Other Plots

- Inferior to histograms for understanding actual distribution shape
- Less detailed than density plots for smooth distribution visualization
- Cannot replace frequency tables when exact counts matter
- Bimodal distributions may appear as normal single-mode distributions

---

In [1]:
## sample dataset creation for visualizations
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Generate diverse sample data - Bimodal distribution with outliers
data_main = np.concatenate([
    np.random.normal(loc=50, scale=10, size=1000),
    np.random.normal(loc=100, scale=12, size=800),
    np.random.uniform(low=20, high=130, size=100)
])

# Add some outliers for realism
outliers = np.array([150, 155, 160, 12, 8, 5])
data = np.concatenate([data_main, outliers])

# Create DataFrame
df = pd.DataFrame({
    'value': data,
    'group': np.random.choice(['Group A', 'Group B', 'Group C'], size=len(data))
})

print(f"Dataset shape: {df.shape}")
print(f"Data summary:\n{df['value'].describe()}")


Dataset shape: (1906, 2)
Data summary:
count    1906.000000
mean       72.808066
std        27.902383
min         5.000000
25%        49.126222
50%        63.924767
75%        98.987204
max       160.000000
Name: value, dtype: float64


In [3]:
### Example 1: Interactive Boxplot with Plotly
# Create boxplot showing distribution and outliers
fig = go.Figure()

# Add boxplot
fig.add_trace(go.Box(
    y=df['value'],
    name='Distribution',
    boxmean='sd',  # Show mean and std dev
    marker=dict(color='rgba(102, 126, 234, 0.6)'),
    line=dict(color='rgba(102, 126, 234, 1)'),
    jitter=0.3,
    pointpos=-1.8,
    showlegend=True,
    hovertemplate='<b>Value</b>: %{y:.2f}<extra></extra>'
))

fig.update_layout(
    title='<b>Boxplot: Distribution Analysis with Outliers</b>',
    yaxis_title='Values',
    height=600,
    template='plotly_white',
    showlegend=True,
    hovermode='closest',
    font=dict(size=12)
)

fig.show()

# Boxplot by group
fig = px.box(df, x='group', y='value', 
             title='<b>Boxplot Comparison: Multiple Groups</b>',
             labels={'value': 'Values', 'group': 'Groups'},
             color='group',
             hover_data={'value': ':.2f'})

fig.update_layout(height=600, template='plotly_white', font=dict(size=12))
fig.show()





## Frequency Table

### What is a Frequency Table?

A frequency table is a tabular display of data organized by categories or bins, showing how many observations fall into each category. It summarizes data distribution through counts and percentages.

### Structure

| Value/Bin | Frequency | Relative Frequency | Cumulative Frequency |
|-----------|-----------|-------------------|----------------------|
| Category 1 | 25 | 0.25 (25%) | 0.25 |
| Category 2 | 40 | 0.40 (40%) | 0.65 |
| Category 3 | 20 | 0.20 (20%) | 0.85 |
| Category 4 | 15 | 0.15 (15%) | 1.00 |

### Advantages

- **Precise counts**: Exact number of observations per category
- **Easy interpretation**: Straightforward to understand and communicate
- **Foundation for other visualizations**: Basis for histograms and bar charts
- **Categorical data**: Natural fit for categorical or discrete data
- **Percentage calculation**: Easily compute proportions and percentages
- **Cumulative insights**: Can show cumulative frequencies for running totals
- **No information loss**: All data accounted for explicitly
- **Statistical calculations**: Foundation for statistical analysis and chi-square tests

### Limitations

- **Not visual**: Requires reading numbers rather than visual pattern recognition
- **Difficult to perceive trends**: Hard to spot distribution shape from table alone
- **Space consuming**: Can be lengthy for many categories or continuous data
- **Bin selection dependency**: Results depend on how categories/bins are defined
- **Bimodality hidden**: Multiple peaks not immediately apparent
- **Outliers not highlighted**: Individual extreme values blend into table
- **Comparison challenges**: Difficult to compare multiple datasets in table format

### When to Use Frequency Tables

- Providing precise counts for statistical reports
- Continuous data organized into bins for analysis
- Categorical data with predefined categories
- Creating foundation data for other visualizations
- Academic or technical documentation requiring exact values
- When stakeholders need exact numbers, not estimates
- Compliance and regulatory reporting where precision matters
- Discrete data where each value's count is important

### Disadvantages vs Other Plots

- Lacks visual appeal and immediate pattern recognition of plots
- Cannot show smooth distribution like density plots
- Less effective for quick comparisons than boxplots
- Does not highlight outliers like boxplots do
- Requires binning decisions that affect interpretation (histogram issue)

---

## Histogram

### What is a Histogram?

A histogram is a graphical representation of continuous data distribution using adjacent rectangular bars. The height of each bar represents the frequency of data falling within that interval (bin).

### Key Components

- **Bins**: Equal-width intervals partitioning the data range
- **Height**: Represents frequency or frequency density
- **X-axis**: Continuous variable values
- **Y-axis**: Frequency or density of observations

### Advantages

- **Distribution shape visibility**: Clearly shows the overall shape (normal, skewed, bimodal)
- **Pattern recognition**: Easily identify clusters, gaps, and concentration areas
- **Intuitive interpretation**: One of the most understood visualizations
- **Skewness detection**: Immediately apparent left or right skewness
- **Multimodality**: Can reveal multiple modes in data
- **Frequency understanding**: Visual representation of how data concentrates
- **Large dataset handling**: Effective for summarizing large datasets
- **Comparison ready**: Overlayed histograms can compare distributions

### Limitations

- **Bin selection dependency**: Results heavily influenced by bin width and starting point
- **Information loss**: Exact individual values hidden within bins
- **Appearance variability**: Same data can look different with different bin sizes
- **Biased interpretation**: Arbitrary binning can misrepresent distribution
- **Not for categorical data**: Requires continuous or at least orderable data
- **Overlapping challenges**: Multiple histograms overlap and obscure each other
- **Boundary artifacts**: Data near bin boundaries can appear artificially clustered
- **No individual outlier emphasis**: Outliers merge with surrounding data

### When to Use Histograms

- Understanding the distribution shape of continuous data
- Detecting normality or deviation from normal distribution
- Identifying multimodal distributions
- Analyzing skewness and kurtosis visually
- Comparing distributions of similar datasets
- Exploratory data analysis on new variables
- Understanding data concentration and spread
- Presenting data distribution to general audiences

### Disadvantages vs Other Plots

- Less precise than frequency tables (loses exact counts)
- Dependent on binning choices unlike boxplots
- Less smooth visualization than density plots
- Cannot compare multiple groups as easily as boxplots
- Individual values not preserved like in frequency tables

---

## Density Plot

### What is a Density Plot?

A density plot is a smoothed, continuous curve representation of data distribution. It estimates the probability density function of a continuous variable using kernel density estimation (KDE).

### Key Components

- **Smooth Curve**: Continuous line representing estimated probability density
- **Area Under Curve**: Total area equals 1 (or equals count if frequency density)
- **Peak Height**: Indicates highest concentration area
- **X-axis**: Continuous variable values
- **Y-axis**: Density or probability density

### Advantages

- **Smooth visualization**: Non-jagged representation without binning artifacts
- **True distribution shape**: More accurate representation than histograms with arbitrary bins
- **Multimodality clarity**: Clearly shows multiple peaks without binning bias
- **Aesthetic appeal**: Visually pleasing and professional appearance
- **Bimodality detection**: Excellent for identifying subtle distribution features
- **Continuous representation**: No artificial boundaries from binning
- **Comparison friendly**: Overlayed density plots compare distributions elegantly
- **Statistical soundness**: Based on mathematical kernel estimation
- **Outlier context**: Shows where outliers fall relative to main distribution

### Limitations

- **Interpretation challenge**: Less intuitive for non-technical audiences
- **Artificial smoothing**: KDE smoothing can obscure true peaks and gaps
- **Parameter dependency**: Bandwidth selection affects appearance significantly
- **Edge effects**: Boundary distortion at data range extremes
- **Not exact counts**: Cannot determine precise frequency from curve
- **Oversmoothing risk**: Can hide important distribution details
- **Bandwidth selection subjectivity**: No universal agreement on optimal bandwidth
- **Less suitable for discrete data**: Smoothing inappropriate for categorical/discrete data
- **Missing raw data view**: Individual data points invisible

### When to Use Density Plots

- Visualizing smooth, continuous distributions
- Comparing multiple distributions elegantly
- Detecting multimodal distributions with confidence
- Professional statistical reports and publications
- Combining with other visualizations (like rugplots) to show individual values
- When distribution smoothness is more important than exact frequencies
- Time series or signal analysis visualizations
- Presenting to audiences familiar with probability density concepts

### Disadvantages vs Other Plots

- Less precise than frequency tables (estimates only)
- Cannot show exact values like histograms and frequency tables
- Smoothing can hide true distribution details
- Requires more technical understanding than simpler plots
- Oversmoothing can create false distribution features
- Not suitable when exact counts matter

---

## Comparative Analysis

### Visual Comparison Table

| Aspect | Boxplot | Frequency Table | Histogram | Density Plot |
|--------|---------|-----------------|-----------|--------------|
| **Data Type** | Continuous | Any | Continuous | Continuous |
| **Shows Distribution Shape** | No | No | Yes | Yes |
| **Exact Counts** | No | Yes | No (binned) | No (estimated) |
| **Outlier Detection** | Excellent | Poor | Fair | Fair |
| **Multiple Groups** | Excellent | Difficult | Good | Excellent |
| **Space Efficient** | Yes | No | No | Yes |
| **Interpretation Ease** | High | High | High | Medium |
| **Small Sample Suitable** | No | Yes | No | No |
| **Bimodality Detection** | Poor | Fair | Good | Excellent |
| **Binning Dependent** | No | Optional | Yes | Implicit |
| **Processing Speed** | Fast | Very Fast | Fast | Medium |
| **Statistical Robustness** | High | High | Medium | Medium |

### Key Differences

**Boxplot vs Histogram**
- Boxplot: Compact, statistical summary (5-number), suited for comparison
- Histogram: Shows full distribution shape, requires more space

**Boxplot vs Density Plot**
- Boxplot: Discrete quartile-based summary
- Density Plot: Continuous probability density estimation

**Histogram vs Density Plot**
- Histogram: Depends on bin selection, exact counts visible
- Density Plot: Smooth, independent of binning, aesthetically superior

**Frequency Table vs All Plots**
- Table: Precise numerical data, no visual pattern recognition
- Plots: Visual interpretation, pattern recognition, less precise

---

## When to Use Each Plot

### Use Boxplot When:
- Comparing distributions across many groups
- Outliers are important to identify and highlight
- Space is limited in publication
- Audience prefers statistical summaries
- Data quality concerns exist (robust to extremes)
- Quick visual comparison needed between datasets

### Use Frequency Table When:
- Exact counts are legally or scientifically required
- Data is categorical with discrete categories
- Readers need precise numbers for calculations
- Creating foundation for other statistical tests
- Reporting regulatory or compliance data
- Small dataset where all values can be shown
- Statistical analysis requiring raw frequencies

### Use Histogram When:
- Exploring new continuous data distribution
- Understanding data concentration and spread
- Checking for normality assumptions
- Identifying multiple modes or clusters
- Communicating with general audiences
- Data quality exploration and validation
- Teaching distribution concepts

### Use Density Plot When:
- Comparing multiple continuous distributions elegantly
- Multimodal distributions need clear representation
- Professional or academic publication required
- Smoothing is preferred over artificial binning
- Combining with other visualization techniques
- Probability concepts are familiar to audience
- High-quality visualization aesthetically important

---

## Python Code Examples

### Setup and Sample Data Generation

```python
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Generate diverse sample data - Bimodal distribution with outliers
data_main = np.concatenate([
    np.random.normal(loc=50, scale=10, size=1000),
    np.random.normal(loc=100, scale=12, size=800),
    np.random.uniform(low=20, high=130, size=100)
])

# Add some outliers for realism
outliers = np.array([150, 155, 160, 12, 8, 5])
data = np.concatenate([data_main, outliers])

# Create DataFrame
df = pd.DataFrame({
    'value': data,
    'group': np.random.choice(['Group A', 'Group B', 'Group C'], size=len(data))
})

print(f"Dataset shape: {df.shape}")
print(f"Data summary:\n{df['value'].describe()}")
```

### Example 1: Interactive Boxplot with Plotly

```python
# Create boxplot showing distribution and outliers
fig = go.Figure()

# Add boxplot
fig.add_trace(go.Box(
    y=df['value'],
    name='Distribution',
    boxmean='sd',  # Show mean and std dev
    marker=dict(color='rgba(102, 126, 234, 0.6)'),
    line=dict(color='rgba(102, 126, 234, 1)'),
    jitter=0.3,
    pointpos=-1.8,
    showlegend=True,
    hovertemplate='<b>Value</b>: %{y:.2f}<extra></extra>'
))

fig.update_layout(
    title='<b>Boxplot: Distribution Analysis with Outliers</b>',
    yaxis_title='Values',
    height=600,
    template='plotly_white',
    showlegend=True,
    hovermode='closest',
    font=dict(size=12)
)

fig.show()

# Boxplot by group
fig = px.box(df, x='group', y='value', 
             title='<b>Boxplot Comparison: Multiple Groups</b>',
             labels={'value': 'Values', 'group': 'Groups'},
             color='group',
             boxmean='sd',
             hover_data={'value': ':.2f'})

fig.update_layout(height=600, template='plotly_white', font=dict(size=12))
fig.show()
```

### Example 2: Frequency Table with Visualization

```python
# Create frequency table
n_bins = 15
counts, bin_edges = np.histogram(df['value'], bins=n_bins)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2

# Create DataFrame for frequency table
freq_df = pd.DataFrame({
    'Bin Range': [f"{bin_edges[i]:.1f}-{bin_edges[i+1]:.1f}" for i in range(len(bin_edges)-1)],
    'Frequency': counts,
    'Relative Frequency': counts / counts.sum(),
    'Cumulative Frequency': np.cumsum(counts) / counts.sum()
})

print("\n<b>Frequency Table:</b>\n")
print(freq_df.to_string(index=False))

# Visualize frequency table as bar chart
fig = go.Figure()

fig.add_trace(go.Bar(
    x=freq_df['Bin Range'],
    y=freq_df['Frequency'],
    name='Frequency',
    marker=dict(color='rgba(102, 126, 234, 0.7)'),
    hovertemplate='<b>Bin</b>: %{x}<br><b>Frequency</b>: %{y}<extra></extra>'
))

fig.add_trace(go.Scatter(
    x=freq_df['Bin Range'],
    y=freq_df['Cumulative Frequency'],
    name='Cumulative Frequency',
    yaxis='y2',
    mode='lines+markers',
    marker=dict(color='rgba(118, 75, 162, 0.8)', size=8),
    line=dict(color='rgba(118, 75, 162, 1)', width=2),
    hovertemplate='<b>Bin</b>: %{x}<br><b>Cumulative</b>: %{y:.2%}<extra></extra>'
))

fig.update_layout(
    title='<b>Frequency Table Visualization with Cumulative Distribution</b>',
    xaxis_title='Value Bins',
    yaxis_title='Frequency',
    yaxis2=dict(
        title='Cumulative Frequency',
        overlaying='y',
        side='right'
    ),
    height=600,
    hovermode='x unified',
    template='plotly_white',
    font=dict(size=11)
)

fig.show()
```

### Example 3: Histogram with Variable Bin Sizes

```python
# Demonstrate bin selection impact
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Bins=10', 'Bins=30', 'Bins=50', 'Bins=100'),
    specs=[[{'type': 'histogram'}, {'type': 'histogram'}],
           [{'type': 'histogram'}, {'type': 'histogram'}]]
)

bin_choices = [10, 30, 50, 100]
positions = [(1,1), (1,2), (2,1), (2,2)]

for bins, (row, col) in zip(bin_choices, positions):
    fig.add_trace(
        go.Histogram(
            x=df['value'],
            nbinsx=bins,
            name=f'Bins={bins}',
            marker=dict(color='rgba(102, 126, 234, 0.6)'),
            showlegend=False,
            hovertemplate='<b>Range</b>: %{x}<br><b>Count</b>: %{y}<extra></extra>'
        ),
        row=row, col=col
    )

fig.update_xaxes(title_text='Values')
fig.update_yaxes(title_text='Frequency')

fig.update_layout(
    title_text='<b>Impact of Bin Selection on Histogram Appearance</b>',
    height=800,
    showlegend=False,
    template='plotly_white',
    font=dict(size=11)
)

fig.show()

# Single detailed histogram
fig = go.Figure()

fig.add_trace(go.Histogram(
    x=df['value'],
    nbinsx=30,
    name='Histogram',
    marker=dict(color='rgba(102, 126, 234, 0.7)', line=dict(color='rgba(102, 126, 234, 1)', width=1)),
    hovertemplate='<b>Value Range</b>: %{x}<br><b>Frequency</b>: %{y}<extra></extra>'
))

fig.update_layout(
    title='<b>Histogram: Distribution Shape and Frequency</b>',
    xaxis_title='Values',
    yaxis_title='Frequency',
    height=600,
    template='plotly_white',
    font=dict(size=12),
    showlegend=True
)

fig.show()
```

### Example 4: Density Plot with Advanced Features

```python
# Create density plot with KDE
from scipy.stats import gaussian_kde

# Calculate density
kde = gaussian_kde(df['value'])
x_range = np.linspace(df['value'].min() - 5, df['value'].max() + 5, 300)
density = kde(x_range)

fig = go.Figure()

# Add density plot
fig.add_trace(go.Scatter(
    x=x_range,
    y=density,
    name='Density Curve',
    mode='lines',
    line=dict(color='rgba(102, 126, 234, 1)', width=3),
    fill='tozeroy',
    fillcolor='rgba(102, 126, 234, 0.3)',
    hovertemplate='<b>Value</b>: %{x:.2f}<br><b>Density</b>: %{y:.4f}<extra></extra>'
))

# Add rug plot (individual data points)
fig.add_trace(go.Scatter(
    x=df['value'],
    y=np.zeros_like(df['value']),
    mode='markers',
    name='Individual Values',
    marker=dict(size=4, color='rgba(118, 75, 162, 0.5)', symbol='line'),
    hovertemplate='<b>Value</b>: %{x:.2f}<extra></extra>',
    showlegend=True
))

# Add mean line
mean_val = df['value'].mean()
fig.add_vline(x=mean_val, line_dash="dash", line_color="red", 
              annotation_text="Mean", annotation_position="top right")

# Add median line
median_val = df['value'].median()
fig.add_vline(x=median_val, line_dash="dot", line_color="green",
              annotation_text="Median", annotation_position="top left")

fig.update_layout(
    title='<b>Density Plot: Smooth Distribution with Rug Plot and Statistics</b>',
    xaxis_title='Values',
    yaxis_title='Probability Density',
    height=600,
    template='plotly_white',
    hovermode='x unified',
    font=dict(size=12)
)

fig.show()

# Overlayed density plots by group
fig = go.Figure()

for group in df['group'].unique():
    group_data = df[df['group'] == group]['value']
    kde_group = gaussian_kde(group_data)
    x_range_group = np.linspace(df['value'].min() - 5, df['value'].max() + 5, 300)
    
    fig.add_trace(go.Scatter(
        x=x_range_group,
        y=kde_group(x_range_group),
        name=group,
        mode='lines',
        line=dict(width=3),
        fill='tozeroy',
        hovertemplate='<b>Value</b>: %{x:.2f}<br><b>Density</b>: %{y:.4f}<extra></extra>'
    ))

fig.update_layout(
    title='<b>Overlayed Density Plots: Group Comparison</b>',
    xaxis_title='Values',
    yaxis_title='Probability Density',
    height=600,
    template='plotly_white',
    hovermode='x unified',
    font=dict(size=12)
)

fig.show()
```

### Example 5: Comprehensive Comparison Dashboard

```python
# Create a comprehensive subplot showing all four visualizations
from scipy.stats import gaussian_kde

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Boxplot', 'Histogram', 'Frequency Table', 'Density Plot'),
    specs=[[{'type': 'box'}, {'type': 'histogram'}],
           [{'type': 'bar'}, {'type': 'scatter'}]],
    vertical_spacing=0.15,
    horizontal_spacing=0.12
)

# 1. Boxplot
fig.add_trace(
    go.Box(y=df['value'], name='Boxplot', marker=dict(color='rgba(102, 126, 234, 0.6)'),
           showlegend=False, hovertemplate='<b>Value</b>: %{y:.2f}<extra></extra>'),
    row=1, col=1
)

# 2. Histogram
fig.add_trace(
    go.Histogram(x=df['value'], nbinsx=30, name='Histogram', 
                 marker=dict(color='rgba(102, 126, 234, 0.6)'), showlegend=False,
                 hovertemplate='<b>Range</b>: %{x}<br><b>Count</b>: %{y}<extra></extra>'),
    row=1, col=2
)

# 3. Frequency Table as Bar
n_bins = 12
counts, bin_edges = np.histogram(df['value'], bins=n_bins)
bin_labels = [f"{bin_edges[i]:.0f}-{bin_edges[i+1]:.0f}" for i in range(len(bin_edges)-1)]

fig.add_trace(
    go.Bar(x=bin_labels, y=counts, name='Frequency', 
           marker=dict(color='rgba(118, 75, 162, 0.6)'), showlegend=False,
           hovertemplate='<b>Bin</b>: %{x}<br><b>Frequency</b>: %{y}<extra></extra>'),
    row=2, col=1
)

# 4. Density Plot
kde = gaussian_kde(df['value'])
x_range = np.linspace(df['value'].min() - 5, df['value'].max() + 5, 300)
density = kde(x_range)

fig.add_trace(
    go.Scatter(x=x_range, y=density, name='Density', mode='lines',
               line=dict(color='rgba(118, 75, 162, 0.8)', width=3),
               fill='tozeroy', fillcolor='rgba(118, 75, 162, 0.3)', showlegend=False,
               hovertemplate='<b>Value</b>: %{x:.2f}<br><b>Density</b>: %{y:.4f}<extra></extra>'),
    row=2, col=2
)

fig.update_xaxes(title_text='', row=1, col=1)
fig.update_xaxes(title_text='Values', row=1, col=2)
fig.update_xaxes(title_text='Bins', row=2, col=1)
fig.update_xaxes(title_text='Values', row=2, col=2)

fig.update_yaxes(title_text='', row=1, col=1)
fig.update_yaxes(title_text='Frequency', row=1, col=2)
fig.update_yaxes(title_text='Frequency', row=2, col=1)
fig.update_yaxes(title_text='Density', row=2, col=2)

fig.update_layout(
    title_text='<b>Comprehensive Statistical Visualization Comparison</b>',
    height=900,
    showlegend=False,
    template='plotly_white',
    font=dict(size=11)
)

fig.show()
```

### Example 6: Statistical Analysis and Insights

```python
# Generate detailed statistical insights
def analyze_data(data):
    """Generate comprehensive statistical analysis"""
    
    analysis = {
        'Count': len(data),
        'Mean': np.mean(data),
        'Median': np.median(data),
        'Mode': stats.mode(data, keepdims=True).mode[0],
        'Std Dev': np.std(data),
        'Variance': np.var(data),
        'Min': np.min(data),
        'Max': np.max(data),
        'Q1 (25%)': np.percentile(data, 25),
        'Q3 (75%)': np.percentile(data, 75),
        'IQR': np.percentile(data, 75) - np.percentile(data, 25),
        'Skewness': stats.skew(data),
        'Kurtosis': stats.kurtosis(data),
        'Range': np.max(data) - np.min(data)
    }
    
    return analysis

stats_data = analyze_data(df['value'])

print("\n<b>Detailed Statistical Analysis:</b>\n")
for key, value in stats_data.items():
    print(f"{key:.<25} {value:>12.4f}")

# Identify outliers using IQR method
Q1 = np.percentile(df['value'], 25)
Q3 = np.percentile(df['value'], 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]['value']

print(f"\n<b>Outlier Detection (IQR Method):</b>")
print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")
print(f"Number of Outliers: {len(outliers)}")
print(f"Outlier Values: {sorted(outliers.values)}")

# Normality test
stat, p_value = stats.normaltest(df['value'])
print(f"\n<b>Normality Test (D'Agostino-Pearson):</b>")
print(f"Test Statistic: {stat:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Is data normal? {'Yes' if p_value > 0.05 else 'No'} (α=0.05)")
```

### Example 7: Distribution by Groups with All Four Methods

```python
# Create comprehensive group analysis
groups = df['group'].unique()

# Boxplot comparison
fig = px.box(df, x='group', y='value', 
             title='<b>Boxplot by Group: Outliers and Distribution Comparison</b>',
             color='group', points='all')
fig.update_layout(height=600, template='plotly_white')
fig.show()

# Overlayed histograms
fig = go.Figure()
for group in groups:
    fig.add_trace(go.Histogram(
        x=df[df['group'] == group]['value'],
        name=group,
        opacity=0.7,
        nbinsx=20
    ))
fig.update_layout(
    title='<b>Histograms by Group: Distribution Shape Comparison</b>',
    xaxis_title='Values',
    yaxis_title='Frequency',
    barmode='overlay',
    height=600,
    template='plotly_white'
)
fig.show()

# Overlayed density plots
fig = go.Figure()
for group in groups:
    group_data = df[df['group'] == group]['value']
    kde = gaussian_kde(group_data)
    x_range = np.linspace(df['value'].min() - 5, df['value'].max() + 5, 300)
    
    fig.add_trace(go.Scatter(
        x=x_range,
        y=kde(x_range),
        name=group,
        mode='lines',
        fill='tozeroy',
        line=dict(width=2)
    ))

fig.update_layout(
    title='<b>Density Plots by Group: Smooth Distribution Comparison</b>',
    xaxis_title='Values',
    yaxis_title='Probability Density',
    height=600,
    template='plotly_white'
)
fig.show()

# Frequency tables by group
print("\n<b>Frequency Tables by Group:</b>\n")
for group in groups:
    group_data = df[df['group'] == group]['value']
    counts, bin_edges = np.histogram(group_data, bins=10)
    bin_labels = [f"{bin_edges[i]:.1f}-{bin_edges[i+1]:.1f}" for i in range(len(bin_edges)-1)]
    
    freq_table = pd.DataFrame({
        'Bin': bin_labels,
        'Frequency': counts,
        'Rel. Freq': counts / counts.sum(),
        'Cum. Freq': np.cumsum(counts) / counts.sum()
    })
    
    print(f"\n{group}:")
    print(freq_table.to_string(index=False))
```

---

## Key Takeaways and Recommendations

### Decision Matrix: Which Plot to Choose?

**Choose Boxplot if you need to:**
- Compare distributions across multiple groups quickly
- Identify and highlight outliers explicitly
- Work with limited space (publications, dashboards)
- Provide statistical summaries to non-technical stakeholders
- Detect skewness and symmetry visually

**Choose Frequency Table if you need to:**
- Provide exact, precise counts (compliance, legal)
- Enable statistical calculations from raw data
- Work with categorical data with predefined classes
- Allow readers to verify exact numbers
- Create foundation for statistical tests (chi-square)
- Document discrete data precisely

**Choose Histogram if you need to:**
- Explore and understand continuous data distribution
- Detect multimodal distributions and clusters
- Identify data quality issues and gaps
- Check normality assumptions
- Teach data distribution concepts to audiences
- Balance visual interpretation with reasonable accuracy

**Choose Density Plot if you need to:**
- Create professional, publication-quality visualizations
- Compare multiple distributions elegantly
- Detect subtle distribution features and multimodality
- Avoid arbitrary binning decisions
- Communicate probability concepts
- Create aesthetic, modern visualizations
- Combine with other plots (rug plots, histograms)

### When NOT to Use Each Plot

**Avoid Boxplot when:**
- Your audience needs to understand exact distribution shape
- Sample size is very small (< 10 observations)
- You need to preserve exact frequency information
- Individual data points are important to see
- Data has obvious bimodality you want to emphasize

**Avoid Frequency Table when:**
- Visual pattern recognition is needed immediately
- Large number of categories makes table unwieldy
- Continuous data requires binning decisions
- Your audience prefers visual communication
- Comparing multiple datasets side-by-side

**Avoid Histogram when:**
- Your audience cannot accept binning artifacts
- Exact counts per bin are not important
- Professional smoothness is required
- Overlaying many distributions (becomes cluttered)
- Data has natural continuous distribution expectations

**Avoid Density Plot when:**
- Exact frequencies need to be communicated
- Your audience isn't familiar with probability density
- Individual data points need explicit visibility
- Discrete or categorical data
- Edge effects near data boundaries are problematic
- Bandwidth selection adds unwanted variability

### Best Practices

1. **Combine Visualizations**: Use multiple plots together. A histogram with overlayed density curve provides both detail and smoothness. A boxplot with density plot comparison shows both summary and shape.

2. **Know Your Audience**: Boxplots and histograms for general audiences; density plots and frequency tables for technical audiences.

3. **Always Check Assumptions**: Before choosing a visualization, understand your data type, sample size, and distribution characteristics.

4. **Bin Carefully**: If using histograms or frequency tables, validate that your binning choice isn't distorting the underlying patterns.

5. **Label Thoroughly**: Regardless of plot type, ensure axes, titles, legends, and hover information are clear and complete.

6. **Provide Context**: Always accompany visualizations with statistical summaries (mean, median, std dev, count) for complete picture.

7. **Use Interactivity**: Plotly's hover, zoom, and pan features enhance exploration. Use these to reveal detailed information without cluttering the plot.

8. **Color Strategically**: Use color to distinguish groups or highlight important features, but avoid overwhelming the visualization.

9. **Multiple Comparisons**: For comparing many groups, boxplots are superior; for few groups, overlayed density/histograms work well.

10. **Test Transformations**: If data is heavily skewed, consider logarithmic transformation before visualization to improve interpretability.

---

## Conclusion

Each visualization technique has unique strengths:

- **Boxplots** excel at summary statistics and outlier detection across groups
- **Frequency Tables** provide precise, verifiable, exact count data
- **Histograms** reveal distribution shape with bin-dependent flexibility
- **Density Plots** deliver smooth, aesthetic representations of probability distributions

The choice depends on your data characteristics, audience, and objectives. Often, the best approach combines multiple visualizations to leverage each one's strengths while compensating for weaknesses. Use this guide to select and create effective statistical visualizations that let your data tell a compelling story.