## Data Visualization
### Visualizing Distributions

- Xianli Zeng
- School of Economics, Xiamen University
- April 7, 2025

Today's Agenda

- Visualizing distributions

    - A single distribution or comparison of multiple distributions

    - Various techniques and important considerations

- Histograms and density plots

- Boxplots, Violin plots and Raincloud plots

- Ridgeline Plots

### Today's Data: penguins
<div>
<img src="./penguins.jpg" width="300"/>
</div>

Data source: https://gist.github.com/slopp/ce3b90b9168f2f921784de84fa445651


- Source: Palmer Penguins (2007–2009, Dr. Kristen Gorman) – a built-in dataset in seaborn

- Size: 344 penguins, 8 variables

- Categorical variables: Species (Adelie, Chinstrap, Gentoo), Island (Biscoe, Dream, Torgersen), Sex (Male/Female)

- Numeric variables: Bill length/depth (mm), flipper length (mm), body mass (g), Year (2007–2009)

#### Data Types

- Categorical variables: Species, island, sex (nominal); Year (ordinal discrete)

- Numeric variables: Bill length, depth, flipper length, body mass (continuous)

- Date/time: None (year treated as numeric)


Two ways to load data:

1. Direct loading:
    - Download the dataset from the above link
    - use pd.readcsv()


2. From the Seaborn package
    - import searborn as sns
    - penguins = sns.load_dataset("penguins")


3. From the palmerpenguins package
    - install the palmerpenguins package
    - import load load_penguins functions
    - penguins = load_penguins() 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from palmerpenguins import load_penguins
import os

# 加载 penguins 数据集
penguins = load_penguins()
penguins = penguins.dropna()

In [None]:
penguins.head(20)

### Contents today
1. ***histograms(直方图)***
2. density plots(密度图)
3. box plots(箱形图)
4. violin plots(小提琴图)
5. raincloud plots(云雨图)
6. ridgeline plots(嵴线图)

### 1. Histograms
<pre>
plt.hist()
</pre>


- Suitable for continuous numeric variables, histograms reflect the frequency distribution of values



- Divide the numeric range into intervals (bins) and count how many data points fall into each



- The choice of bin width significantly affects the appearance of the histogram



#### Example: Histogram of penguin body mass

A histogram showing the frequency distribution of all penguin body masses, using the default bin setting. The x-axis represents weight (g), y-axis is frequency.

In [None]:

# 设置中文字体（可选）
plt.rcParams['font.family'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False


# 默认组距的直方图
plt.figure()
plt.hist(penguins['body_mass_g'], edgecolor='black')
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Histogram of Body Mass - default bin width')
plt.show()

### Better Visualization: 

#### Proper number of bins and bin width

- default bin width and bin numbers may not be proper, try to use different bin width and bin numbers
- Adjust bin width by argument **bins** and **binwidth**. 

#### Too less bins

In [None]:
plt.figure()
plt.hist(penguins['body_mass_g'], bins=5, edgecolor='black')  # 将区间数设为5
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Too Less Bins')
plt.show()

#### Too wide binwidth

In [None]:
import numpy as np
plt.figure()
data = penguins['body_mass_g']
min_val, max_val = data.min(), data.max()
bins = np.arange(min_val, max_val + 500, 500)  # 每500g一个区间
plt.hist(data, bins=bins, edgecolor='black')
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Too Wide Binwidth')
plt.show()


#### Too much bins

In [None]:
plt.figure()
plt.hist(penguins['body_mass_g'], bins=100, edgecolor='black')
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Too Much Bins')
plt.show()

#### Proper number of bins

In [None]:
plt.figure()
plt.hist(penguins['body_mass_g'], bins=20, edgecolor='black')
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Proper Number of Bins')
plt.show()

#### Proper binwidth

In [None]:

plt.figure()
bins = np.arange(min_val, max_val + 200, 200)  # 每200g一个区间
plt.hist(penguins['body_mass_g'], bins=bins, edgecolor='black')
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Proper Binwidth')
plt.show()


### Better Visualization

Colors: face color and edgecolor
    

In [None]:
# 直方图美化: 改变颜色和边框
plt.figure()
plt.hist(penguins['body_mass_g'], bins=20, color='#CC0000', edgecolor='white')
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Imporve the visualization: facecolor and edgecolor ')
plt.show()

#### Comapre two histogram

In [None]:


# 按性别拆分体重数据，去除缺失值
male_mass = penguins[penguins['sex'] == 'male']['body_mass_g'].dropna()
female_mass = penguins[penguins['sex'] == 'female']['body_mass_g'].dropna()

# 画堆叠直方图
plt.figure()
plt.hist([male_mass, female_mass], bins=30, stacked=True, label=['雄性', '雌性'], edgecolor='white')
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Male vs Female Body Mass - Stacked Histogram')
plt.legend()
plt.show()



#### Transparency

In [None]:


# 重叠直方图 (设置透明度区分)
plt.figure()
bins = np.histogram_bin_edges(penguins['body_mass_g'], bins=20)
plt.hist(male_mass, bins=bins, alpha=0.6, label='雄性', edgecolor='white')
plt.hist(female_mass, bins=bins, alpha=0.6, label='雌性', edgecolor='white')
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Male vs Female Body Mass - Stacked Histogram')
plt.legend()
plt.show()

##### Problem

However, even with overlapping histograms, there are still some limitations:

👎 The edges of the bars are still hard to distinguish after overlapping: Especially when both groups have values in the same interval, the overlaid colors make it difficult to accurately determine the starting point and height of the bars.

👎 Color blending may cause misinterpretation: The mixed colors in the overlapping areas might lead to overestimation or underestimation of the data amount for a particular group in that interval.

A better approach: For comparing two-group distributions, consider using mirror histograms or population pyramid charts:

Mirror histogram: Plot two histograms separately — one above the x-axis and one below — in a mirror-symmetric layout. This way, the two distributions are visually independent, making it easier to directly compare their shapes.

Population pyramid: Rotate the mirror histogram by 90°, so that the vertical axis represents the groups and the horizontal axis shows frequencies. This back-to-back horizontal histogram is commonly used to compare age distributions across different population groups, and is especially effective when the group differences are small.

#### Mirrored Histogram

In [None]:
import numpy as np
# 镜像直方图 (将雌性体重直方图绘制在轴下方)


plt.figure()
counts_m, bins_m = np.histogram(male_mass, bins=20)
counts_f, _ = np.histogram(female_mass, bins=bins_m)
bin_centers = 0.5 * (bins_m[:-1] + bins_m[1:])
widths = np.diff(bins_m)
# 雄性柱子向上
plt.bar(bin_centers, counts_m, width=widths, label='Male', edgecolor='white')
# 雌性柱子向下 (取负值绘制)
plt.bar(bin_centers, -counts_f, width=widths, label='Female', edgecolor='white')
plt.axhline(0, color='black')  # 横轴线
plt.xlabel('Body Mass(g)')
plt.ylabel('Frequency')
plt.title('Male vs Female Body Mass - Mirrored Histogram')
yticks = plt.yticks()[0]
plt.yticks(yticks, [abs(int(tick)) for tick in yticks])
plt.legend(loc='upper right')
plt.show()


#### Pyramid Plot

In [None]:

# 计算直方图频数（统一 bin）
counts_m, bins_m = np.histogram(male_mass, bins=20)
counts_f, _ = np.histogram(female_mass, bins=bins_m)
bin_centers = 0.5 * (bins_m[:-1] + bins_m[1:])
heights = np.diff(bins_m)

# 开始绘图
plt.figure(figsize=(8, 6))

# 雌性向左（负数）
plt.barh(bin_centers, -counts_f, height=heights, label='雌性', edgecolor='white', color='skyblue')

# 雄性向右（正数）
plt.barh(bin_centers, counts_m, height=heights, label='雄性', edgecolor='white', color='salmon')

# y 轴为体重，x 轴为频数（左右对称）
plt.axvline(0, color='black')  # 中间竖线
plt.ylabel('Body Mass (g)')
plt.xlabel('Frequency')
plt.title('Male vs Female Body Mass - Population Pyramid')

# 美化 x 轴刻度（不显示负号）
xticks = plt.xticks()[0]
plt.xticks(xticks, [abs(int(t)) for t in xticks])

plt.legend()
plt.tight_layout()
plt.show()

### Contents today
1. istograms(直方图)
2. ***density plots(密度图)***
3. box plots(箱形图)
4. violin plots(小提琴图)
5. raincloud plots(云雨图)
6. ridgeline plots(嵴线图)

### 2. Density Plots(密度图)

<pre>
sns.kdeplot()
</pre>


### Smoothed version of a histogram: the density plot

- As the sample size increases toward infinity and the bin width approaches zero

- This follows from large-sample theory

### Density plots are usually obtained through kernel density estimation (KDE)

- Kernel function: The most commonly used is the Gaussian kernel

- Bandwidth (bw): Controls the smoothness of the curve, similar to bin width in a histogram; the default is usually sufficient

### For large datasets, density plots are more reliable and provide richer information

### For small datasets, density plots may be misleading

In [None]:

# 密度图：鳍长核密度估计
plt.figure(figsize=(8, 5))
sns.kdeplot(data=penguins, x='flipper_length_mm')
plt.title("Density Plot of Flipper Length")
plt.xlabel("Flipper Length（mm）")
plt.show()# plt.savefig("penguins_figs/density.png")
# plt.close()


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# 加载数据
penguins = sns.load_dataset("penguins")

# 绘制体重密度图（填充颜色）

plt.figure(figsize=(8, 5))
sns.kdeplot(data=penguins, x='flipper_length_mm',fill = True,  color='red', alpha=0.6, linewidth=2,
bw_adjust=1)
plt.title("Density Plot of Flipper Length")
plt.xlabel("Flipper Length（mm）")
plt.show()# plt.savefig("penguins_figs/density.png")
# plt.close()


- linewidth: controls the thickness of the line

- bw_adjust: Adjusts the bandwidth (controls the smoothness of the curve)

### Histgram with density plot


In [None]:
plt.figure()
plt.hist(penguins['flipper_length_mm'].dropna(), bins=20, density=True, color='#CC0000', alpha=0.5, edgecolor='white')
sns.kdeplot(penguins['flipper_length_mm'].dropna(), color='black')
plt.title("Histogram + Density Plot")
plt.xlabel("Flipper Length（mm）")
plt.show()

#### Compare different distribution with density plot

In [None]:

# 按种类拆分数据

species = penguins['species'].dropna().unique()
data_to_plot = [penguins[penguins['species'] == s]['flipper_length_mm'].dropna() for s in species]

plt.figure(figsize=(8, 5))

# 核密度图比较三个分布 (雄性 vs 雌性)
plt.figure()
sns.kdeplot(data_to_plot[0], label=species[0], fill=True,color = 'red',alpha = 0.4)
sns.kdeplot(data_to_plot[1], label=species[1], fill=True, color = 'green',alpha = 0.4)
sns.kdeplot(data_to_plot[2], label=species[2], fill=True, color = 'blue',alpha = 0.4)
plt.title("Flipper Length of Different Species")
plt.xlabel("Flipper Length（mm）")
plt.legend()
plt.show()

#### Comparing Multiple Distributions – Advantages of Density Plots

👍 For visualizing multiple distributions, **density plots are generally better than histograms.
Overlapping curves are much clearer than overlapping bars, and there’s no confusion about stacking or alignment of bar starting points.**

### Compare different distribution
Faced density plot


In [None]:
g = sns.FacetGrid(penguins, col='species', height=4, aspect=1)
g.map(sns.kdeplot, 'flipper_length_mm', fill=True,color = '#CC0000',alpha =1)
g.set_axis_labels('Flipper Length(mm)', 'Density')
plt.show()

### Visualizing different distribution
stacked density plot


In [None]:

# 绘图
plt.figure(figsize=(10, 6))
sns.kdeplot(
    data=penguins,
    x='flipper_length_mm',
    hue='species',
    multiple='stack',       # 堆叠模式
    fill=True,
    alpha=0.7
    
)

plt.title("Stacked Density Plot by Species")
plt.xlabel("Body Mass (g)")
plt.ylabel("Density")
plt.tight_layout()
plt.show()

### Contents today
1. histograms(直方图)
2. density plots(密度图)
3. ***box plots(箱形图)***
4. violin plots(小提琴图)
5. raincloud plots(云雨图)
6. ridgeline plots(嵴线图)

### 3. Box Plots
<pre>
plt.boxplot()   
sns.boxplot()
</pre>


- Box plots display the center and spread of a distribution using the five-number summary (minimum, first quartile, median, third quartile, and maximum). They are especially useful for comparing the distribution characteristics across multiple groups.

- The top and bottom of the box represent the first quartile (Q1) and third quartile (Q3).

- The line inside the box indicates the median.


<div>
<img src="./box_plot.png" width="300"/>
</div>



- The "whiskers" extend to the minimum and maximum values within 1.5 times the interquartile range (IQR).

- Outliers:
Data points beyond the whiskers are considered outliers and are typically marked as individual dots, reflecting extreme values.

- Box plots make it easy to compare medians and variability across groups side by side, but they do not reveal detailed features of the distribution such as bimodality.



- Standard box plot
- Set colors for box and outliers
- Hide the outlier
- Highlight 
- Mark the mean
- Show datapoints

In [None]:

# 箱形图：不同种类企鹅的鳍长
plt.figure(figsize=(8, 5))

species = penguins['species'].dropna().unique()
data_to_plot = [penguins[penguins['species'] == s]['flipper_length_mm'].dropna() for s in species]

plt.boxplot(data_to_plot, labels=species)
plt.title("Box plot of body mass of different sex")
plt.xlabel("Species")
plt.ylabel("Body Mass(g)")

plt.show()



In [None]:
plt.figure(figsize=(8, 5))

sns.boxplot(data=penguins, x='species', y='flipper_length_mm')
plt.title("Box plot of body mass of different sex")
plt.xlabel("Species")
plt.ylabel("Body Mass(g)")

plt.show()



#### Adding subgroups

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=penguins, x='species', y='body_mass_g', hue='sex')

plt.title("Penguin Body Mass by Species and Island")
plt.xlabel("Species")
plt.ylabel("Body Mass (g)")
plt.legend(title="Island")
plt.tight_layout()
plt.show()

#### Marking outliers

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(
    data=penguins,
    x='species',
    y='flipper_length_mm',
    color='gray',           # Fill color
    fliersize=4,            # Outlier marker size
    flierprops=dict(
        marker='o',         # Outlier shape (circle)
        markerfacecolor='#CC0000',  # Outlier color (red)
        markeredgecolor='#CC0000',  # Outlier color (red)
        markersize=5
    )
)

plt.title("Flipper Length by Species")
plt.xlabel("Species")
plt.ylabel("Flipper Length (mm)")
plt.tight_layout()
plt.show()

#### Hide the outliers

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(
    data=penguins,
    x='species',
    y='flipper_length_mm',
    color='#CC0000',           # Fill color
    showfliers=False
    )

plt.title("Flipper Length by Species")
plt.xlabel("Species")
plt.ylabel("Flipper Length (mm)")
plt.tight_layout()
plt.show()

#### Highlight Some Part

In [None]:
highlight_species = 'Adelie'

# 生成颜色映射
species_order = sorted(penguins['species'].unique())
colors = ['#bb0000' if s == highlight_species else '#dddddd' for s in species_order]

# 绘图
plt.figure(figsize=(8, 6))
sns.boxplot(
    data=penguins,
    x='species',
    y='flipper_length_mm',
    order=species_order,
    palette=colors,
    showfliers=False
)

plt.title(f"Flipper Length by Species (Highlight: {highlight_species})")
plt.xlabel("Species")
plt.ylabel("Flipper Length (mm)")
plt.tight_layout()
plt.show()

#### Mark the maen

In [None]:

species_order = penguins['species'].unique()

# 绘制灰色箱型图
plt.figure(figsize=(8, 6))
sns.boxplot(
    data=penguins,
    x='species',
    y='flipper_length_mm',
    order=species_order,
    color='gray',
    showfliers=False
)

# 计算并添加均值红色方块
group_means = penguins.groupby('species')['flipper_length_mm'].mean()

# 关键点：确保横坐标位置和 species 顺序一致
plt.scatter(
    x=np.arange(len(species_order)), 
    y=group_means, 
    color='red', 
    marker='s', 
    s=100,   # 放大方块
    zorder=5 # 保证在箱体之上
     )

plt.title("Flipper Length by Species with Mean Points")
plt.xlabel("Species")
plt.ylabel("Flipper Length (mm)")
plt.tight_layout()
plt.show()

#### Mark the sample size

In [None]:

# 获取分组顺序
species_order = penguins['species'].unique()

# 分组统计样本数和中位数
group_stats = penguins.groupby('species')['flipper_length_mm'].agg(['median', 'count']).reset_index()

# 绘图
plt.figure(figsize=(8, 6))
sns.boxplot(
    data=penguins,
    x='species',
    y='flipper_length_mm',
    order=species_order,
    color='white',
    showfliers=False
)

# 添加样本数标签（位置略低于中位数）
for i, row in group_stats.iterrows():
    plt.text(
        x=i, 
        y=row['median'] - 2,   # 中位数下方一点
        s=f"n = {row['count']}", 
        ha='center', 
        va='top',
        fontsize=10,
        color='red'
    )

plt.title("Flipper Length by Species with Sample Sizes")
plt.xlabel("Species")
plt.ylabel("Flipper Length (mm)")
plt.tight_layout()
plt.show()

#### Drow the datapoints

In [None]:

# 绘图
plt.figure(figsize=(8, 6))

# 箱型图（白色，无异常值）
sns.boxplot(
    data=penguins,
    x='species',
    y='flipper_length_mm',
    color='white',
    showfliers=False
)

# 原始数据点（红色，半透明）
sns.stripplot(
    data=penguins,
    x='species',
    y='flipper_length_mm',
    color='#CC0000',
    size=3,
    alpha=0.5,
    jitter=False  # 加抖动防重叠
)

plt.title("Flipper Length by Species (with Raw Data Points)")
plt.xlabel("Species")
plt.ylabel("Flipper Length (mm)")
plt.tight_layout()
plt.show()

#### 👍 Adding fluctuations for better visualization(jittering)

In [None]:

# 绘图
plt.figure(figsize=(8, 6))

# 箱型图（白色，无异常值）
sns.boxplot(
    data=penguins,
    x='species',
    y='flipper_length_mm',
    color='white',
    showfliers=False
)

# 原始数据点（红色，半透明）
sns.stripplot(
    data=penguins,
    x='species',
    y='flipper_length_mm',
    color='#CC0000',
    size=3,
    alpha=0.5,
    jitter=True  # 加抖动防重叠
)

plt.title("Flipper Length by Species (with Raw Data Points)")
plt.xlabel("Species")
plt.ylabel("Flipper Length (mm)")
plt.tight_layout()
plt.show()

### Contents today
1. histograms(直方图)
2. density plots(密度图)
3. box plots(箱形图)
4. ***violin plots(小提琴图)***
5. raincloud plots(云雨图)
6. ridgeline plots(嵴线图)

### 4. Violin Plots
<pre>
sns.violinplot()
</pre>

Similar to box plots, violin plots provide more detailed data visualization.

Unlike box plots, violin plots can accurately represent multimodal data, such as bimodal distributions, which box plots cannot.

A violin plot is essentially a density plot combined with its mirror image, giving it the shape of a violin.

<div>
<img src="./vioplot.png" width="300"/>
</div>


In [None]:
penguins = sns.load_dataset("penguins")
plt.figure(figsize=(8, 5))
sns.violinplot(data=penguins, x='species', y='flipper_length_mm')
plt.title("Flipper Length of Different Species")
plt.xlabel("Species")
plt.ylabel("Flipper Length（mm）")
plt.show()

#### Rotation and Scale

In [None]:
plt.figure(figsize=(8, 5))
sns.violinplot(data=penguins, y='species', x='flipper_length_mm',scale = 'count')  
plt.title("Flipper Length of Different Species (Rotated)")
plt.ylabel("Species")
plt.xlabel("Flipper Length (mm)")
plt.show()

#### Change colors

In [None]:
plt.figure(figsize=(8, 5))
sns.violinplot(data=penguins, y='species', x='flipper_length_mm', palette='Set1')
plt.title("Flipper Length with Custom Colors")
plt.xlabel("Species")
plt.ylabel("Flipper Length (mm)")
plt.show()

#### Thicker lines

In [None]:
plt.figure(figsize=(7, 4))
sns.violinplot(
    data=penguins,
    x='flipper_length_mm',
    y='species',
    linewidth=2,
    palette='muted'
)
plt.title("Violin Plot with Thicker Edges")
plt.show()

#### Adding points

In [None]:
plt.figure(figsize=(7, 4))
sns.violinplot(
    data=penguins,
    x='flipper_length_mm',
    y='species',
    inner=None,  # 不画 quartile box
    palette='Set1'
)
sns.stripplot(
    data=penguins,
    x='flipper_length_mm',
    y='species',
    color='black',
    size=2.5,
    jitter=True,
    alpha=0.4
)
plt.title("Violin + Jittered Raw Data Points")
plt.show()

#### Adding median

In [None]:
plt.figure(figsize=(7, 4))
sns.violinplot(
    data=penguins,
    x='flipper_length_mm',
    y='species',
    inner=None,
    palette='Set1'
)

# 添加中位数点
medians = penguins.groupby("species")["flipper_length_mm"].median()
for i, species in enumerate(medians.index):
    plt.plot(medians[species], i, marker='s', color='black', markersize=6)

plt.title("Violin Plot with Median Highlight")
plt.show()

### Contents today
1. histograms(直方图)
2. density plots(密度图)
3. box plots(箱形图)
4. violin plots(小提琴图)
5. ***raincloud plots(云雨图)***
6. ridgeline plots(嵴线图)

### raincloud plots
Raincloud plots = Half violin plot + Boxplot + Jittered scatter plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import gaussian_kde
from matplotlib.patches import Rectangle

penguins = sns.load_dataset("penguins")
df = penguins.dropna(subset=["sex", "body_mass_g"]).copy()

groups = ["Male", "Female"]
palette = {"Male": "#4E79A7", "Female": "#F28E2B"}

x_grid = np.linspace(df["body_mass_g"].min(), df["body_mass_g"].max(), 200)
max_dens_global = 0
dens_dict = {}

for grp in groups:
    values = df[df["sex"] == grp]["body_mass_g"].values
    kde = gaussian_kde(values, bw_method=0.3)
    y_vals = kde(x_grid)
    dens_dict[grp] = y_vals
    max_dens_global = max(max_dens_global, y_vals.max())

plt.figure(figsize=(6, 6))
ax = plt.gca()

for i, grp in enumerate(groups):
    values = df[df["sex"] == grp]["body_mass_g"].values
    density = dens_dict[grp]
    scaled_dens = density * (0.4 / max_dens_global)
    base_y = i

    # 半小提琴图（上方）
    plt.fill_between(
        x_grid,
        base_y + scaled_dens,
        base_y,
        color=palette[grp],
        alpha=0.6
    )

    # ggplot 风格 box
    q1, med, q3 = np.percentile(values, [25, 50, 75])
    box_width = 0.1
    box_left = q1
    box_right = q3
    box_bottom = base_y - box_width / 2
    box_height = box_width

    ax.add_patch(Rectangle(
        (box_left, box_bottom),
        box_right - box_left,
        box_height,
        facecolor='white',
        edgecolor='black',
        linewidth=1.5,
        zorder=3
    ))

    # 中位数线
    plt.plot([med, med], [box_bottom, box_bottom + box_height], color='black', lw=1.2, zorder=4)

    # jitter 原始点（下方）
    x_jitter = values + np.random.normal(0, 5, size=len(values))
    y_jitter = np.random.uniform(base_y - 0.3, base_y - 0.1, size=len(values))
    plt.scatter(x_jitter, y_jitter, s=10, alpha=0.5, color=palette[grp] )

# 图形美化
plt.yticks(range(len(groups)), groups)
plt.xlabel("Body Mass (g)")
plt.ylabel("Sex")
plt.title("Raincloud Plot by Sex")
plt.grid(axis='x', linestyle='--', alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

### Contents today
1. histograms(直方图)
2. density plots(密度图)
3. box plots(箱形图)
4. violin plots(小提琴图)
5. raincloud plots(云雨图)
6. ***ridgeline plots(嵴线图)***

### Ridgeline plot
- Ridgeline plot
- Colors, Emphasis
- Ridgeline plot with joypy package
- R ggplot2 with python

In [None]:
import numpy as np
from scipy.stats import gaussian_kde


# 准备数据: 组合物种和性别作为新类别
df = penguins.dropna(subset=['sex']).copy()
df['group'] = df['species'] + ' - ' + df['sex']
# 取出所有的组
groups = df.groupby('group')['flipper_length_mm'].unique().index
# 计算各组的密度曲线
density_curves_group = {}
x_grid = np.linspace(df['flipper_length_mm'].min(), df['flipper_length_mm'].max(), 200)
max_dens_global = 0
for grp in groups:
    vals = df[df['group']==grp]['flipper_length_mm'].values
    kde = gaussian_kde(vals)
    dens = kde(x_grid)
    density_curves_group[grp] = dens
    max_dens_global = max(max_dens_global, dens.max())
# plt.figure()
for j, grp in enumerate(groups):
    dens = density_curves_group[grp]
    dens_height = dens * (2/ max_dens_global)
    base = j  # 每组的基线y位置
    plt.fill_between(x_grid, base, base + dens_height, alpha=0.7)
plt.yticks(range(len(groups)), groups)
plt.xlabel('Flipper Length(mm)')
plt.ylabel('Groups')
plt.title('Ridgeline Plot: Distribution of Flipper Length by Species & Sex')
plt.show()


#### Same color

In [None]:
import numpy as np
from scipy.stats import gaussian_kde


# 准备数据: 组合物种和性别作为新类别
df = penguins.dropna(subset=['sex']).copy()
df['group'] = df['species'] + ' - ' + df['sex']
# 取出所有的组
groups = df.groupby('group')['flipper_length_mm'].unique().index
# 计算各组的密度曲线
density_curves_group = {}
x_grid = np.linspace(df['flipper_length_mm'].min(), df['flipper_length_mm'].max(), 200)
max_dens_global = 0
for grp in groups:
    vals = df[df['group']==grp]['flipper_length_mm'].values
    kde = gaussian_kde(vals)
    dens = kde(x_grid)
    density_curves_group[grp] = dens
    max_dens_global = max(max_dens_global, dens.max())
plt.figure()
for j, grp in enumerate(groups):
    dens = density_curves_group[grp]
    dens_height = dens * (2/ max_dens_global)
    base = j  # 每组的基线y位置
    plt.fill_between(x_grid, base, base + dens_height, color = '#CC0000',alpha=0.7)
plt.yticks(range(len(groups)), groups)
plt.xlabel('Flipper Length(mm)')
plt.ylabel('Groups')
plt.title('Ridgeline Plot: Distribution of Flipper Length by Species & Sex')
plt.show()


#### Emphasize some group

In [None]:
import numpy as np
from scipy.stats import gaussian_kde


# 准备数据: 组合物种和性别作为新类别
df = penguins.dropna(subset=['sex']).copy()
df['group'] = df['species'] + ' - ' + df['sex']
# 取出所有的组
groups = df.groupby('group')['flipper_length_mm'].mean().sort_values().index
# 计算各组的密度曲线
density_curves_group = {}
x_grid = np.linspace(df['flipper_length_mm'].min(), df['flipper_length_mm'].max(), 200)
max_dens_global = 0
for grp in groups:
    vals = df[df['group']==grp]['flipper_length_mm'].values
    kde = gaussian_kde(vals)
    dens = kde(x_grid)
    density_curves_group[grp] = dens
    max_dens_global = max(max_dens_global, dens.max())
plt.figure()
for j, grp in enumerate(groups):
    dens = density_curves_group[grp]
    dens_height = dens * (2/ max_dens_global)
    base = j  # 每组的基线y位置
    if grp == 'Adelie - Male':
        cl = '#CC0000'
    else:
        cl = 'grey'
    plt.fill_between(x_grid, base, base + dens_height, color = cl,alpha=0.7)
plt.yticks(range(len(groups)), groups)
plt.xlabel('Flipper Length(mm)')
plt.ylabel('Groups')
plt.title('Ridgeline Plot: Distribution of Flipper Length by Species & Sex')
plt.show()


#### Adding jitter plots

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import seaborn as sns

# 准备数据
penguins = sns.load_dataset("penguins")
df = penguins.dropna(subset=['sex', 'flipper_length_mm']).copy()
df['group'] = df['species'] + ' - ' + df['sex']

# 所有组合组
groups = df['group'].unique()
groups = sorted(groups)

# KDE 曲线数据准备
x_grid = np.linspace(df['flipper_length_mm'].min(), df['flipper_length_mm'].max(), 200)
density_curves_group = {}
max_dens_global = 0

# 计算每组 KDE 曲线
for grp in groups:
    vals = df[df['group'] == grp]['flipper_length_mm'].values
    kde = gaussian_kde(vals)
    dens = kde(x_grid)
    density_curves_group[grp] = dens
    max_dens_global = max(max_dens_global, dens.max())

# 创建图形
plt.figure(figsize=(10, len(groups) * 1))

for j, grp in enumerate(groups):
    # 获取曲线
    dens = density_curves_group[grp]
    dens_height = dens * (2.0 / max_dens_global)  # 缩放统一高度
    base = j  # 每组的 y 位置

    # 画密度曲线
    plt.fill_between(x_grid, base, base + dens_height, color='#CC0000', alpha=0.7)
    
    # 添加 jittered 原始点
    raw_vals = df[df['group'] == grp]['flipper_length_mm'].values
    x_jit = raw_vals  
    y_jit = np.random.uniform(base - 0.1, base + 0.1, size=len(raw_vals))  # 垂直轻抖动
    plt.scatter(x_jit, y_jit, s=5, alpha=0.4, color='black')

# 设置坐标轴与标题
plt.yticks(range(len(groups)), groups)
plt.xlabel('Flipper Length(mm)')
plt.ylabel('Groups')
plt.title('Ridgeline Plot: Distribution of Flipper Length by Species & Sex')
plt.tight_layout()
plt.show()

#### Ridgeline plot with joypy

In [None]:
import joypy

penguins['group'] = penguins['species'] + ' | ' + penguins['sex']

# 按 group 画 ridgeline 图
plt.figure()
joypy.joyplot(
    data=penguins,
    by='group',
    column='flipper_length_mm',
    kind='kde',
    overlap=0,
    # colormap=plt.cm.Set2,   # 彩色方案
    linewidth=1
)

plt.xlabel("Flipper Length (mm)")
plt.title("Distribution of Flipper Length by Species & Sex")
# plt.tight_layout()
plt.show()

#### R ggplot2 with python

Download R and install two packages:
- install.packages("ggplot2")
- install.packages("ggridges") 


Tell python where your R installed:
- import os
- os.environ["R_HOME"] = "C:/Program Files/R/R-4.4.3"

In [None]:
import os
# 替换为你真实的 R 安装路径
os.environ["R_HOME"] = "C:/Program Files/R/R-4.4.3"

import usefull packages

In [None]:
### rpy2.robjects: 提供访问 R 的接口。
from rpy2.robjects import r, pandas2ri
###  pandas2ri: 用于 pandas ↔ R dataframe 转换。
from rpy2.robjects.packages import importr
### importr: 加载 R 中的包。
from rpy2.robjects.conversion import localconverter
### localconverter: 作用于 R-Pandas 转换时的上下文管理器。

In [None]:
import seaborn as sns
import pandas as pd
penguins = sns.load_dataset("penguins").dropna(subset=["flipper_length_mm", "species", "sex"])
penguins["group"] = penguins["species"] + " | " + penguins["sex"]

# 重命名列以匹配 R 中 aes
penguins = penguins.rename(columns={"flipper_length_mm": "x"})

# 加载 R 包
pandas2ri.activate()
ggplot2 = importr("ggplot2")
ggridges = importr("ggridges")

# 转换 pandas → R dataframe
with localconverter(pandas2ri.converter):
    rdf = pandas2ri.py2rpy(penguins)
# print(rdf)
# 定义 R 函数用于绘图并保存
r('''
library(ggplot2)
library(ggridges)

draw_flipper_ridges <- function(df, filename) {
  p <- ggplot(df, aes(x = x, y = group)) +
    geom_density_ridges(fill = "#CC0000", color = "white", alpha = 0.75) +
    theme_minimal() +
    labs(x = "Flipper Length (mm)", y = "Species | Sex",
         title = "Distribution of Flipper Length by Species & Sex")

  ggsave(filename, p, width = 8, height = 5, dpi = 300)
}
''')

# 调用 R 函数绘图并保存
r["draw_flipper_ridges"](rdf, "flipper_ridges.png")



In [None]:
# 用 Python 显示图像
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("flipper_ridges.png")
plt.imshow(img)
plt.axis('off')
plt.show()

#### Ridgeline plot with Bar plot

In [None]:
import seaborn as sns
import pandas as pd
penguins = sns.load_dataset("penguins").dropna(subset=["flipper_length_mm", "species", "sex"])
penguins["group"] = penguins["species"] + " | " + penguins["sex"]

# 重命名列以匹配 R 中 aes
penguins = penguins.rename(columns={"flipper_length_mm": "x"})

# 加载 R 包
pandas2ri.activate()
ggplot2 = importr("ggplot2")
ggridges = importr("ggridges")

# 转换 pandas → R dataframe
with localconverter(pandas2ri.converter):
    rdf = pandas2ri.py2rpy(penguins)
# print(rdf)
# 定义 R 函数用于绘图并保存
r('''
library(ggplot2)
library(ggridges)

draw_flipper_ridges <- function(df, filename) {
  p <- ggplot(df, aes(x = x, y = group)) +
    geom_density_ridges(stat = "binline",binwidth = 3,fill = "#CC0000", color = "white", alpha = 0.75) +
    theme_minimal() +
    labs(x = "Flipper Length (mm)", y = "Species | Sex",
         title = "Distribution of Flipper Length by Species & Sex")

  ggsave(filename, p, width = 8, height = 5, dpi = 300)
}
''')

# 调用 R 函数绘图并保存
r["draw_flipper_ridges"](rdf, "flipper_ridges.png")

# 用 Python 显示图像
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("flipper_ridges.png")
plt.imshow(img)
plt.axis('off')
plt.show()

#### Ridgeline plot with jitter points

In [None]:
import seaborn as sns
import pandas as pd
penguins = sns.load_dataset("penguins").dropna(subset=["flipper_length_mm", "species", "sex"])
penguins["group"] = penguins["species"] + " | " + penguins["sex"]

# 重命名列以匹配 R 中 aes
penguins = penguins.rename(columns={"flipper_length_mm": "x"})

# 加载 R 包
pandas2ri.activate()
ggplot2 = importr("ggplot2")
ggridges = importr("ggridges")

# 转换 pandas → R dataframe
with localconverter(pandas2ri.converter):
    rdf = pandas2ri.py2rpy(penguins)
# print(rdf)
# 定义 R 函数用于绘图并保存
r('''
library(ggplot2)
library(ggridges)

draw_flipper_ridges <- function(df, filename) {
  p <- ggplot(df, aes(x = x, y = group)) +
    geom_density_ridges(jittered_points = TRUE,point_size = 0.5, fill = "#CC0000", alpha = 0.75) +
    theme_minimal() +
    labs(x = "Flipper Length (mm)", y = "Species | Sex",
         title = "Distribution of Flipper Length by Species & Sex")

  ggsave(filename, p, width = 8, height = 5, dpi = 300)
}
''')

# 调用 R 函数绘图并保存
r["draw_flipper_ridges"](rdf, "flipper_ridges.png")

# 用 Python 显示图像
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("flipper_ridges.png")
plt.imshow(img)
plt.axis('off')
plt.show()

In [None]:
import seaborn as sns
import pandas as pd
penguins = sns.load_dataset("penguins").dropna(subset=["flipper_length_mm", "species", "sex"])
penguins["group"] = penguins["species"] + " | " + penguins["sex"]

# 重命名列以匹配 R 中 aes
penguins = penguins.rename(columns={"flipper_length_mm": "x"})

# 加载 R 包
pandas2ri.activate()
ggplot2 = importr("ggplot2")
ggridges = importr("ggridges")

# 转换 pandas → R dataframe
with localconverter(pandas2ri.converter):
    rdf = pandas2ri.py2rpy(penguins)
# print(rdf)
# 定义 R 函数用于绘图并保存
r('''
library(ggplot2)
library(ggridges)

draw_flipper_ridges <- function(df, filename) {
  p <- ggplot(df, aes(x = x, y = group)) +
    geom_density_ridges(fill = '#CC0000',jittered_points = TRUE, position = position_points_jitter(width = 0.05, height = 0),
    point_shape = "|", point_size = 3, alpha =0.7)+
        theme_minimal() +
    labs(x = "Flipper Length (mm)", y = "Species | Sex",
         title = "Distribution of Flipper Length by Species & Sex")

  ggsave(filename, p, width = 8, height = 5, dpi = 300)
}
''')

# 调用 R 函数绘图并保存
r["draw_flipper_ridges"](rdf, "flipper_ridges.png")

# 用 Python 显示图像
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("flipper_ridges.png")
plt.imshow(img)
plt.axis('off')
plt.show()

#### Ridgeline plot with raincloud 

In [None]:
import seaborn as sns
import pandas as pd
penguins = sns.load_dataset("penguins").dropna(subset=["flipper_length_mm", "species", "sex"])
penguins["group"] = penguins["species"] + " | " + penguins["sex"]

# 重命名列以匹配 R 中 aes
penguins = penguins.rename(columns={"flipper_length_mm": "x"})

# 加载 R 包
pandas2ri.activate()
ggplot2 = importr("ggplot2")
ggridges = importr("ggridges")

# 转换 pandas → R dataframe
with localconverter(pandas2ri.converter):
    rdf = pandas2ri.py2rpy(penguins)
# print(rdf)
# 定义 R 函数用于绘图并保存
r('''
library(ggplot2)
library(ggridges)

draw_flipper_ridges <- function(df, filename) {
  p <- ggplot(df, aes(x = x, y = group)) +
    geom_density_ridges(fill = '#CC0000',jittered_points = TRUE, point_size = 0.5,
    SCALE =0.6, alpha =0.7, position = 'raincloud')+
        theme_minimal() +
    labs(x = "Flipper Length (mm)", y = "Species | Sex",
         title = "Distribution of Flipper Length by Species & Sex")

  ggsave(filename, p, width = 8, height = 5, dpi = 300)
}
''')

# 调用 R 函数绘图并保存
r["draw_flipper_ridges"](rdf, "flipper_ridges.png")

# 用 Python 显示图像
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("flipper_ridges.png")
plt.imshow(img)
plt.axis('off')
plt.show()