Spring 2025 <br>
Lecture 07

# Distributions
- Creating plots that show an entire distribution {Box Plots + Histograms}
- Dataset: Diamonds Dataset

## Metadata
- price: price in US dollars (\$326--\$18,823)
- carat: weight of the diamond (0.2--5.01)
- cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color: diamond colour, from J (worst) to D (best)
- clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: length in mm (0--10.74)
- y: width in mm (0--58.9)
- z: depth in mm (0--31.8)
- depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- table: width of top of diamond relative to widest point (43--95)

In [15]:
# Imports
import pandas as pd
import plotly.express as px

# Load data
df_diamonds = pd.read_csv("data/diamonds.csv")

# "Drop" a column
df_diamonds.drop(
    # List all columns to drop
    ['Unnamed: 0'],
    axis = 1, # axis = 0 ---> rows, axis = 1 ---> columns
    inplace = True
)

df_diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## Examples

### Histograms
1. Create a histogram of the `price` variable. Add reference lines at the 75th percentile of the price.
2. Create a histogram of `price` colored by the `cut` variable.
3. Create a histogram of `price` with facets by the `cut` variable.

### Boxplots
4. Create a vertical and horizontal box plot of the `price` variable.
5. Create a horizontal box plot of the `price` variable by the `cut` variable. Can color be added?

#### Example 1

- Percentile vs. Quantile
    - Percentiles are in percentage points where as quantiles are in decimal form
    - 90th Percentile = 0.90 quantile
- 75th Percentile / 0.75 Quantile --> Q3
- 50th Percentile / 0.5 Quantile --> Q2
    - The Median is the 50th Percentile
- 25th Percentile / 0.25 Quantile --> Q1

- The 75th percentile - is the value that 75% of the data is LESS than.

In [16]:
q75_price = df_diamonds["price"].quantile(q = 0.75).astype(int) # .astype(int) --> converts the float to an integer (whole number)

q75_label = f"   ${q75_price}, 75th percentile"
print(q75_label)

# print(
#     "Median:", df_diamonds["price"].median(),
#     "Q2:", df_diamonds["price"].quantile(q = 0.5)
# )

   $5324, 75th percentile


In [22]:
fig = px.histogram(
    df_diamonds,
    x = "price", # x ---> Numeric variable
    title = "Diamond prices are positively skewed",
    template = "simple_white",
    # Change labels (for variables in the dataset)
    labels={
        "price": "Price ($)"
    },
    subtitle=f"While very few diamonds are priced at over $18K, 75% of the diamonds are less than ${q75_price}.",
    color_discrete_sequence=["salmon"],
    width = 1000,
    nbins=100
)

# Overwrite the y-axis variable name
fig.update_yaxes(title = "Count")
# Add a reference line at the 75th percentile
fig.add_vline(
    x=q75_price,
    line_width = 3,
    line_color = "green",
    line_dash = "dot", # dot, dash, solid
    annotation_text = q75_label
)
# Font customization
fig.update_layout(
    font_family = "Helvetica",
    title_font_family = "Georgia",
    title_font_size = 22
)
fig.show()

Example 2

*Plotly Color Palettes*

1. Sequential: Colors go in sequential order (Ordinal or Numerical variables)
2. Categorical: Colors have no order and are easily distinguished (Nominal Variable - No inherent order: e.g. Basketball Positions or Genres)
3. Diverging: Colors that go in opposite directions from the center (Only use if ends of a variable are meaningful -- Ex: Temperature, Hot = Red, Cold = Blue)

In [None]:
fig = px.histogram(
    df_diamonds.sort_values(by = "cut"),
    x = "price",
    color = "cut",
    color_discrete_sequence = px.colors.sequential.Aggrnyl,
    labels = {
        "price": "Price ($)",
        "cut": "Cut"
    },
    # Sort the "cut" variable in order
    category_orders={
        "cut": ["Fair", "Good", "Very Good", "Premium", "Ideal"]
    },
    template="simple_white",
    width=1000,
    title="<b>Diamond prices positively skewed by cut<b>"
)

# Font customization
fig.update_layout(
    font_family = "Helvetica",
    title_font_family = "Georgia",
    title_font_size = 22
)
fig.show()

### Example 3

In [46]:
fig = px.histogram(
    df_diamonds,
    x = "price",
    facet_col="cut",
    facet_col_wrap = 1,
    width = 1000,
    height = 800,
    color = "cut",
    color_discrete_sequence = px.colors.sequential.Aggrnyl,
    labels = {
        "price": "Price ($)",
        "cut": "Cut"
    },
    # Sort the "cut" variable in order
    category_orders={
        "cut": ["Fair", "Good", "Very Good", "Premium", "Ideal"]
    },
    template="simple_white",
    title="<b>Diamond prices positively skewed by cut<b>"
)

fig.update_layout(showlegend = False)
# Font customization
fig.update_layout(
    font_family = "Helvetica",
    title_font_family = "Georgia",
    title_font_size = 22
)
fig.update_yaxes(title = "Count")
fig.show()

Example 4

- Lower Fence = Q1 - (IQR) * 1.5
- Upper Fence = Q3 + (IQR) * 1.5

In [49]:
q3_price = df_diamonds["price"].quantile(0.75)
q1_price = df_diamonds["price"].quantile(0.25)

IQR = q3_price - q1_price

upper_fence = q3_price + IQR * 1.5
print(upper_fence)

lower_fence = q1_price - IQR * 1.5
print(lower_fence)

11885.625
-5611.375


In [53]:
title = "<b>50% of the diamonds are priced between $950 and $5,324<b>"
subtitle = f"A typical diamond is priced at $2,401. Diamonds valued over {upper_fence} are considered outliers."

# Horizontal
fig = px.box(
    df_diamonds,
    x = "price",
    title = title,
    subtitle = subtitle,
    width = 1000,
    template="plotly_white",
    color_discrete_sequence=["cornflowerblue"],
    labels={"price": "Price ($)",}
)
# Font customization
fig.update_layout(
    font_family = "Helvetica",
    title_font_family = "Georgia",
    title_font_size = 22
)

In [56]:
# Vertical
fig = px.box(
    df_diamonds,
    y = "price",
    title = title,
    subtitle = subtitle,
    width = 600,
    height = 800,
    template="plotly_white",
    color_discrete_sequence=["cornflowerblue"],
    labels={"price": "Price ($)",}
)
# Font customization
fig.update_layout(
    font_family = "Helvetica",
    title_font_family = "Georgia",
    title_font_size = 18
)

#### Example 5

In [70]:
fig = px.box(
    df_diamonds,
    x = "price",
    y = "cut",
    category_orders={
        "cut": ["Fair", "Good", "Very Good", "Premium", "Ideal"]
    },
    color = "cut",
    width = 1000,
    height = 600,
    template = "plotly_white",
    title = "<b>Fair-cut diamonds most expensive on average ($3,282)<b>",
    subtitle = "Fair and premium cut diamonds have the highest typical prices. Generally, the Premium cut diamonds have the least outliers.",
    color_discrete_sequence=px.colors.sequential.Viridis
)
# Font customization
fig.update_layout(
    font_family = "Helvetica",
    title_font_family = "Georgia",
    title_font_size = 18
)