# EDA > Explore

<div class="alert alert-info">Compute summary statistics for numeric columns</div>

The `explore` function provides a quick way to calculate summary statistics for numeric columns in your data. It supports grouping and custom aggregation functions.

In [1]:
import polars as pl
import pyrsm as rsm

## setup pyrsm for autoreload
%reload_ext autoreload
%autoreload 2
%aimport pyrsm

# Diamonds Dataset

The diamonds dataset contains prices and attributes of 3,000 diamonds. We'll use it to explore summary statistics.

In [2]:
diamonds = pl.read_parquet("https://github.com/radiant-ai-hub/pyrsm/raw/refs/heads/main/examples/data/data/diamonds.parquet")
diamonds

price,carat,clarity,cut,color,depth,table,x,y,z,date
i32,f64,enum,enum,enum,f64,f64,f64,f64,f64,date
580,0.32,"""VS1""","""Ideal""","""H""",61.0,56.0,4.43,4.45,2.71,2012-02-26
650,0.34,"""SI1""","""Very Good""","""G""",63.4,57.0,4.45,4.42,2.81,2012-02-26
630,0.3,"""VS2""","""Very Good""","""G""",63.1,58.0,4.27,4.23,2.68,2012-02-26
706,0.35,"""VVS2""","""Ideal""","""H""",59.2,56.0,4.6,4.65,2.74,2012-02-26
1080,0.4,"""VS2""","""Premium""","""F""",62.6,58.0,4.72,4.68,2.94,2012-02-26
…,…,…,…,…,…,…,…,…,…,…
4173,1.14,"""SI1""","""Very Good""","""J""",63.3,55.0,6.6,6.67,4.2,2015-12-01
8396,1.51,"""SI1""","""Ideal""","""I""",61.2,60.0,7.39,7.37,4.52,2015-12-01
449,0.32,"""VS2""","""Premium""","""I""",62.6,58.0,4.37,4.42,2.75,2015-12-01
4370,0.91,"""VS1""","""Very Good""","""H""",62.1,59.0,6.17,6.2,3.84,2015-12-01


In [3]:
rsm.md("https://raw.githubusercontent.com/radiant-ai-hub/pyrsm/refs/heads/main/examples/data/data/diamonds_description.md")

## Diamond prices

Prices of 3,000 round cut diamonds

### Description

A dataset containing the prices and other attributes of a sample of 3000 diamonds. The variables are as follows:

### Variables

- price = price in US dollars ($338--$18,791)
- carat = weight of the diamond (0.2--3.00)
- clarity = a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- cut = quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color = diamond color, from J (worst) to D (best)
- depth = total depth percentage = z / mean(x, y) = 2 * z / (x + y) (54.2--70.80)
- table = width of top of diamond relative to widest point (50--69)
- x = length in mm (3.73--9.42)
- y = width in mm (3.71--9.29)
- z = depth in mm (2.33--5.58)
- date = shipment date

### Additional information

<a href="http://www.diamondse.info/diamonds-clarity.asp" target="_blank">Diamond search engine</a>


## Basic Usage

By default, `explore` calculates mean, std, min, max, and count for all numeric columns.

In [4]:
rsm.eda.explore(diamonds)

statistic,price,carat,depth,table,x,y,z
str,f64,f64,f64,f64,f64,f64,f64
"""mean""",3907.186,0.794283,61.752667,57.465333,5.721823,5.7233,3.533447
"""median""",2407.0,0.7,61.9,57.0,5.71,5.72,3.52
"""min""",338.0,0.2,54.2,50.0,3.73,3.71,2.33
"""max""",18791.0,3.0,70.8,69.0,9.42,9.26,5.58
"""sd""",3956.9154,0.473826,1.446028,2.241102,1.124055,1.114313,0.693858


## Select Specific Columns

Use `cols` to focus on specific variables.

In [5]:
rsm.eda.explore(diamonds, cols=["price", "carat"])

statistic,price,carat
str,f64,f64
"""mean""",3907.186,0.794283
"""median""",2407.0,0.7
"""min""",338.0,0.2
"""max""",18791.0,3.0
"""sd""",3956.9154,0.473826


## Custom Summary Functions

Use `funs` to specify which statistics to compute. Available functions: mean, median, sum, std, var, min, max, count, n_unique, null_count.

In [6]:
rsm.eda.explore(diamonds, cols=["price", "carat"], agg=["mean", "median", "std", "min", "max"])

statistic,price,carat
str,f64,f64
"""mean""",3907.186,0.794283
"""median""",2407.0,0.7
"""std""",3956.9154,0.473826
"""min""",338.0,0.2
"""max""",18791.0,3.0


## Grouped Statistics

Use `by` to compute statistics within groups.

In [7]:
rsm.eda.explore(diamonds, cols=["price"], by="cut")

cut,price_mean,price_median,price_min,price_max,price_sd
enum,f64,f64,i32,i32,f64
"""Fair""",4505.237624,3323.0,497,16386,3749.540458
"""Premium""",4369.40856,2858.0,367,18745,4236.977216
"""Ideal""",3470.223639,1788.0,362,18791,3827.423266
"""Good""",4130.432727,3259.0,339,16776,3730.353553
"""Very Good""",3959.915805,2745.0,338,18678,3895.898807


In [8]:
rsm.eda.explore(diamonds, cols=["price", "carat"], by="color", agg=["mean", "median", "count"])

color,price_mean,price_median,price_count,carat_mean,carat_median,carat_count
enum,f64,f64,u32,f64,f64,u32
"""G""",3970.572864,2250.0,597,0.773585,0.7,597
"""J""",5642.012195,4413.5,164,1.192744,1.14,164
"""D""",3217.002618,1857.0,382,0.665393,0.54,382
"""E""",3284.595668,1791.0,554,0.679242,0.55,554
"""H""",4250.301762,3482.0,454,0.879802,0.9,454
"""F""",3654.492035,2298.0,565,0.727965,0.69,565
"""I""",4869.190141,3672.5,284,1.000704,0.96,284


# Titanic Dataset

The Titanic dataset contains passenger information from the 1912 disaster.

In [9]:
titanic = pl.read_parquet("https://github.com/radiant-ai-hub/pyrsm/raw/refs/heads/main/examples/data/data/titanic.parquet")

In [10]:
rsm.md("https://raw.githubusercontent.com/radiant-ai-hub/pyrsm/refs/heads/main/examples/data/data/titanic_description.md")

## Titanic

This dataset describes the survival status of individual passengers on the Titanic. The titanic data frame does not contain information from the crew, but it does contain actual ages of (some of) the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

## Variables

* survival - Survival (Yes, No)
* pclass - Passenger Class (1st, 2nd, 3rd)
* sex - Sex (female, male)
* age - Age in years
* sibsp - Number of Siblings/Spouses Aboard
* parch - Number of Parents/Children Aboard
* fare - Passenger Fare
* name - Name
* cabin - Cabin
* embarked - Port of Embarkation (Cherbourg, Queenstown, Southampton)

##  Notes

`pclass` is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1). If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored.  The following are the definitions used for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them.  As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Note: Missing values and the `ticket` variable were removed from the data

## Related reading

<a href="http://phys.org/news/2012-07-shipwrecks-men-survive.html" target="_blank">In shipwrecks, men more likely to survive</a>

## Explore All Numeric Columns

In [11]:
rsm.eda.explore(titanic)

statistic,age,sibsp,parch,fare
str,f64,f64,f64,f64
"""mean""",29.813199,0.504314,0.42186,36.603024
"""median""",28.0,0.0,0.0,15.75
"""min""",0.1667,0.0,0.0,0.0
"""max""",80.0,8.0,6.0,512.329224
"""sd""",14.366261,0.91308,0.840655,55.753648


## Statistics by Passenger Class

In [12]:
rsm.eda.explore(titanic, cols=["age", "fare"], by="pclass", agg=["mean", "median", "std", "count"])

pclass,age_mean,age_median,age_std,age_count,fare_mean,fare_median,fare_std,fare_count
enum,f64,f64,f64,u32,f64,f64,f64,u32
"""1st""",39.083038,39.0,14.535653,282,92.316092,67.950001,82.88817,282
"""2nd""",29.506705,29.0,13.638628,261,21.855044,15.75,13.540335,261
"""3rd""",24.745,24.0,11.862897,500,12.879299,8.05,9.733091,500


## Statistics by Survival Status

In [13]:
rsm.eda.explore(titanic, cols=["age", "fare"], by="survived", agg=["mean", "median", "count"])

survived,age_mean,age_median,age_count,fare_mean,fare_median,fare_count
enum,f64,f64,u32,f64,f64,u32
"""Yes""",28.81902,28.0,425,53.258884,26.25,425
"""No""",30.496899,28.0,618,25.148752,13.0,618


## Check for Missing Values

In [14]:
rsm.eda.explore(titanic, agg=["count", "null_count", "n_unique"])

statistic,age,sibsp,parch,fare
str,i64,i64,i64,i64
"""count""",1043,1043,1043,1043
"""null_count""",0,0,0,0
"""n_unique""",97,7,7,255


© Vincent Nijs (2026)