## Intro

**Author**: Andre Schomakers

**Date**: 5 Mar 2025

This interactive Python notebook `.ipynb` is designed to compare two different mean functions for the famous iris dataset's numerical columns.

The mean can be mathematically described by:

$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$$

Where:
- $\bar{x}$ represents the mean
- $n$ is the number of observations
- $x_i$ is the value of the $i$-th observation

#### Initial naive calcMeans()

Initial setup and functions programmed (before executed):
Note, that I'm using type-hinting here which is I think optional and not really pythonic.

In [13]:
# install packages
!pip install pandas
!pip install scikit-learn



In [14]:
# import pkg
import pandas as pd
import numpy as np
from sklearn import datasets # needed for the iris dataset

In [15]:
# define the calcMeans function using apply
def calcMeans(dataframe: pd.DataFrame) -> np.ndarray:
    """
    Calculate the mean values of each column in a given dataframe using apply function.

    Parameters:
    -----------
    dataframe : pandas.DataFrame
        The input dataframe with numeric columns

    Returns:
    --------
    numpy.ndarray
        Array containing mean values for each column
    """
    # Use apply function to calculate mean for each column
    means = dataframe.apply(lambda col: col.mean())

    # Convert to ndarray for more efficient return type
    return means.values

def validate_means(dataframe: pd.DataFrame) -> None:
    """
    Helper fx to validate our results against pandas describe

    Parameters:
    -----------
    dataframe : pandas.DataFrame
        The input dataframe to validate

    Returns:
    --------
    None
    """
    print("Our calculated means:")
    print(calcMeans(dataframe))

    print("\nPandas describe means:")
    print(dataframe.describe().loc['mean'].values)

    # Check if our calculation matches pandas
    comparison = calcMeans(dataframe) == dataframe.describe().loc['mean'].values
    print("\nDo our means match pandas?", all(comparison))

#### Execution of code on Sklearn's iris dataset



In [16]:
iris: tuple[pd.DataFrame, pd.Series] = datasets.load_iris(return_X_y=True, as_frame=True)

# extract the feature df (X part),
iris_features = iris[0]

# printing df's head
print("First 5 rows of iris dataset features:")
print(iris_features.head())

# call our own programmed fx
print("\nMean values calculated with calcMeans function:")
iris_our_means: np.ndarray = calcMeans(iris_features)
print(iris_our_means) # shape (4,)

# comparison vs pandas .describe()
print("\nMean values from pandas describe function:")
print(iris_features.describe().loc['mean'].values)

First 5 rows of iris dataset features:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

Mean values calculated with calcMeans function:
[5.84333333 3.05733333 3.758      1.19933333]

Mean values from pandas describe function:
[5.84333333 3.05733333 3.758      1.19933333]


Both arrays have share the same values for the 4 mean values for the columns `sepal length (cm)`, `sepal width (cm)`, `petal width (cm)` and `petal width (cm)`.