# Dataframe Statistics

In [0]:
import numpy as np
import pandas as pd

There exists a large number of methods for computing descriptive statistics on `Series` and `DataFrame`. Most of these are aggregations (hence producing a lower-dimensional result) like `sum()`, `mean()`, and `quantile()`, but some of them, like `cumsum()` and `cumprod()`, produce an object of the same size. 

Generally speaking, these methods take an axis argument, and the axis can be specified by name or integer:

- `Series`: no axis argument needed
- `DataFrame`: “index” (axis=0, default), “columns” (axis=1)

In [0]:
# Create a demo Dataframe
df = pd.DataFrame({'date': pd.date_range('2020-01-01', periods=5),
                   'numbers': [np.nan,1,8,5,1],
                   'fractions': [0.481236,0.758691, 0.977380, 0.992931,	np.nan],
                   'category': pd.Categorical(["test", "train", "test", "train", "test"]),
                   'boolean': pd.array([True, False, False, False, True], dtype='boolean')},
                  index=['a','b','c','d', 'e'])
df

Unnamed: 0,date,numbers,fractions,category,boolean
a,2020-01-01,,0.481236,test,True
b,2020-01-02,1.0,0.758691,train,False
c,2020-01-03,8.0,0.97738,test,False
d,2020-01-04,5.0,0.992931,train,False
e,2020-01-05,1.0,,test,True


# Statistics over rows or columns

## Mean over columns

In [0]:
df.mean(axis=0)

numbers      3.750000
fractions    0.802559
dtype: float64

## Mean over rows

In [0]:
df.mean(axis=1)

a    0.481236
b    0.879346
c    4.488690
d    2.996466
e    1.000000
dtype: float64

## NaN values

Notice in the above examples that NaN values were ignored in the calculations.

All statistical methods have a skipna option signaling whether to exclude missing data (True by default):

In [0]:
# Sum over rows
df.sum(1, skipna=False)

a         NaN
b    1.758691
c    8.977380
d    5.992931
e         NaN
dtype: float64

In [0]:
# Sum over rows, skipping NaN
df.sum(axis=1, skipna=True)

a    0.481236
b    1.758691
c    8.977380
d    5.992931
e    1.000000
dtype: float64

In [0]:
# Sum over columns, skipping Nan
df.sum(axis=0, skipna=True)

numbers                          15
fractions                   3.21024
category     testtraintesttraintest
boolean                           2
dtype: object

Here is a quick reference summary table of common functions.

|Function|Description|
|--- |--- |
|`count`|Number of non-NA observations|
|`sum`|Sum of values|
|`mean`|Mean of values|
|`mad`|Mean absolute deviation|
|`median`|Arithmetic median of values|
|`min`|Minimum|
|`max`|Maximum|
|`mode`|Mode|
|`abs`|Absolute Value|
|`prod`|Product of values|
|`std`|Bessel-corrected sample standard deviation|
|`var`|Unbiased variance|
|`sem`|Standard error of the mean|
|`skew`|Sample skewness (3rd moment)|
|`kurt`|Sample kurtosis (4th moment)|
|`quantile`|Sample quantile (value at %)|
|`cumsum`|Cumulative sum|
|`cumprod`|Cumulative product|
|`cummax`|Cumulative maximum|
|`cummin`|Cumulative minimum|

# Unique values

In [0]:
# Get the number `Series` from the `Dataframe`
df['numbers']

a    NaN
b    1.0
c    8.0
d    5.0
e    1.0
Name: numbers, dtype: float64

`Series.unique()` will return the unique values in a `Series`:

In [0]:
df['numbers'].unique()  # return_counts=True (look up)

array([nan,  1.,  8.,  5.])

`Series.nunique()` will return the number of unique non-NA values in a `Series`:

In [0]:
df['numbers'].nunique()

3

# Summarize data with describe

Data can be summarized with the `.describe()` method. It works the same for both `Series` and `Dataframe`.

As with other statistical methods, NaN is excluded from the results.


In [0]:
# Summarize a Series
df['numbers'].describe()

count    4.00000
mean     3.75000
std      3.40343
min      1.00000
25%      1.00000
50%      3.00000
75%      5.75000
max      8.00000
Name: numbers, dtype: float64

In [0]:
# Summarize a Dataframe
df.describe()

Unnamed: 0,numbers,fractions
count,4.0,4.0
mean,3.75,0.802559
std,3.40343,0.239428
min,1.0,0.481236
25%,1.0,0.689327
50%,3.0,0.868035
75%,5.75,0.981268
max,8.0,0.992931


We can also select specific percentiles to be included in the result.

In [0]:
df.describe(percentiles=[.05, .25, .75, .95])

Unnamed: 0,numbers,fractions
count,4.0,4.0
mean,3.75,0.802559
std,3.40343,0.239428
min,1.0,0.481236
5%,1.0,0.522854
25%,1.0,0.689327
50%,3.0,0.868035
75%,5.75,0.981268
95%,7.55,0.990598
max,8.0,0.992931


For a non-numerical `Series` object, `describe()` will give a simple summary of the number of unique values and most frequently occurring values:

In [0]:
df['category']

a     test
b    train
c     test
d    train
e     test
Name: category, dtype: category
Categories (2, object): [test, train]

In [0]:
df['category'].describe()

count        5
unique       2
top       test
freq         3
Name: category, dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical columns or, if none are, only categorical columns.

This behavior can be controlled by providing a list of types as **include/exclude** arguments. The special value **all** can also be used:


In [0]:
# Include only numbers (default)
df.describe(include=['number'])

Unnamed: 0,numbers,fractions
count,4.0,4.0
mean,3.75,0.802559
std,3.40343,0.239428
min,1.0,0.481236
25%,1.0,0.689327
50%,3.0,0.868035
75%,5.75,0.981268
max,8.0,0.992931


In [0]:
# Include only category
df.describe(include=['category'])

Unnamed: 0,category
count,5
unique,2
top,test
freq,3


In [0]:
# Include all
df.describe(include='all')

Unnamed: 0,date,numbers,fractions,category,boolean
count,5,4.0,4.0,5,5
unique,5,,,2,2
top,2020-01-03 00:00:00,,,test,False
freq,1,,,3,3
first,2020-01-01 00:00:00,,,,
last,2020-01-05 00:00:00,,,,
mean,,3.75,0.802559,,
std,,3.40343,0.239428,,
min,,1.0,0.481236,,
25%,,1.0,0.689327,,


# Max/Min Index

It is often desireable to find the index of the minimum and maximum values, which allows us to see what is going on in the rest of the row.

Pandas provides helper functions `idxmin` and `idxmax` for this purpose.

In [0]:
# Create a new dataframe with only numbers
df_numbers = pd.DataFrame(np.random.rand(5,3), columns=['A', 'B', 'C'])
df_numbers

Unnamed: 0,A,B,C
0,0.736578,0.834027,0.672312
1,0.477935,0.884255,0.913437
2,0.241817,0.891527,0.185116
3,0.219596,0.072499,0.512523
4,0.654276,0.932431,0.493025


### Minimum

In [0]:
# Find the minumum across the entire dataframe for each column with `idxmin`
df_numbers.idxmax(axis=0)

A    0
B    4
C    1
dtype: int64

In [0]:
# Find the minimum for a specific column series with `idxmin`
a_min = df_numbers['A'].idxmin()
a_min

3

### Maximum

In [0]:
# Find the maximum across the entire dataframe for each column with `idxmax`
df_numbers.idxmax(axis=0)

A    0
B    4
C    1
dtype: int64

In [0]:
# Find the maximum for a specific column series with `idxmin`
a_max = df_numbers['A'].idxmax()
a_max

0

Now that we have some indexes of interest, we can use them to find the rows of interest with `iloc`.

In [0]:
# Get the entire row that contains the minimum value in column A
df_numbers.iloc[a_min]

A    0.219596
B    0.072499
C    0.512523
Name: 3, dtype: float64

In [0]:
# Get the entire row that contains the maximum value in column A
df_numbers.iloc[a_max]

A    0.736578
B    0.834027
C    0.672312
Name: 0, dtype: float64

# Value Counts (Histogram)

Pandas Series has a `value_counts` method that calculates a 1D histogram

In [0]:
# Create array data with repeating values
data = np.random.randint(0, 7, size=50)
data

array([5, 5, 1, 2, 5, 6, 3, 6, 0, 5, 6, 3, 4, 6, 0, 5, 2, 2, 3, 5, 1, 2,
       3, 5, 5, 2, 4, 6, 5, 4, 5, 6, 1, 5, 6, 3, 3, 4, 0, 3, 3, 6, 0, 5,
       4, 5, 5, 2, 2, 0])

In [0]:
# Make a series, and caculate the histogram with `value_counts`
s = pd.Series(data)
s.value_counts()

5    14
6     8
3     8
2     7
4     5
0     5
1     3
dtype: int64

Notice that `value_counts` automatically sorts counts from high to low.