# Chapter 2: Descriptive Data Analysis

In [4]:
from IPython.display import Markdown
base_path = (
    "https://raw.githubusercontent.com/rezahabibi96/GitBook/refs/heads/main/"
    "books/applied-statistics-with-python/.resources"
)

This chapter reviews the ideas of **Descriptive Statistics** that summarize a sample in terms of means, standard deviations, counts, visualizations, etc. There are no attempts to generalize sample statistics to the entire population (**Inferential Statistics**).

## Numerical Data

We defined in the previous introductory chapter numerical data as quantitative data that could be described with numbers. In this chapter, we review standard measures of center and spread of such data. In addition to introducing basic Statistics definitions, we introduce some new ideas such as dealing with NAs, trimmed mean, etc.

### Sample and Population Means

We start with the most standard measure of center of a numerical sample, the **mean** (average). For a sample of $n$ measurements $x_1, x_2, ..., x_n$, the mean is defined as:

$$
\bar{x} = \frac{x_1 + x_2 + ... + x_n}{n} = \frac{1}{n} \sum_{i=1}^{n}x_i
$$

The way to compute it depends on whether you are dealing with a small set of data to be typed into an array manually or a data file. Let's first assume that we just have a small data set of heights for students in a class, and we want to find the average. A Python array can be defined with `[]`, but the *numpy* library needs to be imported to use numerical arrays and statistical functions effectively. Numpy arrays are better for Statistics, as they are vectorized (addition, subtraction, and many other operations are automatically done element-by-element) and have a rich set of mathematical functions. First, *np.array()* is defined, then *np.mean* is used to compute the mean. The results are printed with format specifications like {: .4f} to display the desired number of digits rather than the full 16 double-digits-accuracy.

In [None]:
import numpy as np # numerical library


x = np.array([69,64,63,73,74,64,66,70,70,68,67])
xbar = np.mean(x)

print('xbar = {:.4f} for n = {:d} observations'.format(xbar,len(x)))

xbar = 68.0000 for n = 11 observations


Note that if any of the data are missing and represented by *nan*, the mean computes to *nan*. To avoid it, we have to adjust our command:

In [2]:
x = np.array([69,64,63,73,74,64,66,70,70,68,67,float("nan")])

xbar = np.mean(x); 
print('regular mean xbar = {:.4f}'.format(xbar))

xbar = np.nanmean(x); 
print('nanmean() xbar = {:.4f}'.format(xbar))

regular mean xbar = nan
nanmean() xbar = 68.0000


Alternatively, we can trim the mean to a specified percentage on low and high extremes.

In [3]:
from scipy import stats


x = np.array([69,64,63,73,74,64,66,70,70,68,67,float("nan")])
xbar = stats.trim_mean(x,0.1)

print('trimmed mean xbar = {:.4f}'.format(xbar))

trimmed mean xbar = 68.5000


Next, let's fnd the average for a variable defined in the data file. We use cesd depression score in `HELPrct` data file, which we used before:

In [8]:
import pandas as pd


url = f'{base_path}/HELPrct.csv'
mydata = pd.read_csv(url) 
print(mydata[['cesd','pcs','mcs']].head(10))

xbar = np.mean(mydata['cesd'])
print('mean of cesd = {:.4f}'.format(xbar))

   cesd        pcs        mcs
0    49  58.413689  25.111990
1    30  36.036942  26.670307
2    39  74.806328   6.762923
3    15  61.931679  43.967880
4    39  37.345585  21.675755
5     6  46.475212  55.508991
6    52  24.515039  21.793024
7    32  65.138008   9.160530
8    50  38.270878  22.029678
9    46  22.610598  36.143761
mean of cesd = 32.8477


We can also find the mean of several columns at once with *np.mean()* function or *.mean* method as illustrated below

In [9]:
print('Mean of the two columns using np.mean:')
print(np.mean(mydata[['cesd', 'mcs']], axis=0) )

print('\nMean of the two columns .mean():')
print(mydata[['cesd', 'mcs']].mean() )

Mean of the two columns using np.mean:
cesd    32.847682
mcs     31.676678
dtype: float64

Mean of the two columns .mean():
cesd    32.847682
mcs     31.676678
dtype: float64


We can also trim the means of data frame columns:

In [10]:
print('\nTrimmed mean from stats library : ', stats.trim_mean(mydata[['cesd', 'mcs']], 0.05))


Trimmed mean from stats library :  [33.06356968 31.31445885]
