# Data aggregations on pandas DataFrames

We start by building a DataFrame we used before.
![image.png](attachment:image.png)

In [1]:
import pandas as pd

stores_dict = {'manager': ['Ali', 'Madhura', 'Victoria', 'Mary', 'Nina', 'Mia'],
                        'city': ['Toronto', 'Hamilton', 'Ottawa', 'Brampton', 'Kingston', 'Windsor'],
                       'employees': [26, 18, 20, 22, 16, 24],
'revenue': [34, 28, 30, 26, 21, 18]}

stores = pd.DataFrame(stores_dict)



## Overview of the whole DataFrame

The `info()` function provides information about different aspects of a Dataframe including the index, the dtype and column dtypes, non-null values and memory usage.

One of the most important statistical functions is the `describe()` function. 

It provides an overview on different aspects of numerical variables (columns) of a pandas Dataframe
* `count`, `mean`, `std`, `min`, `Q1`, `median`, `Q3` and `max`.

Notice that first the type is specified (a Dataframe). The number of entries (rows) as well as the number of columns are specified. Next, the index of each column is specified (0 for manager, 1 for city, etc.) along with their labels. The number of non-null entries for the columns are specified and also the dtype (data type) of each column is specified. This means the columns manager and city were string (or mixed) and the columns for employees and revenue were integers. Some of the most important dtypes used in pandas are listed below.


![image-3.png](attachment:image-3.png)

## Statistics on each variable

Many of the statistical functions that we used for NumPy arrays are still valid for pandas Dataframes, e.g., `mean()`, `sum()`, `max()`, etc.

For example, we can apply one of those functions on
* A particular column
* The whole DataFrame, in which case the function is applied to each column

In [2]:
# Apply a statistics function (e.g., mean) on the 'employees' column



In [3]:
# Apply a statistics function (e.g., mean) on the whole DataFrame



Note that we often need to use the argument `numeric_only=True`, since we can't calculate statistics to non-numeric data types.

Similar to the `mean()` function, we also have the following aggregations for each column:
* `count()`: count non-NA entries
* `sum()`: summation of all values 
* `std()`: calculate standard deviation
* `max()` and `min()`: find the greatest and smallest values 


### Percentiles

We can also get a certain percentile on a distribution by function `quantile()`. The mandatory argument for this function is the percentile, which is a number between 0 and 1. 

For example, consider the following question.

**Question**: 56% of the stores have **no more** than ___ employees (fill in the blank)   

In [4]:
# Calculating multiple percentiles at the same time
