<a id='menu'></a>
 <hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.25"> 

 ![learning academy and data science campus logos](../images/la_dsc_logo.jpg)
  <hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.25"> 

<a id='menu'></a>
# Introduction to Python
## Chapter 6 – Summary Statistics and Aggregation  
***
Follow along with the code by running cells as you encounter them
***
*Chapter Overview/Learning Objectives*

* [Packages and Data](#packages)
 * Packages
 * Data
 
 
* [Overall Descriptive Statistics](#describe)


* [Range](#range)
 * min
 * max
 * quantiles
 

* [Averages](#averages)
 * Mean
 * Median
 * Mode
 
 
* [Spread](#spread)
 * Standard Deviation
 * Variance
 

* [Counting Values](#count)
 * Counts
 * Null Value counts
 * Value Counts


* [Other Summary Statistics](#other)


* [Creating Size Bands](#cut)


* [Aggregation](#agg)

<a id='packages'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1">  

## Packages and Data

### Packages

As a reminder – we should always import our packages at the top of our script.


In this session we will use:

* `pandas`, and give it the nickname `pd`
* `numpy` and give it the nickname `np`

Complete this action in the cell below.

In [None]:
# Import pandas and numpy into this cell



You can run the "solution" cell if you need help - or revisit Chapter 2.

Practicing these basic commands helps your retention of the skills.

In [None]:
# Solution - These cells contain answers for the exercises.
# Run once to reveal the code
# Run again to reveal the output


%load ../solutions/chapter_6/chaptersixpackages.py

We’ll now import the data again.

In this session we will use:

| variable name | file name  |
| ------- | --- |
| animals | animals.csv |
| titanic | titanic.xlsx |


In [None]:
# Import the animals and the titanic data into this cell



In [None]:
# Solution - These cells contain answers for the exercises.
# Run once to reveal the code
# Run again to reveal the output

%load ../solutions/chapter_6/chaptersixdata.py

You can check your variables are loaded by using `%whos` in Jupyter. In Spyder or other IDE's they should appear in your variable explorer. 

If you struggle with this section – review Chapter 3.

In [None]:
%whos

[return to menu](#menu)

<a id='describe'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Overall Descriptive Statistics

Pandas has an inbuilt method to get basic descriptive statistics, this is `.describe()`

In [None]:
titanic.describe()

Describe works on numeric columns by default. We can also get summary statistics on a specific column.

In [None]:
titanic["fare"].describe()

These statistics are explained in more detail below:
 
| Summary Statistic | Description|
|--------------------------|-----------------|
| count | the number (count) of non missing entries in the given column.
| mean | the average (arithmetic mean) data value in the given column.
| std | the standard deviation (spread) of values in the given column.
| min |  the smallest value in the given column.
| 25% | the value of the data at the lower quartile <br> (i.e. after the first 25% of data, ordered from smallest to largest).
| 50% | the middle value of the data (aka the median). <br> half the values are larger than this value, and half smaller.
| 75% | the value of the data at the upper quartile <br> (i.e. after the first 75% of data, ordered from smallest to largest).
| max | the maximum data value recorded.


We can display descriptive information for other data types by using the parameter `include= ` . 

This parameter takes a list as the input; even if we’re just including one kind of data. 

Here we’re specifying that we want to include “object” – our object columns.

In [None]:
titanic.describe(include=["object"])

| Summary Statistic | Description|
|--------------------------|-----------------|
| count | The number (count) of non missing entries in the given column.
| unique | The number of unique variables in a column.
| top | The most frequently occurring value.
| freq | The frequency of the “top” value.

If there are two or more “top” values; e.g. both most frequently occurring values that have the same frequency within the table then Python will [kind of arbitrarily](https://github.com/pandas-dev/pandas/issues/15833) choose one of them to be the top value. The link is added for general interest, you don't need to understand it.

In this data there are two women and two men who share the same name. Pandas will choose one of them to display.

`titanic.describe(include=[np.number])` will return you all numeric columns, in the same way as `titanic.describe(include=["int64", "float64"])`

[return to menu](#menu)

<a id='range'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Range

We can also access these summary statistics individually. In most cases the name of the method is the same as the summary statistic.

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### min

We can use `.min()` to return the minimum value in a column.

In [None]:
titanic["fare"].min()

This also works for object (text) columns.

In [None]:
titanic["name"].min()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### max

We can use `.max()` to return the maximum value in a column.

In [None]:
titanic["fare"].max()

This also works for object (text) columns.

In [None]:
titanic["name"].max()

Something important to note here is that `pandas` effectively assigns a value to each letter. 

This goes A-Z and **then** a-z.

A lower case "a" is treated as coming **after** a capital "Z" in Python.

This is why van Melkebeke, Mr. Philemon is the maximum value in our `Titanic` Dataframe rather than Zimmerman, Mr Leo.

We can handle this by ensuring our data is either all lower case or all upper case before finding the `.min()` or `.max()` values. Here I’ve done this by chaining the methods `.str.lower()` and `.max()` together.

If I wanted to do more work with this column in future I may consider overwriting it with a lower case version.

In [None]:
titanic["name"].str.lower().max()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### Quantiles

We can use `.quantile()` to find information at different points of our data.

Our parameter is `q= ` and then a decimal number.

If we don't specify this the default behaviour is `0.5`.

In [None]:
titanic["fare"].quantile(q=0.25)

If we wish to specify more than one, we pass a list to the parameter `q= `

In [None]:
titanic["fare"].quantile(q=[0, 0.25, 0.5, 0.75, 1])

We can also use the `np.arange()` function from the `numpy` package (`np`) here rather than manually specifying as a list. This creates *a* range (not arranging the data). Note the `range()` function only works with integers; or whole numbers, while `np.arange()` allows us to work with floats; or decimal numbers.

Like in `range()` in `np.arange()` the stop number is exclusive.

In [None]:
# Note that .arange() comes from the numpy package (np)


titanic["fare"].quantile(q=np.arange(start=0.0, 
                                     stop=1.1,  # Remember this is exclusive!
                                     step=0.1)) 

<hr style="width:50%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.5"> 

## Exercise 1

How old is the oldest passenger in the DataFrame


In [None]:
# Exercise



In [None]:
# Solution - These cells contain answers for the exercises.
# Run once to reveal the code
# Run again to reveal the output

%load ../solutions/chapter_6/oldest_passenger.py

[return to menu](#menu)

<a id='averages'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Averages

### Mean


`.mean()` finds us the sum of the column, divided by the number of items in the column.

In [None]:
titanic["fare"].mean()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### Median

The `.median()` is the middle value when the numbers are listed in order.

In [None]:
titanic["fare"].median()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### Mode

The `.mode()` is the value that occurs most frequently in the column.

In [None]:
titanic["fare"].mode()

`.mode()` can also be calculated for object based columns.

Here the `name` column has two modes; both `Connolly, Miss. Kate` and `Kelly, Mr. James` appear twice in the Data Frame.

How do we know they appear twice? We’ll look at the function to find this  `.value_counts()` later in this section.

In [None]:
titanic["name"].mode()

Interestingly this data isn’t duplicates, There really were two separate individuals with those names!

In [None]:
titanic[(titanic["name"] == "Connolly, Miss. Kate") | (titanic["name"] == "Kelly, Mr. James")]

[return to menu](#menu)

<a id='spread'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Spread

### Standard Deviation


The standard deviation measures the spread of the data about the mean value. It shows you how much your data distribution is spread out around the mean or average.

We can calculate the standard deviation using the `.std()` method.


In [None]:
titanic["fare"].std()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### Variance

Variance measures how spread the values of a variable are.
We can calculate the variance using the `.var()` method.

In [None]:
titanic["fare"].var()

<hr style="width:50%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.5"> 

## Exercise

Find the standard deviation of the `age` column

In [None]:
# Exercise



In [None]:
# Solution - These cells contain answers for the exercises.
# Run once to reveal the code
# Run again to reveal the output

%load ../solutions/chapter_6/standard_deviation.py

[return to menu](#menu)

<a id='count'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Counting Values


### Counts

We can find the number of non null values in a column using the `.count()` method. 

By non null values we mean values with data in them - not the missing values.

In [None]:
titanic["sex"].count()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### Null Value counts

We can find how many null values we have by using the `.isnull()` method. 

`.isnull()` returns a Boolean series – of `True` and `False` values.

In [None]:
titanic["age"].isnull().head()

As these have numeric values behind them (`True` is 1, `False` is 0 ) we can use `.sum()` to total them.

In [None]:
titanic["age"].isnull().sum()

In more modern versions of Pandas the method `.isna()` also exists and provides the same output.

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### Value Counts

We can find the frequencies of each unique value in a column, by using `.value_counts()`.

In [None]:
titanic["sex"].value_counts()

<hr style="width:50%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.5"> 

## Exercise 1

How many passengers were in each class?

In [None]:
# Exercise


In [None]:
# Solution - These cells contain answers for the exercises.
#Run once to reveal the code.
#Run again to reveal the output. 

%load ../solutions/chapter_6/how_many_pclass.py

<hr style="width:50%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.5"> 

## Exercise 2

Look in the help for `pd.Series.value_counts()` to see how you can return the values as a proportion

In [None]:
# Exercise



In [None]:
# Solution - These cells contain answers for the exercises.
#Run once to reveal the code.
#Run again to reveal the output. 

%load ../solutions/chapter_6/pclass_proportion.py

[return to menu](#menu)

<a id='other'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Other Aggregation Statistics

### Sum

We can use `.sum()` to add up columns of numeric data.

In [None]:
titanic["fare"].sum()

The `.sum()` method comes from Pandas; there is also an inbuilt function `sum()`. However if we have null values in the column this will return us `nan`.

In [None]:
sum(titanic["fare"])

If we have null values using the `.sum()` Panda’s method handles the null values for us.


<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### Unique

We can use `.unique()` to find values that are unique in a column. This is returned as an array.

In [None]:
titanic["boat"].unique()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

### Nunique

`.nunique()` can be used to find the number of unique values in a column.

In [None]:
titanic["boat"].nunique()

[return to menu](#menu)

<a id='cut'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Creating Size Bands

## pd.cut()

We can use the method `pd.cut()` to cut or “bin” our data into groups or categories. This is commonly done when creating size bands; like age bands.

`pd.cut()` takes a column of data and groups it into “bins” or categories. This column will have the data type of “category”. There’s some more information about the “category” data type [in this link]( https://pbpython.com/pandas_dtypes_cat.html).

We often assign the output of `pd.cut()` to a new column. This is because we don’t want to overwrite and change the data type of the existing column.


In [None]:
titanic["binned_ages"] = pd.cut(titanic["age"],
                                bins=10)

titanic.head()

We set the parameter bins = to specify the number of categories that we want.

By passing an integer to this, Pandas takes the smallest value and the largest value in the column and creates the number of categories defined.

We can look at our bins. Note the ( denotes exclusion of that number and the ] denotes inclusion of that number.

Now these are `categories` we can see there is a relationship between each category.

In [None]:
titanic["binned_ages"].unique()

We can also pass our own values to determine where the edges of the categories are.

This could be as a list of values or here I create a range.

Note that here I am having to use the `numpy` method  `np.arange()` as this number contains decimals. 

In [None]:
titanic["binned_ages2"]  = pd.cut(titanic["age"],  # Data to cut
                                  bins=np.arange(start=0,
                                                 stop=(titanic["age"].max() + 1) , # Remember stop is exclusive!
                                                 step=10))

titanic.head()

Here we’re combining the `.max()` method with the `np.arange()` method within the `pd.cut()` method. 

This means I don’t have to know the maximum value for the column before I write this piece of code. You can see how we can combine methods quite easily to make our life easier.


It is important to note that in this part of the code :

`stop=(titanic["age"].max() + 1)` 

I use brackets to enforce the order of operations. If I didn’t add 1 to the stop, the entry at my maximum value would read `NaN`.


We can also add labels to our categories. This time rather than displaying the bin edges it will display the text strings we specify.

This is passed as a list to the parameter `labels=` .

In [None]:
titanic["binned_ages3"] = pd.cut(titanic["age"],  # Data to cut
                                 bins=np.arange(start=0,
                                                stop=(titanic["age"].max() + 1),
                                                step=10),
                                 labels=["0 – 10", "11-20", " 21 – 30",
                                         "31 – 40", "41 – 50", "51 – 60",
                                         "61 – 70", "71 - 80"])

titanic.head()

Note these bands are approximate, e.g someone with an age of 20.2 will go into the band labelled `21-30`. Integers were chosen as it's an easier read, and most ages after 1 are whole numbers. 

There’s additional parameters we can set here; check the help function if there’s anything specific you need to do.

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

## pd.qcut()

`pd.qcut()` is described in the documentation as a “Quantile-based discretization function”. This means that `pd.qcut()` tries to divide the data into bins of equal sizes, rather than using the numeric edges of the data to create the bins.

In the cell below I’m using `pd.qcut()` to cut the `age` column in to 10. This using the same data and the same number of segments as the `pd.cut()` we did at the start.

In [None]:
# Divide fare into 3 equally sized classes.
titanic["age_qcut"] = pd.qcut(titanic["age"],
                              q=10)  # Note the parameter here is q

#View the data
titanic.head()

We can really see the difference between the two new "cut" columns if we visualise them.

Here I've taken a `pd.cut()` and a `pd.qcut()` and set `bins/q = 10`.

The `pd.cut()` action on the left, cuts the range of the data into 10 bins. You can see the data is not distributed evenly between these 10 bins, but the bins are of equal size.

The `pd.qcut()` action on the right cuts the data so each of the 10 bins has roughly an equal number in each bin (edge cases in this case have made it slightly more uneven as we cannot split identical values between bins!). You can see the size of bins is not uniform, unlike the `pd.cut()` bins.


![A comparison of cut and qcut](../images/comparing_cut_and_qcut.svg)


In [None]:
# The code used to create this figures can be loaded in this cell.
# Note this used a package called Matplotlib.pyplot - one of many options for visualising your data. 
# Learning Visualisation is out of the scope of this course

%load ../solutions/chapter_6/pdcutvisualisationcode.py

[return to menu](#menu)

<a id='agg'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## Aggregation

Aggregation means grouping data together by a particular grouping variable and producing a summary of one or more columns for that grouping variable.

I want to find out if the average fare paid is different for each passenger class.

We have 3 passenger classes; 1st, 2nd and 3rd.

We can check that using the `.unique()` function as we saw earlier.


In [None]:
titanic["pclass"].unique()

We'll use the .groupby() method in this tutorial. Panda's also provides us with the pd.crosstab() and the pd.pivot_table() methods as well. You can find these in the additional materials chapter.

This function can be really useful, especially when your data are disaggregate  - e.g. data about individual units of people or things.

.groupby() allows us to aggregate by a categorical variable and summarise numerical data into a new DataFrame.

.groupby() works on a principle known as 'split-apply-combine':


![image showing the stages of a group by](../images/group_by.JPG)

Split - a DataFrame is divided into a set of smaller DataFrames based on the grouping variable. 

Apply - an aggregation is applied to each of the groups to create a single row for each group in the original DataFrame. 

Combine - bring together the aggregated DataFrame rows into a final new DataFrame.


In [None]:
titanic_class_fare = titanic.groupby(by="pclass")["fare"].mean() 

titanic_class_fare

I want to find if the `.mean()` value of the `Fare` was different depending where someone embarked.

In the `.groupby()` method to the `by= ` parameter I pass the column I wish to group by.

The column `pclass` has three values – "1" , "2" and "3".

The `.groupby()` behaviour will effectively split the original DataFrame `titanic` into three new DataFrames. One with the values of `1`, one for "2" and one with the values of `3`. This is the **split** step.

From these new `.groupby()` DataFrames I select the column `["fare"]` and apply the summary statistic `.mean()` to it. This is the **apply** step.

This is returned in the DataFrame “titanic_class_fare”, this is the **combine** step.

We can also use more complicates groupings - here grouping first by `pclass` then `embarked`.

In [None]:
titanic.groupby(by=["pclass", "embarked"])["fare"].mean()

The order we pass these columns affects our output.

Note here with `embarked` and `pclass` reversed the resultant DataFrame is different.

In [None]:
titanic.groupby(by=["embarked", "pclass"])["fare"].mean()

You can also use other summary statistics here – including `.count()` to return the number of values.


Here 141 passengers were embarked in Cherbourg and were pclass 1.

In [None]:
titanic.groupby(by=["embarked", "pclass"])["fare"].count()

[return to menu](#menu)

<hr style="width:50%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.5"> 

## Exercise 1




Group `animals` by the column `AnimalClass` and find the `.sum()` of the `IncidentNominalCost(£)` column.

In [None]:
# Exercise



In [None]:
# Solution - These cells contain answers for the exercises.
#Run once to reveal the code.
#Run again to reveal the output. 

%load ../solutions/chapter_6/animalclass_gb_incident_nominal.py


<hr style="width:50%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.5"> 

## Exercise 2




Group `animals` by the column `Borough` and `AnimalClass` and find the `.mean()` of the `PumpHoursTotal` column.

In [None]:
# Exercise



In [None]:
# Solution - These cells contain answers for the exercises.
# Run once to reveal the code
# Run again to reveal the output

%load ../solutions/chapter_6/animalclass_groupby.py

<hr style="width:50%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.5"> 

If we want to do return more than one aggregation there is the `.agg()` method.

This takes a dictionary where the key is our column and the value is the aggregation method we wish to apply as a string.

These aggfuncs are slightly different to what we’ve seen, but are straightforward like “sum”, ”count”, ”mean” etc.


In [None]:
animalclass_sum = animals.groupby(by="AnimalClass", as_index = True).agg({"IncidentNominalCost(£)": "sum",
                                                         "PumpHoursTotal": "mean"})

animalclass_sum

If we want to apply more than one aggregation to a column we can pass a list to the values of the dictionary.

This requires us to use the numpy methods – so `np.sum` , `np.mean` etc.

In [None]:
animalclass_sum = animals.groupby(by="AnimalClass").agg({"IncidentNominalCost(£)": [np.sum, np.mean],
                                                         "PumpHoursTotal": "mean"})

animalclass_sum

In later versions of Pandas (from 0.25.0) named aggregation also exists. [The help guide can be found here]( https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html#named-aggregation)

Note that these DataFrames look a little different to the ones we’ve seen so far. The index is our grouping categories.

We can use the parameter `as_index = False` to reset it - or we could do a `.reset_index()` method we have seen previously. There are a few reasons we might do this, including visualisation.


In [None]:
group_by_dropped_index = titanic.groupby(by=["embarked", "pclass"], as_index = False)["fare"].count()

group_by_dropped_index

<a id='END'></a>
<hr style="width:100%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:1"> 

## End of Chapter

In this chapter we’ve explored:
* Overall descriptive statistics
* Range
* Averages
* Spread
* Counting Values
* Other general summary statistics
* Size Bands
* Aggregation.


You have completed chapter 6 of the Introduction to Python course. Please move on to chapter 7.

[return to menu](#menu)