# 19 GroupBy
File(s) needed: gapminder.tsv

This notebook covers three types of data operations using `groupby` functionality in pandas.
1. aggregation
2. transformation
3. filtering

Each of these operations can be done _without_ groupby. Actually, we have already done #1 and #3 when we were using conditional subsetting earlier. Groupby gives us more speed and flexibility, however.

In [1]:
# Import necessary libraries and read data from tab separated file
import numpy as np
import pandas as pd
df=pd.read_csv("../MIS-3335/DATA/gapminder.tsv",sep='\t')

In [2]:
df

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


# Data aggregation
A big part of working with large datasets is looking at summarizations of the data, which we have done previously. Let's review these simple aggregations.

---

You know from statistics that we describe data with quantities like mean, median, min, and max. With these quantities, we get some insight into the data from a single number.

When we use these aggregating functions with a DataFrame we get a value for each column with numeric data.

|<p style="text-align:left;">pandas Aggregation Method</p>|  | <p style="text-align:left;">Result</p>|
| --- | --- | --- |
|<p style="text-align:left;">count()</p> | |<p style="text-align:left;">Total number of items</p>|
|<p style="text-align:left;">first(), last()</p> | |<p style="text-align:left;">First and last item</p>|
|<p style="text-align:left;">sum()</p> | |<p style="text-align:left;">Sum of all items</p>|
|<p style="text-align:left;">mean(), median()</p> | |<p style="text-align:left;">Mean and median</p>|
|<p style="text-align:left;">min(), max()</p>  | |<p style="text-align:left;">Minimum and maximum items</p>|
|<p style="text-align:left;">std(), var()</p>  | |<p style="text-align:left;">Standard deviation and variance</p>|


In [6]:
# Example: simple aggregation with DataFrame data
df.mean()
df.median()

year         1.979500e+03
lifeExp      6.071250e+01
pop          7.023596e+06
gdpPercap    3.531847e+03
dtype: float64

We can use the methods on individual columns as well.

In [8]:
# Example: find the mean of the pop column
df['pop'].mean()

29601212.324530516

Some of these aggregations are included in the `describe()` method we've used before. This is still a good tool to use to learn about your data.

In [10]:
# dataframe describe() method
df.describe()


Unnamed: 0,year,lifeExp,pop,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,59.474439,29601210.0,7215.327081
std,17.26533,12.917107,106157900.0,9857.454543
min,1952.0,23.599,60011.0,241.165877
25%,1965.75,48.198,2793664.0,1202.060309
50%,1979.5,60.7125,7023596.0,3531.846989
75%,1993.25,70.8455,19585220.0,9325.462346
max,2007.0,82.603,1318683000.0,113523.1329


Compare this output to the `head()` results above. Where are `country` and `continent`? The default behavior for `describe()` is to only list the numeric variables. We saw in the output from the `info()` method that those columns were of the _object_ data type (which is a data type inherited from numpy). We have to either specifically request object results be printed (using the `include` option) or ask that the numbers _not_ be printed (using `exclude`).

In [14]:
# see object column results for describe()
df.describe(include=[object])

Unnamed: 0,country,continent
count,1704,1704
unique,142,5
top,Nigeria,Africa
freq,12,624


And we can also get appropriate info on individual columns.

In [None]:
# Example: individual non-numeric column



In [13]:
# What happens if we try to obtain an inappropriate aggregation?
#df['country'].mean()

# Conditional aggregation: GroupBy
Those basic aggregation methods can give us quite a bit of insight into our data. Sometimes we want to see some of those aggregations based upon a gouping of the data. We can slice or subset the data and use those simple aggregations, or we can do both in a single statement. pandas implements a `groupby` method of the DataFrame to do just that.

http://pandas.pydata.org/pandas-docs/stable/groupby.html

If the term 'groupby' sounds familiar, it was adopted from the same command in the SQL language.

Look familiar?
```
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
```

In SQL and in pandas, `groupby` allows you to apply aggregation to selected subsets of the data. 

In [20]:
# Example: count all column values by continent
df.groupby('continent').count()

Unnamed: 0_level_0,country,year,lifeExp,pop,gdpPercap
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,624,624,624,624,624
Americas,300,300,300,300,300
Asia,396,396,396,396,396
Europe,360,360,360,360,360
Oceania,24,24,24,24,24


That last statement creates a `groupby` object and then applies the `count()` method to it but the grouped version of the data is not saved. The `groupby` object can be created with a name and subsequently used in multiple operations. Think of it as a view of the data, like a view created by a SQL query.

In [24]:
# Example: create the groupby object for further use
continents=df.groupby('continent')
continents.head()

# print only the Asia group
continents.get_group("Asia")

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1675,"Yemen, Rep.",Asia,1987,52.922,11219340,1971.741538
1676,"Yemen, Rep.",Asia,1992,55.599,13367997,1879.496673
1677,"Yemen, Rep.",Asia,1997,58.020,15826497,2117.484526
1678,"Yemen, Rep.",Asia,2002,60.308,18701257,2234.820827


That was an example of subsetting the entire dataframe by grouping on a column. What if we want to perform operations on a single column? The following cell calculates the average life expectancy per year. Let's take a good look at the code once we've run it.

In [27]:
# Average life expectancy per year
avg_life=df.groupby('year')['lifeExp'].mean()
avg_life

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

The effect is that we've created subsets of the data by each unique year value, then applied the `mean()` method to each subset.

Instead of using the groupby technique, we could 
- get a list of unique year values,
- subset the data for each year, and
- take the mean of each subset.

In [32]:
# Example: find the average life expectancy for 1952
years=df['year'].unique()
y1952=df.query("year==1952")
y1952_mean=y1952['lifeExp'].mean()
y1952_mean

49.05761971830987

Even with all the functionality pandas has built into it that takes several lines of code. Groupby does all that for every year in one statement and returns the results in a single dataframe.

## Other built-in and custom aggregation methods
We have already used several of the built-in aggregation methods from pandas, like `count`, `sum`, and `mean`. Many of them are included as part of the `describe` method. There are many more available for use. Refer to the pandas documentation to learn more: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

You can also apply methods from numpy or other libraries to a column by using the `agg` or `aggregate` series methods. You can even apply multiple aggregations or write your own Python function and apply it to your data in that way. More on this topic is covered in the text, pages 192-197. 

# Data transformation
Transforming the data doesn't reduce or summarize it. Transforming data alters each data value into another form. We will use the `transform()` method in combination with `groupby` to change data as desired.

### Standardizing
One concept you might remember from your stats class is z-scores. We often use z-scores to **_standardize_** data. In some analyses, the data is on vastly different scales. One way to be able to meaningfully compare the data is to convert it to be on the same scale. We do that by standardizing with z-scores.

A z-score is an indication of how far a particular value is from the mean expressed in standard deviations. The standardized dataset is centered at 0 (because the mean is actually zero away from itself) and has a standard deviation of one. The formula to calculate a z-score using a data value `x`, a dataset mean of `mu`, and a standard deviation of `sigma`, is

\begin{equation*}
z=\frac{x-\mu}{\sigma}
\end{equation*}

We could write a simple function and use `agg` to apply it. Of course, there is already a z-score function in the Python ecosystem that we can use. It is in the scipy library. Use the scipy zscore function to standardize the life expectancy values by year.

In [None]:
# import the zscore function from scipy.stats


# calculate a non-grouped zscore for life expectancy


# calculate a grouped zscore on year for life expectancy


In [None]:
# Let's look at the results


Why are they different? When the z-score is calculated outside the groupby, it is calculated across the entire dataset. Inside the groupby, the z-score is calculated across just the group values. The important thing to note is the difference. Which one to use depends upon your data and what you want to do with it.

# Filtering
Filtering lets you split your data using key values, then you can further subset with boolean conditions. We'll use the _tips_ data set from the seaborn library.

In [None]:
# import seaborn and load the data set


In [None]:
# look at the number of rows and frequency counts for table size


What we might do with the data depends upon our overall goals, but we might want to exclude rows with sizes of 1, 5, or 6 people because there are so few of them in the data. We can do that by using a filter on a grouping.

In [None]:
# use a lambda function to filter parties of size > 30
tips_filtered = tips.groupby('size').filter(lambda x: x['size'].count() >= 30)
print(tips_filtered.shape)

In [None]:
# look at the filtered data
