# D06: Summary Statistics & GroupBy

Here we'll be exploring some basic statistical functions we can perform on dataframes. We'll start by importing pandas and numpy and importing some data.

In [None]:
import pandas as pd
import numpy as np

In [None]:
url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'
df = pd.read_csv(url)

In [None]:
df.head(5)

Pandas can produce some excellent summary statistics for us by calling a variety of methods on the dataframe object we've just created.

In [None]:
''' Some Basics '''

sum_col = df['total_bill'].sum()      # Summing columns (ignores NaN values)
cou_col = df['total_bill'].count()    # Counting values in a column (ignores NaN values)
min_col = df['tip'].min()             # Returns the minimum value in a column
min_idx = df['size'].idxmin()         # Returns the index of the minimum value in a column
max_col = df['size'].max()            # Returns the maximum value in a column
max_idx = df['size'].idxmax()         # Returns the index of the maximum value in a column

''' Averages + Stats '''

med_col = df['size'].median()                       # Returns the median  
mean_col = df['size'].mean()                        # Returns the mean
mode_col = df['size'].mode()                        # Returns the mode
std_col = df['tip'].std()                           # Returns the standard deviation
var_col = df['tip'].var()                           # Returns the variance
qua_col = df['total_bill'].quantile([.25,.5,.75,1]) # Returns the quantile values (note you can use any % values you like!)
stats_col = df['total_bill'].describe()             # Returns summary statistics (count, mean, std, min, 25%, 50%, 75%, max)

''' Cumulative Values'''

df['tip cumulative'] = df['tip'].cumsum()           # Creating a column for the Cumulative Sum
df['tip cumulative max'] = df['tip'].cummax()       # Creating a column for the Cumulative Max
df['tip cumulative min'] = df['tip'].cummin()       # Creating a column for the Cumulative Min

Note that we can also specify keyword arguements to the methods. For example controlling how we deal with missing values with the skipna keyword argument:

In [None]:
sum_col = df['total_bill'].sum(skipna = False)

We also can be more versatile with how we call methods. They don't have to be on a single column:

In [None]:
sum = df.sum()     # Sums the values in the dataset
sum

In [None]:
sum2 = df[['size','tip','total_bill']].sum()    # sums the values in the selected columns
sum2

## Groupby

Whilst this is nice, you'll probably be wanting to create summary tables of statistics rather than storing them as individual objects and this is where <b>groupby</b> can help us.

Groupby does three things for us:

* <b>Split</b> Splits the data out based upon the input parameters
* <b>Apply</b> Applies a function/method/transformation to each group separately 
* <b>Combine</b> Combines the resulting data into a single data structure

You'll hear this referred to as 'Split, Apply, Combine' but this makes things sound a little complicated! In actual fact it's easy to understand once you've seen an example:

In [None]:
gp1 = df.groupby('sex').sum()         # Simple Groupby statment
gp1

Here we've split out the sex variable from the tips dataset, applied a sum() method and combined the results into a new dataframe.

We can group my more than one variable and use different methods as well:

In [None]:
gp2 = df.groupby(['sex','time']).mean() # Grouping by two variables
gp2

You'll notice that the sex and time columns are in bold. This is because groupby automatically creates an index for them - you can create multiple indexes on a dataframe and this is called multiindexing and where things start to get complicated! We could reset the index to a basic numeric index as follows:

In [None]:
gp2.reset_index(inplace=True)        # Reseting the index inplace without creating a new object
gp2

Note the inplace=True keyword argument to the reset_index() method. This allows us to modify the index inplace without specifying to create or overwrite a new object like so:

In [None]:
gp2 = gp2.reset_index()   

Note that we could just keep the first level of the index (sex because that was specified first in the groupby function) as follows:

In [None]:
gp2 = df.groupby(['sex','time']).mean()     # Grouping by the sex and time variables
gp2 = gp2.reset_index(level=0)                # Reseting the index to level 0
gp2.sort_index(inplace=True)                  # Sorting by the index
gp2.index.name = None                         # Renaming the index so it displays nicely =)
gp2

Often using the groupby function will be the final step before you visualise your data with a chart or graph.

## Further Reading

<a href = "http://pandas.pydata.org/pandas-docs/stable/groupby.html">GroupBy</a>