# Pandas Group Operations

Pandas dataframe 
By “group by” we are referring to a process involving one or more of the following steps:

* Splitting the data into groups based on some criteria.

* Applying a function to each group independently.

    * Aggregation
    * Transformation
    * Filtration

* Combining the results into a data structure.

In [None]:
import pandas as pd

In [None]:
#Importing data from online resources
drink=pd.read_csv("http://bit.ly/drinksbycountry")

In [None]:
drink

In [None]:
#Dropping "total litres if pure alcohol" variable for reducing complexity in understanding
drink = drink.drop(columns=['total_litres_of_pure_alcohol'],axis=1)

In [None]:
drink

In [None]:
#Sorting values
drink.sort_values(by='wine_servings')

In [None]:
drink.beer_servings.mean()

## Groupby

A grouped operation starts by specifying which groups of data that we would want to operate over. There are many ways of making groupsm, but the tool that pandas uses to make groups of data, is groupby

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False)

Parameters :
by : mapping, function, str, or iterable
axis : int, default 0
level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
group_keys : When calling apply, add group keys to index to identify pieces
squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent type
Returns : GroupBy object

In [None]:
dcontinent = drink.groupby('continent')    #Similar like SQL... SELECT * from continent_data GROUP BY continent

In [None]:
dcontinent

What is that DataFrameGroupBy thing? Its .__str__() doesn’t give you much information into what it actually is or how it works. The reason that a DataFrameGroupBy object can be difficult to wrap your head around is that it’s lazy in nature. It doesn’t really do any operations to produce a useful result until you say so.

In [None]:
#Listing out group by continent using for loop
for cont, contd in dcontinent:
    print(cont)
    print(contd.head())

In [None]:
#Listing groups based on index values
dcontinent.groups

In [None]:
#Listing each continent on the group
dcontinent.get_group("Asia").head()

# Apply - Combine Operations

The apply step is unequivocally the most important step of a GroupBy function where we can perform a variety of operations using aggregation, transformation, filtration or even with your own function!

## Aggregation

Aggregation: compute a summary statistic (or statistics) for each group. Some examples:

* Compute group sums or means.

* Compute group sizes / counts.

agg() function in Pandas gives us the flexibility to perform several statistical computations all at once!
* count() – Number of non-null observations
* sum() – Sum of values
* mean() – Mean of values
* median() – Arithmetic median of values
* min() – Minimum
* max() – Maximum
* mode() – Mode
* std() – Standard deviation
* var() – Variance

In [None]:
#Finding Mean for main dataset
print(drink)
drink.beer_servings.mean()

In [None]:
#Finding mean for beer serving for grouped dataset
dcontinent.beer_servings.mean()

In [None]:
#Finding mean for grouped dataset
dcontinent.mean()

In [None]:
#Finding max. value for beer serving on grouped dataset
dcontinent.beer_servings.max()

In [None]:
#Finding min. value for beer serving on grouped dataset
dcontinent.beer_servings.min()

In [None]:
#Using describe for grouped dataset
dcontinent.describe()

In [None]:
#Finding Aggregation
dcontinent.beer_servings.agg(['count','mean','median'])

In [None]:
# %matplotlib inline will make your plot outputs appear and be stored within the notebook.

%matplotlib inline

In [None]:
dcontinent.mean().plot(kind='bar')

## Filter

Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

* Discard data that belongs to groups with only a few members.

* Filter out data based on the group sum or mean.

>> loc: filter rows by LABEL, and select columns by LABEL

In [None]:
# row with label 0
drink.loc[0] 

In [None]:
# rows with labels 0 through 3
drink.loc[0:3]

>> iloc: filter rows by POSITION, and select columns by POSITION

In [None]:
# rows with positions 0 through 2 (not 3)
drink.iloc[0:3]

In [None]:
# all rows, columns with positions 0 through 2
drink.iloc[:, 0:3] 

>> mixing: select columns by LABEL, then filter rows by POSITION

In [None]:
drink.wine_servings[0:3]

In [None]:
drink[['beer_servings', 'spirit_servings', 'wine_servings']][0:3]

# Transformation

It perform some group-specific computations and return a like-indexed object. Some examples:

* Standardize data (zscore) within a group.

* Filling NAs within groups with a value derived from each group.

In [None]:
drink.beer_servings.max()

In [None]:
drink[(drink.wine_servings > 300) | (drink.beer_servings > 300) | (drink.spirit_servings > 300)].country.count()