By "group by" we are referring to a process involving one or more  of the following step
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

of these, the split step is the most straightforward. in fact, in many situations you may wish to split the data set into groups and do someting with those groups yourself. in the apply step, we might wish to following:

- Aggregation: computing a summary statistic (or statistic) about each group. Some examples:
     - Compute group sums or means
     - Compute group sizes / counts

- Transfromation: perform some group-specific computations and return a like-indexed. Some example
    - Standardizing data ( zscore) within group
    - Filling NAs within groups with a value derived from each group

- Filtation: discard sum groups, according to group-wise computation that evaluate True or Fasle. Some example
    - Discarding data that belongs to groups with only a few members
    - Filtering out data based on the group sum or mean
    

- Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it dosen't fit into either of the above two categories 

Since the set of object instance method on pandas data sturcture are generally rice and expressvie, we often simple want to invoke, say , a DataFrame function on each group. The name GroupBu should be quite familiar to those whe have used a SQL-based tool (or itertools), in which you can write code like:


'''
SELECT column1, column2
FROM someTable
GROUP BY column1
'''

we aim to make operation like the natural and easy to express using pandas. We'll address each area of GroupBy functionality then provied some none-trivial examples/ use cases

### Spltting an object into groups

pandas object can be split on any their axes. The abstract definition of grouping is to provide a mapping of lables to group names. To create a GroupBy object(more on what the GroupBy object is later), you do folling

> grouped = obj.groupby(key)

> grouped = obj.groupby(key, axis=1)

> goruped = obj.groupby([key1, key2])

The mapping can be specified many different ways:
    - A Python function, to be called on each of the axis labels
    - A list or NumPy arrays of the same length as the seleced axis
    - a dict or Sereis, providing a "label" -> "group name" mapping
    - For DataFrame object, a string indicating a columns to be used to group. of coure df.groupby('A') is just syntatic sugar for df.groupby(df['A']), but it makes life simpler
    - A list of any of the above things
   
Collectively, we refre to grouping objects as the keys. For example. consider following DataFrame:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'tow', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8),
                  })
df

Unnamed: 0,A,B,C,D
0,foo,one,0.050834,0.423966
1,bar,one,2.321354,-0.257804
2,foo,two,1.648755,0.496588
3,bar,three,0.64457,-0.306532
4,foo,tow,-0.232748,0.798262
5,bar,two,-0.508882,-0.985062
6,foo,one,-0.975162,0.363517
7,foo,three,-0.862191,-0.986987


We could naturally group by either A or B columns or both:

In [3]:
grouped = df.groupby('A')
grouped = df.groupby(['A', 'B'])

these will split the DataFrame on its index(rows). we could split by the columns

In [4]:
def get_letter_type(letter):
    print letter
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'

print get_letter_type('a')
print get_letter_type('b')
# print get_letter_type(1)

a
vowel
b
consonant


In [5]:
grouped = df.groupby(get_letter_type, axis=1)
#grouped = df.groupby(get_letter_type, axis=0) index 가 들어간다 칼럼이 아니라

A
B
C
D


Starting with 0.8 pandas index object now supports duplicate values. if a non-unique index used as the group key in a groupby operation. all values for the same index value  will be considered to be in one group and the the output of aggregation functions will be only contain unique index values:

In [6]:
lst = [1, 2, 3, 1, 2, 3]
s = pd.Series([4,20,6,10,5,30], lst)
grouped = s.groupby(level=0)# if mulit index level 0 is the first colum
print grouped.first()# 각 그룹의 첫번째 로우를 리턴 return first row at each group
print grouped.last()# 각 그룹의 마지막 로우를 리턴 return last row at each group
print grouped.sum()

1     4
2    20
3     6
dtype: int64
1    10
2     5
3    30
dtype: int64
1    14
2    25
3    36
dtype: int64


Note that no splitting occurs until it's need. Creating the GroupBy object only verify that you've passed a valid mapping

### Group By sorting

By default the group keys are sorting during group by operation. You may however pass sort=False for potentail speedups:

In [7]:
df2 = pd.DataFrame({'X' : ['B','A', 'B', 'A', 'A'], 'Y' : [1, 10, 2, 3, 4]})
print df2

print df2.groupby('X').sum()
print df2.groupby('X', sort=False).sum()

   X   Y
0  B   1
1  A  10
2  B   2
3  A   3
4  A   4
    Y
X    
A  17
B   3
    Y
X    
B   3
A  17


Note that groupby will be preserve the order in which observations are sorted within each group. For eaxmple, the groups created by groupby() below are in the order the appeared in the original DataFrame

In [8]:
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print df3.groupby('X').get_group('A')
print df3.groupby('X').get_group('B')


   X  Y
0  A  1
2  A  3
   X  Y
1  B  4
3  B  2


### GroupBy objects Attributes

the groups attribute is a dict whose keys are the computed unique groups and corresponging values being the axis labels beloinging to each group. in the above example we have:

In [9]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'tow', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)
                   })
print df
print df.groupby('A').groups
print df.groupby(get_letter_type, axis=1).groups

     A      B         C         D
0  foo    one -2.217326 -0.497954
1  bar    one -2.111401  0.092335
2  foo    two  1.041459  0.395484
3  bar  three -0.648543  1.635534
4  foo    tow  0.024635  1.017739
5  bar    two  0.376798 -1.348295
6  foo    one -2.046664  0.449330
7  foo  three  0.783075  0.216307
{'foo': [0, 2, 4, 6, 7], 'bar': [1, 3, 5]}
A
B
C
D
{'consonant': ['B', 'C', 'D'], 'vowel': ['A']}


calling the standard Python len function on the groupby object just return the length of the groups dict. so it is largely just a convenience

In [10]:
grouped = df.groupby(['A', 'B'])
print grouped.groups
print len(grouped)

{('foo', 'three'): [7], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('bar', 'one'): [1], ('foo', 'tow'): [4], ('bar', 'three'): [3], ('foo', 'two'): [2]}
7


GroupBy will tab complete colum names (and other attributes)

### GroupBy With Multiindex

With hierachically-indexed data, it's quite natural to group by one of the level of hierachically

Let's create a series with a two-level MultiIndex

In [11]:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
print s

first  second
bar    one       1.768040
       two      -0.612406
baz    one      -0.050725
       two      -1.344785
foo    one       0.918940
       two      -0.992633
qux    one      -0.694429
       two       1.969538
dtype: float64


We can group by one of the levels in s.

In [12]:
grouped = s.groupby(level=0)
print grouped.groups
print grouped.sum()

{'baz': [('baz', 'one'), ('baz', 'two')], 'foo': [('foo', 'one'), ('foo', 'two')], 'bar': [('bar', 'one'), ('bar', 'two')], 'qux': [('qux', 'one'), ('qux', 'two')]}
first
bar    1.155633
baz   -1.395510
foo   -0.073692
qux    1.275109
dtype: float64


if the MultiIndex has name specified, these can be passed instead of the level number

In [13]:
print s.groupby(level='second').sum()
print s.groupby(level=1).sum()

second
one    1.941825
two   -0.980285
dtype: float64
second
one    1.941825
two   -0.980285
dtype: float64


Also of v0.6, grouping with mutiple level is supported

In [14]:
s.groupby(level=['first', 'second']).sum()

first  second
bar    one       1.768040
       two      -0.612406
baz    one      -0.050725
       two      -1.344785
foo    one       0.918940
       two      -0.992633
qux    one      -0.694429
       two       1.969538
dtype: float64

### DataFrame column selection in GroupBy

Once you create groupby object from a DataFrame, for Example, you might do something different for each of colums.
Thus, using[] similar to getting a columns from a DataFrame, you can do:

In [15]:
grouped = df.groupby(['A'])
print grouped.sum()
print grouped['C'].sum()

            C         D
A                      
bar -2.383146  0.379575
foo -2.414820  1.580907
A
bar   -2.383146
foo   -2.414820
Name: C, dtype: float64


this is mainly syntatic sugar for the alternatvie and much more verbose:

In [16]:
df['C'].groupby(df['A']).sum()

A
bar   -2.383146
foo   -2.414820
Name: C, dtype: float64

Additionally this method avoids recomputing the interal grouping information derived from the passed key

### Iterating through groups

With the group by object in hand, iterating throug the grouped data is very natural and functions similarly to itertools.groupby:

In [17]:
grouped = df.groupby('A')
for name, group in grouped:
    print name
    print group

bar
     A      B         C         D
1  bar    one -2.111401  0.092335
3  bar  three -0.648543  1.635534
5  bar    two  0.376798 -1.348295
foo
     A      B         C         D
0  foo    one -2.217326 -0.497954
2  foo    two  1.041459  0.395484
4  foo    tow  0.024635  1.017739
6  foo    one -2.046664  0.449330
7  foo  three  0.783075  0.216307


in the case of grouping by multiple keys, the group names will be a tuple

In [18]:
grouped = df.groupby(['A','B'])
for name, group in grouped:
    print name
    print group

('bar', 'one')
     A    B         C         D
1  bar  one -2.111401  0.092335
('bar', 'three')
     A      B         C         D
3  bar  three -0.648543  1.635534
('bar', 'two')
     A    B         C         D
5  bar  two  0.376798 -1.348295
('foo', 'one')
     A    B         C         D
0  foo  one -2.217326 -0.497954
6  foo  one -2.046664  0.449330
('foo', 'three')
     A      B         C         D
7  foo  three  0.783075  0.216307
('foo', 'tow')
     A    B         C         D
4  foo  tow  0.024635  1.017739
('foo', 'two')
     A    B         C         D
2  foo  two  1.041459  0.395484


its's standard Python-fu but remember you can unpack the tuples in the for loop statement if yyou wish:for (k1, k2) group in

### Selecting a group

A single group can be selected using GroupBy.get_group()

In [23]:
grouped.get_group(('foo','one'))

Unnamed: 0,A,B,C,D
0,foo,one,-2.217326,-0.497954
6,foo,one,-2.046664,0.44933


### Aggregation

Once the group by object has been created, several methods are available to perform a computation on the grouped data

An obvious one is aggreation via the aggreation or equivalently agg method:


In [28]:
grouped = df.groupby('A')
print grouped.aggregate(np.sum)
grouped = df.groupby(['A', 'B'])
print grouped.agg(np.sum)

            C         D
A                      
bar -2.383146  0.379575
foo -2.414820  1.580907
                  C         D
A   B                        
bar one   -2.111401  0.092335
    three -0.648543  1.635534
    two    0.376798 -1.348295
foo one   -4.263989 -0.048623
    three  0.783075  0.216307
    tow    0.024635  1.017739
    two    1.041459  0.395484


As you can see, the result of the aggreation will have the group names as the new index along the grouped axis. In the case of multiple keys, the reulst is a MultiIndex by default, thought this can be changed by using the as_index option: