Categorizing a dataset and applying a function to each group. 

After lading, merging and preparing a dataset, you may need to compute statistics or possibly picot tables for reproting or visualization purpose. 

Use pandas `groupby` interface to slice, dice and summarize datasets

- Split a pandas object into pieces using one or more keys
- Calculate group summary statistics like count, mean or standard deviation
- Apply within-group transformation or other manipulation like normalization, linear regression, rand or subset selection
- Compute pivot tables and cross-tabulations
- Perform quantile analysis and other statistical group analyses

In [1]:
import numpy as np

import pandas as pd



# 10.1 How to think about Group Operations
"split-apply-combine" - group operations
1. Data containes in a pandas object split into groups based on one or more keys that you provide, the splitting is performed on a particular axis of an object.
2. A function applied to each group producting a new value. 
3. Finally, the results of all those function applications are combined into a result object. 

Each grouping key can take many forms, and they keu do not have to be all the same type. 



`GroupBy` object may looks like a DataFrame, but it is already grouped by the provided group key

In [2]:
df = pd.DataFrame(
    {
        "key1": ["a", "a", None, "b", "b", "a", None],
        "key2": pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
        "data1": np.random.standard_normal(7),
        "data2": np.random.standard_normal(7),
    }
)

# Compute the mean of data1 columns using the labels from key1
# Will return the mean value of each group in "key1" (same key1 will be consider as 1 group)
grouped = df['data1'].groupby(df['key1'])
grouped.mean()




key1
a   -0.455194
b    0.115268
Name: data1, dtype: float64

In [None]:
df.groupby(df['key1']).head()

In [None]:
means = df['data1'].groupby( df['key1']).mean()
means

In [None]:
means = df['data1'].groupby( df['key2']).mean()
means

In [None]:
means = df['data1'].groupby([df['key1'], df['key2']])
means.head(999)


In [None]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means.unstack()

In [None]:
states = np.array(['OH', "CA", "CA", "OH", "OH", "CA", "OH"])

years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

# group keys can be any array of the right length.
df['data1'].groupby([states, years]).mean().unstack()

In [None]:
# Pass column names to use the column as the group keys

df.groupby('key1').mean()

df.groupby(['key1', 'key2']).mean().unstack()

In [None]:
df.groupby(['key1', 'key2']).mean()

Use `GroupBy.size` method to return a Series containing group sizes. Any missing values in a group key are excluded from the result by default. This hebavior can be disabled by passing `dropna=False` 

In [None]:
df.groupby(['key1', 'key2'], dropna=False).size().unstack()

In [None]:
df

In [None]:
df.groupby('key1').count()

In [None]:
df.groupby('key1', dropna=False).size()

## Iterating over Groups

The object returned by groupby supposts iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data.

In [None]:
for name, group in df.groupby('key1'):
	print(name)
	print(group)

# In the case of multiple keys, the first element in the tuple will be a tuple of key values
for (k1, k2), group in df.groupby(['key1', 'key2']):
	print((k1, k2))
	print(group)

In [None]:

# Computing a dictionary of data pieces as a one-linear
pieces = {name: group for name, group in df.groupby("key1")}

pieces['b']

pieces['b']

group on any other axes 

Group df by whether they start with 'key' or 'data'



In [None]:
grouped = df.groupby(
    {"key1": "key", "key2": "key", "data1": "data", "data2": "data"}, axis="columns"
)

for group_key, group_val in grouped:
	print(group_key)
	print(group_val)

In [None]:
df[['data1','data2']]

In [None]:
df[['key1', 'key2']]

## Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of column names

In [10]:
df.groupby('key1')['data1'].head()

0   -0.980510
1    0.753196
3   -0.031481
4    0.262018
5   -1.138268
Name: data1, dtype: float64

In [9]:
df['data1'].groupby(df['key1']).head()

0   -0.980510
1    0.753196
3   -0.031481
4    0.262018
5   -1.138268
Name: data1, dtype: float64

In [11]:
# To aggregate only a few columns
# To only compute means for the data 2 column
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,1,0.092148
a,2,-1.514385
b,1,0.534622
b,2,-0.924592


## Grouping with Dictionaries and Series

In [16]:
people = pd.DataFrame(
    np.random.standard_normal((5, 5)),
    columns=["a", "b", "c", "d", "e"],
    index=["Joe", "Steve", "Wanda", "Jill", "Trey"],
)

people.iloc[2:3, [1,2]] = np.nan


In [17]:
people

Unnamed: 0,a,b,c,d,e
Joe,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
Steve,0.810863,-0.416002,1.704311,-1.977146,-1.335709
Wanda,-1.390674,,,-1.774004,0.70221
Jill,-1.308237,-0.967369,0.231921,0.355734,0.373386
Trey,0.236129,-0.02646,-0.260725,0.394199,-0.552764


In [23]:
mapping = {"a": "red", "b": "red", "c": "blue", "d": "blue", "e": "red", "f": "orange"}

by_column = people.groupby(mapping, axis="columns")

by_column.count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wanda,1,2
Jill,2,3
Trey,2,3


In [24]:
by_column.sum()

Unnamed: 0,blue,red
Joe,-2.38636,1.50699
Steve,-0.272835,-0.940848
Wanda,-1.774004,-0.688465
Jill,0.587655,-1.90222
Trey,0.133474,-0.343095


In [25]:
by_column.head(999)

Unnamed: 0,a,b,c,d,e
Joe,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
Steve,0.810863,-0.416002,1.704311,-1.977146,-1.335709
Wanda,-1.390674,,,-1.774004,0.70221
Jill,-1.308237,-0.967369,0.231921,0.355734,0.373386
Trey,0.236129,-0.02646,-0.260725,0.394199,-0.552764


In [26]:
map_series = pd.Series(mapping)

people.groupby(map_series, axis="columns").count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wanda,1,2
Jill,2,3
Trey,2,3


## Grouping with Functions
Any function passed as a group key will be called once per index value, with the return values being used as the group names. 

In [27]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
4,-1.072108,-0.993829,-0.028805,0.749933,-0.179378
5,-0.579812,-0.416002,1.704311,-3.751149,-0.633499


In [36]:
key_list = ['one', 'one', 'one', 'two', 'two']

people.groupby([len, key_list]).sum()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
4,two,-1.072108,-0.993829,-0.028805,0.749933,-0.179378
5,one,-0.579812,-0.416002,1.704311,-3.751149,-0.633499


In [37]:

people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
4,two,-1.308237,-0.967369,-0.260725,0.355734,-0.552764
5,one,-1.390674,-0.416002,1.704311,-1.977146,-1.335709


## Grouping by Index Levels
Aggregate using one of the levles of an axis index. 



In [40]:
columns = pd.MultiIndex.from_arrays(
    [["US", "US", "US", "JP", "JP"], [1, 3, 5, 1, 3]], names=["city", "tenor"]
)

hier_df = pd.DataFrame(np.random.standard_normal((4, 5)), columns=columns)

hier_df


city,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.06568,2.017971,0.631632,-2.054549,0.324105
1,-0.097418,-0.249019,-0.903861,0.965828,0.481348
2,0.072769,0.019059,-1.197369,0.216974,0.442749
3,-0.202678,-1.080596,1.796581,-0.040351,-1.846178


In [42]:
# To group by level, pass the level number or name using level keyword
hier_df.groupby(level="city", axis='columns').count()

city,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


# 10.2 Data Aggregation

Aggregation refer to any data transformation that produces scalar values form arrays. 

Optimized groupby methods

| Function Name | Description |
| - | - |
| any, all | return True is any (one or more values) or all none-Na values are "truthy" | 
| count | Number of non-NA values | 
| cumin, cummax | Cumulative minimum and maximum of no-NA values | 
| cumsum | Cumulative sum of non-NA values |
| cumprod | Cumulative product of non-NA values |
| first, last | First and last non-NA values |
| mean | Mean of non-NA values |
| median | Arithemetic median of non-NA values |
| min, max | Minimum and maximum of non-NA values |
| nth | Retrieve value that would appear at position n with the data in sorted order |
| ohlc | Compute four "open-high-low-close" statistics for time series-like data. |
| prod | product of non-NA values | 
| quantile | Compute sample quantile | 
| rand | Ordinal ranks of non-NA values, like calling Series.rank |
| size | Compute group sizes, returning result as a Series | 
| std, var | Sample standard deviation and variance | 
 

Tp use your own aggregation functions, pass any function that aggregates an array to the aggregate methid or its short alias agg:



In [13]:
def peak_to_peak(arr):
	return arr.max() - arr.min()

grouped.agg(peak_to_peak)

  grouped.agg(peak_to_peak)


Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size,tip_pct
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fri,No,10.29,2.0,1,0.067349
Fri,Yes,34.42,3.73,3,0.159925
Sat,No,41.08,8.0,3,0.235193
Sat,Yes,47.74,9.0,4,0.290095
Sun,No,39.4,4.99,4,0.193226
Sun,Yes,38.1,5.0,3,0.644685
Thur,No,33.68,5.45,5,0.19335
Thur,Yes,32.77,3.0,2,0.15124


## Column-wise and multiple function application 

In [2]:
tips = pd.read_csv('./datasets/tips.csv')

tips.head()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [4]:
tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


In [5]:
grouped = tips.groupby(['day', 'smoker'])

In [7]:
grouped_pct = grouped['tip_pct']

In [10]:
grouped_pct.agg('mean')


day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

In [17]:
grouped_pct.agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,0.15165,0.028123,0.067349
Fri,Yes,0.174783,0.051293,0.159925
Sat,No,0.158048,0.039767,0.235193
Sat,Yes,0.147906,0.061375,0.290095
Sun,No,0.160113,0.042347,0.193226
Sun,Yes,0.18725,0.154134,0.644685
Thur,No,0.160298,0.038774,0.19335
Thur,Yes,0.163863,0.039389,0.15124


In [20]:
# Pass a list of (name, function) tuples, the first element of each tuple will be used as DataFrame column name
grouped_pct.agg([('Average', 'mean'), ('stdev', np.std)])

Unnamed: 0_level_0,Unnamed: 1_level_0,Average,stdev
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,0.15165,0.028123
Fri,Yes,0.174783,0.051293
Sat,No,0.158048,0.039767
Sat,Yes,0.147906,0.061375
Sun,No,0.160113,0.042347
Sun,Yes,0.18725,0.154134
Thur,No,0.160298,0.038774
Thur,Yes,0.163863,0.039389


In [22]:
# Specify a list of functions to apply to all of the columns or different functions per column in DataFrame
functions = ['count', 'mean', 'max']

result = grouped[['tip_pct', 'total_bill']].agg(functions)

result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,4,0.15165,0.187735,4,18.42,22.75
Fri,Yes,15,0.174783,0.26348,15,16.813333,40.17
Sat,No,45,0.158048,0.29199,45,19.661778,48.33
Sat,Yes,42,0.147906,0.325733,42,21.276667,50.81
Sun,No,57,0.160113,0.252672,57,20.506667,48.17
Sun,Yes,19,0.18725,0.710345,19,24.12,45.35
Thur,No,45,0.160298,0.266312,45,17.113111,41.19
Thur,Yes,17,0.163863,0.241255,17,19.190588,43.11


In [23]:
# Apply different functions to one or more of the column, pass a dictionary to agg that contains a mapping of column names to any of the function specifications

grouped.agg({"tip": np.max, "size": "sum"})

# Pass multiple function to one column by using a list
grouped.agg({"tip_pct": ["min", "max", "mean"]})


Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,3.5,9
Fri,Yes,4.73,31
Sat,No,9.0,115
Sat,Yes,10.0,104
Sun,No,6.0,167
Sun,Yes,6.5,49
Thur,No,6.7,112
Thur,Yes,5.0,40


In [24]:
tips.groupby(['day', 'smoker'], as_index=False).mean()

Unnamed: 0,day,smoker,total_bill,tip,size,tip_pct
0,Fri,No,18.42,2.8125,2.25,0.15165
1,Fri,Yes,16.813333,2.714,2.066667,0.174783
2,Sat,No,19.661778,3.102889,2.555556,0.158048
3,Sat,Yes,21.276667,2.875476,2.47619,0.147906
4,Sun,No,20.506667,3.167895,2.929825,0.160113
5,Sun,Yes,24.12,3.516842,2.578947,0.18725
6,Thur,No,17.113111,2.673778,2.488889,0.160298
7,Thur,Yes,19.190588,3.03,2.352941,0.163863


# 10.3 Apply: General split-apply-combine

The most general purpose GroupBy method is apply. 
`apply` splits the object being manipulated into pieces, invokes the passed function on each piece and then attempts to concatenate the pieces. 



In [27]:
# A function that selects the rows with the largest values in a particular column
def top(df, n=5, column="tip_pct"):
	return df.sort_values(column, ascending=False)[:n]

top(tips, n=6)

# The top function will be applied to each smoker group
# The result has a hierarchical index with an inner level that contains index values from the original DataFrame
tips.groupby('smoker').apply(top)

# Pass a function to apply with other arguments
# Below code will return the highest total bill in each day for smoker and non smokers
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
No,Fri,94,22.75,3.25,Female,No,Fri,Dinner,2,0.142857
No,Sat,212,48.33,9.0,Male,No,Sat,Dinner,4,0.18622
No,Sun,156,48.17,5.0,Male,No,Sun,Dinner,6,0.103799
No,Thur,142,41.19,5.0,Male,No,Thur,Lunch,5,0.121389
Yes,Fri,95,40.17,4.73,Male,Yes,Fri,Dinner,4,0.11775
Yes,Sat,170,50.81,10.0,Male,Yes,Sat,Dinner,3,0.196812
Yes,Sun,182,45.35,3.5,Male,Yes,Sun,Dinner,3,0.077178
Yes,Thur,197,43.11,5.0,Female,Yes,Thur,Lunch,4,0.115982


In [31]:
result = tips.groupby('smoker')['tip_pct'].describe()
result
result.unstack('smoker')
# Inside groupby when invoke a method like describe, it is a sort cut for
def f(group):
	return group.describe()

grouped.apply(f)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,size,tip_pct
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fri,No,count,4.000000,4.000000,4.00,4.000000
Fri,No,mean,18.420000,2.812500,2.25,0.151650
Fri,No,std,5.059282,0.898494,0.50,0.028123
Fri,No,min,12.460000,1.500000,2.00,0.120385
Fri,No,25%,15.100000,2.625000,2.00,0.137239
...,...,...,...,...,...,...
Thur,Yes,min,10.340000,2.000000,2.00,0.090014
Thur,Yes,25%,13.510000,2.000000,2.00,0.148038
Thur,Yes,50%,16.470000,2.560000,2.00,0.153846
Thur,Yes,75%,19.810000,4.000000,2.00,0.194837


       smoker
count  No        151.000000
       Yes        93.000000
mean   No          0.159328
       Yes         0.163196
std    No          0.039910
       Yes         0.085119
min    No          0.056797
       Yes         0.035638
25%    No          0.136906
       Yes         0.106771
50%    No          0.155625
       Yes         0.153846
75%    No          0.185014
       Yes         0.195059
max    No          0.291990
       Yes         0.710345
dtype: float64

## Suppressing the Group Keys


In [32]:
tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
No,149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
No,51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
No,185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
No,88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
Yes,172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345
Yes,178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
Yes,67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
Yes,183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
Yes,109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525


In [33]:
tips.groupby('smoker', group_keys=False).apply(top)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525


## Quantile and Bucket Analysis
`pandas.cut` and `pandas.qcut`, slicing data up into buckets with binds of your chooseing. 

In [42]:
# Sample random dataset and an equal-length bucket categorization using pandas.cut
frame = pd.DataFrame(
    {"data1": np.random.standard_normal(1000), "data2": np.random.standard_normal(1000)}
)

frame.head()

quartiles = pd.cut(frame['data1'], 4)

# The Categorical object returned by cut can be passed directly to groupby. 
# So we could compute a set of group statistics fro the quartiles.

In [43]:
quartiles

0      (0.226, 1.753]
1       (-1.3, 0.226]
2      (0.226, 1.753]
3       (-1.3, 0.226]
4      (0.226, 1.753]
            ...      
995    (0.226, 1.753]
996     (-1.3, 0.226]
997     (-1.3, 0.226]
998    (1.753, 3.279]
999     (-1.3, 0.226]
Name: data1, Length: 1000, dtype: category
Categories (4, interval[float64, right]): [(-2.833, -1.3] < (-1.3, 0.226] < (0.226, 1.753] < (1.753, 3.279]]

In [45]:
def get_stats(group):
    return pd.DataFrame(
        {
            "min": group.min(),
            "max": group.max(),
            "count": group.count(),
            "mean": group.mean(),
        }
    )

grouped = frame.groupby(quartiles)

grouped.apply(get_stats)


Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(-2.833, -1.3]",data1,-2.826442,-1.312433,75,-1.791765
"(-2.833, -1.3]",data2,-2.27962,1.793336,75,-0.005226
"(-1.3, 0.226]",data1,-1.287826,0.225605,476,-0.397329
"(-1.3, 0.226]",data2,-3.035522,3.168642,476,-0.021084
"(0.226, 1.753]",data1,0.229181,1.742931,397,0.857498
"(0.226, 1.753]",data2,-3.632431,2.725002,397,-0.011502
"(1.753, 3.279]",data1,1.757662,3.278974,52,2.128956
"(1.753, 3.279]",data2,-1.826796,2.424735,52,-0.104719


## Example: Filling Missing Values with Group-Specific Values
Suppose you need to fill value to vary by group.
Use apply with function that calls fillna on each data chunk

In [46]:
states = ['Ohio', 'New York', 'Vermont', 'Florida', 'Oregon', 'Nevada', 'California', 'Idaho']

group_key = ['East', 'East', 'East', 'East', 'West', 'West', 'West', 'West']

data = pd.Series(np.random.standard_normal(8), index=states)

In [47]:
data

Ohio         -0.045672
New York     -1.684287
Vermont      -0.115937
Florida       0.858967
Oregon        0.608497
Nevada        2.437919
California   -0.897093
Idaho        -0.930766
dtype: float64

In [48]:
# Set some value in the data to be missing
data[['Vermont', 'Nevada', "Idaho"]] = np.nan

In [49]:
data.groupby(group_key).size()

East    4
West    4
dtype: int64

In [51]:
data.groupby(group_key).count()

East    3
West    2
dtype: int64

In [53]:
def fill_mean(group):
	return group.fillna(group.mean())

data.groupby(group_key).apply(fill_mean)

Ohio         -0.045672
New York     -1.684287
Vermont      -0.290331
Florida       0.858967
Oregon        0.608497
Nevada       -0.144298
California   -0.897093
Idaho        -0.144298
dtype: float64

In [54]:
fill_values = {'East':0.5, 'West': -1}

def fill_func(group):
	return group.fillna(fill_values[group.name])