# 累计与分组
在对较大的数据进行分析时，一项基本的工作就是有效的数据累计（summarization）：计算累计（aggregation）指标，如 sum()、 mean()、 median()、 min() 和 max()，其中每一个指标都呈现了大数据集的特征。在这一节中，我们将探索 Pandas 的累计功能，从类似前面NumPy 数组中的简单操作，到基于 groupby 实现的复杂操作。

## GroupBy： 分割、 应用和组合
![](https://s2.ax1x.com/2020/02/20/3ZJTxg.png)
用 Pandas 进行上图所示的计算作为具体的示例。从创建输入 DataFrame 开始：

我们可以用 DataFrame 的 groupby() 方法进行绝大多数常见的分割 - 应用 - 组合操作，将需要分组的列名传进去即可：

### 导入数据

In [7]:
import pandas as pd
import numpy as np

In [9]:
df = pd.read_csv('E:\code_studying\DL_exercise\MachineLearning\pandas\planets.csv')
if df.iloc[-1].isnull().all():
    df = df.iloc[:-1]
planets=df
planets

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.300000,7.10,77.40,2006
1,Radial Velocity,1,874.774000,2.21,56.95,2008
2,Radial Velocity,1,763.000000,2.60,19.84,2011
3,Radial Velocity,1,326.030000,19.40,110.62,2007
4,Radial Velocity,1,516.220000,10.50,119.47,2009
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


### 简单的累计功能

In [10]:
rng=np.random.RandomState(42)
ser=pd.Series(rng.rand(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [12]:
ser.sum()

2.811925491708157

In [13]:
ser.mean()

0.5623850983416314

In [14]:
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df

Unnamed: 0,A,B
0,0.020584,0.183405
1,0.96991,0.304242
2,0.832443,0.524756
3,0.212339,0.431945
4,0.181825,0.291229


In [15]:
df.mean()

A    0.443420
B    0.347115
dtype: float64

In [16]:
df.mean(axis='columns')

0    0.101995
1    0.637076
2    0.678600
3    0.322142
4    0.236527
dtype: float64

# 对于之前的行星数据，首先找到有缺失值的行：

In [42]:
planets[planets.isnull().any(axis='columns').values==True]

Unnamed: 0,method,number,orbital_period,mass,distance,year
7,Radial Velocity,1,798.500000,,21.41,1996
20,Radial Velocity,5,0.736540,,12.53,2011
25,Radial Velocity,1,116.688400,,18.11,1996
26,Radial Velocity,1,691.900000,,81.50,2012
29,Imaging,1,,,45.52,2005
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


In [47]:
len(planets.isnull().any(axis='columns').values==True)

1035

In [58]:
narray1=np.zeros(1035)
narray2=(narray1>1)