# Pandas汇总和计算描述统计
- 描述统计
    - DataFrameObj.sum()
    - DataFrameObj.sum(axis=1)
    - DataFrameObj.mean(axis=1, skipna=False)
    - DataFrameObj.describe()
    - Descriptive and summary statistics列表
- 相关系数与协方差
    - corr()
    - cov()
    - corrwith()
- 唯一值、值计数以及成员资格
    - unique()
    - value_counts()
    - isin()：匹配一个指定的数组
    - pd.match(SeriesObj1, SeriesObj2)：匹配SeriesObj1中是否有SeriesObj2的元素，得到一个array，是对应元素在SeriesObj2中的position
    - apply(pd.value_counts).fillna(0)：例子，按列统计value，不存在的补零

In [1]:
# coding:utf-8
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
%pwd

u'/Users/zhangjun/Documents/machine-learning-notes/data-processing'

## 描述统计

In [2]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


Calling DataFrame’s `sum` method returns a Series containing column sums:

In [3]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [4]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [5]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

Table 5-9. Options for reduction methods

Method | Description
-------|------------
axis | Axis to reduce over. 0 for DataFrame’s rows and 1 for columns.
skipna | Exclude missing values, True by default.
level | Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex).

Some methods, like `idxmin` and `idxmax`, return indirect statistics like the index value where the minimum or maximum values are attained:

In [6]:
df.idxmax()

one    b
two    d
dtype: object

Other methods are `accumulations`:

In [7]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


`describe` produces multiple summary statistics in one shot:

In [8]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On non-numeric data, `describe` produces alternate summary statistics:

In [9]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [10]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

Table 5-10. Descriptive and summary statistics

Method | Description
-------|------------
count | Number of non-NA values
describe | Compute set of summary statistics for Series or each DataFrame column
min, max | Compute minimum and maximum values
argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax | Compute index values at which minimum or maximum value obtained, respectively
quantile | Compute sample quantile ranging from 0 to 1
sum | Sum of values
mean | Mean of values
median | Arithmetic median (50% quantile) of values
mad | Mean absolute deviation from mean value
prod | Product of all values
var | Sample variance of values
std | Sample standard deviation of values
skew | Sample skewness (3rd moment) of values
kurt | Sample kurtosis (4th moment) of values
cumsum | Cumulative sum of values
cummin, cummax | Cumulative minimum or maximum of values, respectively
cumprod | Cumulative product of values
diff | Compute 1st arithmetic difference (useful for time series)
pct_change | Compute percent changes

## 相关系数与协方差
Let’s consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance using the add-on `pandas.datareader` package:

    pip install pandas-datareader

```py
import pandas_datareader.data as web
all_data = {ticker: web.get_data_yahoo(ticker)
            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

price = pd.DataFrame({ticker: data['Adj Close']
                     for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})

In [217]: returns = price.pct_change()

In [218]: returns.tail()
Out[218]: 
                AAPL      GOOG       IBM      MSFT
Date                                              
2016-10-17 -0.000680  0.001837  0.002072 -0.003483
2016-10-18 -0.000681  0.019616 -0.026168  0.007690
2016-10-19 -0.002979  0.007846  0.003583 -0.002255
2016-10-20 -0.000512 -0.005652  0.001719 -0.004867
2016-10-21 -0.003930  0.003011 -0.012474  0.042096
```

The `corr` method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, cov computes the `covariance`:

```py
In [219]: returns.MSFT.corr(returns.IBM)
Out[219]: 0.49976361144151144

In [220]: returns.MSFT.cov(returns.IBM)
Out[220]: 8.8706554797035462e-05
```

DataFrame’s `corr` and `cov` methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively:

```py
In [221]: returns.corr()
Out[221]: 
          AAPL      GOOG       IBM      MSFT
AAPL  1.000000  0.407919  0.386817  0.389695
GOOG  0.407919  1.000000  0.405099  0.465919
IBM   0.386817  0.405099  1.000000  0.499764
MSFT  0.389695  0.465919  0.499764  1.000000

In [222]: returns.cov()
Out[222]: 
          AAPL      GOOG       IBM      MSFT
AAPL  0.000277  0.000107  0.000078  0.000095
GOOG  0.000107  0.000251  0.000078  0.000108
IBM   0.000078  0.000078  0.000146  0.000089
MSFT  0.000095  0.000108  0.000089  0.000215
```

Using DataFrame’s `corrwith` method, you can compute pairwise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column:

```py
In [223]: returns.corrwith(returns.IBM)
Out[223]: 
AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

In [224]: returns.corrwith(volume)
Out[224]: 
AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64
```

## 唯一值、值计数以及成员资格
The first function is `unique`, which gives you an array of the unique values in a Series:

In [25]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

The unique values are not necessarily returned in sorted order, but could be sorted after the fact if needed (uniques.`sort`()). Relatedly, `value_counts` computes a Series containing value frequencies:

In [26]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [27]:
pd.value_counts(obj.values, sort=False)

a    3
c    3
b    2
d    1
dtype: int64

`isin` performs a vectorized set membership check and can be useful in filtering a data set down to a subset of values in a Series or column in a DataFrame:

In [28]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [29]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

Related to isin is the `pandas.match` method, which gives you an index array from an array of possibly non-distinct values into another array of distinct values:

In [30]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.match(to_match, unique_vals)

array([0, 2, 1, 1, 0, 2])

Table 5-11. Unique, value counts, and set membership methods

Method | Description
-------|------------
isin | Compute boolean array indicating whether each Series value is contained in the passed sequence of values.
match | Compute integer indices for each value in an array into another array of distinct values. Helpful for data alignment and join-type operations.
unique | Compute array of unique values in a Series, returned in the order observed.
value_counts | Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order.

In some cases, you may want to compute a `histogram` on multiple related columns in a DataFrame. Here’s an example:

In [31]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [32]:
result = data.apply(pd.value_counts).fillna(0) 
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
