<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#numpy" data-toc-modified-id="numpy-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>numpy</a></span><ul class="toc-item"><li><span><a href="#代替列表" data-toc-modified-id="代替列表-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>代替列表</a></span></li></ul></li><li><span><a href="#ndarray" data-toc-modified-id="ndarray-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>ndarray</a></span><ul class="toc-item"><li><span><a href="#创建数组" data-toc-modified-id="创建数组-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>创建数组</a></span></li><li><span><a href="#数据类型" data-toc-modified-id="数据类型-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>数据类型</a></span></li></ul></li><li><span><a href="#ufunc" data-toc-modified-id="ufunc-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>ufunc</a></span><ul class="toc-item"><li><span><a href="#运算函数" data-toc-modified-id="运算函数-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>运算函数</a></span></li><li><span><a href="#统计函数" data-toc-modified-id="统计函数-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>统计函数</a></span></li><li><span><a href="#排序函数" data-toc-modified-id="排序函数-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>排序函数</a></span></li></ul></li></ul></div>

# numpy
[用NumPy快速处理数据](https://time.geekbang.org/column/article/73756)

## 代替列表
- list 的元素在内存中是分散存储的，而 NumPy 数组存储在一个**连续**的内存块中，不需要再查找内存地址

- NumPy 中的矩阵计算可以采用多线程的方式，充分利用多核 CPU 计算资源，提升计算效率

操作技巧：
- 采用就地操作，**避免隐式拷贝**。举个例子，如果我想让一个数值 x 是原来的两倍，可以直接写成 x * =2，而不要写成 y=x * 2

# ndarray
ndarray (N-dimensional array object) 多维数组

- 秩：维数称为秩（rank），一维数组的秩为 1，二维数组的秩为 2 

- 轴：每一个线性的数组称为一个轴（axes），秩就是描述轴的数量
    - axes = 0：跨行，对列求
    - axes = 1：跨列，对行求


## 创建数组
`ndarray = np.array()`

In [7]:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(a.shape, a.dtype)
print(b.shape, b.dtype)

(3,) int64
(3, 3) int64


## 数据类型
`dtype = numpy.dtype(object, align=False, copy=False)`

- object
    - 传入字符串：字段名默认为为'f0'，'f1'，...
    ```
    dt = np.dtype("i4, (2,3)f8, f4")
    ```
    - 传入列表：`[(field_name, field_dtype, field_shape), ...]`
    ```
    dt = np.dtype([('big', '>i4'), ('little', '<i4')])
    ```
    - 传入字典：`{'names': ..., 'formats': ..., 'offsets': ..., 'titles': ..., 'itemsize': ...}`
    ```
    dt = np.dtype({
        'names': ['r','g','b','a'],
        'formats': [uint8, uint8, uint8, uint8]})
    ```
   

- 数据类型

形式|含义
:--: |:--:
'?'	| boolean
'b'	| (signed) byte
'B'	| unsigned byte
'i' | integer
'u' | unsigned integer
'f'	| floating-point
'c'	| complex-floating point
'U' | unicode string
'O'	| (Python) objects

In [26]:
# 名为f0的字段，包含32位整数
# 名为f1的字段，包含一个2 x 3的64位浮点数子数组
# 名为f2的字段，包含32位浮点数
dt = np.dtype("i4, (2,3)f8, f4")
print(dt.fields)

# 列表创建
dt = np.dtype([('name', 'U32'), ('age', 'i'), ('chinese', 'i')])
print(dt.fields)

# 字典创建
dt = np.dtype({
    'names': ('name', 'age', 'chinese'),
    'formats': ('U32', 'i', 'i')
})
print(dt.fields)

{'f0': (dtype('int32'), 0), 'f1': (dtype(('<f8', (2, 3))), 4), 'f2': (dtype('float32'), 52)}
{'name': (dtype('<U32'), 0), 'age': (dtype('int32'), 128), 'chinese': (dtype('int32'), 132)}
{'name': (dtype('<U32'), 0), 'age': (dtype('int32'), 128), 'chinese': (dtype('int32'), 132)}


In [23]:
import numpy as np

person = np.dtype({                                                   # 构造数据类型（表）
    'names': ['name', 'age', 'chinese', 'math', 'english'],    
    'formats':['U32','i', 'i', 'i', 'f']
})

people = np.array([                                                   # 传入数据（行）
    ("ZhangFei",32,75,100, 90),("GuanYu",24,85,96,88.5),
    ("ZhaoYun",28,85,92,96.5),("HuangZhong",29,65,85,100)],
    dtype = person
)

names = people['name']
chinese = people['chinese']
print(names, chinese, np.mean(chinese), sep='\n')

['ZhangFei' 'GuanYu' 'ZhaoYun' 'HuangZhong']
[75 85 85 65]
77.5


# ufunc

## 运算函数

In [29]:
x1 = np.arange(1,11,2)
x2 = np.linspace(1,9,5)

print (np.add(x1, x2))
print (np.subtract(x1, x2))
print (np.multiply(x1, x2))
print (np.divide(x1, x2))
print (np.power(x1, x2))
print (np.mod(x1, x2))

[ 2.  6. 10. 14. 18.]
[0. 0. 0. 0. 0.]
[ 1.  9. 25. 49. 81.]
[1. 1. 1. 1. 1.]
[1.00000000e+00 2.70000000e+01 3.12500000e+03 8.23543000e+05
 3.87420489e+08]
[0. 0. 0. 0. 0.]


## 统计函数

In [50]:
import numpy as np

a = np.array([[1,2,3], [4,5,6], [7,8,9]])

print(np.min(a), np.min(a, 0), np.min(a, 1))    # 最小值
print(np.ptp(a), np.ptp(a, 0), np.ptp(a, 1))    # max-min



1 [1 2 3] [1 4 7]
8 [6 6 6] [2 2 2]


- `np.percentile(a, q, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)`

    使得至少有p%的数据项小于或等于这个值，且至少有(100-p)%的数据项大于或等于这个值
    - p=0 求最小值，p=50 求平均值，p=100 求最大值
- `numpy.median(a, axis=None, out=None, overwrite_input=False, keepdims=False)` 

    返回中位数
- `numpy.mean(a, axis=None, dtype=None, out=None, keepdims=False)`

    返回平均值
- `numpy.average(a, axis=None, weights=None, returned=False)`

    加权平均值

In [78]:
import numpy as np

a = np.array([[1.1,2.2,3.1], [4,5,6], [7,8,9]], dtype=np.float16)

print(np.percentile(a, (50, 60)), np.percentile(a, (50, 60) , 0), np.percentile(a, (50, 60) , 1), sep='\n\n')


[5.  5.8]

[[4.  5.  6. ]
 [4.6 5.6 6.6]]

[[2.19921875 5.         8.        ]
 [2.37929688 5.2        8.2       ]]


In [81]:
print(np.median(a), np.median(a, 0), np.median(a, 1))
print(np.mean(a), np.mean(a, 0), np.mean(a, 1))
print(np.average(a, weights=[[1,2,3], [4,5,6], [7,8,9]]))



5.0 [4. 5. 6.] [2.2 5.  8. ]
5.043 [4.03  5.066 6.03 ] [2.133 5.    8.   ]
6.351041666666666


- `numpy.var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False)`

    方差
- `numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False)`

    标准差

In [85]:
a = np.array([[1,2,3], [4,5,6], [7,8,9]])

print(np.var(a), np.var(a, 0), np.var(a, 1))
print(np.std(a), np.std(a, 0), np.std(a, 1))

6.666666666666667 [6. 6. 6.] [0.66666667 0.66666667 0.66666667]
2.581988897471611 [2.44948974 2.44948974 2.44948974] [0.81649658 0.81649658 0.81649658]


## 排序函数
- `np.sort(a, axis=-1, kind=‘quicksort’, order=None)`
    - axis
        - 默认沿数组的最后一个轴进行排序
        - None：向量化
    - kind 
        - quicksort、mergesort、heapsort 分别表示快速排序、合并排序、堆排序。

In [88]:
a = np.array([[4,3,2],[2,4,1]])
print (np.sort(a), '\n')                 # 默认以行排序
print (np.sort(a, axis=None), '\n')      # 转化成一维向量
print (np.sort(a, axis=0), '\n')
print (np.sort(a, axis=1), '\n')


[[2 3 4]
 [1 2 4]] 

[1 2 2 3 4 4] 

[[2 3 1]
 [4 4 2]] 

[[2 3 4]
 [1 2 4]] 



- `list = sorted(iterable, key=None, reverse=False)`

In [115]:
import numpy as np

type_grade = np.dtype({
    'names':['姓名', '语文', '英语', '数学'],
    'formats':['U32', 'f16', 'f16', 'f16']
})

grades = np.array([
    ('zhangfei', 66, 65, 30),
    ('guanyu', 95, 85, 98),
    ('zhaoyun', 93, 92, 96)], 
    dtype=type_grade)


names = grades['姓名']
chinese = grades['语文']
english = grades['英语']
math = grades['数学']

def show(name,cj):
    print('{} | {} | {} | {} | {} | {} '
          .format(name,np.mean(cj),np.min(cj),np.max(cj),np.var(cj),np.std(cj)))

print("科目 | 平均成绩 | 最小成绩 | 最大成绩 | 方差 | 标准差")
show("语文", chinese)
show("英语", english)
show("数学", math)


ranking = sorted(grades, key=lambda x:x['语文']+x['英语']+x['数学'], reverse=True)
print(ranking)

科目 | 平均成绩 | 最小成绩 | 最大成绩 | 方差 | 标准差
语文 | 84.66666666666667 | 66.0 | 95.0 | 174.88888888888889 | 13.224556283251582 
英语 | 80.66666666666667 | 65.0 | 92.0 | 130.88888888888889 | 11.440668201153676 
数学 | 74.66666666666667 | 30.0 | 98.0 | 998.2222222222222 | 31.594654962860762 
[('zhaoyun', 93., 92., 96.), ('guanyu', 95., 85., 98.), ('zhangfei', 66., 65., 30.)]
