# Python中的科学计算

#### 科学计算软件包的安装，Anaconda
开源的Anaconda distribution版是提供便捷地获取数据包，并且能够对包进行管理，同时对环境可以统一管理的发行版本，是在Linux、Windows和Mac OS X上执行Python/R数据科学和机器学习的最简单方法。在全球拥有超过1100万用户。  
[Anaconda官方网站](https://www.anaconda.com/distribution/)

# NumPy

[numpy的功能](https://www.numpy.org/devdocs/user/quickstart.html)

NumPy是Python中科学计算的基础软件包。
它是一个提供了多维数组对象，多种派生对象（如：掩码数组、矩阵）以及用于快速操作数组的函数及API，
它包括数学、逻辑、数组形状变换、排序、选择、I/O 、离散傅立叶变换、基本线性代数、基本统计运算、随机模拟等等。

### 引入Numpy的目的

- 为何不直接用list？
    - 慢
    - 缺少数学操作

In [1]:
import numpy as np
import timeit

x = np.array([(1, 2),(3,4)], dtype=[('a', np.int8), ('b', np.int8)])
xv = x.view(dtype=np.int8).reshape(-1,2)
print (xv)

def t_list():
    l = [1 for i in range(4*100000)]
    [0 for i in l]
    
def t_float32():
    np.ones(4*100000, np.float32).view(np.float32)[...] = 0

def t_int8():
    np.ones(4*100000, np.int8).view(np.int8)[...] = 0
    
print(timeit.timeit('t_list()', setup='from __main__ import t_list', number=1))
print(timeit.timeit('t_float32()', setup='from __main__ import t_float32', number=1))
print(timeit.timeit('t_int8()', setup='from __main__ import t_int8', number=1))

[[1 2]
 [3 4]]
0.048134199999999794
0.000827499999999759
0.0002157999999994331


### 内存

- 布局 align
- 分配 malloc
- 拷贝 memcpy

### NumPy核心数据类型 - ndarray数组  
Numpy数组是一个N维数值矩阵，所有的值都是同一个类型，并且以非负整数的元组（Tuple）的形式索引这些值。数组维度（Dimension）的个数也就是数组的秩（Rank），数组的Shape属性是整数的元组，给出了沿着每个维度数组的大小。它和标准Python Array（数组）之间有几个重要的区别：
- NumPy数组在创建时具有固定的大小，与Python的原生数组对象（可以动态增长）不同。
- NumPy数组中的元素都需要具有相同的数据类型，因此在内存中的大小相同。
- NumPy数组支持对大量数据进行向量操作，以及高级数学和其他类型的操作。

In [4]:
heights = [1.73, 1.68, 1.71, 1.89, 1.79] # 5人的身高
weights = [65.4, 59.2, 63.6, 88.4, 68.7] # 5人的体重
bmis = weights / heights ** 2            # 却不能直接计算5人的BMI

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

In [5]:
import numpy as np 
np_heights = np.array(heights)        # array，向量，通常只包含数值
np_weights = np.array(weights)
bmis = np_weights / np_heights ** 2   # 这里计算的单元是一个向量
print(bmis)

[21.85171573 20.97505669 21.75028214 24.7473475  21.44127836]


In [13]:
bmis.ndim

1

In [7]:
bmis > 21  # 直接判断，符合直觉的结果

array([ True, False,  True,  True,  True])

##### 更新BMI公式，现在是对向量做计算
\begin{equation}\vec{BMIS} = \frac{\vec{weights}}{\vec{heights}^{2}}\end{equation}

### 关于ndarray的数据类型

In [18]:
type(np_weights)                   # 应该返回numpy的类型

numpy.ndarray

In [6]:
mixed_list = [1.0, "is", True] # 通用类型的list
print(np.array(mixed_list))
type(np.array(mixed_list))     # 转化为numpy array
type(np.array(mixed_list)[0])

['1.0' 'is' 'True']


numpy.str_

In [20]:
np.array(mixed_list)               # 数据类型是字符串

array(['1.0', 'is', 'True'], dtype='<U32')

### NumPy数组的创建
有多个机制可以创建数组：
1. 如上文采用的，从其他Python结构（例如，列表List，元组Tuple）转换
2. 使用numpy原生数组的创建（例如，arange、ones、zeros等）
3. 从磁盘读取数组，无论是标准格式还是自定义格式
4. 使用特殊库函数（例如，random）

 
我们可以用内置输入的Python列表的形式初始化numpy数组，并使用方括号访问元素：

In [10]:
a = np.array([1, 2, 3])   # Create a rank 1 array
print(type(a))            # Prints "<type 'numpy.ndarray'>"
print(a.shape)            # Prints "(3,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5, 2, 3]"

b = np.array([[1,2,3],[4,5,6]])     # Create a rank 2 array
print (b.shape)                     # Prints "(2, 3)"
print (b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"

<class 'numpy.ndarray'>
(3,)
1 2 3
[5 2 3]
(2, 3)
1 2 4


In [15]:
a = np.zeros((2,2))  # Create an array of all zeros
print (a,'a')              # Prints "[[ 0.  0.]
                       #        [ 0.  0.]]"
b = np.ones((1,2))   # Create an array of all ones
print (b,'b')              # Prints "[[ 1.  1.]]"

c = np.full((2,2), 7) # Create a constant array
print (c,'c')               # Prints "[[ 7.  7.]
                        #         [ 7.  7.]]"
d = np.eye(2)        # Create a 2x2 identity matrix
print (d,'d')              # Prints "[[ 1.  0.]
                      #          [ 0.  1.]]"

e = np.random.random((2,2)) # Create an array filled with random values
print (e,'e')                     # Might print "[[ 0.91940167  0.08143941]
                              #              [ 0.68744134  0.87236687]]"

[[0. 0.]
 [0. 0.]] a
[[1. 1.]] b
[[7 7]
 [7 7]] c
[[1. 0.]
 [0. 1.]] d
[[0.0175284  0.62989422]
 [0.10559723 0.93251141]] e


### NumPy数组的常用操作

#### 数组的索引与切片

Numpy提供了几种方式来Index索引数组：  
#### 切片（Slicing）: 
与Python 列表（list）类似，numpy数组同样可以被切片。因为数组可能是多维的，所以必须为数组的每个维度指定切片。
* 需要注意的是，数组的切片是对相同数据的视图，因此修改它将会修改原数组。

In [24]:
# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
# Use slicing to pull out the subarray consisting of the first 2 rows and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
# [6 7]]
b = a[:2, 1:3]
print (b)
# A slice of an array is a view into the same data, so modifying it will modify the original array.
print (a[0, 1])   # Prints "2"
b[0, 0] = 77    # b[0, 0] is the same piece of data as a[0, 1]
print (a[0, 1])   # Prints "77"

[[2 3]
 [6 7]]
2
77


#### 整数数组索引（Integer array indexing）: 
一个有用的技巧是从矩阵的每一行中选择或改变一个元素

In [36]:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print (a,'/')  # prints "array([[ 1,  2,  3],
               #                [ 4,  5,  6],
               #                [ 7,  8,  9],
               #                [10, 11, 12]])"
b = np.array([0, 2, 0, 1])     # Create an array of indices
print (a[np.arange(4), b],'/') # Select one element from each row of a using the indices in b and prints "[ 1  6  7 11]"
a[np.arange(4), b] += 10       # Mutate one element from each row of a using the indices in b
print (a,'/')  # prints "array([[11,  2,  3],
               #                [ 4,  5, 16],
               #                [17,  8,  9],
               #                [10, 21, 12]])

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]] /
[ 1  6  7 11] /
[[11  2  3]
 [ 4  5 16]
 [17  8  9]
 [10 21 12]] /


#### 布尔数组索引（Boolean array indexing）: 
布尔数组索引允许您挑选数组的任意元素。 通常，这种类型的索引用于选择满足一些条件的数组的元素。 这里是一个例子：

In [29]:
bool_idx = (a > 6)
print(bool_idx)
print(a[bool_idx])
print(a[a>6])

[[False  True False False]
 [False False  True  True]
 [ True  True  True  True]]
[77  7  8  9 10 11 12]
[77  7  8  9 10 11 12]


还有其它索引方法，可以多参考[官网指南](https://www.numpy.org.cn/article/basics/an_introduction_to_scientific_python_numpy.html)

#### 数组的计算
- 星乘（*）与.dot点乘

In [31]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
print (x * y)
print (np.multiply(x, y))

[[ 5. 12.]
 [21. 32.]]
[[ 5. 12.]
 [21. 32.]]


我们使用dot函数来计算相量的内积，矩阵乘以相量或者矩阵乘以矩阵。dot可以通过numpy模块中的函数来调用，或者数组对象的一个方法函数：

In [33]:
print (x.dot(y))
print (np.dot(x, y))

[[19. 22.]
 [43. 50.]]
[[19. 22.]
 [43. 50.]]


#### 广播 Broadcasting

广播是一种强大的机制，允许numpy在执行算术运算时使用不同形状的数组。 通常我们有一个较小的数组和一个较大的数组，我们想要使用较小的数组多次对较大的数组执行一些操作。
例如，假设我们要向矩阵的每一行添加一个常量向量。 我们可以这样做：

In [37]:
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v  # Add v to each row of x using broadcasting
print (y)

[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]


### NumPy的常用函数

In [8]:
round(bmis) # 只有numpy.ndarray支持的方法

TypeError: type numpy.ndarray doesn't define __round__ method

In [40]:
print(bmis)

[21.85171573 20.97505669 21.75028214 24.7473475  21.44127836]


In [143]:
np.round(bmis) # 需要使用np重载的方法

array([22., 21., 22., 25., 21.])

In [4]:
np.max(bmis) == max(bmis) # 大胆地猜想这两种方法应该都可以

True

In [43]:
np.argmax(bmis)

3

In [42]:
np.sort(bmis)

array([20.97505669, 21.44127836, 21.75028214, 21.85171573, 24.7473475 ])

In [41]:
np.median(bmis)

21.750282138093777

In [44]:
np.std(bmis)

1.3324932175111337

### 练习：求两个数组的和

In [18]:
import timeit
a1 = [i for i in range(10000)]
a2 = [j for j in range(10000)]
n1 = np.array(a1)
n2 = np.array(a2)

def list_add():
    return [i1 + i2 for (i1, i2) in zip(a1, a2)]

def np_add():
    return n1 + n2

print(timeit.timeit(lambda: list_add(), number = 1))
print(timeit.timeit(lambda: np_add(), number = 1))

0.001779190999968705
0.00038525799982380704


In [16]:
help(np.ndarray)  # 文档即教程

Help on class ndarray in module numpy:

class ndarray(builtins.object)
 |  ndarray(shape, dtype=float, buffer=None, offset=0,
 |          strides=None, order=None)
 |  
 |  An array object represents a multidimensional, homogeneous array
 |  of fixed-size items.  An associated data-type object describes the
 |  format of each element in the array (its byte-order, how many bytes it
 |  occupies in memory, whether it is an integer, a floating point number,
 |  or something else, etc.)
 |  
 |  Arrays should be constructed using `array`, `zeros` or `empty` (refer
 |  to the See Also section below).  The parameters given here refer to
 |  a low-level method (`ndarray(...)`) for instantiating an array.
 |  
 |  For more information, refer to the `numpy` module and examine the
 |  methods and attributes of an array.
 |  
 |  Parameters
 |  ----------
 |  (for the __new__ method; see Notes below)
 |  
 |  shape : tuple of ints
 |      Shape of created array.
 |  dtype : data-type, optional
 |

##### 深入了解numpy数据类型
[dtype](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)

##### 快速检索开发文档
>具有本地索引的应用程序，Jupyter的help菜单速度不够快

- [Dash](https://kapeli.com/dash)
- [Zeal](https://zealdocs.org)
- [velocity](http://velocity.silverlakesoftware.com/)


## Numpy在数理统计和线性代数中的应用

##### 数理统计的例子
- 在大样本数据上计算身高体重指数
- 把前面5人的例子扩大1000倍
- 随机产生一个正态分布

> [正态分布生成](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html?highlight=random%20normal#numpy.random.normal)文档
![文档链接](http://bazhou.blob.core.windows.net/learning/mpp/npnormal_doc.png)

![正态分布生成文档快照](http://bazhou.blob.core.windows.net/learning/mpp/npnormal.png)

##### 中国成年人身高，体重均值和标准差的参考值 （2015）
|性别|东北华北|西北|东南|华中|华南|西南|
|-|-|-|-|-|-|-|
|身高，体重|均值，标准差|均值，标准差|均值，标准差|均值，标准差|均值，标准差|均值，标准差|
|男(mm)|1693, 56.6|1684, 53.7|1686, 55,2|1669, 56.3|1650, 57.1|1647, 56.7|
|女(mm)|1586, 51.8|1575, 51.9|1575, 50.8|1560, 50.7|1549, 49.7|1546, 53.9|
|男(kg)|64, 8.2|60, 7.6|59, 7.7|57, 6.9|56, 6.9|55, 6.8|
|女(kg)|55, 7.1|52, 7.1|51, 7.2|50, 6.8|49, 6.5|50, 6.9|

> 选择东南地区成年男性数据：身高：1686, 55.2，体重：59, 7.7

In [22]:
height_5k = np.random.normal(1.686, 0.0552, 5000)   # 5000身高数据
weight_5k = np.random.normal(59, 7.7, 5000)         # 5000体重数据
shmale_5k = np.column_stack((height_5k, weight_5k)) # 5000上海男性数据
shmale_5k                                           # 随机产生一些瘦子和胖子

array([[ 1.67762294, 60.75896532],
       [ 1.63347506, 69.78589296],
       [ 1.67032477, 49.20994517],
       ...,
       [ 1.70339819, 53.62467471],
       [ 1.67471629, 62.41044849],
       [ 1.69527693, 40.4186189 ]])

In [24]:
shmale_weight = shmale_5k[:,1]
shmale_height = shmale_5k[:,0]
shmale_height_mean = np.mean(shmale_height)  # 身高均值
shmale_height_std = np.std(shmale_height)    # 身高标准差
shmale_weight_mean = np.mean(shmale_weight)  # 体重均值
shmale_weight_std = np.std(shmale_weight)    # 体重标准差
from tabulate import tabulate                # 格式化成表格样式
print(tabulate([['身高（米）', shmale_height_mean, shmale_height_std], 
                ['体重（公斤）', shmale_weight_mean, shmale_weight_std]], 
               headers=['上海男','均值', '标准差']))


上海男            均值     标准差
------------  --------  ---------
身高（米）     1.68436  0.0553223
体重（公斤）  59.0428   7.6029


In [25]:
shmale_bmi = shmale_weight / shmale_height ** 2 # 计算5000身高体重指数
shmale_bmi

array([21.58845969, 26.1542713 , 17.63806908, ..., 18.48129107,
       22.25229531, 14.06371856])

In [26]:
print(np.mean(shmale_bmi), np.std(shmale_bmi))  # 上海男性体型分布

20.87830460016227 3.0283880774606056


##### 线性代数例子
- 解方程组
\begin{equation}2x + 3y = 8 \end{equation}
\begin{equation}5x + 2y = 9 \end{equation}

- 用矩阵的形式表示为
$$ A = \begin{bmatrix}2 & 3\\5 & 2\end{bmatrix} \;\;\;\; \vec{b} = \begin{bmatrix}8\\9\end{bmatrix}$$

- 目标是求向量x
$$ A\vec{x}= \vec{b} $$

In [28]:
a = np.array([[2, 3], [5, 2]]) # 方程组系数矩阵
a.transpose()                  # 转置

array([[2, 5],
       [3, 2]])

In [29]:
b = np.array([8, 9])
print(b.shape, a.shape)         # shape不是函数，是tuple
b.transpose()

(2,) (2, 2)


array([8, 9])

In [31]:
# 矩阵的索引
print(tabulate([['0', a[0,0], a[0,1]], 
                ['1', a[1,0], a[1,1]]], 
               headers=['A', '0', '1']))

  A    0    1
---  ---  ---
  0    2    3
  1    5    2


>[矩阵运算](https://docs.scipy.org/doc/numpy/reference/routines.linalg.html)文档
![linalg](http://bazhou.blob.core.windows.net/learning/mpp/linalg.png)

![矩阵求逆](http://bazhou.blob.core.windows.net/learning/mpp/linalg.inv.png)

In [32]:
# 矩阵求逆
from numpy.linalg import inv as inv
a_inv = inv(a)
a_inv

array([[-0.18181818,  0.27272727],
       [ 0.45454545, -0.18181818]])

$$ A^{-1}A=\begin{bmatrix}1 & 0\\0 & 1\end{bmatrix} $$

In [155]:
np.round(a_inv @ a) # 逆矩阵和原矩阵乘来验证求逆, @代表矩阵乘

array([[ 1.,  0.],
       [-0.,  1.]])

$$ A^{-1}A\vec{x}=A^{-1}\vec{b}$$


$$ \begin{bmatrix}1 & 0\\0 & 1\end{bmatrix}\vec{x}= A^{-1}\vec{b}$$


$$ \vec{x} = A^{-1}\vec{b}$$


In [33]:
x = a_inv @ b
x

array([1., 2.])

In [34]:
from numpy.linalg import solve as solve # 引入求解方法
solve(a, b)

array([1., 2.])

$$ \vec{x} = \begin{bmatrix}1\\2\end{bmatrix} $$


$$ x=1 $$
$$ y=2 $$

In [35]:
import random as random
import numpy as np
import timeit
from itertools import accumulate
from tabulate import tabulate

class RandomWalker:
    def __init__(self):
        self.position = 0

    def walk(self, n):
        self.position = 0
        for i in range(n):
            yield self.position
            self.position += 2 * random.randint(0, 1) - 1

def oo_walk(n=1000):
    walker = RandomWalker()
    return [p for p in walker.walk(n)]

def procedure_walk(n=1000):
    position = 0
    walk = [position]
    for i in range(n):
        position += 2 * random.randint(0, 1)-1
        walk.append(position)
    return walk

def acc_walk(n=1000):
    steps = random.choices([-1, +1], k=n)
    return [0] + list(accumulate(steps))

def np_walk(n=1000):
    steps = np.random.choice([-1,+1], n)
    return np.cumsum(steps)

oo = timeit.timeit(lambda: oo_walk(), number=1)
procedure = timeit.timeit(lambda: procedure_walk(), number=1)
acc =  timeit.timeit(lambda: acc_walk(), number=1)
np = timeit.timeit(lambda: np_walk(), number=1)
print(tabulate([['algo', 'time'],['oo', oo],['proc', procedure],['acc', acc], ['np', np]], headers='firstrow'))    
    

algo           time
------  -----------
oo      0.00355035
proc    0.00287531
acc     0.000473735
np      0.000811369


# Scipy
SciPy是构建在numpy的基础之上的，它提供了更多的操作numpy的数组的函数。  
SciPy是一款方便、易于使用、专为科学和工程设计的python工具包，它包括了统计、优化、整合以及线性代数模块、傅里叶变换、信号和图像图例，常微分方差的求解等

### Numpy vs Scipy

- 函数库的大小不同
- NumPy是基于多维数组的数学计算模块，存储和处理大型矩阵
- SciPy是科学计算函数库，在NumPy库的基础上增加了众多的数学、科学以及工程计算中常用的库函数

### Scipy 的软件包

|包|英文描述|中文描述|
|---|---|---|
|cluster|Clustering algorithms|向量计算|
|constants|Physical and mathematical constants|物理与数学常量|
|fftpack|Fast Fourier Transform routines|快速傅里叶变换|
|integrate|Integration and ordinary differential equation solvers|积分与常微分方程|
|interpolate|Interpolation and smoothing splines|插值|
|io|Input and Output|数据输入与输出|
|linalg|Linear algebra|线性代数|
|ndimage|N-dimensional image processing|多维图像处理|
|odr|Orthogonal distance regression|正交距离回归|
|optimize|Optimization and root-finding routines|优化|
|signal|Signal processing|信号处理|
|sparse|Sparse matrices and associated routines|稀疏矩阵|
|spatial|Spatial data structures and algorithms|空间数据结构|
|special|Special functions|一些特殊函数|
|stats|Statistical distributions and functions|统计|

[参考](https://docs.scipy.org/doc/scipy/reference/tutorial/general.html)

![模块](http://bazhou.blob.core.windows.net/learning/mpp/scipy.png)

### SciPy应用的一个典型例子



- [libquadmath](https://gcc.gnu.org/onlinedocs/libquadmath/Math-Library-Routines.html#Math-Library-Routines)
- [OpenBLAS](https://github.com/xianyi/OpenBLAS)
> BLAS stands for Basic Linear Algebra Subprograms. BLAS provides standard interfaces for linear algebra, including BLAS1 (vector-vector operations), BLAS2 (matrix-vector operations), and BLAS3 (matrix-matrix operations). In general, BLAS is the computational kernel ("the bottom of the food chain") in linear algebra or scientific applications. Thus, if BLAS implementation is highly optimized, the whole application can get substantial benefit.

# Pandas
基于NumPy 的一种工具，为了解决数据分析任务而创建的。  
Pandas 纳入了最具有统计意味的工具包，大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。 

### Pandas的数据结构

- Series，一维
- DataFrame，二维
- Panel，三维

#### Series, DataFrame的基本操作和常用函数

Series是一维标记的数组，能够保存任何数据类型（整数，字符串，浮点数，Python对象等）。轴标签统称为索引。
创建系列的基本方法是：
 s = pd.Series(data, index=index)

在这里，data可以有很多不同的东西：
一个Python词典
一个ndarray
标量值（如5）
传递的索引是轴标签列表。

In [3]:
#From ndarray  
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) #索引的长度必须与数据的长度相同
print(s)

a    0.181330
b    0.992798
c    1.922655
d   -0.125142
e    0.987567
dtype: float64


In [4]:
s.index             #pandas支持非唯一索引值。

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [5]:
pd.Series(np.random.randn(5))

0    0.621302
1    1.358443
2   -1.214798
3   -1.876280
4    0.429610
dtype: float64

In [6]:
#From dict
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)


a    0
b    1
c    2
dtype: int64

注意：当数据是dict，并且未传递Series索引时，如果您使用的是Python版本> = 3.6且Pandas版本> = 0.23 ，则索引将按dict的插入顺序排序。
如果您使用的是Python <3.6或Pandas <0.23，并且未传递Series索引，则索引将是词汇顺序的dict键列表。

In [7]:
pd.Series(d, index=['b', 'c', 'd', 'a']) #如果传递索引，则将拉出与索引中的标签对应的数据中的值。

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

In [8]:
#From scalar value
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

In [12]:
#Series的操作与ndarray的操作相似
s[0]

0.18132983564722829

In [13]:
s[:3]

a    0.181330
b    0.992798
c    1.922655
dtype: float64

In [14]:
s[s > s.median()]

b    0.992798
c    1.922655
dtype: float64

In [15]:
s[[4, 3, 1]]

e    0.987567
d   -0.125142
b    0.992798
dtype: float64

DataFrame是一个二维标记数据结构，具有可能不同类型的列。可以将其视为电子表格或SQL表，或Series对象的字典。它通常是最常用的pandas对象。与Series一样，DataFrame接受许多不同类型的输入：

    1D ndarray，list，dicts或Series的Dict
    二维numpy.ndarray
    结构化或记录 ndarray
    一个 Series
    另一个 DataFrame
    除了数据，还可以选择传递索引（行标签）和 列（列标签）参数。如果传递索引和/或列，则可以保证生成的DataFrame的索引和/或列。因此，系列的字典加上特定索引将丢弃与传递的索引不匹配的所有数据。


In [18]:
#From dict of Series 
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [19]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [20]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


In [21]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [22]:
df.columns

Index(['one', 'two'], dtype='object')

In [23]:
# From dict of ndarrays / lists
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}
pd.DataFrame(d)


Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [24]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


In [71]:
#可以通过传递元组字典自动创建多索引
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
              ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
              ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
              ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


In [72]:
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]),('B', [4, 5, 6])]),orient='index') 

Unnamed: 0,0,1,2
A,1,2,3
B,4,5,6


In [70]:
df.columns = ['one', 'two', 'three']
df

ValueError: Length mismatch: Expected axis has 4 elements, new values have 3 elements

In [42]:
#列选择，添加，删除
df['one']

key2    4
key1    1
key3    7
Name: one, dtype: int64

In [44]:
df['three'] = df['one'] * df['two']
df

Unnamed: 0,one,two,three
key2,4,5,20
key1,1,2,2
key3,7,8,56


In [53]:
df['flag'] = df['one'] > 2
df

Unnamed: 0,one,flag
key2,4,True
key1,1,False
key3,7,True


In [47]:
del df['two']
df

Unnamed: 0,one,three,flag
key2,4,20,True
key1,1,2,False
key3,7,56,True


In [52]:
three = df.pop('three')
df

KeyError: 'three'

Indexing / Selection
The basics of indexing are as follows:

Operation	                 Syntax	          Result
Select column	              df[col]	        Series
Select row by label	           df.loc[label]	   Series
Select row by integer location  	df.iloc[loc]	    Series
Slice rows	                 df[5:10]	       DataFrame
Select rows by boolean vector	   df[bool_vec]	      DataFrame

In [56]:
df.loc['key1']   #取某行

one         1
flag    False
Name: key1, dtype: object

In [59]:
df

Unnamed: 0,one,flag
key2,4,True
key1,1,False
key3,7,True


In [60]:
df.iloc[1]   #取第2列

one         1
flag    False
Name: key1, dtype: object

In [61]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
df + df2

Unnamed: 0,A,B,C,D
0,1.846331,1.008791,1.679312,
1,-3.187723,-0.748738,-2.782484,
2,0.962797,0.544605,-0.560604,
3,-0.940819,-2.757307,-0.950905,
4,0.421416,-0.091232,-2.845273,
5,0.948495,1.826965,0.073359,
6,0.86914,-2.012826,-1.603238,
7,,,,
8,,,,
9,,,,


In [62]:
df - df.iloc[0]

Unnamed: 0,A,B,C,D
0,0.0,0.0,0.0,0.0
1,-2.656066,-2.46089,-2.670479,-1.278996
2,0.881333,-0.845595,-1.752782,-0.848336
3,-0.889294,-3.876843,-2.843044,-0.489264
4,-1.315849,-1.7827,-4.375105,-0.897013
5,-0.607609,0.224697,-0.715493,-1.864492
6,-0.968256,-3.569502,-1.573848,0.125748
7,-0.999731,-0.699439,-3.026557,-0.743453
8,-2.060926,0.1906,-0.147368,-1.561439
9,-0.097323,-1.424839,-2.692717,-2.664375


In [63]:
df * 5 + 2

Unnamed: 0,A,B,C,D
0,7.337059,8.741749,9.461832,5.994874
1,-5.943271,-3.562701,-3.890561,-0.400106
2,11.743724,4.513775,0.697923,1.753195
3,2.890587,-10.642464,-4.753389,3.548554
4,0.757814,-0.171753,-12.413691,1.509809
5,4.299012,9.865234,5.884364,-3.327588
6,2.495779,-9.105761,1.592592,6.623615
7,2.338403,5.244554,-5.670955,2.277608
8,-2.967573,9.694747,8.724993,-1.81232
9,6.850442,1.617552,-4.001751,-7.326999


In [64]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [65]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [66]:
#要转置，请访问T属性（也是transpose函数）
df[:5].T    #only show the first 5 rows

Unnamed: 0,0,1,2,3,4
A,1.067412,-1.588654,1.948745,0.178117,-0.248437
B,1.34835,-1.11254,0.502755,-2.528493,-0.434351
C,1.492366,-1.178112,-0.260415,-1.350678,-2.882738
D,0.798975,-0.480021,-0.049361,0.309711,-0.098038


#Panel
Panel是三维数据使用较少但仍然很重要的数据结构。术语Panel数据来自计量经济学。
3轴的名称旨在为描述涉及面板数据的操作提供一些语义含义，特别是面板数据的计量经济分析。
但是，为了切割和切割DataFrame对象集合的严格目的，您可能会发现轴名称有点随意：

items：axis 0，每个项目对应一个包含在其中的DataFrame
major_axis：轴1，它是每个DataFrame 的索引（行）
minor_axis：轴2，它是每个DataFrame 的列

In [68]:
wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],
                major_axis=pd.date_range('1/1/2000', periods=5),
                minor_axis=['A', 'B', 'C', 'D'])
wp

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

### Pandas的一个典型案例

In [None]:
数据源
CIA world factbook

### 数据源 

[CIA world factbook](https://www.cia.gov/library/publications/the-world-factbook/rankorder/rankorderguide.html)


In [1]:
import pandas as pd

electricity_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2233.txt')
nature_gas_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2250.txt')

CParserError: Error tokenizing data. C error: Expected 4 fields in line 12, saw 5


In [8]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import pandas as pd

electricity_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2233.txt')
nature_gas_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2250.txt')

ParserError: Error tokenizing data. C error: Expected 4 fields in line 12, saw 5


In [2]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import pandas as pd

electricity_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2233.txt', engine='python')
nature_gas_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2250.txt', engine='python')

ValueError: Expected 4 fields in line 12, saw 5

In [3]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import pandas as pd

electricity_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2233.txt', engine='python', header=None)
nature_gas_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2250.txt', engine='python', header=None)

ValueError: Expected 4 fields in line 12, saw 5

In [4]:
energy_consuming = nature_gas_consuming.set_index('country').join(electricity_consuming.set_index('country'))
energy_consuming

NameError: name 'nature_gas_consuming' is not defined

In [5]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import pandas as pd

electricity_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2233.txt', 
                                    delimiter='\s{2,}', engine='python', header=None,
                                    names=['erank', 'country', 'etotal'], thousands=',')
nature_gas_consuming = pd.read_csv('https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2250.txt', 
                                   delimiter='\s{2,}', engine='python', header=None,
                                   names=['nrank', 'country', 'ntotal'], thousands=',')


In [11]:
electricity_consuming

Unnamed: 0,erank,country,etotal
0,1,China,5920000000000
1,2,United States,3911000000000
2,3,European Union,2845000000000
3,4,India,1048000000000
4,5,Japan,933600000000
5,6,Russia,890100000000
6,7,Canada,516600000000
7,8,Germany,514600000000
8,9,"Korea, South",497000000000
9,10,Brazil,460800000000


In [12]:
nature_gas_consuming

Unnamed: 0,nrank,country,ntotal
0,1,United States,773200000000
1,2,European Union,428800000000
2,3,Russia,418900000000
3,4,China,186200000000
4,5,Iran,186000000000
5,6,Japan,123600000000
6,7,Canada,114800000000
7,8,Saudi Arabia,102300000000
8,9,Germany,81350000000
9,10,Mexico,77930000000


In [39]:
electricity_consuming

Unnamed: 0,erank,country,etotal
0,1,China,5920000000000
1,2,United States,3911000000000
2,3,European Union,2845000000000
3,4,India,1048000000000
4,5,Japan,933600000000
5,6,Russia,890100000000
6,7,Canada,516600000000
7,8,Germany,514600000000
8,9,"Korea, South",497000000000
9,10,Brazil,460800000000


### join

In [14]:
energy_consuming = nature_gas_consuming.set_index('country').join(electricity_consuming.set_index('country'))
energy_consuming

Unnamed: 0_level_0,nrank,ntotal,erank,etotal
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
United States,1,773200000000,2.0,3.911000e+12
European Union,2,428800000000,3.0,2.845000e+12
Russia,3,418900000000,6.0,8.901000e+11
China,4,186200000000,1.0,5.920000e+12
Iran,5,186000000000,20.0,2.209000e+11
Japan,6,123600000000,5.0,9.336000e+11
Canada,7,114800000000,7.0,5.166000e+11
Saudi Arabia,8,102300000000,14.0,2.928000e+11
Germany,9,81350000000,8.0,5.146000e+11
Mexico,10,77930000000,16.0,2.452000e+11


In [15]:
energy_consuming.sort_values(by=['etotal', 'ntotal'], ascending=False)

Unnamed: 0_level_0,nrank,ntotal,erank,etotal
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
China,4,186200000000,1.0,5.920000e+12
United States,1,773200000000,2.0,3.911000e+12
European Union,2,428800000000,3.0,2.845000e+12
India,19,47520000000,4.0,1.048000e+12
Japan,6,123600000000,5.0,9.336000e+11
Russia,3,418900000000,6.0,8.901000e+11
Canada,7,114800000000,7.0,5.166000e+11
Germany,9,81350000000,8.0,5.146000e+11
"Korea, South",12,69630000000,9.0,4.970000e+11
Brazil,27,38490000000,10.0,4.608000e+11


# 如何学习和掌握NumPy, SciPy和Pandas

### 利用可执行性

> StackOverflow，Notebook，Example Section

### 利用索引

> index is a tool to map the world

### 利用批评

- 精彩捍卫
- HackerNews

### 观点

- What + How vs Who + Why
- 效率优化的消去法

### 叙事结构 (如何获取技术洞见）

- 关键人物不吸引我？
  - 恐怕提出了最有价值的问题

- 舞台视角和观众视角
  - 历史作品的两种体裁
  - 路线图和向导
  
- 事实并不重要
  - 指导行动的一定是结论
  - 历史记录是论据，史观才是论点

### Pattern &  Insight

## 对待语言，库和工具的实用主义

- 没有病不需要吃药
- 药都有副作用
- 服用信息不可能忘记
- 病通常比药有更长的生命周期

## Framework & Libs

- 作者对市场的分析
> 获新时用户最常见的比较是什么？
- 作者对用户场景的分析
> 用户最大的痛点是什么？
- 作者怎样管理退出
> 锁定和网络效应

## 作者的问题并不是你的问题，反之亦然

> 大家在同一个生态环境中是共生的关系，共同演化是一个博弈的过程