# 理解Python数据类型

Effective data-driven science and computation requires understanding how data is stored and manipulated.
This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this.
Understanding this difference is fundamental to understanding much of the material throughout the rest of the book.

Users of Python are often drawn-in by its ease of use, one piece of which is dynamic typing.
While a statically-typed language like C or Java requires each variable to be explicitly declared, a dynamically-typed language like Python skips this specification. For example, in C you might specify a particular operation as follows:

```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```

While in Python the equivalent operation could be written this way:

```python
# Python code
result = 0
for i in range(100):
    result += i
```

Notice the main difference: in C, the data types of each variable are explicitly declared, while in Python the types are dynamically inferred. This means, for example, that we can assign any kind of data to any variable:

```python
# Python code
x = 4
x = "four"
```

Here we've switched the contents of ``x`` from an integer to a string. The same thing in C would lead (depending on compiler settings) to a compilation error or other unintented consequences:

```C
/* C code */
int x = 4;
x = "four";  // FAILS
```

This sort of flexibility is one piece that makes Python and other dynamically-typed languages convenient and easy to use.
Understanding *how* this works is an important piece of learning to analyze data efficiently and effectively with Python.
But what this type-flexibility also points to is the fact that Python variables are more than just their value; they also contain extra information about the type of the value. We'll explore this more in the sections that follow.

## Python整数不仅仅是整数

每个Python是伪装的C structure
``x = 10000``, ``x`` 不仅仅是整数，而是指向C structure的指针
这是Python的C语言实现的``long``数据类型

```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```

一个整数实际包含四个部分:

- ``ob_refcnt``, 引用计数，用于内存的回收
- ``ob_type``, 编码了变量类型
- ``ob_size``, 指定下面数据的尺寸
- ``ob_digit``, 包含实际的数据

在Python中一个整数有额外的负载

![Integer Memory Layout](figures/cint_vs_pyint.png)

额外的信息使得Python编程自由动态，但是这是需要付出代价的，尤其这样的对象很多的时候。

## Python列表不仅仅是列表

标准的多元素可变容器是list

In [1]:
L = list(range(10))
L

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [2]:
type(L[0])

int

或者类似的一列字符串:

In [3]:
L2 = [str(c) for c in L]
L2

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [4]:
type(L2[0])

str

由于Python是动态类型的，甚至可以创建混合类型列表:

In [3]:
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]

[bool, str, float, int]

这种灵活性是有代价的，如下图所示Python List的元素尽管类型是一样的，但是还是会冗余存放

![Array Memory Layout](figures/array_vs_list.png)

如上图所示Numpy数组，直接一个指针指向一片连续的存放数据的区域，而Python列表的指针指向一片存放指针的区域，然后这个区域的指针再指向实际的对象，需要两次访问内存才能得到数据。

## Python中固定类型的数组

为了高效的存储固定类型的数据，Python3.3也提供了内建模块``array``，可以用来连续存储同类型项:

In [6]:
import array
L = list(range(10))
A = array.array('i', L)
A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

这里的``'i'`` 意指整数.
但是NumPy的``ndarray``对象不仅可以高效的存储数组数据，而且提供了额外高效的操作。

In [6]:
import numpy as np

## 从Python List创建NumPy数组：

In [8]:
# integer array:
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

记住元素的数据类型必须一致，否则会upcast,如下数组都变为浮点数:

In [9]:
np.array([3.14, 4, 2, 3])

array([ 3.14,  4.  ,  2.  ,  3.  ])

也可以用``dtype``关键字指定类型:

In [10]:
np.array([1, 2, 3, 4], dtype='float32')

array([ 1.,  2.,  3.,  4.], dtype=float32)

不像Python List,NumPy显示的支持多维度数组，这种多维度体现在访问数组元素时需要指定各个维度的index,例如a[1,5,2],多维数组可以用列表的列表来初始化，例如:

In [7]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

内部的列表被视为二维数组的行

## 从头开始创建数组

对于大型数组，更有效的方式是利用NumPy的内建方法来创建:

In [8]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [9]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [10]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [11]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [12]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [13]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
#0～1均匀分布
np.random.random((3, 3))

array([[0.66184123, 0.68307018, 0.49392886],
       [0.11661599, 0.45578323, 0.6520928 ],
       [0.25467614, 0.7497171 , 0.24714148]])

In [18]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
# 正态分布
np.random.normal(0, 1, (3, 3))

array([[ 1.51772646,  0.39614948, -0.10634696],
       [ 0.25671348,  0.00732722,  0.37783601],
       [ 0.68446945,  0.15926039, -0.70744073]])

In [14]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[5, 8, 4],
       [9, 0, 5],
       [4, 9, 7]])

In [15]:
# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [17]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty((3,4))

array([[4.65401464e-310, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000]])

## NumPy支持的标准数据类型
NumPy支持的标准数据类型如下表所示

构建数组指定类型时既可以使用字符串：
```python
np.zeros(10, dtype='int16')
```

也可以使用``NumPy``对象:

```python
np.zeros(10, dtype=np.int16)
```

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

指定更高级的类型规范也是可能的，例如指定大端还是小端，也可以指定复合数据类型
