### Array

从这章开始用python梳理常用的数据结构,主要参考goodrich的data structures&algorithms,将会依次介绍Array,stack,queues,deques,linked list,trees,priority queues,maps等常见数据结构

python内建了有序classes如 list,tuple,str等class,索引seq[k]是它们共同的功能,它们都是使用了一个底层的数据结构array进行构造的.

#### 底层数组
1. The primarymemory of a computer is composed of bits of information, and those bits are typicallygrouped into larger units that depend upon the precise system architecture.Such a typical unit is a **byte**, which is equivalent to 8 bits.  
字节(byte)是计算机的最小操作单元,一个位存储了8个位(bits)的信息.
2. to keeptrack of what information is stored in what byte, the computer uses an abstraction known as a memory address.  
内存地址用来对字节进行跟踪.
3. so that any byte of the main memory can be efficiently accessed based upon its memory address. In this sense, we say that a computer’s main memory performs as random access memory (RAM). That is, it is just as easy to retrieve byte \#8675309 as it is to retrieve byte \#309.any individual byte of memory can be stored or retrieved in O(1) time.  
通过内存地址获取字节的时间是一个常数,无论地址有多大(实现涉及到cache的知识,这里不讨论),因此内存称为随机存取存储器.

A group of related variables can be stored one after another in a contiguous portion of the computer’s memory. We will denote such a representation as an array. 
把一组相关的变量放到内存上连片的空间称为数组.
![](http://i.imgur.com/U0YneQY.png)


We will refer to each location within an array as a **cell**, and will use an integer **index** to describe its location within the array, with cells numbered starting with 0, 1, 2, and so on.  
* cell即数组的单元,可能是一个寄存器,也可能是多个寄存器的组合  
* index为索引  
* 因此每个元素的位置为start + cellsize\*index 上图中"L"的位置为2146+2\*4=2154
"SAMPLE"在封装到数组中的形式如下:
![](http://i.imgur.com/fDSLk2w.png)


##### 引用数组
数组中每个单元存储数据的尺寸可能不一样,强行填充会造成大量浪费,因此更好的解决方法是在单元中存放数据的引用.python的list和tuple就是如此做的.

The fact that lists and tuples are referential structures is significant to the semantics of these classes. A single list instance may include multiple references to the same object as elements of the list, and it is possible for a single object to be an element of two or more lists, as those lists simply store references back to that object. As an example, when you compute a slice of a list, the result is a new list instance, but that new list has references to the same elements that are in the
original list, as portrayed in Figure 5.5.
真是因为list and tuple 是引用的,因此切片才能顺利的实现,同一个对象可以是多个list的元素
![](http://i.imgur.com/wrzuDyi.png)

python浅拷贝,创建了一个新的list对象,复制了全部引用;深拷贝创建了一个新的list对象,除不可变对象外,对可变对象创建了新引用.  
> http://www.cnblogs.com/wilber2013/p/4645353.html 

![](http://i.imgur.com/fzP5nEQ.png)
![](http://i.imgur.com/lDqKTqx.png)

##### 密集数组(compact array)
即本章开头直接在数组单元直接存储bits的数组.其优势有:
1. 内存占用小,因为引用类型通常使用64bit作为引用地址的大小,理论上存100万个64bit的int,需要64百万位,但python通常需要4到5倍的大小,因为list中每个元素存64位的地址,int存在内存中的其他地方,在python中一个int要14bytes引用+4bytes代表64位整数.
2. primary data are stored consecutively in memory基本数据在内存中连片存储,有利于缓存的工作

在python中内建了一个array,支持c的数据类型
> https://docs.python.org/3/library/array.html

#### 动态数组 和 摊销
数组一经创建大小不可变,python的(tuple str实例),python的list却可以增加,list的实现依赖于dynamic array(动态数组),其大小随着元素的增多而发生变化:

In [1]:
import sys
data=[]
for k in range(10):
    a=len(data)
    b=sys.getsizeof(data)
    print("lenth :{0:3};size in bytes:{1:4} ".format(a,b))
    data.append(None)

lenth :  0;size in bytes:  64 
lenth :  1;size in bytes:  96 
lenth :  2;size in bytes:  96 
lenth :  3;size in bytes:  96 
lenth :  4;size in bytes:  96 
lenth :  5;size in bytes: 128 
lenth :  6;size in bytes: 128 
lenth :  7;size in bytes: 128 
lenth :  8;size in bytes: 128 
lenth :  9;size in bytes: 192 


#### 实现一个动态数组

In [7]:
import ctypes

class DynamicArray:
    def __init__(self):
        self._n=0
        self._capacity=1
        self._A=self._make_array(self._capacity)
    
    def __len__(self):
        return self._n
    
    def __getitem__(self,k):
        if not 0<=k<self._n:
            raise IndexError("invalid index")
        else:
            return self._A[k]
        
    def append(self,obj):
        if self._n==self._capacity:
            self._resize(self._capacity*2)
        self._A[self._n]=obj
        self._n+=1
    
    def remove(self,value):
        for i in range(self._n):
            if self._A[i]==value:
                for j in range(i,self._n-1):
                    self._A[j]=self._A[j+1]
                #便于垃圾搜集
                self._A[self._n-1]=None
                self._n-=1
                return
        raise ValueError("value not found") 
        
    def _resize(self,c):
        B=self._make_array(c)
        for i in range(self._n):
            B[i]=self._A[i]
        self._A=B
        self._capacity=c
            
    
    def _make_array(self,c):
        #调用c语言接口,返回一个容量为c的array
        return (c* ctypes.py_object)()
    
A=DynamicArray()
A.append(1)
print(len(A),A[0])
A.remove(1)
print(len(A))

1 1
0


#### 动态数组的摊销分析
命题一:: Let S be a sequence implemented by means of a dynamic array
with initial capacity one, using the strategy of doubling the array size when full.
The total time to perform a series of n append operations in S, starting from S being
empty, is $O(n)$.
每次需要调整size时,将变为size为当前的两倍,平均时间复杂度为n
![Screenshot from 2017-03-23 16-27-21.png](https://ooo.0o0.ooo/2017/03/23/58d3881fa3dc6.png)

命题二::Performing a series of n append operations on an initially empty
dynamic array using a fixed increment with each resize takes $Ω(n^2)$ time.每次追加时递增size,平均时间复杂度$n^2$
![Screenshot from 2017-03-23 16-36-33.png](https://ooo.0o0.ooo/2017/03/23/58d38a1908f08.png)

#### python list and tuple 操作的效率
非可变操作
![Screenshot from 2017-03-23 16-45-15.png](https://ooo.0o0.ooo/2017/03/23/58d38b342560b.png)
可变操作
![Screenshot from 2017-03-23 16-47-14.png](https://ooo.0o0.ooo/2017/03/23/58d38ba464163.png)

len(data),data[i],data[i]=vai,append(),pop()是常数级别的操作.
list comprehension显著快于append

#### python的字符串类
 
finding a pattern of length m within a longer string of length n :字符串的pattern matching通常是一个$O(mn)$的操作

In [23]:
%%timeit -n10
letters=""
document="aqweafafafasfafaqqqqqqqqqqqqqqqqqqqqqqqwqwwwwwwwwwwwwwwwwww\
rdfsdasssssssssssssssssssssssssssssssssssssssssssssssssssssssssssaaas\
hftujgddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddk\
as"
for c in document:
    if c.isalpha():
        letters+=c

10 loops, best of 3: 22.8 µs per loop


In [24]:
%%timeit -n10
temp=[]
for c in document:
    if c.isalpha():
        temp.append(c)
letters="".join(temp)

10 loops, best of 3: 2.58 µs per loop


In [33]:
%%timeit -n10
#无需使用temp
letters="".join(c for c in document if c.isalpha())

10 loops, best of 3: 2.48 µs per loop


字符串是不可变对象,对它的追加实际上创造了一个new string instance ,每次追加的复杂度都为n.

#### 多维数组(矩阵)

In [34]:
#错误写法
data = ([0]*6)*3
print(data)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [37]:
#仍是错误写法
data=[[0]*6]*3
print(data)
data[0][0]=1
print(data)

[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]
[[1, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]


![Screenshot from 2017-03-23 17-53-50.png](https://ooo.0o0.ooo/2017/03/23/58d39b414935d.png)
这种写法使3个元素都指向同一个对象

In [38]:
#正确初始化矩阵需要每个2级列表都是相互独立的instance
data=[[0]*6 for i in range(3)]
print(data)
data[0][0]=1
print(data)

[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]
[[1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]
