### 常用搜索方法

#### 顺序搜索

Starting at the first item in the list, we simply move from item to item, following the underlying 
sequential ordering until we either find what we are looking for or run out of items. 
If we run out of items, we have discovered that the item we were searching for was not present.
从第一个元素开始，沿着下标以此搜索，直到跳出边界

![顺序搜索](http://interactivepython.org/courselib/static/pythonds/_images/seqsearch.png)

In [1]:
def sequentialSearch(alist,item):
    pos=0
    finditem=False
    while pos<len(alist):
        if alist[pos]==item:
            finditem=True
        pos+=1
    return finditem
print(sequentialSearch([1,2,3,4],3))
print(sequentialSearch([1,2,3,4],5))

True
False


元素在内 ：最好 1 ；最坏 n；平均 n/2
元素不在内 ：最好 n ；最坏 n；平均 n

##### 有序数组的顺序搜索

Assume that the list of items was constructed so that the items were in ascending order, from low to high. If the item we are looking for is present in the list, the chance of it being in any one of the n positions is still the same as before. However, if the item is not present there is a slight advantage. 
有序情况下，可以跳过搜索。
![有序数组的顺序搜索](http://interactivepython.org/courselib/static/pythonds/_images/seqsearch2.png)

In [2]:
def sequentialSearch2(alist,item):
    pos=0
    finditem=False
    while pos <len(alist):
        if alist[pos]==item:
            finditem=True
        #若当前元素大于目标，由于数组已经排序，所以后续元素必然大于当前元素，直接返回
        elif alist[pos]>item:
            break
        pos+=1
    return finditem
print(sequentialSearch2([1,2,3,4],3))
print(sequentialSearch2([1,2,3,4],5))

True
False


元素在内 ：最好 1 ；最坏 n；平均 n/2
元素不在内 ：最好 1 ；最坏 n；平均 n/2

#### 二分法查找

 a binary search will start by examining the middle item. If that item is the one we are searching for, we are done. If it is not the correct item, we can use the ordered nature of the list to eliminate half of the remaining items. If the item we are searching for is greater than the middle item, we know that the entire lower half of the list as well as the middle item can be eliminated from further consideration. The item, if it is in the list, must be in the upper half.
 二分法从检查中间的元素开始，如果目标元素大于中间的元素，我们搜索后一半，否则我们搜索前一半。
![二分法](http://interactivepython.org/courselib/static/pythonds/_images/binsearch.png)

In [3]:
def binarySearch(alist, item):
    lo=0
    hi=len(alist)-1
    found=item
    # lo=hi 是为了考虑length=1的情况，加入not found防止=0时死循环
    while lo<=hi and not found:
        mid=(lo+hi)//2
        if alist[mid]==item:
            found=True
        elif alist[mid]>item:
            lo=mid+1
        elif alist[mid]<item:
            hi=mid-1
    return found
print(sequentialSearch2([1,2,3,4],3))
print(sequentialSearch2([3],3)) 

True
True


运用divide and conquer strategy,递归实现的二分法

In [4]:
def binarySearch2(alist, item):
    if len(alist)==0:
        return False
    else:
        mid=len(alist)//2
        if alist[mid]==item:
            return True
        elif alist[mid]>item:
            #小于mid，搜索前半部分
            return binarySearch2(alist[:mid],item)
        else:
            #小于mid，搜索前半部分
            return binarySearch2(alist[mid+1:],item)
print(sequentialSearch2([1,2,3,4],3))
print(sequentialSearch2([3],3)) 

True
True


算法复杂度 O(logn)
需要注意的是，在binarySearch2中，我们在递归中使用了切片，在python中这是O（k）级别的操作，会大幅影响与运行时间。`

使用index而非切片的二分法实现

In [5]:
class binarySearch3:
    def __init__(self,alist,item):
        self.alist=alist
        self.item=item
    def binarySearch(self):
        return self._binarySearch(0,len(self.alist))
    def _binarySearch(self,lo,hi):
        mid=(lo+hi)//2
        if lo>hi:
            return False
        elif self.alist[mid]==self.item:
            return True
        elif self.alist[mid]>self.item:
            return self._binarySearch(lo,mid-1)
        else:
            return self._binarySearch(mid+1,hi)
    
b=binarySearch3([1,2,3,4],4)
print(b.binarySearch())
b=binarySearch3([4],3)
print(b.binarySearch())

True
False


#### hashing 哈希表 O（1）的计算复杂度

A hash table is a collection of items which are stored in such a way as to make it easy to find them later. Each position of the hash table, often called a slot, can hold an item and is named by an integer value starting at 0.  哈希表是一堆插槽的集合，每个插槽含有一个item，并按顺序编号。
![哈希表](http://interactivepython.org/courselib/static/pythonds/_images/hashtable.png)

哈希函数在item和插槽之间建立映射，我们假设item是整数，哈希函数是取余，h(item)=item%11，那么映射如下：

|Item	|Hash Value|
|:-------:|:---------:|
|54	|10|
|26	|4|
|93	|5|
|17	|6|
|77	|0|
|31	|9|
散列表的载荷因子为它的填充度
![散列表的载荷因子](http://interactivepython.org/courselib/static/pythonds/_images/hashtable2.png)
为6/11
搜索的时间不随规模而变化，为常数

哈希函数的设置是哈希表的关键，当哈希函数对不同的item产生了相同的值，称作hash碰撞，
hash(item1)=hash(item2)

##### 常用的hash函数：
* folding method（折叠法）把“436-555-4601” 2个分组变为43,65,55,46,01，再相加得到210，210%11=1，所有“436-555-4601”指向slot 1，本质上相当于把item缩小，减少碰撞的概率。
* mid-square method(半平方法）先平方，再于中取部分数字，最后取余。如“44”->“44\*44“ ->"1936"->"93”->93%11

对于字符串，先求其ordinal values，再相加取余。
![字符串](http://interactivepython.org/courselib/static/pythonds/_images/stringhash.png)

In [6]:
def hash(astring,tablesize):
    sum=0
    for pos in range(len(astring)):
        sum+=ord(astring[pos])
    return sum%tablesize
print(hash("cat",11))

4


但是，这种方法有一个问题，对于同一字符数组的不同排列总是返回相同的值。
改进是根据位置增加权重
![改进](http://interactivepython.org/courselib/static/pythonds/_images/stringhash2.png)

In [7]:
def hash2(astring,tablesize):
    sum=0
    for pos in range(len(astring)):
        sum+=ord(astring[pos])*(pos+1)
    return sum%tablesize
print(hash2("cat",11))

3


##### 碰撞解决办法 （collision resolution）


###### 开放地址法
A simple way to do this is to start at the original hash value position and then move in a sequential manner through the slots until we encounter the first slot that is empty.  
沿slot下标寻找空slot，插入item  
 20%11=9  
 31%11=9  
 冲突,因此20进行线性探测（linear probing），从9开始找空位，遍历10,0,1,2,找到了3
![冲突](http://interactivepython.org/courselib/static/pythonds/_images/linearprobing1.png)

这种方法的缺陷是可能发生聚类clustering，在相同的hash value发生多起冲突，大量相邻的slot被填满![clustering](http://interactivepython.org/courselib/static/pythonds/_images/clustering.png)

解决聚类的方法，再哈希（rehashing），在线性探测（linear probing）进行跳跃，即隔几位进行查找。  
rehash(pos)=(pos+skip)%sizeoftable  
简单的线性中skip为1  
rehashing中skip大于1，注意：hash表的尺寸最好为质数，以免空间被浪费。


###### 平方探测法 ( Quadratic Probing )
线性探测的变种，区别是skip不是常数，而是递增的常数的平方

###### 链接法（Chaining）
当冲突发生时，维持一个链表指向冲突的slot
![链接法](http://interactivepython.org/courselib/static/pythonds/_images/chaining.png)

#### 哈希表的实现
* Map()       由两个list，一个存储key，一个存储value实现
* put(key,val) __setitem__
* get(key)     __getitem__
* del          __delitem__
* len()        __len__
* in           __contain__

In [11]:
class Hashtable:
    def __init__(self):
        self.size=11
        #存储key
        self.slots=[None] * self.size
        #存储value
        self.data=[None] * self.size
        
    def hashfunc(self,key):
        return key % self.size
    
    def rehash(self,oldhash,skip):
        return (oldhash+skip) % self.size
        
    def put(self,key,val):
        index=self.hashfunc(key)
        if not self.slots[index]:
            self.slots[index]=key
            self.data[index]=val
        else:
            if self.slots[index]==key:
                self.data[index]=val
            else:
                newindex=self.rehash(index,1)
                #发现空位或者遍历一周
                while not self.slots[newindex] and newindex!=index:
                    newindex=self.rehash(index,1)
                if self.slots[newindex]==None:
                    self.slots[newindex]=key
                    self.data[newindex]=val
                else:
                    raise Exception("zoom is not enough")
                    
    def get(self,key):
        startindex=self.hashfunc(key)
        if self.slots[startindex]==key:
            return self.data[startindex]
        index=self.rehash(startindex,1)
        while self.slots[index] !=key and index !=startindex:
            index=self.rehash(index,1)
        if self.slots[index]==key:
            return self.data[index]
        else:
            raise Exception("can not find")
            
    def __setitem__(self,key,val):
        return self.put(key,val)
    
    def __getitem__(self,key):
        return self.get(key)
    
    def __len__(self):
        return self.size
    
    def __contains__(self,key):
        try:
            self.get(key)
        except:
            return False
        return True
a=Hashtable()
a[1]="a"
print(a[1])
print(len(a))
print(1 in a)
print(2 in a)

a
11
True
False
