- cpython/Objects/setobject.c
- cpython/Include/setobject.h
-
- call stack
- static PyObject * set_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
- static PyObject * make_new_set(PyTypeObject *type, PyObject *iterable)
- static PyObject * set_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
- call stack
the following picture shows value in each field in an empty set
-
- call stack
- static PyObject *set_add(PySetObject *so, PyObject *key)
- static int set_add_key(PySetObject *so, PyObject *key)
- static int set_add_entry(PySetObject *so, PyObject *key, Py_hash_t hash)
- static int set_add_key(PySetObject *so, PyObject *key)
- static PyObject *set_add(PySetObject *so, PyObject *key)
- call stack
initialize a new empty set, whenever an empty set created, the setentry field points to smalltable field inside the same PySetObject, default value of PySet_MINSIZE is 8, it means if there are less than 8 objects in PySetObject, they are stored in the smalltable, if there are more than 8 objects, the resize operation will make setentry points to a new malloced block (Actually, the first time resize operation takes place, the smalltable won't be filled, there is a factor to trigger the resize operation, please keep reading)
s = set()
the value in mask field is the size of malloced entries - 1, currently it's 7
s.add(0) # hash(0) & mask == 0
add a value 5
s.add(5) # hash(5) & mask == 0
add a value 16, because the mask is 7, hash(16) & 7 ===> 0 cpython use LINEAR_PROBES to solve hash collision instead of CHAIN(linked list)
// pseudocode
#define LINEAR_PROBES 9
while (1)
{
// i is the hash result
// find the position according to the hash value
if (entry in i is empty)
return entry[i]
// reach here, means entry[i] already taken
if (i + LINEAR_PROBES <= mask) {
for (j = 0 ; j < LINEAR_PROBES ; j++) {
if (entry in j is empty)
return j
}
}
// reach here, means entry[i] - entry[i + LINEAR_PROBES] all are taken
// now, rehash take place
perturb >>= PERTURB_SHIFT;
i = (i * 5 + 1 + perturb) & mask;
}
the 0th position has been taken, so the LINEAR_PROBES take the next empty position, which is 1th position
s.add(16) # hash(16) & mask == 0
the 0th position has been taken, the first time LINEAR_PROBES find the 1th position, which also has been taken, the second time LINEAR_PROBES find the 2th position which is empty.
s.add(32) # hash(32) & mask == 0
let's insert one more item with value 64, still repeat the LINEAR_PROBES progress, insert to the position at index 3
s.add(64) # hash(64) & mask == 0
now, value in field fill is 5, mask is 7, it will trigger the resize operation the trigger condition is fill*5 !< mask * 3
// from cpython/Objects/setobject.c
if ((size_t)so->fill*5 < mask*3)
return 0;
return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);
after resize,value 32 and 64 still repeat the LINEAR_PROBES progress, inserted at index 1 and index 2, because the value in mask field becomes 31, hash(16) is less than the mask, so 16 can be inserted to index 15 And field table no longer points to smalltable, instead, it points to a new malloced block
-
- improve cache locality
- reduces the cost of hash collisions
-
- call stack
- static PyObject *set_clear(PySetObject *so, PyObject *Py_UNUSED(ignored))
- static int set_clear_internal(PySetObject *so)
- static void set_empty_to_minsize(PySetObject *so)
- static int set_clear_internal(PySetObject *so)
- static PyObject *set_clear(PySetObject *so, PyObject *Py_UNUSED(ignored))
- call stack
call clear to restore the initial state
s.clear()