This repository has been archived by the owner on Jan 3, 2018. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 48
/
tutorial.html
3484 lines (3102 loc) · 98.4 KB
/
tutorial.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
layout: lesson
root: ../..
title: Sets and Dictionaries
order: ["sets", "storage", "dict", "aggregate", "nanotech", "json", "phylotree"]
---
<section>
<h2>Opening</h2>
<div id="s:setdict:opening" class="opening">
<p>
Fan Fullerene has just joined Molecules'R'Us,
a nanotechnology startup that fabricates molecules
using only the highest quality atoms.
His first job is to build a simple inventory management system
that compares incoming orders for molecules
to the stock of atoms in the company's supercooled warehouse
to see how many of those molecules we can build.
For example,
if the warehouse holds 20 hydrogen atoms,
5 oxygen atoms,
and 11 nitrogen atoms,
Fan could make 10 water molecules (H<sub>2</sub>O)
or 6 ammonia molecules (NH<sub>3</sub>),
but could not make any methane (CH<sub>4</sub>)
because there isn't any carbon.
</p>
<p>
Fan could solve this problem using the tools we've seen so far.
As we'll see, though,
it's a lot more efficient to do it using a different data structure.
And "efficient" means both "takes less programmer time to create"
and "takes less computer time to execute":
the data structures introduced in this chapter are both simpler to use and faster
than the lists most programmers are introduced to first.
</p>
</div>
</section>
<section>
<h2>Instructors</h2>
<div id="s:setdict:instructors" class="instructors">
<p>
The ostensible goal of this set of lessons is
to introduce learners to non-linear data structures.
Most have ony ever seen arrays or lists,
i.e.,
things that are accessed using sequential numeric indices.
Sets and dictionaries are usually their first exposure
to accessing content by value rather than by location,
and to the bigger idea that there are lots of other data structures
they might want to learn about.
(Unfortunately,
there still isn't a good data structure handbook for Python programmers
that we can point them at.)
</p>
<p>
These lessons also introduce JSON as a general-purpose data format
that requires less effort to work with than flat text or CSV.
We discuss its shortcomings as well as its benefits to help learners see
what forces are at play when designing and/or choosing data representations.
</p>
<p>
Finally,
these lessons are also our first chance to introduce
the idea of computational complexity
via back-of-the-envelope calculations of how
the number of steps required to look things up in an unordered list
grows with the number of things being looked up.
We return to this idea in the <a href="dev.html">extended example of invasion percolation</a>,
and to the notion that algorithmic improvements help more than tuning code,
but this is a chance to touch on the idea in classes that don't get to that example.
The discussion of hash tables is also good preparation
for the discussion of <a href="db.html">relational databases</a>,
but isn't required.
</p>
<p>
Everything in this lesson except the final example on phylogenetic trees
can be covered in two hours,
assuming that only three short exercises are given
(one for sets, one for basic dictionary operations, and one related to aggregation).
</p>
<ul>
<li>
Start with sets:
they're a familiar concept,
there's no confusion between keys and values,
and they are enough to motivate discussion of hash tables.
</li>
<li>
Python's requirement that entries in hash-based data structures be immutable
only makes sense once the mechanics of hash tables are explained.
Terms like "hash codes" and "hash function" also come up
in error messages and Python's documentation,
so learners are likely to become confused
without some kind of hand-waving overview.
Tuples are also easy to explain as
"how to create immutable multi-part keys",
and it's easy to explain why entries can't be looked up by parts
(e.g., why a tuple containing a first and a last name
can't be looked up by last name only)
in terms of hash functions.
</li>
<li>
Finally,
explaining why hash tables are fast
is a good first encounter with the idea of "big oh" complexity.
</li>
<li>
Once sets have been mastered,
dictionaries can be explained as
"sets with extra information attached to each entry".
The canonical example—counting things—shows why
that "extra information" is useful.
The original motivating problem then uses
both a dictionary and a dictionary of dictionaries;
when introducing the latter,
compare it to a list of lists.
</li>
<li>
Use the nanotechnology inventory example
to re-emphasize how code is build top-down
by writing code as if desired functions existed,
then filling them in.
</li>
<li>
Only tackle the phylogenetic tree example with very advanced learners.
The algorithm is usually presented as a table,
which makes an array a natural representation.
Showing how and why to use dictionaries instead
is as important as showing vector operations when introducing NumPy,
but the example is hard to follow (and debug)
without a graphical representation of the evolving tree.
</li>
</ul>
</div>
<h2>Prerequisites</h2>
<div id="s:setdict:prereq" class="prereq">
<p>
Basic data types (strings, numbers, lists);
loops;
file I/O;
conditionals;
string operations;
references and aliasing;
creating functions;
top-down development.
</p>
</div>
</section>
<section id="s:setdict:sets">
<h2>Sets</h2>
<p><a href="setdict-sets.ipynb">Notebook</a></p>
<h3>Objectives</h3>
<div id="s:setdict:sets:objectives" class="objectives">
<ul>
<li>Explain why some programs that use lists become proportionally slower as data sizes increase.</li>
<li>Explain the three adjectives in "unordered collection of distinct values".</li>
<li>Use a set to eliminate duplicate values from data.</li>
</ul>
</div>
<h3>Lesson</h3>
<div id="s:setdict:sets:lesson" class="lesson">
<p>
Let's start with something simpler than our actual inventory problem.
Suppose we have a list of all the atoms in the warehouse,
and we want to know which different kinds we have—not how many,
but just their types.
We could solve this problem using a list to store
the unique atomic symbols we have seen.
Here's a function to add a new atom to the list:
</p>
<pre>
def another_atom(seen, atom):
for i in range(len(seen)):
if seen[i] == atom:
return # atom is already present, so do not re-add
seen.append(atom)
</pre>
<p>
<code>another_atom</code>'s arguments are
a list of the unique atoms we've already seen,
and the symbol of the atom we're adding.
Inside the function,
we loop over the atoms that are already in the list.
If we find the one we're trying to add,
we exit the function immediately:
we aren't supposed to have duplicates in our list,
so there's nothing to add.
If we reach the end of the list without finding this symbol,
though,
we append it.
This is a common <a href="glossary.html#design-pattern">design pattern</a>:
either we find pre-existing data in a loop and return right away,
or take some default action if we finish the loop without finding a match.
</p>
<p>
Let's watch this function in action.
We start with an empty list.
If the first atomic symbol is <code>'Na'</code>,
we find no match (since the list is empty),
so we add it.
The next symbol is <code>'Fe'</code>;
it doesn't match <code>'Na'</code>,
so we add it as well.
Our third symbol is <code>'Na'</code> again.
It matches the first entry in the list,
so we exit the function immediately.
</p>
<table>
<tr>
<th>Before</th>
<th>Adding</th>
<th>After</th>
</tr>
<tr>
<td><code>[]</code></td>
<td><code>'Na'</code></td>
<td><code>['Na']</code></td>
</tr>
<tr>
<td><code>['Na']</code></td>
<td><code>'Fe'</code></td>
<td><code>['Na', 'Fe']</code></td>
</tr>
<tr>
<td><code>['Na', 'Fe']</code></td>
<td><code>'Na'</code></td>
<td><code>['Na', 'Fe']</code></td>
</tr>
</table>
<p>
This code works,
but it is inefficient.
Suppose there are <em>V</em> distinct atomic symbols in our data,
and <em>N</em> symbols in total.
Each time we add an observation to our list,
we have to look through an average of <em>V/2</em> entries.
The total running time for our program is therefore approximately <em>NV/2</em>.
If <em>V</em> is small,
this is only a few times larger than <em>N</em>,
but what happens if we're keeping track of something like patient records rather than atoms?
In that case,
most values are distinct,
so <em>V</em> is approximately the same as <em>N</em>,
which means that our running time is proportional to <em>N<sup>2</sup>/2</em>.
That's bad news:
if we double the size of our data set,
our program runs four times slower,
and if we double it again,
our program will have slowed down by a factor of 16.
</p>
<p>
There's a better way to solve this problem
that is simpler to use and runs much faster.
The trick is to use a <a href="glossary.html#set">set</a>
to store the symbols.
A set is an unordered collection of distinct items.
The word "collection" means that a set can hold zero or more values.
The word "distinct" means that any particular value is either in the set or not:
a set can't store two or more copies of the same thing.
And finally, "unordered" means that values are simply "in" the set.
They're not in any particular order,
and there's no first value or last value.
(They actually are stored in some order,
but as we'll discuss in <a href="#s:storage">the next section</a>,
that order is as random as the computer can make it.)
</p>
<p>
To create a set,
we simply write down its elements inside curly braces:
</p>
<pre>
>>> primes = {3, 5, 7}
</pre>
<figure id="f:simple_set">
<img src="setdict/simple_set.png" alt="A Simple Set" />
<figcaption>Figure 1: A Simple Set</figcaption>
</figure>
<p class="continue" id="a:previous-use">
However,
we have to use <code>set()</code> to create an empty set,
because the symbol <code>{}</code> was already being used for something else
when sets were added to Python:
</p>
<pre>
>>> even_primes = set() <span class="comment"># not '{}' as in math</span>
</pre>
<p class="continue">
We'll meet that "something else" <a href="#s:dict">later in this chapter</a>.
</p>
<p>
To see what we can do with sets,
let's create three holding the integers 0 through 9,
the first half of that same range of numbers (0 through 4),
and the odd values 1, 3, 5, 7, and 9:
</p>
<pre>
>>> ten = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> lows = {0, 1, 2, 3, 4}
>>> odds = {1, 3, 5, 7, 9}
</pre>
<p>
If we ask Python to display one of our sets,
it shows us this:
</p>
<pre>
>>> print lows
<span class="out">set([0, 1, 2, 3, 4])</span>
</pre>
<p class="continue">
rather than using the curly-bracket notation.
I personally regard this as a design flaw,
but it does remind us that we can create always create a set from a list.
</p>
<p>
Sets have methods just like strings and lists,
and,
like the methods of strings and lists,
most of those methods create new sets
instead of modifying the set they are called for.
These three come straight from mathematics:
</p>
<pre>
>>> print lows.union(odds)
<span class="out">set([0, 1, 2, 3, 4, 5, 7, 9])</span>
>>> print lows.intersection(odds)
<span class="out">set([1, 3])</span>
>>> print lows.difference(odds)
<span class="out">set([0, 2, 4])</span>
</pre>
<p>
Another method that creates a new set is <code>symmetric_difference</code>,
which is sometimes called "exclusive or":
</p>
<pre>
>>> print lows.symmetric_difference(odds)
<span class="out">set([0, 2, 4, 5, 7, 9])</span>
</pre>
<p class="continue">
It returns the values that are in one set or another, but not in both.
</p>
<p>
Not all set methods return new sets.
For example,
<code>issubset</code> returns <code>True</code> or <code>False</code>
depending on whether all the elements in one set are present in another:
</p>
<pre>
>>> print lows.issubset(ten)
<span class="out">True</span>
</pre>
<p class="continue">
A complementary method called <code>issuperset</code> also exists,
and does the obvious thing:
</p>
<pre>
>>> print lows.issuperset(odds)
<span class="out">False</span>
</pre>
<p>
We can count how many values are in a set using <code>len</code>
(just as we would to find the length of a list or string),
and check whether a particular value is in the set or not using <code>in</code>:
</p>
<pre>
>>> print len(odds)
<span class="out">7</span>
>>> print 6 in odds
<span class="out">False</span>
</pre>
<p class="continue">
Some methods modify the sets they are called for.
The most commonly used is <code>add</code>,
which adds an element to the set:
</p>
<pre>
>>> lows.add(9)
>>> print lows
<span class="out">set([0, 1, 2, 3, 4, 9])</span>
</pre>
<p class="continue">
If the thing being added is already in the set,
<code>add</code> has no effect,
because any specific thing can appear in a set at most once:
</p>
<pre>
>>> lows.add(9)
>>> print lows
<span class="out">set([0, 1, 2, 3, 4, 9])</span>
</pre>
<p class="continue">
This behavior is different from that of <code>list.append</code>,
which always adds a new element to a list.
</p>
<p>
Finally,
we can remove individual elements from the set:
</p>
<pre>
>>> lows.remove(0)
>>> print lows
<span class="out">set([1, 2, 3, 4])</span>
</pre>
<p class="continue">
or clear it entirely:
</p>
<pre>
>>> lows.clear()
>>> print lows
<span class="out">set()</span>
</pre>
<p>
Removing elements is similar to deleting things from a list,
but there's an important difference.
When we delete something from a list,
we specify its <em>location</em>.
When we delete something from a set,
though,
we must specify the <em>value</em> that we want to take out,
because sets are not ordered.
If that value isn't in the set,
<code>remove</code> does nothing.
</p>
<p>
To help make programs easier to type and read,
most of the methods we've just seen can be written using arithmetic operators as well.
For example, instead of <code>lows.issubset(ten)</code>,
we can write <code>lows <= ten</code>,
just as if we were using pen and paper.
There are even a couple of operators,
like the strict subset test <code><</code>,
that don't have long-winded equivalents.
</p>
<table>
<tr>
<th>Operation</th>
<th>As Method</th>
<th>Using Operator</th>
</tr>
<tr>
<td><em>difference</em></td>
<td><code>lows.difference(odds)</code></td>
<td><code>lows - odds</code></td>
</tr>
<tr>
<td><em>intersection</em></td>
<td><code>lows.intersection(odds)</code></td>
<td><code>lows & odds</code></td>
</tr>
<tr>
<td><em>subset</em></td>
<td><code>lows.issubset(ten)</code></td>
<td><code>lows <= ten</code></td>
</tr>
<tr>
<td><em>strict subset</em></td> <td></td>
<td><code>lows < ten</code></td>
</tr>
<tr>
<td><em>superset</em></td>
<td><code>lows.issuperset(ten)</code></td>
<td><code>lows >= odds</code></td>
</tr>
<tr>
<td><em>strict superset</em></td> <td></td>
<td><code>lows >= odds</code></td>
</tr>
<tr>
<td><em>exclusive or</em></td>
<td><code>lows.symmetric_difference(odds)</code></td>
<td><code>lows ^ odds</code></td>
</tr>
<tr>
<td><em>union</em></td>
<td><code>lows.union(odds)</code></td>
<td><code>lows | odds</code></td>
</tr>
</table>
<p>
The fact that the values in a set are distinct makes them
a convenient way to get rid of duplicate values,
like the "unique atoms" problem at the start of this section.
Suppose we have a file containing the names of all the atoms in our warehouse,
and our task is to produce a list of the their types.
Here's how simple that code is:
</p>
<pre>
def unique_atoms(filename):
atoms = set()
with open(filename, 'r') as source:
for line in source:
name = line.strip()
atoms.add(name)
return atoms
</pre>
<p>
We start by creating an empty set which we will fill with atomic symbols
and opening the file containing our data.
As we read the lines in the file,
we strip off any whitespace (such as the newline character at the end of the line)
and put the resulting strings in the set.
When we're done,
we print the set.
If our input is the file:
</p>
<pre>
Na
Fe
Na
Si
Pd
Na
</pre>
<p class="continue">
then our output is:
</p>
<pre>
set(['Na', 'Fe', 'Si', 'Pd'])
</pre>
<p>
The right atoms are there,
but what are those extra square brackets for?
The answer is that
if we want to construct a set with values using <code>set()</code>,
we have to pass those values in a single object,
such as a list.
This syntax:
</p>
<pre>
set('Na', 'Fe', 'Si', 'Pd')
</pre>
<p class="continue">
does <em>not</em> work,
even though it seems more natural.
On the other hand,
this means that we can construct a set from almost anything
that a <code>for</code> loop can iterate over:
</p>
<pre>
>>> <span class="in">set('lithium')</span>
<span class="out">set(['i', 'h', 'm', 'l', 'u', 't'])</span>
</pre>
<p>
But hang on:
if we're adding characters to the set in the order
<code>'l'</code>, <code>'i'</code>, <code>'t'</code>, <code>'h'</code>, <code>'i'</code>, <code>'u'</code>, <code>'m'</code>,
why does Python show them in the order
<code>'i'</code>, <code>'h'</code>, <code>'m'</code>, <code>'l'</code>, <code>'u'</code>, <code>'t'</code>?
To answer that question,
we need to look at how sets are actually stored,
and why they're stored that way.
</p>
</div>
<h3>Key Points</h3>
<div id="s:setdict:sets:keypoints" class="keypoints">
<ul>
<li>Use sets to store distinct unique values.</li>
<li>Create sets using <code>set()</code> or <code>{<em>v1</em>, <em>v2</em>, ...}</code>.</li>
<li>Sets are mutable, i.e., they can be updated in place like lists.</li>
<li>A loop over a set produces each element once, in arbitrary order.</li>
<li>Use sets to find unique things.</li>
</ul>
</div>
<h3>Challenges</h3>
<div id="s:setdict:sets:challenges" class="challenges">
<ol>
<li>
<p>
Mathematicians are quite comfortable negating sets:
for example, the negation of the set <code>{1, 2}</code>
is all numbers that aren't 1 or 2.
Why don't Python's sets have a <code>not</code> operator?
</p>
</li>
<li>
<p>
Fan has created a set containing the names of five noble gases:
</p>
<pre>
>>> print gases
<span class="out">set(['helium', 'argon', 'neon', 'xenon', 'radon'])</span>
</pre>
<p>
He would like to print them in alphabetical order.
What is one simple way to do this?
(Hint: the <code>list</code> function converts its arguments to a list.)
</p>
</li>
<li>
<p>
Fan has the following code:
</p>
<pre>
left = {'He', 'Ar', 'Ne'}
right = set()
while len(left) > len(right):
temp = left.pop()
right.add(temp)
</pre>
<p>
What values could <code>left</code> and <code>right</code> have
after this code is finished running?
Explain why your answer makes this code hard to test.
</p>
</li>
<li>
<p>
Fan has written the following code:
</p>
<pre>
left = {'He', 'Ar', 'Ne'}
right = {'Ar', 'Xe'}
for element in left: <span class="comment"># X</span>
if element not in right: <span class="comment"># X</span>
right.add(element) <span class="comment"># X</span>
assert left.issubset(right)
</pre>
<p>
What single line could be used
in place of the three marked with 'X'
to achieve the same effect?
</p>
</li>
<li>
<p>
Fan has written a program to print the names of the distinct atoms in a data file:
</p>
<pre>
<span class="comment"># Print the name of each atom in the data file once.</span>
reader = open('atoms.txt', 'r')
seen = set()
for line in reader:
name = line.strip()
if name in seen:
print name
else:
seen.add(name)
reader.close()
</pre>
<p>
When he runs the program on this data file:
</p>
<pre>
Na
Fe
Na
</pre>
<p>
it only prints:
</p>
<pre>
<span class="out">Na</span>
</pre>
<p>
What is the simplest change you can make to the program
so that it produces the correct answer?
</p>
</li>
<li>
<p>
Fan has created a set containing the names of five noble gases:
</p>
<pre>
>>> print noble_gases
<span class="out">set(['He', 'Ne', 'Ar', 'Kr', 'Xe'])</span>
</pre>
<p>
Fan also has a set of "smaller" elements (here 10 or less protons).
</p>
<pre>
>>> print small_elements
<span class="out">set(['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne'])</span>
</pre>
<p>
He would like to create two different sets. One that contains the noble gases
that contain more than 10 protons, and one that contains the noble gases with
less than 10 protons.
What are two ways he can create the first set and what
are two ways he can create the second set?
</p>
</li>
</ol>
</div>
</section>
<section id="s:setdict:storage">
<h2>Storage</h2>
<p><a href="setdict-storage.ipynb">Notebook</a></p>
<h3>Objectives</h3>
<div id="s:setdict:storage:objectives" class="objectives">
<ul>
<li>Draw a diagram showing how hash tables are implemented, and correctly label the main parts.</li>
<li>Explain the purpose of a hash function.</li>
<li>Explain why using mutable values as keys in a hash table can cause problems.</li>
<li>Correctly identify the error messages Python produces when programs try to put mutable values in hashed data structures.</li>
<li>Explain the similarities and differences between tuples and lists.</li>
<li>Explain why using tuples is better than concatenating values when storing multi-part data in hashed data structures.</li>
</ul>
</div>
<h3>Lesson</h3>
<div id="s:setdict:storage:lesson" class="lesson">
<p>
Let's create a set,
add the string <code>'lithium'</code> to it
(as a single item, not character by character),
and print the result:
</p>
<pre>
>>> things = set()
>>> things.add('lithium')
>>> print things
<span class="out">set(['lithium'])</span>
</pre>
<p class="continue">
As expected, the string is in the set.
Now let's try adding a list to the same set:
</p>
<pre>
>>> things.add([1, 2, 3])
<span class="err">TypeError: unhashable type: 'list'</span>
</pre>
<p class="continue">
Why doesn't that work?
And what does that word "unhashable" mean?
</p>
<p>
When we create a set,
the computer allocates a block of memory to store references to the set's elements.
When we add something to the set,
or try to look something up,
the computer uses a <a href="glossary.html#hash-function">hash function</a> to figure out where to look.
A hash function is any function that produces a seemingly-random number
when given some data as input.
For example,
one way to hash a string is to add up the numerical values of its characters.
If the string is "zebra",
those values are 97 for lower-case 'a',
98 for lower-case 'b',
and so on up to 122 for lower-case 'z'.
When we add them up,
we will always get the same result:
in this case, 532.
If our hash table has 8 slots,
we can take the remainder <code>532%8=4</code>
to figure out
where to store a reference to our string in the hash table
(<a href="#f:set_storage_string">Figure 2</a>).
</p>
<figure id="f:set_storage_string">
<img src="setdict/set_storage_string.png" alt="Hashing a String" />
<figcaption>Figure 2: Hashing a String</figcaption>
</figure>
<p>
Now let's take a look at how a list would be stored.
If the list contains the same five characters,
so that its hash code is still 4,
it would be stored as shown in
<a href="#f:set_storage_list">Figure 3</a>:
</p>
<figure id="f:set_storage_list">
<img src="setdict/set_storage_list.png" alt="Hashing a List" />
<figcaption>Figure 3: Hashing a List</figcaption>
</figure>
<p>
But what happens if we change the characters in the list
after we've added it to the set?
For example,
suppose that we change the first letter in the list from 'z' to 'X'.
The hash function's value is now 498 instead of 532,
which means that the modified list belongs in slot 2 rather than slot 4.
However, the reference to the list is still in the old location:
the set doesn't know that the list's contents have changed,
so it hasn't moved its reference to the right location
(<a href="#f:set_storage_mutate">Figure 4</a>):
</p>
<figure id="f:set_storage_mutate">
<img src="setdict/set_storage_mutate.png" alt="After Mutation" />
<figcaption>Figure 4: After Mutation</figcaption>
</figure>
<p>
This is bad news.
If we now ask, "Is the list containing 's', 'e', 'b', 'r', and 'a' in the set?"
the answer will be "no",
because the reference to the list isn't stored in the location that our hash function tells us to look.
It's as if someone changed their name from "Tom Riddle" to "Lord Voldemort",
but we left all the personnel records filed under 'R'.
</p>
<p>
This problem arises with any <a href="glossary.html#mutable">mutable</a> structure—i.e.,
any structure whose contents or value can be changed after its creation.
Integers and strings are safe to hash because their values are fixed,
but the whole point of lists is that we can grow them,
shrink them,
and overwrite their contents.
</p>
<p>
Different languages and libraries handle this problem in different ways.
One option is to have each list keep track of the sets that it is in,
and move itself whenever its values change.
However, this is expensive:
every time a program touched a list,
it would have to see if it was in any sets,
and if it was,
recalculate its hash code and update all the references to it.
</p>
<p>
A second option is to shrug and say, "It's the programmer's fault."
This is what most languages do,
but it's also expensive:
programmers can spend hours tracking down the bugs that arise
from data being in the wrong place.
</p>
<p>
Python uses a third option:
it only allows programmers to put <a href="glossary.html#immutable">immutable</a> values in sets.
After all,
if something's value can't change,
neither can its hash code or its location in a hash table.
</p>
<p>
But if sets can only hold immutable values,
what do we do with mutable ones?
In particular,
how should we store things like (x,y) coordinates,
which are naturally represented as lists,
or people's names,
which are naturally represented as lists of first, middle, and last names?
Again, there are several options.
</p>
<p>
The first is to concatenate those values somehow.
For example,
if we want to store "Charles" and "Darwin",
we'd create the string "Charles Darwin" and store that.
This is simple to do,
but our code will wind up being littered with string joins and string splits,
which will make it slower to run and harder to read.
More importantly,
it's only safe to do if
we can find a concatenator that can never come up in our data.
(If we join "Paul Antoine" and "St. Cyr" using a space,
there would be three possible ways to split it apart again.)
</p>
<p id="a:tuple">
The second option—the right one—is to use <a href="glossary.html#tuple">tuples</a> instead of lists.
A tuple is an immutable list,
i.e., a sequence of values that cannot be changed after its creation.
Tuples are created exactly like lists,
except we use parentheses instead of square brackets:
</p>
<pre>
>>> full_name = ('Charles', 'Darwin')
</pre>
<p class="continue">
They are indexed the same way,
too,
and functions like <code>len</code> do exactly what we'd expect:
</p>
<pre>
>>> print full_name[0]
<span class="out">Charles</span>
>>> print len(full_name)
<span class="out">2</span>
</pre>
<p class="continue">
What we <em>cannot</em> do is assign a new value to a tuple element,
i.e., change the tuple after it has been created:
</p>
<pre>
>>> full_name[0] = 'Erasmus'
<span class="err">TypeError: 'tuple' object does not support item assignment</span>
</pre>
<p class="continue">
This means that a tuple's hash code never changes,
and <em>that</em> means that tuples can be put in sets:
</p>
<pre>
>>> names = set()
>>> names.add(('Charles', 'Darwin'))
>>> print names
<span class="out">set([('Charles', 'Darwin')])</span>
</pre>
</div>
<h3>Key Points</h3>
<div id="s:setdict:storage:keypoints" class="keypoints">
<ul>
<li>Sets are stored in hash tables, which guarantee fast access for arbitrary values.</li>
<li>The values in sets must be immutable to prevent hash tables misplacing them.</li>