# Sets

There is another type of container in Python: the **set**.

Here's the short version of what we'll go through:

* Sets are like lists — they can contain various different items.

* Sets have no order and no index.

* Sets have no duplicates.

* Checking set membership is instant.

## What is a set?

A set is a mathematical concept used widely in many fields. In math, it's common to **define** a set: we define what it contains, rather than individually listing contents.

That might seem abstract. An example is the **set of integers**. This set contains elements like `1`, `5`, `-10`, `1,000,000` and so on. Notice that it has actually infinite members, because there are infinite integers. This is why call it "the set of integers" rather than listing every element!

But there are also finite sets. Here's the complete set of perfect squares between 0 and 100:

```{0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100}```

This set has a size (technical term: "cardinality") of 11. Very much finite.

## Sets in Python

OK, let's figure out what relevance this has in programming.

Sets are defined like this, with curly braces:

In [None]:
# Creating a set
squares = {0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100}
print(f'There are {len(squares)} squares between 0 and 100.')

Note that to create an empty set, we actually have to write `set()`.

In [None]:
# Creating an empty set
empty_set = set()
print(len(empty_set))

P.S. We can do this with any type, but set is the only one it's required for.

In [None]:
# Creating empty types
empty_set = set() # {}
empty_string = str() # ''
empty_list = list() # []
emtpy_int = int()
empty_float = float()
empty_bool = bool()

### Aside: empty values

P.S. What do you suppose the default value is for the last three above? What are the values of `empty_int`, `empty_float`, and `empty_bool` after running that block?

 <details>
 <summary>Click to reveal</summary>

 > An empty int is `0`; an empty float is `0.0`; an empty bool is `False`.
 </details>

Incidentally, all these "empty" values also can be used as conditions; they all evaluate to `False` when used as booleans.

In [None]:
# Falsy vs. truthy
empty_list = []
non_empty_list = ['a']

if empty_list:
  print('this was an empty list. I know because it evaluated to False')

if non_empty_list:
  print('the list had something in it. I know because it evaluated to True')

## Basic methods

Sets have some methods that are similar to lists, but they're not quite the same.

In [None]:
# Adding to a set
numbers = set()
numbers.add(5)
numbers.add(8)
numbers.add(11)
print(numbers)

In [None]:
# Removing from a set
numbers = {7, 6, 5, 3}
numbers.remove(6)
print(numbers)

If you run the above, you'll notice something quite puzzling. The one where we added `5` then `8` then `11` can actually be printed `{8, 11, 5}`. (It may not when you run it — it's random.) The one can also be out of order. Why?!

**Sets don't have order!** There is no index.

Lists are defined by *index*. Each index points to an item.

`[5, 8, 11]`

Sets are defined by *membership*. Each element either belongs to the set, or it doesn't.

`{5, 8, 11}`


## Sets have no duplicates

That might seem like a downside, but it actually has two advantages. The first is that there are no duplicate members.

Run this code and see what happens:

In [None]:
# Sets have no duplicates
numbers = [5, 1, 2, 5, 7, 6, 2, 5, 1]
set_version = set(numbers)
print(set_version)

Since sets have no duplicates, we can use them to identify distinctness.

We were just doing that with the year problem. **Challenge:** Can you use a set to simplify this function down to one or two lines?

<details>
<summary>Click for a hint</summary>

> How did the length of `numbers` compare to `set_version`?
</details>

In [None]:
# Sets for distinctness
def is_distinct(year: int) -> bool:
  """
  Return True iff the digits in the given year are distinct.

  >>> is_distinct(1987)
  True
  >>> is_distinct(1988)
  False
  """
  # TODO

# Test
print(is_distinct(1987))
print(is_distinct(1988))

This also means that, unlike a list, we don't need to be careful about adding things multiple times, nor do we have to remove things multiple times. Once is enough.

In [None]:
# Multiple adds do nothing
numbers = {1}
numbers.add(5)
numbers.add(5)
numbers.add(5)
numbers.add(5)
print(numbers)

In [None]:
# One remove is enough
numbers = {1, 5, 5, 5, 5}
numbers.remove(5)
print(numbers)

### Combining sets

While we're on the topic, we might as well figure out what happens when we combine sets. There are actually **three** ways to combine sets: union (in either), intersection (in both), and difference (exclude ones that are in both).

![apple spam.png](https://i.imgur.com/JZ31XvV.png)

Based on this, predict the output of the next four blocks.

In [None]:
# Union
set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}
combined = set_a.union(set_b)
print(combined)

In [None]:
# Intersection
set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}
combined = set_a.intersection(set_b)
print(combined)

In [None]:
# Difference (A - B)
set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}
combined = set_a.difference(set_b)
print(combined)

In [None]:
# Difference (B - A)
set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}
combined = set_b.difference(set_a)
print(combined)

### Your turn

Here are two sets. Print the requested combinations, using only set methods (do *not* retype any names).

In [None]:
# Two sets
teachers = {'Vriend', 'Hayward', 'Jaworski', 'Dibbits'}
men = {'Vriend', 'Hayward', 'Ronald', 'Picard'}
women = {'Jaworski', 'Dibbits', 'Angela', 'Selena'}

# a.union(b)        combine all elements
# a.intersection(b) get only overlapping elements
# a.difference(b)   all As, except those that are B
# b.difference(a)   all Bs, except those that are A

# Print all men and women
# TODO

# Print men who are teachers
# TODO

# Print women who are teachers
# TODO

# Print men who are not teachers
# TODO

# Print women who are not teachers
# TODO

# Print men and women who are teachers
# TODO

# Print men and women who are not teachers
# TODO

# Print everyone
# TODO

Incidentally, for some of the above, could any of them have been simplified? Think about inclusion in sets. Do any combinations of sets already include all others?

## Checking set membership is instant

When you use the `in` keyword, what do you suppose happens?

Say you had a list of every student in the school and you wanted to check whether Gordon Ramsey was among them. How would you do so?

A natural question to ask would be: "Is the list sorted alphabetically?" If it is, the job is easy — you go to `R` for Ramsey, then look to `G` for Gordon. But if the list isn't sorted, you would have to skim every name to find Gordon Ramsey.

This is like what happens with sets vs. lists. A set is like a phone book. When we add to it, Python places every element at a specific location (we don't care where). But in a list, the order is arbitrary. It may or may not be sorted. So Python has to look at each item. If it happens to be near the start, Python will find it quickly, but if not, it takes a long time. If it's not in there at all, it takes the longest.

But a set is instant, no matter what.

Let's see this in action.

In [None]:
# Finding an item near the start of a list
import time

L = list(range(1, 10000000))

start = time.time()
for _ in range(1000):
  5 in L
end = time.time()

print(f'{end - start:.5f} s to find 1,000 items near the start of a long list')

In [None]:
# Finding an item near the end of a list
import time

L = list(range(1, 10000000))

start = time.time()
for _ in range(10):
  9999999 in L

end = time.time()
print(f'{end - start:.5f} s to find 10 items near the end of a long list')

Whereas if we try it with a set...



In [None]:
# Finding an item near the end of a set
import time
import random

S = set(range(1, 10000000))

start = time.time()
for _ in range(1000):
  9999999 in S

end = time.time()
print(f'{end - start:.5f} s to find 1,000 items anywhere in a large set')

Wow. And note that while we can't make these lists and sets much bigger with Google Colab's RAM restrictions, if we did, the lists would trail further and further behind while the sets wouldn't break a sweat. Or shall we say... wouldn't break a *set*.

### Your turn

Some DMOJ problems have time limits — they're designed to fail even if your solution is correct, but your implementation is too slow.

One such problem is [Email](https://dmoj.ca/problem/ecoo19r2p1). A list is too slow; you need a set to solve it. Try it out.