##### <img src="../SDSS-Logo.png" style="display:inline; width:500px" />


## Learning Objectives

- Understand how sets are represented and managed in python
- Learn how to use sets to de-duplicate data


In [15]:
import numpy as np
import pickle
import comp116



with open('Unit-2-4-Sets.data.pickle', 'rb') as fid:
    county_poll_locations, counties, pop_dates, pop_county = pickle.load(fid)


# Sets
* Sets are an important data structure in python
* The python data structure **sets** is designed to provide many of the features we associate with the mathematical concept of a set.
#### Here are a few places to get more information about python sets:
* [Real Python](https://realpython.com/python-sets/)
* [PLYMI](https://www.pythonlikeyoumeanit.com/Module2_EssentialsOfPython/DataStructures_III_Sets_and_More.html)
* [python docs](https://docs.python.org/2/library/stdtypes.html#set-types-set-frozenset)

#### There are many features to sets, we will cover only the most important ones.

# What are Sets

* Sets are unordered collections of **unique** items.

* We'll be exploring:
    * What set members are in both sets (the set intersection)
    * What set members are unique to one set and not in antoher (the set difference)
    * What'set members are in either set (the set union)
<img src="Unit-2-4-Venn-Diagram.png" style="display:inline; width:250px; float:right" />   
* Set operations are generally explained using a Venn Diagram
   * Venn Diagrams shows
      * Sets as circles with the members within the circle, non-members outside the circle
      * The intersection, difference, or union is shown by colored areas
     
* Python has set operations for each of the above.
  
* There are other operations possible, but we will just concentrate on the above.
 

## Creating Sets

* You can instantiate a Python set similar to a list or a numpy array.  
* You can add things to a set using the set method `add`.
* You can also cast a list or a numpy array to a set

### Note: An elegant feature in python is what are called comprehensions.
* You can have list, tuple, set, dictionary, generator comprehensions, see [here](https://book.pythontips.com/en/latest/comprehensions.html).
* One of the examples below uses a set comprehension.

<pre>

</pre>

In [16]:
# Method 1
my_set_1 = set()
for i in range(10):
    my_set_1.add(i)
print(my_set_1)

# Method 2
my_set_2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
print(my_set_2)

# Method 3
# The way we are creating the set here uses a comprehension
my_set_3 = {j for j in range(10)}
print(my_set_3)

# Method 4
my_set_4 = set("This is a  string")
print(my_set_4)

# Method 5
my_set_5 = set([9, 'this', 3])
print(my_set_5)

# Method 6
my_set_6 = set(np.array(np.arange(10)))
print(my_set_6)
               

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{'a', 't', ' ', 'T', 'r', 'n', 'h', 's', 'g', 'i'}
{'this', 9, 3}
{np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)}


### Set elements need to be immutable

In [23]:
# This code will fail. What is the error message?
try:
    no_set = {6, 'test', (8, 9)}
    print(no_set)
except Exception as e:
    print("Exception Try 1")
    print(e)

# And so will this code
try:
    no_set_2 = {6, 'test', np.array([8, 9])}
    print(no_set_2)
except Exception as e2:
    print("Exception Try 2")
    print(e2)
    

{'test', (8, 9), 6}
Exception Try 2
unhashable type: 'numpy.ndarray'


## Disadvantage of sets

* Sets are unordered.  

* Therefore you can't reference an element of the set like you can in a list, array, or dictionary.

* But a set is an iterable, so you can use it in a for loop to process the set elements one at a time. You are not guaranteed that the set elements will come out in the same order each time.

<pre>


</pre>

In [24]:
# Attempt to access a set element using an index
# Note how we are adding an if condition to the comprehension below.
# What does the if condition do?
my_set_3 = {j for j in range(10) if j%2 == 0}
print(my_set_3)

try:
    print(my_set_3[2])
except Exception as e:
    print("Error:", e)

{0, 2, 4, 6, 8}
Error: 'set' object is not subscriptable


In [26]:
# for loop over the elements of a set
my_set_4 = set("This is a  string")
print(my_set_4)

for ch in my_set_4:
    print("Character in my_set_4:", ch)

{'a', 't', ' ', 'T', 'r', 'n', 'h', 's', 'g', 'i'}
Character in my_set_4: a
Character in my_set_4: t
Character in my_set_4:  
Character in my_set_4: T
Character in my_set_4: r
Character in my_set_4: n
Character in my_set_4: h
Character in my_set_4: s
Character in my_set_4: g
Character in my_set_4: i


## `sorted()` to sort all types of objects  in python

* `sorted` is an overloaded function that can be used to sort a number of python objects.

In [28]:
# Sort a string
print(sorted("My string"))
# Sort a list
print(sorted([4, -9, 22, -8]))
# Sort a numpy array
print(sorted(np.array([4, -9, 22, -8])))
# Sort a set
print(sorted(set([4, -9, 4, -9, 22, -8, -8, 22])))
# Can everything be sorted?
#Solution
try:
#End
    print(sorted([4, 'May I', 9]))
#Solution
except:
    pass
#End
# Sorting uses lexical ordering
print(sorted(['4', 'May I', '9']))

[' ', 'M', 'g', 'i', 'n', 'r', 's', 't', 'y']
[-9, -8, 4, 22]
[np.int64(-9), np.int64(-8), np.int64(4), np.int64(22)]
[-9, -8, 4, 22]
['4', '9', 'May I']


## Venn Diagram of Sets

<img src="https://online.visual-paradigm.com/repository/images/8f05154c-2e9c-43f3-b2fe-bc4dc180c4d8.png" style="display:inline; width:400px" />

* Green/left are characteristics common to whales
* Right/red are characteristics common only to fish
* Middle are characteristics common to whales and fish

### Set intersection: Characteristics that are common to *both* whales *and* fish
### Set union: Characteristics that are present in *either* whales *or* fish

In [29]:
whales = set(['live birth', 'breathe air', 'have hair', 'can swim',\
              'live in water', 'have fins'])
fish = set(['have scales', 'lay eggs', 'breathe water', 'can swim',\
            'live in water', 'have fins'])
whales_and_fish = whales.intersection(fish)
print("Intersection, whales and fish:", whales_and_fish)
print("Union, whales or fish:", fish.union(whales))

Intersection, whales and fish: {'have fins', 'can swim', 'live in water'}
Union, whales or fish: {'live birth', 'lay eggs', 'live in water', 'breathe water', 'have hair', 'have fins', 'have scales', 'breathe air', 'can swim'}


### Suppose we want to find what is in one set but not the other?
### The set `difference` method can do that.

In [12]:
# What do whales have but not fish?
print(whales.difference(fish))

# And what fish have but not whales?
print(fish.difference(whales))

{'live birth', 'breathe air', 'have hair'}
{'lay eggs', 'have scales', 'breathe water'}


### Removing items from a set can be done using the `remove` or `discard` methods.

In [30]:
print(whales)
whales.discard('breathe air')
print(whales)


{'live birth', 'live in water', 'have hair', 'have fins', 'breathe air', 'can swim'}
{'live birth', 'live in water', 'have hair', 'have fins', 'can swim'}


### What is the difference between `remove` and `discard`?

In [31]:
print(whales)

try:
    whales.discard('breathe air')
except Exception as e:
    print('discard error:', e)

try:
    whales.remove('breathe air')
except Exception as e:
    print('remove error:', e)

{'live birth', 'live in water', 'have hair', 'have fins', 'can swim'}
remove error: 'breathe air'


## North Carolina voting sites and population during the Nov 2020 election
### Let us say we want to determine how many voting sites were available in each county during the November 2020 elections.
* The State of North Carolina makes many datasets about the [demography](https://www.osbm.nc.gov/facts-figures/population-demographics) of NC publicly available.
* [Data related to elections](https://www.ncsbe.gov/) is made available by the NC State Board of Elections.
* One such [data is about polling places](https://www.ncsbe.gov/results-data/polling-place-data).
* We will be working with the [polling place data from Nov. 3, 2020](https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2020_11_03/polling_place_20201103.csv).
* The data is provided as a CSV file, see below.
* We have extracted just the `county_name` field from this spreadsheet and made it available as the string numpy array `county_poll_locations`.

<img src="Unit-2-4-NC-Polling-Places.jpg" style="display:inline; width:800px; float:bottom" />


### Write a function called `setOfUniqueCountiesInArray` that given a numpy string array of counties in the variable `names` returns a set of the unique counties in `names`.

In [32]:
def setOfUniqueCountiesInArray(names):
    '''
    Given a Python array of county names, return the unique counties in a set
    '''
    # Unique counties in names

    unique_counties = set(names)

    return unique_counties


uc_set = setOfUniqueCountiesInArray(county_poll_locations)

print('There are', len(uc_set), 'unique counties in the county_poll_locations array.')

print('The unique countries are:')
print(*uc_set, sep=', ')






There are 100 unique counties in the county_poll_locations array.
The unique countries are:
CAMDEN, ONSLOW, GATES, HYDE, PERQUIMANS, HENDERSON, MOORE, DAVIE, CHOWAN, WILKES, CABARRUS, DURHAM, RANDOLPH, ASHE, GREENE, IREDELL, PITT, CLAY, EDGECOMBE, DARE, CHEROKEE, GUILFORD, CURRITUCK, NORTHAMPTON, ROBESON, RUTHERFORD, ROCKINGHAM, SCOTLAND, GASTON, MARTIN, CUMBERLAND, TYRRELL, STOKES, HOKE, MADISON, CALDWELL, AVERY, GRANVILLE, GRAHAM, SAMPSON, YADKIN, WARREN, CATAWBA, FRANKLIN, COLUMBUS, RICHMOND, WAYNE, WATAUGA, MITCHELL, HALIFAX, BEAUFORT, ORANGE, MECKLENBURG, POLK, ALEXANDER, JONES, NASH, DUPLIN, VANCE, CASWELL, JOHNSTON, CLEVELAND, WASHINGTON, BURKE, LENOIR, PENDER, WILSON, SURRY, CRAVEN, NEW HANOVER, SWAIN, PASQUOTANK, LEE, LINCOLN, STANLY, CHATHAM, TRANSYLVANIA, HARNETT, DAVIDSON, BRUNSWICK, ALAMANCE, BUNCOMBE, YANCEY, BERTIE, PAMLICO, ANSON, MCDOWELL, ALLEGHANY, CARTERET, ROWAN, PERSON, FORSYTH, HERTFORD, HAYWOOD, JACKSON, MONTGOMERY, MACON, WAKE, UNION, BLADEN


### Write a function called `pollingLocationCounts` that given a numpy string array of county names with polling locations, returns a numpy array of unique county names and a numpy array of the number of polling locations in that county.

In [35]:
def pollingLocationCounts(names):
    '''
    Given a Python array of county names, returns a numpy array of the unique
    counties and a numpy array of the number of polling locations in each county
    '''
    # Unique counties and number of polling locations

    unique_counties_set = set(names)
    poll_location_count = []
    for county in unique_counties_set:
        poll_location_count.append(np.count_nonzero(names == county))

    return (np.array(list(unique_counties_set)), np.array(poll_location_count))

print(county_poll_locations)
county_array, poll_count_array = pollingLocationCounts(county_poll_locations)

# print('There are', len(county_array), 'unique counties in the county_poll_locations array.')
# print('The unique countries are:')
# print(*county_array, sep=', ')
# print('Number of poll locations by county is:', *poll_count_array)
comp116.array_to_html(poll_count_array, row_names=county_array, col_names=['Number'])

0       ALAMANCE
1       ALAMANCE
2       ALAMANCE
3       ALAMANCE
4       ALAMANCE
          ...   
2657      YANCEY
2658      YANCEY
2659      YANCEY
2660      YANCEY
2661      YANCEY
Name: county_name, Length: 2662, dtype: object


Unnamed: 0,Number
CAMDEN,3
ONSLOW,24
GATES,6
HYDE,7
PERQUIMANS,7
HENDERSON,35
MOORE,26
DAVIE,14
CHOWAN,6
WILKES,27


In [36]:
## Quick check
county = 'ORANGE'
print(np.sum(county_poll_locations == county))

41


## What attributes and methods are available for sets?

In [37]:
whales = set(['live birth', 'breathe air', 'have hair', 'can swim',\
              'live in water', 'have fins'])
dir(whales)

['__and__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iand__',
 '__init__',
 '__init_subclass__',
 '__ior__',
 '__isub__',
 '__iter__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__or__',
 '__rand__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__ror__',
 '__rsub__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__xor__',
 'add',
 'clear',
 'copy',
 'difference',
 'difference_update',
 'discard',
 'intersection',
 'intersection_update',
 'isdisjoint',
 'issubset',
 'issuperset',
 'pop',
 'remove',
 'symmetric_difference',
 'symmetric_difference_update',
 'union',
 'update']

### Find a list of prime numbers by removing items from a set

* Let's find all the prime numbers from 2 to some max number, call it `limit`.
    * (Remember a prime number is a number that's only divisible by itself and one. One is not a prime number.)

* We know that all the divisors of a number, $N$ (say 100) can be found by investigating divisors upto $N^{1/2}$ ($100 ^ {1/2}$ = 10).

* So write the function `allPrimesUpto` which takes one integer parameter, `limit` and returns a set of all prime numbers upto `limit`.

* Note that this is actually a very bad way to find primes: it is too computtionally expensive. There are many more sophisticated ways of finding primes as well as checking if a number is a prime.

* Primes are very important in cryptography.


In [41]:
def allPrimesUpto( limit ):
    """
    Find all prime numbers upto limit.
    """

    Allnumbers = set(range(2, limit)) # Create a set of all numbers from 2 up to limit
    print("Range 1:", int(limit ** 0.5)+1)
    for number in range(2, int(limit ** 0.5)+1): #  all numbers from 2 to square root of limit
        for factor in range(2 * number, limit, number): # iterate over all the factors to 100
            Allnumbers.discard(factor) # discard each of these multiple of a number
    return Allnumbers # return the remanents of the set

print("All prime numbers up to 100 are", allPrimesUpto(100))






Range 1: 11
All prime numbers up to 100 are {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97}


### NumPy provides a function to keep only unique items in an array

* The advantage of NumPy `np.unique` as compared to sets is that you can reference items by index.

* But, you don't have the intersection, union, difference, etc.

In [45]:
words = '''
A primary object should be the education of our youth in the science of government. 
In a republic, what species of knowledge can be equally important? And what duty 
more pressing than communicating it to those who are to be the future guardians 
of the liberties of the country -- George Washington
'''

unique_words = np.unique( words.split() )
print('The unique words in George Washington quote are', unique_words, sep=", ")

The unique words in George Washington quote are, ['--' 'A' 'And' 'George' 'In' 'Washington' 'a' 'are' 'be' 'can'
 'communicating' 'country' 'duty' 'education' 'equally' 'future'
 'government.' 'guardians' 'important?' 'in' 'it' 'knowledge' 'liberties'
 'more' 'object' 'of' 'our' 'pressing' 'primary' 'republic,' 'science'
 'should' 'species' 'than' 'the' 'those' 'to' 'what' 'who' 'youth']


# Unpacking

* You may have noticed that  we are using asterisk `*` in print.

* You can use it if you want to unpack things in a list, set, array, or other collections.

* Really handy for short hand prints.

In [44]:
from_nyt = "The number of new daily Covid-19 cases has plunged 57 percent since peaking on Sept. 1."
print(from_nyt)
print(*from_nyt)
print(*from_nyt, sep="|")

arr = np.arange(10)
print(arr)
print(*arr)


The number of new daily Covid-19 cases has plunged 57 percent since peaking on Sept. 1.
T h e   n u m b e r   o f   n e w   d a i l y   C o v i d - 1 9   c a s e s   h a s   p l u n g e d   5 7   p e r c e n t   s i n c e   p e a k i n g   o n   S e p t .   1 .
T|h|e| |n|u|m|b|e|r| |o|f| |n|e|w| |d|a|i|l|y| |C|o|v|i|d|-|1|9| |c|a|s|e|s| |h|a|s| |p|l|u|n|g|e|d| |5|7| |p|e|r|c|e|n|t| |s|i|n|c|e| |p|e|a|k|i|n|g| |o|n| |S|e|p|t|.| |1|.
[0 1 2 3 4 5 6 7 8 9]
0 1 2 3 4 5 6 7 8 9
