# Set Introduction














## Overview


In the previous lesson we learned about data structures in Python:

<ul>
    <li><b>List:</b> Sequence of ordered elements with values that can be changed 
    <li><b>Dictionaries:</b> Collection of unordered mapping of unique keys to data values
</ul>
Lists help us store sorted values, such as a roster of Hogwarts students. Dictionaries store values based on a given key. In the data cleaning process, sometimes we encounter datasets with millions of rows or duplicate values. Using a list or dictionary on such a large file would either take too long or generate complicated code.

<b>Sets</b> are a data structure in Python covered in this lesson. Sets are unordered collections of unique items, such as a collection of first names. Sets only contain elements with no duplicates. Sets are helpful with managing repeat customer data in telephone logs or purchase histories. We will learn how to use set operations to compare two lists of customers.

## Learning Objectives
  

  <ol>
    <li>Describe a common problem
    <li>Define sets in Python
    <li>Understand why sets are used
    <li>Create a set
    <li>Set Operations: In, Not In, Union, Intersection, For Set
    <li>Sets compared to other data structures

## A Common Problem


Let’s revisit Molly’s cupcake business from “Visualizing Data with Graphs.” Molly's cupcake sales have exploded! She decides to open a second store and hires Bob as the manager. Molly and Bob write down the name of each customer who purchases a cupcake:

| MollyStore |    | BobStore |  
|------|          |------|          
|   Suzie  ||   Fred  |
|   Edgar  ||   Suzie  |
|   Steven  ||   Steven  |
|   Natalie  ||   Joseph  |
|   Natalie  ||   Catherine  |
|   Suzie  ||   Steven  |
|            | |   Steven  |


Molly wants to analyze which customers shop at her store vs. Bob's store. However, she notices duplicate entries (e.g., Natalie frequently visits Molly's cupcake store but has not been to Bob's store). Molly wants to determine:
<ul><li>How to create a list of customers and eliminate repeat entries
    <li>Whether a person has ever shopped at her store 
    <li>Which customers shop at both stores
    <li>How to create a roster of all customers from either store
</ul>
    
    
Sets are designed to simplify datasets by removing duplicate names. We will learn what is a set, why to use a set, and how to use set operations to help generate Molly's analysis.


## What is a Set?

A Set is an unordered collection of <b>unique</b> data elements. Sets do not contain any duplicate elements. Sets can quickly generate a list of customer names from a log of transactions or telephone calls. This is extremely helpful for later lessons when we use SQL to merge datasets on customer account information.
 
In Python, we can add or subtract elements from a set (sets are "mutable," which means changeable). Sets items do not follow a specific order or index. Set elements cannot be accessed or changed.

## Why use Sets?

Molly's example illustrates three reasons to use sets instead of lists or dictionaries:

<ol><li><b>Eliminate Duplicates:</b> If the same customer name appears multiple times in a set, duplicates are deleted
    <li><b>Membership Testing:</b> The IN statement determines whether a name is in Molly's customer set 
    <li><b>Set Operations:</b> Set operations help calculate whether a customer has shopped at both stores. The union operation also generates a list of all customers from either store
                
       

## Creating a Set

There are two standard approaches to creating a set. We can use brace construction to build the set. Alternatively, a set function can convert a list to a set.

In [368]:
##Option #1: Brace Construction
MollyStore = {'Suzie','Edgar','Steven','Natalie','Natalie','Suzie'}
print(mollyStore)


{'Natalie', 'Steven', 'Suzie', 'Molly', 'Edgar'}


Note: Although Natalie shopped at Molly's store twice, her name only appears once. Python <b>automatically deletes</b> the repeating "Natalie" instances. Even if we create a set with Natalie's name written seven times, the set will only contain one "Natalie" element:


In [369]:
NatalieTestSet={'Natalie','Natalie','Natalie','Natalie','Natalie','Natalie','Natalie'}
print(NatalieTestSet) #Set only contains one instance of "Natalie" 

{'Natalie'}


In [370]:
##Option #2: Converting a list to a set
BobStore = set(['Fred','Suzie','Steven','Joseph','Catherine','Steven','Steven'])
print(BobStore)

{'Joseph', 'Catherine', 'Steven', 'Suzie', 'Fred'}


The set function converts a list to a set. Steven appears three times in the list. In the set, all duplicates are removed.

## Set Operations: In/Not In

The In/Not In statement determines whether an element is a member of the set. We can use this operation to see if a customer has ever shopped at Molly's store.

The below example checks if 'Catherine' is in the MollyStore set. The results are stored in a boolean variable <b>CatherineCustomer</b> with two possible outcomes:
<ul>
    <li><b>True:</b> If 'Catherine' exists in the MollyStore set
    <li><b>False:</b> If 'Catherine' cannot be found in the set
       

In [371]:
CatherineCustomer='Catherine' in MollyStore
print(CatherineCustomer)

False


The <b> Not In</B> operation checks the opposite situation. If Heather has not shopped at a store, the result is true. Our boolean variable <b>HeatherNotCustomer</b> can return one of two values:
<ul>
    <li><b>True:</b> If Heather did not shop at Bob's store
    <li><b>False:</b> If Heather did shop at Bob's store
       

In [376]:
HeatherNotCustomer='Heather' not in BobStore
print(HeatherNotCustomer)


True


## Set Operations: Intersection/Union

The intersection operation compares two sets and reports which elements are in both. Molly can use this operation to find which customers shop at both store locations. There are two equivalent techniques:
<ol><li>Calculate the intersection using the & sign: print(Set1&Set2)
    <li>Use the intersection function: print(Set1.intersection(Set2))
        
        

In [377]:
print(MollyStore&BobStore)

{'Suzie', 'Steven'}


In [378]:
print(MollyStore.intersection(BobStore))

{'Suzie', 'Steven'}


Suzie and Steven are the only customers who appear in both store locations.

| MollyStore |    | BobStore |  
|------|          |------|          
|  <font color="orange"> Suzie </font> ||   Fred  |
|   Edgar  || <font color="orange">   Suzie  |
|    <font color="orange"> Steven </font> || <font color="orange">    Steven </font> |
|   Natalie  ||   Joseph  |
|   Natalie  ||   Catherine  |
|    <font color="orange"> Suzie </font> || <font color="orange">    Steven </font> |
|            | | <font color="orange">    Steven </font> |


Molly also requested a list of all customers from either location. We can use the union operation to combine all the names from Molly's store with Bob's store:
<ol><li>Calculate the union using the | sign (the pipeline character is to the right of the } key): print(Set1|Set2)
    <li>Use the union function: print(Set1.union(Set2))
        

In [379]:
print(MollyStore|BobStore) #pipe operator

{'Joseph', 'Catherine', 'Natalie', 'Steven', 'Suzie', 'Fred', 'Edgar'}


In [382]:
print(MollyStore.union(BobStore))

{'Joseph', 'Catherine', 'Natalie', 'Steven', 'Suzie', 'Fred', 'Edgar'}


## For Set


Finally, we can use the For Set command to loop through the entire set and print out each customer's name:

In [383]:
for item in (MollyStore|BobStore):
    print(item)
    

Joseph
Catherine
Natalie
Steven
Suzie
Fred
Edgar


## Summary 



In this lesson, we learned how to use sets to remove duplicates from a list of customers. Sets can help with the machine learning and data cleaning process by:
<ul>
    <li>Removing duplicate elements from a customer list
    <li>Using the In statement to check if a customer is in the set
    <li>Comparing two sets using the intersection and union operations
    <li>Printing each element using For Set
</ul>        
Sets are a great addition to lists and dictionaries because now we can manage larger, repeating customer datasets. Set operations can quickly compare elements of one set to another. The techniques we applied to Molly's customer dataset will be used later on in our machine learning and SQL curriculum.         
    