# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # example: "rvuduc3"
COLLABORATORS = [] # list of strings of your collaborators' IDs

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

# Part 5: Actual baskets!

Let's download some actual shopping basket data and find co-occurring pairs. We can use the shopping basket data available in someone else's [online tutorial](http://www.salemmarafi.com/code/market-basket-analysis-with-r/).

## The Requests module

Let's start by downloading the data using Python. The file we want is in a comma-separated values (CSV) file with variable-length rows. Its first few lines look like,

```
citrus fruit,semi-finished bread,margarine,ready soups
tropical fruit,yogurt,coffee
whole milk
pip fruit,yogurt,cream cheese ,meat spreads
other vegetables,whole milk,condensed milk,long life bakery product
...
```

Each line is a basket and the items are separated by commas. An item appears _at most once_ in any given basket. (This scenario differs from the letters scenario in previous parts of this lab.)

The [Requests module](http://requests.readthedocs.io/en/latest/user/quickstart/) makes it easy to download any web page as (raw) text.

In [None]:
import requests
response = requests.get ('http://www.salemmarafi.com/wp-content/uploads/2014/03/groceries.csv')
groceries_file = response.text  # or response.content for raw bytes

print (groceries_file[0:250]) # Prints the first 250 characters only

## Find the co-occurring items (21 points)

Write your own code to solve the pairwise assocation problem for the groceries data set. Your code must include a function with the signature,

```python
  def pairwise_assoc_miner (text, s):
      ...
```

This function should takes a text string `text` and positive integer threshold `s` as input. You may assume `text` is formatted just like the groceries file above. The function should return a list whose entries have the form, `((i, j), c)` where `i` and `j` are item names, `c` is the frequency of the pair, and the list contains exactly those pairs where `c >= s`. Our test code will simply call this function and check the list.

This problem exhibits symmetry in that $(i, j)$ and $(j, i)$ pairs may be regarded as the same. Your implementation can use any convention, i.e., it can include one or both, as long as it includes at least one of them if the count is `s` or higher.

Feel free to define auxiliary functions you'd like; just make sure everything needed to run your solution appears in the cell below.

Lastly, try to write clear, readable code. For tricky bits, use lightweight documentation to explain what you are doing.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

The following cells will check your function on the 'groceries' data (`groceries_file` above). However, you probably want to develop your own smaller-scale tests before you run on the full data set. We recommend you create a few code cells and write your own code for testing and debugging. (We won't check or grade those.)

In [None]:
try:
    freq_pairs = pairwise_assoc_miner (groceries_file, 500)
except NameError:
    raise AssertionError ("*** Your implementation did not appear to define a `pairwise_assoc_miner()` function as asked. ***")

In [None]:
freq_pairs_dict = {ij: c for (ij, c) in freq_pairs}
print (freq_pairs_dict)

# We *think* these values would be correct.
check_dict = {('rolls/buns', 'whole milk'): 557,
              ('whole milk', 'yogurt'): 551,
              ('other vegetables', 'whole milk'): 736}

for ((i, j), c) in check_dict.items ():
    assert (i, j) in freq_pairs_dict or (j, i) in freq_pairs_dict
    if (i, j) in freq_pairs_dict:
        assert freq_pairs_dict[(i, j)] == check_dict[(i, j)]
    elif (j, i) in freq_pairs_dict:
        assert freq_pairs_dict[(j, i)] == check_dict[(i, j)]
    else:
        raise AssertionError ('Pair ({}, {}) missing!'.format (i, j))
        
for ((i, j), c) in freq_pairs_dict.items ():
    if i <= j:
        assert (i, j) in check_dict and c == check_dict[(i, j)]
    else:
        assert (j, i) in check_dict and c == check_dict[(j, i)]
        
print ("\n(Passed.)")