<a/ id='top'></a>

# CSCI4022 Homework 5; A-Priori

## Due Friday, March 4 at 11:59 pm to Canvas and Gradescope

#### Submit this file as a .ipynb with *all cells compiled and run* to the associated dropbox.

***

Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.  Remember that you are encouraged to discuss the problems with your classmates, but **you must write all code and solutions on your own**.

**NOTES**: 

- Any relevant data sets should be available on Canvas. To make life easier on the graders if they need to run your code, do not change the relative path names here. Instead, move the files around on your computer.
- If you're not familiar with typesetting math directly into Markdown then by all means, do your work on paper first and then typeset it later.  Here is a [reference guide](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) linked on Canvas on writing math in Markdown. **All** of your written commentary, justifications and mathematical work should be in Markdown.  I also recommend the [wikibook](https://en.wikibooks.org/wiki/LaTeX) for LaTex.
- Because you can technically evaluate notebook cells is a non-linear order, it's a good idea to do **Kernel $\rightarrow$ Restart & Run All** as a check before submitting your solutions.  That way if we need to run your code you will know that it will work as expected. 
- It is **bad form** to make your reader interpret numerical output from your code.  If a question asks you to compute some value from the data you should show your code output **AND** write a summary of the results in Markdown directly below your code. 
- 45 points of this assignment are in problems.  The remaining 5 are for neatness, style, and overall exposition of both code and text.
- This probably goes without saying, but... For any question that asks you to calculate something, you **must show all work and justify your answers to receive credit**. Sparse or nonexistent work will receive sparse or nonexistent credit. 
- There is *not a prescribed API* for these problems.  You may answer coding questions with whatever syntax or object typing you deem fit.  Your evaluation will primarily live in the clarity of how well you present your final results, so don't skip over any interpretations!  Your code should still be commented and readable to ensure you followed the given course algorithm.
- There are two ways to quickly make a .pdf out of this notebook for Gradescope submission.  Either:
 - Use File -> Download as PDF via LaTeX.  This will require your system path find a working install of a TeX compiler
 - Easier: Use File ->  Print Preview, and then Right-Click -> Print using your default browser and "Print to PDF"



---
**Shortcuts:**  [Problem 1](#p1) | [Problem 2](#p2) | [Extra Credit](#p3) |
---


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import itertools #may use for .combinations/similar, if desired.

***
<a/ id='p1'></a>
[Back to top](#top)
# Problem 1 (Practice: Candidate Items; 20 pts)

In the A-Priori algorithm, there is a step in which we create a candidate list of frequent itemsets of size $k+1$ as we prune the frequent itemsets of size $k$.  This this problem we will create two functions to do that formally.

#### Part A:

There are two types of data objects in which we might be holding the frequency counts of itemsets.  If $k=2$, they may be stored in a triangular array.  Create a function `Cand_Trips` that takes a triangular array and returns all valid candidate triples as a list.  Recall that the itemset $\{i,j,k\}$ is only a candidate if all 3 of the itemsets in $\{\{i,j\}, \{i,k\}, \{k,j\}\}$ are frequent.

Some usage notes:

- The first input argument is `triang_counts`,  a zero-indexed triangular (numeric) array, by same convention as introduced in class.
- The second input argument is the positive integer support threshold `s`.
- The underlying itemset is 0-indexed, so e.g. `[0,1,3]` is a valid triple.
- You should not convert the input list `triang_counts` into a list of triples as part of your function.
- The return array `candidates` should be a list of 3-index lists of the item numbers of the triples.  So a final answer for some input might be:

`cand_trips` =
    `[[0,3,4], [1,2,7]]`

- An implementation note: there are two fundamentally different ways to think about implementing this function.  Option 1 involves thinking about the elements of `triang_counts` in terms of their locations on the corresponding *triangular matrix*: scan row $i$ for a pair of frequent pairs $\{\{i,j\}, \{i,k\}\}$ and then check if $\{j,k\}$ is in fact frequent.  Option 2 scans all of `tri_Counts` for frequent item pairs (the "pruning" step) and saves those in some object with their indices, then scans *that* object for candidates.  Both are valid for this problem, but option 2 may generalize to higher $k$ better...

In [2]:
index = lambda i,j,n: int(i*(n-(i+1)/2)+j-i-1)

In [3]:
index(3,4,5), index(0,1,3)
#yay

(9, 0)

In [4]:
np.union1d([1,0], [0,2])

array([0, 1, 2])

In [5]:
def cand_trips(t_c, s):
    num_items = int(len(t_c)/2)
    #cant use ndarrays here as we dont know their length.
    freq_items = []
    candidates = []
    
    for i in range(num_items):
        for j in range(i+1, num_items): #for 5: 1,2,3,4 -> 2,3,4 -> 3,4 -> 4
            t_i = index(i,j,num_items)
            if t_c[t_i] >= s:
                freq_items.append((i,j))
                
    #iterate over all combinations 

    for x,y in itertools.combinations(freq_items, 2):
        #construct the union of the two tuples here
        #we can then check if its len(3) -> if it isnt then we ignore
        k_len = list(np.union1d(x,y)) #0(n) here to cast as list, can ignore as len(3)
        #means we have 1 common elem
        if len(k_len) == 3:
            #check if it already is in -> lets not waste time doing expensive set op's otherwise.
            if k_len not in candidates:
                #ex. (0,1), (0,2) -> need to find (1,2) freq -> is xor1d of two sets(1,0 || 0,1 -> 1)
                uniq_vals = tuple(np.setxor1d(x,y)) #0(n) to cast as tuple, we can ignore as len(3)
                if uniq_vals in freq_items:
                    candidates.append(k_len)
    
    return candidates

In [6]:
triang_counts=[10,7,3,2,6,4,3,3,6 ,0]
cand_trips(triang_counts, 4)

[[0, 1, 2]]

#### Part B:

A quick test case.  Below is  a matrix $M$ and code including its corresponding the triangular array.  

$C=\begin{bmatrix}
\cdot &10&7&3&2\\
\cdot &\cdot&6&4&3\\
\cdot &\cdot&\cdot&3&6\\
\cdot &\cdot&\cdot&\cdot&0\\
\cdot &\cdot&\cdot&\cdot&\cdot\\
\end{bmatrix}$
 
Input the given list into your function to verify that it returns the correct valid triples at $s=1$ and $s=6$.

In [7]:
triang_counts=[10,7,3,2,6,4,3,3,6 ,0]
print('For s>=6, candidate:', cand_trips(triang_counts, 6))
print('For s>=1, candidates:', cand_trips(triang_counts, 1))

#Check that...
#cand_trips(triang_counts, 1) returns all the possible triples except those that contain BOTH items 3 and 4.
#cand_trips(triang_counts, 6) returns only the triple [[0,1,2]]

For s>=6, candidate: [[0, 1, 2]]
For s>=1, candidates: [[0, 1, 2], [0, 1, 3], [0, 1, 4], [0, 2, 3], [0, 2, 4], [1, 2, 3], [1, 2, 4]]


#### Part C:

Suppose instead that our $k=2$ item counts were stored in a list of the form e.g.
`pairs_counts` =
    `[[0,1,12], [0,2,0], [0,3,11], ..., [7,8,103]]`
    
Where each element is a triple storing the two item indices and their count, $[i,j,c_{ij}]$. 

Create a function `cand_trips_list` that takes in a list of pairs counts and returns all valid candidate triples as a list.  

Some usage notes:

- The first input argument is `pairs_counts`,  a zero-indexed list of triples.
- The second input argument is the positive integer support threshold `s`.
- The underlying itemset is 0-indexed, so e.g. `[0,1,3]` is a valid triple.
- The return array `candidates` should be a list of 3-element lists, as above.

You should **not** convert the input list `pairs_counts` into a triangular array as part of your function.  After all, sometimes we use the list format for pairs because it saves memory compared to the triangular array format!  You may be able to borrow heavily from the logic of your first function, though!

In [8]:
#:o
def cand_trips_list(pairs_counts, s):
    #iterate over all combinations 
    freq_items = []
    candidates = []
    
    #we just iterate elementwise here.
    for x,y,z in pairs_counts:
        if z >= s:
            freq_items.append((x,y))
    
    for x,y in itertools.combinations(freq_items, 2):
        k_len = list(np.union1d(x,y))
        if len(k_len) == 3:
            if k_len not in candidates:
                uniq_vals = tuple(np.setxor1d(x,y))
                if uniq_vals in freq_items:
                    candidates.append(k_len)
                    
    return candidates

#### Part D:

Do the test case again.  Below is the list reprentation of the same matrix $M$ from part B.  
 
Input the given list into your function to verify that it returns the correct valid triples at $s=1$ and $s=6$.

In [9]:
pairs_counts=[[0,1,10], [0,2,7], [0,3,3], [0,4,2],\
             [1,2,6],[1,3,4], [1,4,3],\
             [2,3,3],[2,4,6],\
             [3,4,0]]
print(cand_trips_list(pairs_counts, 6))
print(cand_trips_list(pairs_counts, 1))
#Check that...
#cand_trips(pairs_counts, 1) returns all the possible triples except those that contain BOTH items 3 and 4.
#cand_trips(pairs_counts, 6) returns only the triple [[0,1,2]]

[[0, 1, 2]]
[[0, 1, 2], [0, 1, 3], [0, 1, 4], [0, 2, 3], [0, 2, 4], [1, 2, 3], [1, 2, 4]]


#### Part E

Describe *in words* how you would generalize your code in part D to work for generating candidate quadruples $[i_1, i_2, i_3, i_4]$ from an input list of triples counts (each element of the form $[i, j, k, c_{ijk}]$).

My list iter here would be different. Would unpack with $x[-1]$ -> count, would take rest($x[:-1]$) and cast as a tuple.

I think my lines for generating candidate pairs would work for any k-len tuple. Would have to parameterize my len check(take in k + counter which iterates during a-priori pass in counter.), but rest should be fine.


***
<a/ id='p2'></a>
[Back to top](#top)
# Problem 2 (Practice: A-Priori; 25 pts) 

Consider the recipe data set provided in `recipes.npy` (use `np.load`).  This includes 100,000 recipes from a variety of sources.

We want to use the baskets and the ingredients therein (see `ingredients.npy`) to perform an item basket analysis.

This data set is small enough to run directly from main memory, so you may do that if you wish.

Loading and accessing the data set is shown below:

In [13]:
recipes=np.load('../recipes.npy', allow_pickle=True)
ingredients=np.load('../ingredients.npy', allow_pickle=True)

In [14]:
print(recipes[:2]) #list of lists
print(ingredients[:2]) #inventory list
print(ingredients[recipes[0]]) #to access a recipe by string

[array([ 233, 2754,   42,  120,  560,  345,  150, 2081,   12,   21])
 array([ 198,  249,    2,  194, 1884,  791,  965,  423,   53,   48,  798,
         31,  362, 1031,   94,   26,    8])]
['salt' 'pepper']
['basil leaves' 'focaccia' 'leaves' 'mozzarella' 'pesto' 'plum tomatoes'
 'rosemary' 'sandwiches' 'sliced' 'tomatoes']


In [15]:
ingredients[9]

'ground'



#### a) Since the ingredients file alrady provides integer codes for each of our items, we can move directly into countin via the A-Priori algorithm.  Using the two given files, create a table of frequent single items at 1% support threshold. You may use Python's native classes to set up your lookup functions/tables.

In [16]:
index = lambda i,j,n: int(i*(n-(i+1)/2)+j-i-1)

In [17]:
def tri_c(t_c,s,m):
    freq_items = []
    for i in range(m):
        for j in range(i+1,m): #for 5: 1,2,3,4 -> 2,3,4 -> 3,4 -> 4
            t_i = index(i,j,m)
            if t_c[t_i] >= s:
                freq_items.append((i,j))
    return freq_items

In [18]:
#s -> supp threshold
#k -> number of passes

def a_priori(baskets, items, s, k):
    s = s*len(baskets) #let s be a fraction of baskets.
    counts = np.zeros(len(ingredients))
    c = 1
    #first pass -----
    for x in baskets: 
        #single recipe
        for y in x:
            #single ingredient
            counts[y] += 1     
            
    m_items = np.zeros(len(ingredients))
    m = 0 #this is the number of frequent singles.
    for x in range(len(counts)):
        if counts[x] >= s:
            m += 1
            m_items[x] = m

    if k < 2:
        #just a helper func to translate to strings
        ind = []
        for x in range(len(counts)):
            if m_items[x] > 0:
                ind.append(x)
        return ingredients[ind], m_items
    
    c+=1 #just incrementing to represent num passes
    m_c_2 = int(np.ceil((m**2/2)))
    count_pairs = np.zeros(m_c_2) 

    for x in baskets:
        #routine to check which values in basket are freq
        freq = []
        for y in range(0, len(x)):
            m_val = m_items[x[y]]
            #we need to use the m-value here as we want to make sure
            #our indices translate properly to the tri_mat
            if m_val > 0:
                freq.append(m_val-1) 
                #we need to minus one as our indices are operating on values being 0 indexed - 1-m is 1 indexed.
        #we need to sort frequency here as 0 < i < j <= m. If we dnt sort then i can be g.t. j
        #sorting is pretty cheap too.
        for i,j in itertools.combinations(sorted(freq), 2):
            #calculate pos'n in triangular array. 
            #increment count
            ind = index(i,j,m)
            count_pairs[ind] += 1
    freq_pairs = tri_c(count_pairs, s, m)
    #terribly hard coded but will work
    return freq_pairs, m_items, counts, count_pairs    

In [19]:
f, m, c, c_p = a_priori(recipes, ingredients, 0.1, 2)

In [20]:
#sanity check
c = 0
for x in recipes:
    if 0 in x and 1 in x:
        c += 1
print(c_p[0] == c) #index(0,1,m) == 0 -> checking items 0/1 count for comparison

True


In [21]:
ing, ind = a_priori(recipes, ingredients, 0.01, 1)
len(ing), ing

(293,
 array(['salt', 'pepper', 'butter', 'garlic', 'sugar', 'flour', 'onion',
        'olive oil', 'water', 'ground', 'olive', 'powder', 'sliced',
        'eggs', 'black pepper', 'milk', 'cheese', 'cream', 'lemon',
        'chicken', 'sauce', 'tomatoes', 'brown', 'white', 'egg', 'onions',
        'vinegar', 'vegetable', 'brown sugar', 'lemon juice',
        'ground black pepper', 'parsley', 'cinnamon', 'garlic cloves',
        'extract', 'vegetable oil', 'vanilla', 'baking powder',
        'vanilla extract', 'unsalted butter', 'ginger', 'chocolate',
        'leaves', 'soda', 'parmesan', 'tomato', 'celery', 'potatoes',
        'kosher salt', 'mustard', 'cheddar', 'juice', 'baking soda',
        'kosher', 'sour cream', 'cilantro', 'soy sauce', 'cream cheese',
        'parmesan cheese', 'cheddar cheese', 'oregano', 'red pepper',
        'carrots', 'clove', 'chicken broth', 'mushrooms', 'honey',
        'packed', 'orange', 'bread', 'thyme', 'oil', 'basil', 'seasoning',
        'extra', 'm

Was 1% an appropriate support threshold?  Describe why or why not.  Keep in mind, the goal here is two fold: you want "actionable" conclusions, and output that's small enough that you or your grader can make sure that you have the right set!

From a purely theoretical lense, 1% seems alright for very very large datasets(many components too) that likely do not have much overlap. We expect to have a considerable amount of overlap here. We have 100k recipes and 3.5k ingredients, as such we can expect many ingredients to be featured in at least 1000 buckets. We cannot really draw actionable conclusions with this volume of frequent items.

If we wanted to compute frequent pairs we would expect around 293^2/2 -> ~85849 pairs of integers to be loaded into memory as well. This isn't too big of a deal computationally, however if our dataset scaled here it could be a potential issue. One interesting note is that our ingredient list here isn't large enough to provide a memory bottleneck when generating frequent pairs - $2(3500^{2}*32) = 2(3.92*10^{8})$ -> less than a gb. Most modern day computers have around 8gb ram, of which a programmer can expect potentially 1/8th of which to be accessable to a program. 


#### b) Use A-priori to find all frequent  pairs of items from your set of frequent items in a).  Use whatever support threshold you feel is most appropriate, but make sure your result is readable: you should list the top handful of most frequent pairs, sorted by their prevalence.

Report the confidences of the two association rules corresponding to the most frequent item pair.


In [76]:
f, m, c, c_p = a_priori(recipes, ingredients, 0.05, 2) #0.05*100000 -> 5000. 

In [23]:
print(f"We would have to generate around: {int(max(m)**2/2)} pairs")

We would have to generate around: 1682 pairs


In [24]:
ind_sort = []
num_freq = int(max(m))
for x in range(0, len(f)):
        ind_sort.append((x, c_p[index(f[x][0], f[x][1], num_freq)]))
ind_sort.sort(key=lambda x:x[1], reverse=True) #just sort by the second value

prevalence = [f[x[0]] for x in ind_sort]

#pairs to strings.
for x in range(0, len(prevalence)):
        ind1 = int(m[m == prevalence[x][0]][0])
        ind2 = int(m[m == prevalence[x][1]][0])
        ind = ingredients[[ind1,ind2]]
        prevalence[x] = ind
        
#few blocks of code to sort by count + convert indi to strings for interpretability

In [49]:
prevalence[:25], len(prevalence)

([array(['salt', 'pepper'], dtype='<U39'),
  array(['olive oil', 'olive'], dtype='<U39'),
  array(['pepper', 'garlic'], dtype='<U39'),
  array(['salt', 'sugar'], dtype='<U39'),
  array(['pepper', 'black pepper'], dtype='<U39'),
  array(['pepper', 'ground'], dtype='<U39'),
  array(['salt', 'flour'], dtype='<U39'),
  array(['salt', 'butter'], dtype='<U39'),
  array(['salt', 'garlic'], dtype='<U39'),
  array(['butter', 'sugar'], dtype='<U39'),
  array(['sugar', 'flour'], dtype='<U39'),
  array(['pepper', 'olive'], dtype='<U39'),
  array(['pepper', 'olive oil'], dtype='<U39'),
  array(['pepper', 'onion'], dtype='<U39'),
  array(['butter', 'flour'], dtype='<U39'),
  array(['garlic', 'olive'], dtype='<U39'),
  array(['garlic', 'olive oil'], dtype='<U39'),
  array(['garlic', 'onion'], dtype='<U39'),
  array(['salt', 'onion'], dtype='<U39'),
  array(['salt', 'olive'], dtype='<U39'),
  array(['salt', 'olive oil'], dtype='<U39'),
  array(['ground', 'black pepper'], dtype='<U39'),
  array(['garli

In [48]:
np.flip(ingredients[np.argsort(c)[-5:]]) #gets 5 largest counts, prob a better function to call here but w/e

array(['salt', 'pepper', 'sugar', 'garlic', 'butter'], dtype='<U39')

Most of these associations are quite intuitive, validating the function via an eye test. We might want to calculate confidence and rank by confidence over prevalence, however, as we expect salt/pepper to have the most pairs - they are the most prevalent singletons(added an np.argsort call to display this.). As such, our most prevalent pairs should contain 

Confidence(i,j): $conf({I}\rightarrow{J}) = \frac{support(I\cup{J})}{support(I)}$

In [63]:
#we want the confidence of the two association rules corresponding to the most frequent item pair
#therefore we find confidence of i -> salt, j -> pepper; i -> pepper, j -> salt. 

#first lets strip the item -> index.

salt_i = np.where(ingredients == "salt")
pepper_i = np.where(ingredients == "pepper")

s_Supp, p_Supp = c[salt_i], c[pepper_i]
sp_Supp = c_p[index(m[salt_i[0]]-1, m[pepper_i[0]]-1, num_freq)] #c_p is a triangular matrix, 
#we technically want the m values here but they are 1 indexed.

print(f"Salt -> Pepper confidence: {sp_Supp/s_Supp}")
print(f"Pepper -> Salt confidence: {sp_Supp/p_Supp}")
print(f"Baskets w salt: {s_Supp/len(recipes)}, w pepper: {p_Supp/len(recipes)}")

Salt -> Pepper confidence: [0.53350094]
Pepper -> Salt confidence: [0.58468497]
Baskets w salt: [0.42163], w pepper: [0.38472]


The importance of confidence here lies in isolating components that are just present in many many many baskets vs those that are in many baskets together. We note that our confidence values are quite high(over 50% for both), which means that there is a clear association between recipes having both salt and pepper. 

Using some real world extrapolation tells us that most recipes contain both salt and pepper as mandatory seasonings, so this relationship is not all that surprising. 

**c)**

Zach has to go to the store and stock his pantry.  He knows that his girlfriend has a (borderline unhealthy to those around her) love of garlic.  What should he purchase to make sure he has in stock?  What are two most frequent $\{garlic, x\}$ item pairs, and what are the two most **interesting** $garlic \to X$ associations?

In [52]:
garlic_pairs = []
for x in prevalence:
    if "garlic" in x:
        garlic_pairs.append(x)

In [55]:
garlic_pairs[:2]

[array(['pepper', 'garlic'], dtype='<U39'),
 array(['salt', 'garlic'], dtype='<U39')]

These are both incredibly unsurprising as pepper and salt are the most prevalent singletons.

Interest(i,j): $int({I}\rightarrow{J}) = \frac{support(I\cup{J})}{support(I)} - P(j)$

In [64]:
interest = lambda ij,i,pj: ij/i - pj #int/int - float -> float

In [77]:
l_i = np.zeros(len(garlic_pairs))
l_r = len(recipes)
for ind,x in enumerate(garlic_pairs):
    i,j = np.where(ingredients == x[0])[0][0], np.where(ingredients==x[1])[0][0] 
    #np.where gives arr(arr(ind, ind, etc.))
    i_s, j_s = c[i], c[j]
    pj = j_s/l_r
    
    ij_ind = index(m[i]-1, m[j]-1, num_freq) #1-indexed :L
    ij = c_p[ij_ind]
    i = interest(ij, i_s, pj)
    l_i[ind] = i

In [96]:
gp = np.array(garlic_pairs)
li_maxi = np.flip(np.argsort(l_i)[-4:])
gp[li_maxi], l_i[li_maxi] 

(array([['garlic', 'olive'],
        ['garlic', 'olive oil'],
        ['pepper', 'garlic'],
        ['garlic', 'onion']], dtype='<U39'),
 array([0.21352845, 0.21294542, 0.21049972, 0.15199928]))

Note: Somewhat confused by a few ingredients - might be an issue with how the ingredients were scraped. We see ingredients like ground/olive - these feel like parts of another ingredient.

Regardless, we see that garlic and olive/olive oil are the two most interesting relations, with pepper/garlic and garlic/onion right behind. This is pretty straightforward as garlic and olive oil are cornerstones of many american recipes. 

***
<a/ id='p3'></a>
[Back to top](#top)
# Problem 3 (Extra-Credit: A-Priori with hashing and more baskets; 10 pts each part) 

The data set in 2 had two very appealing propeties that we typically do **not** assume to be the case:
- It came with an ingredient list provided
- It was small enough to fit into main memory.

To fully implement the model, you can get some extra credit by attempting variants of the data that do not have those properties.  We will tackle each problem individually.  You should answer each problem *in its own, separate notebooks* to ensure you're not using any variables from your solution to problem 2 above.

## EC1: A-P with hashing

#### EC1a) The file `recipesbying` contains the same data set as in problem 2, but the strings themselves live in each recipe.

Create a hash table as in nb08 that hashes each ingredient observed based on its string. In other words, create your own version of what **was** in `ingredients.npy` by creating your own hash and/or lookup functions.
Include a check to minimize and fix any collisions, as in nb08.


In [None]:
def hash_string(s):
    val = 0
    for i,x in enumerate(s):
        val += (i+1)*ord(x) #note that enumerate is 1 indexed.


**EC1b)** Use A-priori to find all frequent items and all frequent pairs of items from your hashed data set in part EC1).  Ensure that the results match those of problem 2.



## EC2: A-P with massive data

The `.npz` file `simplified-recipes-1M.npz` contains over 1 million recipes, and is the original source of the 100,000 recipes used in problem 2.  Using this file (and `ingredients.npy`, if desired), use A-priori to find all frequent items and all frequent pairs of items.  However, you should **not** load all of the file into main memory.  Instead, use `np.memmap` or other options to ensure that you never load into main memory more than 100,000 recipes at a time.  Include any processing in your submission, and use the same proportionate support threshold as you did in problem 2.  Do the most common items differ?

A few notes: 
- If you process the data to make it readable in other forms `.npy`, `.csv`, etc., that's fine, but show all processing code in your submission.
- For example, if you find `.memmap` hard to get working, you may convert to `.csv` and use `pd.read_csv` with arguments `chunksize` or `skiprows`, `nrows`
- You may be able to do the problem with very little additional work if you are clever about how you open the file and read over it.  In this case, set up your "loop" over baskets to only go over 100,000 rows of the file at a time, though, and be very explicit as to how you're avoiding the larger objects ever entering main memory.