# Day 21: Allergen Assessment

You reach the train's last stop and the closest you can get to your vacation island without getting wet. There aren't even any boats here, but nothing can stop you now: you build a raft. You just need a few days' worth of food for your journey.

You don't speak the local language, so you can't read any ingredients lists. However, sometimes, allergens are listed in a language you do understand. You should be able to use this information to determine which ingredient contains which allergen and work out which foods are safe to take with you on your trip.

You start by compiling a list of foods (your puzzle input), one food per line. Each line includes that food's ingredients list followed by some or all of the allergens the food contains.

Each allergen is found in exactly one ingredient. Each ingredient contains zero or one allergen. Allergens aren't always marked; when they're listed (as in (contains nuts, shellfish) after an ingredients list), the ingredient that contains each listed allergen will be somewhere in the corresponding ingredients list. However, even if an allergen isn't listed, the ingredient that contains that allergen could still be present: maybe they forgot to label it, or maybe it was labeled in a language you don't know.

For example, consider the following list of foods:

```text
mxmxvkd kfcds sqjhc nhms (contains dairy, fish)
trh fvjkl sbzzf mxmxvkd (contains dairy)
sqjhc fvjkl (contains soy)
sqjhc mxmxvkd sbzzf (contains fish)
```

The first food in the list has four ingredients (written in a language you don't understand): mxmxvkd, kfcds, sqjhc, and nhms. While the food might contain other allergens, a few allergens the food definitely contains are listed afterward: dairy and fish.

The first step is to determine which ingredients can't possibly contain any of the allergens in any food in your list. In the above example, none of the ingredients kfcds, nhms, sbzzf, or trh can contain an allergen. Counting the number of times any of these ingredients appear in any ingredients list produces 5: they all appear once each except sbzzf, which appears twice.

Determine which ingredients cannot possibly contain any of the allergens in your list. How many times do any of those ingredients appear?

In [1]:
# Python imports
from collections import defaultdict
from itertools import combinations
from pathlib import Path

import numpy as np

We load the data into a list of tuples using `load_data()`. Each tuple contains a set of ingredients, and a set of allergens from each line.

In [2]:
def load_data(fpath):
    with Path(fpath).open("r") as ifh:
        data = []
        for line in [_.strip() for _ in ifh.readlines()]:
            ingredients, allergens = line.split(" (contains ")
            allergens = set(allergens[:-1].split(", "))
            ingredients = set(ingredients.split())
            data.append((ingredients, allergens))
    return data

To find the non-allergens, we define a *pattern* (a `numpy` array) for each of the ingredients and the allergens. We use a `1` to indicate that the ingredient or allergen occurs in the food, and a `0` to indicate that it does not.

We can then use these patterns to identify certain non-allergens as those ingredients that have a pattern incompatible with any allergen pattern. An ingredient pattern is incompatible with an allergen pattern if the allergen occurs and the ingredient is not labelled. If an ingredient is compatible with *any* allergen, it is not a certain non-allergen. We test for compatibility by subtracting the allergen pattern from the ingredient pattern. The possibilities are:

- ingredient present only when allergen is present (compatible, possible allergen): 1-1 or 1-0 only; no 0-1, so all values > -1
- ingredient present when allergen is not present (incompatible, not this allergen): 1-1, 1-0 or 0-1

The `find_non_allergens()` function returns only ingredients that fall into the second category, that are not consistent with containing any of the allergens.

In [3]:
def find_non_allergens(foods):
    # What are the allergens
    allergens = set()
    [allergens := allergens.union(alls) for _, alls in foods]
    
    # What are the ingredients
    ingredients = set()
    [ingredients := ingredients.union(ings) for ings, _ in foods]

    # What are the patterns of occurrence for the allergens and ingredients
    a_ptns = {}
    for allergen in allergens:
        a_ptns[allergen] = np.array([1 if allergen in alls else 0 for _, alls in foods])
    i_ptns = {}
    for ingredient in ingredients:
        i_ptns[ingredient] = np.array([1 if ingredient in ings else 0 for ings, _ in foods])
        
    # Which ingredients are incompatible with any allergen?
    non_allergens = set()
    for ing, iptn in i_ptns.items():
        non_allergen = True
        for alg, aptn in a_ptns.items():
            if np.all(iptn - aptn > -1):
                non_allergen = False
        if non_allergen:
            non_allergens.add(ing)
    return non_allergens

We use the `count_non_allergens()` function to count the number of occurrences of each non-allergen ingredient in the list of foods:

In [4]:
def count_non_allergens(foods, non_allergens):
    count = 0
    for ing, allerg in foods:
        count += len(ing.intersection(non_allergens))
    return count

We run this on the test data:

In [5]:
foods = load_data("day21_test.txt")
non_allergens = find_non_allergens(foods)
print(count_non_allergens(foods, non_allergens))
non_allergens

5


{'kfcds', 'nhms', 'sbzzf', 'trh'}

Then on the puzzle data:

In [6]:
foods = load_data("day21_data.txt")
non_allergens = find_non_allergens(foods)
count_non_allergens(foods, non_allergens)

1882

## Part Two

Now that you've isolated the inert ingredients, you should have enough information to figure out which ingredient contains which allergen.

In the above example:

    mxmxvkd contains dairy.
    sqjhc contains fish.
    fvjkl contains soy.

Arrange the ingredients alphabetically by their allergen and separate them by commas to produce your canonical dangerous ingredient list. (There should not be any spaces in your canonical dangerous ingredient list.) In the above example, this would be mxmxvkd,sqjhc,fvjkl.

Time to stock your raft with supplies. What is your canonical dangerous ingredient list?

To find the allergens in our food list, we take the list of foods and list of known non-allergens into the `find_allergens()` function. This holds a known allergenic ingredients in the dictionary `allergens`.

The function first removes all known non-allergens from the ingredients list of each food.

Then, we check each pairwise combination of foods to see if they can logically isolate a single ingredient for a single allergen, with set arithmetic: if two sets of "unknown" ingredients have a single ingredient in common, and a single "unknown" allergen in common, that ingredient must contain that allergen. The `allergens` dictionary is then updated.

We conduct an additional test: if there is a single "unknown" ingredient and "unknown" allergen for any food, then the allergens dictionary is updated, accordingly.

In [7]:
def find_allergens(foods, non_allergens):
    allergens = {}
    
    # Remove allergens
#     print(f"{non_allergens=}")
    for idx, food in enumerate(foods):
        foods[idx] = (food[0].difference(non_allergens), food[1])
#         print(food[0].difference(non_allergens), food[1])

    # Get initial allergen set
    for food1, food2 in list(combinations(foods, 2)):
        # exclude known allergens and non_allergens
        f1_ing = food1[0].difference(set(allergens.keys()))
        f2_ing = food2[0].difference(set(allergens.keys()))
        f1_all = food1[1].difference(set(allergens.keys()))
        f2_all = food2[1].difference(set(allergens.keys()))

        ing_candidates = (f1_ing.intersection(f2_ing))
        all_candidates = (f1_all.intersection(f2_all))
        print(len(ing_candidates), len(all_candidates))
        if len(ing_candidates) == len(all_candidates) == 1:
            allergens[ing_candidates.pop()] = all_candidates.pop()

        print(len(f1_ing), len(f1_all))
        if len(f1_ing) == len(f1_all) == 1:
            allergens[f1_ing.pop()] = f1_all.pop()
        print(len(f2_ing), len(f2_all))
        if len(f2_ing) == len(f2_all) == 1:
            allergens[f2_ing.pop()] = f2_all.pop()

    return allergens

There's an additional complication. Our input data for the puzzle is underdetermined. There are a whole bunch of ingredients that are not certain to *not be allergens* (i.e. the ingredient is incompatible with *any* allergen), but they are compatible with so many allergens that we can't solve the problem with the `find_allergens()` function.

To remedy this, the `find_candidates()` function identifies all ingredients that are compatible with containing each allergen, and recomposes the ingredients list for each food to contain *only* those ingredients from the ingredients list that *could* contain the allergen.

That reduces the complexity of the problem and avoids underdetermination.

In [8]:
def find_candidates(foods):
    # What are the allergens
    allergens = set()
    [allergens := allergens.union(alls) for _, alls in foods]
    
    # What are the ingredients
    ingredients = set()
    [ingredients := ingredients.union(ings) for ings, _ in foods]

    # What are the patterns of occurrence for the allergens and ingredients
    a_ptns = {}
    for allergen in allergens:
        a_ptns[allergen] = np.array([1 if allergen in alls else 0 for _, alls in foods])
    i_ptns = {}
    for ingredient in ingredients:
        i_ptns[ingredient] = np.array([1 if ingredient in ings else 0 for ings, _ in foods])
        
    # For each allergen, which ingredients are compatible with that allergen?
    allergens = defaultdict(set)
    for alg, aptn in a_ptns.items():
        for ing, iptn in i_ptns.items():
            if np.all(iptn - aptn > -1):
                allergens[alg].add(ing)
                
    # Format the allergen candidates appropriately
#     print(allergens)
    result = []
    for allergen, ingredients in allergens.items():
        result.append((set(ingredients), set([allergen])))
    return result

For the test data, we do not need to reduce the candidate ingredients.

In [9]:
foods = load_data("day21_test.txt")
non_allergens = find_non_allergens(foods)
print(count_non_allergens(foods, non_allergens))
allergens = find_allergens(foods, non_allergens)

result = [(val, key) for key, val in allergens.items()]
",".join([_[1] for _ in sorted(result)])

5
1 1
2 2
2 1
1 0
1 2
2 1
1 1
1 2
1 1
1 0
1 1
1 1
0 0
0 1
0 1
0 0
0 1
0 1


'mxmxvkd,sqjhc,fvjkl'

But for the puzzle data, we do:

In [10]:
foods = load_data("day21_data.txt")
non_allergens = find_non_allergens(foods)
print(count_non_allergens(foods, non_allergens))
candidates = find_candidates(foods)
allergens = find_allergens(candidates, non_allergens)

result = [(val, key) for key, val in allergens.items()]
result
",".join([_[1] for _ in sorted(result)])

1882
1 0
3 1
1 1
2 0
2 1
4 1
0 0
2 1
2 1
1 0
2 1
3 1
1 0
2 1
2 1
1 0
2 1
1 1
0 0
1 1
1 1
0 0
0 1
2 1
0 0
0 1
2 1
0 0
0 1
2 1
0 0
0 1
1 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
1 1
1 1
0 0
0 1
1 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 1
0 1


'xgtj,ztdctgq,bdnrnx,cdvjp,jdggtft,mdbq,rmd,lgllb'

In [11]:
candidates

[({'bdnrnx', 'mdbq', 'xgtj'}, {'dairy'}),
 ({'mdbq'}, {'sesame'}),
 ({'bdnrnx', 'cdvjp', 'mdbq', 'xgtj', 'ztdctgq'}, {'nuts'}),
 ({'lgllb', 'mdbq', 'ztdctgq'}, {'soy'}),
 ({'bdnrnx', 'cdvjp', 'mdbq', 'rmd'}, {'shellfish'}),
 ({'bdnrnx', 'ztdctgq'}, {'eggs'}),
 ({'bdnrnx', 'mdbq'}, {'fish'}),
 ({'bdnrnx', 'jdggtft', 'mdbq'}, {'peanuts'})]

In [12]:
allergens

{'mdbq': 'sesame',
 'bdnrnx': 'fish',
 'xgtj': 'dairy',
 'jdggtft': 'peanuts',
 'ztdctgq': 'eggs',
 'cdvjp': 'nuts',
 'lgllb': 'soy',
 'rmd': 'shellfish'}