## [Day 7](https://adventofcode.com/2020/day/7)


So this one we have a series of relationships where one kind of bag can hold some combination of other colors. We're given that we have a gold one and want to know what all the outermost bags are possible that could hold this. I am not sure what the purpose of the numbers are at the moment but I assume this is for part 2?

In [2]:
import pandas as pd
import numpy as np

bags = open('../inputs/d7.txt').read().splitlines()
bags[:10]

['light beige bags contain 5 dark green bags, 5 light gray bags, 3 faded indigo bags, 2 vibrant aqua bags.',
 'faded purple bags contain 4 shiny green bags, 2 mirrored olive bags.',
 'drab tomato bags contain 4 shiny coral bags.',
 'mirrored crimson bags contain 4 bright maroon bags.',
 'faded magenta bags contain 2 clear bronze bags, 5 dim brown bags, 3 striped cyan bags.',
 'vibrant beige bags contain 1 pale silver bag.',
 'plaid lavender bags contain 5 striped teal bags, 2 vibrant tan bags, 3 clear bronze bags, 3 light black bags.',
 'posh maroon bags contain no other bags.',
 'dotted yellow bags contain 4 plaid turquoise bags, 2 plaid lavender bags, 1 dotted violet bag.',
 'posh fuchsia bags contain 5 mirrored gold bags, 2 faded bronze bags, 2 faded coral bags, 1 vibrant maroon bag.']

So we've wanted to turn some of these list of lists into data frames several times. I'm thinking there is probably a way to turn this into an ndarray and then directly to a data frame without breaking apart and  stitching together. Let's see if that's possible

In [2]:
def fix_string(x):
    x = (x.replace(' bags', '')
         .replace(' bag', '')
         .replace(' contain', ',')
         .replace(', no other', '')
         .replace('.', ''))
    return(x)

bags = [fix_string(x) for x in bags]
bags = [x.split(', ') for x in bags]
bags = pd.DataFrame(bags)
bags.head(15)

Unnamed: 0,0,1,2,3,4
0,light beige,5 dark green,5 light gray,3 faded indigo,2 vibrant aqua
1,faded purple,4 shiny green,2 mirrored olive,,
2,drab tomato,4 shiny coral,,,
3,mirrored crimson,4 bright maroon,,,
4,faded magenta,2 clear bronze,5 dim brown,3 striped cyan,
5,vibrant beige,1 pale silver,,,
6,plaid lavender,5 striped teal,2 vibrant tan,3 clear bronze,3 light black
7,posh maroon,,,,
8,dotted yellow,4 plaid turquoise,2 plaid lavender,1 dotted violet,
9,posh fuchsia,5 mirrored gold,2 faded bronze,2 faded coral,1 vibrant maroon


Turns out it was incredibly easy this whole time. Okay so now I think a long version will probably be nicer.

In [3]:
# Okay so a few weird things I learned here are that when pandas auto generated
# the .columns attribute for this data set, it was actually a range object, not a list of numbers
# as it appeared from the print statement. Second, as you can see below, the column names don't need 
# to be characters so the below rename statement works with a dictionary 
bags.columns = list(bags.columns)
bags.rename(columns = {0 : 'parent'}, inplace=True)
bags

Unnamed: 0,parent,1,2,3,4
0,light beige,5 dark green,5 light gray,3 faded indigo,2 vibrant aqua
1,faded purple,4 shiny green,2 mirrored olive,,
2,drab tomato,4 shiny coral,,,
3,mirrored crimson,4 bright maroon,,,
4,faded magenta,2 clear bronze,5 dim brown,3 striped cyan,
...,...,...,...,...,...
589,dull crimson,2 drab red,,,
590,dark plum,1 dark blue,2 light yellow,2 striped silver,
591,faded violet,4 dotted gray,5 muted blue,,
592,plaid black,2 posh aqua,5 plaid orange,,


In [4]:
bags_long = bags.melt(id_vars = 'parent').dropna().drop(columns = 'variable').reset_index(drop = True)
bags_long

Unnamed: 0,parent,value
0,light beige,5 dark green
1,faded purple,4 shiny green
2,drab tomato,4 shiny coral
3,mirrored crimson,4 bright maroon
4,faded magenta,2 clear bronze
...,...,...
1480,shiny maroon,4 light maroon
1481,muted tomato,3 muted yellow
1482,wavy plum,5 clear tan
1483,bright purple,1 wavy plum


In [5]:
splits = pd.DataFrame(bags_long.value.str.split(' ', n = 1).to_list())
splits.columns = ['count', 'child']
splits = splits[['child', 'count']]
splits

Unnamed: 0,child,count
0,dark green,5
1,shiny green,4
2,shiny coral,4
3,bright maroon,4
4,clear bronze,2
...,...,...
1480,light maroon,4
1481,muted yellow,3
1482,clear tan,5
1483,wavy plum,1


In [6]:
bags_long = pd.concat([bags_long, splits], axis = 1).drop(columns = 'value')
bags_long

Unnamed: 0,parent,child,count
0,light beige,dark green,5
1,faded purple,shiny green,4
2,drab tomato,shiny coral,4
3,mirrored crimson,bright maroon,4
4,faded magenta,clear bronze,2
...,...,...,...
1480,shiny maroon,light maroon,4
1481,muted tomato,muted yellow,3
1482,wavy plum,clear tan,5
1483,bright purple,wavy plum,1


Now that we've done all that data munging we can try and build the complete chains. I know that the easiest or most elegant way of doing this would to form a number of trees and count the nodes leading up to the gold bag but since we're doing this pandas style, everything is solved by merging!

So I think if we make a subset with gold bags as the children, and then repeatedly join on the data set itself with old_parent = new_child, we can basically form every possible chain. Then the distinct entries in the resulting data set will be the possible antecedants to a gold bag.

If I were really clever, I would know how to incorporate the counts into this for part 2 but for now, I'll just drop them.

In [7]:
bags_long2 = bags_long.copy().drop(columns = 'count')
golden_children = (bags_long2
                   .copy()
                   .query("child == 'shiny gold'")
                   .rename(columns = {'child':'child0', 'parent':'parent0'}))
golden_children

Unnamed: 0,parent0,child0
64,vibrant white,shiny gold
73,drab turquoise,shiny gold
105,clear tan,shiny gold
196,vibrant fuchsia,shiny gold
259,faded green,shiny gold
660,dotted gray,shiny gold
1395,dotted brown,shiny gold


In [8]:
# Let's just do a few iterations to see if this makes any sense:
golden_children2 = golden_children.merge(
    bags_long2.rename(columns = {'child':'child'+str(1), 'parent':'parent'+str(1)}), 
    how = 'left', 
    left_on = 'parent'+str(0), 
    right_on = 'child'+str(1))
print(golden_children2.shape[0])
golden_children2.head(10)

31


Unnamed: 0,parent0,child0,parent1,child1
0,vibrant white,shiny gold,,
1,drab turquoise,shiny gold,bright beige,drab turquoise
2,drab turquoise,shiny gold,drab fuchsia,drab turquoise
3,drab turquoise,shiny gold,shiny indigo,drab turquoise
4,drab turquoise,shiny gold,shiny silver,drab turquoise
5,clear tan,shiny gold,dark aqua,clear tan
6,clear tan,shiny gold,faded coral,clear tan
7,clear tan,shiny gold,shiny yellow,clear tan
8,clear tan,shiny gold,striped olive,clear tan
9,clear tan,shiny gold,vibrant blue,clear tan


In [9]:
golden_children2 = golden_children2.merge(
    bags_long2.rename(columns = {'child':'child'+str(2), 'parent':'parent'+str(2)}), 
    how = 'left', 
    left_on = 'parent'+str(1), 
    right_on = 'child'+str(2))
print(golden_children2.shape[0])
golden_children2.head(10)

66


Unnamed: 0,parent0,child0,parent1,child1,parent2,child2
0,vibrant white,shiny gold,,,,
1,drab turquoise,shiny gold,bright beige,drab turquoise,,
2,drab turquoise,shiny gold,drab fuchsia,drab turquoise,posh indigo,drab fuchsia
3,drab turquoise,shiny gold,drab fuchsia,drab turquoise,dull chartreuse,drab fuchsia
4,drab turquoise,shiny gold,drab fuchsia,drab turquoise,mirrored coral,drab fuchsia
5,drab turquoise,shiny gold,shiny indigo,drab turquoise,bright red,shiny indigo
6,drab turquoise,shiny gold,shiny silver,drab turquoise,clear turquoise,shiny silver
7,clear tan,shiny gold,dark aqua,clear tan,,
8,clear tan,shiny gold,faded coral,clear tan,striped salmon,faded coral
9,clear tan,shiny gold,faded coral,clear tan,plaid yellow,faded coral


So that seems like it will work pretty well. I think we can keep track of the columns to be merged with a counter and then end when the latest parent column is all missing


In [10]:
# I'm sure there is some way to do this that doesn't involve this stupid 'while True' loop but I am
# no computer scientist.
i = 1
while True:
    golden_children = golden_children.merge(
        bags_long2.rename(columns = {'child':'child'+str(i), 'parent':'parent'+str(i)}), 
        how = 'left', 
        left_on = 'parent'+str(i-1), 
        right_on = 'child'+str(i))
    
    if pd.isna(golden_children['parent'+str(i)]).all():
        break
    i += 1

    

In [11]:
golden_children.head()

Unnamed: 0,parent0,child0,parent1,child1,parent2,child2,parent3,child3,parent4,child4,...,parent7,child7,parent8,child8,parent9,child9,parent10,child10,parent11,child11
0,vibrant white,shiny gold,,,,,,,,,...,,,,,,,,,,
1,drab turquoise,shiny gold,bright beige,drab turquoise,,,,,,,...,,,,,,,,,,
2,drab turquoise,shiny gold,drab fuchsia,drab turquoise,posh indigo,drab fuchsia,faded orange,posh indigo,,,...,,,,,,,,,,
3,drab turquoise,shiny gold,drab fuchsia,drab turquoise,dull chartreuse,drab fuchsia,,,,,...,,,,,,,,,,
4,drab turquoise,shiny gold,drab fuchsia,drab turquoise,mirrored coral,drab fuchsia,,,,,...,,,,,,,,,,


Hmm I was definitely expecting it to go more iterations than that.... Somewhat sus

In [12]:
(golden_children
.drop('child0', axis = 1)
.melt()
.drop('variable', axis = 1)
.dropna()
.drop_duplicates())

Unnamed: 0,value
0,vibrant white
1,drab turquoise
8,clear tan
45,vibrant fuchsia
120,faded green
...,...
2196,shiny crimson
2197,vibrant coral
2200,clear red
2242,wavy white


When I first did this, I forgot that there would be NA in there so it's good to know that `drop_duplciates` does consider `NaN` a distinct value

### Part 2

As I suspected, we now need to use the weights attached to the bags. Instead of working our way up from the golden back, we now work our way down. The goal is to compute the total weight within the golden bag.

In [13]:
golden_parents = (bags_long
                   .copy()
                   .query("parent == 'shiny gold'")
                   .rename(columns = {'child':'child0', 'parent':'parent0', 'count':'count0'}))
golden_parents

Unnamed: 0,parent0,child0,count0
136,shiny gold,wavy green,4
690,shiny gold,mirrored teal,2
1095,shiny gold,dark tomato,4
1360,shiny gold,faded beige,2


So we know that we'll need to figure out the weight of the 4 wavy green bags + 2 mirrored teal + ... 

I think we can take a similar tact:
1. Start with the above table
2. Repeatedly join on the parent/child relationship
3. At each join, compound the count
4. Count em up

I originally thought this would just be products or cummulative products but there is a lot of duplication due to the branching process.

In [14]:
# We dropped the count variable previously so I guess we'll make another copy:
bags_long3 = bags_long.copy()

In [15]:
# Same kind of thing as before
i = 1
while True:
    golden_parents = golden_parents.merge(
        bags_long3.rename(columns = {'child':'child'+str(i), 'parent':'parent'+str(i), 'count':'count'+str(i)}), 
        how = 'left', 
        left_on = 'child'+str(i-1), 
        right_on = 'parent'+str(i))
    
    if pd.isna(golden_parents['child'+str(i)]).all():
        break
    i += 1
    
golden_parents.sort_values(['child0', 'parent0', 'child1', 'parent1'])    


Unnamed: 0,parent0,child0,count0,parent1,child1,count1,parent2,child2,count2,parent3,...,count4,parent5,child5,count5,parent6,child6,count6,parent7,child7,count7
29,shiny gold,dark tomato,4,dark tomato,dotted purple,5,dotted purple,muted yellow,2,muted yellow,...,,,,,,,,,,
30,shiny gold,dark tomato,4,dark tomato,dotted purple,5,dotted purple,muted yellow,2,muted yellow,...,,,,,,,,,,
42,shiny gold,dark tomato,4,dark tomato,drab silver,2,drab silver,light turquoise,2,light turquoise,...,2,,,,,,,,,
43,shiny gold,dark tomato,4,dark tomato,drab silver,2,drab silver,mirrored indigo,3,mirrored indigo,...,,,,,,,,,,
44,shiny gold,dark tomato,4,dark tomato,drab silver,2,drab silver,mirrored indigo,3,mirrored indigo,...,5,dim lavender,posh yellow,2,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,shiny gold,mirrored teal,2,mirrored teal,pale blue,3,pale blue,faded black,2,faded black,...,3,,,,,,,,,
26,shiny gold,mirrored teal,2,mirrored teal,pale blue,3,pale blue,faded black,2,faded black,...,4,,,,,,,,,
27,shiny gold,mirrored teal,2,mirrored teal,wavy coral,5,,,,,...,,,,,,,,,,
0,shiny gold,wavy green,4,wavy green,light violet,2,light violet,dim crimson,3,,...,,,,,,,,,,


In [16]:
# Make sure that the count variables are actually numeric. This confused me
count_names = list(golden_parents.filter(regex = 'count')
              .columns)
for col in count_names:
    golden_parents[col] = golden_parents[col].astype(dtype = 'float').fillna(0)
golden_parents.sort_values(['child0', 'child1', 'child2'])   

Unnamed: 0,parent0,child0,count0,parent1,child1,count1,parent2,child2,count2,parent3,...,count4,parent5,child5,count5,parent6,child6,count6,parent7,child7,count7
29,shiny gold,dark tomato,4.0,dark tomato,dotted purple,5.0,dotted purple,muted yellow,2.0,muted yellow,...,0.0,,,0.0,,,0.0,,,0.0
30,shiny gold,dark tomato,4.0,dark tomato,dotted purple,5.0,dotted purple,muted yellow,2.0,muted yellow,...,0.0,,,0.0,,,0.0,,,0.0
42,shiny gold,dark tomato,4.0,dark tomato,drab silver,2.0,drab silver,light turquoise,2.0,light turquoise,...,2.0,,,0.0,,,0.0,,,0.0
43,shiny gold,dark tomato,4.0,dark tomato,drab silver,2.0,drab silver,mirrored indigo,3.0,mirrored indigo,...,0.0,,,0.0,,,0.0,,,0.0
44,shiny gold,dark tomato,4.0,dark tomato,drab silver,2.0,drab silver,mirrored indigo,3.0,mirrored indigo,...,5.0,dim lavender,posh yellow,2.0,,,0.0,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,shiny gold,mirrored teal,2.0,mirrored teal,pale blue,3.0,pale blue,wavy green,5.0,wavy green,...,3.0,,,0.0,,,0.0,,,0.0
12,shiny gold,mirrored teal,2.0,mirrored teal,pale blue,3.0,pale blue,wavy green,5.0,wavy green,...,2.0,,,0.0,,,0.0,,,0.0
27,shiny gold,mirrored teal,2.0,mirrored teal,wavy coral,5.0,,,0.0,,...,0.0,,,0.0,,,0.0,,,0.0
0,shiny gold,wavy green,4.0,wavy green,light violet,2.0,light violet,dim crimson,3.0,,...,0.0,,,0.0,,,0.0,,,0.0


Alright so the strategy I've come up with is this:
1. Aggregate by all but the lowest level of child. Sum the last count.
2. Remove the last parent, child, and count. Take distinct values.
3. Join the counts onto the distinct values.
4. Replace the next count up with count + count*agg_sum

In [17]:
golden_res = golden_parents.copy()

# This is the last child that was ever produced xD
last_child = int(golden_res.columns[-1].replace('count', ''))

for i in range(last_child, 0, -1):
    
    # First step is to compute the aggregates:
    older_children = ['child'+str(i) for i in range(i)]
    next_level = golden_res.groupby(older_children)['count'+str(i)].sum()
    next_level = pd.DataFrame(next_level).reset_index(drop = False)
        
    # Now we remove all the current level columns from the data set
    golden_res = (golden_res
                  .drop(columns = ['count'+str(i), 'child'+str(i), 'parent'+str(i)])
                  .drop_duplicates())

    golden_res = golden_res.merge(next_level, how = 'left')
    golden_res['count'+str(i-1)] = golden_res['count'+str(i-1)] + golden_res['count'+str(i-1)]*golden_res['count'+str(i)]
    
    # Okay this last line turned out to be a huge headache for me.
    # I wasn't dropping this column and as a result, the residual values from
    # shorter branches entering into a chain were zeroes and thus affected the count
    # of distinct and thus made tons of duplicates:
    golden_res.drop(columns = 'count'+str(i), inplace = True)

golden_res[['parent0', 'child0', 'count0']].drop_duplicates()['count0'].sum()


34988.0