## [Day 6](https://adventofcode.com/2020/day/6)

Alright so here we basically have grouped input and what we're really after is the count of unique values per group, then summed across all groups. I think this will be relatively easy if we just copy some of the code we previously used to catch the blank lines that deliminate groups.


In [1]:
import pandas as pd
import numpy as np

forms = open('../inputs/d6.txt').read().splitlines()
forms[:6]

['bdmceunt', 'ubcdjqvnmte', 'mcgetfndul', 'dcenmtu', '', 'ldy']

So the first thing I wanna do is split all these into single characters and then flatten down the lists.

In [2]:
def split_string(x):
    if len(x) > 0:
        return [x[i] for i in range(len(x))]
    else:
        return [' ']

forms = [split_string(x) for x in forms]
forms[:5]

[['b', 'd', 'm', 'c', 'e', 'u', 'n', 't'],
 ['u', 'b', 'c', 'd', 'j', 'q', 'v', 'n', 'm', 't', 'e'],
 ['m', 'c', 'g', 'e', 't', 'f', 'n', 'd', 'u', 'l'],
 ['d', 'c', 'e', 'n', 'm', 't', 'u'],
 [' ']]

In [3]:
forms_flat = [item for sublist in forms for item in sublist]
forms_flag = [1 if x == ' ' else 0 for x in forms_flat]
forms_sum = np.cumsum(forms_flag)
forms_df = pd.DataFrame({'value' : forms_flat, 'flag' : forms_flag, 'id_no' : forms_sum})
forms_df.head()

Unnamed: 0,value,flag,id_no
0,b,0,0
1,d,0,0
2,m,0,0
3,c,0,0
4,e,0,0


In [4]:
# Cool now we just get rid of the blanks and find the distinct values:
forms_df = forms_df.query("value != ' '").drop('flag', axis = 1)
forms_df.drop_duplicates().shape[0]

6273

In [5]:
len(forms)

2061

### Part 2

The criteria here has changed. We now don't need to know the number of distinct occurances per gropu but rather which letters appear in every member of the group. The data transformations I did previously made it so I don't know how many members were in each group so I'll have to go back and restructure it again. This is good though since I can practice some of the group by summaries and merging.


In [30]:
forms = [''.join(x) for x in forms]
forms_flag2 = [1 if x == ' ' else 0 for x in forms]
forms_sum2 = np.cumsum(forms_flag2)
forms_df2 = pd.DataFrame({'value' : forms, 'flag' : forms_flag2, 'id_no' : forms_sum2})
forms_df2 = forms_df2.query("value != ' '")
forms_df2.head(10)

Unnamed: 0,value,flag,id_no
0,bdmceunt,0,0
1,ubcdjqvnmte,0,0
2,mcgetfndul,0,0
3,dcenmtu,0,0
5,ldy,0,1
6,dy,0,1
8,wnghmqt,0,2
9,zjhwmabg,0,2
10,kmwhisgy,0,2
11,hwvomngqj,0,2


In [31]:
# Now we just count the group sizes and use the df from part 1 to get the value counts
forms_group_counts = forms_df2.groupby('id_no')[['id_no']].count().rename(columns = {'id_no' : 'N'}).reset_index()
forms_group_counts.head()

Unnamed: 0,id_no,N
0,0,4
1,1,2
2,2,4
3,3,4
4,4,2


So that is a bit more work than the count in R since it labels the count as the old column name. I suppose it could be rather nice if we want to count up a bunch of columns. Maybe this will get me more where I want:

In [40]:
forms_group_value_counts = (forms_df
                            .assign(n = 1)
                            .groupby(['id_no', 'value'])
                            .count()
                            .reset_index())
forms_group_value_counts.head()

Unnamed: 0,id_no,value,n
0,0,b,2
1,0,c,4
2,0,d,4
3,0,e,4
4,0,f,1


Alright, I'm liking that syntax a bit more.

In [42]:
(forms_group_value_counts
.merge(form_group_counts)
.query('n == N')
.shape)

(3254, 4)