### Day 3

Here we have a series a binary numbers and we're asked to find the most common and least common digit for each position within the number. We might be able to do some kinda OO idea or looping but we're shitty data scientists here so everything is a problem of rectangular data.

In [1]:
import pandas as pd
import numpy as np
from functools import reduce
from itertools import starmap
# Note that I originally did not have the dtype as string and it 
# read them in as numerics, thus dropping all the leading zeroes!
report = pd.read_csv('d3.txt', header = None, dtype = 'str')
report.columns = ['binary']
report

Unnamed: 0,binary
0,100101001000
1,011101110101
2,000001010101
3,001001010001
4,001101011110
...,...
995,010000000000
996,011010100000
997,110011001111
998,010110110111


So first here let's write a function that can turn the binary into a data frame with the position of each digit and the digit. This is kinda nice since it stops the fact that 

In [2]:
def bin_to_df(x, id):
    # Cool trick with slicing, same start and end but by increments on -1
    x = x[::-1]
    return pd.DataFrame({
        'id' : id,
        'place' : list(range(len(x))),
        'digit' : [int(y) for y in x]
    })             

In [3]:
bin_to_df('01011', 'A')

Unnamed: 0,id,place,digit
0,A,0,1
1,A,1,1
2,A,2,0
3,A,3,1
4,A,4,0


Now let's use our favorite tool which is the map reduce. I updated this after seeing part 2 to include the ID so that we can keep track of which digit comes from which number:

In [4]:
# The way that starmap takes in arguments is like a list of tuples,
# in this case they will be (binary_number, id)
bin_tuples = [(x, i) for i, x in enumerate(report['binary'])]

bin_df = reduce(
    lambda x, y: pd.concat([x, y]),
    starmap(bin_to_df, bin_tuples)
)
bin_df.head()

Unnamed: 0,id,place,digit
0,0,0,0
1,0,1,0
2,0,2,0
3,0,3,1
4,0,4,0


Cool, now I think what we want to do is count by position and digits, then pivot wider with place as the ID. 

In [5]:
bin_df_wide = (
    bin_df
    .groupby(['place', 'digit'])
    .size()
    .reset_index(drop = False)
    .rename(columns = {0 : 'N'})
    .pivot(index = 'place', columns = 'digit', values = 'N')
    # In case there are none:
    .fillna(0)
)

bin_df_wide

digit,0,1
place,Unnamed: 1_level_1,Unnamed: 2_level_1
0,498,502
1,516,484
2,480,520
3,508,492
4,518,482
5,491,509
6,505,495
7,520,480
8,495,505
9,481,519


In [6]:
# verify that there are no ties... This seems implied from the lack of explanation
# but just in case
(bin_df_wide[0] == bin_df_wide[1]).value_counts()

False    12
dtype: int64

In [7]:
bin_df_wide['gamma'] = np.where(
    bin_df_wide[0] > bin_df_wide[1],
    0,
    1
)
bin_df_wide['epsilon'] = np.where(
    bin_df_wide[0] < bin_df_wide[1],
    0,
    1
)
bin_df_wide

digit,0,1,gamma,epsilon
place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,498,502,1,0
1,516,484,0,1
2,480,520,1,0
3,508,492,0,1
4,518,482,0,1
5,491,509,1,0
6,505,495,0,1
7,520,480,0,1
8,495,505,1,0
9,481,519,1,0


I had a good idea which is to use the place as power to 2 and then just used the vectorized operations of pandas to solve this.

In [8]:
bin_df_wide['base'] = list(map(lambda x: 2**x, bin_df_wide.index))
bin_df_wide

digit,0,1,gamma,epsilon,base
place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,498,502,1,0,1
1,516,484,0,1,2
2,480,520,1,0,4
3,508,492,0,1,8
4,518,482,0,1,16
5,491,509,1,0,32
6,505,495,0,1,64
7,520,480,0,1,128
8,495,505,1,0,256
9,481,519,1,0,512


In [9]:
# Here's the answer:
epsilon = (bin_df_wide['epsilon']*bin_df_wide['base']).sum()
gamma = (bin_df_wide['gamma']*bin_df_wide['base']).sum()
gamma*epsilon

2648450

### Part 2

Now we're going to do this kind of thing in a bit of an iterative process. The process of finding the oxygen and generator ratings are like this:

1. Figure out the most (or least) common digit in the highest (leftmost) position.
2. If it is a tie, choose 1 in the case of oxygen and 0 in the case of co2
3. Keep only the numbers with that as the digit in that position.
4. Repeat until only one number remains

I went back and modified bin_df to have an ID so we can subset it repeatedly.

So what the steps needed left are going to be:
1. Write a function that can do the digit counts like above
2. Make a copy of bin_df that we can iterate off
3. Apply the function in step #1, find the values in the copy (#2) that match that digit, filter the copy, and continue until we only have one item left.

In [10]:
def count_position(df, position, default):
    
    count_df = (
        df
        .query('place == ' + str(position))
        .groupby('digit')
        .size()
        .reset_index(drop = False)
        .rename(columns = {0 : 'N'})
        .assign(A = 'A')
        .pivot(index = 'A', columns = 'digit', values = 'N')
        # In case there are none:
        .fillna(0)
    )
    #print(count_df.columns)
    # Now we compute the most common this way
    zeroes = count_df[0].iloc[0]
    ones = count_df[1].iloc[0]
    #print('Zeroes: ' + str(zeroes) + ', Ones: ' + str(ones))
    # Now run throuh the cases. Note that we can infer the direction
    # of comparison by the default value
    if zeroes == ones:
        return default
    elif default == 1:
        if zeroes > ones:
            return 0
        else:
            return 1
    else:
        if zeroes > ones:
            return 1
        else:
            return 0

count_position(bin_df, 0, 1)    

1

And now another function to repeatedly filter the list:

In [11]:
def filter_bins(df, digit, place):
    keep_ids = (
        df
        .query('place ==' + str(place))
        .query('digit == ' + str(digit))
        ['id']
    )
    
    return df.copy()[df.id.isin(keep_ids)]

filter_bins(bin_df, 0, 0).query('place == 0').digit.value_counts()

0    498
Name: digit, dtype: int64

Alright, and now we just have to apply these two functions repeatedly:

In [12]:
# This was in the wide version but we want to compute the results
bin_df['base'] = list(map(lambda x: 2**x, bin_df.place))

# Set up looping
oxygen = bin_df.copy()
co2 = bin_df.copy()
n_digits = len(report.binary.iloc[0])
current_position = n_digits -1

while len(oxygen.id.unique()) > 1:
    
    bit_value = count_position(oxygen, current_position, 1)
    oxygen = filter_bins(oxygen, bit_value, current_position)
    current_position -= 1

current_position = n_digits -1
while len(co2.id.unique()) > 1:
    
    bit_value = count_position(co2, current_position, 0)
    co2 = filter_bins(co2, bit_value, current_position)
    current_position -= 1
    
oxygen = (oxygen['digit']*oxygen['base']).sum()
co2 = (co2['digit']*co2['base']).sum()
oxygen*co2

2845944