## [Day 4](https://adventofcode.com/2020/day/4)

So this problem is about verifying that chunks of information (id cards?) have all the required elements. The necessary pieces are  
  
* byr (Birth Year)
* iyr (Issue Year)
* eyr (Expiration Year)
* hgt (Height)
* hcl (Hair Color)
* ecl (Eye Color)
* pid (Passport ID)
* cid (Country ID)

We're allowing for the cid to be missing but no other fields in order to declare this a valid passport.

I think this one, like the last few, is more complicated to do with pandas than just writing some loops over strings but I will again treat this like practice and try too make a tidy data set out of this. Sometimes the second part of these problems are a lot easier if you put in a bit more time instead of looking for the shortest path possible.

In [1]:
import pandas as pd
import numpy as np
import re
passes = open('c:/Users/Sven/Documents/py_files/aoc_2020/inputs/d4.txt').read().splitlines()
passes[:10]


['eyr:2028 iyr:2016 byr:1995 ecl:oth',
 'pid:543685203 hcl:#c0946f',
 'hgt:152cm',
 'cid:252',
 '',
 'hcl:#733820 hgt:155cm',
 'iyr:2013 byr:1989 pid:728471979',
 'ecl:grn eyr:2022',
 '',
 'hgt:171cm']

In [2]:
# First thing I want to do is split the lines up and make id:data pairs
# Seems like i've solved this problem a bunch of times with these problems but I never
# can remember how to do it nicely.
passes = [x.split() if len(x)>0 else [''] for x in passes ]
passes[:6]

[['eyr:2028', 'iyr:2016', 'byr:1995', 'ecl:oth'],
 ['pid:543685203', 'hcl:#c0946f'],
 ['hgt:152cm'],
 ['cid:252'],
 [''],
 ['hcl:#733820', 'hgt:155cm']]

In [3]:
# now flatten:
passes_flat = [x for y in passes for x in y]
passes_flat[:15]

['eyr:2028',
 'iyr:2016',
 'byr:1995',
 'ecl:oth',
 'pid:543685203',
 'hcl:#c0946f',
 'hgt:152cm',
 'cid:252',
 '',
 'hcl:#733820',
 'hgt:155cm',
 'iyr:2013',
 'byr:1989',
 'pid:728471979',
 'ecl:grn']

In [4]:
# Now we need to make the ids. Use this neato trick:
ids = [1 if x == '' else 0 for x in passes_flat]
id_sum = np.cumsum(ids)

In [5]:
# And next the splits of type and value:
types = [x.split(':')[0] for x in passes_flat]
values = [x.split(':')[1] if len(x)> 0 else '' for x in passes_flat]
passes_df = pd.DataFrame({'id_no' : id_sum, 'type' : types, 'value' : values})
passes_df.head(20)

Unnamed: 0,id_no,type,value
0,0,eyr,2028
1,0,iyr,2016
2,0,byr,1995
3,0,ecl,oth
4,0,pid,543685203
5,0,hcl,#c0946f
6,0,hgt,152cm
7,0,cid,252
8,1,,
9,1,hcl,#733820


Lovely now let's just filter it down a bit and look at a wide version for kicks

In [6]:
passes_df = passes_df.query("type != ''")
passes_wide = passes_df.pivot(index = 'id_no', columns = 'type', values = 'value')
passes_wide.head()

type,byr,cid,ecl,eyr,hcl,hgt,iyr,pid
id_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1995,252.0,oth,2028,#c0946f,152cm,2016,543685203
1,1989,,grn,2022,#733820,155cm,2013,728471979
2,1986,,grn,2028,#cfa07d,171cm,2013,214368857
3,1945,210.0,brn,2029,#cfa07d,167cm,2010,429131951
4,1966,,amb,2028,#888785,170cm,2015,893805464


I don't think we'll want to do this with the wide form (just did that to see) but we could look at some summary statistics on the missing rates. 

In [7]:
passes_wide.apply(lambda x: np.mean(pd.isna(x)))

type
byr    0.014035
cid    0.512281
ecl    0.024561
eyr    0.014035
hcl    0.035088
hgt    0.010526
iyr    0.014035
pid    0.017544
dtype: float64

In [8]:
# so back to the actual solution:
sol1 = passes_df.query("type != 'cid'").groupby('id_no').count()
sol1.query('type == 7').shape

(256, 2)

### Part 2

So now we just have to apply a bunch of criteria to this to remove implausible values. Papers, please!
  
* byr (Birth Year) - four digits; at least 1920 and at most 2002.
* iyr (Issue Year) - four digits; at least 2010 and at most 2020.
* eyr (Expiration Year) - four digits; at least 2020 and at most 2030.
* hgt (Height) - a number followed by either cm or in:
    * If cm, the number must be at least 150 and at most 193.
    * If in, the number must be at least 59 and at most 76.
* hcl (Hair Color) - a # followed by exactly six characters 0-9 or a-f.
* ecl (Eye Color) - exactly one of: amb blu brn gry grn hzl oth.
* pid (Passport ID) - a nine-digit number, including leading zeroes.
* cid (Country ID) - ignored, missing or not.

So this going to be a bit tedious but maybe we can how to do some R style thinking here... What I would do normally is just set up a function for each variable that specifies which values are valid then write another function that replaces the values in those columns with NaN that are not valid.

In [9]:
# So now we wanna get a function that tests this:
def v_byr(x):
    if ('1920' <= x <= '2002') and len(x) == 4:
        return x
    else:
        return np.nan

# and then a wrapper:
def vectorize(f):
    
    def return_fun(x):
        res = []
        #for i in range(len(x))
        # Huge learning moment here. I had originally had the commented line above
        # but this fails if you subset the data frame and dont reset the index. I was having
        # issues where I'd feed it a length 2 Series but the indices were like 3, 45 and then
        # we get a key error. Instead, do this:
        for i in x.index:
            if pd.isna(x[i]):
                res.append(np.nan)
            else:
                res.append(f(x[i]))
        return res
    
    return return_fun

# Test:
vectorize(v_byr)(pd.Series(['2000', '20000', '1950', '3000', np.nan]))

['2000', nan, '1950', nan, nan]

Alright well, this is probably way more tedious than it's worth but I wanted to see how you would try to do this without the beloved tidyverse. Now define the rest of the 

In [10]:
# Issue year
def v_iyr(x):
    if ('2010' <= x <= '2020') and len(x) == 4:
        return x
    else:
        return np.nan

# Expire year
def v_eyr(x):
    if ('2020' <= x <= '2030') and len(x) == 4:
        return x
    else:
        return np.nan
 
# Height
def v_hgt(x):
    
    # square the units and value for the height
    units = x[-2:]
    value = x[:-2]
    if len(value) > 0:
        value = int(value)
    else:
        value = 0
    
    if units == 'cm' and (150 <= value <= 193):
        return x
    elif units == 'in' and (59 <= value <= 76):
        return x
    else:
        return np.nan
    
# Hair color
def v_hcl(x):
    if len(re.findall('^#[0-9a-f]{6}$', x)) > 0:
        return x
    else:
        return np.nan

# Eye color
def v_ecl(x):
    if x in ['amb', 'blu', 'brn', 'gry', 'grn', 'hzl', 'oth']:
        return x
    else:
        return np.nan

# pid    
def v_pid(x):
    if len(re.findall('^[0-9]{9}$', x)) > 0:
        return x
    else:
        return np.nan

So now we're going to try to to apply all these functions to the columns of the data 

In [11]:
# So I think we can just make a dictionary:
func_dict = {'byr':v_byr, 'iyr':v_iyr, 'eyr':v_eyr, 'hgt':v_hgt, 'hcl':v_hcl, 'ecl':v_ecl, 'pid':v_pid}

# Now vectorize them all:
for key in func_dict:
    func_dict[key] = vectorize(func_dict[key])
    
func_dict['byr'](pd.Series(['2000', '20000', '1950', '3000', np.nan]))

['2000', nan, '1950', nan, nan]

In [12]:
passes_wide_fixed = passes_wide.copy()
# Then loop through:
for key in func_dict:
    passes_wide_fixed[key] = func_dict[key](passes_wide_fixed[key])

passes_wide_fixed.drop('cid', axis = 1).dropna().shape    
    

(198, 7)

In [13]:
# So this got the right answer now... but originally I had a couple issues
# with the check functions such as < vs <= and so I did some checks.
# These were kinda valuable so I'll leave them in there just in case I wanna
# see it later.

wrong_answer = passes_wide_fixed.drop('cid', axis = 1).dropna()
wrong_answer

type,byr,ecl,eyr,hcl,hgt,iyr,pid
id_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1995,oth,2028,#c0946f,152cm,2016,543685203
1,1989,grn,2022,#733820,155cm,2013,728471979
2,1986,grn,2028,#cfa07d,171cm,2013,214368857
3,1945,brn,2029,#cfa07d,167cm,2010,429131951
4,1966,amb,2028,#888785,170cm,2015,893805464
...,...,...,...,...,...,...,...
277,1998,gry,2021,#c0946f,189cm,2012,472066200
278,1922,oth,2028,#623a2f,158cm,2014,594856217
281,1955,grn,2030,#5637d2,187cm,2014,862655087
282,1980,hzl,2029,#7d3b0c,176cm,2019,703908707


In [14]:
wrong_answer.shape

(198, 7)

In [15]:
passes_wide.shape

(285, 8)

In [20]:
# Answer was too small so I thought I would look at some examples of pieces that 
# were not in my solution.
leftovers = passes_wide.loc[np.setdiff1d(passes_wide.index, wrong_answer.index), :]
leftovers = leftovers.drop('cid', axis = 1).dropna()

In [17]:
# Okay test a subset:
test = leftovers.copy().iloc[10:20]
test

type,byr,ecl,eyr,hcl,hgt,iyr,pid
id_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
46,2013,zzz,2032,#a97842,193in,2021,163cm
51,1983,grt,1979,#cfa07d,163cm,1958,796395720
68,2023,grn,2038,06d729,73,1939,#eb4c2a
73,1985,utc,2026,#866857,169cm,2018,#ff1cbf
78,2015,xry,2037,1b816a,96,1954,472891001
80,2027,#4e3d72,2037,#c0946f,129,2009,3569865
82,1972,xry,2030,#7d3b0c,172in,2015,833809421
84,2013,oth,2021,#866857,181cm,2010,072317444
85,1933,amb,2020,#b6652a,96,2012,4354408888
87,1932,brn,2040,#cfa07d,170cm,2014,777844412


In [18]:
# Apply teh criteria and look at differences.
for key in func_dict:
    test[key] = func_dict[key](test[key])
test

type,byr,ecl,eyr,hcl,hgt,iyr,pid
id_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
46,,,,#a97842,,,
51,1983.0,,,#cfa07d,163cm,,796395720.0
68,,grn,,,,,
73,1985.0,,2026.0,#866857,169cm,2018.0,
78,,,,,,,472891001.0
80,,,,#c0946f,,,
82,1972.0,,2030.0,#7d3b0c,,2015.0,833809421.0
84,,oth,2021.0,#866857,181cm,2010.0,72317444.0
85,1933.0,amb,2020.0,#b6652a,,2012.0,
87,1932.0,brn,,#cfa07d,170cm,2014.0,777844412.0
