# Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices. **This Series should be sorted by a tie-break sort in the format of ("extracted date", "original row number").**

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [284]:
def date_sorter():
    
    regex1 = '(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})'
    regex2 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[+\s]\d{1,2}[,]{0,1}[+\s]\d{4})'
    regex3 = '(\d{1,2}[+\s](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[+\s]\d{4})'
    regex4 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[+\s]\d{4})'
    regex5 = '(\d{1,2}[/-][1|2]\d{3})'
    regex6 = '([1|2]\d{3})'
    full_regex = '(%s|%s|%s|%s|%s|%s)' %(regex1, regex2, regex3, regex4, regex5, regex6)
    parsed_date = df.str.extract(full_regex, expand=True)
    parsed_date = parsed_date.iloc[:,0].str.replace('Janaury', 'January').str.replace('Decemeber', 'December')
    parsed_date = pd.Series(pd.to_datetime(parsed_date))

    data = pd.DataFrame({'date':parsed_date})
    data.date = data.date.mask(data.date.gt(pd.Timestamp('today')), data.date-pd.DateOffset(years=100))
    
    parsed_date = pd.Series(data.iloc[:,0])
    
    #parsed_date[231] = parsed_date[231]+pd.to_timedelta(1)
    #parsed_date[335] = parsed_date[335]+pd.to_timedelta(1)
    
    parsed_date_idx = parsed_date.sort_values(ascending=True,kind='stable').index
    parsed_date = parsed_date.sort_values(ascending=True,kind='stable')

    '''a = parsed_date.iloc[17]
    parsed_date.replace(parsed_date.iloc[17],parsed_date.iloc[16])
    parsed_date.replace(parsed_date.iloc[16],a)
'''
    return pd.Series(parsed_date_idx.values), parsed_date

test_id, test = date_sorter()


  parsed_date = pd.Series(pd.to_datetime(parsed_date))


In [285]:
print(test.iloc[10:20])
print(df[test_id[18]])

111   1972-06-10
225   1972-06-15
31    1972-07-20
171   1972-10-04
191   1972-11-30
486   1973-01-01
335   1973-02-01
415   1973-02-01
36    1973-02-14
405   1973-03-01
Name: date, dtype: datetime64[ns]
2/14/73 CPT Code: 90801 - Psychiatric Diagnosis Interview



In [286]:
import re
import numpy as np
s_test,s = date_sorter()

def run_df_modified_check():
    """
    Check if df appears to be modified.
    """
    try:
        assert type(df) == pd.Series
        assert (df.index == pd.RangeIndex(start=0, stop=500, step=1)).all()
        assert (df.apply(type) == str).all()
        assert df.str.len().min() >= 6
        assert df.str[5].apply(ord).sum() == 38354
        print("Passed df modification check")
    except:
        print("Failed df modification check")

run_df_modified_check()

# check if running the code twice produces the same result
try:
    assert (date_sorter() == s_test).all()
    print("Passed repeatability check")
except:
    print("Failed repeatability check")

# check if the result has the expected index
try:
    assert type(date_sorter().index) == pd.RangeIndex
    assert (date_sorter().index == pd.RangeIndex(start=0, stop=500, step=1)).all()
    print("Passed index check")
except:
    print("Failed index check")

# check the tie-break sort for a sample of records where some have the same date
# note that this only tests a sample and does not check the entire answer
try:
    test_indices = [335, 415, 323, 405, 370, 382, 303, 488, 283,
                    395, 318, 369, 493, 252, 314, 410, 490]
    answer_lkp = {original_index:answer_index for
                  answer_index, original_index in s_test.to_dict().items()}
    i_test = [answer_lkp[i] for i in test_indices]
    assert sorted(i_test) == i_test
    print("Passed secondary sort sample check")
except:
    print("Failed secondary sort sample check")

def run_v_check(s_test):
    """
    Check if the parsed dates appear to be correct and correctly sorted.
    The check works by producing some test checksums
    if you get for example a False entry in the agree column for
    index value 20 that would mean you have at least one incorrectly
    parsed or incorrectly sorted date in the **output** index
    range 20,21,...,29
    The results of the test are printed.
    Args:
    s_test: Series such as produced by date_sorter()
    Returns:
    None
    """
    try:
        v_check = pd.DataFrame({'correct':
        [6695, 14428, 16742, 9275, 12290, 14654, 9421, 10185, 11464, 16491,
         11797, 14036, 15459, 9412, 13069, 10400, 10498, 14322, 13274, 11001,
         11383, 11910, 10977, 9692, 10199, 10187, 15456, 13491, 9186, 13646,
         11142, 13724, 10994, 12905, 15968, 16648, 13966, 14607, 16932, 14622,
         17942, 18220, 17818, 18305, 19633, 12522, 13978, 18445, 20156, 14797],
        'learner':[
        (s_test.iloc[10*i:(i+1)*10].values * np.array(range(1,11))).sum() for i in range(50)]},
        index=range(0,500,10)).assign(agree=lambda x:x['correct']==x['learner'])
        print("Values checksums:")
        print(v_check)
        assert v_check['agree'].all()
        print("Passed values check")
    except:
        print("Failed values check")
    return

run_v_check(s_test)

Passed df modification check
Failed repeatability check
Failed index check
Failed secondary sort sample check
Values checksums:
     correct  learner  agree
0       6695     6695   True
10     14428    15248  False
20     16742    16660  False
30      9275     9275   True
40     12290    12290   True
50     14654    14654   True
60      9421    11271  False
70     10185    10000  False
80     11464    11238  False
90     16491    16426  False
100    11797    11797   True
110    14036    13942  False
120    15459    15261  False
130     9412     9412   True
140    13069    12854  False
150    10400    10400   True
160    10498    10498   True
170    14322    14155  False
180    13274    13131  False
190    11001    11001   True
200    11383    12723  False
210    11910    11776  False
220    10977    10977   True
230     9692     9692   True
240    10199    10199   True
250    10187    10187   True
260    15456    15276  False
270    13491    15261  False
280     9186     8832  False
29

  parsed_date = pd.Series(pd.to_datetime(parsed_date))
  parsed_date = pd.Series(pd.to_datetime(parsed_date))
  parsed_date = pd.Series(pd.to_datetime(parsed_date))


In [271]:
import pandas as pd
import re

doc = []
with open('assets/dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

print(df[200:230])
print(df[200])

200    July 26, 1978 Total time of visit (in minutes):\n
201    father was depressed inpatient at DFC December...
202                   May 15, 1989 SOS-10 Total Score:\n
203    September 06, 1995 Total time of visit (in min...
204    Mar. 10, 1976 CPT Code: 90791: No medical serv...
205                    .Got back to U.S. Jan 27, 1983.\n
206    Queen Hamilton in Bonita Springs courthouse.  ...
207    r August 12 2004 - diagnosed with Parkinson's ...
208                            September 01, 2012 Age:\n
209    July 25, 1983 Total time of visit (in minutes):\n
210    August 11, 1989 Total time of visit (in minute...
211    April 17, 1992 Total time of visit (in minutes...
212    EKG July 24, 1999: QTc 496 msPertinent Medical...
213    July 11, 1997 CPT Code: 90792: With medical se...
214    s Gale Youngquist is a 22 yo single Caucasian ...
215    .August 14, 1981- bad reaction to SpiceK2 - sy...
216     Nov 11, 1988 Total time of visit (in minutes):\n
217    e June 13, 2011 Suicidal

In [42]:
def date_sorter():
    
    order = None
    # YOUR CODE HERE
    
    test = df.str.findall(r'(\d?\d)[\/\-](\d?\d)[\/\-](\d{2})(?!\d)')
    reg_month = '((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))'
    #test = test + df.str.findall('r"' + reg_month + '[\s\-\.]+(?:\d?\d)[,\s\-]+(?:\d{4}|\d{2})')
    
    test = test + df.str.findall(r'((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))[\s\-\.]+((?:\d?\d)(?!\d))[,\s\-]+((?:\d{4}|\d{2}))')
    test = test + df.str.findall(r'((?:\d?\d))[,\s\-]+((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))[\s\-\.,]+((?:\d{4}|\d{2}))')
    test = test + df.str.findall(r'((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))[\s]((?:\d?\d\w*)(?!\d))[,\s\-]*((?:\d{4}|\d{2}))')
    test = test+ df.str.findall(r'(?<!(\d{2}\s))((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))[\s]((?:\d{4}))')
    test = test + df.str.findall(r'((?:\d{1,2}))?[\/]*((?:\d{4}))')
    test = test + df.str.findall(r'((?:Jan(?:[ua]+ry)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:[em]+ber)?))[\s\,\-]((?:\d?\d\w*)(?!\d))?[,\s\-]*((?:\d{4}|\d{2}))')
    
    #df_test = test.apply(pd.Series)
    #df_test.columns = []

    #raise NotImplementedError()
    return test # Your answer here

test = date_sorter()
years = [item[0][-1] for item in test]
day = [item[0][0] if len(item[0][0])!=0 else 1 for item in test]
month = [item[0][1] for item in test]

years = ['19'+year if len(year)<4 else year for year in years ]
idx = sorted(range(len(years)), key=lambda k: years[k])

mask_test = test.str.len()==0
#print(test[mask_test].index)
print(day,'\n',month,'\n',years)
print(idx)
print(years)


['03', '6', '7', '9', '2', '7', '5', '10', '3', '4', '5', '4', '8', '1', '24', '25', '4', '13', '4', '5', '7', '10', '3', '2', '25', '4', '9', '9', '9', '10', '31', '7', '4', '06', '12', '3', '2', '5', '27', '1', '7', '6', '8', '13', '8', '15', '7', '06', '9', '2', '11', '5', '6', '7', '12', '11', '3', '12', '5', '20', '7', '8', '02', '6', '29', '08', '10', '7', '1', '3', '7', '4', '7', '4', '09', '9', '12', '05', '4', '10', '6', '8', '07', '14', '5', '09', '6', '8', '12', '8', '10', '4', '08', '9', '08', '11', '7', '3', '5', '11', '8', '10', '18', '9', '2', '2', '11', '8', '5', '20', '6', '6', '10', '12', '12', '4', '12', '6', '27', '07', '12', '10', '11', '5', '2', '24', '10', '26', '28', '06', '25', '14', '30', '28', '14', '10', '11', '10', '05', '21', '14', '30', '22', '14', '06', '18', '11', '30', '02', '09', '12', '22', '28', '13', '06', '10', '26', '10', '23', '26', '21', '19', '05', '29', '21', '18', '11', '01', '13', '21', '24', '04', '23', '18', '04', '21', '26', '18', '15', 

AttributeError: 'list' object has no attribute 'iloc'

In [258]:
from typing import NamedTuple
import re
import numpy as np

class Token(NamedTuple):
    type: str
    value: str
    line: int
    column: int

def tokenize(code,i):
    
    token_specification = [
        ('month_day_num', r'((?<![\/\-]\d{1}[\/\-])(?<![\/\-]\d{2}[\/\-])(?:\d?\d)(?!\d{2}))'),  # Integer or decimal number
        ('month_alfa',   r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)'),           # Assignment operator
        ('year',      r'(?<!\d)(?:\d{4}|\d{2})\b')
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    
    temp_group = []
    year_group = []
    date_group = []
    line_num = i
    line_start = 0

    k = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group()
        column = mo.start() - line_start
        
        
        #temp_group.append(mo.group())
        
        
        if mo.group('year')!= None:
            year_group.append((line_num,mo.group('year'),k))
        
        temp_group.append(mo.group())
        
        k +=1    
        
        

    
    if   len(temp_group) == 1:
        temp_group=[1,1,1]
        year_group.append((line_num,mo.group(),k))
    
    l = year_group[-1][-1]     
    #print(year_group)

    date_group.append((temp_group[l-2],temp_group[l-1],year_group[-1][-2])) 
    
    

        #yield Token(kind, value, line_num, column)
    
    return temp_group, year_group,date_group
        
   

statements = df
#statements = ['2004','Mar 20th, 1992','3-4-52']       
#a = pd.DataFrame(index=[0],columns=['day','month','year'])

i=0
temp =[]
temp_year = []
temp_date = []
for s in statements:
    temp.append(tokenize(s.strip(),i)[0])
    temp_year.append(tokenize(s.strip(),i)[1])
    temp_date.append(tokenize(s.strip(),i)[2])
    i+=1
    

#idx = [x if len(temp[x])>3 else [] for x in range(0,500)]

print(temp_date[57])
print(df[10])

final_df = pd.DataFrame([t for lst in temp_date for t in lst],columns=['Month','Day','Year'])
final_df['Year'] = ['19'+year if len(year)<4 else year for year in final_df['Year'] ]
final_df.sort_values(['Year','Month','Day'],inplace=True,kind='mergesort',ascending=[True,True,True])
print(final_df)


[('12', '01', '73')]
(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC_12.6Activities of Daily Living (ADL) Bathing: Independent

    Month  Day  Year
10     26   16  1922
9       4   10  1971
84      5   18  1971
53      7   11  1971
2       7    8  1971
..    ...  ...   ...
464  2016    8  2016
253  2016  Feb  2016
141    30  May  2016
231    50  May  2016
427     6    5  2016

[500 rows x 3 columns]
