# Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices. **This Series should be sorted by a tie-break sort in the format of ("extracted date", "original row number").**

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [1]:
import pandas as pd

doc = []
with open('assets/dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

0         03/25/93 Total time of visit (in minutes):\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (N...
5                    .Per 7/06/79 Movement D/O note:\n
6    4, 5/18/78 Patient's thoughts about current su...
7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                         3/7/86 SOS-10 Total Score:\n
9             (4/10/71)Score-1Audit C Score Current:\n
dtype: object

In [2]:
num_elements = df.shape[0]

print(f"Number of elements (rows): {num_elements}")

Number of elements (rows): 500


In [3]:
df.to_excel('raw.xlsx', index = True) # When you export a DataFrame to an Excel file using to_excel() with index=True, the DataFrame's index will be saved as an additional column in the resulting Excel sheet. By default, the index will appear as the first column.

In [4]:
import pandas as pd
import re

def date_sorter():
    # Regular expressions for matching various date formats
    
    regex1 = '(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})' # Match MM/DD/YYYY or M/D/YY or MM-DD-YYY
    regex2 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[-/\s]\d{1,2}[,]{0,1}[-/\s]\d{2,4})' # Match MTH DD, YYYY; MM-DD-YYYY
    regex3 = '(\d{1,2}[,]{0,1}[-/\s](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[-/\s]\d{2,4})'# Match DD/MTH/YYYY
    regex4 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[-/\s]\d{2,4})' # Match MTH/YYYY
    regex5 = '(\d{1,2}[/-][1|2]\d{3})'# Match MM/YYYY; MM-YYYY
    regex6 = '([1|2]\d{3})' # Match YYYY only
    
    full_regex = '(%s|%s|%s|%s|%s|%s)' %(regex1, regex2, regex3, regex4, regex5, regex6) # Combine all the regex patterns
    parsed_date = df.str.extract(full_regex) # extract() is used when you want to extract the first match (or the first group) from each row in your pandas Series. It's ideal when you are expecting a single date match per row.
    parsed_date = parsed_date.iloc[:,0].str.replace('Janaury', 'January').str.replace('Decemeber', 'December') # Correcting spelling errors found, if not will have error.
    # select all rows but first column only then automatically treat those that has only years i.e. 1995 -> 1/1/1995
    
    def convert_two_digit_year(date_str):
        # Regular expression to find 2-digit year and prepend '19'
        match = re.match(r'(\d{1,2}[/-]\d{1,2}[/-])(\d{2})$', date_str)  # Matches MM/DD/YY or MM-DD-YY
        if match:
            # If two-digit year found, convert it to 19YY
            return match.group(1) + '19' + match.group(2)
        return date_str  # If not, return the original date string
    
    # Apply the conversion to all date entries
    df_converted = parsed_date.apply(convert_two_digit_year)
    # Return df_converted.to_excel('test_output.xlsx', index = True) 
    
    df_converted = pd.Series(pd.to_datetime(df_converted)) 
    df_converted = df_converted.sort_values(ascending=True, kind = 'stable').index 
    # Sort the data by the parsed date column using a stable sort algorithm. "stable" is a technical term meaning that in case of any identical values in the sort column the original order will be preserved
    return pd.Series(df_converted.values) # Return the sorted indices as a pandas Series

    raise NotImplementedError()

date_sorter()


0        9
1       84
2        2
3       53
4       28
      ... 
495    427
496    141
497    186
498    161
499    413
Length: 500, dtype: int64

In [5]:
# Checking output posted by lecture on discussion forum: You should make sure your solution matches the output above, and only after it matches use the checking code provided in this discussion.

import re
import numpy as np
s_test = date_sorter()

def run_df_modified_check():
    """
    Check if df appears to be modified.
    """
    try:
        assert type(df) == pd.Series
        assert (df.index == pd.RangeIndex(start=0, stop=500, step=1)).all()
        assert (df.apply(type) == str).all()
        assert df.str.len().min() >= 6
        assert df.str[5].apply(ord).sum() == 38354
        print("Passed df modification check")
    except:
        print("Failed df modification check")

run_df_modified_check()

# check if running the code twice produces the same result
try:
    assert (date_sorter() == s_test).all()
    print("Passed repeatability check")
except:
    print("Failed repeatability check")

# check if the result has the expected index
try:
    # assert type(date_sorter().index) == pd.RangeIndex
    # assert (date_sorter().index == pd.RangeIndex(start=0, stop=500, step=1)).all()
    assert list(date_sorter().index) == list(range(500))
    print("Passed index check")
except:
    print("Failed index check")

# check the tie-break sort for a sample of records where some have the same date
# note that this only tests a sample and does not check the entire answer
try:
    test_indices = [335, 415, 323, 405, 370, 382, 303, 488, 283,
                    395, 318, 369, 493, 252, 314, 410, 490]
    answer_lkp = {original_index:answer_index for
                  answer_index, original_index in s_test.to_dict().items()}
    i_test = [answer_lkp[i] for i in test_indices]
    assert sorted(i_test) == i_test
    print("Passed secondary sort sample check")
except:
    print("Failed secondary sort sample check")

def run_v_check(s_test):
    """
    Check if the parsed dates appear to be correct and correctly sorted.
    The check works by producing some test checksums
    if you get for example a False entry in the agree column for
    index value 20 that would mean you have at least one incorrectly
    parsed or incorrectly sorted date in the **output** index
    range 20,21,...,29
    The results of the test are printed.
    Args:
    s_test: Series such as produced by date_sorter()
    Returns:
    None
    """
    try:
        v_check = pd.DataFrame({'correct':
        [6695, 14428, 16742, 9275, 12290, 14654, 9421, 10185, 11464, 16491,
         11797, 14036, 15459, 9412, 13069, 10400, 10498, 14322, 13274, 11001,
         11383, 11910, 10977, 9692, 10199, 10187, 15456, 13491, 9186, 13646,
         11142, 13724, 10994, 12905, 15968, 16648, 13966, 14607, 16932, 14622,
         17942, 18220, 17818, 18305, 19633, 12522, 13978, 18445, 20156, 14797],
        'learner':[
        (s_test.iloc[10*i:(i+1)*10].values * np.array(range(1,11))).sum() for i in range(50)]},
        index=range(0,500,10)).assign(agree=lambda x:x['correct']==x['learner'])
        print("Values checksums:")
        print(v_check)
        assert v_check['agree'].all()
        print("Passed values check")
    except:
        print("Failed values check")
    return

run_v_check(s_test)

Passed df modification check
Passed repeatability check
Passed index check
Passed secondary sort sample check
Values checksums:
     correct  learner  agree
0       6695     6695   True
10     14428    14428   True
20     16742    16742   True
30      9275     9275   True
40     12290    12290   True
50     14654    14654   True
60      9421     9421   True
70     10185    10185   True
80     11464    11464   True
90     16491    16491   True
100    11797    11797   True
110    14036    14036   True
120    15459    15459   True
130     9412     9412   True
140    13069    13069   True
150    10400    10400   True
160    10498    10498   True
170    14322    14322   True
180    13274    13274   True
190    11001    11001   True
200    11383    11383   True
210    11910    11910   True
220    10977    10977   True
230     9692     9692   True
240    10199    10199   True
250    10187    10187   True
260    15456    15456   True
270    13491    13491   True
280     9186     9186   True
29