# Assignment Instructions
## Completing the Assignment  
1. Fill in your STUDENTID (abc123) in the code block below.
2. Make sure you fill in any place that says `#YOUR CODE HERE` or "YOUR ANSWER HERE"
3. When filling in `#YOUR CODE HERE` sections, remove or comment out the line  
> `raise NotImplementedError()`  

## Assignment Submission Checklist  
Before you submit this assignment for grading, you must do the following or you risk losing points. 
1. **Remove Extraneous prints** Long prints _might_ confuse the grader. If they do, you lose points.
2. **Restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart)
3. **Run all cells** (in the menubar, select Cell$\rightarrow$Run All).  
  - If any cell that contains code throws an exception or does not compile, fix it and restart this checklist 
  - If a grading cell throws an exception, you will not recieve any credit for that cell
4. **Save the notebook** This ensures that any graphs or plots are in the submission   
  - Do NOT rename your notebook. It must have the same name that was downloaded, or the grading will fail.  
5. **Zip up the assignment notebook(s) and any files required to run the notebook**
  - You must name the zip file "ASnn.zip" where nn is the zero padded assignment number. This is the same file name used to download the assignment.  
  - All files must be in the root of the zip file, NOT in a subdirectory

By submitting this notebook for grading, you affirm that all work was produced by the author identified below, and that references are included for all use of public source material (to include code, data, diagrams, pictures, and verbatim text).

In [1]:
STUDENTID = "igy530"

---

# AS03: Text Data Basics
**Version:**  1.0  
**Total Points:**  5  

## Objective
The objective of this assignment is to become familiar with regular expressions as an essential tools for extracting specific information from text.

## Data Sources
The data file for this exercise was modified from a problem in the Coursera "Applied Text Mining in Python" course. Each line in the data file corresponds to a hand transcribed medical note. Each note has a date that needs to be extracted, but there are several different date formats.

The relevant data sources for this exercise have been copied to the read-only Datasets directory (location is identified by the environment variable DATASETS_ROOT).

## Setup
This section must be run to initialize everything.

[Back to Instructions](#Instructions)

In [2]:
# Imports and globals for this exercise
# mainline tools
import os
import re
# data tools
import pandas as pd
import numpy as np
# DSIP class utilities
import DSIPClassUtilities as utl

# Set paths to directories for the data
dataroot = os.environ['DATASETS_ROOT']
sDataFile = os.path.join(dataroot, 'Misc', 'dated-transcriptions.txt')

# For this exercise, we'll need a list of months to use in our regular expressions.
sMonthPattern = 'jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec'

# place to save temporary data and avoid lengthy reprocessing
TmpSaveDir = './_tmp' 
if not os.path.exists(TmpSaveDir):
    os.makedirs(TmpSaveDir)

# Document the paths and selectors
print('Paths and data for this exercise')
print('--------------------------------')
print('Data File: {}'.format(sDataFile))
print('Temp files saved to:  {}'.format(TmpSaveDir))    

Paths and data for this exercise
--------------------------------
Data File: \\adfs01\datasets\Misc\dated-transcriptions.txt
Temp files saved to:  ./_tmp


## Read in the text file
Define a function that reads in a text file. 
- The output must be a pandas Series named Text, with index named `Line`
- Each entry in the series is a line in the file (in order)
- Strip the line end (`\n`) from each line
- The resulting Series will have 500 entries  

This is easier if you just use standard python file reading tools, then create the Series.

In [3]:
def read_file( sFileName ):
    # YOUR CODE HERE
    ps = pd.read_fwf(sFileName, usecols=[0], header = None)
    ps = pd.Series(ps[0].values, name = 'Text')
    ps.index.name='Line'
    return ps 
    #raise NotImplementedError()

In [4]:
## This is an automatically graded test cell.
# It contains public tests that you can use to help determine whether your
# functions are correct. It also contains hidden tests that are run by
# the autograder.

# Use the defined functions
srDoc = read_file(sDataFile)
l100 = srDoc[100]

# Public tests (make sure your function passes these tests)
# ---------------------------------------------------------
assert 500 == len(srDoc), 'Incorrect size'
assert l100[ len(l100)-1 ] != '\n', 'Line ending was not stripped'
                          

## Extract dates from strings
The following functions are designed to match a particular type of date format. Although the tests for these functions will only be date strings, they must be able to find the date string anywhere it appears in a line.


The first function, `extract_numeric_date()`, must extract a date from a string where the date format is all numbers and separators
- Example forms include: 04/20/2009; 04/20/09; 4/20/09; 4/3/09; 6/2008; 12/2009; 5-13-92; 6-13-1983
- this function should ignore stand alone 4 digit years
- The return must be a datetime object (use pd.to_datetime()) if a date is found, otherwise None
- If no day is specified, set it to the first day of the month
- For two digit years, assume the century is 1900 (e.g., 2/95 is 02/01/1995)
- Assume all dates in xx/xx/xx format are mm/dd/yy

In [5]:
def extract_numeric_date(s):
    # YOUR CODE HERE
    
    if(re.search('(\d{1,2})[/-](\d{1,2})[/-](\d{4})',s)):
        return pd.to_datetime(re.search('(\d{1,2})[/-](\d{1,2})[/-](\d{4})',s).group(0),errors= 'coerce')
    elif(re.search('(\d{1,2})[/-](\d{1,2})[/-](\d{2})',s)):
        y = '19'+re.search('(\d{1,2})[/-](\d{1,2})[/-](\d{2})',s).group(3)
        d = re.search('(\d{1,2})[/-](\d{1,2})[/-](\d{2})',s).group(1)+'/'+ re.search('(\d{1,2})[/-](\d{1,2})[/-](\d{2})',s).group(2) + '/'+ y
        return pd.to_datetime(d,errors= 'coerce')
    elif(re.search('(\d{1,2})[/-](\d{4})',s)):
        m = re.search('(\d{1,2})[/-](\d{4})',s).group(1)
        y = re.search('(\d{1,2})[/-](\d{4})',s).group(2)
        d = m+'/'+ '01' + '/'+ y
        return pd.to_datetime(d,errors= 'coerce')
    elif(re.search('(\d{1,2})[/-](\d{2})',s)):
        m = re.search('(\d{1,2})[/-](\d{4})',s).group(1)
        y = '19' + re.search('(\d{1,2})[/-](\d{4})',s).group(2)
        d = m+'/'+ '01' + '/'+ y
        return pd.to_datetime(d,errors= 'coerce')
    else:
        return None


In [6]:
## This is an automatically graded test cell.
# It contains public tests that you can use to help determine whether your
# functions are correct. It also contains hidden tests that are run by
# the autograder.

# Use the defined functions
dfDates = pd.DataFrame( 
   [
    (  '04/20/2009',        '2009-04-20 00:00:00'),
    (  '4-13-82',           '1982-04-13 00:00:00'),
    (  '4/3/09',            '1909-04-03 00:00:00'),
    (  '6/2008',            '2008-06-01 00:00:00'),
   ],
    columns=['text', 'truth'])
dfDates['truth'] = pd.to_datetime(dfDates['truth'],infer_datetime_format=True)

# Public tests
# ---------------------------------------------------------
for idx, row in dfDates.iterrows():
    assert row['truth'] == extract_numeric_date(row['text']), 'Incorrect for text: {}'.format(row['text'])


The second function, `extract_alpha_date()` must extract a date from a string where the month is spelled out in some form 
- Examples include: 
  - Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
  - 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
  - Mar 20th, 2009; Mar 21st, 2009; mar 22nd, 2009
  - Feb 2009; SEP 2009; Oct 2010
- In the Setup cell, there is a variable named `sMonthPattern` that you can use to help construct re patterns to match against month names.   
- For months that are spelled out, there may be typo's. You should only rely on the first 3 letters begin correct.
- Comparison should be case insensitive


In [18]:
def extract_alpha_date(s):
    # YOUR CODE HERE
    s = s.lower()
    matchstring = sMonthPattern.split('|')
    findstring = re.findall('[a-zA-Z]', s)
    findstring = "".join(findstring).lower()
            print(s)
            return pd.to_datetime(s, errors = 'coerce')
            break
        else:
            if(i == matchstring[-1]):
                return None
            else:
                continue
    #raise NotImplementedError()
dfDates = pd.DataFrame( 
   [
    (  'yo yo Jan-20-2009',       '2009-01-20 00:00:00'),
    (  'March 20, 2009',    '2009-03-20 00:00:00'),
    (  'Apr. 20, 2009',     '2009-04-20 00:00:00'),
    (  'Jul 21st, 2009',    '2009-07-21 00:00:00'),
    (  'Aug 22nd, 2009',    '2009-08-22 00:00:00'),
    (  'Oct 2009',          '2009-10-01 00:00:00'),
    (  'Novenber 2010',     '2010-11-01 00:00:00'),
    (  '20 Jan 2009',       '2009-01-20 00:00:00'),
    (  '20 April, 2009',    '2009-04-20 00:00:00'),
    (  '2Jun 1999',         '1999-06-02 00:00:00')
   ], columns=['text', 'truth'])
dfDates['truth'] = pd.to_datetime(dfDates['truth'],infer_datetime_format=True)


# Public tests
# ---------------------------------------------------------
for idx, row in dfDates.iterrows():
    assert row['truth'] == extract_alpha_date(row['text']), 'Incorrect for text: {}'.format(row['text'])

yo yo jan-20-2009
jan
ok
yo yo jan-20-2009
NaT


AssertionError: Incorrect for text: yo yo Jan-20-2009

In [8]:
## This is an automatically graded test cell.
# It contains public tests that you can use to help determine whether your
# functions are correct. It also contains hidden tests that are run by
# the autograder.

# Use the defined functions
dfDates = pd.DataFrame( 
   [
    (  'Jan-20-2009',       '2009-01-20 00:00:00'),
    (  'March 20, 2009',    '2009-03-20 00:00:00'),
    (  'Apr. 20, 2009',     '2009-04-20 00:00:00'),
    (  'Jul 21st, 2009',    '2009-07-21 00:00:00'),
    (  'Aug 22nd, 2009',    '2009-08-22 00:00:00'),
    (  'Oct 2009',          '2009-10-01 00:00:00'),
    (  'Novenber 2010',     '2010-11-01 00:00:00'),
    (  '20 Jan 2009',       '2009-01-20 00:00:00'),
    (  '20 April, 2009',    '2009-04-20 00:00:00'),
    (  '2Jun 1999',         '1999-06-02 00:00:00')
   ], columns=['text', 'truth'])
dfDates['truth'] = pd.to_datetime(dfDates['truth'],infer_datetime_format=True)


# Public tests
# ---------------------------------------------------------
for idx, row in dfDates.iterrows():
    assert row['truth'] == extract_alpha_date(row['text']), 'Incorrect for text: {}'.format(row['text'])
    

The third function, `extract_fourdigit_date()` just looks for a four digit year, but only for years in the 1900's or 2000's
- Examples include: 1950; 2010
- When no month is given, assume January 1 of that year.


In [9]:
def extract_fourdigit_date(s):
    # YOUR CODE HERE
    if(re.search('(\d{4})',s)):
        if(len(s) == 4):
            if((re.search('(\d{4})',s).group().startswith('19')) or (re.search('(\d{4})',s).group().startswith('20'))):
                y = re.search('(\d{4})',s).group()
                d = '01'+'/'+ '01' + '/'+ y
                return pd.to_datetime(d,errors= 'coerce')
            else:
                return None
        else:
            return None
    else:
        return None
    #raise NotImplementedError()

In [10]:
## This is an automatically graded test cell.
# It contains public tests that you can use to help determine whether your
# functions are correct. It also contains hidden tests that are run by
# the autograder.

# Public tests
# ---------------------------------------------------------
assert extract_fourdigit_date('2009') == pd.to_datetime('2009-01-01 00:00:00')
assert extract_fourdigit_date('1927') == pd.to_datetime('1927-01-01 00:00:00')


## Putting them together
Use the three functions above to build a function `extract_date` that parses out a date in any of the three styles handled by these functions. The order that you use them matters! This function will be tested against the data file.
- Remember to return None if no date is found.

In [11]:
def extract_date(s):
    # YOUR CODE HERE
    s= s.lower()
    if (re.search('(\d{0,})[/-](\d{1,})[/-](\d)', s)):
        return extract_numeric_date(s)
    elif (re.search('[a-z]', s)):
        return extract_alpha_date(s)
    elif (re.search('(\d{4})', s)):
        return extract_fourdigit_date(s)
    else:
        return None
    #raise NotImplementedError()

In [12]:
# Loop through the document lines
def date_finder(srDoc):
    # Start with a blank Series
    srOut = pd.Series()
    srOut.name = 'date'
    # Run through the lines
    for idx, s in srDoc.iteritems():
        srOut.loc[idx] = extract_date(s)
    return srOut


In [13]:
## This is an automatically graded test cell.
# It contains public tests that you can use to help determine whether your
# functions are correct. It also contains hidden tests that are run by
# the autograder.

# Public tests
# ---------------------------------------------------------
# This first set of tests is against strings that you have already been using
srDates = pd.Series(['04/20/2009', '4-13-82', '14 January 2011', 
                     '2010','Oct 2009','Novenber 2010'])
srTruth = pd.Series(['2009-04-20 00:00:00', '1982-04-13 00:00:00', '2011-01-14 00:00:00',
                     '2010-01-01 00:00:00', '2009-10-01 00:00:00','2010-11-01 00:00:00'])
srTruth = pd.to_datetime(srTruth)

ans = date_finder(srDates)
idx = ans == srTruth
assert len(idx) == sum(idx), 'Only got {} correct.'.format(sum(idx))
assert date_finder(pd.Series(['Not a date']))[0] is None




In [14]:
## This is an automatically graded test cell.
# It contains public tests that you can use to help determine whether your
# functions are correct. It also contains hidden tests that are run by
# the autograder.

# Now we start testing against the data file.
# The next three test blocks add up to 2 points.
# 1 point (total) if you get them all right
# 0.5 points if you get >480 right

# Use the defined functions
srDates = date_finder(srDoc)

# Public tests (make sure your function passes these tests)
# ---------------------------------------------------------
# No public tests



In [15]:
## This is an automatically graded test cell.
# No public tests



In [19]:
print(srDates.count())

124
