# Table of Contents
* [Extract Traits from the VertNet Database](#Extract-Traits-from-the-VertNet-Database)
	* [Introduction](#Introduction)
		* [General Approach for Extraction](#General-Approach-for-Extraction)
		* [Constants Used During the Extraction](#Constants-Used-During-the-Extraction)
		* [Look at the Words in the Target Cells](#Look-at-the-Words-in-the-Target-Cells)
		* [Regular Expression Objects](#Regular-Expression-Objects)
	* [Sex Parsing](#Sex-Parsing)
		* [Sex Parsing Regular Expression Battery](#Sex-Parsing-Regular-Expression-Battery)
		* [Test Sex Parsing](#Test-Sex-Parsing)
	* [Life Stage Parsing](#Life-Stage-Parsing)
		* [Life Stage Parsing Regular Expression Battery](#Life-Stage-Parsing-Regular-Expression-Battery)
		* [Test Life Stage Parsing](#Test-Life-Stage-Parsing)
	* [Common Regular Expression Fragments for Both Length and Mass Trait Parsing](#Common-Regular-Expression-Fragments-for-Both-Length-and-Mass-Trait-Parsing)
	* [Total Length Parsing](#Total-Length-Parsing)
		* [Common Total Length Parsing Regular Expression Fragments](#Common-Total-Length-Parsing-Regular-Expression-Fragments)
		* [Total Length Parsing Regular Expression Battery](#Total-Length-Parsing-Regular-Expression-Battery)
		* [Test Total Length Parsing](#Test-Total-Length-Parsing)
	* [Body Mass Parsing](#Body-Mass-Parsing)
		* [Common Body Mass Parsing Regular Expression Fragments](#Common-Body-Mass-Parsing-Regular-Expression-Fragments)
		* [Body Mass Parsing Regular Expression Battery](#Body-Mass-Parsing-Regular-Expression-Battery)
		* [Test Body Mass Parsing](#Test-Body-Mass-Parsing)
	* [Extract the Traits](#Extract-the-Traits)
		* [Extract the Raw Trait Values](#Extract-the-Raw-Trait-Values)
		* [Look at Extracted Keys and Units](#Look-at-Extracted-Keys-and-Units)
		* [Normalize the Extracted Traits](#Normalize-the-Extracted-Traits)


# Extract Traits from the VertNet Database

## Introduction

**Welcome to the thrilling world of parsing irregularly structured text!**

We're going to extract the following traits from an extract of the [VertNet database](http://vertnet.org/):
- Sex
- Life stage
- Total length (or a commonly used measure often substituted for total length: E.g. Snout-Vent Length)
- Body mass (look for common body mass substitutes too)

We are looking for the traits in these columns of the VertNet database:
- dynamicproperties (This will be the preferred column for extracting values)
- occurrenceremarks
- fieldnotes

We will append the extracted data to new columns in each row.

We're exploiting the fact that most of the data is in a structured or semi-structured format.

**Note**: This an early version and, as such, it uses an *Ad hoc* approach with regular expressions.

### General Approach for Extraction

We are going to loop through a each row in the CSV file and scan for the trait in each of the column cells. The scanning will involve an ordered battery of regular expressions for each trait. Once a trait is found for the CSV cell we will stop scanning that particular cell for the trait and move on to the next cell.  That means that we may find the same trait for a row in each of the scanned cells. For example: We may find a sex in both dynamicproperties and occurrenceremarks and we will record both. Once we have scanned all cells in a row for a trait we will then move on to scan all cells in the row for the next trait. And so on.  **The order of the regular expressions is important.**

We will add a new column for each trait being extracted. That column will contain a JSON object with an array of objects like so:

<table>
    <tr>
        <th>...rest of CSV row...</th>
        <th>autoextract_body_length</th>
        <th>...other extracted columns...</th>
    </tr>
    <tr>
        <td>orginal data is untouched</td>
        <td>{"dynamicproperties":{"key":"totalLengthInMM","units":"MM","value":"270"},
        "fieldnotes":{"key":"total length","units":"mm","value":"270.0"}}</td>
        <td>other extracted data</td>
    </tr>
    
</table>

The object will have the column we extracted the trait from as a column key. Therefore, there will be up to three fields in the object (In the example above there was nothing for the "occurrenceremarks" column.):
- key: AKA the regex key. (Two keys are a bit confusing.) This is what we're looking for to extract the value.
- value: This is a number or a number range for the value. Or a word or phrase for class values.
- units: For measurements with numbers we also try to extract the units associated with the value.

**Note**: We do not try to interpret any of the values we are only extracting them. We will interpret the data at a later step.

In [1]:
import os
import sys
import csv
import json
import regex   # re expressions are not enough
import datetime
import unittest
from collections import Counter
from pprint import pprint

### Constants Used During the Extraction

In [2]:
# data directory
DATA_DIR = 'data/'

# The file containing the original VertNet extraction
VERTNET_FILE_NAME = os.path.join(DATA_DIR, 'extendedvnextract')

# Used in file names
now = datetime.datetime.now().strftime("_%Y%m%d_%H%M_")
BASE_FILE_NAME = VERTNET_FILE_NAME + '_'

# The file containing the parsed VertNet traits
RAW_FILE_NAME = BASE_FILE_NAME + 'raw.csv'

# The file containing the normalized VertNet traits
NORMALIZED_FILE_NAME = BASE_FILE_NAME + 'norm.csv'

# A file containing all of the raw words in the target columns -- used to search for stem words
WORDS_FILE_NAME = BASE_FILE_NAME + 'words.txt'

# We will search these VerNet columns to extract the traits
VERTNET_SEARCH_COLUMNS = [
    'dynamicproperties',
    'occurrenceremarks',
    'fieldnotes'
]

### Look at the Words in the Target Cells

To get an idea of what kinds of data are in the cells have a look at the different words in the cells. After examining those words, we can start to get an idea of what regular expressions to write and which words to use as anchors for the regular expressions.

In [3]:
def get_words_in_cells(csv_dict_reader, search_columns):
    # For this, we consider dots as letters
    punctuation = regex.compile(r'[^\p{Letter}.]+')
    
    words = Counter()
    
    for row in csv_dict_reader:
        extracted_words = []
        
        for column in search_columns:
            extracted_words.extend(punctuation.split(row[column]))
    
        for word in extracted_words:
            words[word.lower()] += 1
    
    return sorted(words.keys())

In [4]:
def get_all_words():
    with open(VERTNET_FILE_NAME, 'r') as in_file:
        reader = csv.DictReader(in_file)
        words = get_words_in_cells(reader, VERTNET_SEARCH_COLUMNS)

    with open(WORDS_FILE_NAME, 'w') as out_file:
        for word in words:
            out_file.write(word + '\n')

    # print(words)

# get_all_words()

### Regular Expression Objects

The regular expressions require common supporting logic so they are packaged into an object. Then we will use an array of these objects for the actual parsing.

In [5]:
class Regexp:
    def __init__(self, name, regexp,
                 want_array=False,
                 parse_units=False,
                 default_key=None,
                 default_units=None,
                 units_from_key=None,
                 compound_value=False):
        self.name = name
        self.regexp = regex.compile(
            regexp,
            regex.IGNORECASE | regex.VERBOSE)
        self.want_array     = want_array
        self.parse_units    = parse_units
        self.default_key    = default_key
        self.default_units  = default_units
        self.compound_value = compound_value
        self.units_from_key = units_from_key

    def _get_key_(self, match):
        key = None
        if 'key' in match.groupdict().keys():
            key = match.group('key')
        if not key:
            key = self.default_key
        return key

    def _get_value_(self, match):
        if 'value' in match.groupdict().keys():
            return match.group('value')
        return [match.group('value1'), match.group('value2')]

    def _get_units_(self, match, key):
        units = None
        if 'units' in match.groupdict().keys():
            units = match.group('units')
        if 'units1' in match.groupdict().keys():
            units = [match.group('units1'), match.group('units2')]
        if not units and key:
            u = self.units_from_key.search(key)
            if u:
                units = u.group('units')
        if not units:
            units = self.default_units
        return units

    def _get_value_array_(self, string):
        matches = self.regexp.findall(string)
        if matches:
            return dict(key=None, value=matches)
        else:
            return None

    def matches(self, string):
        if self.want_array:
            return self._get_value_array_(string)

        match = self.regexp.search(string)
        if not match:
            return None

        parsed = dict()
        parsed['key']   = self._get_key_(match)
        parsed['value'] = self._get_value_(match)
        if self.parse_units:
            parsed['units'] = self._get_units_(match, parsed['key'])

        return parsed

We use an carefully ordered array of regular expressions to look for traits in the database. There is some logic for dealing with the entire array of regular expressions, this is captured in the following object.

In [6]:
class RegexpBattery:
    '''
    '''
    def __init__(self, exclude_pattern=None, parse_units=False, units_from_key=None):
        self.exclude_pattern = exclude_pattern
        if exclude_pattern:
            self.exclude_pattern = regex.compile(
                exclude_pattern,
                regex.IGNORECASE | regex.VERBOSE)

        self.units_from_key = units_from_key
        if units_from_key:
            self.units_from_key = regex.compile(
                units_from_key,
                regex.IGNORECASE | regex.VERBOSE)

        self.battery     = []
        self.parse_units = parse_units

    def _excluded_(self, match):
        if self.exclude_pattern and match and isinstance(match['value'], str):
            return self.exclude_pattern.search(match['value'])
        return False
    
    def append(self, *args, **keywords):
        regexp = Regexp(*args, **keywords)
        self.battery.append(regexp)
        regexp.parse_units    = self.parse_units
        regexp.units_from_key = self.units_from_key
    
    def parse(self, string):
        for regexp in self.battery:
            match = regexp.matches(string)
            if match and not self._excluded_(match):
                # print(regexp.name)
                return match
        return None

[top](#Table-of-Contents)

## Sex Parsing

### Sex Parsing Regular Expression Battery

The regular expressions:
- First we look for a keyword for sex and its value. We try and get a string of words for the value by looking for a delimiter after the value.
- If no delimiter is found then just return the word that follows the keyword.
- Failing that, we look for the words "male" or "female" in the cells. Here we want to return all matches not just one so that we don't seem more sure of the value than we should.

In [7]:
SEX = RegexpBattery(exclude_pattern=r''' ^ (?: and | was | is ) $ ''')

# Look for a key and value that is terminated with a delimiter
SEX.append(
    'sex_key_value_delimited',
    r'''
        \b (?P<key> sex)
        \W+
        (?P<value> [\w?.]+ (?: \s+ [\w?.]+ ){0,2} )
        \s* (?: [:;,"] | $ )
    '''
)

# Look for a key and value without a clear delimiter
SEX.append(
    'sex_key_value_undelimited',
    r'''
         \b (?P<key> sex) \W+ (?P<value> \w+ )
    '''
)

# Look for the words male & female
SEX.append(
    'sex_unkeyed',
    r'''
        \b (?P<value> (?: males? | females? ) (?: \s* \? )? ) \b
    ''',
    want_array=True
)

### Test Sex Parsing

In [8]:
target = SEX

class TestSexParsing(unittest.TestCase):

    def test_sex_key_value_delimited(self):
        self.assertDictEqual(
            target.parse('weight=81.00 g; sex=female ? ; age=u ad.'),
            {'key':'sex', 'value': 'female ?'})
        self.assertDictEqual(
            SEX.parse('sex=unknown ; crown-rump length=8 mm'),
            {'key': 'sex', 'value': 'unknown'})

    def test_sex_key_value_undelimited(self):
        self.assertDictEqual(
            target.parse('sex=F crown rump length=8 mm'),
            {'key':'sex', 'value': 'F'})

    def test_sex_unkeyed(self):
        self.assertDictEqual(
            target.parse('words male female unknown more words'),
            {'key':None, 'value': ['male', 'female']})

    def test_excluded(self):
        self.assertEqual(
            target.parse('Respective sex and msmt. in mm'),
            None)

suite = unittest.defaultTestLoader.loadTestsFromTestCase(TestSexParsing)
unittest.TextTestRunner().run(suite)

....
----------------------------------------------------------------------
Ran 4 tests in 0.002s

OK


<unittest.runner.TextTestResult run=4 errors=0 failures=0>

[top](#Table-of-Contents)

## Life Stage Parsing

### Life Stage Parsing Regular Expression Battery

The regular expressions:
- First we look for a keyword for life stage and its value. We try and get a string of words for the value by looking for a delimiter after the value.
- If no delimiter is found then return known life stage phrases that follow the keyword.
- Failing that, we look for phrases that are associated with life stage.

In [9]:
LIFE_STAGE = RegexpBattery(
    exclude_pattern=r''' ^ determin ''')

# Look for a key and value that is terminated with a delimiter
LIFE_STAGE.append(
    'life_stage_key_value_delimited',
    r'''
        \b (?P<key> (?: life \s* stage | age (?: \s* class )? ) )
           \W+
           (?P<value> [\w?.\/]+ (?: \s+ [\w?.\/]+){0,4} ) \s*
           (?: [:;,"] | $ )
    '''
)

# Look for a key and value without a clear delimiter
LIFE_STAGE.append(
    'life_stage_key_value_undelimited',
    r'''
        \b (?P<key> life \s* stage
                  | age \s* class
                  | age \s* in \s* (?: hour | day ) s?
                  | age
            )
            \W+
            (?P<value> \w+ (?: \s+ (?: year | recorded ) )? )
    '''
)

# Look for common life stage phrases
LIFE_STAGE.append(
    'life_stage_no_keyword',
    r'''
        (?P<value> (?: after \s+ )?
                   (?: first | second | third | fourth | hatching ) \s+
                   year )
    '''
)

### Test Life Stage Parsing

In [10]:
target = LIFE_STAGE

class TestLifeStageParsing(unittest.TestCase):

    def test_life_stage_key_value_delimited(self):
        self.assertDictEqual(
            target.parse('sex=unknown ; age class=adult/juvenile'),
            {'key': 'age class', 'value': 'adult/juvenile'})
        self.assertDictEqual(
            target.parse('weight=81.00 g; sex=female ? ; age=u ad.'),
            {'key': 'age', 'value': 'u ad.'})

    def test_life_stage_key_value_undelimited(self):
        self.assertDictEqual(
            target.parse('sex=female ? ; age=1st year more than four words here'),
            {'key': 'age', 'value': '1st year'})

    def test_life_stage_no_keyword(self):
        self.assertDictEqual(
            target.parse('words after hatching year more words'),
            {'key': None, 'value': 'after hatching year'})

    def test_excluded(self):
        self.assertEqual(
            target.parse('age determined by 20-sided die'),
            None)

suite = unittest.defaultTestLoader.loadTestsFromTestCase(TestLifeStageParsing)
unittest.TextTestRunner().run(suite)

....
----------------------------------------------------------------------
Ran 4 tests in 0.001s

OK


<unittest.runner.TextTestResult run=4 errors=0 failures=0>

[top](#Table-of-Contents)

## Common Regular Expression Fragments for Both Length and Mass Trait Parsing

Length and Mass regular expressions use many of the same parsing fragments repeatedly. We group them here and append them to the regular expressions in the batteries.

One common abbreviation that is used in both mass and length traits is in the form of: 181-75-21-18=22. The first number is always the total length in millimeters and the last number is the body mass. The other numbers are various length measurements that we are not extracting at this time. The first number (total length) is easy to extract but the last number is typically, but not always, preceded by an equal sign. Things to be careful about when parsing this form:
- We do not want to mistake a date for this shorthand notation.
- If the last number is not preceded by an equal sign or not followed by a mass unit (which makes the parsing easy) then we will consider the last number to be a mass if there are at least 5 numbers in the sequence.
- There is also a simplifying form for the shorthand like: 83-0-17-23-fa64-35. We consider the last number after the "fa" number to be the total mass.
- We have to be careful to not mistake these shorthand notations for number ranges like 10.5-20.2

In [11]:
MASS_LENGTH_FRAGMENTS = r'''
    (?(DEFINE)
    
        # For our purposes numbers are always positive and decimals.
        (?P<number> [\[\(]? \d+ (?: \. \d* )? [\]\)]? [\*]? )
        
        # We also want to pull in number ranges when appropriate.
        (?P<range> (?&number) (?: \s* (?: - | to ) \s* (?&number) )? )

        # Characters that follow a keyword
        (?P<key_end>  \s* [^\w.\[\(]* \s* )
        
        # We sometimes want to guarantee no word precedes another word.
        # This cannot be done with negative look behind, so we do a positive search for a separator
        (?P<no_word>  (?: ^ | [;,:"'\{\[\(]+ ) \s* )

        # Keywords that may precedes a shorthand measurement
        (?P<shorthand_words> on \s* tag
                           | specimens?
                           | catalog
                           | measurements (?: \s+ [\p{Letter}]+)
                           | tag \s+ \d+ \s* =? (?: male | female)? \s* ,
                           | meas [.,]? (?: \s+ \w+ \. \w+ \. )?
        )
        
        # Common keyword misspellings that precede shorthand measurement
        (?P<shorthand_typos>  mesurements | Measurementsnt )
        
        # Keys where we need units to know if it's for mass or length
        (?P<key_units_req> measurements? | body | total )
        
        # Characters that separate shorthand values
        (?P<shorthand_sep> [:,\/\-\s] )

        # Look for an optional dash or space character
        (?P<dash> [\s\-]? )
        
        # Look for an optional dot character
        (?P<dot> \.? )
        
        # Numbers are sometimes surrounded by brackets or parentheses
        # Don't worry about matching the opening and closing brackets
        (?P<open>  [\(\[\{]? )
        (?P<close> [\)\]\}]? )
    )
'''

[top](#Table-of-Contents)

## Total Length Parsing

### Common Total Length Parsing Regular Expression Fragments

In [12]:
LENGTH_FRAGMENTS = MASS_LENGTH_FRAGMENTS + r'''
    (?(DEFINE)

        # Look for a shorthand total length. Make sure this isn't a date
        (?P<len_shorthand> (?: (?&shorthand_sep) (?&number) ){3,} )

        # The "European" version of the shorthand length
        (?P<len_shorthand_euro> (?: (?&shorthand_sep) (?&number) [\p{Letter}]* ){3,} )

        # Keys that indicate we have a total length
        (?P<total_len_key> total  (?&dash) length (?&dash) in (?&dash) mm
                         | length (?&dash) in     (?&dash) millimeters
                         | (?: total | max | standard ) (?&dash) lengths?
        )

        # Snout-vent length is sometimes used as a proxy for total length in some groups
        (?P<svl_len_key> snout  (?&dash) vent   (?&dash) lengths? (?: (?&dash) in (?&dash) mm )?
                       | s (?&dot) v (?&dot) l (?&dot)
                       | snout \s+ vent \s+ lengths?
        )

        # Other keys that may be used as a proxy for total length for some groups
        (?P<other_len_key> head  (?&dash) body (?&dash) length (?&dash) in (?&dash) millimeters
                         | (?: fork | mean | body ) (?&dash) lengths?
                         | t [o.]? l (?&dot) _?
        )

        # Ambiguous length keys
        (?P<len_key_ambiguous> lengths? | tag )

        # Abbreviations for total length
        (?P<len_key_abbrev> t (?&dot) o? l (?&dot) )

        # For when the key is a suffix like: 44 mm TL
        (?P<len_key_suffix> (?: in \s* )? (?&len_key_abbrev) )

        # Gather all length key types
        (?P<all_len_keys> (?&total_len_key)
                        | (?&svl_len_key)
                        | (?&other_len_key)
                        | (?&len_key_ambiguous)
                        | (?&key_units_req)
                        | (?&shorthand_words)
                        | (?&shorthand_typos)
        )

        # Length keys found in phrases
        (?P<len_in_phrase> (?: total \s+ length | snout \s+ vent \s+ length ) s? )

        # Length unit words
        (?P<len_units_word> (?: meter | millimeter | centimeter | foot | feet | inch e? ) s? )

        # Length unit abbreviations
        (?P<len_units_abbrev> (?: [cm] (?&dot) m | in | ft ) (?&dot) s? )

        # All length units
        (?P<len_units> (?&len_units_word) | (?&len_units_abbrev) )

        # Used for parsing forms like: 2 ft 4 inches
        (?P<len_foot> (?: foot | feet | ft ) s? (?&dot) )
        (?P<len_inch> (?: inch e? | in )     s? (?&dot) )
    )
'''

### Total Length Parsing Regular Expression Battery

In [13]:
TOTAL_LENGTH = RegexpBattery(parse_units=True, units_from_key=r''' (?<units> mm | millimeters ) $ ''')

# Look for a pattern like: total length: 4 ft 8 in
TOTAL_LENGTH.append(
    'en_len',
    LENGTH_FRAGMENTS + r'''
        \b (?<key> (?&all_len_keys))? (?&key_end)?
           (?<value1> (?&range))    \s*
           (?<units1> (?&len_foot)) \s*
           (?<value2> (?&range))    \s*
           (?<units2> (?&len_inch))
    ''',
    default_key='_english_',
    compound_value=2
)

# Look for total key, number (not a range) and optional units
# Like: total length = 10.5 mm
TOTAL_LENGTH.append(
    'total_len_key_num',
    LENGTH_FRAGMENTS + r'''
        \b (?<key>   (?&total_len_key)) (?&key_end)
           (?<value> (?&number)) \s*
           (?<units> (?&len_units))?
    '''
)

# Look for these secondary length keys next but allow a range
TOTAL_LENGTH.append(
    'other_len_key',
    LENGTH_FRAGMENTS + r'''
        \b (?<key>   (?&other_len_key)) (?&key_end)
           (?<value> (?&range)) \s*
           (?<units> (?&len_units))?
    '''
)

# Look for keys where the units are required
TOTAL_LENGTH.append(
    'key_units_req',
    LENGTH_FRAGMENTS + r'''
        \b (?<key> (?&key_units_req)) (?&key_end)
        (?<value> (?&range)) \s*
        (?<units> (?&len_units))
    '''
)

# Look for a length in a phrase
TOTAL_LENGTH.append(
    'len_in_phrase',
    LENGTH_FRAGMENTS + r'''
        \b (?<key>   (?&len_in_phrase)) \D{1,32}
           (?<value> (?&range)) \s*
           (?<units> (?&len_units))?
    '''
)

# These keys require units to disambiguate what is being measured
TOTAL_LENGTH.append(
    'len_key_ambiguous_units',
    LENGTH_FRAGMENTS + r'''
        (?&no_word)
        (?<key>   (?&len_key_ambiguous)) (?&key_end)
        (?<value> (?&range)) \s*
        (?<units> (?&len_units))
    '''
)

# An out of order parse: tol (mm) 20-25
TOTAL_LENGTH.append(
    'len_key_abbrev',
    LENGTH_FRAGMENTS + r'''
        \b (?<key>   (?&len_key_abbrev)) \s*
           (?&open)  \s* (?<units> (?&len_units)) \s* (?&close) \s*
           (?<value> (?&range))
    '''
)

# This parse puts the key at the end: 20-25 mm TL
TOTAL_LENGTH.append(
    'len_key_suffix',
    LENGTH_FRAGMENTS + r'''
        \b (?<value> (?&range)) \s*
           (?<units> (?&len_units))? \s*
           (?<key>   (?&len_key_suffix))
    '''
)

# Length is in shorthand notation
TOTAL_LENGTH.append(
    'len_shorthand',
    LENGTH_FRAGMENTS + r'''
        \b (?: (?<key> (?&all_len_keys)) (?&key_end) )?
           (?<value>   (?&number))
           (?&len_shorthand)
    ''',
    default_units='_mm_',
    default_key='_shorthand_'
)

# A shorthand notation with some abbreviations in it
TOTAL_LENGTH.append(
    'len_shorthand_euro',
    LENGTH_FRAGMENTS + r'''
        \b (?: (?<key> (?&all_len_keys)) (?&key_end) )?
           [a-z]*
           (?<value>   (?&number))
           (?&len_shorthand_euro)
    ''',
    default_units='_mm_',
    default_key='_shorthand_'
)

# Now we can look for the total length, RANGE, optional units
# See 'total_len_key_num' above
TOTAL_LENGTH.append(
    'total_len_key',
    LENGTH_FRAGMENTS + r'''
        \b (?<key>   (?&total_len_key)) (?&key_end)
           (?<value> (?&range)) \s*
           (?<units> (?&len_units))?
    '''
)

# We will now allow an ambiguous key if it is not preceded by another word
TOTAL_LENGTH.append(
    'len_key_ambiguous',
    LENGTH_FRAGMENTS + r'''
        (?&no_word)
        (?<key>   (?&len_key_ambiguous)) (?&key_end)
        (?<value> (?&range))
    '''
)

# Look for snout-vent length keys
TOTAL_LENGTH.append(
    'svl_len_key',
    LENGTH_FRAGMENTS + r'''
        \b (?<key>   (?&svl_len_key)) (?&key_end)
           (?<value> (?&range)) \s*
           (?<units> (?&len_units))?
    '''
)

### Test Total Length Parsing

In [14]:
target = TOTAL_LENGTH

class TestTotalLengthParsing(unittest.TestCase):

    def test_units_from_key(self):
        self.assertDictEqual(
            target.parse('{"totalLengthInMM":"123" };'),
            {'key': 'totalLengthInMM', 'value': '123', 'units': 'MM'})

    def test_en_len(self):
        pass
    
    def test_to_be_determined0(self):
        self.assertDictEqual(
            target.parse('measurements: ToL=230;TaL=115;HF=22;E=18; total length=230 mm; tail length=115 mm;'),
            {'key': 'total length', 'value': '230', 'units': 'mm'})
    
    def test_to_be_determined1(self):
        self.assertEqual(
            target.parse('sex=unknown ; crown-rump length=8 mm'),
            None)
    
    def test_to_be_determined2(self):
        self.assertEqual(
            target.parse('left gonad length=10 mm; right gonad length=10 mm;'),
            None)
    
    def test_to_be_determined3(self):
        self.assertDictEqual(
            target.parse('"{"measurements":"308-190-45-20" }"'),
            {'key':'measurements','value': '308','units': '_mm_'})
    
    def test_to_be_determined4(self):
        self.assertDictEqual(
            target.parse('308-190-45-20'),
            {'key':'_shorthand_','value': '308','units': '_mm_'})
    
    def test_to_be_determined5(self):
        self.assertDictEqual(
            target.parse('{"measurements":"143-63-20-17=13 g" }'),
            {'key':'measurements','value': '143','units': '_mm_'})
    
    def test_to_be_determined6(self):
        self.assertDictEqual(
            target.parse('143-63-20-17=13'),
            {'key':'_shorthand_','value': '143','units': '_mm_'})
    
    def test_to_be_determined7(self):
        self.assertDictEqual(
            target.parse('snout-vent length=54 mm; total length=111 mm; tail length=57 mm; weight=5 g'),
            {'key':'total length','value': '111','units': 'mm'})
    
    def test_to_be_determined8(self):
        self.assertDictEqual(
            target.parse('unformatted measurements=Verbatim weight=X;ToL=230;TaL=115;HF=22;E=18; ; total length=230 mm; tail length=115 mm;'),
            {'key':'total length','value': '230','units': 'mm'})
    
    def test_to_be_determined9(self):
        self.assertDictEqual(
            target.parse('** Body length =345 cm; Blubber=1 cm '),
            {'key':'Body length','value': '345','units': 'cm'})

    def test_to_be_determined10(self):
        self.assertDictEqual(
            target.parse('t.l.= 2 feet 3.1 - 4.5 inches '),
            {'key':'t.l.', 'value': ['2', '3.1 - 4.5'], 'units': ['feet', 'inches']})

    def test_to_be_determined11(self):
        self.assertDictEqual(
            target.parse('2 ft. 3.1 - 4.5 in. '),
            {'key':'_english_', 'value': ['2', '3.1 - 4.5'], 'units': ['ft.', 'in.']})

    def test_to_be_determined12(self):
        self.assertDictEqual(
            target.parse('total length= 2 ft.'),
            {'key':'total length','value': '2','units': 'ft.'})

    def test_to_be_determined13(self):
        self.assertDictEqual(
            target.parse('AJR-32   186-102-23-15  15.0g'),
            {'key':'_shorthand_','value': '186','units': '_mm_'})

    def test_to_be_determined14(self):
        self.assertDictEqual(
            target.parse('length=8 mm'),
            {'key':'length','value': '8','units': 'mm'})

    def test_to_be_determined15(self):
        self.assertDictEqual(
            target.parse('another; length=8 mm'),
            {'key':'length','value': '8','units': 'mm'})
    
    def test_to_be_determined16(self):
        self.assertDictEqual(
            target.parse('another; TL_120, noise'),
            {'key':'TL_','value': '120','units': None})
    
    def test_to_be_determined17(self):
        self.assertDictEqual(
            target.parse('another; TL - 101.3mm, noise'),
            {'key':'TL','value': '101.3','units': 'mm'})
     
    def test_to_be_determined18(self):
        self.assertDictEqual(
           target.parse('before; TL153, after'),
            {'key':'TL','value': '153','units': None})

    def test_to_be_determined19(self):
        self.assertDictEqual(
            target.parse('before; Total length in catalog and specimen tag as 117, after'),
            {'key':'Total length','value': '117','units': None})
    
    def test_to_be_determined20(self):
        self.assertDictEqual(
            target.parse('before Snout vent lengths range from 16 to 23 mm. after'),
            {'key':'Snout vent lengths','value': '16 to 23','units': 'mm.'})
    
    def test_to_be_determined21(self):
        self.assertDictEqual(
            target.parse('Size=13 cm TL'),
            {'key':'TL','value': '13','units': 'cm'})
    
    def test_to_be_determined22(self):
        self.assertDictEqual(
            target.parse('det_comments:31.5-58.3inTL'),
            {'key':'TL','value': '31.5-58.3','units': 'in'})
    
    def test_to_be_determined23(self):
        self.assertDictEqual(
            target.parse('SVL52mm'),
            {'key':'SVL','value': '52','units': 'mm'})
    
    def test_to_be_determined24(self):
        self.assertDictEqual(
            target.parse('snout-vent length=221 mm; total length=257 mm; tail length=36 mm'),
            {'key':'total length','value': '257','units': 'mm'})
    
    def test_to_be_determined25(self):
        self.assertDictEqual(
            target.parse('SVL 209 mm, total 272 mm, 4.4 g.'),
            {'key':'total','value': '272','units': 'mm'})
    
    def test_to_be_determined26(self):
        self.assertDictEqual(
            target.parse('{"time collected":"0712-0900", "length":"12.0" }'),
            {'key':'length','value': '12.0','units': None})
    
    def test_to_be_determined27(self):
        self.assertDictEqual(
            target.parse('{"time collected":"1030", "water depth":"1-8", "bottom":"abrupt lava cliff dropping off to sand at 45 ft.", "length":"119-137" }'),
            {'key':'length','value': '119-137','units': None})
    
    def test_to_be_determined28(self):
        self.assertDictEqual(
            target.parse('TL (mm) 44,SL (mm) 38,Weight (g) 0.77 xx'),
            {'key':'TL','value': '44','units': 'mm'})
    
    def test_to_be_determined29(self):
        self.assertDictEqual(
            target.parse('{"totalLengthInMM":"270-165-18-22-31", '),
            {'key':'totalLengthInMM','value': '270','units': 'MM'})
    
    def test_to_be_determined30(self):
        self.assertDictEqual(
            target.parse('{"length":"20-29" }'),
            {'key':'length','value': '20-29','units': None})
    
    def test_to_be_determined31(self):
        self.assertDictEqual(
            target.parse('field measurements on fresh dead specimen were 157-60-20-19-21g'),
            {'key':'_shorthand_','value': '157','units': '_mm_'})


suite = unittest.defaultTestLoader.loadTestsFromTestCase(TestTotalLengthParsing)
unittest.TextTestRunner().run(suite)

..................................
----------------------------------------------------------------------
Ran 34 tests in 0.055s

OK


<unittest.runner.TextTestResult run=34 errors=0 failures=0>

[top](#Table-of-Contents)

## Body Mass Parsing

### Common Body Mass Parsing Regular Expression Fragments

In [15]:
MASS_FRAGMENTS = MASS_LENGTH_FRAGMENTS + r'''
    (?(DEFINE)

        # Used to indicate that the next measurement in a shorthand notation is total mass
        (?P<wt_shorthand_sep> [=\s\-]+ )

        #
        (?P<wt_shorthand> (?: (?&number) (?&shorthand_sep) ){3,} (?&number) (?&wt_shorthand_sep) )

        # Shorthand notation requiring units
        (?P<wt_shorthand_req> (?: (?&number) (?&shorthand_sep) ){4,} )

        # A common shorthand notation
        (?P<wt_shorthand_euro> (?&number) hb (?: (?&shorthand_sep) (?&number) [a-z]* ){4,} = )

        # Keywords for total mass
        (?P<total_wt_key> weightingrams | massingrams
                        | (?: body | full | observed | total ) (?&dot) \s* (?&wt_key_word)
        )

        # Keywords often used for total mass
        (?P<other_wt_key> (?: dead | live ) (?&dot) \s* (?&wt_key_word) )

        #  Weight keyword
        (?P<wt_key_word> weights?
                       | weigh (?: s | ed | ing )
                       | mass
                       | w (?&dot) t s? (?&dot)
        )

        # Gather all weight keys
        (?P<all_wt_keys>  (?&total_wt_key)  | (?&other_wt_key) | (?&wt_key_word)
                       |  (?&key_units_req) | (?&shorthand_words) | (?&shorthand_typos))

        # Look for phrases with the total weight
        (?P<wt_in_phrase> total \s+ (?&wt_key_word) )

        # Mass unit words
        (?P<wt_units_word> (?: gram | milligram | kilogram | pound | ounce ) s? )

        # Mass unit abbreviations
        (?P<wt_units_abbrev> (?: m (?&dot) g | k (?&dot) g | g[mr]? | lb | oz ) s? (?&dot) )

        # All mass units
        (?P<wt_units> (?&wt_units_word) | (?&wt_units_abbrev) )

        # Use to parse forms like: 2 lbs 4 oz.
        (?P<wt_pound> (?: pound | lb ) s? (?&dot) )
        (?P<wt_ounce> (?: ounce | oz ) s? (?&dot) )
    )
'''

### Body Mass Parsing Regular Expression Battery

In [16]:
BODY_MASS = RegexpBattery(parse_units=True, units_from_key=r''' (?<units> grams ) $ ''')

#
BODY_MASS.append(
    'en_wt',
    MASS_FRAGMENTS + r'''
         \b (?<key>    (?&all_wt_keys))? (?&key_end)?
            (?<value1> (?&range))  \s*
            (?<units1> (?&wt_pound))  \s*
            (?<value2> (?&range))  \s*
            (?<units2> (?&wt_ounce))
    ''',
    default_key='_english_',
    compound_value=2
)

#
BODY_MASS.append(
    'total_wt_key',
    MASS_FRAGMENTS + r'''
         \b (?<key>   (?&total_wt_key)) (?&key_end)
            (?<value> (?&range)) \s*
            (?<units> (?&wt_units))?
    '''
)

#
BODY_MASS.append(
    'other_wt_key',
    MASS_FRAGMENTS + r'''
         \b (?<key>   (?&other_wt_key)) (?&key_end)
            (?<value> (?&range)) \s*
            (?<units> (?&wt_units))?
    '''
)

#
BODY_MASS.append(
    'key_units_req',
    MASS_FRAGMENTS + r'''
         \b (?<key>   (?&key_units_req)) (?&key_end)
            (?<value> (?&range)) \s*
            (?<units> (?&wt_units))
    '''
)

#
BODY_MASS.append(
    'wt_in_phrase',
    MASS_FRAGMENTS + r'''
         \b (?<key>   (?&wt_in_phrase)) \D{1,32}
            (?<value> (?&range)) \s*
            (?<units> (?&wt_units))?
    '''
)

#
BODY_MASS.append(
    'wt_key_word',
    MASS_FRAGMENTS + r'''
         \b (?<key>   (?&wt_key_word)) \s*
            (?&open) \s* (?<units> (?&wt_units)) \s* (?&close) \s*
            (?<value> (?&range))
    '''
)

#
BODY_MASS.append(
    'wt_key_word_req',
    MASS_FRAGMENTS + r'''
         (?<key>   (?&wt_key_word)) (?&key_end)
         (?<value> (?&range)) \s*
         (?<units> (?&wt_units))
    '''
)

#
BODY_MASS.append(
    'wt_shorthand',
    MASS_FRAGMENTS + r'''
         \b (?: (?<key> (?&all_wt_keys)) (?&key_end) )?
            (?&wt_shorthand) \s*
            (?<value> (?&number)) \s*
            (?<units> (?&wt_units))?
    ''',
    default_key='_shorthand_'
)

#
BODY_MASS.append(
    'wt_shorthand_req',
    MASS_FRAGMENTS + r'''
         \b (?: (?<key> (?&all_wt_keys)) (?&key_end) )?
            (?&wt_shorthand_req) \s*
            (?<value> (?&number)) \s*
            (?<units> (?&wt_units))
    ''',
    default_key='_shorthand_'
)

#
BODY_MASS.append(
    'wt_shorthand_euro',
    MASS_FRAGMENTS + r'''
         \b (?: (?<key> (?&all_wt_keys)) (?&key_end) )?
            (?&wt_shorthand_euro) \s*
            (?<value> (?&number)) \s*
            (?<units> (?&wt_units))?
    ''',
    default_key='_shorthand_'
)

#
BODY_MASS.append(
    'wt_fa',
    MASS_FRAGMENTS + r'''
         fa \d* -
         (?<value> (?&number)) \s*
         (?<units> (?&wt_units))?
    ''',
    default_key='_shorthand_'
)

#
BODY_MASS.append(
    'wt_key_ambiguous',
    MASS_FRAGMENTS + r'''
         (?<key>   (?&wt_key_word)) (?&key_end)
         (?<value> (?&range)) \s*
         (?<units> (?&wt_units))?
    '''
)

### Test Body Mass Parsing

In [17]:
target = BODY_MASS

class TestBodyMassParsing(unittest.TestCase):

    def test_to_be_determined1(self):
        self.assertDictEqual(
            target.parse('762-292-121-76 2435.0g'),
            {'key':'_shorthand_','value': '2435.0','units': 'g'})

    def test_to_be_determined2(self):
        self.assertDictEqual(
            target.parse('TL (mm) 44,SL (mm) 38,Weight (g) 0.77 xx'),
            {'key':'Weight','value': '0.77','units': 'g'})

    def test_to_be_determined3(self):
        self.assertDictEqual(
            target.parse('Note in catalog: Mus. SW Biol. NK 30009; 91-0-17-22-62g'),
            {'key':'_shorthand_','value': '62','units': 'g'})

    def test_to_be_determined4(self):
        self.assertDictEqual(
            target.parse('body mass=20 g'),
            {'key':'body mass','value': '20','units': 'g'})

    def test_to_be_determined5(self):
        self.assertDictEqual(
            target.parse('2 lbs. 3.1 - 4.5 oz '),
            {'key':'_english_', 'value': ['2', '3.1 - 4.5'], 'units': ['lbs.', 'oz']})

    def test_to_be_determined6(self):
        self.assertDictEqual(
            target.parse('{"totalLengthInMM":"x", "earLengthInMM":"20", "weight":"[139.5] g" }'),
            {'key':'weight','value': '[139.5]','units': 'g'})

    def test_to_be_determined7(self):
        self.assertDictEqual(
            target.parse('{"fat":"No fat", "gonads":"Testes 10 x 6 mm.", "molt":"No molt", "stomach contents":"Not recorded", "weight":"94 gr."'),
            {'key':'weight','value': '94','units': 'gr.'})

    def test_to_be_determined8(self):
        self.assertDictEqual(
            target.parse('Note in catalog: 83-0-17-23-fa64-35g'),
            {'key':'_shorthand_','value': '35','units': 'g'})

    def test_to_be_determined9(self):
        self.assertDictEqual(
            target.parse('{"measurements":"20.2g, SVL 89.13mm" }'),
            {'key':'measurements','value': '20.2','units': 'g'})

    def test_to_be_determined10(self):
        self.assertDictEqual(
            target.parse('Body: 15 g'),
            {'key':'Body','value': '15','units': 'g'})

    def test_to_be_determined11(self):
        self.assertDictEqual(
            target.parse('82-00-15-21-tr7-fa63-41g'),
            {'key':'_shorthand_','value': '41','units': 'g'})

    def test_to_be_determined12(self):
        self.assertDictEqual(
            target.parse('weight=5.4 g; unformatted measurements=77-30-7-12=5.4'),
            {'key':'weight','value': '5.4','units': 'g'})

    def test_to_be_determined13(self):
        self.assertDictEqual(
            target.parse('unformatted measurements=77-30-7-12=5.4; weight=5.4;'),
            {'key':'measurements','value': '5.4','units': None})

    def test_to_be_determined14(self):
        self.assertDictEqual(
            target.parse('{"totalLengthInMM":"270-165-18-22-31", '),
            {'key':'_shorthand_','value': '31','units': None})

    def test_to_be_determined15(self):
        self.assertDictEqual(
            target.parse('{"measurements":"143-63-20-17=13 g" }'),
            {'key':'measurements','value': '13','units': 'g'})

    def test_to_be_determined16(self):
        self.assertDictEqual(
            target.parse('143-63-20-17=13'),
            {'key':'_shorthand_','value': '13','units': None})

    def test_to_be_determined17(self):
        self.assertDictEqual(
            target.parse('reproductive data: Testes descended -10x7 mm; sex: male; unformatted measurements: 181-75-21-18=22 g'),
            {'key':'measurements','value': '22','units': 'g'})

    def test_to_be_determined18(self):
        self.assertDictEqual(
            target.parse('{ "massInGrams"="20.1" }'),
            {'key':'massInGrams','value': '20.1','units': 'Grams'})

suite = unittest.defaultTestLoader.loadTestsFromTestCase(TestBodyMassParsing)
unittest.TextTestRunner().run(suite)

..................
----------------------------------------------------------------------
Ran 18 tests in 0.024s

OK


<unittest.runner.TextTestResult run=18 errors=0 failures=0>

[top](#Table-of-Contents)

## Extract the Traits

### Extract the Raw Trait Values

** Warning: This function is, currently, several times slower than the Perl version.**

I am looking into why. It is slow enough to be a concern.

In [18]:
# The new columns we will put the raw (unnormalized) extracted data into
# Along with the regular expression batter used to parse the value
raw_trait_columns = [
    dict(column='autoextract_sex',         battery=SEX),
    dict(column='autoextract_life_stage',  battery=LIFE_STAGE),
    dict(column='autoextract_body_length', battery=TOTAL_LENGTH),
    dict(column='autoextract_body_mass',   battery=BODY_MASS)
]


def extract_raw_traits():
    with open(VERTNET_FILE_NAME, 'r') as in_file, open(RAW_FILE_NAME, 'w') as out_file:
        reader  = csv.DictReader(in_file)
        headers = reader.fieldnames + [c['column'] for c in raw_trait_columns]
        writer  = csv.DictWriter(out_file, headers)
        writer.writeheader()
        for row in reader:
            for r in raw_trait_columns:
                cell = {}
                for v in VERTNET_SEARCH_COLUMNS:
                    string = row[v]
                    trait  = r['battery'].parse(string)
                    if trait:
                        cell[v] = trait
                if cell:
                    row[r['column']] = json.dumps(cell)
            writer.writerow(row)


extract_raw_traits()

### Look at Extracted Keys and Units

In [19]:
extracted_words = [
    dict(column='autoextract_body_length', json_field='key'),
    dict(column='autoextract_body_length', json_field='units'),
    dict(column='autoextract_body_mass',   json_field='key'),
    dict(column='autoextract_body_mass',   json_field='units')
]

def get_extracted_word_counts(column_name, json_field):
    cnt = Counter()
    with open(RAW_FILE_NAME, 'r') as in_file:
        reader = csv.DictReader(in_file)
        for row in reader:
            cell = row[column_name]
            if not cell:
                continue
            jcell = json.loads(cell)
            for key, obj in jcell.items():
                if isinstance(obj[json_field], list):
                    word = ' '.join(obj[json_field])
                else:
                    word = obj[json_field]
                if word:
                    cnt[word.lower()] += 1
    return cnt


def print_extracted_word_counts():
    for target in extracted_words:
        words = get_extracted_word_counts(target['column'], target['json_field'])
        out_file_name = '{0}{1}_{2}.txt'.format(BASE_FILE_NAME, target['column'],  target['json_field'])
        with open(out_file_name, 'w') as out_file:
            for word, n in sorted(words.items()):
                out_file.write(word + '\n')


print_extracted_word_counts()

### Normalize the Extracted Traits

In [20]:
# TODO

LEN_KEY = {
    '_english_'                   : 'total length',
    '_shorthand_'                 : 'total length',
    'body'                        : 'head-body length',
    'Body'                        : 'head-body length',
    'BODY LENGTH'                 : 'head-body length',
    'Body Length'                 : 'head-body length',
    'body length'                 : 'head-body length',
    'Body length'                 : 'head-body length',
    'catalog'                     : 'total length',
    'Forklength'                  : 'fork length',
    'Fork length'                 : 'fork length',
    'fork length'                 : 'fork length',
    'headBodyLengthInMillimeters' : 'head-body length',
    'Length'                      : 'total length',
    'LENGTH'                      : 'total length',
    'length'                      : 'total length',
    'lengthInMillimeters'         : 'total length',
    'Lengths'                     : 'total length',
    'lengths'                     : 'total length',
    'max length'                  : 'total length',
    'maxlength'                   : 'total length',
    'mean length'                 : 'total length',
    'meas.'                       : 'total length',
    'Meas'                        : 'total length',
    'Meas.'                       : 'total length',
    'meas'                        : 'total length',
    'Meas,'                       : 'total length',
    'Meas. H.B.'                  : 'head-body length',
    'Measurement'                 : 'total length',
    'measurement'                 : 'total length',
    'MEASUREMENTS'                : 'total length',
    'Measurements'                : 'total length',
    'measurements'                : 'total length',
    'Measurementsnt'              : 'total length',
    'Mesurements'                 : 'total length',
    'Snout-Vent Length'           : 'snout-vent length',
    'SNOUT-VENT LENGTH'           : 'snout-vent length',
    'snout-vent length'           : 'snout-vent length',
    'Snout-vent length'           : 'snout-vent length',
    'Snout vent length'           : 'snout-vent length',
    'Snout-Vent length'           : 'snout-vent length',
    'snoutVentLengthInMM'         : 'snout-vent length',
    'Snout vent lengths'          : 'snout-vent length',
    'specimen'                    : 'total length',
    'Specimen'                    : 'total length',
    'specimens'                   : 'total length',
    'Standard Length'             : 'standard length',
    'Standard length'             : 'standard length',
    'standard length'             : 'standard length',
    'SVL.'                        : 'snout-vent length',
    'SVL'                         : 'snout-vent length',
    'tag'                         : 'total length',
    'Tag'                         : 'total length',
    'TL_'                         : 'total length',
    'Tl'                          : 'total length',
    'TL.'                         : 'total length',
    'T.l.'                        : 'total length',
    'Tl.'                         : 'total length',
    'tl.'                         : 'total length',
    'TL'                          : 'total length',
    't.l.'                        : 'total length',
    'T.L'                         : 'total length',
    'tl'                          : 'total length',
    'T.L.'                        : 'total length',
    'Tol'                         : 'total length',
    'ToL'                         : 'total length',
    'TOL'                         : 'total length',
    'TOTAL'                       : 'total length',
    'total'                       : 'total length',
    'Total'                       : 'total length',
    'total  length'               : 'total length',
    'Totallength'                 : 'total length',
    'Total Length'                : 'total length',
    'Total length'                : 'total length',
    'totalLength'                 : 'total length',
    'total length'                : 'total length',
    'Total  length'               : 'total length',
    'TOTAL LENGTH'                : 'total length',
    'total length in mm'          : 'total length',
    'totalLengthInMM'             : 'total length',
    'total lengths'               : 'total length',
}

LEN_UNITS = {
    ''             : 1.0,
    'centimeters'  : 10.0,
    'C.M.'         : 10.0,
    'CM.'          : 10.0,
    'CM'           : 10.0,
    'cm.'          : 10.0,
    'cm'           : 10.0,
    'cm.S'         : 10.0,
    'cmS'          : 10.0,
    'feet'         : 304.8,
    'feet inches.' : [304.8, 25.4],
    'FEET INCHES.' : [304.8, 25.4],
    'feet inches'  : [304.8, 25.4],
    'ft'           : 304.8,
    'ft.'          : 304.8,
    'FT'           : 304.8,
    'ft. in'       : [304.8, 25.4],
    'FT IN.'       : [304.8, 25.4],
    'ft in'        : [304.8, 25.4],
    'ft in.'       : [304.8, 25.4],
    'ft. in.'      : [304.8, 25.4],
    'FT IN'        : [304.8, 25.4],
    'ft. inches'   : [304.8, 25.4],
    'In'           : 25.4,
    'in.'          : 25.4,
    'in'           : 25.4,
    'IN.'          : 25.4,
    'IN'           : 25.4,
    'INCHES'       : 25.4,
    'inches'       : 25.4,
    'ins'          : 25.4,
    'meter'        : 1000.0,
    'METERS'       : 1000.0,
    'meters'       : 1000.0,
    'Millimeters'  : 1.0,
    'm.m'          : 1.0,
    'M.M.'         : 1.0,
    'MM'           : 1.0,
    'm.m.'         : 1.0,
    'mm'           : 1.0,
    'MM.'          : 1.0,
    'mm.'          : 1.0,
    '_mm_'         : 1.0,
    'MM.S'         : 1.0,
    'mm.S'         : 1.0,
    'mmS'          : 1.0,
}

MASS_KEY = {
    '_shorthand_'                     : 'total weight',
    '_english_'                       : 'total weight',
    'Body'                            : 'total weight',
    'BODY'                            : 'total weight',
    'Body mass'                       : 'total weight',
    'body mass'                       : 'total weight',
    'Body Mass'                       : 'total weight',
    'body weight'                     : 'total weight',
    'catalog'                         : 'total weight',
    'dead. Wt'                        : 'total weight',
    'full.weight'                     : 'total weight',
    'Live weight'                     : 'total weight',
    'live weight'                     : 'total weight',
    'live wt'                         : 'total weight',
    'Live wt'                         : 'total weight',
    'live wt.'                        : 'total weight',
    'Live wt.'                        : 'total weight',
    'MASS'                            : 'total weight',
    'Mass'                            : 'total weight',
    'mass'                            : 'total weight',
    'massInGrams'                     : 'total weight',
    'Measurement'                     : 'total weight',
    'measurement'                     : 'total weight',
    'Measurements'                    : 'total weight',
    'MEASUREMENTS'                    : 'total weight',
    'measurements'                    : 'total weight',
    'measurements at time of prep'    : 'total weight',
    'Measurements in English'         : 'total weight',
    'MEASUREMENTS IN RED BOOK SAY'    : 'total weight',
    'Measurements read'               : 'total weight',
    'measurements written on NK page' : 'total weight',
    'observedweight'                  : 'total weight',
    'total'                           : 'total weight',
    'Total weight'                    : 'total weight',
    'total weight'                    : 'total weight',
    'Total wt.'                       : 'total weight',
    'total wt'                        : 'total weight',
    'WEIGHT'                          : 'total weight',
    'Weight'                          : 'total weight',
    'weight'                          : 'total weight',
    'weightInGrams'                   : 'total weight',
    'Weights'                         : 'total weight',
    'weights'                         : 'total weight',
    'WT'                              : 'total weight',
    'wt.'                             : 'total weight',
    'WT.'                             : 'total weight',
    'Wt.'                             : 'total weight',
    'Wt'                              : 'total weight',
    'wt'                              : 'total weight',
}

MASS_UNITS = {
    ''               : 1.0,
    'g.'             : 1.0,
    'G.'             : 1.0,
    'G'              : 1.0,
    'g'              : 1.0,
    'gm'             : 1.0,
    'GM'             : 1.0,
    'GM.'            : 1.0,
    'gm.'            : 1.0,
    'gms.'           : 1.0,
    'GMS'            : 1.0,
    'gms'            : 1.0,
    'Gr'             : 1.0,
    'Gr.'            : 1.0,
    'gr.'            : 1.0,
    'GR'             : 1.0,
    'GR.'            : 1.0,
    'gr'             : 1.0,
    'gram'           : 1.0,
    'grams'          : 1.0,
    'GRAMS'          : 1.0,
    'Grams'          : 1.0,
    'grs'            : 1.0,
    'KG.'            : 1000.0,
    'KG'             : 1000.0,
    'Kg.'            : 1000.0,
    'kg.'            : 1000.0,
    'Kg'             : 1000.0,
    'kg'             : 1000.0,
    'kgs.'           : 1000.0,
    'kgs'            : 1000.0,
    'kilograms'      : 1000.0,
    'LB'             : 453.593,
    'LB.'            : 453.593,
    'lb'             : 453.593,
    'lb.'            : 453.593,
    'LB OZ'          : [453.593, 28.349],
    'lb. oz.'        : [453.593, 28.349],
    'lb oz'          : [453.593, 28.349],
    'LB OZ.'         : [453.593, 28.349],
    'lb oz.'         : [453.593, 28.349],
    'lb. oz'         : [453.593, 28.349],
    'LBS.'           : 453.593,
    'lbs'            : 453.593,
    'Lbs'            : 453.593,
    'lbs.'           : 453.593,
    'LBS'            : 453.593,
    'lbs oz.'        : [453.593, 28.349],
    'lbs oz'         : [453.593, 28.349],
    'lbs. oz.'       : [453.593, 28.349],
    'lbs. oz'        : [453.593, 28.349],
    'lbs ozs'        : [453.593, 28.349],
    'mg.'            : 0.001,
    'mg'             : 0.001,
    'mgs.'           : 0.001,
    'ounce'          : 28.349,
    'ounces'         : 28.349,
    'OZ.'            : 28.349,
    'oz.'            : 28.349,
    'oz'             : 28.349,
    'Oz.'            : 28.349,
    'Ozs.'           : 28.349,
    'ozs'            : 28.349,
    'ozs.'           : 28.349,
    'pound ounces'   : [453.593, 28.349],
    'POUNDS'         : 453.593,
    'pounds'         : 453.593,
    'pounds ounces.' : [453.593, 28.349],
    'pounds ounces'  : [453.593, 28.349],
}


def to_number(value):
    value = regex.sub(r'[^\d\.]', '', value)
    return round(float(value), 3)


def normalize(in_file_name, out_file_name):
    with open(in_file_name, 'rb') as in_file, open(out_file_name, 'w') as out_file:
        reader = csv.reader(in_file)
        writer = csv.writer(out_file)
        row = reader.next()   # Header row
        row.extend(['Length', 'Weight'])
        writer.writerow(row)

        for row in reader:
            print(reader.line_num)
            lengths = row[-4]
            weights = row[-3]
            norm_len = None
            norm_wt  = None

            if lengths:
                json_len = json.loads(lengths)
                norm_len = []
                for key, obj in json_len.iteritems():
                    if obj['key'] not in LEN_KEY:
                        continue
                    label = LEN_KEY[obj['key']]
                    if isinstance(obj['units'], list):
                        units  = ' '.join(obj['units'])
                        value  = to_number(obj['value'][0]) * LEN_UNITS[units][0]
                        value += to_number(obj['value'][1]) * LEN_UNITS[units][1]
                    elif regex.search(r'- | to',
                                      obj['value'],
                                      regex.IGNORECASE | regex.VERBOSE):
                        values = regex.split(r'- | to',
                                             obj['value'],
                                             flags=regex.IGNORECASE | regex.VERBOSE)
                        value = (to_number(values[0]) * LEN_UNITS[obj['units']],
                                 to_number(values[1]) * LEN_UNITS[obj['units']])
                    else:
                        value = to_number(obj['value']) * LEN_UNITS[obj['units']]
                    norm_len.append((label, value))

            if weights:
                json_wt = json.loads(weights)
                norm_wt = []
                for key, obj in json_wt.iteritems():
                    if obj['key'] not in MASS_KEY:
                        continue
                    label = MASS_KEY[obj['key']]
                    if isinstance(obj['units'], list):
                        units  = ' '.join(obj['units'])
                        value  = to_number(obj['value'][0]) * MASS_UNITS[units][0]
                        value += to_number(obj['value'][1]) * MASS_UNITS[units][1]
                    elif regex.search(r'- | to', obj['value'], regex.IGNORECASE | regex.VERBOSE):
                        values = regex.split(r'- | to',
                                             obj['value'],
                                             flags=regex.IGNORECASE | regex.VERBOSE)
                        value = (to_number(values[0]) * MASS_UNITS[obj['units']],
                                 to_number(values[1]) * MASS_UNITS[obj['units']])
                    else:
                        value = to_number(obj['value']) * MASS_UNITS[obj['units']]
                    norm_wt.append((label, value))

            row.extend([json.dumps(norm_len), json.dumps(norm_wt)])
            writer.writerow(row)


#if __name__ == '__main__':
#    in_file_name  = sys.argv[1]
#    out_file_name = sys.argv[2]

#    normalize(in_file_name, out_file_name)
