# Exploring Address Parsing APIs

## 0. Setup

The following parsing APIs were considered:
* scourgify
* usaddress
* usps-api
* postal_address (poor documentation)
* address (out of date- api uses prints intended for Python 2)

After looking at docs and testing the implementation of each library, I narrowed down the list to two:
* scourgify
* usaddress
* usps-api

### Imports

In [1]:
import usaddress # usaddress
import scourgify # usaddress-scourgify
from usps import USPSApi, Address # usps-api

import urllib.request, json 
import sys

### Set Address

In [2]:
address = '5137 Pond Crest Trail, Fairview, TX 75069'

## 1. Scourgify

Follows RESO guidelines and USPS pub 28. 

#### Notes:
* Scourgify does not attempt to validate an address

In [3]:
# Returns a dictionary of "normalized" address data, but does not validate
scourgify_result = scourgify.normalize_address_record(address)

print("scourgify result: \n")
scourgify_result

scourgify result: 



{'address_line_1': '5137 POND CREST TRL',
 'address_line_2': None,
 'city': 'FAIRVIEW',
 'state': 'TX',
 'postal_code': '75069'}

## 2. USAddress

usaddress is a Python library for parsing unstructured address strings into address components, using advanced NLP methods.

#### Notes: 
* Probabilistic parsing
* Does not validate address

In [4]:
# Returns an array of 
usaddress_result = usaddress.parse(address)

print("usaddress result: \n")
usaddress_result

usaddress result: 



[('5137', 'AddressNumber'),
 ('Pond', 'StreetName'),
 ('Crest', 'StreetName'),
 ('Trail,', 'StreetNamePostType'),
 ('Fairview,', 'PlaceName'),
 ('TX', 'StateName'),
 ('75069', 'ZipCode')]

## 3. USPS API

The USPS API seems to be the most comprehensive of the three. In addition to parsing the address, the USPS API is the only API from the candidate set that _validates_ the address as well.

#### Setup

In [5]:
required_address_std_keys = ['Address2','Address1','City','State','Zip5','Zip4']
campaign_data_table_name = 'campaign_data_drop_'
baselines = {
    'cost_per_piece':0.51,
    'comparison_response_rate_bps':44,
    'comparison_cost_per_response':116,
    'comparison_cost_per_member':1000
}
### Type of Responses Table
current_sales_status_dict = {
    'disqualified':['Declined - Ineligible','Do Not Contact / Deceased'],
    'applied':['Declined - Enrollment Cancelled','Application  Received by Agent','Declined - Enrollment Cancelle','Enrollment Denied','Processed by Enrollment','Enrollment Pending',' Declined - Enrollment Can','Application  Received by Agent','Processed by Enrollment','Enrollment Pending','Application  Received by'],
    'positive_applied':['Application  Received by Agent','Processed by Enrollment','Enrollment Pending','Application  Received by']
    }
## Scaling Results
point_scaler = 10000

In [6]:
COGNITO_PUBLIC_KEY = ""
PASSAGEIQ_PUBLIC_KEY_URL = "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_tvhMfFPru/.well-known/jwks.json"
AUDIENCE = "2o9l4v55jf8dhk3hl6f4c8hvl8"
# decode and verify the token

def get_usps_address(table_data,verbose=False):
    """
    function that validates a standard USA address against the USPS dataset
    takes in a dictionary with the following key value pair
    table_data = {
            name = '',
            address_1 = '',
            address_2 = '',
            city = '',
            state = '',
            zipcode = ''
    }
    values can be left blank such as address 1 and name
    Keep in mind that due to a bug in USPS web API, address_1 and address_2 are switched
    so that means that street address should be placed with the address_2 key
    and other values such as Apartment values should be set with the address_1 key
    returns values in the following dictionary format:
    {'AddressValidateResponse': {'Address': {'Address1': 'APT 403', 'Address2': '315 N PRINCE ST', '@ID': '0', 'Zip5': '17603', 'State': 'PA', 'City': 'LANCASTER', 'Zip4': '3033'}}}
    """
    
    try:
        address_1 = table_data['address_1'].replace(u'\xa0', u' ').encode('ascii','ignore').decode('UTF-8')
    except:
        address_1 = str(table_data['address_1'])
    
    try: 
        address_2 = table_data['address_2'].replace(u'\xa0', u' ').encode('ascii','ignore').decode('UTF-8')
    except:
        address_2 = str(table_data['address_2'])
    #print('Table data : ',table_data)
    address = Address(
        name = '',#table_data['name'],
        address_1 = address_1,
        address_2 = address_2,
        #address_1 = table_data['address_1'].replace(u'\xa0', u' ').encode('ascii','ignore').decode('UTF-8'),
        #address_2 = table_data['address_2'].replace(u'\xa0', u' ').encode('ascii','ignore').decode('UTF-8'),
        city = table_data['city'].replace(u'\xa0', u' ').encode('ascii','ignore').decode('UTF-8'),
        state = table_data['state'].replace(u'\xa0', u' ').encode('ascii','ignore').decode('UTF-8'),
        zipcode = table_data['zipcode'].replace(u'\xa0', u' ').encode('ascii','ignore').decode('UTF-8')
    )
    sys.stdout.flush()
    usps = USPSApi('YOUR KEY HERE', test=True)
    
    try:
        validation = usps.validate_address(address)
        result = validation.result['AddressValidateResponse']
        print("RESULT! ", result, '\n')

        ## Standardize Result
        out_res = result['Address']
        keys_from_result = out_res.keys()
        keys_to_fill = list(set(required_address_std_keys).difference(set(keys_from_result)))
        for keys in keys_to_fill:
            out_res[keys] = ''
    except:
        out_res = None
    return out_res

In [7]:
table_data = {
        'name' : '',
        'address_1' : '5137 Pond Crest',
        'address_2' : '',
        'city' : 'Fairview',
        'state' : 'TX',
        'zipcode' : '75069'
}

In [8]:
usps_result = get_usps_address(table_data,verbose=False)
print("USPS result: \n")
usps_result

RESULT!  {'Address': {'@ID': '0', 'Address1': '-', 'Address2': '5137 POND CREST TRL', 'City': 'FAIRVIEW', 'State': 'TX', 'Zip5': '75069', 'Zip4': '6854'}} 

USPS result: 



{'@ID': '0',
 'Address1': '-',
 'Address2': '5137 POND CREST TRL',
 'City': 'FAIRVIEW',
 'State': 'TX',
 'Zip5': '75069',
 'Zip4': '6854'}

# Conclusion:

In conclusion, after testing the usaddress, usaddress-scourgify, and usps APIs, I've found the usps API to be the most comprehensive. Not only is it backed and supported by the USPS with releases as recent as 4/24/19, but it was also the only API to allow for address validation rather than simple parsing.

The usaddress api is interesting in that it using NLP to parse the address, but returns a less than desirable output (an array of tuples). 

The usaddress-scourgify API builds upon the usaddress API by returning a more easily accessible output, though the documentation does not mention the use of NLP.