# Canonicalization
In several areas, there are explicit rules to standardize the representation of data.  This procedure, called canonicalization or normalization, removes differences that are not important and permits tests for exact matches to succeed even when the details of the representation have been changed.

URI normalization is described in RFC 3986, and contains a list of algorithmic modifications to a URI aimed to allow the sameness of a URI to be identified.

Duplicate slashes should be removed:
    
    http://example.com/foo//bar.html → http://example.com/foo/bar.html

Relative directory navigation symbols should be interpreted and removed:

    http://example.com/foo/../bar.html → http://example.com/bar.html
    
Certain ascii characters do not require percent encoding in URI strings, and should be decoded:

    http://example.com/%7Efoo → http://example.com/~foo

Capitalization may include variation that hinders our ability to recognize identity, and fields in many places may contain symbols like spaces and newlines that make the encodings superficially different.


In [2]:
names = [ "THOMPSON, EMILY",
          "THOMPSON,EMILY",
          "THOMPSON,   EMILY ",
          "Thompson, Emily ",
          "Thompson, Emily A."]

We cannot tell how many Emily Thompsons are in this database, but the number of spaces is usually not relevant.

The string methods `.upper()` and `.lower()` and `.strip()` are potentially useful to this end.  

If we wanted to make all of these the same, we could write a function to clean up capitalization and spacing differences: 

In [3]:
def clean(s):
    fields = s.upper().strip().split(',')
    return(fields[0].strip() + ", " + fields[1].strip())

for name in names:
    print( clean(name))

THOMPSON, EMILY
THOMPSON, EMILY
THOMPSON, EMILY
THOMPSON, EMILY
THOMPSON, EMILY A.


As a first step, we should explore our data to find what kinds of extraneous differences in encodings are present, and then make a copy of the data that has been "cleaned"  of the differences that we were able to identify.

The practice of collecting unusual (or pathological) examples of data can help you make your code work better. 