_This is one of the steps in the Truss interview process. If you've
stumbled upon this repository and are interested in a career with
Truss, [check out our jobs page](https://truss.works/jobs)._

# Truss Software Engineering Interview

## Introduction and expectations

Hi there! Please complete the problem described below to the best of
your ability, using the tools you're most comfortable with. Assume
you're sending your submission in for code review from peers;
we'll be talking about your submission in your interview in that
context.

We expect this to take less than 4 hours of actual coding time. Please
submit a working but incomplete solution instead of spending more time
on it. We're also aware that getting after-hours coding time can be
challenging; we'd like a submission within a week and if you need more
time please let us know.

If you have any questions, please contact hiring@truss.works; we're
happy to help if you're not sure what we're asking for or if you have
questions.

## How to submit your response

Please send hiring@truss.works a link to a public git repository
(Github is fine) that contains your code and a README.md that tells us
how to build and run it. Your code will be run on either macOS 10.13
or Ubuntu 16.04 LTS, your choice.

## The problem: CSV normalization

Please write a tool that reads a CSV formatted file on `stdin` and
emits a normalized CSV formatted file on `stdout`. Normalized, in this
case, means:

* The entire CSV is in the UTF-8 character set.
* The Timestamp column should be formatted in ISO-8601 format.
* The Timestamp column should be assumed to be in US/Pacific time;
  please convert it to US/Eastern.
* All ZIP codes should be formatted as 5 digits. If there are less
  than 5 digits, assume 0 as the prefix.
* All name columns should be converted to uppercase. There will be
  non-English names.
* The Address column should be passed through as is, except for
  Unicode validation. Please note there are commas in the Address
  field; your CSV parsing will need to take that into account. Commas
  will only be present inside a quoted string.
* The columns `FooDuration` and `BarDuration` are in HH:MM:SS.MS
  format (where MS is milliseconds); please convert them to a floating
  point seconds format.
* The column "TotalDuration" is filled with garbage data. For each
  row, please replace the value of TotalDuration with the sum of
  FooDuration and BarDuration.
* The column "Notes" is free form text input by end-users; please do
  not perform any transformations on this column. If there are invalid
  UTF-8 characters, please replace them with the Unicode Replacement
  Character.

You can assume that the input document is in UTF-8 and that any times
that are missing timezone information are in US/Pacific. If a
character is invalid, please replace it with the Unicode Replacement
Character. If that replacement makes data invalid (for example,
because it turns a date field into something unparseable), print a
warning to `stderr` and drop the row from your output.

You can assume that the sample data we provide will contain all date
and time format variants you will need to handle.



In [498]:
from datetime import datetime, timedelta
import pandas as pd

In [495]:
def change_encoding(data, cols):
    for col in cols:
        try:
            data[col] = data[col].encode('raw_unicode_escape').decode('utf-8')
        except UnicodeDecodeError:
            data[col] = data[col].encode('utf-8')
    return data 

def convert_time(row, num_hours):
    new_time = pd.to_datetime(row['Timestamp']) - timedelta(hours=num_hours)
    return new_time.isoformat()

def fix_zip(row, norm_len=5):
    word = str(row.ZIP)
    if len(word) < 5:
        new_zip = word.zfill(norm_len)
        return new_zip
    return str(row.ZIP)

def time_diff(row):
    try:
        foo = datetime.datetime.strptime(row['FooDuration'], "%H:%M:%S.%f")
        foo_elements = {'hours': foo.hour, 'minutes': foo.minute, 'seconds': foo.second,
                    'microseconds': foo.microsecond}
        foo_td = timedelta(**foo_elements)
    except ValueError:
        hours, rest = row['FooDuration'].split(':', 1)
        r = datetime.datetime.strptime(rest, "%M:%S.%f")
        r_elem = {'hours':  r.hour, 'minutes':  r.minute, 'seconds':  r.second, 'microseconds': r.microsecond}
        foo_td = timedelta(hours=int(hours)) + timedelta(**r_elem)
    try:
        bar = datetime.datetime.strptime(row['BarDuration'], "%H:%M:%S.%f")
        bar_elements = {'hours':  bar.hour, 'minutes':  bar.minute, 'seconds':  bar.second,
                        'microseconds': bar.microsecond}
        bar_td = timedelta(**bar_elements)
    except ValueError:
        hours, rest = row['BarDuration'].split(':', 1)
        r = datetime.datetime.strptime(rest, "%M:%S.%f")
        r_elem = {'hours':  r.hour, 'minutes':  r.minute, 'seconds':  r.second, 'microseconds': r.microsecond}
        bar_td = timedelta(hours=int(hours)) + timedelta(**r_elem)
#     return str(foo_td + bar_td)
    return (foo_td + bar_td).total_seconds()

def note_fixer(row):
    try:
        return row.encode('utf-8').decode('utf8', 'replace')
    except AttributeError:
        return row.decode('utf8', 'replace').encode('raw_unicode_escape').decode('utf-8', 'replace')

In [168]:
sample = pd.read_csv('sample.csv')

In [169]:
sample.head()

Unnamed: 0,Timestamp,Address,ZIP,FullName,FooDuration,BarDuration,TotalDuration,Notes
0,4/1/11 11:00:00 AM,"123 4th St, Anywhere, AA",94121,Monkey Alberto,1:23:32.123,1:32:33.123,zzsasdfa,I am the very model of a modern major general
1,3/12/14 12:00:00 AM,"Somewhere Else, In Another Time, BB",1,Superman übertan,111:23:32.123,1:32:33.123,zzsasdfa,This is some Unicode right here. ü ¡! 😀
2,2/29/16 12:11:11 PM,111 Ste. #123123123,1101,Résumé Ron,31:23:32.123,1:32:33.123,zzsasdfa,🏳️🏴🏳️🏴
3,1/1/11 12:00:01 AM,"This Is Not An Address, BusyTown, BT",94121,Mary 1,1:23:32.123,0:00:00.000,zzsasdfa,I like Emoji! 🍏🍎😍
4,12/31/16 11:59:59 PM,"123 Gangnam Style Lives Here, Gangnam Town",31403,Anticipation of Unicode Failure,1:23:32.123,1:32:33.123,zzsasdfa,I like Math Symbols! ≱≰⨌⊚


In [460]:
broken_utf8 = pd.read_csv('sample-with-broken-utf8.csv', encoding='ISO-8859-1')

In [371]:
text_cols = ['Address', 'FullName', 'Notes']

In [461]:
broken_utf8.fillna('', inplace=True)

In [463]:
broken_utf8[text_cols] = broken_utf8[text_cols].apply(lambda x: change_encoding(x,text_cols), axis=1)

In [464]:
broken_utf8['Timestamp'] = broken_utf8.apply(lambda x: convert_time(x, 3), axis=1)

In [465]:
broken_utf8['ZIP'] = broken_utf8.apply(lambda x: fix_zip(x), axis=1)

In [466]:
broken_utf8['FullName'] = broken_utf8['FullName'].apply(lambda x: x.upper())

In [367]:
# def duration_to_float(row):
#     try:
#         time = datetime.datetime.strptime(row, "%H:%M:%S.%f")
#     except ValueError:
#         hours, rest = row.split(':', 1)
#         r = datetime.datetime.strptime(rest, "%M:%S.%f")
#         r_elem = {'hours':  r.hour, 'minutes':  r.minute, 'seconds':  r.second, 'microseconds': r.microsecond}
#         time = timedelta(hours=int(hours)) + timedelta(**r_elem)
#         print(type(time))
#     return datetime.datetime.strftime(time, "%H:%M:%S.%f")

In [363]:
# broken_utf8['FooDuration'] = broken_utf8['FooDuration'].apply(lambda x: duration_to_float(x))
# broken_utf8['BarDuration'] = broken_utf8['BarDuration'].apply(lambda x: duration_to_float(x))

In [467]:
broken_utf8['TotalDuration'] = broken_utf8.apply(lambda x: time_diff(x), axis=1)

In [496]:
broken_utf8['Notes'] = broken_utf8['Notes'].apply(lambda x: note_fixer(x))

In [497]:
broken_utf8.head()

Unnamed: 0,Timestamp,Address,ZIP,FullName,FooDuration,BarDuration,TotalDuration,Notes
0,2011-04-01T08:00:00,"123 4th St, Anywhere, AA",94121,MONKEY ALBERTO,1:23:32.123,1:32:33.123,10565.246,I am the very model of a modern major general
1,2014-03-11T21:00:00,"Somewhere Else, In Another Time, BB",1,SUPERMAN ÜBERTAN,111:23:32.123,1:32:33.123,406565.246,This is some Unicode right h�xxx ü ¡! 😀
2,2016-02-29T09:11:11,111 Ste. #123123123,1101,RÉSUMÉ RON,31:23:32.123,1:32:33.123,118565.246,🏳️🏴🏳️🏴
3,2010-12-31T21:00:01,"This Is Not An Address, BusyTown, BT",94121,MARY 1,1:23:32.123,0:00:00.000,5012.123,I like Emoji! 🍏🍎😍
4,2016-12-31T20:59:59,"123 Gangnam Style Lives Here, Gangnam Town",31403,ANTICIPATION OF UNICODE FAILURE,1:23:32.123,1:32:33.123,10565.246,I like Math Symbols! ≱≰⨌⊚


In [502]:
broken_utf8.to_csv('test.csv', index=False)