# Data wrangling


It is very often necessary to make data conform to specific requirements. Here's a detailed example. We start with a statement that someone may have sent in an email, and want total credits and debits for the month. 

Here's the input: 

![Sample Statement](stmt.jpg)
After downloading and some offline editing, we obtain this Statement:

In [4]:
stmt = """
04.02.19| SI10072| Sale| $250| | $250 
04.11.19| SI10073| Sale| $400| | $400 
04.15.19| DC18 | Payment| | $100 | -$100
04.22.19| CR20020| Credit| -$50| | $50
"""
stmt

'\n04.02.19| SI10072| Sale| $250| | $250 \n04.11.19| SI10073| Sale| $400| | $400 \n04.15.19| DC18 | Payment| | $100 | -$100\n04.22.19| CR20020| Credit| -$50| | $50\n'

Our next step is to split the data up into chunks we can handle. We start by splitting it into lines. The line break character is '\n'. 

In [5]:
# splitting at \n converts this to a list of lines. 
lines = stmt.split('\n')
lines

['',
 '04.02.19| SI10072| Sale| $250| | $250 ',
 '04.11.19| SI10073| Sale| $400| | $400 ',
 '04.15.19| DC18 | Payment| | $100 | -$100',
 '04.22.19| CR20020| Credit| -$50| | $50',
 '']

Our next step is to skip the blank lines and split up each remaining line. The lines contain "|" characters between columns. So we split at "|" characters. 

In [17]:
def add_num(list_, string):
    # remove spaces and dollar signs
    for char in string:
        if char in " $":
            string = string.replace(char,'')
    try:
        num = float(string)
        list_.append(num)
    except:
        pass
    # Comment out this line when done debugging this function
    # print (list_)
    return list_

# >>For testing this function we wrote
if True: # Change True to False once we know it works!
    credits = []
    credit = '   $500.01  '
    add_num(credits, '$500')
    print (credits)
    add_num(credits, credit)
    print (credits)
# << end of testing, we can delet or comment out these lines once done

[500.0]
[500.0, 500.01]


In [19]:
credits = []
debits = []

(credits, debits)

([100.0], [250.0, 400.0, -50.0])

Finally, we sum up the credits and debits. This is a matter of making a sum of a list of things. Here's the pattern: 

In [8]:
total_credits = 0
for c in credits: 
    total_credits += c
total_debits = 0
for c in debits: 
    total_debits += c
(total_credits, total_debits)

(100.0, 600.0)

or, we could instead remember to write: 

In [9]:
total_credits = sum(credits)
total_debits = sum(debits)
(total_credits, total_debits)

(100.0, 600.0)

# Some basic observations
1. Each transformation requires the previous one. 
2. I printed the result of each transformation to ensure that I was doing things correctly.
3. The cells are written in the order in which they should be executed.
4. One can thus visually determine whether everything is working correctly. 
5. (The postconditions for the previous step are at least enough to be preconditions for the next step.)

# Some exercises

1. We wrote the function `add_num` above. Restructure as two classes `Statement_Line` with an appropriate `__init__` method and a class `Statement` with an `__init__` that takes the variable `stmt` as an input and offers methods `credits` and `debits` to return the calculated sums as worked out above.

In [23]:
string = "fgxclkjdzfgon fdghsfdgs $$ %% foo"
for char in string:
    if char in " $":
        print (string)
        string = string.replace(char,'')
        print (string)
        print ("----------")

fgxclkjdzfgon fdghsfdgs $$ %% foo
fgxclkjdzfgonfdghsfdgs$$%%foo
----------
fgxclkjdzfgonfdghsfdgs$$%%foo
fgxclkjdzfgonfdghsfdgs$$%%foo
----------
fgxclkjdzfgonfdghsfdgs$$%%foo
fgxclkjdzfgonfdghsfdgs%%foo
----------
fgxclkjdzfgonfdghsfdgs%%foo
fgxclkjdzfgonfdghsfdgs%%foo
----------
fgxclkjdzfgonfdghsfdgs%%foo
fgxclkjdzfgonfdghsfdgs%%foo
----------
fgxclkjdzfgonfdghsfdgs%%foo
fgxclkjdzfgonfdghsfdgs%%foo
----------


Why did we get 6 iterations of the loop when only two iterations were effective, removing `' '` and `'$'`? What would you do to fix it?

In [34]:
class Statement():
    def __init__(self, stmt):
        def add_num(list_, string):
            # remove spaces and dollar signs
            for char in string:
                if char in " $":
                    string = string.replace(char,'')
            try:
                num = float(string)
                list_.append(num)
            except:
                pass
            # Comment out this line when done debugging this function
            # print (list_)
            return list_

        lines = stmt.split('\n')
        self.credits_ = []
        self.debits_ = []
        for l in lines: 
            # splitting at '|' separates the columns.
            try:
                item, desc, detail, debit, credit, amount = l.split("|")
                self.credits_ = add_num(self.credits_, credit)
                self.debits_ = add_num(self.debits_, debit)
            except:
                continue

    def credits(self):
        return sum(self.credits_)

    def debits(self):
        return sum(self.debits_)


class Statement_Line():
    def __init__(self, line):
        item, desc, detail, debit, credit, amount = l.split("|")
        self.credit = ...?
        self.debit = ...?

In [35]:
april_statement = Statement(stmt)
print (april_statement.credits())
print (april_statement.debits())

100.0
600.0


## What if the line format had been different? 

2. We never wrote the class `Statement_Line` above. The point of having a separate class for working with lines and the class `Statement` for working with statements (which are collections of lines) is to only have to change the class `Statement_Line` if the format of each line changes.

3. **Edit your classes** so as to completely drop the last column `Amount Due`. Edit the `stmt` variable accordingly and get it to behave properly!

4. This idiom of having separate classes, one representing items and another one representing the collection of those classes is pervasive in libraries!