# Building Realistic Example Data
This __optional bonus assignment__ is worth up to 5% of your final grade. It must be handed in by directly mailing the instructor (Christopher Brooks, brooksch@umich.edu) the assignment no later than Sunday September 27th at 11:59pm EST. This is not an all-or-nothing assignment, partial grades will be provided as appropriate.

## Assignment Overview
A constant need I have when teaching pandas is finding compelling example data to work from. Good example data is real-world, messy enough to need some manipulation, and fits reasonable constraints for a given problem. For instance, if I want to demonstrate joining multiple `DataFrame` together I might want one which is about people and one which is about purchases, where every person has an identifier and a bunch of personal information, and every purchase is linked to a given person. This is much more compelling then a bunch of random `np.ndarray` lists that I create inline while trying to give a lecture!

In addition, I'm taken by domain specific languages, and in this part of the assignment you are required to build a processor for a simple domain specific language I have invented for the purpose of describing pandas `DataFrame` structures! It is expected that you will demonstrate your knowledge of regex here in particular.

Here's an example of the language I've created for this part of the assignment:
```
persons
-------
first_name
last_name*
phone_number
random_number(5) as customer_number [1]*

purchases
---------
isbn10
credit_card_full
random_number(3) as price
random_number(5) as customer_number [1]
```

In this example I describe two `DataFrame` objects by underlying a string with two or more hyphens. The string (`persons` and `purchases`) should be used as the variable name for the `DataFrame` objects created, and the language will always separate multiple `DataFrame` definitions by whitespace and hyphens as shown. Each column in the `DataFrame` is described on its own line with a string (e.g. `first_name*`). The string defines the column as follows:

1. The first word (e.g. `first_name` or `random_number(3)`) describes a function  and optional set of parameters to be called against a common `faker` object (an instance of the `Faker` class) for each entry in the resulting `DataFrame`. For instance, a value of `isbn10` implies that `faker.isbn10()` be called (note the default parameters), while a value of `random_number(5)` implies that `faker.random_number(5)` be called (with my supplied parameters).
2. The first word *may* be followed by some whitespace and then an `as` statement. The `as` statement denotes that the following word be used as the name of the column. For instance, `random_number(3) as price` means that I'm looking for a column named `price` where every instance in the column is of a `faker.random_number(3)` invocation.
3. If there is no `as` statement the name of the column should be the name of the function (with no parameters) I supplied, e.g. `first_name`.
4. The definition may include a reference in the format `[#]` where the # sign is any number. This reference will be used across tables to show that the data in those tables should be of a similar set of values. This is so I can join between tables, which is a common need to demonstrate. In the example given I want each of the tables to have a column called `customer_number` where the data values in the column are such that `set(column1)==set(column2)`. Note that this doesn't mean the columns should be the same (they shouldn't), but just that they should include only 100% overlapping data. See point 6 for data distribution. **Clarification: In the example here, all of the `customer_number` values in `persons` should be unique (hence the `*`), but in the referencing collumn `purchases[customer_number]` the uniqueness is relaxed. This means that there does not have to be a 1:1 mapping as implied, and that it is a 1:many mapping, and in this case since `purchases` has repeated data you could verify this with `set(purchases['customer_number']).issubset(set(persons['customer_number']))`.**
5. The sentence may end with a `*`. This indicates that the column described should be made up of unique data (no repeated elements). For instance, you wouldn't want a customer to be accidently assigned the customer number of another person! In the example above I decided I wanted the `last_name` of the persons and their `customer_number` to be unique in that table.
6. By default a column should have 25% repeated data. e.g. something close to `len(column)==len(set(column))*1.25`. This lets me demonstrate operations such as left joins easily.
7. By default, the length of each `DataFrame` created should be 99 items. This is both reasonable for most demonstrations, and a homage to The Great One.
8. The functionality described should be executed in a cell magic function called `%%fakedata`, where the remainder of the cell is the definition in plain text. See https://ipython.readthedocs.io/en/stable/config/custommagics.html for more details.

## An attempt at a more formal grammar

```
function_to_call  ::= <wordcharacters>
parameters        ::= "" | "(" ( wordcharacters | number ) ")"
as_name           ::= "" | "as" <whitespace> <wordcharacters>
column_name       ::= as_name | function_to_call
reference         ::= "" |  "[" number "]"
unique_mark       ::= "" | "*"
column_definition ::= <function_to_call> <parameters> <whitespace> \
                      <as_name> <whitespace> <reference> <unique_mark>
df_sep            ::= "--" ("-"*)
df_definition     ::= <wordcharacters> <newline> <df_sep> <newline> \
                      (<column_definition>*) <newline> <newline>
language_spec     ::= <def_definition>*
```


## Background: What is this `Faker` class?
The `Faker` class defines a number of great functions that generate realistic data. The way it works is that you create a new instance of `Faker` with no parameters, then call various methods on that object which are predefined at https://faker.readthedocs.io/en/stable/

This is demonstrated below showing a single entry into a `DataFrame` using the description above.

In [1]:
import pandas as pd
from faker import Faker

fake = Faker()
person = pd.DataFrame( [{"first_name": fake.first_name(),
                        "last_name": fake.last_name(),
                        "phone_number": fake.phone_number(),
                        "customer_number": fake.random_number(5)}])
person

Unnamed: 0,first_name,last_name,phone_number,customer_number
0,Ashley,Fowler,625.031.2882x91917,7187


## Some example test cases
There is no autograder for this assignment, but it might be useful to see some example test cases, so imagine that I was going to run your code using the following cells. My assumption is that your code to define, load, and run the magic function goes in one cell at the top of the notebook).

In [2]:
from IPython.core.magic import (register_line_magic, register_cell_magic,
                                register_line_cell_magic)

@register_line_magic
def lmagic(line):
    "my line magic"
    return line*2

@register_cell_magic
def cmagic(line,cell):
    "my cell magic"
    global cell_result
    cell_result = line,cell
    return line, cell

@register_line_cell_magic
def lcmagic(line, cell=None):
    'Magic that works both as %lcmagic and as %%lcmagic'
    if cell is None:
        print("called as line magic")
        return line
    else:
        print('called as cell magic')
        return line, cell

In [3]:
from IPython.core.magic import (register_cell_magic, needs_local_scope)

@register_cell_magic
@needs_local_scope
def fakedata(line, cell, local_ns=None):
    import re
    import pandas as pd
    import numpy as np
    from faker import Faker

    fake = Faker()
    
    
    reg = '((?P<rand>random_number)\((?P<rand_n>[\d]+)\) as )?(?P<var_name>[\w_]+)( \[?(?P<table_group>[\d]+)?\])?(?P<unique>\*)?'

    # To store table groups linkage
    dict_unique = {}

    # store df names
    df_names = []

    # 99 // 1.25 = 79
    rows_n = 79

    define_df_name = True
    for line in cell.split('\n'):
        print('got line: ' + line)

        # see if it's df separator
        if re.match('\-+', line):
            define_df_name = False
            print('finished define df name')

        # define the df name
        if define_df_name == True:
            print('defining df name as: ' + line)
            # start with naming the df 
            vars()[line] = pd.DataFrame(None)
            df = vars()[line]
            df_names.append(line)


        # find end of df definition
        elif line=='':
            define_df_name = True

        # define contents of df
        else:
            re_search = re.search(reg,line)
            if re_search:
                reg_dict = re_search.groupdict()


                if reg_dict['rand'] == 'random_number':
                    # generate random number
                    # save the series if unique
                    if reg_dict['unique'] == '*':
                        df[reg_dict['var_name']] = pd.Series([fake.random_number(int(reg_dict['rand_n']), fix_len=True) for i in range(rows_n)])
                        dict_unique[reg_dict['table_group']] = df[reg_dict['var_name']]
                    elif reg_dict['table_group'] is not None:
                        # load the series from unique and randomize if not unique
                        random_index = np.random.randint(0, rows_n, size=rows_n)
                        df[reg_dict['var_name']] = dict_unique[reg_dict['table_group']].iloc[random_index].tolist()
                else:
                    # generate with fake function
                    df[reg_dict['var_name']] = pd.Series([getattr(fake, reg_dict['var_name'])() for i in range(rows_n)])

    # increase df rows by 25% to 99
    for df_name in df_names:
        length = vars()[df_name].shape[0]
        vars()[df_name] = vars()[df_name].append(vars()[df_name].iloc[:(int(length*0.25)+1), :])
        local_ns[df_name] = vars()[df_name]
    


In [4]:
%%fakedata
persons
-------
first_name
last_name*
phone_number
random_number(5) as customer_number [1]*

purchases
---------
isbn10
credit_card_full
random_number(3) as price
random_number(5) as customer_number [1]

got line: persons
defining df name as: persons
got line: -------
finished define df name
got line: first_name
got line: last_name*
got line: phone_number
got line: random_number(5) as customer_number [1]*
got line: 
got line: purchases
defining df name as: purchases
got line: ---------
finished define df name
got line: isbn10
got line: credit_card_full
got line: random_number(3) as price
got line: random_number(5) as customer_number [1]
got line: 


In [5]:
assert ('persons' in locals()), "You should automatically set the persons and purchases objects"
assert ('purchases' in locals()), "You should automatically set the persons and purchases objects"
assert (type(persons)==pd.DataFrame), "You should be setting persons and purchases to be DataFrame objects"
assert (len(persons)==99), "All Hail the Great One!"
assert (set(purchases['customer_number']).issubset(set(persons['customer_number']))), "Check the clarification in the description carefully"