# Handling Data and Writing Functions in Python
**by [Jason DeBacker](http://jasondebacker.com), August 2017**

This Jupyter Notebook is intended to introduce students to read and saving data in Python as well are writing and calling functions.

## Reading in Data

There are a number of methods you can use to read in various data sources.  We consider reading and writing data within just three packages, CSV, NumPy and Pandas, but note that there are others packages that can read/write data.

We'll work with the microdata from the 2015 [Kauffman Index of Startup Activity](http://www.kauffman.org/kauffman-index/about/kiea-microdata) as are example data.  This file is named `kisa_2015.csv` and the notebook will be written such that this file is in the same directory as the notebook.

## Using CSV to read in data from text files

In [17]:
import csv
# read in data with reader method and put into a list
kisa_data = list(csv.reader(open('kisa_2015.csv')))
# Take a look at the data (may be large, so just look at first 5 elements in the list)
kisa_data[:5]

[['month',
  'grdatn',
  'marstat',
  'age',
  'class',
  'region',
  'state',
  'hours',
  'mlr',
  'natvty',
  'msafp',
  'msastat',
  'faminc',
  'spneth',
  'race',
  'year',
  'class_t1',
  'mlr_t1',
  'wgta',
  'ind2',
  'indmaj2',
  'ind2_t1',
  'indmaj2_t1',
  'pid',
  'yeart1',
  'female',
  'immigr',
  'homeown',
  'hoursu1b',
  'hoursu1b_t1',
  'se15u',
  'se15u_t1',
  'ent015u',
  'ent015ua',
  'vet',
  'wgtat',
  'wgtat1'],
 ['12',
  '42',
  '5',
  '57',
  '4',
  '1',
  '14',
  '40',
  '1',
  '57',
  '49340',
  '2',
  '15',
  '-1',
  '1',
  '2014',
  '4',
  '1',
  '3032.9811953',
  '8190',
  '10',
  '8190',
  '10',
  '15866200851 3111 11912',
  '2015',
  '1',
  '0',
  '',
  '40',
  '40',
  '0',
  '0',
  '0',
  '0',
  '0',
  '269.17244224',
  '270.43382431'],
 ['12',
  '39',
  '7',
  '26',
  '4',
  '1',
  '14',
  '40',
  '1',
  '57',
  '49340',
  '2',
  '15',
  '2',
  '1',
  '2014',
  '4',
  '1',
  '4541.1878663',
  '2170',
  '4',
  '2170',
  '4',
  '15866200851 3111 21912'

In [39]:
# see number of obs
len(kisa_data)

636018

Note that we have a list of lists.  Each list is a row from the data.  Also note that each element in the lists are strings.  The first list being the variable names. We *could* work with the data like this, but it's not ideal - we need to keep track of the index of the variable names.

e.g., if we want to find the mean age of those surveyed, we have to note that `age` is the fourth element in the first list.  To get mean age we'd then do:

In [25]:
# reference an cell containing age to see what type of object it is
type(kisa_data[1][3])

str

In [43]:
# so we need turn ages into numeric values when we take the mean
sum_age = sum([float(i) for i in [x[3] for x in kisa_data[1:]]])
obs_age = len(([float(i) for i in [x[3] for x in kisa_data[1:]]]))
mean_age = sum_age/obs_age
mean_age

42.70527831803238

Clearly, reading in the csv file in this way is not likely the most intuitive way to handle the data.  Let's try another option with the CSV module.  You can read into a dictionary rather than a list.

In [52]:
kisa_data = list(csv.DictReader(open('kisa_2015.csv')))
# look at a slice
kisa_data[:5]

[OrderedDict([('month', '12'),
              ('grdatn', '42'),
              ('marstat', '5'),
              ('age', '57'),
              ('class', '4'),
              ('region', '1'),
              ('state', '14'),
              ('hours', '40'),
              ('mlr', '1'),
              ('natvty', '57'),
              ('msafp', '49340'),
              ('msastat', '2'),
              ('faminc', '15'),
              ('spneth', '-1'),
              ('race', '1'),
              ('year', '2014'),
              ('class_t1', '4'),
              ('mlr_t1', '1'),
              ('wgta', '3032.981195'),
              ('ind2', '8190'),
              ('indmaj2', '10'),
              ('ind2_t1', '8190'),
              ('indmaj2_t1', '10'),
              ('pid', '15866200851 3111 11912'),
              ('yeart1', '2015'),
              ('female', '1'),
              ('immigr', '0'),
              ('homeown', ''),
              ('hoursu1b', '40'),
              ('hoursu1b_t1', '40'),
              ('

In [58]:
# Now varible names correspond to the values 
# This makes things a little easier.  To get the mean with python built-ins:
type(kisa_data[1])
kisa_data[1]['age']
sum_age = sum([float(i) for i in [x['age'] for x in kisa_data[1:]]])
obs_age = len(([float(i) for i in [x['age'] for x in kisa_data[1:]]]))
mean_age = sum_age/obs_age
mean_age

43.03191489361702

But this still isn't ideal.


## Using NumPy to read text files

Remember that NumPy gives us matrix programming capabilities, akin to what one can do in Matlab.

In [71]:
import numpy as np
kisa_data = np.genfromtxt('kisa_2015.csv', delimiter=",")
# look at data
kisa_data[:5]

array([[             nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan],
       [  1.20000000e+01,   4.20000000e+01,   5.00000000e+00,
          5.70000000e+01,   4.00000000e+00,   1.00000000e+00,
          1.40000000e+01,   4.00000000e+01,   1.00000000e+00,
          5.70000000e+01,   4.93400000e+04,

In [70]:
# Like in Matlab, default doesn't deal well with strings
# Now we have some numpy capabilities
# This makes things like means easy
np.mean(kisa_data[1:, 3])

43.178947368421049

So NumPy can be very useful to read in text files with numeric data that will be manipulated in Numpy.  But it doesn't deal well with data of mixed types or with keeping track of variable names.

## Using Pandas to read in text files

For the ability to read in data into standard tabular data format, Pandas provides the best options.  The main command one will use for text data (at least in csv format) is the `pandas.from_csv()` method.  This method reads in a csv file and returns a Pandas dataframe.

In [75]:
import pandas as pd
# read in csv file into Pandas
kisa_data = pd.read_csv('kisa_2015.csv')
# look at the first 5 obs
kisa_data.head(n=5)

Unnamed: 0,month,grdatn,marstat,age,class,region,state,hours,mlr,natvty,...,homeown,hoursu1b,hoursu1b_t1,se15u,se15u_t1,ent015u,ent015ua,vet,wgtat,wgtat1
0,12,42,5,57,4,1,14,40,1,57,...,,40,40,0,0,0.0,0.0,0,269.172442,270.433824
1,12,39,7,26,4,1,14,40,1,57,...,,40,40,0,0,0.0,0.0,0,403.023478,404.912105
2,12,41,1,43,4,2,41,46,1,110,...,,46,40,0,0,0.0,0.0,0,402.790075,404.677609
3,12,39,1,38,4,2,41,40,1,57,...,,40,30,0,0,0.0,0.0,0,342.934489,344.541531
4,12,42,1,51,-1,3,58,-1,6,57,...,,-1,-1,0,0,0.0,0.0,0,560.224448,562.849743


In [76]:
# see some summary statistics
kisa_data.describe()

Unnamed: 0,month,grdatn,marstat,age,class,region,state,hours,mlr,natvty,...,homeown,hoursu1b,hoursu1b_t1,se15u,se15u_t1,ent015u,ent015ua,vet,wgtat,wgtat1
count,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,...,0.0,95.0,95.0,95.0,95.0,87.0,85.0,95.0,95.0,95.0
mean,12.0,40.284211,2.568421,43.178947,2.810526,3.210526,69.084211,27.673684,2.557895,88.473684,...,,27.315789,27.221053,0.073684,0.084211,0.0,0.0,0.084211,320.829357,322.332811
std,0.0,2.562675,2.434908,11.816846,2.531902,0.921317,23.262454,21.014403,2.499854,79.168799,...,,20.111531,20.316212,0.262642,0.279177,0.0,0.0,0.279177,127.035039,127.630344
min,12.0,34.0,1.0,20.0,-1.0,1.0,14.0,-1.0,1.0,57.0,...,,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,40.806662,40.997888
25%,12.0,39.0,1.0,33.5,-1.0,3.0,63.0,-1.0,1.0,57.0,...,,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,244.629113,245.775481
50%,12.0,40.0,1.0,44.0,4.0,3.0,71.0,40.0,1.0,57.0,...,,40.0,40.0,0.0,0.0,0.0,0.0,0.0,351.493054,353.140203
75%,12.0,43.0,5.0,52.0,4.0,4.0,86.0,41.5,5.5,57.0,...,,40.0,40.0,0.0,0.0,0.0,0.0,0.0,405.081462,406.979734
max,12.0,46.0,7.0,64.0,7.0,4.0,94.0,65.0,7.0,364.0,...,,65.0,70.0,1.0,1.0,0.0,0.0,1.0,600.702105,603.517084


In [77]:
# find mean age
kisa_data['age'].mean()

43.17894736842105

Pandas `from_csv()` method provides lots of flexiblity with respect to skipping rows/columns, changing columns names, etc.

Pandas also contains methods to read in data in many different formats:

Below is a table of some of the commonly used `pandas` read-in functions (taken partially from Table 6.1 of McKinney, 2013).

| Function       | Description                                                     |
| -------------- | --------------------------------------------------------------- |
| read_csv       | Load delimited data from a file, URL, or file-like object.      |
|                | Use comma as default delimiter.                                 |
| read_table     | Load delimited data from a file, URL, or file-like object.      |
|                | Use tab ('\t') as default delimiter.                            |
| read_fwf       | Read data in fixed-width column format (that is, no delimiters) |
| read_clipboard | Version of `read_table` that reads data from the clipboard.     |
|                | Useful for converting tables from webpages.                     |
| read_stata     | Load .dta format Stata data file as a DataFrame                 |
| read_excel     | Load .xls or .xlsx Excel data file as a DataFrame               |
| read_sas       | Load .sas SAS data file as a DataFrame                          |
| read_json      | Load .json data file as a DataFrame                             |
| read_pickle    | Load .pkl Python pickle data object file as a DataFrame         |

Type inference is one of the more important features of these functions; that means you don't have to specify which coluns are numeric, integer, boolean, or string. Handling dates and other custom types requires a bit more effort, though.

## Saving Data

We can use many of the modules/classes above to write data to text files. E.g. `csv.writer()`, `numpy.savetxt()`, and, `pandas.to_csv()` are all options.

### Pickling data

Another option for writing data to disk is to use a package called Pickle.  Pickle allows you to write Python objects to disk in a compressed binary file format.  Pickling can be especially convenient in Python because it allows you to write about any Python object to disk an then recall them.  

To "pickle" an object:

In [78]:
import pickle
# save the kisa dataframe to a file called kisa_df.pkl
# note the 'wb', this option helps with reading the binary file in any operating system
pickle.dump(kisa_data, open('kisa_df.pkl', 'wb'))

In [79]:
# To read the pickle, just "unpickle"
# Note the 'rb', you need this option if you used 'wb' when you saved the pickle
kisa_data2 = pickle.load(open('kisa_df.pkl', 'rb'))
kisa_data2.describe()

Unnamed: 0,month,grdatn,marstat,age,class,region,state,hours,mlr,natvty,...,homeown,hoursu1b,hoursu1b_t1,se15u,se15u_t1,ent015u,ent015ua,vet,wgtat,wgtat1
count,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,...,0.0,95.0,95.0,95.0,95.0,87.0,85.0,95.0,95.0,95.0
mean,12.0,40.284211,2.568421,43.178947,2.810526,3.210526,69.084211,27.673684,2.557895,88.473684,...,,27.315789,27.221053,0.073684,0.084211,0.0,0.0,0.084211,320.829357,322.332811
std,0.0,2.562675,2.434908,11.816846,2.531902,0.921317,23.262454,21.014403,2.499854,79.168799,...,,20.111531,20.316212,0.262642,0.279177,0.0,0.0,0.279177,127.035039,127.630344
min,12.0,34.0,1.0,20.0,-1.0,1.0,14.0,-1.0,1.0,57.0,...,,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,40.806662,40.997888
25%,12.0,39.0,1.0,33.5,-1.0,3.0,63.0,-1.0,1.0,57.0,...,,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,244.629113,245.775481
50%,12.0,40.0,1.0,44.0,4.0,3.0,71.0,40.0,1.0,57.0,...,,40.0,40.0,0.0,0.0,0.0,0.0,0.0,351.493054,353.140203
75%,12.0,43.0,5.0,52.0,4.0,4.0,86.0,41.5,5.5,57.0,...,,40.0,40.0,0.0,0.0,0.0,0.0,0.0,405.081462,406.979734
max,12.0,46.0,7.0,64.0,7.0,4.0,94.0,65.0,7.0,364.0,...,,65.0,70.0,1.0,1.0,0.0,0.0,1.0,600.702105,603.517084


## Writing Functions

### Syntax

Python functions follow the `def` keyword used to define/declare a function.  Arguments to the function follow the function name, in parenthesese.  After the arguments, a colon is used and then the statements to be executed when the function is called are indented on lines below this.  A `return` statement is used to declare any objects that are to be returned from the function.

Let's see this through a simple example.  We'll define a function that returns the product of two arguments.

In [83]:
def product(val1, val2):
    prod = val1 * val2
    product 
    
    return prod

Here we defined the function `product`.  The function takes two arguments, `val1` and `val2`.  The statements to be executed are one line, setting a variable called `prod` equal to `val1 * val2`.  Finally, the last line of the function contains a `return`, which says that whenever the function is called, the variable `prod` will be returned.  

We've now defined this function.  We can call it by just using the function name:

In [84]:
x, y = 4, 5

In [85]:
product(x, y)

20

It's as simple as that.  

But a few more words on functions.  

First, remember that you want well documented code.  An important part of this is documenting what each of your custom functions is doing.  To do this well, you'll want to include a 'docstring' in your functions to help document what is that function does (note that there are a number of conventions for formatting docstrings.  See [here](https://stackoverflow.com/questions/3898572/what-is-the-standard-python-docstring-format) for some examples).  E.g.,

In [86]:
def product(val1, val2):
    '''
    This function returns the product of two numbers.
    
    Args:
        val1: a scalar
        val2: a scalar

    Returns:
        A scalar that is the product of val1 and val2.

    Raises:
        N/A
    '''
    prod = val1 * val2
    product 
    
    return prod

Second, note two ways functions are often defined.  One is within a class.  Functions of a class are called methods.  Another way is in a script that contained nothing but functions.  A script (a `.py` file) that contains nothing but function is called a module.  It's often helpful to put fuctions together in modules to help make your code more readable.  Functions in classes and modules are accessable when those classes or modules are loaded.  For example.   We might save the following functions in a script called `fibo.py` in the directory with this notebook:

```
# Fibonacci numbers module

def fib(n):    # write Fibonacci series up to n
    a, b = 0, 1
    while b < n:
        print(b),
        a, b = b, a+b

def fib2(n):   # return Fibonacci series up to n
    result = []
    a, b = 0, 1
    while b < n:
        result.append(b)
        a, b = b, a+b
    return result
```

We can then import this module to make the functions available in this notebook.

In [88]:
import fibo

In [89]:
# to call the fib function
fibo.fib(100)

1
1
2
3
5
8
13
21
34
55
89


In [90]:
fibo.fib2(100)

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

### One-line functions

Sometimes (as in the above example) functions may be very simple.  The keyword `lambda` is a shortcut for creating one-line functions in Python.  E.g.,

In [91]:
# alternative to our 'product' function above
f = lambda x, y: x*y
f(4, 5)

20

### Generalized function input

Sometimes you will want to define a function that has a variable number of input arguments. Python's function syntax includes two variable length input objects: `*args` and `*kwargs`. `*args` is a list of the positional arguments, and `*kwargs` is a dictionary mapping the keywords to their argument. This is the most general form of a function definition.

In [92]:
def report(*args, **kwargs):
    for i, arg in enumerate(args):
        print('Argument ' + str(i) + ':', arg)
    for key in kwargs:
        print("Keyword", key, "->", kwargs[key])

report("TK", 421, exceptional=False, missing=True)

Argument 0: TK
Argument 1: 421
Keyword exceptional -> False
Keyword missing -> True


Passing arguments or dictionaries through the variable length `*args` or `*kwargs` objects is often desireable for the targets of SciPy's root finders, solvers, and minimizers.

### Some function best practices

1. Don't use global variables. Always explicitly pass everything in to a function that the function requires to execute.
2. Don't pass input arguments into a function that do not get used. This principle is helpful when one needs to debug code.
3. Don't create objects in the return line of a function. Even though it is easier and you can often write an entire function in one return line, it is much cleaner and more transparent to create all of your objects in the body of a function and only return objects that have already been created.