<img style="float: right;" width="120" src="../Images/supplier-logo.png">
<img style="float: left; margin-top: 0" width="80" src="../Images/client-logo.png">
<br><br><br>


# Synopsis

This notebook will explain the following topics and concepts:


1) Functions - Basics

2) Apply functions to DataFrames

3) Vectorized Functions
- Using `numpy`
- Using `numba`

3) Aggregation

4) Transformation

5) Filter


# Functions - Basics

A way to group and reuse python code, removes the need to copy/past code.

D.R.Y - Dont repeat yourself

Use the `def` keyword to define a function

Python used indentation to define blocks of code

Use the `return` keyword to have the function return a value

In [None]:
# First Function
def foo():
    # indent 4 spaces
    # function code here
    print("Calling Foo")
    
    
# execute the function
foo()

In [None]:
# Second function
# Have the function return a value

def my_sq(n):
    return n * n

# execute the function
my_sq(9)

In [None]:
# Third Function

def my_exp(x, e):
    return x ** e

# exdecute the function
my_exp(6, 3)

# Apply (Basics)

Use apply to use a fucntion across rows or columns of a DataFrame

Use `func` to indentify function to be executed

## Create a simple DataFrame

In [None]:
import pandas as pd

d = {
    'a': [10,20,30],
    'b': [20,30,40]
}

df = pd.DataFrame(data=d)

display(df)

## Apply over a series

**NOTE** <BR>
- a column / row in a DataFrame is a Series

In [None]:
# Cannot apply a function that takes ZERO argumentz

# This is an error
# df['a'].apply(foo)

### Single Parameter Function

In [None]:
# Uses the value in each cell as the argument to the function
df['a'].apply(func=my_sq)

In [None]:
df

### Multiple Parameter Function

In [None]:
# Uses the value in each cell as the argument to the function
# Use the keyword argument to pass in the value for 'e'
df['a'].apply(func=my_exp, e=3)

## Apply over a DataFrame

In [None]:
def print_me(x):
    print(x)
    

### Column-wise Operations

- `axis=0`

In [None]:
df.apply(func=print_me, axis=0)

### Row-wise Operations

- `axis=1`

In [None]:
df.apply(func=print_me, axis=1)

## Advanced Apply

Use the titanic dataset to write more advanced apply operations

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

df_titanic = sns.load_dataset('titanic')

### Define some functions

In [None]:
# Number of missing values in a Series 
def count_missing(ser):
    nulls = pd.isnull(ser)
    
    return np.sum(nulls)

# Proportion of missing values in a Series 
def prop_missing(ser):
    num = count_missing(ser)
    den = ser.size
    
    return num/den

# Proportion complete
def prop_complete(ser):
    return 1 - prop_missing(ser)

### Try them out

In [None]:
# Default is column wise
missing = df_titanic.apply(func=count_missing)
print(missing)

In [None]:
# Explicitly set the column-wise ordering
p_missing = df_titanic.apply(func=prop_missing, axis=0)
print(p_missing)

In [None]:
# Rowwise
p_complete = df_titanic.apply(func=prop_complete, axis=1)
print(p_complete)

# Vectorized Functions

## Create a simple DataFrame

- Same as DataFrame at astart of Notebook

In [None]:
import pandas as pd

d = {
    'a': [10,20,30],
    'b': [20,30,40]
}

df = pd.DataFrame(data=d)

display(df)

## Vectorized Arithmetic

e.g. average of 2 values

Note that `pandas` and `numpy` will perform either scalar or vector arithmetic.


In [None]:
def avg_2(x,y):
    return (x+y)/2

# Scalar
print(avg_2(10,100))

# Vector
print(avg_2(df['a'], df['b']))

## Non-Vectorized Arithmetic

Not all functions can be automaticall vectorized

In [None]:
import numpy as np

def avg_2_modified(x,y):
    if (x == 20):
        return np.NaN
    else:
        return (x + y) / 2

# Scalar
print(avg_2_modified(10,100))
print(avg_2_modified(20,100))

# Vectors
## Does not compile
#print(avg_2_modified(df['a'], df['b']))

## Option 1 - Use np.vectorize

Use `numpy.vectorize` to change the function so that when given a vector of values, it will perform element-wise calculactions. <BR>

Useful in 2 cases

1 - When you dont have the source code for the function, use `np.vbectorize()`

2 - When you do have the source code for the function, use the `vectorize decorator`

In [None]:
# Case 1
avg_2_mod_vec = np.vectorize(avg_2_modified)

# Scalar
print(avg_2_mod_vec(10,100))
print(avg_2_mod_vec(20,100))

# Vectors
print(avg_2_mod_vec(df['a'], df['b']))


In [None]:
# Case 2

@np.vectorize
def v_avg_2_modified(x,y):
    if (x == 20):
        return np.NaN
    else:
        return (x + y) / 2

# Scalar
print(v_avg_2_modified(10,100))
print(v_avg_2_modified(20,100))

# Vectors
print(v_avg_2_modified(df['a'], df['b']))


## Option 2 - Use numba

Designed to optimize python code, esp on array calculations.

- Use the numba vectorize decorator

- `numba` does not understand pandas objects so need to convert series and dataframes to arrays

- Use the `values` property

Same 2 situations as before

1 - When you dont have the source code for the function, use `np.vbectorize()`

2 - When you do have the source code for the function, use the `vectorize decorator`

In [None]:
# Case 1
avg_mod_2_numba = numba.vectorize(avg_2_modified)

# Scalar
print(avg_mod_2_numba(10,100))
print(avg_mod_2_numba(20,100))

# Vectors
print(avg_mod_2_numba(df['a'], df['b']))

In [None]:
# Case 2
import numba

@numba.vectorize
def v_avg_2_numba(x,y):
    if (x == 20):
        return np.NaN
    else:
        return (x + y) / 2

# Scalar
print(v_avg_2_numba(10,100))
print(v_avg_2_numba(20,100))

# Vectors
print(v_avg_2_numba(df['a'].values, df['b'].values))


# Split-Apply-Combine

Referring to a process involving one or more of the following steps:
- **Splitting** the data into groups based on some criteria. 
- **Applying** a function to each group independently. 
- **Combining** the results into a data structure. 

Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to do one of the following:

- **Aggregation**: compute a summary statistic (or statistics) for each group. e.g.:
>
>Compute group sums or means.
>Compute group sizes / counts.
>

- **Transformation**: perform some group-specific computations and return a like-indexed object. e.g.:
>
>Standardize data (zscore) within a group.
>Filling NAs within groups with a value derived from each group.
>

- **Filtration**: discard some groups, according to a group-wise computation that evaluates True or False. e.g.:
>
>Discard data that belongs to groups with only a few members.
>Filter out data based on the group sum or mean.
>


# Aggregation

Process of taking multiple values and returnign a single value <BR>
Sometimes referred to as *`summarization`*
 
 
Some form of reduction is performed by taking multiple values and replacing them with a single value


## Group By

Many functions can be applied to the result of GroupBy


### Pandas functions
- count, size, mean, std, min, quantile, max, sum, var, sem, describe, first, last

In [None]:
import pandas as pd

df = pd.read_csv(filepath_or_buffer='../Data/gapminder.tsv', sep='\t')

df.groupby(by='year')['lifeExp'].mean()
df.groupby(by='year')['lifeExp'].std()
df.groupby(by='year')['lifeExp'].sem()


### Non Pandas Functions
Use the `agg` method
e.g. for numpy and scipy functions

- np.count_nonzero, np.mean, np.std, np.min, np.max, np.percentile, np.sum, np.var
- scipy.stats.sem, scipy.stats.describe


In [None]:
import numpy as np
import scipy

df.groupby(by='year')['lifeExp'].agg(np.count_nonzero)
df.groupby(by='year')['lifeExp'].agg(np.std)
df.groupby(by='year')['lifeExp'].agg(scipy.stats.sem)


### User Defined Functions

In [None]:
# Simple function to calculate mean

def my_mean(values):
    n = len(values)
    sum = 0
    for v in values:
        sum += v
    
    return sum / n

df.groupby(by='year')['lifeExp'].agg(my_mean)

In [None]:
# Function that takes mulitple parameters
def my_mean_diff(values, diff_value):
    n = len(values)
    sum = 0
    for v in values:
        sum += v
    mean = sum / n
    
    return mean - diff_value


# Calcualte global average life exp & subtract it form a grouped value
global_mean = df['lifeExp'].mean()

# Note that python does not like the named paraemeter version
# Not exactly sure why this is
# df.groupby('year')['lifeExp'].agg(func=my_mean_diff,diff_value=global_mean)

# This version using an unmaed parameter works fine
df.groupby('year')['lifeExp'].agg(my_mean_diff,diff_value=global_mean)

## Multiple Functions

### Option 1 - Use a list of functions (or some iterable collection of functions)

can also use `aggregate(...)` instead of `agg(...)`

In [None]:
func_list = [np.std, np.mean, np.count_nonzero]
df.groupby('year')['lifeExp'].agg(func_list)

# These are the pandas functions
func_set = {'std','min', 'max', 'count'}
df.groupby('year')['lifeExp'].agg(func_set)

# Mix and match
func_tup = (my_mean, np.std, 'count')
df.groupby('year')['lifeExp'].agg(func_tup)

# Mix and match
func_tup = (my_mean, np.std, 'count')
df.groupby('year')['lifeExp'].aggregate(func_tup)

### Option 2 - Use a dict

- Use the key of the dict to refer to the column
- Use the value to refer to the function to apply

In [None]:
func_dict = {
    'lifeExp'  : 'mean',
    'pop'      : np.std,
    'gdpPercap': my_mean
}

df.groupby('year').aggregate(func_dict)

# Transform

Takes a vector of values and performs a 1-2-1 mapping / transformation

There is no reduction with transform

e.g. the z-Scxore

- number of std deviations from the mean

- $\mathbf{ z = \frac{x - \mu}{\sigma} }$

In [None]:
def my_zscore(x):
    return (x - x.mean() / x.std())

# Transform the data
df.groupby(by='year')['lifeExp'].transform(my_zscore)

# Filter

## Load in some data


In [None]:
df_tips = sns.load_dataset('tips')

# Look at table size
df_tips['size'].value_counts()

## Option 1 use pandas.filter & user defined function

e.g. display only data where each group of table sizes has mroe than 30 observations

In [None]:
def gt30(df):
    return df['size'].count() >= 30

df_tmp = df_tips.groupby(by='size').filter(gt30)

df_tmp['size'].value_counts()

## Option 2 use pandas.filter & lambda

Same as above but using a lambda

In [None]:
df_tmp = df_tips.groupby(by='size').filter(lambda x: x['size'].count() >= 30)

df_tmp['size'].value_counts()