# Advanced Pandas Tutorial

In [None]:
import pandas as pd
import numpy as np
import random
random.seed(10)

In [None]:
# lets generate some random data for a dataframe
d = {}
d['index'] = list(range(100))
d['colA'] = [random.randint(1, 100) for i in range(100)]
d['colB'] = [random.randint(1, 100) for i in range(100)]
d['colC'] = [random.randint(1, 100) for i in range(100)]
df = pd.DataFrame(data=d)


In [None]:
df.set_index('index')
df.head()

### Creating new columns

In [None]:
# we can create a new column in our dataframe by simply assigning a static value
df['colD'] = 0
df.head()

In [None]:
# but assigning a static value is not very useful
# lets create a function that we can apply to each row

def foo(val):
    """
    @param val - integer value
    @return - True if val < 50, False otherwise
    """
    return True if val < 50 else False

In [None]:
# lets apply foo to column D (with column C as input) with a lambda function
df['colD'] = df.apply(lambda x: foo(x['colC']), axis=1)   # x represents a row, which can be indexed by column name
                                                          # axis=1 is required to traverse the dataframe by rows
                                                          #     (by default, axis=0 will traverse df by columns)

In [None]:
df.head()
# and we notice colD contains True/False based on the value of colC

In [None]:
# equivalent function calls
df['colD'] = df['colC'].apply(lambda x: foo(x))  # apply to the colC column/series, instead of the dataframe
                                                 # useful if you are not referencing more than one column in the application
# or
df['colD'] = df.apply(lambda x: True if x['colC'] < 50 else False, axis=1) # inline rewrite of foo()
# or
df['colD'] = df['colC'].apply(lambda x: True if x < 50 else False) # merging the above two approaches

df.head()
# still the same result

#### But why would I need to create new columns?

This is incredibly useful for annotating data.  
For example, in MP1, you can use df.apply() to:
 * Store the number of bit flips based on the syndrome column
 * Mark coalesced enries with True/False; You can filter out coalesced entries based on this new column
 * Identify error reasons & suberror reasons by splitting the 'Error Type' column into multiple new columns

### Editing column values
    

In [None]:
# We can update column values in df with =

# First, index into the row you want to modify with iloc
df.iloc[0]['colA'] = 999

### ARGH, pandas won't allow us to modify the dataframe
Lets do this the right way

In [None]:
df.at[0, 'colA'] = 999
# or df.loc[0, 'colA'] = 999

In [None]:
df.head()
# and 'colA' at index 0 has been updated

In [None]:
# lets try slixing the dataframe and then modifying it
df_slice = df[df['colA'] < 10] # select entries with colA < 10

In [None]:
df_slice.head()

In [None]:
len(df_slice)

In [None]:
# we are left with 9 entries
# lets modify df_slice at index 12
df_slice.at[12, 'colA'] = 123

In [None]:
df_slice.head()

In [None]:
# Okay, that was easy
# But has our original df been modified?
df.at[12, 'colA']


In [None]:
# Nope. Keep this in mind when modifying slices. The original dataframe will not be modified.
# To modify the original dataframe, you must operate on it directly
# (This will come in handy when coalescing your dataset, you will have to operate on slides of unique nodes)

### Parallelizing your code

In [None]:
# Lets say you want to call a certain function on a set of different inputs
# (Once again, this will be useful in coalescing)


# generate a set of dummy inputs
inputs = [random.randint(1, 100) for i in range(100)]
print(inputs[:10])

In [None]:
# Okay, now lets define a function that we will apply to our set of inputs
def square(i):
    """
    @param i - input
    @return - square of i
    """
    return i**2

In [None]:
# we could just use a for loop here
for i in inputs:
    print (square(i))

In [None]:
# or we could do this in parallel

In [None]:
from multiprocessing.pool import ThreadPool
# create a pool with 8 threads (increase or reduce this based on the number of threads your processor supports)
pool = ThreadPool(8) 
poolresults = pool.map(square, inputs) # map the square function to inputs
pool.close()
pool.join()

# poolresults contains our returned values
poolresults

We strongly recommend using the ThreadPool approach to coalesce your data.

* Create a list of dataframe slices, each slice containing entries for a unique node.

* Next, define your Sliding Window Algorithm function such that it returns the number of tuples after coalescing its slice. You should also mark any rows that you will eventually filter out.

* Then, create a ThreadPool and map your algorithm function to the list of dataframe slices

* Run the pool with close/join. The return value of the threadpool map will give you your tuple counts. You just need to sum up this returned list to get the datapoint for your knee curve.