# Python Week 4: NumPy and Pandas

For this week we will be doing a deep dive into Pandas and NumPy. These two libraries are probably within the top-five most used python libraries, and for good reason; they provide fast, reliable, and easy to read classes and methods for the purpose of numerical computing, data analysis, and machine learning. For today we will be looking at some of their more complex features. 

In [None]:
# like always we start by importing the necessary libraries
import numpy as np
from numpy import random
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook

## NumPy

In [None]:
# create a testing array for the lesson, 10x10 normally distributed
test_array = random.randn(10,10)
print(test_array)

### Review and Common Operations

In [None]:
# review from week two
# remember that if we want to index into a NumPy array, we need at least as many indices as dimensions
# also, NumPy uses row, xolumn indexing like in Linear Algebra

print(test_array[1][1]) # second row, second column
print(test_array[-1][-1]) # we can also negative index to index from the end of the array
print(test_array[-1,-1]) # we can use a single bracket as well
# a new variant on this however is fancy indexing, where we can pass an array of indices
rows = [1,2,3]
cols = [9,8,7]
print(test_array[rows,cols])

In [None]:
# NumPy accesses the same object when we index thus a statement like:
test_array[-1,-1] = -5
# changes the actual array
test_array[-1,-1]

In [None]:
# we should also review slicing
# if indexing is like a specific coordinate, slicing is like grabbing a specific line of latitude or longitude
# remember, : means all

print(test_array[0,:]) # grab all columns in first row
print(test_array[:,0]) # grab all rows in first column
print(test_array[:5]) # grab all rows until row 5
print(test_array[5:]) # grab all rows from 5 till end
print(test_array[0,5:]) # grab last 5 entries in first row

In [None]:
# here are some common methods of NumPy you should be able to know how to use
# we can use size to determine the number of entries
print(test_array.size)

# we can use shape to determine the dimensions
print(test_array.shape)

# we can use ndim to determine, well, the number of dimensions
print(test_array.ndim)

# we can also check the datatype
print(test_array.dtype)


In [None]:
# some more useful methods/functions
# we can transpose an array 
print(test_array.transpose())

# we can sort our arrays
print(np.sort(test_array))

# we can sort the array by its indices
print(np.argsort(test_array))


# in place sorting, permanantly changes the order of the array, so beware!
test_array.sort() 
test_array


In [None]:
# we should also look at splitting and concatenating arrays
# create a new test array
test_array1 = random.randn(2,2)
test_array2 = random.randn(2,2)

# lets say we need to combine two arrays
test_array_1_2 = np.concatenate([test_array1, test_array2]) # note that in this case, this is ambiguous
print(test_array_1_2)
# in my opinion hstack and vstack are better because you always get the proper behavior
# stack horizontally
test_array_1_2_h = np.hstack([test_array1, test_array2])
print(test_array_1_2_h)
#stack vertically
test_array_1_2_v = np.vstack([test_array1, test_array2])
print(test_array_1_2_v)

# notice that concatenate produces the same result as vstack

In [None]:
# splitting is the inverse operation
# biblically inspired programmer humour
the_red_sea = random.randn(10,10)

the_red_sea_l, the_red_sea_r = np.split(the_red_sea, 2) # default splits vertically
print(the_red_sea_l)

# fruit inspired humour

banana = random.randn(10,10)

left_banana, right_banana = np.vsplit(banana, 2)
print(left_banana)

top_banana, bottom_banana = np.hsplit(banana, 2)
print(top_banana)

# we can also split arrays at specific points in an array
# all out of bad jokes

my_array = random.randn(10,10)
w, x, y  = np.split(my_array,[3,6]) #split at third row, and sixth column
print(w,x,y)

In [None]:
### we should now talk about universal functions, we use these when we want to apply operations element wise in numpy
# we are probably aware of them as when you use the arithmetic operations in numpy, you actually are calling a
# a wrapper for those functions
# for example:
test_array3 = random.randn(10,10)
test_array + test_array3 == np.add(test_array3, test_array)

# this persists for other arithmetic operations in python and is thus trivial to describe, consult documentation if
# you need help


In [None]:
# perhaps something more useful is numpy where
# where lets you check conditions where something is true or not and checks it in a bitwise fashion i.e. every
# element, but its much faster than a for loop, and it allows us to change that original array

# lets create an imaginary situation where we need to run a t-test on some imaginary pearson correlations

# lets create a supposed matrix of correlation coefficients
corr = random.rand(10,10)
# now lets create a supposed array that contains the p-value of those coefficients
ps =  abs(random.normal(0.05, 0.5, [10,10]))

# first let's eliminate all correlations that are not significant
# basically, find all values in ps where the entry is greater than 0.05, find that same location in corr,
# replace with zero
sig_corr = np.where(ps > 0.05, corr, 0)
print(sig_corr)
# another similar operation is extract, which grabs values based on boolean logic
sig_corr1 = np.extract(ps <= 0.05, ps)
print(sig_corr1)

In [None]:
# another very useful thing is the ability to write our own vectorized functions in python, this will always be
# faster than a for loop, but almost never faster than a vector function written in C/C++ so be wary

# lets create a funciton here
def my_func(a, b):
    if a is not b:
        return a
    else:
        return b**3 + a

# we can then call the vector class in numpy
vectorized_func = np.vectorize(my_func)
# we can now run our newly vectorized function, obviosuly for a case like this, we are probably going overkill
# but you can use your imagination
# also for the adventurous, you can make your own vectorized functions in C/C++ and port them to python
# that's however outside the scope of these lessons
vectorized_func(np.array([1,2,3,4]),2)


In [None]:
# numpy also has an excellent convolve fucntion, for when we need to combine two functions like during signal
# processing
plt.style.use('ggplot')
# let's define a time axis
time = np.linspace(0, 60, 60)
response = random.randn(60)

# lets define a kernel
kernel_size = 5
kernel = np.ones(kernel_size) / kernel_size


fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(time, response, color='blue')
ax.plot(time, convolve_r, color='orange')


In [None]:
# you also have no excuse for not knowing where to find documentation with numpy lookfor
# numpy comes with a literal search function!
np.lookfor('discrete fourier')

In all honesty, this is just scratching the surface of NumPy. There is so many applications for signal processing, linear algebra, etc. that I couldn't possibly show you everything you need to know. If you have a specific use case thats numerically oriented however, I would always suggest to look whether NumPy has a solution. 

## Pandas

As a quick note, many of the functions in Pandas has cognates in NumPy, so I'm not going to waste time going over them, if you need to concatenate or split DF's, please consult the documentation!

In [None]:
# import this test dataset from Kaggle that I downloaded
# it has relative protein expression of several proteins in mice that are either WT or down-syndrome modeled,
# whether they received a pharmacologicla agent, and whehter they were fear conditioned or not

mouse_protein = pd.read_csv('Data_Cortex_Nuclear.csv')
print(mouse_protein.head()) # print first 5 rows
print(mouse_protein.tail()) # print last 5 rows

In [None]:
# let's take a quick look at all of the columns, their datatypes, and general characteristics
print(mouse_protein.columns)
print(mouse_protein.dtypes)
print(mouse_protein.describe(include='all'))

In [None]:
# so I won't talk about concatenation as like i stated this is simialr ot NumPy methods, the one advantage about
# Pandas however is that it is intelligent in how it parses the column and row labels, meaning you can easily 
# concatenate multiple spreadsheets with the same index labels into one large document

# but we can talk about the merge feature
# let's say you have new data that you want to add to an existing DF
# for example, lets say we measured one last protein

BDNF = pd.DataFrame({'BDNF':random.rand(1080),'MouseID':mouse_protein.MouseID})
BDNF
mouse_protein = mouse_protein.merge(BDNF, how='inner')
mouse_protein

In [None]:
# i want to explain how this is different from concatenating
# create different dataframes
df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})

print(pd.merge(df1, df2))
print(pd.concat([df1, df2]))

# be wary how you use these as they seem similar, but are not always identical and can have different meaning

In [None]:
# another advantage of pandas is its ability to easily aggregate data
# lets take a look at some of those methods

# one way is groupby, which allows us to basically group some labels, apply a function, then get a summary
# let's say i want the mean value for each distinct group in this experiment
print(mouse_protein.groupby(['Treatment', 'Genotype', 'Behavior'], as_index=False).mean())

# another option is a pivot table, I tend to find this less useful, but it does perform a function
# as you probably noticed this does the same calculation
print(pd.pivot_table(mouse_protein, index=['Treatment','Genotype'], columns=['Behavior'],aggfunc=np.mean))



In [None]:
# one of the real advantages of pandas however is method chaining
# method chaining is the application of mulitple operations that can be condensed into a single line
# in this single line I was able to find 
print(mouse_protein[mouse_protein.Behavior.eq('C/S')].agg({'SYP_N': ["mean", "median"]}))

In [None]:
# pandas also has a lot of plotting features to make quick plots of descriptive statistics
# as a note these aren't intended for publication
pd.plotting.boxplot(mouse_protein)
plt.show()

As a final note, this isn't a end all be all guide. I tried to include the most useful tools for you all that would be generalizable to many applications. Some of these examples don't even paint the full picture of that specific funciton or method, so I beleive it is imperative that you familiarize yourselves with the documentation of Pandas and NumPy.