# Numpy & Pandas
In this notebook, we'll encounter two packages for scientific computing in Python: NumPy and Pandas.

**At the end of this notebook, you'll be able to:**
* Install and import packages for Python
* Create NumPy arrays
* Execute methods & access attributes of arrays
* Create & manipulate Pandas dataframes


## Importing packages

Before we can use numpy or pandas, we need to import them. We can also nickname the modules when we import them.

The convention is to import `numpy` as `np` and `pandas` as `pd`.

In [1]:
# Import packages
import numpy as np
import pandas as pd

# Use whos 'magic command' to see available modules
%whos

Variable   Type      Data/Info
------------------------------
np         module    <module 'numpy' from '/op<...>kages/numpy/__init__.py'>
pd         module    <module 'pandas' from '/o<...>ages/pandas/__init__.py'>


## Numpy

**Numpy** is the fundamental package for scientific computing with Python. It'll allow us to work with bigger datasets more efficiently.

### Creating `numpy` arrays

A numpy **array** is a grid of values which are all the same type (they’re homogenous).

We can create a numpy array in a few different ways:

* from a Python list or tuples
* by using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, `empty`,`zeroes`, etc.
* reading data from files

In [2]:
# Create a list
lst = [1,2,3,4,5]

# Make our list into an array
my_vector = np.array(lst)
type(my_vector)

numpy.ndarray

In [19]:
# If we give numpy a list of lists, it will create a matrix
my_matrix = np.array([lst,lst])
print(my_matrix)

[[1 2 3 4 5]
 [1 2 3 4 5]]


### Accessing attributes of numpy arrays
We can test shape and size either by looking at the attribute of the array, or by using the `shape()` and `size()` functions.

Other attributes that might be of interest are `ndim` and `dtype`.

In [7]:
# Check the dimensions of vector
print(my_vector.ndim)
print(my_vector.shape)
print(my_vector.size)

# Check the dimensions of matrix
print(my_matrix.ndim)
print(my_matrix.shape)
print(my_matrix.size)

1
(5,)
5
2
(2, 5)
10


Array data type is decided upon creation of the array.

You can explicitly define the data type by using `dtype= ` when you use `np.array()`. You can set the dtype to be `int, float, complex, bool, object`, etc

In [16]:
my_matrix.dtype 

my_complex_array = np.array([lst,lst],dtype='complex')
my_complex_array

array([[1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j],
       [1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j]])

<div class="alert alert-success">

**Task**: Create an array of booleans called `bool_array` that is 2 rows x 3 columns. Access the `shape` and `ndim` attributes to confirm its size, and the `dtype` attribute to confirm that it is boolean.

</div>

In [11]:
a1 = [True, False, False]
a2 = [False, True, True]
bool_array = np.array([a1, a2], dtype = 'bool')
print(bool_array.shape)
print(bool_array.ndim)
print(bool_array.dtype)

(2, 3)
2
bool


### Indexing & slicing arrays

Indexing and slicing 1D arrays (vectors) is similar to indexing lists.

You can index matrices using `[row,column]`. If you omit the column, it will give you the whole row.

If you use `:` for either row or column, it will give you all of those values.

In [29]:
print(my_matrix)
my_matrix[1,:]

[[1 2 8 4 5]
 [1 2 3 4 5]]


array([1, 2, 3, 4, 5])

We can also index arrays using booleans or lists. When we use Booleans, we can think of this as *filtering* the array. For example:

In [30]:
# Index with an operator
bool_matrix = my_matrix[my_matrix>2]

# Index with a list of coordinates
list_matrix = my_matrix[[0,1],[1,3]]

print(my_matrix)
print(bool_matrix)
print(list_matrix)

[[1 2 8 4 5]
 [1 2 3 4 5]]
[8 4 5 3 4 5]
[2 4]


<div class="alert alert-success">

**Task**: Filter the `bool_array` you created above to create a `true_array` that only contains True values.

</div>

In [15]:
true_array = bool_array[bool_array == True]
print(true_array)

[ True  True  True]


We can also change values in an array similar to how we would change values in a list.

In [20]:
print(my_matrix)
my_matrix[0][2] = 8
print(my_matrix)

[[1 2 3 4 5]
 [1 2 3 4 5]]
[[1 2 8 4 5]
 [1 2 3 4 5]]


### Benefits of using arrays
In addition to being less clunky & a bit faster than lists of lists, arrays can do a lot of things that lists can't. For example, we can add and multiply them. Alternatively, we can use the `sum` method to sum across a specific axis.

In [21]:
sum_list = [1,3,5] + [3,5,7]
sum_array = np.array([1,3,5]) + np.array([3,5,7])

print(sum_list)
print(sum_array)

[1, 3, 5, 3, 5, 7]
[ 4  8 12]


In [26]:
this_array = np.array([[1,3,5],[3,5,7]])
sum_rows = this_array.sum(axis=0)
print(this_array)
print(sum_rows)

[[1 3 5 7]
 [3 5 7 9]]
[16 24]


### Numpy also includes some very useful array generating functions:

* `arange`: like `range` but gives you a useful numpy array, instead of an interator, and can use more than just integers)
* `linspace` creates an array with given start and end points, and a desired number of points
* `logspace` same as linspace, but in log.
* `random` can create a random list (there are <a href="https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html">many different ways to use this</a>)
* `concatenate` which can concatenate two arrays along an existing axis [<a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html">documentation</a>]
* `hstack` and `vstack` which can horizontally or vertically stack arrays

Whenever we call these, we need to use whatever name we imported numpy as (here, `np`).

In [31]:
# When using linspace, both end points are included!
np.linspace(0,147,10)

array([  0.        ,  16.33333333,  32.66666667,  49.        ,
        65.33333333,  81.66666667,  98.        , 114.33333333,
       130.66666667, 147.        ])

<div class="alert alert-success">

**Task**: Create an array called `big_array` that has two rows. The first row should be a list of 10 numbers that are evenly spaced, and range from exactly 1 to 100. The second row should be a list of 10 numbers that begin at 0 and are exactly 10 apart. `big_array` should have a shape (2,10): two rows, and ten columns. Lastly, reassign the last value of each row in the array to be -100. 

*Hint*: Create your two arrays, and then use the `vstack` method to stack them.

</div>

In [41]:
big_array = np.array([np.linspace(1,100,10), np.arange(0,100,10)])
print(big_array)
big_array.shape
big_array[:,-1] = -100
print(big_array)


[[  1.  12.  23.  34.  45.  56.  67.  78.  89. 100.]
 [  0.  10.  20.  30.  40.  50.  60.  70.  80.  90.]]
[[   1.   12.   23.   34.   45.   56.   67.   78.   89. -100.]
 [   0.   10.   20.   30.   40.   50.   60.   70.   80. -100.]]


Numpy also has built in methods to save and load dataframes: `np.save()` and `np.load()`. Numpy files have a .npy extension.

See full documentation <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html">here</a>.

In [None]:
# Save method takes arguments 'filename' and then 'array':
np.save('matrix',my_matrix)

In [None]:
np.load('matrix.npy')

# Pandas

Pandas is a useful module that creates **data frames** (think of these like Excel spreadsheets, but much faster!). 

We can think of Pandas as "numpy with labels".


### Benefits of Pandas
* Great for real-world, heterogeneous data
* Similar to Excel spreadsheets (but way faster!)
* Smartly deals with missing data

We can work with the gene expression data from a2 as a Pandas dataframe!


In [45]:
# Import necessary packages
from csv import reader
import pandas as pd

# Read in the list of lists as a data frame
gene_df = pd.read_csv('brainarea_vs_genes_exp_w_reannotations.tsv',sep='\t',index_col = 'gene_symbol')
gene_df.head() # Show the first five rows

Unnamed: 0_level_0,CA1 field,CA2 field,CA3 field,CA4 field,"Crus I, lateral hemisphere","Crus I, paravermis","Crus II, lateral hemisphere","Crus II, paravermis",Edinger-Westphal nucleus,Heschl's gyrus,...,"temporal pole, inferior aspect","temporal pole, medial aspect","temporal pole, superior aspect",transverse gyri,trochlear nucleus,tuberomammillary nucleus,ventral tegmental area,ventromedial hypothalamic nucleus,vestibular nuclei,zona incerta
gene_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,0.856487,-1.773695,-0.678679,-0.986914,0.826986,0.948039,0.935427,1.120774,-1.018554,0.170282,...,0.27783,0.514923,0.733368,-0.104286,-0.910245,1.03961,-0.155167,-0.444398,-0.901361,-0.23679
A1BG-AS1,0.257664,-1.373085,-0.619923,-0.636275,0.362799,0.353296,0.422766,0.346853,-0.812015,0.903358,...,1.074116,0.821031,1.219272,0.901213,-1.522431,0.598719,-1.709745,-0.054156,-1.695843,-1.155961
A1CF,-0.089614,-0.546903,0.282914,-0.528926,0.507916,0.577696,0.647671,0.306824,0.089958,0.14982,...,-0.030265,-0.187367,-0.428358,-0.465863,-0.136936,1.229487,-0.11068,-0.118175,-0.139776,0.123829
A2M,0.552415,-0.635485,-0.954995,-0.259745,-1.687391,-1.756847,-1.640242,-1.73311,-0.091695,0.003428,...,-0.058505,0.207109,-0.161808,0.18363,0.948098,-0.977692,0.911896,-0.499357,1.469386,0.557998
A2ML1,0.758031,1.549857,1.262225,1.33878,-0.289888,-0.407026,-0.358798,-0.589988,0.944684,-0.466327,...,-0.472908,-0.598317,-0.247797,-0.282673,1.396365,0.945043,0.158202,0.572771,0.073088,-0.88678


Indexing in Pandas  works slightly different. Similar to a dictionary, we can index values by their names.

* Use `df['index']` for columns, and method `.loc` for rows.
* Use `.iloc` to index by #.

In [67]:
DISC1_data = gene_df.loc['DISC1']
print(DISC1_data[DISC1_data > 1.5])
#DISC1_data['CA4 field']

cingulum bundle                      1.966820
corpus callosum                      2.173866
emboliform nucleus                   1.631351
fastigial nucleus                    1.891645
globose nucleus                      1.919081
globus pallidus, external segment    1.889369
globus pallidus, internal segment    1.813221
lateral habenular nucleus            1.761793
lateral parabrachial nucleus         1.811779
medial habenular nucleus             1.563363
red nucleus                          1.914885
reticular nucleus of thalamus        1.629758
substantia nigra, pars reticulata    1.950869
zona incerta                         1.515921
Name: DISC1, dtype: float64


Pandas has many, many useful methods that you can use on your data, including `describe`, `mean`, and more.

In [47]:
DISC1_data.describe()

count    232.000000
mean      -0.006324
std        0.965280
min       -1.572527
25%       -0.715778
50%       -0.132084
75%        0.831907
max        2.173866
Name: DISC1, dtype: float64

## Resources
Check out the <a href="https://docs.scipy.org/doc/numpy/user/index.html">NumPy user guide</a> if you ever have a question about a NumPy array!

## About this notebook
This notebook is largely derived from UCSD COGS18 Materials, created by Tom Donoghue & Shannon Ellis, as well as <a href="https://github.com/jrjohansson/scientific-python-lectures/blob/master/Lecture-2-Numpy.ipynb">JR Johannson's Scientific Python Lecture on Numpy</a>.


Want to run this notebook as a slideshow? If you have Python (or Anaconda) follow <a href="http://www.blog.pythonlibrary.org/2018/09/25/creating-presentations-with-jupyter-notebook/">these instructions</a> to setup your computer with the RISE plugin.