# **Python For Neuro Week 6**: More Numpy and Pandas

## Warmup

We're going to start today by loading in ```mat_1.npy``` again. If you don't remember how to do this, use google to find the right function to use. Please assign the array to variable called ```arr```.

In [17]:
import numpy as np
# load in data
arr = np.load('ex_array.npy')

## Summary operations

Summary operations allow you to collapse an array according to a certain summary statistic. For instance, we may want to compute the overall mean firing rate in our experimental data:

In [18]:
arr.mean()

np.float64(0.8717270073789575)

You can also specify the axis along we want to average. For instance, maybe we want to average firing rates across individual trials:

In [19]:
arr.shape

(2, 10, 50, 2000)

In [20]:
arr_across_trials = arr.mean(axis=1)
print(arr)
print(arr_across_trials)
arr_across_trials.shape

[[[[1.06735651e-01 8.50013428e-01 6.61261976e-02 ... 1.22214287e+00
    1.85433810e+00 2.97715561e-02]
   [1.74584281e+00 1.04634545e+00 1.19461252e+00 ... 3.86994751e-01
    1.46257796e-01 5.57668861e-01]
   [1.27660851e+00 1.63782002e+00 1.73077654e+00 ... 2.69178437e-01
    8.87518742e-01 4.97409664e-01]
   ...
   [1.33977689e-01 1.61987886e-01 4.46480597e-01 ... 4.39952432e-01
    6.42594250e-01 1.23527806e+00]
   [2.47564259e+00 1.69413446e+00 2.12417460e+00 ... 2.61901093e+00
    3.11246504e+00 1.92026681e+00]
   [7.29644202e-01 2.04679777e+00 1.37959358e+00 ... 1.74199210e+00
    2.11039000e+00 8.98703266e-01]]

  [[8.10785116e-01 9.33062418e-01 2.59379116e-01 ... 1.31543734e+00
    1.29622586e+00 1.23752785e+00]
   [5.66745857e-01 7.88732871e-01 1.09353632e+00 ... 4.95212124e-01
    2.18896547e-01 5.55039361e-01]
   [8.82901511e-01 6.86934256e-01 1.25065050e+00 ... 9.04460497e-01
    4.60074219e-01 4.38216968e-01]
   ...
   [3.64492211e-02 5.71047168e-01 4.95341715e-01 ... 5.94

(2, 50, 2000)

The `keepdims` argument means that you don't remove the dimensions you're averaging over, but rather set their length to 1:

In [24]:
arr_across_trials = arr.mean(axis=1, keepdims=True)
arr_across_trials.shape
print(arr_across_trials)

[[[[0.65972936 0.68679441 0.68199004 ... 0.96912439 1.07270512
    0.84568177]
   [0.79611248 0.99935367 0.88531804 ... 0.50904864 0.3364558
    0.4932102 ]
   [1.00938778 1.30849305 0.95961269 ... 0.67265484 0.94855162
    0.57360145]
   ...
   [0.32410045 0.60430154 0.59851933 ... 0.43082814 0.35448484
    0.60595627]
   [2.20282076 2.10202105 1.72845273 ... 2.40318232 2.62285072
    2.47028176]
   [0.77735322 1.2437595  1.09964378 ... 1.38810799 1.48371474
    1.22155473]]]


 [[[0.29366721 0.40403176 0.32066067 ... 1.11915401 0.79285423
    1.20313501]
   [0.54035301 0.7392684  0.70751494 ... 1.3894037  1.30790281
    1.4672015 ]
   [0.62631555 0.49423944 0.59660195 ... 0.82260387 0.99906975
    0.56779008]
   ...
   [0.60475668 0.75494216 0.66881441 ... 0.33957372 0.47778094
    0.5476308 ]
   [0.42349661 0.39669646 0.79542393 ... 1.52273235 1.35616066
    1.55100953]
   [0.51770586 0.46192723 0.57713179 ... 1.50219293 1.53141366
    1.34884422]]]]


In [26]:
arr_across_trials.shape

(2, 1, 50, 2000)

You can average across multiple axes as well. For instance, maybe you want to average across both trials and time:

In [28]:
arr_across_trials_and_time = arr.mean(axis=(1,3))

In [29]:
arr_across_trials_and_time.shape

(2, 50)

### Question 1

- What is the average firing rate across all neurons, times, and trials for each condition?
- (Advanced.) Subtract the average firing rate per time across all neurons, trials, and conditions from the original array.

In [34]:
# (2, 10, 50, 2000)
# 2 conditions
# 10 trials
# 50 neurons
# 2000 timepoints

arr_across_neurons_times_trials = arr.mean(axis=(1,2,3), keepdims=True)
print(arr_across_neurons_times_trials)

arr_norm = (arr - arr_across_neurons_times_trials)
print(arr[0, 0, 0, 0])
print(arr_norm[0, 0, 0, 0])


[[[[0.98828779]]]


 [[[0.75516623]]]]
0.10673565077352659
-0.8815521351172901


## Indexing

Indexing in vectors works just as in lists:

In [36]:
lst_1 = [25, 20, 40, 5]
vec_1 = np.array(lst_1)
vec_1

array([25, 20, 40,  5])

In [37]:
vec_1[0]

np.int64(25)

For matrices and higher-dimensional arrays, a single index selects a single row:

In [39]:
lst_1 = [
    [1, 2],
    [3, 4],
    [5, 6]
]
mat_1 = np.array(lst_1)
mat_1

array([[1, 2],
       [3, 4],
       [5, 6]])

In [40]:
mat_1[0]

array([1, 2])

In [41]:
mat_1[0][1]

np.int64(2)

Instead of using two brackets, you can also separate the row and column index by a comma:

In [None]:
# The following two lines of code are equivalent
print(mat_1[0][0])
print(mat_1[0,0])

### Slicing

Slicing is a useful way of extracting more than one element. In particular, `j:k` extracts the elements j,...,k-1:

In [42]:
vec = np.arange(10)
print(vec)

[0 1 2 3 4 5 6 7 8 9]


In [43]:
vec[3:7]

array([3, 4, 5, 6])

We can leave either end of the range away and it will default to the beginning and the end of the list, respectively.

In [44]:
vec[:7]

array([0, 1, 2, 3, 4, 5, 6])

In [45]:
vec[3:]

array([3, 4, 5, 6, 7, 8, 9])

In [46]:
vec[:] # What do you think this will do?

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

You can therefore also use the colon to select all rows of a matrix and specific columns.

In [47]:
mat_1

array([[1, 2],
       [3, 4],
       [5, 6]])

In [48]:
mat_1[:,0]

array([1, 3, 5])

You can add another colon to specify a step size, similarly to how you would use these three arguments in `range`.

In [49]:
vec[3:7:2]

array([3, 5])

We could still leave away the beginning or the end of the slice:

In [50]:
vec[::2]

array([0, 2, 4, 6, 8])

### Question 2
Predict the output of the following commands:

In [51]:
vec[:4]

array([0, 1, 2, 3])

In [52]:
vec[5:9:2]

array([5, 7])

In [53]:
vec[:7:2]

array([0, 2, 4, 6])

In [54]:
vec[2::2]

array([2, 4, 6, 8])

### Boolean indexing

Do you remember how to create an array that is true if and only if `vec` is smaller than 5?

In [55]:
vec = np.arange(10)
vec

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [56]:
selector = vec <= 5
selector

array([ True,  True,  True,  True,  True,  True, False, False, False,
       False])

You can use these boolean arrays to subset the corresponding true values.

In [57]:
vec

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [58]:
vec[selector]

array([0, 1, 2, 3, 4, 5])

In [59]:
vec[vec<=5]

array([0, 1, 2, 3, 4, 5])

You can do the same with matrices:

In [60]:
mat_1 = np.array([[1, 2],
       [3, 4],
       [5, 6]])

In [61]:
mat_1 >= 3

array([[False, False],
       [ True,  True],
       [ True,  True]])

In [62]:
mat_1[mat_1 >= 3]

array([3, 4, 5, 6])

### Questions 3
- Consider the example matrix from above and subset all entries with values between 2 and 4. You can try to do this in one line or do it through multiple lines!

In [74]:
selector = (mat_1 >=2) & (mat_1 <=4)
mat_1[selector]

array([2, 3, 4])

# Pandas
## Python's package for handling data
### Motivation for pandas
Dictionaries allow us to save multiple attributes of a particular object. For example, we can store some information about a lesson:

In [75]:
lesson_5 = {
    'topic': 'Numpy',
    'teacher': 'Sharon',
    'week': 5
}

Often, we collect multiple observations for which we record the same attributes and we'd like to store them together:

In [76]:
lesson_3 = {
    'topic': 'Basics of Python 2',
    'teacher': 'Sharon',
    'week': 3
}
lesson_1 = {
    'topic': 'Setting up Python',
    'teacher': 'Abhi',
    'week': 1
}

We could go about this by storing them in a list:

In [77]:
lst_lessons = [lesson_5, lesson_3, lesson_1]

In [78]:
lst_lessons

[{'topic': 'Numpy', 'teacher': 'Sharon', 'week': 5},
 {'topic': 'Basics of Python 2', 'teacher': 'Sharon', 'week': 3},
 {'topic': 'Setting up Python', 'teacher': 'Abhi', 'week': 1}]

However, such lists are lacking a lot of functionality. For example, we may want to print out only those observations where Jasmine was the teacher. We'd have to use a for loop for this:

In [79]:
sharons_lessons = [
    lesson for lesson in lst_lessons if lesson['teacher'] == 'Sharon'
]
sharons_lessons

[{'topic': 'Numpy', 'teacher': 'Sharon', 'week': 5},
 {'topic': 'Basics of Python 2', 'teacher': 'Sharon', 'week': 3}]

We therefore need a new data structure that can record multiple pieces of information about multiple observations. This is provided by `pandas` (which stands for *panel data*):

In [80]:
#We normally import pandas like this
import pandas as pd

The core object in pandas is a *data frame*, which consists of observations organized along its rows and different pieces of information about its observations organized along its columns.

In [113]:
df_lessons = pd.DataFrame(lst_lessons)
df_lessons

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3
2,Setting up Python,Abhi,1


### Finding out basic information

In [82]:
df_lessons.shape

(3, 3)

In [83]:
df_lessons.columns

Index(['topic', 'teacher', 'week'], dtype='object')

### Indexing

Regular brackets return a specific column or a subset of columns:

In [84]:
df_lessons['teacher']

0    Sharon
1    Sharon
2      Abhi
Name: teacher, dtype: object

(*Note:* The object that is returned is called a `pd.Series` and has a few additional features compared to a one-dimensional numpy array. I personally don't use those additional features and think they are counter-productive, but you can look them up if you have to interact with them.)

You can operate on those columns in the same way you would operate on numpy arrays:

In [85]:
df_lessons['teacher'] == 'Sharon'

0     True
1     True
2    False
Name: teacher, dtype: bool

In [86]:
df_lessons[['topic', 'teacher']]

Unnamed: 0,topic,teacher
0,Numpy,Sharon
1,Basics of Python 2,Sharon
2,Setting up Python,Abhi


`.loc` allows you to index data frames by row numbers and column names:

In [87]:
df_lessons.loc[1, 'teacher']

'Sharon'

This also works with slicing:

In [88]:
df_lessons.loc[1:, 'teacher']

1    Sharon
2      Abhi
Name: teacher, dtype: object

In [89]:
df_lessons.loc[:, ['topic', 'teacher']]

Unnamed: 0,topic,teacher
0,Numpy,Sharon
1,Basics of Python 2,Sharon
2,Setting up Python,Abhi


`iloc` works in the same way, but allows you to access columns according to their numerical index rather than their name:

In [90]:
df_lessons.iloc[1, 1]

'Sharon'

Finally you can also do boolean indexing with rectangular brackets.

In [91]:
selector = df_lessons['teacher'] == 'Sharon'
df_lessons[selector]

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3


(Note that the single `=` assigns the command to the right of it to the variable on its left. The double `==` on the other hand compares the values in `df_lessons['teacher']` and determines whether they are equal to `'Jasmine'`.)

In [92]:
df_lessons[df_lessons['teacher']=='Sharon']

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3


Finally, you can add new columns in the same way you would add a new key, value pair to a dictionary:

In [93]:
df_lessons

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3
2,Setting up Python,Abhi,1


In [114]:
df_lessons['homework'] = [True, True, False]

In [97]:
df_lessons


Unnamed: 0,topic,teacher,week,homework
0,Numpy,Sharon,5,True
1,Basics of Python 2,Sharon,3,True
2,Setting up Python,Abhi,1,False


### Exercises
1. Create a data frame that additionally includes this week (week 6) with the appropriate topic (pandas) and teacher (Sam).
2. Print out the topic for the second row.
3. Subset the data frame to only print out the lessons for week 3 and higher.
4. Create a new data frame that also includes week 7's lesson with teacher Sam. However, you don't know the topic yet. How does `pandas` represent this information? (Hint: Create a dictionary that only contains the keys `week` and `teacher`, but not `topic`. Try adding it to the list we used above and turning it into a dataframe.)
5. You could have alternately also represented this information as a two-dimensional array with observations structured along rows and variables structured along columns. What would the difference be and why might this be a bad idea in this case? Discuss with the other students at your table.

In [120]:
week_6 = {
    'topic': 'Pandas',
    'teacher': 'Sam',
    'week': 6,
    'homework': True
}

df_lessons.loc[3] = week_6
print(df_lessons.loc[1,'topic'])

#print(df_lessons)

selector = df_lessons['week'] >= 3
print(df_lessons[selector])

week_7 = {
    'teacher': 'Abhi',
    'week': 7,
    'homework': False
}

df_lessons.loc[4] = week_7
df_lessons


Basics of Python 2
                topic teacher  week homework
0               Numpy  Sharon     5     True
1  Basics of Python 2  Sharon     3     True
3              Pandas     Sam     6     True
4                 NaN    Abhi     7    False


Unnamed: 0,topic,teacher,week,homework
0,Numpy,Sharon,5,True
1,Basics of Python 2,Sharon,3,True
2,Setting up Python,Abhi,1,False
3,Pandas,Sam,6,True
4,,Abhi,7,False


### Saving and loading a data frame
You can save data frames in different formats. A popular format is csv (comma-separated values), which represents each observation in one row and each variable separated by commas.

In [103]:
df_lessons.to_csv('df_lessons.csv')

Let's inspect this file.

We'll be using csv files today. Note that they are not always ideal. For example, they do not save the type of your different values which can lead to issues. The hdf5 format is a popular alternative (but a little more complicated to use); alternatively the feather format is lightweight and more reliable, but a little less common.

In [104]:
df_lessons_loaded = pd.read_csv('df_lessons.csv')

In [107]:
df_lessons_loaded
type(df_lessons_loaded.loc[0,'week'])

numpy.int64

### Exercises
1. Read in the file `dot_motion.csv` using pandas and assign it to the variable `df_dm`.
2. Try exploring the file and describe the data contained in it.
3. Subset the data frame to only contain the observations with a reaction time of above 100.
4. Create a new variable 'accuracy' that is 1 if the motion and the choice are matching and 0 otherwise.

#### Hint for 4:
If the motion and choice are matching, their entries should be equal. Create an array `accuracy` that contains as a boolean whether they are or are not matching. You can turn this boolean array (with True and False value) into a float array (which will assign 1 to True and 0 to False), using `accuracy.astype(float)`.


In [135]:
df_dm = pd.read_csv('dot_motion.csv')
df_dm.drop(columns='Unnamed: 0', inplace=True)

In [136]:
df_dm


Unnamed: 0,subject,motion,noise_level,choice,reaction_time
0,Subject 1,right,high,left,504.360449
1,Subject 1,right,high,right,503.477539
2,Subject 1,right,low,right,531.792948
3,Subject 1,left,high,right,529.713605
4,Subject 1,left,high,left,1065.592634
...,...,...,...,...,...
495,Subject 2,right,high,right,682.588538
496,Subject 2,left,low,left,608.847204
497,Subject 2,right,low,right,529.400439
498,Subject 2,left,low,left,600.240023


In [None]:

for column in df_dm.columns[0:3]:
    print(f"{column}: {df_dm[column].nunique()} unique values, {df_dm[column].unique()}")


subject: 2 unique values, ['Subject 1' 'Subject 2']
motion: 2 unique values, ['right' 'left']
noise_level: 2 unique values, ['high' 'low']


In [140]:
#3. Subset the data frame to only contain the observations with a reaction time of above 100.

df_dm[df_dm['reaction_time'] > 100]

Unnamed: 0,subject,motion,noise_level,choice,reaction_time
0,Subject 1,right,high,left,504.360449
1,Subject 1,right,high,right,503.477539
2,Subject 1,right,low,right,531.792948
3,Subject 1,left,high,right,529.713605
4,Subject 1,left,high,left,1065.592634
...,...,...,...,...,...
495,Subject 2,right,high,right,682.588538
496,Subject 2,left,low,left,608.847204
497,Subject 2,right,low,right,529.400439
498,Subject 2,left,low,left,600.240023


In [141]:
#4. Create a new variable 'accuracy' that is 1 if the motion and the choice are matching and 0 otherwise.
df_dm['accuracy'] = (df_dm['motion'] == df_dm['choice']).astype(int)
df_dm

Unnamed: 0,subject,motion,noise_level,choice,reaction_time,accuracy
0,Subject 1,right,high,left,504.360449,0
1,Subject 1,right,high,right,503.477539,1
2,Subject 1,right,low,right,531.792948,1
3,Subject 1,left,high,right,529.713605,0
4,Subject 1,left,high,left,1065.592634,1
...,...,...,...,...,...,...
495,Subject 2,right,high,right,682.588538,1
496,Subject 2,left,low,left,608.847204,1
497,Subject 2,right,low,right,529.400439,1
498,Subject 2,left,low,left,600.240023,1
