# Applying Complex Orthogonal Decomposition to Lampreys swimming in fluid environment of various viscosity
# Part I - Pre-process

In this notebook, we pre-process the original dataset, to obtain quantities such as center of mass and swimming velocity. This prepares the dataset to be analyzed by using complex orthogonal decomposition.

### Author: Yuexia Luna Lin (luna.lin@epfl.ch)
### Data provided by Prof. Eric Tytell.

# Start by loading some necessary libraries, files. 

Pre-process swimming dataset to compute swimming speed, body axis, etc.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.animation import FuncAnimation
from matplotlib import rc
import zipfile

import os
from os import listdir
from os.path import isfile, isdir, join
from scipy.interpolate import interp1d
from scipy.signal import hilbert
from scipy.fft import rfft, irfft
from scipy.linalg import eigh
from scipy.optimize import curve_fit, brute, minimize

# The following two lines is to be able toeasily convert
# comma decimal place to point decimal place
import locale
locale.setlocale(locale.LC_NUMERIC, "fr_CH.ISO8859-15")
import time

# To read Eric's h5 file, we can't use Pandas since it requires a particular structure within the HDF5 file.
# So we need this library
import h5py

#%matplotlib inline
%matplotlib widget

# Read in dataset and pre-process it.

The original h5 file cannot be directly read in by Pandas. We use h5py to read it in, then we convert it to Pandas DataFrame for easier manipulation.

We also add columns for center of mass (comx, comy), body orientation (bodyaxisx, bodyaxisy), tracker coordinates in the body frame (bodycoordx, bodycoordy), and swimming speed (velx, vely).

For convenience, we will save the processed dataset as a h5 file named "processed_midlien_all.h5'.

# 1. We first compute center of mass and body axis. Then we rotate  the body coordinates so that the body is algned with the $x$-axis.

For computing body axis, we tried using both least squares fit for the midline data, or a principal component analysis (notice it is exactly POD). After inspecting several examples, we conclude that the two approaches are extrememly similar.

PCA has the benefit that when the fish's body becomes more vertical, the results remain stable and consistent.
So we choose this.

In [23]:
with h5py.File("midlines_all2.h5", 'r') as h5file:
    column_names = [c1.decode('ascii') for c1 in h5file.attrs['colnames']]
    
    dat = dict([(c1, np.array(h5file[c1])) for c1 in column_names])
    data_frame = pd.DataFrame(dat)# , columns=column_names)

In [24]:
data_frame.head(30)

Unnamed: 0,filename,date,indiv,trial,t,frame,point,mxmm,mymm,viscosity.cP,len.mm
0,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,1.0,341.299771,88.110667,1.0,140.0
1,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,2.0,349.324554,90.105424,1.0,140.0
2,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,3.0,357.545449,91.064575,1.0,140.0
3,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,4.0,365.759349,91.948042,1.0,140.0
4,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,5.0,374.046024,91.78167,1.0,140.0
5,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,6.0,382.245394,90.91437,1.0,140.0
6,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,7.0,390.32937,89.114436,1.0,140.0
7,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,8.0,398.584686,88.492048,1.0,140.0
8,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,9.0,406.743469,89.811668,1.0,140.0
9,b'lamprey5-1-corr-midline.csv',15868.0,b'5',b'1',0.58,29.0,10.0,414.012944,93.737974,1.0,140.0


In [25]:
data_frame['date']    = data_frame['date'].astype('int')
data_frame['indiv']   = data_frame['indiv'].astype('int')
data_frame['trial']   = data_frame['trial'].apply(lambda x: x.decode())

In [26]:
data_frame['viscosity.cP'].unique()

array([ 1., 10., 20.])

Some of the fields need a bit massaging:
1. frames ought to start at 0
2. we add columns of center of mass

Subtract off the minimum frame for each individual and trial

In [27]:
data_frame[['frame']] = data_frame.groupby(['indiv','trial'])[['frame']]\
    .transform(lambda x: x - x.min())

Take the PCA to get the body axis in each frame

In [28]:
# A simple PCA function
def PCA(df):
    """ This function takes in a Pandas DataFrame that contains the (x,y)
    data, labeled as 'mxmm', 'mymm' respectively.
    It then calculates the two principal components of this set of data points."""
    
    x= df['mxmm'].tolist()
    y= df['mymm'].tolist()
    comx = np.mean(x)
    comy = np.mean(y)
    standard_x = (x - comx)
    standard_y = (y - comy)
    
    # perform PCA
    D = np.vstack([standard_x, standard_y]).T
    Corr = D.T @ D
    U, S, Vh = np.linalg.svd(Corr)
    pca = U[:, np.argmax(S)]
    # If the principal vector is pointing to the tail, we flip it around
    if (np.dot(pca, D[0,:])) < 0:
        pca = -pca
    return pd.Series([comx, comy, pca[0], pca[1]], index = ["comx", "comy", "bodyaxisx", "bodyaxisy"])

bodyaxis = data_frame.groupby(['indiv','trial','frame'])\
        .apply(PCA)

In [29]:
data_frame = pd.merge(data_frame, bodyaxis, how="left", on=["indiv", "trial", "frame"])

Then center each frame on the center of mass and rotate into the body axis coordinate system

In [30]:
data_frame["bodycoordx"] = (data_frame['mxmm'] - data_frame['comx']) * data_frame['bodyaxisx'] \
                                    + (data_frame['mymm'] - data_frame['comy']) * data_frame['bodyaxisy']
data_frame["bodycoordy"] = - (data_frame['mxmm'] - data_frame['comx'])  * data_frame['bodyaxisy'] \
                                    + (data_frame['mymm'] - data_frame['comy'])  * data_frame['bodyaxisx']       

In [31]:
# ## HERE WE JUST DOUBLE CHECKING that PCA and least squares give similar results
# ind = 3
# tr = 3
# frame = 30
# x= data_frame.loc[(data_frame['indiv']==ind) & (data_frame['trial'] == tr) & (data_frame['frame']== frame)]['mxmm'].tolist()
# y= data_frame.loc[(data_frame['indiv']==ind) & (data_frame['trial'] == tr) & (data_frame['frame']== frame)]['mymm'].tolist()
# standard_x = (x - np.mean(x))
# standard_y = (y - np.mean(y))

# A = np.vstack([standard_x, np.ones_like(x)]).T
# soln, res, rank, sv = np.linalg.lstsq(A, standard_y, rcond=None)

# B = np.vstack([standard_x, standard_y]).T
# C = B.T @ B
# U, S, Vh = np.linalg.svd(C)
# print(U)
# plt.figure()

# plt.plot(standard_x, standard_y,'bv-', label = 'centered actual data')
# plt.plot(standard_x[0], standard_y[0], 'rs')
# plt.plot(standard_x, A@soln, 'ro', label = 'LSQ, slope {:.3f}'.format(soln[0]))
# expanse  = np.max( np.linalg.norm(B, axis=1) )
# plt.plot([-expanse*U[0,0], expanse*U[0,0]], [-expanse*U[1,0], expanse*U[1,0]], 'g-', \
#          label='First principal component, slope {:.3f}'.format(U[1,0] / U[0,0] ))
# plt.legend()

# 2. Then we compute swimming velocity (center of mass translating velocity)


At the first frame and the last frame, we use forward and backward first order finite difference. In between, we used centered difference (second order accuracy in time). 
### Note we don't process any trial that has skipped frames

In [32]:
def comvelocity(df):
    dx = np.zeros((df.shape[0],))
    dy = np.zeros((df.shape[0],))
    dt = np.zeros((df.shape[0],))
    comx = df['comx'].to_numpy()
    comy = df['comy'].to_numpy()
    t = df['t'].to_numpy()

    dx[1:-1] = comx[2:] - comx[0:-2]
    dx[0] = comx[1] - comx[0]
    dx[-1] = comx[-1] - comx[-2]

    dy[1:-1] = comy[2:] - comy[0:-2]
    dy[0] = comy[1] - comy[0]
    dy[-1] = comy[-1] - comy[-2]

    dt[1:-1] = t[2:] - t[0:-2]
    dt[0] = t[1] - t[0]
    dt[-1] = t[-1] - t[-2]

    return pd.DataFrame({'frame': df['frame'].to_numpy(), 'swimvelx': dx/dt, 'swimvely': dy/dt})

swimvel = data_frame.groupby(['indiv', 'trial', 'frame'])\
    .first()\
    .reset_index()\
    .groupby(['indiv', 'trial'])\
    .apply(comvelocity)

In [33]:
data_frame = data_frame.merge(swimvel, how = "left", on=["indiv", "trial", "frame"])
data_frame.head()

Unnamed: 0,filename,date,indiv,trial,t,frame,point,mxmm,mymm,viscosity.cP,len.mm,comx,comy,bodyaxisx,bodyaxisy,bodycoordx,bodycoordy,swimvelx,swimvely
0,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,1.0,341.299771,88.110667,1.0,140.0,413.689152,104.191588,-0.950239,-0.311522,73.79677,-7.270154,-206.90573,-68.994977
1,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,2.0,349.324554,90.105424,1.0,140.0,413.689152,104.191588,-0.950239,-0.311522,65.549898,-6.665755,-206.90573,-68.994977
2,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,3.0,357.545449,91.064575,1.0,140.0,413.689152,104.191588,-0.950239,-0.311522,57.439287,-5.01619,-206.90573,-68.994977
3,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,4.0,365.759349,91.948042,1.0,140.0,413.689152,104.191588,-0.950239,-0.311522,49.3589,-3.296886,-206.90573,-68.994977
4,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,5.0,374.046024,91.78167,1.0,140.0,413.689152,104.191588,-0.950239,-0.311522,41.536407,-0.557311,-206.90573,-68.994977


In [34]:
data_frame["swimvel"] = np.sqrt(data_frame["swimvelx"]**2 + data_frame["swimvely"]**2)

# Next, we compute
- heading angle, i.e. body orientation computed by using body axis ('theta') 
- swimming angle, i.e. angle computed by using COM velocity ('com_vel_theta')
- time derivatives in these angles, 'd_theta', 'come_vel_d_theta', for heading and swimming angles, respectively

In [35]:
individuals = data_frame.loc[:,'indiv'].unique().tolist()

for ind in individuals:
    trials = data_frame.loc[data_frame['indiv'] == ind, 'trial'].unique().tolist()
    for tr in trials:
        print("working on individual {}, trial {}".format(ind, tr))
        
        trial_df = data_frame.loc[(data_frame['indiv'] == ind) & (data_frame['trial'] == tr)]

        # Compute the heading angle based on body axis
        data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr), 'theta'] \
                        = np.arctan( np.array(trial_df['bodyaxisy'])/np.array(trial_df['bodyaxisx'])\
                        * np.sign(trial_df['bodyaxisx']) )

        trial_df = data_frame.loc[(data_frame['indiv'] == ind) &\
                                              (data_frame['trial'] == tr)]

        frame = np.array(trial_df.loc[:,'frame'].unique()).copy()
        
        # Compute the changes in heading angle using forward finite difference
        data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr), 'd_theta'] = np.nan

        data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr) &\
                      (data_frame['frame']<np.max(frame)), 'd_theta'] \
                        = (np.array(trial_df[trial_df['frame']>0]['theta'])\
                        - np.array(trial_df[trial_df['frame']<np.max(frame)]['theta']))/0.02

        # The last frame we used backward finite difference
        data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr) &\
                      (data_frame['frame']==np.max(frame)), 'd_theta'] \
        = np.array( data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr) &\
                      (data_frame['frame']==np.max(frame)-1), 'd_theta'])

        # Compute the swimming angle
        data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr), 'com_vel_theta'] \
                        = np.arctan( np.array(trial_df['swimvely'])/np.array(trial_df['swimvelx'])\
                                            * np.sign(trial_df['swimvelx'])\
                                           )
        trial_df = data_frame.loc[(data_frame['indiv'] == ind) & (data_frame['trial'] == tr)]

        # Compute the changes in swimming angle, similar to what we did with heading angle
        data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr),'com_vel_d_theta'] = np.nan
        data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr) & \
                    (data_frame['frame']<np.max(frame)), 'com_vel_d_theta'] \
                    = (np.array(trial_df[trial_df['frame']>0]['com_vel_theta'])\
                             - np.array(trial_df[trial_df['frame']<np.max(frame)]['com_vel_theta']))/0.02
        data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr) & \
                    (data_frame['frame']==np.max(frame)), 'com_vel_d_theta'] \
                    = np.array( data_frame.loc[(data_frame['indiv'] == ind) &\
                      (data_frame['trial'] == tr) & \
                    (data_frame['frame']==np.max(frame)-1), 'com_vel_d_theta'])


working on individual 5, trial 1
working on individual 5, trial 10
working on individual 5, trial 11
working on individual 5, trial 12
working on individual 5, trial 13
working on individual 5, trial 14
working on individual 5, trial 15
working on individual 5, trial 16
working on individual 5, trial 17
working on individual 5, trial 18
working on individual 5, trial 19
working on individual 5, trial 2
working on individual 5, trial 20
working on individual 5, trial 21
working on individual 5, trial 22
working on individual 5, trial 23
working on individual 5, trial 24
working on individual 5, trial 25
working on individual 5, trial 3
working on individual 5, trial 4
working on individual 5, trial 5
working on individual 5, trial 6
working on individual 5, trial 7
working on individual 5, trial 8
working on individual 5, trial 9
working on individual 5, trial 28
working on individual 5, trial 29
working on individual 5, trial 30
working on individual 5, trial 31
working on individual 5

# Save the dataset for future use!

In [36]:
data_frame.head(30)

Unnamed: 0,filename,date,indiv,trial,t,frame,point,mxmm,mymm,viscosity.cP,...,bodyaxisy,bodycoordx,bodycoordy,swimvelx,swimvely,swimvel,theta,d_theta,com_vel_theta,com_vel_d_theta
0,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,1.0,341.299771,88.110667,1.0,...,-0.311522,73.79677,-7.270154,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216
1,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,2.0,349.324554,90.105424,1.0,...,-0.311522,65.549898,-6.665755,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216
2,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,3.0,357.545449,91.064575,1.0,...,-0.311522,57.439287,-5.01619,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216
3,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,4.0,365.759349,91.948042,1.0,...,-0.311522,49.3589,-3.296886,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216
4,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,5.0,374.046024,91.78167,1.0,...,-0.311522,41.536407,-0.557311,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216
5,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,6.0,382.245394,90.91437,1.0,...,-0.311522,34.015228,2.821113,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216
6,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,7.0,390.32937,89.114436,1.0,...,-0.311522,26.894238,7.049815,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216
7,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,8.0,398.584686,88.492048,1.0,...,-0.311522,19.243602,10.212945,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216
8,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,9.0,406.743469,89.811668,1.0,...,-0.311522,11.079718,11.500629,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216
9,b'lamprey5-1-corr-midline.csv',15868,5,1,0.58,0.0,10.0,414.012944,93.737974,1.0,...,-0.311522,2.948849,10.0343,-206.90573,-68.994977,218.106139,-0.316794,0.148989,-0.321865,-0.842216


In [37]:
# We save this processed dataset for future use!
data_frame.to_hdf("Data/processed_midline_all.h5", "data")

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index(['filename', 'trial'], dtype='object')]

  data_frame.to_hdf("Data/processed_midline_all.h5", "data")
