# Data Exploration

We will now explore the data and try to gain some biological insight from it. 

**Before the starting the Tutors will give an introduction, if you are ready for this step please let them know!**

The dataset we will work with consists of a time-lapse data of ...

---

## Import packages

Before starting the code we need to import all the required packages.

We use a number of important Python packages:
- [Numpy](https://numpy.org): Goto package for vector/matrix based calculations (heavily inspired by Matlab)
- [Pandas](https://pandas.pydata.org): Goto package for handling data tables (heavily inspired by R) 
- [Matplotlib](https://matplotlib.org): Goto package for plotting data
- [Seaborn](https://seaborn.pydata.org): Fancy plots made easy (Similar to ggplot in R)
- [pathlib](https://docs.python.org/3/library/pathlib.html): Path handling made easy

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%gui qt

import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
matplotlib.rc("figure", figsize=(10,5))

import seaborn as sns

import pathlib

---

## Setup Folders
As always we start with specifying the data paths:

In [None]:
proj_dir = pathlib.Path(pathlib.Path.home(), 'I2ICourse', 'Project2C')
data_path = proj_dir / 'cell_data.pkl'

---

## Load Data
We load the data from the last notebook;

In [None]:
#load data frame
df_name = proj_dir / 'cell_data.pkl'
df = pd.read_pickle(df_name)

---

## A bit more Pandas

For how to use Pandas, please refer to the Project 2A notebook.

Here we just give one last example. Sometime you might be interested in the property of a cell at birth or division. For that Pandas has a very neat function: just combine `groupby` to group cells based on their unique `lin_id` with `first` or `last` to extract the first or last entry of each cell.

In [None]:
import seaborn as sns

#get first and last frame of all cells
df_first_frame = df.groupby('lin_id').first()
df_last_frame = df.groupby('lin_id').last()

#we only want cells of which we have observed the full lineage
#and exclude the ones that have no offspring (d1_lin_id=-1)
#we can filter these out with:
df_first_frame = df_first_frame.loc[~np.isnan(df_first_frame['NextDivisionFrame'])]
df_last_frame = df_last_frame.loc[~np.isnan(df_last_frame['NextDivisionFrame'])]

fig, axs = plt.subplots(1,3, figsize=(12,4))
sns.histplot(ax=axs[0], data=df_first_frame, x='Length', )
sns.histplot(ax=axs[1], data=df_last_frame, x='Length')
sns.scatterplot(ax=axs[2], x=df_first_frame['Length'], y=df_last_frame['Length'])

for ax in axs: ax.set(xlim=(0,10)) 
axs[2].set(ylim=(0,10))

titles = ['length at birth', 'length at division', 'length at division vs length at birth'] 
for idx, title in enumerate(titles): axs[idx].set_title(title)

---

## Exercise

Now it is time for some biology. Discuss with your tutor want kind of biological questions you could address with this data.  
Some ideas:
- How do cells respond to the nutrient switches? how does growth rate, cell size, etc, change? What is the variation between cells?
- Is there a correlation between the response between the first and second switch?
- Are there any correlations between the phenotypes of closely related cells?
- Are there correlations between different cell properties (growth rate, size, etc)?
- Does the phenotype of cells depend on the position in the channel?
- etc

Bacmman calculates quite some cell properties, but there might also be other interesting phenotypes that Bacmman does not calculate by default. For example Bacmman calculates the average growth rate over a cell's life, but the growth rate can change between division events, for example when nutrients change. You might thus have to do some calculation of your own to find, e.g.:
- Cell elongation rate (rate at which cell length increases) as function of time
  - Hint: cell length is described by $l(t) = l(0) \cdot e^{r\cdot t}$ where $r$ is the elongation/growth rate
  - Hint: elongation rates can change during the life of a cell, how could you quantify this?
- Lag time: time the cells need to start growing again after nutrient switch
  - Hint: you can use either length increase or growth rate increase to measure lag time
- etc.

Think about how you could estimate these quantities from the data you do have. 

In [None]:
#Add your code here