# Explore Mother Machine Data

We have an experiment where two carbon switches were done: details to come)

In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%gui qt

import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt

import pathlib

## Import Data from BACMMAN

Set the path to the exported csv file from Bacmman to load it into Python.

Note there is also a python package that allows for direct interactions between python and Bacmman, for example to find and select problematic/interesting cells which you want to manually correct, a detailed explanation can be found [here](https://github.com/jeanollion/bacmman/wiki/Selections#create-selections-from-python).

If needed you can manually edit segmentation and tracking using the Bacmman GUI, see [here](https://github.com/jeanollion/bacmman/wiki/Data-Curation) for instructions. You can also look [at this screencast](https://www.github.com/jeanollion/bacmman/wiki/resources/screencast/manual_correction_dataset2.webm).

For time reasons we will skip these steps and just use the data as is.

In [4]:
root = pathlib.Path(pathlib.Path.home(), 'I2ICourse/')
proj_dir = (root / 'Project2C')
bm_path = proj_dir / 'Bacmman' #location of model


data_set_name = "Project2C" # change to the actual name of the dataset
objectClassIdx = 1 # 1 is for the object class #1 = bacteria

file_name =  '%s_%i.csv' %(data_set_name,objectClassIdx) 
file_path = bm_path / data_set_name / file_name

print(file_path)

/Users/simonvanvliet/I2ICourse/Project2C/Bacmman/Project2C/Project2C_1.csv


Now we  read this in with Pandas

In [7]:
df = pd.read_csv(file_path, sep=';') 

## Bacmman data formats

Let's have a look at how Bacmman stores cell property data.

In [6]:
df.head()

Unnamed: 0,Position,PositionIdx,Indices,Frame,Idx,Time,BacteriaLineage,NextDivisionFrame,PreviousDivisionFrame,SizeRatio,...,TrackErrorPrev,BacteriaCenterX,BacteriaCenterY,BacteriaCenterZ,GrowthRateArea,SizeAtBirthArea,Size,GrowthRateLength,SizeAtBirthLength,Length


There is quite some info here, but it is a bit obscure:
- `Position` is the name of the position (image)
- `PositionIdx` is an integer keeping track of which position you are in 
- `Indices` corresponds to `frame_nr - channel_nr - cell-nr`
- `Frame` is frame nr
- `Idx` is cell nr (1 = mother cells)
- `Bacteria` lineage keeps track of cell lineage (after each division a letter is added)

Annoyingly there is no field for channel, so let's add it. 

> **Exercise** 
> 
> Think about how you could do this
> 
> Hint: you can use python package [`re`](https://docs.python.org/3/library/re.html#) to extract it from the `Indices` field

In [None]:
import re
ChIdx = [int(re.split("\-",ind)[1]) for ind in df['Indices']]
df['ChannelIdx'] = ChIdx
df.head()

## Cell lineage information
Now let's look at the mother cell and first offspring in the first channel. Try to understand how  lineages are connected.

As you might notice lineages in different channels have the same BacteriaLineage code. Often it is very useful to have a unique lineage id, a number that is constant throughout a cell's life and that only occurs once within the data table. Can you come up with a good idea of how to implement this?

In [64]:
df.loc[(df['Channel']==0) & (df['Idx']<2) & (df['Frame']<13)]

Unnamed: 0,Position,PositionIdx,Indices,Frame,Idx,Time,BacteriaLineage,NextDivisionFrame,PreviousDivisionFrame,SizeRatio,...,BacteriaCenterX,BacteriaCenterY,BacteriaCenterZ,GrowthRateArea,SizeAtBirthArea,Size,GrowthRateLength,SizeAtBirthLength,Length,Channel
0,dataset1_0-50,0,0-0-0,0,0,0.0,A,6.0,,,...,1.366,2.7596,0.0,0.032735,1.7776,1.7881,0.031729,2.7234,2.7738,0
1,dataset1_0-50,0,0-0-1,0,1,0.0,B,5.0,,,...,1.3461,5.4857,0.0,0.03764,1.8383,1.8474,0.035061,2.6976,2.7227,0
52,dataset1_0-50,0,1-0-0,1,0,4.0,A,6.0,,,...,1.3646,2.9181,0.0,0.032735,1.7776,2.0017,0.031729,2.7234,3.0825,0
53,dataset1_0-50,0,1-0-1,1,1,4.0,B,5.0,,,...,1.3459,6.1708,0.0,0.03764,1.8383,2.1401,0.035061,2.6976,3.0921,0
105,dataset1_0-50,0,2-0-0,2,0,8.0,A,6.0,,,...,1.3697,3.1221,0.0,0.032735,1.7776,2.3498,0.031729,2.7234,3.4644,0
106,dataset1_0-50,0,2-0-1,2,1,8.0,B,5.0,,,...,1.3519,6.823,0.0,0.03764,1.8383,2.4645,0.035061,2.6976,3.5873,0
162,dataset1_0-50,0,3-0-0,3,0,12.0,A,6.0,,,...,1.3716,3.3315,0.0,0.032735,1.7776,2.599,0.031729,2.7234,3.9041,0
163,dataset1_0-50,0,3-0-1,3,1,12.0,B,5.0,,,...,1.3567,7.5835,0.0,0.03764,1.8383,2.8641,0.035061,2.6976,3.9669,0
214,dataset1_0-50,0,4-0-0,4,0,16.0,A,6.0,,,...,1.3578,3.6236,0.0,0.032735,1.7776,2.9827,0.031729,2.7234,4.5302,0
215,dataset1_0-50,0,4-0-1,4,1,16.0,B,5.0,,,...,1.3467,8.5774,0.0,0.03764,1.8383,3.3902,0.035061,2.6976,4.8467,0


To uniquely id a cell linage we need three pieces of info
- `Position-idx`
- `Channel-idx`
- `Bacteria-Lineage`

> **Exercise** 
> Think about how you could add a unique lineage id to the dataframe

Below we give an example of how to combine these fields to make a unique identifier. Sometimes it has handy if the identifier is a simple integer, so we also make one of those.

In [None]:
#combine PositionIdx-ChannelIdx-BacteriaLineage into single string and add string lin_id_str property
df['lin_id_str'] = df['PositionIdx'].map(str) + '-' + df['ChannelIdx'].map(str) + '-' + df['BacteriaLineage'].map(str)

#find unique strings
uniq_lin_ids = set(df['lin_id_str'])
#convert string in integer number 
lin_id = [uniq_lin_ids.index(id_str) for id_str in df['lin_id_str']]

#add integer lin_id property
df['lin_id'] = lin_id

#show data-frame
df.head()

Now we can extract a cell lineage (e.g. nr 6) as:

In [None]:
df_sub = df.loc[df['lin_id']==6]
df_sub.head()

### Saving
This would be a good time to save your data. 

In [None]:
save_name = proj_dir / 'cell_data.pkl'
df.to_pickle(save_name)