# Exploring data using Pandas

**Before starting with this notebook, the tutors will give a brief introduction**

Now that we have extracted the cell properties we can address some biological questions. The data we have been analyzing is from a synthetic microbial cross-feeding community consisting of two auxotrophic strains of E. coli that can only grow by exchanging Amino-Acids with each other. We are interested in understanding the dynamics of this community, and as a first question we like to know how the frequency of the two cell types changes over time. It is now your task to try to answer this question.

## Import packages

Before starting the code we need to import all the required packages.

We use a number of important Python packages:
- [Numpy](https://numpy.org): Goto package for vector/matrix based calculations (heavily inspired by Matlab)
- [Pandas](https://pandas.pydata.org): Goto package for handling data tables (heavily inspired by R) 
- [Matplotlib](https://matplotlib.org): Goto package for plotting data
- [Seaborn](https://seaborn.pydata.org): Fancy plots made easy (Similar to ggplot in R)
- [pathlib](https://docs.python.org/3/library/pathlib.html): Path handling made easy

In [None]:
#next two lines make sure that Matplotlib plots are shown properly in Jupyter Notebook
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

#main data analysis packages
import numpy as np
import pandas as pd

#data plotting packages
import matplotlib
import matplotlib.pyplot as plt
#set default figure size
matplotlib.rc("figure", figsize=(10,5))
import seaborn as sns

#path handling
import pathlib

---
## Import Data
We start by specifying the paths to our data

In [None]:
#Set the path to the folder that contains project data
root = pathlib.Path(pathlib.Path.home(), 
                    'workdir/Project2A/ProcessedData/')

image_name = 'pos0_preproc-rg.tif' #set name of image
data_path = root /  image_name.replace('.tif','_cellprop.pkl')

And we can load the dataframe

In [None]:
df = pd.read_pickle(data_path)
df.head()

----
## Working with Pandas Dataframes
Now let's analyze some data. You can manipulate Pandas data frames and e.g. extract a column.  
An extracted column is know as a [Series](https://pandas.pydata.org/docs/user_guide/dsintro.html#series).

In [None]:
cell_length = df['axis_major_length'] #extract a column 
print('data type of extracted column = ', type(cell_length))

You can also add new columns, either from a Series, Vector data, or constant value:

In [None]:
cell_length = df['axis_major_length'] #extract a column 
cell_width = df['axis_minor_length'] #extract a column 

#add new column based on ratio of two Series
df["aspect_ratio"] = cell_length/cell_width

#show output
df.head()

We can get a quick summary of the data using the [`describe` function](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html):

In [None]:
df.describe()

We can also filter on rows and for example extract all cells that fulfill a certain requirement, for example we can sort out the largest 0.1% of cells using:

In [None]:
#find 99.9% percentile of cell lengths:
size_tr = df['axis_major_length'].quantile(0.999)

#select biggest cells:  
huge_cells = df[df['axis_major_length']>size_tr]
huge_cells.head(n=10)

For more details of how to use Pandas, see examples below or consult the extensive [documentation online](https://pandas.pydata.org/docs/user_guide/index.html)

----
## Plotting and analyzing cell properties using Pandas & Matplotlib 

### Cell number over time
We will first look at how the number of cells changes over time.

We use the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#splitting-an-object-into-groups) function on the `frame` property to group cells based on their frame. We can then count the number of cells at each frame by calling the `size()` property.

The result is a [Pandas Data Series](https://pandas.pydata.org/docs/user_guide/dsintro.html#series). We can directly plot the result using the build-in plot function.

In [None]:
cell_num_t = df.groupby('frame').size() #output is pandas data series
cell_num_t.plot(xlabel='frame',ylabel='# of cells',figsize=(10,5)) #use build in plot function

### Average cell properties over time
Next we will look at how the average properties of a cell, such as their fluorescent intensity, changes over time.

Again we use the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#splitting-an-object-into-groups) function on the `frame` property to group cells based on their frame. We can then calculate the average value for each group by calling the `mean()` function.
The output in this case is a Pandas dataframe, with shows the average value over all cells contained in a given frame (each row is a frame). 

[Here](https://pbpython.com/groupby-agg.html) you can find an overview of how to group and aggregate data in Pandas.

In [None]:
av_prop = df.groupby('frame').mean()
av_prop['frame'] = av_prop.index
av_prop.head()

We can plot the result in two ways: using [Matplotlib](https://matplotlib.org/stable/index.html) or [seaborn](https://seaborn.pydata.org/index.html).

Matplotlib is a lower level package, giving you a lot of freedom but requiring quite a bit of code to make things look nice.

Seaborn is a higher level package, making it easier to make nice looking figures, at the cost of some flexibility.

First we show you Matplotlib:

In [None]:
#plot with Matplotlib
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(av_prop['frame'],av_prop['mean_intensity-0'],label='channel 0')
ax.plot(av_prop['frame'],av_prop['mean_intensity-1'],label='channel 1')
ax.set_xlabel('frame nr')
ax.set_ylabel('fluorescent intensity')
ax.legend()

----
## Plotting with Seaborn

Now let's look at Seaborn:

In [None]:
#plot with Seaborn
p = sns.lineplot(data=av_prop[["mean_intensity-0", "mean_intensity-1"]])
p.set_ylabel("fluorescent intensity")

Similarly we can look at distributions and scatter plots:

In [None]:
fig, axs = plt.subplots(1,2, figsize=(12,6))
sns.histplot(ax=axs[0], data=df[["mean_intensity-0", "mean_intensity-1"]])
sns.scatterplot(ax=axs[1], data=df, x="mean_intensity-0", y="mean_intensity-1")

Seaborn has also many mare advanced functionalities. For example you can automatically make a facet plot to make a separate plot for each group. Here we visualize how the distribution of RFP intensities changes over time:

In [None]:
g = sns.FacetGrid(df, col="frame", col_wrap=10, height=2)
g.map(sns.histplot, "mean_intensity-0")

For more details, you can consult the [Example gallery](https://seaborn.pydata.org/examples/index.html) or [Tutorial section](https://seaborn.pydata.org/tutorial.html) on the Seaborn website.

---
## Data analysis: Quantifying Community Dynamics

> ### Exercise
> We are interested in the dynamics of the community, and would like to know how the fraction of red cells changes over time.   
> Try to come up with a way to calculate this.
> 
> Hints:
> - Think about how you can tell red and green cells apart in a reliable way
> - Classify cells as either red or green
> - Calculate the fraction of red cells over time

In [None]:
#enter code here

---

### Solution 1
To see the solution, uncomment (remove `#`) the `load` line below and run the cell twice (first time will load the code second time will run it)

In [None]:
# %load ../Solutions/p0_classify_sol1.py

### Solution 2
To see the solution, uncomment (remove `#`) the `load` line below and run the cell twice (first time will load the code second time will run it)

In [None]:
# %load ../Solutions/p0_classify_sol2.py