<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /></a><div align="center">This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.</div>

----

# Data Analysis with Pandas

[Pandas](https://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

## Preamble

Before using PanDas, we must import it.  PanDas depends on NumPy and can interchange data with it, so let's import both of them using abbreviated names.

In [1]:
import numpy as np
import pandas as pd

Evaluating the following cell allows us to display plots *in* the Python notebook:

In [2]:
%matplotlib inline

import matplotlib.pyplot as plt

# make large figures so we can appreciate details
plt.rc('figure', figsize=(20.0, 15.0))

import seaborn as sea

# use visual style like R's ggplot2
sea.set_style('darkgrid')

----

## Exercise 8.A

Load "metadata" files into an array `mds` of tables; that is to say, `mds[0]` should be a PanDas `DataFrame` holding the values read from file `BIO325_CRISPR_Yap_p1_D09_Cells_metadata.csv`, `mds[1]` holds values from `BIO325_CRISPR_Yap_p1_D08_Cells_metadata.csv`, and so on.

Similarly, load "feature values" files into an array of tables `fvs`.

How many rows are in each table?

In [21]:
mds = list(map(pd.read_csv, [
    'BIO325_CRISPR_Yap_p1_D07_Cells_metadata.csv',
    'BIO325_CRISPR_Yap_p1_D08_Cells_metadata.csv',
    'BIO325_CRISPR_Yap_p1_D09_Cells_metadata.csv',
    'BIO325_CRISPR_Yap_p1_D10_Cells_metadata.csv',
]))

fvs = list(map(pd.read_csv, [
    'BIO325_CRISPR_Yap_p1_D07_Cells_feature-values.csv.gz',
    'BIO325_CRISPR_Yap_p1_D08_Cells_feature-values.csv.gz',
    'BIO325_CRISPR_Yap_p1_D09_Cells_feature-values.csv.gz',
    'BIO325_CRISPR_Yap_p1_D10_Cells_feature-values.csv.gz',
]))

In [22]:
mds[1].head()

Unnamed: 0,mapobject_id,plate_name,well_name,well_pos_y,well_pos_x,tpoint,zplane,label,is_border,Classification-5,TPlus
0,512061,p1,D08,0,0,0,0,1,1,0.0,0.0
1,512062,p1,D08,0,0,0,0,2,1,0.0,0.0
2,512063,p1,D08,0,0,0,0,3,1,0.0,0.0
3,512064,p1,D08,0,0,0,0,4,1,0.0,0.0
4,512065,p1,D08,0,0,0,0,5,1,0.0,0.0


In [23]:
fvs[2].head()

Unnamed: 0,mapobject_id,Nuclei_Intensity_max_A02_C01,Nuclei_Intensity_mean_A02_C01,Nuclei_Intensity_min_A02_C01,Nuclei_Intensity_sum_A02_C01,Nuclei_Intensity_std_A02_C01,Intensity_max_A01_C02,Intensity_mean_A01_C02,Intensity_min_A01_C02,Intensity_sum_A01_C02,...,Texture_LBP-radius-5-26_A01_C03,Texture_LBP-radius-5-27_A01_C03,Texture_LBP-radius-5-28_A01_C03,Texture_LBP-radius-5-29_A01_C03,Texture_LBP-radius-5-30_A01_C03,Texture_LBP-radius-5-31_A01_C03,Texture_LBP-radius-5-32_A01_C03,Texture_LBP-radius-5-33_A01_C03,Texture_LBP-radius-5-34_A01_C03,Texture_LBP-radius-5-35_A01_C03
0,533047,732.0,300.344502,122.0,349601.0,111.768719,13577.0,4207.824389,366.0,16700855.0,...,64.0,142.0,6.0,26.0,21.0,82.0,53.0,27.0,259.0,172.0
1,533048,387.0,243.867097,124.0,377994.0,43.024769,9372.0,3547.173381,267.0,13912014.0,...,47.0,183.0,1.0,14.0,7.0,58.0,55.0,18.0,203.0,127.0
2,533049,323.0,223.507143,122.0,281619.0,43.917192,7575.0,3285.757202,363.0,7984390.0,...,40.0,118.0,1.0,7.0,0.0,35.0,18.0,4.0,97.0,40.0
3,533050,370.0,238.769716,124.0,302760.0,47.942993,6527.0,2297.226913,290.0,12786365.0,...,95.0,248.0,7.0,18.0,17.0,126.0,83.0,25.0,361.0,268.0
4,533051,297.0,217.765554,122.0,287015.0,33.750384,7825.0,3038.53232,511.0,22187363.0,...,104.0,366.0,6.0,43.0,32.0,122.0,85.0,42.0,421.0,302.0


----

## Exercise 8.B

What is the mean value of column `Intensity_mean_A01_C03` in the "feature values" of each well?

In [24]:
for i in range(4):
    print(np.mean(fvs[i]['Intensity_mean_A01_C03']))

178.42255450758492
185.8060821058702
179.5681108325277
183.5100216211156


----

## Exercise 8.C

How many unique values are in column `TPlus` in the metadata of each well?

In [25]:
for i in range(4):
    print(len(np.unique(mds[i]['TPlus'])))

2
2
2
2


----

## Exercise 8.D

The `is_border` column in a "metadata" table tells you whether a cell lies at the border of an acquisition site or not (1 = lies at the border, 0 = does not touch nor cross the border).

Can you count the number of "border" cells in each well?

In [26]:
for i in range(4):
    print(np.sum(mds[i]['is_border']))

2197
2282
1884
2500


-----

## Exercise 8.E 

Make stacked tables:

- `md`, combining metadata for all wells
- `fv`, combining feature values for all wells

In [39]:
md = pd.concat(mds)
fv = pd.concat(fvs)

----

## Exercise 8.F

Make a single large table `all` by joining tables `md` and `fv` over the common column `mapobject_id`.

How many rows are in the combined table?

In [40]:
all = md.merge(fv, how='inner', on=['mapobject_id'])

If your solution is correct, evaluating the following cells should give the results already shown:

In [41]:
all.shape

(100568, 215)

In [42]:
all.head()

Unnamed: 0,mapobject_id,plate_name,well_name,well_pos_y,well_pos_x,tpoint,zplane,label,is_border,Classification-5,...,Texture_LBP-radius-5-26_A01_C03,Texture_LBP-radius-5-27_A01_C03,Texture_LBP-radius-5-28_A01_C03,Texture_LBP-radius-5-29_A01_C03,Texture_LBP-radius-5-30_A01_C03,Texture_LBP-radius-5-31_A01_C03,Texture_LBP-radius-5-32_A01_C03,Texture_LBP-radius-5-33_A01_C03,Texture_LBP-radius-5-34_A01_C03,Texture_LBP-radius-5-35_A01_C03
0,566404,p1,D07,0,0,0,0,1,1,0.0,...,52.0,133.0,1.0,20.0,21.0,72.0,46.0,22.0,214.0,153.0
1,566405,p1,D07,0,0,0,0,2,1,0.0,...,50.0,217.0,6.0,20.0,15.0,78.0,44.0,16.0,225.0,119.0
2,566406,p1,D07,0,0,0,0,3,1,0.0,...,100.0,404.0,5.0,54.0,27.0,171.0,83.0,38.0,489.0,309.0
3,566407,p1,D07,0,0,0,0,4,1,0.0,...,30.0,186.0,1.0,4.0,5.0,42.0,19.0,6.0,135.0,66.0
4,566408,p1,D07,0,0,0,0,5,1,0.0,...,65.0,246.0,3.0,24.0,22.0,94.0,46.0,37.0,259.0,185.0


----

## Exercise 8.G

Make a table `good` by extracting from `all` only rows which refer to objects that are *not* "border" objects.

How many good objects are there?  

*Advanced:* could you compute this number from the selector alone?

In [43]:
good_rows = (all['is_border'] == 0)

good = all.loc[good_rows]

If your solution is correct evaluating the following cell gives the result `(91705, 215)`.

In [44]:
good.shape

(91705, 215)

This can be computed from the selector alone by noting that the selector is just an array of booleans, where `True` counts as `1` and `False` counts as `0` -- so summing the array will give the required number:

In [45]:
np.sum(good_rows)

91705

----

## Exercise 8.H

Make two tables `all0` and `all1` by splitting table `all` on the two values of column `TPlus` (`0` or `1`).
What is the mean of column `Intensity_mean_A01_C03` in each table?  And the std deviation?

In [46]:
tp0_rows = (all['TPlus'] == 0)
tp1_rows = (all['TPlus'] == 1)  # or: tp1_rows = ~tp0_rows

all0 = all.loc[tp0_rows]
all1 = all.loc[tp1_rows]

The `.decribe()` method returns the basic descriptive statistics:

In [47]:
all0['Intensity_mean_A01_C03'].describe()

count    84165.000000
mean       183.761707
std         22.503610
min        100.447304
25%        171.670273
50%        183.710159
75%        196.467103
max        824.712494
Name: Intensity_mean_A01_C03, dtype: float64

-----

## Exercise 8.J.

Make a 2x2 grid of plots, showing box plots of the mean Intensity of the *Yap* channel for each of the wells D07, D08, D09, D10.

-----

## Exercise 8.K

Modify the above violin plot so that only *one* (non-simmetrical) violin is shown per well: the two halves of the violin should show the distribution for TPlus-positive and TPlus-null data.

----

## Exercise 8.L

Make a 2x2 grid of the violin plots, showing only one well per plot.

-----