In [0]:
import numpy as np
import pandas as pd

# Data Mapping

Many times we will have data in a format that needs to be mapped to other values in order to be useful for analysis or figure creation. A good example is data output from the slide scanner.

Let's take a look at it:

In [111]:
df = pd.read_csv('slide_data.csv')

df.head()

Unnamed: 0,File,Spots1,Spots2,Spots3,Area,DMax,Dmin,Smax,Smin
0,Plate001_Well001_Object1.tif_projection.tif,0,0,0,11639576,661.4024,110.0419,1819,126
1,Plate001_Well002_Object1.tif_projection.tif,0,0,1,5042555,2362.7571,125.9414,1969,151
2,Plate001_Well003_Object1.tif_projection.tif,370,272,489,1515167,12584.084,141.2762,2509,254
3,Plate002_Well001_Object1.tif_projection.tif,0,0,7,10463116,1549.1947,127.4712,1848,157
4,Plate002_Well002_Object1.tif_projection.tif,0,0,31,7595761,2527.1138,111.0442,1887,133


Our dataframe contains experimental data, but no information about the conditions for the experiment. We only have the plate and well number for each sample. 

We know what the plate and well number maps to, so we create a mapping file. Let's open it:

In [88]:
condition_map = pd.read_csv('condition_map.csv')

condition_map.head(10)

Unnamed: 0,slide,temperature,maturity_days,plate,well
0,1,room,0,1,1
1,2,room,2,1,2
2,3,room,4,1,3
3,4,room,6,2,1
4,5,room,8,2,2
5,6,incubate,0,2,3
6,7,incubate,2,3,1
7,8,incubate,4,3,2
8,9,incubate,6,3,3
9,10,incubate,8,3,4


We have a new dataframe that contains the mapping from **plate** and **well** to the experimental conditions **temperature** and **maturity_days**.

## Goal

Our goal is to "map" the conditions into the `my_data` file so that it contains both the experimental values, and the experimental conditions. In order to do this, we will need to understand a couple of useful functions:

- Python `zip` function
- Pandas `map` function

## Zip

`Zip` is a python function that takes multiple lists and combines each element to create a single list.

Let's create two simple lists and try it out.

In [89]:
# Run this code cell and inspect the output

# Create two lists
list1 = ['a', 'b', 'd', 'r']
list2 = [1, 2, 3, 2]

# Zip the lists
zipped2 = zip(list1, list2)

# Inspect the output
list(zipped2)

[('a', 1), ('b', 2), ('d', 3), ('r', 2)]

We see from the above output that `zip` took the two lists created a single list of tuples. The first element of the new **zipped** list contains the first element of **list1** AND the first element of list2... and so on for each consecutive element.

The `zip` function will work with any number of lists. Let's create one more list to see it in action.

In [90]:
# Run this code cell and inspect the output

# Create a third list
list3 = ['steak sauce', 'bomber', 'js', 'd2']

# Zip all 3 lists
zipped3 = zip(list1, list2, list3)

# Inspect the output
list(zipped3)

[('a', 1, 'steak sauce'), ('b', 2, 'bomber'), ('d', 3, 'js'), ('r', 2, 'd2')]

Again, the first element of each list is combined to create the first element of the zipped list. The second element of each list is combined to create the second element of the zipped list... etc.

## Map

The Pandas `map` function substitutes each value in a series with another value.

Let's create a simple series to see how it works.

In [91]:
#Create a Pandas series
series = pd.Series(['dog', 'cat', 'spider', 'worm'])

series

0       dog
1       cat
2    spider
3      worm
dtype: object

We also need to create a dictionary of substitutions. We will substitute *dog* with **puppy** and *cat* with **kitten**.

In [92]:
# Create substitution dictionary
subs_dict = {'dog': 'puppy', 'cat': 'kitten'}

subs_dict

{'cat': 'kitten', 'dog': 'puppy'}

Now that we have a series and a substitution dictionary, let's try the `map` function:

In [93]:
# Use map to do substitutions

sub_series = series.map(subs_dict)

sub_series

0     puppy
1    kitten
2       NaN
3       NaN
dtype: object

After `map`, we have a new `sub_series` with substitute values. We also notice that other values that were not listed in our dictionary (spider and worm) are `NaN` in the new `sub_series`.

# Mapping Real Data

Using what we've learned above, we are ready to use our dataframe `df` and our `condition_map` to create a new dataframe containing both experimental values and conditions.

Remember what the data looks like:

In [94]:
df.head()

Unnamed: 0,File,Spots1,Spots2,Spots3,Area,DMax,Dmin,Smax,Smin
0,Plate001_Well001_Object1.tif_projection.tif,0,0,0,11639576,661.4024,110.0419,1819,126
1,Plate001_Well002_Object1.tif_projection.tif,0,0,1,5042555,2362.7571,125.9414,1969,151
2,Plate001_Well003_Object1.tif_projection.tif,370,272,489,1515167,12584.084,141.2762,2509,254
3,Plate002_Well001_Object1.tif_projection.tif,0,0,7,10463116,1549.1947,127.4712,1848,157
4,Plate002_Well002_Object1.tif_projection.tif,0,0,31,7595761,2527.1138,111.0442,1887,133


In [95]:
condition_map.head(10)

Unnamed: 0,slide,temperature,maturity_days,plate,well
0,1,room,0,1,1
1,2,room,2,1,2
2,3,room,4,1,3
3,4,room,6,2,1
4,5,room,8,2,2
5,6,incubate,0,2,3
6,7,incubate,2,3,1
7,8,incubate,4,3,2
8,9,incubate,6,3,3
9,10,incubate,8,3,4


- In dataframe `df` the *File* column contains a string with both plate and well number.
- The `condition_map` dataframe contains a plate and well column.

We will use these as keys to map the conditions and data into a single dataframe.

### Create `df` mapping key

To create a mapping key, we need the plate and well number in a string formatted like **1_3**, where the first number represents the plate and the second number represents the well.

The *File* column contains the plate and well numbers, so we will use string functions to extract the info we need.

Plate and well are separated by an underscore "_" in the filename, so let's use pandas `df.str.split` to get them into their own columns in a temporary dataframe.

In [96]:
# Split plate and well into columns in a new dataframe
split = df['File'].str.split('_', expand=True)

split.head()

Unnamed: 0,0,1,2,3
0,Plate001,Well001,Object1.tif,projection.tif
1,Plate001,Well002,Object1.tif,projection.tif
2,Plate001,Well003,Object1.tif,projection.tif
3,Plate002,Well001,Object1.tif,projection.tif
4,Plate002,Well002,Object1.tif,projection.tif


Let's work on extracting the plate and well numbers from the first two columns in the `split` dataframe. We notice that we can get the numbers by splitting on the string "00".

In [97]:
# Split the plate column on "00"
plate = split[:][0].str.split('00', expand=True)

plate.head()

Unnamed: 0,0,1
0,Plate,1
1,Plate,1
2,Plate,1
3,Plate,2
4,Plate,2


In [98]:
# Split the well column on "00"
well = split[:][1].str.split('00', expand=True)

well.head()

Unnamed: 0,0,1
0,Well,1
1,Well,2
2,Well,3
3,Well,1
4,Well,2


Now we have two new dataframes called `plate` and `well` that contain the plate/well number in the second column (index 1). Both columns are already strings, so we can concatenate them into a new series.

In [99]:
# Concatenate plate and well
plate_well = plate[:][1] + '_' + well[:][1]

plate_well.head()

0    1_1
1    1_2
2    1_3
3    2_1
4    2_2
Name: 1, dtype: object

This is the plate/well key that we need! Let's put it back in the dataframe `df`:

In [100]:
# Create new plate_well column in df
df['plate_well'] = plate_well

df.head()

Unnamed: 0,File,Spots1,Spots2,Spots3,Area,DMax,Dmin,Smax,Smin,plate_well
0,Plate001_Well001_Object1.tif_projection.tif,0,0,0,11639576,661.4024,110.0419,1819,126,1_1
1,Plate001_Well002_Object1.tif_projection.tif,0,0,1,5042555,2362.7571,125.9414,1969,151,1_2
2,Plate001_Well003_Object1.tif_projection.tif,370,272,489,1515167,12584.084,141.2762,2509,254,1_3
3,Plate002_Well001_Object1.tif_projection.tif,0,0,7,10463116,1549.1947,127.4712,1848,157,2_1
4,Plate002_Well002_Object1.tif_projection.tif,0,0,31,7595761,2527.1138,111.0442,1887,133,2_2


We need the same key in the `condition_map` dataframe. Because we have the plate and well number in seperate columns, we can convert each to a string, and skip straight to the concatenation step.

Let's create a new `condition_map` column called **plate_well**

In [101]:
# Create new plate_well column in condition_map
condition_map['plate_well'] = condition_map['plate'].astype(str) + '_' + condition_map['well'].astype(str)

condition_map.head()

Unnamed: 0,slide,temperature,maturity_days,plate,well,plate_well
0,1,room,0,1,1,1_1
1,2,room,2,1,2,1_2
2,3,room,4,1,3,1_3
3,4,room,6,2,1,2_1
4,5,room,8,2,2,2_2


Both dataframes have a new column named **plate_well** that can be used as a key for mapping data from `condition_map` into `df`.

### Use `zip` and `map`

We will use the `condition_map` dataframe to create our mapping.

First we zip our **plate_well** key with the two conditions we want to add; namely, **temperature** and **maturity_days**.

In [102]:
zip_map = list(zip(condition_map['plate_well'], condition_map['temperature'], condition_map['maturity_days']))

zip_map

[('1_1', 'room', 0),
 ('1_2', 'room', 2),
 ('1_3', 'room', 4),
 ('2_1', 'room', 6),
 ('2_2', 'room', 8),
 ('2_3', 'incubate', 0),
 ('3_1', 'incubate', 2),
 ('3_2', 'incubate', 4),
 ('3_3', 'incubate', 6),
 ('3_4', 'incubate', 8)]

Create a mapping dictionary using zipped values.

In [103]:
mapping_dict = {p_w: (temp, days) for p_w, temp, days in zip_map}

dict(mapping_dict)

{'1_1': ('room', 0),
 '1_2': ('room', 2),
 '1_3': ('room', 4),
 '2_1': ('room', 6),
 '2_2': ('room', 8),
 '2_3': ('incubate', 0),
 '3_1': ('incubate', 2),
 '3_2': ('incubate', 4),
 '3_3': ('incubate', 6),
 '3_4': ('incubate', 8)}

Use `map` to map the conditions to the key.

In [108]:
mapped_df = pd.DataFrame(df.plate_well.map(mapping_dict).tolist())

mapped_df

Unnamed: 0,0,1
0,room,0
1,room,2
2,room,4
3,room,6
4,room,8
5,incubate,0
6,incubate,2
7,incubate,4
8,incubate,6
9,incubate,8


Now insert the values into the original dataframe!

In [110]:
df[['temperature','maturity_days']] = mapped_df

df

Unnamed: 0,File,Spots1,Spots2,Spots3,Area,DMax,Dmin,Smax,Smin,plate_well,temperature,maturity_days
0,Plate001_Well001_Object1.tif_projection.tif,0,0,0,11639576,661.4024,110.0419,1819,126,1_1,room,0
1,Plate001_Well002_Object1.tif_projection.tif,0,0,1,5042555,2362.7571,125.9414,1969,151,1_2,room,2
2,Plate001_Well003_Object1.tif_projection.tif,370,272,489,1515167,12584.084,141.2762,2509,254,1_3,room,4
3,Plate002_Well001_Object1.tif_projection.tif,0,0,7,10463116,1549.1947,127.4712,1848,157,2_1,room,6
4,Plate002_Well002_Object1.tif_projection.tif,0,0,31,7595761,2527.1138,111.0442,1887,133,2_2,room,8
5,Plate002_Well003_Object1.tif_projection.tif,0,0,0,11451702,584.117,114.7559,2052,130,2_3,incubate,0
6,Plate003_Well001_Object1.tif_projection.tif,0,0,0,13455481,721.2783,108.9951,1989,113,3_1,incubate,2
7,Plate003_Well002_Object1.tif_projection.tif,3,3,0,6983515,397.6721,111.6656,2033,136,3_2,incubate,4
8,Plate003_Well003_Object1.tif_projection.tif,0,0,0,7059987,762.9282,110.5655,1631,121,3_3,incubate,6
9,Plate003_Well004_Object1.tif_projection.tif,0,0,0,493406,213.945,139.2979,1247,169,3_4,incubate,8
