**Notebook - Combining JUMP Metadata**

This notebook loads the different JUMP-DP dataset metadata files and combines them to form one dataframe containing all metadata information.

# Imports:

In [1]:
import pandas as pd
import numpy as np

# Load Metadata:
Files downloaded from: https://github.com/jump-cellpainting/datasets/tree/main/metadata

## Plates:

In [2]:
plates = pd.read_csv('../data/metadata/plate.csv.gz')
print(plates.shape)
plates.head(2)

(2378, 4)


Unnamed: 0,Metadata_Source,Metadata_Batch,Metadata_Plate,Metadata_PlateType
0,source_1,Batch1_20221004,UL000109,COMPOUND_EMPTY
1,source_1,Batch1_20221004,UL001641,COMPOUND


## Wells:

In [4]:
wells = pd.read_csv('../data/metadata/well.csv.gz')
wells = wells.applymap(str)
print(wells.shape)
wells.head(2)

(1096074, 4)


Unnamed: 0,Metadata_Source,Metadata_Plate,Metadata_Well,Metadata_JCP2022
0,source_1,UL000081,A02,JCP2022_033924
1,source_1,UL000081,A03,JCP2022_085227


## Compounds:

In [7]:
compound = pd.read_csv('../data/metadata/compound.csv.gz')
print(compound.shape)
compound.head(2)

(116753, 3)


Unnamed: 0,Metadata_JCP2022,Metadata_InChIKey,Metadata_InChI
0,JCP2022_000001,AAAHWCWPZPSPIW-UHFFFAOYSA-N,InChI=1S/C25H31N5O2/c1-4-23-26-14-16-30(23)24-...
1,JCP2022_000002,AAAJHRMBUHXWLD-UHFFFAOYSA-N,InChI=1S/C11H13ClN2O/c12-10-4-2-9(3-5-10)8-14-...


- 'JCP2022_999999' are untreated wells (these wells contain only cells) - see comment at https://github.com/jump-cellpainting/datasets/issues/49.

- This is altered below to change the n/a value for this row to 'NON_COMPOUND' instead so when the dataframes are merged there are no null values.

In [8]:
compound.loc[compound.Metadata_InChIKey.isnull(), ['Metadata_InChIKey', 'Metadata_InChI']] = 'NON_COMPOUND'

In [9]:
# Check that the above code ran successfully:
compound.loc[compound['Metadata_JCP2022'] == 'JCP2022_999999']

Unnamed: 0,Metadata_JCP2022,Metadata_InChIKey,Metadata_InChI
116752,JCP2022_999999,NON_COMPOUND,NON_COMPOUND


# Combining Metadata:

## Merging compound with well data:

In [13]:
c_well = pd.merge(compound, wells, on='Metadata_JCP2022', how='inner')
print(c_well.shape)
c_well.head()

(945604, 6)


Unnamed: 0,Metadata_JCP2022,Metadata_InChIKey,Metadata_InChI,Metadata_Source,Metadata_Plate,Metadata_Well
0,JCP2022_000001,AAAHWCWPZPSPIW-UHFFFAOYSA-N,InChI=1S/C25H31N5O2/c1-4-23-26-14-16-30(23)24-...,source_1,UL001783,C29
1,JCP2022_000001,AAAHWCWPZPSPIW-UHFFFAOYSA-N,InChI=1S/C25H31N5O2/c1-4-23-26-14-16-30(23)24-...,source_10,Dest210622-144945,C07
2,JCP2022_000001,AAAHWCWPZPSPIW-UHFFFAOYSA-N,InChI=1S/C25H31N5O2/c1-4-23-26-14-16-30(23)24-...,source_3,B40803aW,B15
3,JCP2022_000001,AAAHWCWPZPSPIW-UHFFFAOYSA-N,InChI=1S/C25H31N5O2/c1-4-23-26-14-16-30(23)24-...,source_6,110000296383,B15
4,JCP2022_000002,AAAJHRMBUHXWLD-UHFFFAOYSA-N,InChI=1S/C11H13ClN2O/c12-10-4-2-9(3-5-10)8-14-...,source_1,UL000087,D43


- Non compound datapoints are retained, but flagged as they are without ChlKeys:

In [14]:
c_well[c_well['Metadata_JCP2022'] == 'JCP2022_999999'][0:2]

Unnamed: 0,Metadata_JCP2022,Metadata_InChIKey,Metadata_InChI,Metadata_Source,Metadata_Plate,Metadata_Well
908510,JCP2022_999999,NON_COMPOUND,NON_COMPOUND,source_10,Dest210531-152149,A05
908511,JCP2022_999999,NON_COMPOUND,NON_COMPOUND,source_10,Dest210531-152149,A09


- Check for null values:

In [16]:
c_well[c_well.Metadata_Plate.isnull()]

Unnamed: 0,Metadata_JCP2022,Metadata_InChIKey,Metadata_InChI,Metadata_Source,Metadata_Plate,Metadata_Well


## Merging compound + well data with plates data:

In [17]:
cwp_data = pd.merge(c_well, plates, on=['Metadata_Source', 'Metadata_Plate'] , how='inner')
print(cwp_data.shape)
cwp_data.head()

(945604, 8)


Unnamed: 0,Metadata_JCP2022,Metadata_InChIKey,Metadata_InChI,Metadata_Source,Metadata_Plate,Metadata_Well,Metadata_Batch,Metadata_PlateType
0,JCP2022_000001,AAAHWCWPZPSPIW-UHFFFAOYSA-N,InChI=1S/C25H31N5O2/c1-4-23-26-14-16-30(23)24-...,source_1,UL001783,C29,Batch5_20221030,COMPOUND
1,JCP2022_000013,AABSTWCOLWSFRA-UHFFFAOYSA-N,InChI=1S/C17H19N5O2S/c1-11-20-14(16-22(11)7-8-...,source_1,UL001783,O05,Batch5_20221030,COMPOUND
2,JCP2022_000026,AACNNMAJYLOGIN-UHFFFAOYSA-N,"InChI=1S/C11H17BrN2O/c1-8(2)11(3,15)7-14-10-4-...",source_1,UL001783,Z11,Batch5_20221030,COMPOUND
3,JCP2022_000209,ABAOQGNJQBALMY-UHFFFAOYSA-N,InChI=1S/C10H9NO4/c11-8(12)7-3-6-4-1-2-5(6)10(...,source_1,UL001783,H28,Batch5_20221030,COMPOUND
4,JCP2022_000273,ABJRPGSJRJOLST-UHFFFAOYSA-N,InChI=1S/C15H19F2N5/c16-13(17)14-19-18-11-4-5-...,source_1,UL001783,O41,Batch5_20221030,COMPOUND


- Plate count check - for a specific batch, check the number of associated plates:

In [18]:
cwp_data.loc[cwp_data['Metadata_Batch'] == 'CP_28_all_Phenix1'].Metadata_Plate.value_counts()[0:5]

SP20P23d    384
SP20P23c    384
DMSOC26     384
DMSOC25     384
SP05P08c    384
Name: Metadata_Plate, dtype: int64

# Save Data:

In [19]:
cwp_data.to_csv('../data/cwp_data.csv', index=False)