# **Gather clean data in order to make databases for each dataset method**

Goal : Have all databases that may be of interest

**1 output score corresponds to one replicate of a pair of bacteria (1 undesirable = P and 1 Bacillus = B) for 1 mixing method. We have several samples which represents what a single species biofilm P or B is. The question is how do we match our input data to the output ? Said differently, what representation of P and what representation of B do we choose to match to a score resulting of mixing P and B together ?**

Several methods are possible : 
- Random Sampling
- Average representation
- Combinatoric

In [38]:
import numpy as np
import pandas as pd
import random as rd
import os

# Set seed for reproductibility
seed = 62
np.random.seed(seed)

Below we import all exclusion scores.

In [2]:
scores = pd.read_csv("ScoresData.csv")
print(scores.shape, "\n")
scores["Bacillus"] = scores["Bacillus"].str.replace('_', '') # remove beginning '_' in Bacillus names
print(scores.info(), '\n')
scores.head()

(648, 4) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 648 entries, 0 to 647
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Bacillus   648 non-null    object 
 1   Pathogene  648 non-null    object 
 2   Modele     648 non-null    object 
 3   Score      648 non-null    float64
dtypes: float64(1), object(3)
memory usage: 20.4+ KB
None 



Unnamed: 0,Bacillus,Pathogene,Modele,Score
0,11285,E.ce,I,0.913489
1,11285,E.ce,I,0.984517
2,11285,E.ce,I,0.992624
3,12048,E.ce,I,0.951883
4,12048,E.ce,I,0.982055


Below we load our whole bank of single species biofilm representations.

In [3]:
organisms = [f[:f.index('_')] for f in os.listdir('./Data/Clean') if f.endswith('bank.csv')] 
print(organisms)

['11285', '11457', '1167', '12001', '1202', '12048', '1218', '1219', '1234', '12701', '1273', '12832', '1298', '1339', 'B18', 'B1', 'B8', 'C5', 'E.ce', 'E.co', 'S.au', 'S.en']


In [4]:
path = "./Data/Clean/"
organisms_database = {org:pd.read_csv(path + org + '_bank.csv') for org in organisms}
print(organisms_database["E.ce"].shape)
organisms_database["B1"].shape

(36, 62)


(16, 72)

We do not have the same number of single species sample for all organisms. We also do not have always the same morphological features available.

### **Random Sampling**

This method consists of randomly selecting P and B among the actual single species samples and matching the pair to a (P, B) exclusion score. The rationale behind this method is that it should correspond to real biological variability.Meaning that when looking at single species biofilms of P or B, there might be a lot of variability in how such single species biofilms are morphologically described.

The dataset should be of size 648 since we will be matching each output score to one pair of representations of P and B.


In [5]:
def fillout_missing_columns(sample_df_dico, all_columns_dico):
    # We need to make sure all dataframes of the same type (bacillus or pathogen)
    # have the same columns to later on be able to merge them
    new_sample_dico = {}
    
    for org in sample_df_dico.keys():
        org_selected = sample_df_dico[org].copy() # dataframe of selected samples
        if '.' in org:
            trigger = 'Pathogen'
        else:
            trigger = 'Bacillus'

        # We look if and which columns are missing
        # if trigger == 'Pathogen':
        #     print("Check Substratum:", "Biofilm_SubstratumCoverage" in all_columns_dico[trigger])
        missing_columns = list(all_columns_dico[trigger].difference(set(org_selected.columns)))
        # if trigger == 'Pathogen':
        #     print("MISSING:", missing_columns)
        if missing_columns != []:
            for col in missing_columns:
                # We put missing values there
                org_selected[col] = [np.nan] * org_selected.shape[0]
        # Check if we did what we wanted
        # if trigger == 'Pathogen':
        #     print("Check Substratum in col_dict:", "Biofilm_SubstratumCoverage" in all_columns_dico[trigger])
        
        # if trigger == 'Pathogen':
        #     print("Check Substratum in df before list:", "Biofilm_SubstratumCoverage" in org_selected.columns)
        assert all_columns_dico[trigger].difference(set(org_selected.columns)) == set(), "Error, there are still some columns missing."
        new_sample_dico[org] = org_selected
        # if trigger == 'Pathogen':
        #     print("Check Substratum in df after list:", "Biofilm_SubstratumCoverage" in new_sample_dico[org].columns)
    return new_sample_dico

In [None]:
# from collections import Counter
# counter_bacillus, counter_pathogen = Counter(scores["Bacillus"]), Counter(scores["Pathogene"])

# # stores the randomly selected samples
# selected_samples = {}

# #stores all seen columns
# columns_dict = {'Bacillus':[], 'Pathogen':[]}

# for org in organisms:
#     if '.' in org:
#         n_org = counter_pathogen[org]
#         trigger = 'Pathogen'
#     else:
#         n_org = counter_bacillus[org]
#         trigger = 'Bacillus'
#     selected_samples[org] = organisms_database[org].sample(n_org, 
#                                                            ignore_index=True, 
#                                                            replace=True, 
#                                                            random_state=seed)
#     print(selected_samples[org])
#     columns_dict[trigger] = set(columns_dict[trigger]).union(list(selected_samples[org].columns))

# selected_samples = fillout_missing_columns(selected_samples, columns_dict)

# bacillus_selected = [selected_samples[org] for org in organisms if '.' not in org]
# pathogen_selected = [selected_samples[org] for org in organisms if '.' in org]

# bacillus_df = pd.concat(bacillus_selected, axis=0)
# bacillus_df = bacillus_df.reset_index(drop=True)
# # To ensure that there are no overlapping column names
# bacillus_df.columns = ['B_' + col for col in bacillus_df.columns]

# pathogen_df = pd.concat(pathogen_selected, axis=0)
# pathogen_df = pathogen_df.reset_index(drop=True)
# # To ensure that there are no overlapping column names
# pathogen_df.columns = ['P_' + col for col in pathogen_df.columns]

# random_database = pd.concat([scores[["Bacillus", "Pathogene"]], bacillus_df, pathogen_df, scores[["Modele", "Score"]]], axis=1)
# print(random_database.shape, "\n")
# print(random_database.info(), '\n')
# random_database.head()

In [None]:
# Lists to hold the sampled rows in the same order as scores
bacillus_samples = []
pathogen_samples = []

# Iterate over each row in scores
for idx, row in scores.iterrows():
    bac_org = row["Bacillus"]
    path_org = row["Pathogene"]
    
    # Sample one row from the corresponding organism database
    # using seed+idx to ensure different samples per row
    sampled_bac = organisms_database[bac_org].sample(
        n=1, replace=True, random_state=seed + idx
    ).reset_index(drop=True)
    sampled_path = organisms_database[path_org].sample(
        n=1, replace=True, random_state=seed + idx
    ).reset_index(drop=True)
    
    bacillus_samples.append(sampled_bac)
    pathogen_samples.append(sampled_path)

# Combine the list of DataFrames into a single DataFrame
bacillus_df = pd.concat(bacillus_samples, ignore_index=True)
# Prefix the column names to avoid same columns name later on
bacillus_df.columns = ['B_' + col for col in bacillus_df.columns]

pathogen_df = pd.concat(pathogen_samples, ignore_index=True)
pathogen_df.columns = ['P_' + col for col in pathogen_df.columns]

random_database = pd.concat(
    [scores.reset_index(drop=True), bacillus_df, pathogen_df],
    axis=1
)

print(random_database.shape, "\n")
print(random_database.info(), '\n')
random_database.head()

(648, 559) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 648 entries, 0 to 647
Columns: 559 entries, Bacillus to P_Intensity_Ratio_Mean_ch2_ch1_std_biovolume
dtypes: float64(538), int64(14), object(7)
memory usage: 2.8+ MB
None 



Unnamed: 0,Bacillus,Pathogene,Modele,Score,B_Unnamed: 0.1,B_Unnamed: 0,B_Collection,B_Ncells,B_Timepoint,B_Architecture_LocalDensity_range10_max,...,P_Intensity_Ratio_Mean_ch2_ch1_shell_mean,P_Intensity_Ratio_Mean_ch2_ch1_shell_mean_biovolume,P_Intensity_Ratio_Mean_ch2_ch1_shell_median,P_Intensity_Ratio_Mean_ch2_ch1_shell_min,P_Intensity_Ratio_Mean_ch2_ch1_shell_p25,P_Intensity_Ratio_Mean_ch2_ch1_shell_p75,P_Intensity_Ratio_Mean_ch2_ch1_shell_std,P_Intensity_Ratio_Mean_ch2_ch1_shell_std_biovolume,P_Intensity_Ratio_Mean_ch2_ch1_std,P_Intensity_Ratio_Mean_ch2_ch1_std_biovolume
0,11285,E.ce,I,0.913489,5,1,Lalfilm,31616.0,10-Feb-2021 15:56:04,1.0,...,,,,,,,,,,
1,11285,E.ce,I,0.984517,0,1,Lalfilm,33699.0,10-Feb-2021 15:56:04,1.0,...,,,,,,,,,,
2,11285,E.ce,I,0.992624,4,1,Lalfilm,39061.0,05-Feb-2021 09:37:34,1.0,...,,,,,,,,,,
3,12048,E.ce,I,0.951883,2,1,Lalfilm,21590.0,10-Feb-2021 15:56:00,1.0,...,,,,,,,,,,
4,12048,E.ce,I,0.982055,1,1,Lalfilm,19530.0,10-Feb-2021 15:56:00,1.0,...,,,,,,,,,,


We have a huge amount of columns but we will actually focus on only a few of them :
- Biofilm_Height
- Biofilm_Volume
- Biofilm_Roughness
- Biofilm_SubstratumCoverage

We will call them `Columns Of Interest (COI)`.

But first let's save the whole data:

In [55]:
random_database.to_csv("./Data/Datasets/random_all.csv")

Let's check if the mentioned columns are there and how they are named:

In [56]:
COI = ["Height", "_Volume", "Coverage", "Roughness"]
actual_COI = []
for columns in random_database.columns:
    for col in COI:
        if columns.endswith(col):
            print(columns)
            actual_COI.append(columns)

B_Biofilm_Height
B_Biofilm_Roughness
B_Biofilm_SubstratumCoverage
B_Biofilm_Volume
P_Biofilm_Height
P_Biofilm_Volume
P_Biofilm_SubstratumCoverage


In [57]:
random_database[["Bacillus", "Pathogene"] + actual_COI + ["Modele", "Score"]].to_csv("./Data/Datasets/random_COI.csv")

### **Average representation**

This method consists in taking the average representation of what a single species biofilm is for each organism. the rationale is to adress one of the issues with random sampling which is the fact that when sampling with replacement randomly, we will probably have very different samples which describe the same organism. By simplifying the input distributions, we hope that models will learn easily how each single species feature impacts the resulting antagonism observed.

However instead of simpifying by taking only the average representation, we will also be taking the average + standard deviation representation and the average - standard deviation representation of each single species biofilms. The rationale to do that is that it allows to still take into account biological variance when it comes to representing one organism' biofilm.

In [10]:
def separate_num_cat_cols(df, index_cols_indicator='Unnamed'): #simple but works perfectly for our use case (no other datatypes)
    num_cols = []
    cat_cols = []
    index_cols = []
    for col in df.columns:
        if ('float' in str(type(df[col].iloc[0])) or 'int' in str(type(df[col].iloc[0]))) and index_cols_indicator not in col:
            num_cols.append(col)
        elif index_cols_indicator in col:
            index_cols.append(col)
        else:
            cat_cols.append(col)
    return num_cols, cat_cols, index_cols

def make_average_representation(org, database):
    org_df = database[org]
    num_cols, cat_cols, index_cols = separate_num_cat_cols(org_df)
    means = []
    stds = []
    for col in num_cols:
        if np.sum(org_df[col].isna()) == org_df.shape[0]:
            means.append(np.nan)
            stds.append(np.nan)
        else:
            noNA_mask = ~pd.isna(org_df[col])
            noNA_col = org_df[col].loc[noNA_mask]
            means.append(np.mean(noNA_col))
            stds.append(np.std(noNA_col))
    means = np.array(means)
    stds = np.array(stds)
    
    avg_m = np.ones((3, len(num_cols)))

    avg_m[0,:] = means
    avg_m[1,:] = means - stds
    avg_m[2,:] = means + stds
    df_avg = pd.DataFrame(avg_m, columns=num_cols)
    return pd.concat([org_df[index_cols + cat_cols].iloc[:3,:], df_avg], axis=1)

def filterout_fullNA(df):
    not_full_na_indices = []
    n_cols = df.shape[1]
    for i in range(df.shape[0]):
        if np.sum(pd.isna(df.iloc[i,:])) != n_cols:
            not_full_na_indices.append(i)
    return df.iloc[not_full_na_indices,:]

In [11]:
avg_representations = {org:make_average_representation(org, organisms_database) for org in organisms}
avg_representations = fillout_missing_columns(avg_representations, columns_dict)

  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_sele

In [12]:
bacillus_selected_avg = [avg_representations[org] for org in organisms if '.' not in org]
pathogen_selected_avg = [avg_representations[org] for org in organisms if '.' in org]

memory_per_bacillus = {org:0 for org in organisms if '.' not in org}
all_lines = []
for i in range(scores.shape[0]):
    bacillus, pathogen = scores["Bacillus"].iloc[i], scores["Pathogene"].iloc[i]

    idx = memory_per_bacillus[bacillus]
    # the idx will determine which type of line we build (mean, mean+std, mean-std)
    if idx == 2:
        bacillus_avg_line = avg_representations[bacillus].iloc[idx:,:]#.reset_index(drop=True)
        pathogen_avg_line = avg_representations[pathogen].iloc[idx:,:]#.reset_index(drop=True)
    else:
        bacillus_avg_line = avg_representations[bacillus].iloc[idx:idx+1,:]
        pathogen_avg_line = avg_representations[pathogen].iloc[idx:idx+1,:]

    # bacillus_avg_line = pd.DataFrame({i:bacillus_avg_line.iloc[i] for i in range(bacillus_avg_line.shape[0])}, 
    #                                  columns=avg_representations[bacillus].columns)
    # bacillus_avg_line = bacillus_avg_line.reset_index(drop=True)

    # pathogen_avg_line = pd.DataFrame({i:pathogen_avg_line.iloc[i] for i in range(pathogen_avg_line.shape[0])}, 
    #                                   columns=avg_representations[pathogen].columns)
    # pathogen_avg_line = pathogen_avg_line.reset_index(drop=True)

    bacillus_avg_line.columns = ['B_' + col for col in bacillus_avg_line.columns]
    pathogen_avg_line.columns = ['P_' + col for col in pathogen_avg_line.columns]

    if i != scores.shape[0] - 1:
        full_line = pd.concat([scores[["Bacillus", "Pathogene"]].iloc[i:i+1,:].reset_index(drop=True), 
                bacillus_avg_line.reset_index(drop=True), 
                pathogen_avg_line.reset_index(drop=True), 
                scores[["Modele", "Score"]].iloc[i:i+1,:].reset_index(drop=True)], axis=1)
    else:
        full_line = pd.concat([scores[["Bacillus", "Pathogene"]].iloc[i:,:].reset_index(drop=True), 
                bacillus_avg_line.reset_index(drop=True), 
                pathogen_avg_line.reset_index(drop=True), 
                scores[["Modele", "Score"]].iloc[i:,:].reset_index(drop=True)], axis=1)
    full_line = filterout_fullNA(full_line)
    all_lines.append(full_line)

    if memory_per_bacillus[bacillus] == 2:
        memory_per_bacillus[bacillus] = 0

    if idx != 2:
        memory_per_bacillus[bacillus] += 1

In [13]:
avg_database = pd.concat(all_lines, axis=0, ignore_index=True)
print(avg_database.shape, "\n")
print(avg_database.info(), '\n')
avg_database.head()

(648, 559) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 648 entries, 0 to 647
Columns: 559 entries, Bacillus to Score
dtypes: float64(548), int64(4), object(7)
memory usage: 2.8+ MB
None 



Unnamed: 0,Bacillus,Pathogene,B_Unnamed: 0.1,B_Unnamed: 0,B_Collection,B_Timepoint,B_Ncells,B_Architecture_LocalDensity_range10_max,B_Architecture_LocalDensity_range10_mean,B_Architecture_LocalDensity_range10_mean_biovolume,...,P_Grid_ID_shell_std,P_Intensity_Ratio_Mean_ch2_ch1_shell_p75,P_Distance_ToBiofilmCenter_shell_max,P_Intensity_Ratio_Mean_ch1_ch2_shell_p25,P_Cube_VolumeFraction_shell_mean_biovolume,P_Cube_VolumeFraction_core_p25,P_Distance_ToBiofilmCenterAtSubstrate_mean_biovolume,P_Architecture_LocalNumberDensity_range3_mean,Modele,Score
0,11285,E.ce,0,1,Lalfilm,05-Feb-2021 09:37:34,31256.222222,1.0,0.516779,0.720486,...,,,,,,,,,I,0.913489
1,11285,E.ce,1,1,Lalfilm,05-Feb-2021 09:37:34,22761.5541,1.0,0.466152,0.682156,...,,,,,,,,,I,0.984517
2,11285,E.ce,2,1,Lalfilm,05-Feb-2021 09:37:34,39750.890344,1.0,0.567406,0.758816,...,,,,,,,,,I,0.992624
3,12048,E.ce,0,1,Lalfilm,05-Feb-2021 09:37:28,26769.833333,1.0,0.527177,0.732543,...,,,,,,,,,I,0.951883
4,12048,E.ce,1,1,Lalfilm,05-Feb-2021 09:37:28,20346.013431,1.0,0.477913,0.698378,...,,,,,,,,,I,0.982055


In [14]:
avg_database.to_csv("./Data/Datasets/avg_all.csv")

In [15]:
avg_database[["Bacillus", "Pathogene"] + actual_COI + ["Modele", "Score"]].to_csv("./Data/Datasets/avg_COI.csv")

### **Combinatoric**

This method consists in making every possible combinations possible between single species samples for each Bacillus/Pathogen interaction. Once all possible combinations of samples are created, we then assign a randomly selected exclusion score among the 3 replicates for each interaction for each mixing modele. 

For Bacillus B1 and Pathogen P1:  
Number of single species biofilm samples of B1 * That of P1 * 3 Mixing method * 1 randomly selected B1/P1 exclusion score replicate.
 
To prevent data leakage, we will need to ensure that the same combination does not appear both in the training set and the test set, but this will be done later on.

Because of the size of the resulting dataset, we will directly only build a version of this dataset using Column Of Interests (COI) only.

In [16]:
for row in organisms_database["E.ce"]:
    print(row)

Unnamed: 0.1
Unnamed: 0
Collection
Ncells
Timepoint
Biofilm_AspectRatio_HeightToLength
Biofilm_AspectRatio_HeightToWidth
Biofilm_AspectRatio_LengthToWidth
Biofilm_BaseArea
Biofilm_BaseEccentricity
Biofilm_Height
Biofilm_Length
Biofilm_OuterSurfacePerSubstrate
Biofilm_OuterSurfacePerVolume
Biofilm_Volume
Biofilm_VolumePerSubstrate
Biofilm_Width
Cube_Surface_max
Cube_Surface_mean
Cube_Surface_mean_biovolume
Cube_Surface_median
Cube_Surface_min
Cube_Surface_p25
Cube_Surface_p75
Cube_Surface_std
Cube_Surface_std_biovolume
Cube_VolumeFraction_max
Cube_VolumeFraction_mean
Cube_VolumeFraction_mean_biovolume
Cube_VolumeFraction_median
Cube_VolumeFraction_min
Cube_VolumeFraction_p25
Cube_VolumeFraction_p75
Cube_VolumeFraction_std
Cube_VolumeFraction_std_biovolume
Grid_ID_max
Grid_ID_mean
Grid_ID_mean_biovolume
Grid_ID_median
Grid_ID_min
Grid_ID_p25
Grid_ID_p75
Grid_ID_std
Grid_ID_std_biovolume
Intensity_Mean_ch1_max
Intensity_Mean_ch1_mean
Intensity_Mean_ch1_mean_biovolume
Intensity_Mean_ch1_me

In [17]:
[col[2:] for col in actual_COI]

['Biofilm_Height',
 'Biofilm_Roughness',
 'Biofilm_SubstratumCoverage',
 'Biofilm_Volume',
 'Biofilm_Height',
 'Biofilm_Volume',
 'Biofilm_SubstratumCoverage']

In [18]:
from tqdm import tqdm

def get_row_as_df(df, row):
    if row >= df.shape[0]:
        raise IndexError("Index is out of bound")
    if row == df.shape[0] - 1:
        sample = df.iloc[row:,:]
    else:
        sample = df.iloc[row: row+1,:]
    return sample


def cross_join(old_bacillus_dict, old_pathogen_dict, all_columns):
    bacillus_dict = fillout_missing_columns(old_bacillus_dict, all_columns)
    pathogen_dict = fillout_missing_columns(old_pathogen_dict, all_columns)

    all_cross_tables = {}
    start_b_id = 0
    pathogen_memory = {}
    start_p_id = 0
    for B in tqdm(bacillus_dict.keys(), desc="Bacillus"):
        bacillus_df = bacillus_dict[B].copy()
        # print(bacillus_df.columns)
        bacillus_df.columns = ['B_' + col for col in bacillus_df.columns]
        # print(bacillus_df.columns)
        # print(bacillus_df.shape)
        b_ids = [i for i in range(start_b_id, start_b_id + bacillus_df.shape[0])]
        b_ids = pd.DataFrame(b_ids, columns=["B_sample_ID"], index=bacillus_df.index)

        start_b_id += bacillus_df.shape[0]
        bacillus_df = pd.concat([b_ids, bacillus_df], axis=1)
        for P in pathogen_dict.keys():
            pathogen_df = pathogen_dict[P].copy()
            pathogen_df.columns = ['P_' + col for col in pathogen_df.columns]
            # print(pathogen_df.shape)

            if P not in pathogen_memory.keys():
                pathogen_memory[P] = [i for i in range(start_p_id, start_p_id + pathogen_df.shape[0])]
                start_p_id += pathogen_df.shape[0]
            p_ids = pd.DataFrame(pathogen_memory[P], columns=["P_sample_ID"], index=pathogen_df.index)

            
            pathogen_df = pd.concat([p_ids, pathogen_df], axis=1)
            all_lines = []
            for b_row in range(bacillus_df.shape[0]):
                b_row = get_row_as_df(bacillus_df, b_row).reset_index(drop=True)
                for p_row in range(pathogen_df.shape[0]):
                    p_row = get_row_as_df(pathogen_df, p_row).reset_index(drop=True)
                    line = pd.concat([b_row, p_row], axis=1)
                    assert 'DataFrame' in str(type(line)), f"Incorect type, got {str(type(line))}"
                    all_lines.append(line)
            all_cross_tables[f"{B}x{P}"] = pd.concat(all_lines, axis=0)
    return all_cross_tables

In [19]:
bacillus_database = {org:organisms_database[org] for org in organisms_database.keys() if '.' not in org}
pathogen_database = {org:organisms_database[org] for org in organisms_database.keys() if '.' in org}

all_crossjoins = cross_join(bacillus_database, pathogen_database, columns_dict)

  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_selected[col] = [np.nan] * org_selected.shape[0]
  org_sele

In [20]:
print(all_crossjoins.keys())
len(all_crossjoins.keys())

dict_keys(['11285xE.ce', '11285xE.co', '11285xS.au', '11285xS.en', '11457xE.ce', '11457xE.co', '11457xS.au', '11457xS.en', '1167xE.ce', '1167xE.co', '1167xS.au', '1167xS.en', '12001xE.ce', '12001xE.co', '12001xS.au', '12001xS.en', '1202xE.ce', '1202xE.co', '1202xS.au', '1202xS.en', '12048xE.ce', '12048xE.co', '12048xS.au', '12048xS.en', '1218xE.ce', '1218xE.co', '1218xS.au', '1218xS.en', '1219xE.ce', '1219xE.co', '1219xS.au', '1219xS.en', '1234xE.ce', '1234xE.co', '1234xS.au', '1234xS.en', '12701xE.ce', '12701xE.co', '12701xS.au', '12701xS.en', '1273xE.ce', '1273xE.co', '1273xS.au', '1273xS.en', '12832xE.ce', '12832xE.co', '12832xS.au', '12832xS.en', '1298xE.ce', '1298xE.co', '1298xS.au', '1298xS.en', '1339xE.ce', '1339xE.co', '1339xS.au', '1339xS.en', 'B18xE.ce', 'B18xE.co', 'B18xS.au', 'B18xS.en', 'B1xE.ce', 'B1xE.co', 'B1xS.au', 'B1xS.en', 'B8xE.ce', 'B8xE.co', 'B8xS.au', 'B8xS.en', 'C5xE.ce', 'C5xE.co', 'C5xS.au', 'C5xS.en'])


72

In [21]:
scores.head()

Unnamed: 0,Bacillus,Pathogene,Modele,Score
0,11285,E.ce,I,0.913489
1,11285,E.ce,I,0.984517
2,11285,E.ce,I,0.992624
3,12048,E.ce,I,0.951883
4,12048,E.ce,I,0.982055


In [22]:
b = "11285"
p = "E.co"
print(f"{b} {organisms_database[b].shape[0]} x {p} {organisms_database[p].shape[0]} = {organisms_database[b].shape[0] * organisms_database[p].shape[0]}")
# list(all_crossjoins[f"{b}x{p}"].columns)

11285 18 x E.co 36 = 648


In [28]:
scores[(scores["Bacillus"] == "11285") & (scores["Pathogene"] == "E.co") & (scores["Modele"] == "I")]

Unnamed: 0,Bacillus,Pathogene,Modele,Score
108,11285,E.co,I,0.323109
109,11285,E.co,I,0.395121
110,11285,E.co,I,0.402264


In [46]:
combinatoric_chunks = []
memory = set()
# suffled_scores = scores.sample(scores.shape[0], random_state=62)
for i in range(scores.shape[0]):
    b, p, m, s = scores.iloc[i,:]
    if (b,p,m) not in memory:
        pair = b + 'x' + p
        # print(set(actual_COI).intersection(set(all_crossjoins[pair].columns)))
        cross_product = all_crossjoins[pair][["B_sample_ID", "P_sample_ID"] + actual_COI]
        cross_product.reset_index(drop=True, inplace=True)
        n_rows = cross_product.shape[0]

        mask = (scores["Bacillus"] == b) & (scores["Pathogene"] == p) & (scores["Modele"] == m)
        SCORES = rd.choices(list(scores[mask]["Score"]), k=n_rows)
        score_infos = pd.DataFrame({"Bacillus":[b]*n_rows,
                                        "Pathogene":[p]*n_rows,
                                        "Modele":[m]*n_rows,
                                        "Score":SCORES})
        combinatoric_chunks.append(pd.concat([score_infos[["Bacillus", "Pathogene"]], cross_product, score_infos[["Modele", "Score"]]], axis=1))
        memory = memory.union(set([(b,p,m)]))

In [None]:
# combinatoric_chunks = []
# memory = {}
# for i in range(scores.shape[0]):
#     b, p, m, s = scores.iloc[i,:]
#     pair = b + 'x' + p
#     # print(set(actual_COI).intersection(set(all_crossjoins[pair].columns)))
#     cross_product = all_crossjoins[pair][["B_sample_ID", "P_sample_ID"] + actual_COI]
#     cross_product.reset_index(drop=True, inplace=True)
#     n_rows = cross_product.shape[0]
#     score_infos = pd.DataFrame({"Bacillus":[b]*n_rows,
#                                     "Pathogene":[p]*n_rows,
#                                     "Modele":[m]*n_rows,
#                                     "Score":[s]*n_rows})
#     combinatoric_chunks.append(pd.concat([score_infos[["Bacillus", "Pathogene"]], cross_product, score_infos[["Modele", "Score"]]], axis=1))

In [47]:
combinatoric_db = pd.concat(combinatoric_chunks, axis=0)
print(combinatoric_db.shape)
combinatoric_db.iloc[:50,:]

(115020, 13)


Unnamed: 0,Bacillus,Pathogene,B_sample_ID,P_sample_ID,B_Biofilm_Height,B_Biofilm_Roughness,B_Biofilm_SubstratumCoverage,B_Biofilm_Volume,P_Biofilm_Height,P_Biofilm_Volume,P_Biofilm_SubstratumCoverage,Modele,Score
0,11285,E.ce,0,0,71.2709,0.24003,0.74907,717122.4472,17.8136,185840.7796,,I,0.984517
1,11285,E.ce,0,1,71.2709,0.24003,0.74907,717122.4472,18.3367,195676.4463,,I,0.984517
2,11285,E.ce,0,2,71.2709,0.24003,0.74907,717122.4472,16.9216,215739.3021,,I,0.913489
3,11285,E.ce,0,3,71.2709,0.24003,0.74907,717122.4472,20.6751,259495.2255,,I,0.984517
4,11285,E.ce,0,4,71.2709,0.24003,0.74907,717122.4472,19.9349,335603.1025,,I,0.984517
5,11285,E.ce,0,5,71.2709,0.24003,0.74907,717122.4472,19.8387,351449.1457,,I,0.913489
6,11285,E.ce,0,6,71.2709,0.24003,0.74907,717122.4472,17.0281,332652.5343,,I,0.992624
7,11285,E.ce,0,7,71.2709,0.24003,0.74907,717122.4472,19.9252,366016.9373,,I,0.992624
8,11285,E.ce,0,8,71.2709,0.24003,0.74907,717122.4472,18.533,339632.8413,,I,0.913489
9,11285,E.ce,0,9,71.2709,0.24003,0.74907,717122.4472,19.6616,342774.4385,,I,0.992624


In [48]:
combinatoric_db.to_csv("./Data/Datasets/combinatoric_COI.csv")

### **Additional Notes**

We could also have built a dataset similar to the Average representation one but with a bit of data augmentation by making values range between mean - std and mean + std following a given distribution (e.g gaussian or uniform sampling in this range). 