# Random Forest

This notebook is to help you setup your first model on the bit.bio colony segmentation features.
The model will be a random forest; it is a easy model to start with for data without a clear or uniform scale between the parameters.

First of all the loading of the data.
We load the enriched data and drop ImageNumber, Metadata_Well, and ObjectNumber because these are not things we want our model to react on.
Metadata_TimePoint is also removed for now because it is not a number.

Next up all location data is removed.
If we do not do that the model could learn that a colony is the right top is good, and the left bottom is bad.

In [1]:
import os
import sys
from pathlib import Path
basefolder_loc = Path(os.path.abspath('')).parents[0]
sys.path.append(str(basefolder_loc))

from feature_engineering import load_enriched_data, remove_all_location_info
bitbio_data = load_enriched_data()

# Remove some columens that are not numberic indicators
bitbio_data = bitbio_data.drop(
    columns=[
            "ImageNumber",  
            "Metadata_Well",
            "ObjectNumber", 
            "Metadata_TimePoint"
        ]
)

bitbio_data = remove_all_location_info(bitbio_data)

bitbio_data.head()

Unnamed: 0,AreaShape_Area,AreaShape_BoundingBoxArea,AreaShape_Compactness,AreaShape_Eccentricity,AreaShape_EquivalentDiameter,AreaShape_EulerNumber,AreaShape_Extent,AreaShape_FormFactor,AreaShape_MajorAxisLength,AreaShape_MaxFeretDiameter,...,Texture_Variance_Phase_5_00_256,Texture_Variance_Phase_5_01_256,Texture_Variance_Phase_5_02_256,Texture_Variance_Phase_5_03_256,is_good,SampleName,ColonyName,Circularity,AreaShape_BoundingBoxMaximum_Width,AreaShape_BoundingBoxMaximum_Height
Good_1_00d12h00m.tif,1423,7597,13.697258,0.958639,42.565477,1,0.187311,0.073007,120.073586,116.361506,...,55.191058,58.169413,52.687,51.290731,True,Good_1,Good_1_1,0.073007,107,71
Good_1_01d00h00m.tif,1288,2624,8.649761,0.997509,40.49608,1,0.490854,0.11561,180.713061,163.370744,...,15.489024,15.462723,12.422264,15.4263,True,Good_1,Good_1_1,0.11561,16,164
Good_1_01d04h00m.tif,2916,12420,7.879338,0.995852,60.932475,1,0.234783,0.126914,213.572235,211.009479,...,21.288325,22.808554,20.98627,20.334934,True,Good_1,Good_1_1,0.126914,60,207
Good_1_02d04h00m.tif,1398,3312,2.397711,0.945498,42.189914,1,0.422101,0.417064,82.411836,76.967526,...,110.650754,113.107319,113.482992,117.162961,True,Good_1,Good_1_1,0.417064,46,72
Good_1_02d08h00m.tif,1683,4180,2.564627,0.944304,46.291059,1,0.402632,0.38992,89.759893,83.186537,...,68.624614,67.173322,70.372928,68.00064,True,Good_1,Good_1_1,0.38992,55,76


## Feature leakage

[Feature leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) is when the training data has some data or information of the test.
In our case we have 2 grouping to look at, time and wells.

**Time**. If we add a colony of timepoint 10 to training and of timepoint 11 to test, we have a leakage.
Because the model can overfit to the timepoint 10 colony, but our test will not pick up on this.

**Wells**. Some colonies grew in the same well/sample.
This means they have the same environment and are labeled the same.
But because we only have 2 sample with bad colony (Bad_1 and Bad_2), it might be better to NOT split on samples too.
This give us more freedom in splitting the dataset.

Below we have grouped data by SampleName, and by ColonyName.
We see that Bad_1 has 411 colony snapshots.
This could mean 411 timepoints of 1 colony, till 1 timepoint with 411 colonies.

Bad_1_1 is the first colony of bad_1.
It has 86 snapshots, so it was here for 340 hours.

In [2]:
bitbio_data.groupby("SampleName").count()["is_good"]

SampleName
Bad_1      411
Bad_2      148
Good_1      99
Good_10    142
Good_11    117
Good_12    151
Good_13    101
Good_14    174
Good_15    182
Good_16    114
Good_17    137
Good_18    159
Good_19    100
Good_2     261
Good_20    158
Good_21    114
Good_22     10
Good_23    137
Good_24     86
Good_25    261
Good_3     143
Good_4     108
Good_5     165
Good_6     127
Good_7      80
Good_8     111
Good_9     169
Name: is_good, dtype: int64

In [5]:
bitbio_data.groupby("ColonyName").count()["is_good"]

ColonyName
Bad_1_1     86
Bad_1_10     4
Bad_1_11     3
Bad_1_12     2
Bad_1_13     1
            ..
Good_9_5     8
Good_9_6     5
Good_9_7     2
Good_9_8     2
Good_9_9     2
Name: is_good, Length: 153, dtype: int64

## Training/test split

We will create a balanced test set (both 10 colonies).
Everything that is left over will be the training set.
This is because the dataset itself imbalanced.

In [6]:
import random

all_colony_names = bitbio_data.ColonyName.unique()
all_good_colony_names = list(filter(lambda x: x.startswith("Good_"), all_colony_names))
all_bad_colony_names = list(filter(lambda x: x.startswith("Bad_"), all_colony_names))

random.shuffle(all_good_colony_names)
random.shuffle(all_bad_colony_names)

training_colony_names = all_good_colony_names[:-10] + all_bad_colony_names[:-10]
test_colony_names = all_good_colony_names[-10:] + all_bad_colony_names[-10:]

training_data = bitbio_data[bitbio_data['ColonyName'].isin(training_colony_names)]
test_data = bitbio_data[bitbio_data['ColonyName'].isin(test_colony_names)]

training_data = training_data.drop(columns=[
        "ColonyName",
        "SampleName"
    ])
test_data = test_data.drop(columns=[
        "ColonyName",
        "SampleName"
    ])

## Training a model

Training a model. We give it all columns except 'is_good' and let is predict 'is_good'.

In [7]:
from sklearn.ensemble import RandomForestClassifier

y = training_data.is_good
X = training_data.drop(
    columns=[
        "is_good",
    ]
)
clf = RandomForestClassifier(max_depth=12, random_state=0)
clf.fit(X, y)

RandomForestClassifier(max_depth=12, random_state=0)

## Validation

We a few simple validations, recall, precision, and confusion matrix.

A good colony has value 1, bad has 0.

Both of them seem very high on the training set.
On the test set it drops a bit.
Still, not a bad score for a first attempt. 

In [11]:
# Training set
from sklearn.metrics import confusion_matrix, precision_score, recall_score

p = clf.predict(X)
print("Precision Good", precision_score(y, p))
print("Recall Good", recall_score(y, p))
print("Precision Bad", precision_score(1- y, 1- p))
print("Recall Bad", recall_score(1- y, 1- p))
confusion_matrix(y, p)

Precision Good 0.9798170923998738
Recall Good 1.0
Precision Bad 1.0
Recall Bad 0.8363171355498721


array([[ 327,   64],
       [   0, 3107]], dtype=int64)

In [12]:
# Test set
from sklearn.metrics import confusion_matrix, precision_score, recall_score
p2 = clf.predict(test_data.drop(columns=["is_good"]))
y2 = test_data.is_good
print("Precision", precision_score(y2, p2))
print("Recall", recall_score(y2, p2))
print("Precision Bad", precision_score(1- y2, 1- p2))
print("Recall Bad", recall_score(1- y2, 1- p2))
confusion_matrix(y2, p2)

Precision 0.7952127659574468
Recall 1.0
Precision Bad 1.0
Recall Bad 0.5416666666666666


array([[ 91,  77],
       [  0, 299]], dtype=int64)