# Part 4: Classification

In this notebook, we will exploit quantitative measurements in order to retreive different phenotypic groups present in a large image dataset. The end goal will be to group samples (whether full images or individual objects) into different classes, a process referred to as *classification*. We will here review commonly-used (non-deep-learning) strategies for classification.

In [1]:
import os
import numpy as np
import imageio.v2 as imageio
import matplotlib.pyplot as plt

plt.rcParams['figure.dpi'] = 200

In this part, we will use pandas to handle numerical data (https://pandas.pydata.org/) and seaborn to generate cute plots (). Feel free to consult the extensive documentation available on their websites if you want to know more about these libraries!

In [2]:
import pandas as pd
import seaborn as sb

## 1. Data loading

Following on what we did in notebook 3 - Quantification, we will again work with feature matrices extracted from the BBBC010 dataset featuring dead and live *C. elegans* worms. 

**1.1** Run the lines below to load and display the feature matrix for the entire BBBC010 dataset. Note that features are here reported *per-image*: they correspond to the average value of any given feature across all instances present in the image.

In [4]:
bbbc010_img_feats = pd.read_csv('data/Part 4/BBBC010/bbbc010_image_features.csv')
bbbc010_img_feats.set_index('image_id', inplace = True)

display(bbbc010_img_feats)

Unnamed: 0_level_0,area,area_bbox,area_convex,area_filled,axis_major_length,axis_minor_length,eccentricity,equivalent_diameter_area,euler_number,extent,...,moments_weighted_hu-2,moments_weighted_hu-3,moments_weighted_hu-4,moments_weighted_hu-5,moments_weighted_hu-6,perimeter,perimeter_crofton,solidity,inertia_tensor_eigvals-0,inertia_tensor_eigvals-1
image_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
B19,943.0,4108.5,1106.0,943.0,123.570549,10.548067,0.996573,34.650125,1.0,0.228990,...,7588.008983,5412.063805,3.628178e+07,3.696531e+05,-244015.600002,247.959938,238.056804,0.849288,955.418213,6.961393
C08,992.0,4017.0,2320.0,992.0,105.018978,31.627365,0.962186,35.539466,1.0,0.270981,...,31037.408386,6120.538317,4.518897e+06,6.949827e+04,131317.931469,261.036580,250.159713,0.474510,689.311608,62.518137
A19,1007.0,5570.0,1256.0,1007.0,132.804181,10.995167,0.996344,35.805206,1.0,0.194813,...,9226.734523,6673.665822,4.696403e+07,4.239094e+05,-1899.335729,265.282792,256.847158,0.812467,1102.317839,7.557020
A17,1003.5,3452.0,1205.0,1003.5,130.924845,10.653125,0.996696,35.744738,1.0,0.256726,...,11922.301041,9432.248373,1.098919e+08,7.028946e+05,-7180.150784,267.734542,257.081984,0.834433,1071.545636,7.093067
C14,986.0,7068.0,1223.0,986.0,132.262768,10.362722,0.996913,35.431825,1.0,0.133833,...,58065.858521,39343.463171,1.872940e+09,2.767755e+06,102463.455280,266.960461,255.775905,0.833582,1093.339988,6.711625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
D17,992.0,2640.0,1229.0,992.0,128.188235,10.288765,0.996859,35.539466,1.0,0.317766,...,58537.926367,55776.638663,3.187109e+09,4.670525e+06,35678.848885,257.154329,255.864418,0.817810,1027.013969,6.616168
D14,976.0,4068.0,1138.0,976.0,125.830275,10.531605,0.996413,35.251692,1.0,0.249263,...,4838.434642,2703.017315,5.447706e+06,1.222994e+05,33613.441270,253.303607,242.828395,0.849762,989.578625,6.932169
A23,1005.5,4443.0,1209.0,1005.5,132.000311,10.698218,0.996587,35.778792,1.0,0.252010,...,21727.929175,15534.741409,2.650161e+08,9.351359e+05,3827.231079,273.320328,264.756956,0.839605,1089.018195,7.153789
B18,1046.0,4346.0,1240.0,1046.0,133.212443,10.771310,0.996535,36.493952,1.0,0.232115,...,10618.905167,8311.790206,7.664504e+07,4.582506e+05,217124.289809,269.747258,258.417954,0.834098,1109.097186,7.251320


**1.2** In order to evaluate the quality of our classification attempts, we need a ground truth to compare to. Run the lines below to load and display the ground truth label for each image in the dataset.

In [5]:
bbbc010_img_gt = pd.read_csv('data/Part 4/BBBC010/bbbc010_image_ground_truth.csv')
bbbc010_img_gt.set_index('image_id', inplace = True)

display(bbbc010_img_gt)

Unnamed: 0_level_0,label
image_id,Unnamed: 1_level_1
A01,live
A02,live
A03,live
A04,live
A05,live
...,...
D24,dead
E01,live
E02,live
E03,live


## 2. Feature selection and dimensionality reduction

As we have seen in notebook 3- Quantification, it is hard to find a single feature able to discriminate between dead and live worms. The feature matrix we are working with has thus been built by putting together an extensive collection of 33 measurements capturing shape and intensity, in the hope that all of these features considered together can capture the difference between the 'dead' and 'live' phenotypes. 

While it is clear that more than a single feature is needed, some of the features may however be more informative than others. Among the 33 features considered, some may in fact be entirely uninformative. Revealing which features are relevant and which aren't and making sure that our feature matrix is not too redundant is the job of feature selection and dimensionality reduction methods, as we shall see now.

**2.1** In order to get an impression of whether each of the feature in our matrix is informative, we can first investigate their distributions. We know that our dataset is composed of two classes (dead and live), and are therefore mostly interested in features that have a bimodal distribution. Can you spot features that seem to be uninformative?

**2.2** Correlation matrices for each feature

**2.3** The Fisher score provides a formal way of evaluating how much information we can get from each features and selecting them accordingly. The Fisher score compares how much a feature varies between our two classes (*inter-class* variance) and among each given class (*intra-class* variance). Features that exhibit a large variation between classes but vary little inside a given class have a powerful discriminative power, reflected by a large Fisher score. The Fisher score is based on the notion of Fisher information, a core theoretical concept in information theory (https://en.wikipedia.org/wiki/Fisher_information). 

Run the lines below to compute the Fisher score of the features in our matrix and visualize their distribution. Which ones do you identify as being uninformative? Does that corroboate your observations from 2.1?

**2.4** Linear Discriminant Analysis (LDA, https://en.wikipedia.org/wiki/Linear_discriminant_analysis)

**2.5** Beyond selecting individual features, another way to reduce the dimensionality of our feature matrix is to try and find a few combinations of features that can explain most of the variability present in the matrix. This is the idea behind the famous principal component analysis (PCA, https://en.wikipedia.org/wiki/Principal_component_analysis). 

Run the lines below to 1) extract the first N principal components of our feature matrix, and 2) plot the cumulative variance that they are able to explain. Based on this, how many principal components do you think is sufficient to analyze this dataset? How does that compare to the number of features we initially had, and how does that relate to the distribution of Fisher scores you observed in 2.2?

## 3. Unsupervised classification

Now that we have "cleaned-up" our feature matrix, we can dig into the actual classification

**3.1** K-means clustering (https://en.wikipedia.org/wiki/K-means_clustering)

**3.2**  Kolmogorov–Smirnov test (https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) - check if the samples from both clusters are drawn from the same distributions

## 4. Evaluating classification performance

**3.1** Confusion matrix (https://en.wikipedia.org/wiki/Confusion_matrix)

**3.2** Classification metrics

## 5. Supervised classification

**5.1** Training and test set

**5.2** Support Vector Machines (https://en.wikipedia.org/wiki/Support-vector_machine)

**5.3** Decision trees (https://en.wikipedia.org/wiki/Decision_tree)

## BONUS. Classifying individual objects

**6.1** Run the lines below to load and display the feature matrix for the entire BBBC010 dataset. This time, features are reported for individual object instances.

In [6]:
bbbc010_obj_feats = pd.read_csv('data/Part 4/BBBC010/bbbc010_object_features.csv')
bbbc010_obj_feats.set_index('instance_id', inplace = True)

display(bbbc010_obj_feats)

Unnamed: 0_level_0,area,area_bbox,area_convex,area_filled,axis_major_length,axis_minor_length,eccentricity,equivalent_diameter_area,euler_number,extent,...,moments_weighted_hu-2,moments_weighted_hu-3,moments_weighted_hu-4,moments_weighted_hu-5,moments_weighted_hu-6,perimeter,perimeter_crofton,solidity,inertia_tensor_eigvals-0,inertia_tensor_eigvals-1
instance_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
B19_14,902,5168,1017,902,103.543142,11.616397,0.993687,33.888967,1,0.174536,...,23940.345921,19475.162046,4.203367e+08,1.511937e+06,-1.241909e+07,215.705627,207.183275,0.886922,670.073886,8.433792
B19_14,892,6248,1048,892,119.446128,9.930965,0.996538,33.700589,1,0.142766,...,2535.212490,1829.380476,3.933488e+06,1.364230e+05,-2.211612e+05,236.090404,226.509255,0.851145,891.711088,6.164004
B19_14,1073,2356,1264,1073,132.409286,10.895316,0.996609,36.961954,1,0.455433,...,34302.045488,33058.033002,1.113206e+09,2.363752e+06,-7.176924e+05,265.740115,254.618944,0.848892,1095.763693,7.419245
B19_14,870,3780,1087,870,114.346205,10.926564,0.995424,33.282404,1,0.230159,...,8654.502748,3454.043512,1.760087e+07,1.579022e+05,-6.844388e+06,239.994949,230.603696,0.800368,817.190906,7.461862
B19_14,969,7569,1098,969,127.694970,9.925922,0.996974,35.125050,1,0.128022,...,7172.795870,7184.270060,5.157250e+07,4.813559e+05,-4.276454e+03,255.924928,245.509912,0.882514,1019.125338,6.157746
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
A01_15,799,1408,1092,799,46.779707,34.087398,0.684855,31.895429,-1,0.567472,...,739.318118,20.898658,-1.294405e+02,2.005234e+01,-2.594499e+03,205.758405,198.964503,0.731685,136.771310,72.621921
B01_07,978,4305,1410,978,116.127593,13.314544,0.993405,35.287792,1,0.227178,...,11175.911702,9020.480529,9.053035e+07,5.652779e+05,2.691117e+06,246.409163,236.292052,0.693617,842.851118,11.079817
C10_10,1069,1710,1345,1069,120.147486,12.942310,0.994181,36.892995,1,0.625146,...,54144.930588,41022.308572,1.931240e+09,3.362168e+06,-9.015857e+07,242.911688,232.976238,0.794796,902.213649,10.468962
D13_03,986,3157,2041,986,92.380767,31.010642,0.941975,35.431825,1,0.312322,...,12931.994491,1845.573247,-3.473114e+05,-3.817916e+04,9.009636e+06,223.681241,214.744631,0.483097,533.387879,60.103744


**6.2** Run the lines below to load and display the ground truth label for each instance in the dataset.

In [8]:
bbbc010_obj_gt = pd.read_csv('data/Part 4/BBBC010/bbbc010_object_ground_truth.csv')
bbbc010_obj_gt.set_index('instance_id', inplace = True)

display(bbbc010_obj_gt)

Unnamed: 0_level_0,label
instance_id,Unnamed: 1_level_1
B19_14,dead
C08_03,live
A19_06,dead
A17_08,dead
C14_13,dead
...,...
A01_15,live
B01_07,live
C10_10,live
D13_03,live


**6.3** Adapt the analysis we did above to classify individual worms into 2 catergories: dead or alive

**6.4** Identify images in which there are misclassified worms and find a good way to visualize the result