## IIa- Processing labels
### Aim

The aim here is to end up with a fundi-based dataset where each row corresponds to one eye fundus and its related info, rather than to one patient with both eye fundi and their info (see below).

In [12]:
import pandas as pd
import numpy as np

df=pd.read_csv('Fil_rouge_TV.csv')
df.head()

Unnamed: 0,ID,Patient Age,Patient Sex,Left-Fundus,Right-Fundus,Left-Diagnostic Keywords,Right-Diagnostic Keywords,N,D,G,C,A,H,M,O
0,0,69,Female,0_left.jpg,0_right.jpg,cataract,normal fundus,0,0,0,1,0,0,0,0
1,1,57,Male,1_left.jpg,1_right.jpg,normal fundus,normal fundus,1,0,0,0,0,0,0,0
2,2,42,Male,2_left.jpg,2_right.jpg,laser spot，moderate non proliferative retinopathy,moderate non proliferative retinopathy,0,1,0,0,0,0,0,1
3,3,66,Male,3_left.jpg,3_right.jpg,normal fundus,branch retinal artery occlusion,0,0,0,0,0,0,0,1
4,4,53,Male,4_left.jpg,4_right.jpg,macular epiretinal membrane,mild nonproliferative retinopathy,0,1,0,0,0,0,0,1


### Processing labels

Given that there can be multiple combination of labels, I will :
1) identify the different label combinations within the dataset,
2) focus on the main ones based on their weight accross the whole dataset

#### 1) Identifying label combinations of interest

Here I: 
- create a column pooling the values of all 8 columns corresponding to all possible labels,
- count the number of unique combinations of labels possible,
- check the weight of each individual combination compare to others,
- identify the top label combinations of interest based on their combined weight and visualize it.

In [13]:
#1 Pool labels (N, D, ..., O) : 
df=df.assign(Pooled_labels=df[['N','D','G', 'C','A', 'H', 'M', 'O']].apply(lambda row:''.join([str(each) for each in row]),axis=1))
#df.head()

#2 Count the number of unique combination:
print("There is",len(pd.unique(df['Pooled_labels'])),"label combination total.") 

#3 Assess the weight of each individual combination over all:
print('\033[1m\nWeight of each individual label combination \ncompare to the overal combination (in %):\033[0m')
Comb_labl1=pd.DataFrame(df['Pooled_labels'].value_counts(normalize=True).round(4)*100)
Comb_labl1= Comb_labl1.rename({'Pooled_labels': 'Pooled'}, axis=1)
pd.set_option('display.max_columns', None)
Comb_labl1.head(25)
# To display results horizontally:
#Comb_labl1.T

There is 37 label combination total.
[1m
Weight of each individual label combination 
compare to the overal combination (in %):[0m


Unnamed: 0,Pooled
10000000,32.57
1000000,19.91
1,15.74
1000001,8.06
10000,4.17
100000,3.43
1000,3.34
10,3.06
1000100,1.26
11,1.14


In [14]:
#4 Determine the weight of the top labels to identify most relevant combination: 
print('--> The first 12 combined labels covers',df['Pooled_labels'].value_counts(normalize=True).head(12).round(4).sum()*100,'% of the whole dataset,')   
print('--> The first 15 combined labels covers',df['Pooled_labels'].value_counts(normalize=True).head(15).round(4).sum()*100,'% of the whole dataset.\n')

#5 Plot top 12 labels to visualize the weight of their proportion compare to compare to all unique combinations:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import math
import colorcet as cc
from colorcet.plotting import swatch, swatches
from bokeh.plotting import figure, show, output_notebook
output_notebook()
from bokeh.palettes import Category20c
from bokeh.models import Legend
from bokeh.palettes import Paired
from bokeh.models import LabelSet, ColumnDataSource, HoverTool
from bokeh.layouts import column
from bokeh.layouts import row

#5a name of the top 12 sectors 
sectors = df['Pooled_labels'].value_counts().head(12).index 
percentages = df['Pooled_labels'].value_counts(normalize=True).round(3)*100
#5b converting into radians
radians = [math.radians((percent / 100) * 360) for percent in percentages]
#5c starting angle values
start_angle = [math.radians(0)]
prev = start_angle[0]
for i in radians[:-1]:
    start_angle.append(i + prev)
    prev = i + prev 
#5d ending angle values
end_angle = start_angle[1:] + [math.radians(0)]  
#5e center of the pie chart
x = 0
y = 0  
#5f radius of the glyphs
radius = 1
#5g color of the wedges
color=Category20c[len(sectors)]

#5h instantiating the figure object  
graph = figure(title = "Top 12 label combinations", x_range=(-.7, .7), plot_width=500, plot_height=500)  
graph.add_layout(Legend(), 'right')

for i in range(len(sectors)):
    g=graph.annular_wedge(x, y, 
                          inner_radius=0.45, 
                          outer_radius=0.65, 
                          direction="anticlock", 
                          start_angle = start_angle[i],
                          end_angle = end_angle[i], 
                          color = color[i],
                          legend_label = sectors[i],
                          fill_alpha=0.7,
                          line_color='gray',
                          hover_color = 'blue',
                          hover_alpha = 0.5)
    graph.axis.visible = False
    graph.grid.grid_line_color = None
    graph.title.align = 'center' 
    graph.title.text_font_size = '16pt'
    graph.legend.click_policy = 'hide'
    
from bokeh.models import Panel, Tabs
show(graph)

--> The first 12 combined labels covers 94.71 % of the whole dataset,
--> The first 15 combined labels covers 96.89 % of the whole dataset.



<i> --> As a matter of practicality, I will focus on the top 12 label combinations, out of 37, representing over 94% of the whole dataset as represented below.
   
#### 2) Processing dataset based on label combinations of interest

Since I'll focus on the top 12 label combinations out 37, I will: 
- remove the remaining 25 unique minor label combinations (about 5% of the whole dataset),
- assess their impact of the removal on the dataset, 
- properly re-named the 12 label combinations
- check that the remaining 12 label combinations contains the 8 individual labels of interest for our classification model.

In [15]:
#6 Keep the top 12 label combinations (out of 37) by dropping the last 25 :
drop_values = ['01010000','01100000','00010001','01001000','00001001',
               '01000010','00101000','00000101','01100001','01010001',
               '00100010','00100100','00001100','00100101','01000011',
               '00010100','00001010','00110000','00110001','01000101',
               '01001010','00100011','01100010','00101001','01001001']
df1=df[~df['Pooled_labels'].str.contains('|'.join(drop_values))]

## Or : 
#drop_values=df['Pooled_labels'].value_counts(normalize=True).tail(25).round(4).reset_index().set_index('index')*100
#drop=list(drop_values.index)
#df1 = df[~df.Pooled_labels.isin(drop)]

#7 Assess the number of lines removed from the dataset to confirm 
print("Shape of the dataset after removing all 'minor' label combinations :",df1.shape)
y=3500-3315
x=((3500-3315)*100)/3500
print("A total of",y,"lines have been removed, representing",round(x, 2),"% of the original dataset.\n")

#8a Rename each combination based on the positive label(s) (value = 1) within the combination : 
my_dict= {'10000000':'Normal', '01000000':'Diabetes', '00000001':'Others',
          '01000001':'Diabetes-Others', '00010000':'Cataract', '00100000':'Glaucoma',
          '00001000':'AMD', '00000010':'Myopia', '01000100':'Diabetes-Hypertension',
          '00000011':'Myopia-Others', '00000100':'Hypertension', '00100001':'Glaucoma-Others'}

df1=df1.replace({'Pooled_labels': my_dict})

#8b Rename the column 'Pooled_labels'
df1 = df1.rename({'Pooled_labels': 'Labels'}, axis=1)

df1.head()

Shape of the dataset after removing all 'minor' label combinations : (3315, 16)
A total of 185 lines have been removed, representing 5.29 % of the original dataset.



Unnamed: 0,ID,Patient Age,Patient Sex,Left-Fundus,Right-Fundus,Left-Diagnostic Keywords,Right-Diagnostic Keywords,N,D,G,C,A,H,M,O,Labels
0,0,69,Female,0_left.jpg,0_right.jpg,cataract,normal fundus,0,0,0,1,0,0,0,0,Cataract
1,1,57,Male,1_left.jpg,1_right.jpg,normal fundus,normal fundus,1,0,0,0,0,0,0,0,Normal
2,2,42,Male,2_left.jpg,2_right.jpg,laser spot，moderate non proliferative retinopathy,moderate non proliferative retinopathy,0,1,0,0,0,0,0,1,Diabetes-Others
3,3,66,Male,3_left.jpg,3_right.jpg,normal fundus,branch retinal artery occlusion,0,0,0,0,0,0,0,1,Others
4,4,53,Male,4_left.jpg,4_right.jpg,macular epiretinal membrane,mild nonproliferative retinopathy,0,1,0,0,0,0,0,1,Diabetes-Others


In [16]:
#9 Check the 12 remaining label combinations to ensure they contain the 8 individal labels of interest 
# for our classification model (N, D, G, C, A, H, M, O)
Comb_labl=pd.DataFrame(df1['Labels'].value_counts(normalize=True).round(4)*100)
pd.set_option('display.max_columns', None)
Comb_labl

Unnamed: 0,Labels
Normal,34.39
Diabetes,21.03
Others,16.62
Diabetes-Others,8.51
Cataract,4.4
Glaucoma,3.62
AMD,3.53
Myopia,3.23
Diabetes-Hypertension,1.33
Myopia-Others,1.21


In [8]:
#10 Save this temporary dataset
df1.to_csv('Temp_df.csv')