<div align='center'><font size="5" color='#353B47'>Chest-X-ray</font></div>
<div align='center'><font size="4" color="#353B47">How to deal with models averaging ?</font></div>
<br>
<hr>

The objective of this notebook is to aproach different methods for Ensembling

* OR method (Affirmative): A box is considered if it’s generated by at least one of the models.
* AND method (Unanimous): A box is considered if all of the models generate the same box (the box is considered the same if IOU > 0.5).
* Consensus method: A box is considered if the majority of the models generate the same box (ie) if there are m models and (m/2 +1) models generate the same box, that box is considered as valid.
* Weighted Fusion: This is a novel method which was created to replace NMS and it’s shortcomings.

# <div id="summary">Summary</div>

**<font size="2"><a href="#chap1">1. Load libraries and dataframes with predictions</a></font>**
**<br><font size="2"><a href="#chap2">2. Helper functions</a></font>**
**<br><font size="2"><a href="#chap3">3. Run ensembling with appropriate strategy</a></font>**
**<br><font size="2"><a href="#chap4">4. Save results</a></font>**

# <div id="chap1">1. Load libraries and dataframes with predictions

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import os
from tqdm import tqdm

In [None]:
# Import outputs of each selected models
# yolo = pd.read_csv('../input/vinbigdatastack/yolov5.csv')
# detectron = pd.read_csv('../input/vinbigdatastack/detectron2.csv')
# fasterrcnn = pd.read_csv('../input/vinbigdatastack/fasterrcnn.csv')
yolo = pd.read_csv('../input/ensample/submission0.175.csv')
detectron = pd.read_csv('../input/ensample/submission_0.2.csv')
fasterrcnn = pd.read_csv('../input/ensample/submission_2class filter.csv')
image_ids = yolo.image_id.values

# <div id="chap2">2. Helper functions

In [None]:
def getitem(dataframe, img_id):
    
    """
    Parameters
    ----------
    dataframe : pd.DataFrame
    img_id : str
        
    Returns
    -------
    Dictionary of radiographic observations
    """

    pred = list(dataframe.loc[dataframe.image_id == img_id, "PredictionString"])[0].split(' ')
    nb_elm = len(pred)//6
    output = {}
    
    for elm in range(nb_elm):
        output[f'elm_{elm}'] = pred[elm*6 : (elm+1)*6]
        
    return output


def sortDictByProba(dict_):
    
    """
    Parameters
    ----------
    dict_ : dict, Dictionary of radiographic observations
        
    Returns
    -------
    Dictionary of radiographic observations sorted by probabilities 
    """
    
    for key in dict_.keys():
        dict_[key] = list(map(lambda x: float(x), dict_[key]))
    
    # item[1][1] corresponds to the second element of the value (the confidence of the class identified)
    return {k: v for k, v in sorted(dict_.items(), key=lambda item: item[1][1], reverse = True)}


def getHighestProba(*list_of_dicts, n=3):
    
    """
    Parameters
    ----------
    list_of_dicts : list[dict], List of dictionaries containing radiographic observations
    n : int, keep n highest elements of each list_of_dicts at most
    
    Returns
    -------
    Dict of merged top3 confidence interval in each dict of list_of_dicts
    """
    
    output = {}
    for index, dict_ in enumerate(list_of_dicts):
        dict_length = len(dict_)
        for i in range(dict_length):
            if i < n:
                output[f"elm_{i}_dict_{index}"] =list(dict_.values())[i]
                
    return output


def getUnique(dict_):
    
    """
    Parameters
    ----------
    dict_ : dict, Dictionary of radiographic observations
        
    Returns
    -------
    List of unique class_id, list of duplicates class_id
    """
    
    dict_length = len(dict_)
    
    classes_non_unique = [list(dict_.values())[index][0] for index in range(dict_length)]
    classes_unique = list(set(classes_non_unique))
    
    uniques, counts = np.unique(classes_non_unique, return_counts=True)
    duplicates = uniques[counts > 1]
    singles = np.setdiff1d(classes_unique, duplicates)
    
    return singles, duplicates


def getKeysByValue(dictOfElements, valueToFind):
    
    """
    Parameters
    ----------
    dictOfElements : dict, Dictionary of radiographic observations
    valueToFind : int, corresponds to class_id
    
    Returns
    -------
    List of keys of dictOfElements that contain valueToFind
    """
    
    output = list()
    listOfItems = dictOfElements.items()
    
    for item  in listOfItems:
        if item[1][0] == valueToFind:
            output.append(item[0])
            
    return  output


def getListKeysByValue(dictOfElements, valuesToFind):
    
    """
    Parameters
    ----------
    dictOfElements : dict, Dictionary of radiographic observations
    valuesToFind : list[int], list of class_id
    
    Returns
    -------
    List of lists of keys of dictOfElements for each value in valuesToFind
    """
    
    output = []
    
    for value in valuesToFind:
        output.append(getKeysByValue(dictOfElements, value))
        
    return output


def averaging(from_dict, single_keys, dupl_keys):
    
    """
    Parameters
    ----------
    from_dict : dict, dictionary to be filtered
    single_keys : list[str], list of keys that should be infered
    dupl_keys : list[str], list of class_id
    
    Returns
    -------
    A filtered dictionary with averaged probs and boxes
    """
    
    output = {}
    
    # Infer single keys
    if len(np.ravel(single_keys)) != 0:
        for single in np.ravel(single_keys):
            output[single] = from_dict[single]

    # For each duplicates, get index of all occurences and average boxing
    if len(np.ravel(dupl_keys)) != 0:
        for index, list_of_duplicate_class in enumerate(dupl_keys):
            probs = [] 
            boxing1 = []
            boxing2 = []
            boxing3 = []
            boxing4 = []
            
            for elm in list_of_duplicate_class:
                probs.append(from_dict[elm][1])
                boxing1.append(from_dict[elm][2])
                boxing2.append(from_dict[elm][3])
                boxing3.append(from_dict[elm][4])
                boxing4.append(from_dict[elm][5])
            
            output[f"elm_{index}"] = [from_dict[list_of_duplicate_class[0]][0],
                                      np.mean(probs),
                                      np.mean(boxing1),
                                      np.mean(boxing2),
                                      np.mean(boxing3),
                                      np.mean(boxing4)]
            
    return output


def toString(pred_list):
    
    """
    Parameters
    ----------
    list_final : list[int], list of all radiographic observations
    
    Returns
    -------
    A string which fits with the expected output
    """
    
    castedList = []
    for index, elm in enumerate(pred_list):
        if index%6 == 0:
            castedList.append(str(int(elm)))
        else:
            castedList.append(str(elm))
            
    output = " ".join(castedList)
    
    return output

--------

**<font size="2"><a href="#summary">Back to summary</a></font>**

# <div id="chap3">3. Run ensembling with appropriate strategy

My strategy here consists in averaging observations that have at least one dupplicate among all models. Some filtering about boxing areas should be added. This will come in a future release

In [None]:
def main():
    
    output = pd.DataFrame(columns = ["image_id", "PredictionString"])
    
    for image_id in tqdm(image_ids):
        
        # For each model, get PredictionString of image_id as a dict
        fasterrcnn_pred = getitem(fasterrcnn, image_id)
        detectron_pred = getitem(detectron, image_id)
        yolo_pred = getitem(yolo, image_id)  
        
        # Sort dicts by proba
        sorted_fasterrcnn = sortDictByProba(fasterrcnn_pred)
        sorted_detectron = sortDictByProba(detectron_pred)
        sorted_yolo = sortDictByProba(yolo_pred)

        # Filter dicts into one dict with at most top n probs
        highest_probs = getHighestProba(sorted_fasterrcnn, 
                                        sorted_detectron, 
                                        sorted_yolo,
                                        n = 3)
        
        # Get keys of unique and duplicates values in the filtered dict
        singles, duplicates = getUnique(highest_probs)
        single_keys = getListKeysByValue(highest_probs, singles)
        dupl_keys = getListKeysByValue(highest_probs, duplicates)
        
        # Apply averaging strategy
        stacked_dict = averaging(highest_probs, single_keys, dupl_keys)
        
        # Put string in right format
        prediction_int = np.ravel(list(stacked_dict.values()))
        prediction_string = toString(prediction_int)
        
        output = output.append({"image_id": image_id, 
                                "PredictionString": prediction_string},
                               ignore_index=True)
        
    return output

Some other strategies will be tested in a future release:
* OR method 
* AND method
* Consensus method
* Weighted Fusion

In the meantime, if you found this notebook usefull and you do have some suggestions on how this could be better implemented, do not hesitate to contribute, i'd really appreciate !

--------

**<font size="2"><a href="#summary">Back to summary</a></font>**

# <div id="chap4">4. Save results

In [None]:
final_sub = main()
final_sub.to_csv("submission.csv", index=False)

# References

* <a href = "https://medium.com/inspiredbrilliance/object-detection-through-ensemble-of-models-fed015bc1ee0">Article on object detection through ensemble of models</a>
* detectron2 : https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/code?competitionId=24800&sortBy=scoreDescending
* fasterrcnn : https://www.kaggle.com/awsaf49/vinbigdata-cxr-ad-yolov5-14-class-infer
* yolov5 : https://www.kaggle.com/basu369victor/chest-x-ray-abnormalities-detection-submission

<hr>
<div align='justify'><font color="#353B47" size="4">Thank you for taking the time to read this notebook. I hope that I was able to answer your questions or your curiosity and that it was quite understandable. <u>any constructive comments are welcome</u>. They help me progress and motivate me to share better quality content. I am above all a passionate person who tries to advance my knowledge but also that of others. If you liked it, feel free to <u>upvote and share my work.</u> </font></div>
<br>
<div align='center'><font color="#353B47" size="3">Thank you and may passion guide you.</font></div>