<center><img src="logo.png" /></center>

# <center>Avey: A High-Level Overview</center>

## Table of Contents

* [Introduction](#intro)
* [Cases](#cases)
* [Metrics and Calculations](#metrics)
* [Results](#results)

<p>


## Introduction<a class="anchor" id="intro"></a>
<p>This notebook is a supplement of the <a href="https://www.medrxiv.org/content/10.1101/2022.03.08.22272076v1.full"> paper </a> we are submitting. In this notebook, we share all the cases (after manual cleaning and matching) that we analysed and their results. We also share results from multiple experiments, few of which were discussed in the paper.</p>


#### Load data

In [1]:
# load data

import json
import pandas as pd
from collections import defaultdict
import math
import os

def loadData(fileName):
    '''Loading data from result files'''
    with open(f'{fileName}.json', 'r', encoding='utf-8') as file:
        data = json.load(file)
        return data

def normalize(cases):
    '''make all the ddx list of a case of the same length by padding with None'''
    for case in cases.values():
        maxLen = max(len(result) for result in case.values())
        for result in case.values():
            result += [None]*(maxLen-len(result))

        assert len(set(len(result) for result in case.values())) == 1
    
    return cases

def getDataframe(case):
    '''Convert each test case into a dataframe'''
    caseLen = len(next(iter(case.values())))
    return pd.DataFrame(
        case,
        columns=['gold',*sorted([key for key in case.keys() if key != 'gold'])],
        index= list(range(1,1+caseLen))
        )


In [4]:
# We need to make all the differentials of the same length to ease comparison
# We pad the lists with None
data = loadData('allResults')
for caseNum in data:
    for app in data[caseNum]:
        # removing empty strings
        data[caseNum][app] = [r for r in data[caseNum][app] if r]

# make sure each case has all the apps
for caseNum, tests in data.items():
    assert len(tests.keys()) == 15,\
        f"app missing in case {caseNum}, {tests.keys()}"

normalizedData = normalize(data)
cases = {int(id): getDataframe(case) for id, case in normalizedData.items()}
caseClassification = loadData('case-classification')
display(f'We have {len(cases)} cases in the experiment.')


'We have 74 cases in the experiment.'

<p>

### Let us have a look at all the cases. <a class="anchor" id="cases"></a>
The cases have been labelled common and less common by our doctors. We have also labelled the cases with apps that failed on them.

In [5]:
from IPython.display import display
for caseNum, case in cases.items():
    isCommonString = 'common' if caseNum in caseClassification['common'] else 'less common'
    sessionFailedToStart, noDiseaseFound = ([], [])
    for app, failedCases in caseClassification['apps'].items():
        if caseNum in failedCases['session failed']:
            sessionFailedToStart.append(app)
        elif caseNum in failedCases['no disease found']:
            noDiseaseFound.append(app)

    print(f'Case number {caseNum} ({isCommonString})')
    if sessionFailedToStart:
        print(
            f'Session failed to start for: {", ".join(sessionFailedToStart)}.')
    if noDiseaseFound:
        print(
            f'No diseases were found for: {", ".join(noDiseaseFound)}.')
    # display(case)
    print('\n'*2)


Case number 284 (less common)
Session failed to start for: K health.
No diseases were found for: Babylon, Buoy.



Case number 366 (less common)
No diseases were found for: Babylon.



Case number 399 (common)



Case number 359 (common)
No diseases were found for: Babylon, Buoy.



Case number 394 (less common)
No diseases were found for: Babylon, Buoy.



Case number 290 (less common)
No diseases were found for: Babylon.



Case number 48 (common)
No diseases were found for: Babylon.



Case number 423 (common)
No diseases were found for: Babylon.



Case number 170 (less common)
No diseases were found for: Babylon.



Case number 438 (less common)
No diseases were found for: Babylon.



Case number 491 (less common)
No diseases were found for: Babylon, Buoy.



Case number 157 (common)
Session failed to start for: K health.
No diseases were found for: Babylon.



Case number 68 (common)
No diseases were found for: Avey, Babylon.



Case number 159 (common)
No diseases were found for

<p>

## Let us define the metrics now. <a class="anchor" id="metrics"></a>

### Terms used
- TP: True positive (correct disease retrieved)
- TN: True negative (wrong disease **not** retrieved)
- FP: False positive (wrong disease retrieved)
- FN: False negative (correct disease **not** retrieved)
- gold standard - the correct list of diseases as determined by collective intelligence of doctors

### Precision
Precision helps us understand how exact our results are. It gives us an intuition about how many wrong diseases (false positives) are being retrieved. It is the ratio *number of correct diseases retrieved* to the *length of the complete list retrieved*.
$$precision = \frac{TP}{TP + FP} = \frac{TP}{\text{length of differential list}}$$

### Recall
Recall is a measure of how many of the correct diseases are being retrieved. It is the ratio *number of correct diseases retrieved* to the *length of the gold standard list*.
$$recall = \frac{TP}{TP + FN} = \frac{TP}{\text{length of the gold standard}}$$

### F1 Score
F1 score is the weighted harmonic mean of *precision* and *recall*. It is a metric that combines *precision* and *recall* and gives us 1 score for easier comparison.

Suppose $\beta$ defines how important is $recall$ to $precision$ then,
$$fscore_{\beta} = (1 + \beta^2)\frac{precision \cdot recall}{(\beta^2 \cdot precision) + recall}$$
Substituting $\beta = 1$,
$$fscore_{1} = \frac{2 \cdot precision \cdot recall}{ precision + recall}$$

### NDCG
NDCG or Normalized Discounted Cumulative Gain is measure of how accurate the ranking is. In our calculations, we use
$$DCG = \sum_{i=1}^n\frac{2^{relevance_i}-1}{log_2(i+1)}$$
where $n$ is the number of differentials in the returned list and  
$relevance_i = |gold standard| - rank_{gold\ standard}(ddx[i])$ if $ddx[i]$ is present, 0 otherwise.

$$NDCG = \frac{DCG_{ddx}}{DCG_{gold\ standard}}$$

### M Score
M Score determines where the top disease (gold standard) appears in the returned differential.
$$M_i = \text{gold standard[0]} \in \text{ddx[:i]}$$

### Position
Shows the position of the gold standard[0] in the returned differential. 

### Length
$$length = \frac{|ddx|}{|gold\ standard|}$$
<br>
<br>
<br>
<br>


In [6]:
import math

def getPrecision(goldStandard: pd.Series, candidate: pd.Series) -> float:
    tp = sum(int(disease in goldStandard.values and disease is not None)
             for disease in candidate)
    return tp if tp == 0 else tp/candidate.count()


def getRecall(goldStandard: pd.Series, candidate: pd.Series) -> float:
    tp = sum(int(disease in goldStandard.values and disease is not None)
             for disease in candidate)
    return tp/goldStandard.count()


def getFScore(precision: float, recall: float, beta: float = 1) -> float:
    return math.nan if precision+recall == 0 else \
        (1+beta**2)*precision*recall/(precision*(beta**2)+recall)


def getNDCG(goldStandard: pd.Series, candidate: pd.Series, scores) -> float:
    def discount(score: float, index: int) -> float:
        '''The index is 1 based'''
        return (math.pow(2, score)-1)/math.log2(index+1)

    maxDCG = sum(discount(scores[i], i+1) for i in range(len(scores)))

    candidateRelevance = []
    goldStandard = list(goldStandard)
    for index, disease in enumerate(candidate):
        if disease is not None and disease in goldStandard:
            candidateRelevance.append(
                discount(scores[goldStandard.index(disease)], index+1))
        else:
            candidateRelevance.append(0)

    return sum(candidateRelevance)/maxDCG


def getMScore(goldStandard: pd.Series, candidate: pd.Series, m=1) -> bool:
    return goldStandard.values[0] in candidate.values[:m]


def getPosition(goldStandard: pd.Series, candidate: pd.Series) -> bool:
    return math.nan if goldStandard.values[0] not in candidate.values else\
        1 + list(candidate.values).index(goldStandard.values[0])


def getLength(goldStandard: pd.Series, candidate: pd.Series) -> int:
    return math.nan if candidate.count() == 0 else \
        candidate.count()/goldStandard.count()


def getScoresCase(case: pd.DataFrame) -> pd.DataFrame:
    scores = [
        [getPrecision(case.iloc[:, 0], case.iloc[:, i])
         for i in range(1, len(case.columns))],
        [getRecall(case.iloc[:, 0], case.iloc[:, i])
         for i in range(1, len(case.columns))]
    ]

    scores.append([getFScore(scores[0][i], scores[1][i],beta=1)
                  for i in range(len(case.columns)-1)])
    scores.append([getFScore(scores[0][i], scores[1][i],beta=2)
                  for i in range(len(case.columns)-1)])

    # relevance for a list of 4 is 4, 3, 2, 1
    # relevance for a list of 2 is 2, 1
    scores.append([getNDCG(case.iloc[:, 0], case.iloc[:, i],
                           list(range(case.iloc[:, 0].count(), 0, -1)))
                   for i in range(1, len(case.columns))])

    for m in range(1, 6, 2):
        scores.append([getMScore(case.iloc[:, 0], case.iloc[:, i], m)
                      for i in range(1, len(case.columns))])

    scores.append([getPosition(case.iloc[:, 0], case.iloc[:, i])
                   for i in range(1, len(case.columns))])

    scores.append([getLength(case.iloc[:, 0], case.iloc[:, i])
                   for i in range(1, len(case.columns))])

    return pd.DataFrame(
        scores,
        columns=case.columns[1:],
        index=[
            "precision",
            "recall",
            "f1-score",
            "f2-score",
            "NDCG",
            "M1",
            "M3",
            "M5",
            "position",
            "length (x of gs)",
        ],
    )

scores = {id:getScoresCase(case) for id, case in cases.items()}

In [7]:
# no_ddx = {
# 537,
# 526,
# 525,
# 521,
# 534,
# 220,
# 444,
# 211,
# 27,
# 499,
# 102,
# 456,
# 414,
# 437,
# 311,
# 373,
# 400,
# 395,
# 388,
# 305,
# 325,
# 267,
# 218,
# 206,
# 142,
# 135,
# 270,
# 38,
# 23,
# 21,
# 165,
# 300
# }

# with open('assignedCases.json','r') as f:
#     assignedCases = json.load(f)

# # name = "alsabbaghb"
# # scores = {k: v for k,v in scores.items() if k in assignedCases[name]}


# # scores = {case_num: v for case_num, v in scores.items() if case_num <= 500}
# # scores = {k: v for k,v in scores.items() if k not in no_ddx}

In [8]:
print(len(scores))

74


Let us define the experiment now. We will pick which cases to compute startistics for.

In [9]:
from collections import defaultdict
experiments = {}
combineLabels = defaultdict(list)

def addExperiment(caseType,casesToConsider):
    #add experiment to ignore no case
    experiments[f'ignore_{caseType}_none'] = set(casesToConsider)

    caseClassificationApps = {}
    for app, classification in caseClassification['apps'].items():
        caseClassificationApps[app] = {
            'no disease found': classification['no disease found'] + classification['session failed'],
            'session failed':  classification['session failed']
            }

    for app, classifications in caseClassificationApps.items():
        for classification, caseNums in classifications.items():
            #record labels to combine later
            combineLabels[f'ignore_{caseType}_{classification}'].append(f'ignore_{caseType}_{app}_{classification}')
            #experiment to ignore the cases for the particular app
            experiments[f'ignore_{caseType}_{app}_{classification}'] = set(casesToConsider) - set(caseNums)
            #experiment to ignore the cases for all the apps except Babylon
            if "babylon" not in app.lower() or "no disease" not in classification:
                if f'ignore_{caseType}_any_{classification}' in experiments:
                    experiments[f'ignore_{caseType}_any_{classification}'] =\
                    experiments[f'ignore_{caseType}_any_{classification}'] - set(caseNums)
                else:
                    experiments[f'ignore_{caseType}_any_{classification}'] = experiments[f'ignore_{caseType}_{app}_{classification}']

addExperiment('common',caseClassification['common'])
addExperiment('less-common',caseClassification['less common'])
addExperiment('all-cases',set(caseClassification['less common']) | set(caseClassification['common']))
    
display('The experiments we are going to conduct are:')
# list(enumerate(experiments.keys()))


'The experiments we are going to conduct are:'

<br>
<br>

### Results <a class="anchor" id="results"></a>

In [10]:
# for id, score in scores.items():
#     if not score.loc["M1","Avey v2"]:
#         print(id)

#         133
# 261
# 394
# 271
# 21
# 149
# 24
# 411
# 414
# 420
# 43
# 299
# 300
# 309
# 65
# 71
# 75
# 207
# 336
# 84
# 229
# 486
# 359
# 104
# 233
# 362
# 489
# 499
# 118
# 246

In [16]:
count = 0
for id, score in scores.items():
    if not score.loc["M1","Avey v2"] and score.loc["M1","Avey old"]:
        print(id)
        count += 1 

count

359
438
159
393
451
300
311
75
490
414
344
378
420
271
489
461
494
264
411
233
182
326
206
216
251
499
308


27

In [12]:
def getStats(scores, row: int, col: int):
    values = []
    for score in scores.values():
        if not math.isnan(score.iloc[row, col]):
            values.append(score.iloc[row, col])
    if not values:
        return 0, 0, 0
    average = sum(values)/len(values)
    variance = sum((value-average)**2 for value in values)/len(values)
    stdDev = variance**0.5
    return round(average, 3), round(variance, 3),round(stdDev, 3)


results = {}
for label, casesToConsider in experiments.items():
    selectedScores = {id: score for id,
                      score in scores.items()}
    # selectedScores = {id: score for id,
    #                   score in scores.items() if int(id) in casesToConsider}
    print(len(selectedScores))
    columns = next(iter(scores.values())).columns
    index = index = [
        f"stats_for_{x}"
        for x in [
            "precision",
            "recall",
            "f1-score",
            "f2-score",
            "NDCG",
            "M1",
            "M3",
            "M5",
            "position",
            "length (x of gs)",
        ]
    ]
    averageScores = pd.DataFrame(
        [
            [getStats(selectedScores, row, col) for col in range(len(columns))]
            for row in range(len(index))
        ],
        columns=columns,
        index=index,
    )

    # ignore nan for recall
    for col in next(iter(scores.values())).columns:
        p = averageScores.loc["stats_for_precision", col][0]
        r = averageScores.loc["stats_for_recall", col][0]
        averageScores.loc["stats_for_f1-score",
                          col] = (round(getFScore(p, r, 1), 3),'-')
        averageScores.loc["stats_for_f2-score",
                          col] = (round(getFScore(p, r, 2), 3),'-')

    def calcStats(doctorResults):
        def extractItem(data, index=0):
            return data.apply(lambda x: x[index])

        sum = pd.Series(extractItem(doctorResults[0]))
        for data in doctorResults[1:]:
            # display(extractItem(data))
            sum += extractItem(data)
        average = (sum/len(doctorResults)).round(3)
        return average.apply(lambda x: (x,'-'))

    doctorResultsAverage = calcStats(
        [
            averageScores.loc[:, "MA"] ,
            averageScores.loc[:, "NJ"] ,
            averageScores.loc[:, "TH"]
        ]
    )

    averageScores.insert(
        loc=8, column="average_doctor",
        value=doctorResultsAverage,
    )

    results[label] = averageScores

    break

# print(label)
# list(results.values())[0].to_excel(f'named_results/{name}.xlsx')
list(results.values())[0]

74


Unnamed: 0,Ada,Avey,Avey old,Avey v2,Buoy,ChatGPT - 4,Healthily,K Health,average_doctor,MA,Mediktor,NJ,Symptomate,TH,WebMD
stats_for_precision,"(0.432, 0.074, 0.273)","(0.276, 0.049, 0.221)","(0.345, 0.066, 0.257)","(0.353, 0.051, 0.225)","(0.365, 0.102, 0.319)","(0.403, 0.056, 0.237)","(0.48, 0.23, 0.48)","(0.294, 0.09, 0.301)","(0.698, -)","(0.619, 0.138, 0.372)","(0.319, 0.078, 0.28)","(0.739, 0.136, 0.369)","(0.498, 0.131, 0.362)","(0.736, 0.076, 0.275)","(0.219, 0.03, 0.172)"
stats_for_recall,"(0.607, 0.115, 0.339)","(0.466, 0.133, 0.364)","(0.595, 0.096, 0.31)","(0.61, 0.121, 0.348)","(0.374, 0.115, 0.34)","(0.585, 0.101, 0.317)","(0.215, 0.051, 0.225)","(0.323, 0.097, 0.311)","(0.48, -)","(0.461, 0.091, 0.301)","(0.455, 0.152, 0.389)","(0.39, 0.051, 0.225)","(0.47, 0.133, 0.365)","(0.589, 0.069, 0.263)","(0.559, 0.17, 0.413)"
stats_for_f1-score,"(0.505, -)","(0.347, -)","(0.437, -)","(0.447, -)","(0.369, -)","(0.477, -)","(0.297, -)","(0.308, -)","(0.564, -)","(0.528, -)","(0.375, -)","(0.511, -)","(0.484, -)","(0.654, -)","(0.315, -)"
stats_for_f2-score,"(0.562, -)","(0.41, -)","(0.52, -)","(0.532, -)","(0.372, -)","(0.537, -)","(0.242, -)","(0.317, -)","(0.51, -)","(0.486, -)","(0.419, -)","(0.431, -)","(0.475, -)","(0.614, -)","(0.427, -)"
stats_for_NDCG,"(0.584, 0.137, 0.371)","(0.344, 0.084, 0.29)","(0.65, 0.108, 0.329)","(0.542, 0.121, 0.348)","(0.384, 0.128, 0.357)","(0.519, 0.113, 0.336)","(0.326, 0.127, 0.357)","(0.288, 0.095, 0.309)","(0.605, -)","(0.537, 0.114, 0.338)","(0.396, 0.137, 0.371)","(0.55, 0.104, 0.322)","(0.52, 0.14, 0.375)","(0.729, 0.065, 0.256)","(0.39, 0.121, 0.348)"
stats_for_M1,"(0.378, 0.235, 0.485)","(0.027, 0.026, 0.162)","(0.581, 0.243, 0.493)","(0.351, 0.228, 0.477)","(0.257, 0.191, 0.437)","(0.297, 0.209, 0.457)","(0.365, 0.232, 0.481)","(0.135, 0.117, 0.342)","(0.563, -)","(0.446, 0.247, 0.497)","(0.203, 0.162, 0.402)","(0.581, 0.243, 0.493)","(0.446, 0.247, 0.497)","(0.662, 0.224, 0.473)","(0.122, 0.107, 0.327)"
stats_for_M3,"(0.662, 0.224, 0.473)","(0.432, 0.245, 0.495)","(0.73, 0.197, 0.444)","(0.635, 0.232, 0.481)","(0.473, 0.249, 0.499)","(0.581, 0.243, 0.493)","(0.392, 0.238, 0.488)","(0.311, 0.214, 0.463)","(0.707, -)","(0.635, 0.232, 0.481)","(0.392, 0.238, 0.488)","(0.622, 0.235, 0.485)","(0.554, 0.247, 0.497)","(0.865, 0.117, 0.342)","(0.365, 0.232, 0.481)"
stats_for_M5,"(0.703, 0.209, 0.457)","(0.527, 0.249, 0.499)","(0.757, 0.184, 0.429)","(0.689, 0.214, 0.463)","(0.473, 0.249, 0.499)","(0.703, 0.209, 0.457)","(0.392, 0.238, 0.488)","(0.311, 0.214, 0.463)","(0.712, -)","(0.649, 0.228, 0.477)","(0.432, 0.245, 0.495)","(0.622, 0.235, 0.485)","(0.608, 0.238, 0.488)","(0.865, 0.117, 0.342)","(0.419, 0.243, 0.493)"
stats_for_position,"(1.712, 0.898, 0.947)","(2.615, 1.108, 1.053)","(1.439, 0.983, 0.991)","(1.765, 0.886, 0.941)","(1.629, 0.576, 0.759)","(2.096, 1.51, 1.229)","(1.103, 0.162, 0.402)","(1.826, 0.665, 0.816)","(1.261, -)","(1.438, 0.538, 0.733)","(2.314, 3.416, 1.848)","(1.065, 0.061, 0.247)","(1.696, 1.951, 1.397)","(1.281, 0.296, 0.544)","(3.205, 4.727, 2.174)"
stats_for_length (x of gs),"(1.56, 0.31, 0.557)","(1.836, 0.517, 0.719)","(2.098, 0.78, 0.883)","(1.833, 0.332, 0.577)","(1.068, 0.185, 0.43)","(1.609, 0.432, 0.657)","(0.545, 0.102, 0.319)","(1.171, 0.383, 0.619)","(0.755, -)","(0.814, 0.189, 0.435)","(1.599, 0.655, 0.81)","(0.559, 0.051, 0.225)","(1.205, 0.85, 0.922)","(0.892, 0.184, 0.428)","(2.983, 3.35, 1.83)"


Let us print all the results. The experiments are nomenclatured as follows:
- **[common|uncommon|all]:** means whether only common cases were considered, only uncommon cases were considered, and so on.
- **app name / any:** If an app name is present, then we ignore only those cases that pertain to it. If the label is *any* then we consider all apps in that experiment.
- **failure type**: The apps can fail in 2 ways. Either a session does not complete due to some reason or the app fails to retrieve any diagnosis. If this is set to None, then we ignore the failures and consider all cases under option 1 above.

In [13]:
def displayResults(results,printNumCases=True):
    for label, result in results.items():
        if printNumCases:
            print(f'Results for experiment {label}, which has {len(set(experiments[label]) & set(scores.keys()))} cases, is')
        else:
            print(f'Results for experiment {label} is')
        display(result)
        resultFiltered = result.applymap(lambda x: x[0])
        resultFiltered.to_csv(f'stats/{label}.csv', sep=';')

displayResults({key:val for key, val in results.items() if 'any' not in key})

Results for experiment ignore_common_none, which has 32 cases, is


Unnamed: 0,Ada,Avey,Avey old,Avey v2,Buoy,ChatGPT - 4,Healthily,K Health,average_doctor,MA,Mediktor,NJ,Symptomate,TH,WebMD
stats_for_precision,"(0.432, 0.074, 0.273)","(0.276, 0.049, 0.221)","(0.345, 0.066, 0.257)","(0.353, 0.051, 0.225)","(0.365, 0.102, 0.319)","(0.403, 0.056, 0.237)","(0.48, 0.23, 0.48)","(0.294, 0.09, 0.301)","(0.698, -)","(0.619, 0.138, 0.372)","(0.319, 0.078, 0.28)","(0.739, 0.136, 0.369)","(0.498, 0.131, 0.362)","(0.736, 0.076, 0.275)","(0.219, 0.03, 0.172)"
stats_for_recall,"(0.607, 0.115, 0.339)","(0.466, 0.133, 0.364)","(0.595, 0.096, 0.31)","(0.61, 0.121, 0.348)","(0.374, 0.115, 0.34)","(0.585, 0.101, 0.317)","(0.215, 0.051, 0.225)","(0.323, 0.097, 0.311)","(0.48, -)","(0.461, 0.091, 0.301)","(0.455, 0.152, 0.389)","(0.39, 0.051, 0.225)","(0.47, 0.133, 0.365)","(0.589, 0.069, 0.263)","(0.559, 0.17, 0.413)"
stats_for_f1-score,"(0.505, -)","(0.347, -)","(0.437, -)","(0.447, -)","(0.369, -)","(0.477, -)","(0.297, -)","(0.308, -)","(0.564, -)","(0.528, -)","(0.375, -)","(0.511, -)","(0.484, -)","(0.654, -)","(0.315, -)"
stats_for_f2-score,"(0.562, -)","(0.41, -)","(0.52, -)","(0.532, -)","(0.372, -)","(0.537, -)","(0.242, -)","(0.317, -)","(0.51, -)","(0.486, -)","(0.419, -)","(0.431, -)","(0.475, -)","(0.614, -)","(0.427, -)"
stats_for_NDCG,"(0.584, 0.137, 0.371)","(0.344, 0.084, 0.29)","(0.65, 0.108, 0.329)","(0.542, 0.121, 0.348)","(0.384, 0.128, 0.357)","(0.519, 0.113, 0.336)","(0.326, 0.127, 0.357)","(0.288, 0.095, 0.309)","(0.605, -)","(0.537, 0.114, 0.338)","(0.396, 0.137, 0.371)","(0.55, 0.104, 0.322)","(0.52, 0.14, 0.375)","(0.729, 0.065, 0.256)","(0.39, 0.121, 0.348)"
stats_for_M1,"(0.378, 0.235, 0.485)","(0.027, 0.026, 0.162)","(0.581, 0.243, 0.493)","(0.351, 0.228, 0.477)","(0.257, 0.191, 0.437)","(0.297, 0.209, 0.457)","(0.365, 0.232, 0.481)","(0.135, 0.117, 0.342)","(0.563, -)","(0.446, 0.247, 0.497)","(0.203, 0.162, 0.402)","(0.581, 0.243, 0.493)","(0.446, 0.247, 0.497)","(0.662, 0.224, 0.473)","(0.122, 0.107, 0.327)"
stats_for_M3,"(0.662, 0.224, 0.473)","(0.432, 0.245, 0.495)","(0.73, 0.197, 0.444)","(0.635, 0.232, 0.481)","(0.473, 0.249, 0.499)","(0.581, 0.243, 0.493)","(0.392, 0.238, 0.488)","(0.311, 0.214, 0.463)","(0.707, -)","(0.635, 0.232, 0.481)","(0.392, 0.238, 0.488)","(0.622, 0.235, 0.485)","(0.554, 0.247, 0.497)","(0.865, 0.117, 0.342)","(0.365, 0.232, 0.481)"
stats_for_M5,"(0.703, 0.209, 0.457)","(0.527, 0.249, 0.499)","(0.757, 0.184, 0.429)","(0.689, 0.214, 0.463)","(0.473, 0.249, 0.499)","(0.703, 0.209, 0.457)","(0.392, 0.238, 0.488)","(0.311, 0.214, 0.463)","(0.712, -)","(0.649, 0.228, 0.477)","(0.432, 0.245, 0.495)","(0.622, 0.235, 0.485)","(0.608, 0.238, 0.488)","(0.865, 0.117, 0.342)","(0.419, 0.243, 0.493)"
stats_for_position,"(1.712, 0.898, 0.947)","(2.615, 1.108, 1.053)","(1.439, 0.983, 0.991)","(1.765, 0.886, 0.941)","(1.629, 0.576, 0.759)","(2.096, 1.51, 1.229)","(1.103, 0.162, 0.402)","(1.826, 0.665, 0.816)","(1.261, -)","(1.438, 0.538, 0.733)","(2.314, 3.416, 1.848)","(1.065, 0.061, 0.247)","(1.696, 1.951, 1.397)","(1.281, 0.296, 0.544)","(3.205, 4.727, 2.174)"
stats_for_length (x of gs),"(1.56, 0.31, 0.557)","(1.836, 0.517, 0.719)","(2.098, 0.78, 0.883)","(1.833, 0.332, 0.577)","(1.068, 0.185, 0.43)","(1.609, 0.432, 0.657)","(0.545, 0.102, 0.319)","(1.171, 0.383, 0.619)","(0.755, -)","(0.814, 0.189, 0.435)","(1.599, 0.655, 0.81)","(0.559, 0.051, 0.225)","(1.205, 0.85, 0.922)","(0.892, 0.184, 0.428)","(2.983, 3.35, 1.83)"


  resultFiltered = result.applymap(lambda x: x[0])


Please note that the results below have Babylon just for reference. We include cases that retrieved no disease in babylon for reference.

In [14]:
displayResults({key:val for key, val in results.items() if 'any' in key})

Now let us combine the individual results of the apps. We take the app results from different experiments and combine them. Our goal is to get the best results for each app and compare them.

In [15]:
combinedResults = {}
for label, keys in combineLabels.items():
    collectedResults = {}
    for key in keys:
        result = results[key]
        app = key.split('_')[2]
        collectedResults[app] = result.loc[:,app]
    combinedResults[label] = pd.DataFrame(
        collectedResults,
        index=[
            f"average_{x}" for x in [
                "precision", "recall", "f1-score", "NDCG", "M1", "M3", "M5", "position", "length (x of gs)"
            ]
        ]
        )

KeyError: 'ignore_common_Avey_no disease found'

In [None]:
displayResults(combinedResults,printNumCases=False)

Results for experiment ignore_common_no disease found is


Unnamed: 0,Avey,Ada,Babylon,Buoy,K health,WebMD
average_precision,0.488,0.486,0.498,0.433,0.448,0.267
average_recall,0.756,0.634,0.299,0.422,0.474,0.542
average_f1-score,0.593,0.55,0.374,0.427,0.461,0.358
average_NDCG,0.809,0.719,0.351,0.495,0.544,0.511
average_M1,0.732,0.644,0.333,0.424,0.476,0.332
average_M3,0.932,0.833,0.375,0.62,0.613,0.548
average_M5,0.945,0.874,0.417,0.62,0.639,0.664
average_position,1.327,1.433,1.4,1.386,1.41,2.173
average_length (x of gs),1.904,1.497,0.569,1.021,1.203,2.23


Results for experiment ignore_common_session failed is


Unnamed: 0,Avey,Ada,Babylon,Buoy,K health,WebMD
average_precision,0.484,0.486,0.061,0.367,0.444,0.263
average_recall,0.749,0.634,0.036,0.357,0.469,0.534
average_f1-score,0.588,0.55,0.045,0.362,0.456,0.352
average_NDCG,0.801,0.719,0.043,0.42,0.538,0.504
average_M1,0.725,0.644,0.042,0.359,0.472,0.327
average_M3,0.923,0.833,0.047,0.525,0.606,0.541
average_M5,0.937,0.874,0.051,0.525,0.632,0.655
average_position,1.327,1.433,1.364,1.386,1.41,2.173
average_length (x of gs),1.904,1.497,0.567,1.021,1.203,2.23


Results for experiment ignore_less-common_no disease found is


Unnamed: 0,Avey,Ada,Babylon,Buoy,K health,WebMD
average_precision,0.379,0.386,0.1,0.283,0.25,0.159
average_recall,0.704,0.519,0.05,0.283,0.316,0.33
average_f1-score,0.493,0.443,0.067,0.283,0.279,0.215
average_NDCG,0.722,0.524,0.028,0.292,0.258,0.269
average_M1,0.612,0.416,0.0,0.197,0.13,0.147
average_M3,0.809,0.562,0.0,0.348,0.253,0.249
average_M5,0.854,0.624,0.0,0.348,0.286,0.322
average_position,1.577,1.658,0.0,1.609,2.244,2.276
average_length (x of gs),2.3,1.683,0.38,1.104,1.4,2.403


Results for experiment ignore_less-common_session failed is


Unnamed: 0,Avey,Ada,Babylon,Buoy,K health,WebMD
average_precision,0.379,0.386,0.006,0.216,0.25,0.159
average_recall,0.704,0.519,0.003,0.216,0.316,0.33
average_f1-score,0.493,0.443,0.004,0.216,0.279,0.215
average_NDCG,0.722,0.524,0.002,0.222,0.258,0.269
average_M1,0.612,0.416,0.0,0.15,0.13,0.147
average_M3,0.809,0.562,0.0,0.266,0.253,0.249
average_M5,0.854,0.624,0.0,0.266,0.286,0.322
average_position,1.577,1.658,0.0,1.609,2.244,2.276
average_length (x of gs),2.3,1.683,0.375,1.104,1.4,2.403


Results for experiment ignore_all-cases_no disease found is


Unnamed: 0,Avey,Ada,Babylon,Buoy,K health,WebMD
average_precision,0.439,0.441,0.381,0.37,0.36,0.218
average_recall,0.733,0.582,0.226,0.364,0.403,0.447
average_f1-score,0.549,0.502,0.284,0.367,0.38,0.293
average_NDCG,0.77,0.632,0.256,0.41,0.416,0.402
average_M1,0.678,0.542,0.235,0.329,0.322,0.249
average_M3,0.877,0.713,0.265,0.506,0.452,0.414
average_M5,0.905,0.762,0.294,0.506,0.481,0.51
average_position,1.434,1.515,1.4,1.45,1.635,2.202
average_length (x of gs),2.081,1.58,0.518,1.056,1.291,2.308


Results for experiment ignore_all-cases_session failed is


Unnamed: 0,Avey,Ada,Babylon,Buoy,K health,WebMD
average_precision,0.437,0.441,0.036,0.3,0.358,0.217
average_recall,0.729,0.582,0.021,0.295,0.401,0.443
average_f1-score,0.546,0.502,0.027,0.297,0.378,0.291
average_NDCG,0.766,0.632,0.025,0.332,0.414,0.399
average_M1,0.675,0.542,0.023,0.267,0.32,0.247
average_M3,0.873,0.713,0.026,0.41,0.45,0.411
average_M5,0.9,0.762,0.029,0.41,0.478,0.506
average_position,1.434,1.515,1.364,1.45,1.635,2.202
average_length (x of gs),2.081,1.58,0.512,1.056,1.291,2.308
