# Medical Label Project

## Setting up

Import necessary packages and adjust settings.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', None)
pd.options.mode.chained_assignment = None  # default='warn'

Read data from csv files.

In [2]:
reads = pd.read_csv('data/1345_admin_reads.csv')
results = pd.read_csv('data/1345_customer_results.csv')

Throw out all rows which do not end with voteN for N=0-8 in the "Origin" column of the results dataframe, and extract N from the column for the remaining data.\
N is the number of experts who voted 'yes'.

In [3]:
valid_vote = results["Origin"].str.fullmatch(r'(.*)vote[0-8](.*)', case=False)
valid_results = results.loc[valid_vote]
votes = valid_results["Origin"].str.extract(r'vote(\d)').astype(int)

Throw out unnecessary columns. Create a new row called "Votes" to store the number of votes extracted from the "Origin".

In [4]:
valid_results = valid_results.loc[:,"Qualified Reads":"Second Choice Weight"]
valid_results["Votes"] = votes

## Remarks on "Correct Label"

By interpreting number of votes<4 as having a correct label of 'no', number of votes>4 as having a correct label of 'yes', and having no correct label otherwise, this column completely agrees with the "Correct Label" column.

In [5]:
print("Correct Labels corresponding to < 4 votes:", valid_results.loc[valid_results["Votes"]<4,"Correct Label"].unique())
print("Correct Labels corresponding to > 4 votes:", valid_results.loc[valid_results["Votes"]>4,"Correct Label"].unique())
print("Correct Labels corresponding to = 4 votes:", valid_results.loc[valid_results["Votes"]==4,"Correct Label"].unique())

Correct Labels corresponding to < 4 votes: ["'no'"]
Correct Labels corresponding to > 4 votes: ["'yes'"]
Correct Labels corresponding to = 4 votes: [nan]


We also note that out of the 27000 results, 12000 of them have a correct label of 'no', 12000 of them have a correct label of 'yes', and 3000 of them have no correct label.

In [6]:
print(valid_results["Correct Label"].value_counts(dropna=False))

'no'     12000
'yes'    12000
NaN       3000
Name: Correct Label, dtype: int64


## Remark on "Majority Label"

Out of the 27000 results, 14547 of them have a majority label of 'no', 12211 of them have a majority label of 'yes', and 242 of them have no majority label.

In [7]:
print(valid_results["Majority Label"].value_counts(dropna=False))

'yes'    14547
'no'     12211
NaN        242
Name: Majority Label, dtype: int64


The reason that 242 of them have no majority labels is that all of them have no qualified reads.

In [8]:
print(valid_results.loc[valid_results["Majority Label"].isna(),"Qualified Reads"].value_counts())

0    242
Name: Qualified Reads, dtype: int64


Of the results with at least one qualified reads, 1580 of them have an equal number of yes/no votes. In these cases, the majority label is determined by the sum of accuracy score of qualified reads for each choice answer.

In [9]:
print(valid_results.loc[(valid_results["First Choice Votes"]==valid_results["Second Choice Votes"])&(valid_results["Qualified Reads"]!=0),"Majority Label"].value_counts(dropna=False))
# print(valid_results.loc[(valid_results["First Choice Votes"]==valid_results["Second Choice Votes"])&(valid_results["Qualified Reads"]!=0),"First Choice Answer"].value_counts(dropna=False))

'no'     764
'yes'    744
Name: Majority Label, dtype: int64


## Analysis

We first count the number of expert who agrees with expert majority and the total number of expert votes.\
The since each result have 8 expert votes, the total number is simply calculated by 8 times the number of results.\
The problem is that as we've seen earlier, some results have no expert majority because equal number of experts voted on yes vs no, so we need to decide whether to include these results. I'd like to argue that since there have to be a correct label (though we don't know which), we can say that 50% of the experts got the correct answer in these cases, so the result can be included.\
Then the number is calculated by summing over the number of votes on the majority label.

In [10]:
num_expert_agree = (np.max((valid_results["Votes"],8-valid_results["Votes"]),axis=0)).sum()
total_expert_votes = len(valid_results.index) * 8

Then we do the same for the crowd votes.\
Note that there are some cases where the first choice answer is not the same as majority label since there were an equal number of first choice answers and second choice answers, but we can safely use the number of votes on the first choice answer in our calculation of votes that agreed with the correct label since the votes on the two answers are the same.

In [11]:
total_crowd_votes = (valid_results["Qualified Reads"]).sum()

In [12]:
valid_results

Unnamed: 0,Qualified Reads,Correct Label,Majority Label,Difficulty,Agreement,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Votes
0,2,'no','no',0.000,1.000,'no',2,1.54,'yes',0,0.00,2
1,3,'no','no',0.000,1.000,'no',3,2.34,'yes',0,0.00,0
2,2,'no','no',0.000,1.000,'no',2,1.70,'yes',0,0.00,0
3,1,'no','no',0.000,1.000,'no',1,0.82,'yes',0,0.00,0
4,7,,'yes',,0.571,'yes',4,3.28,'no',3,2.32,4
...,...,...,...,...,...,...,...,...,...,...,...,...
30288,2,'no','yes',1.000,1.000,'yes',2,1.56,'no',0,0.00,2
30289,3,'no','yes',0.667,0.667,'yes',2,1.56,'no',1,0.76,3
30290,6,,'yes',,1.000,'yes',6,4.78,'no',0,0.00,4
30291,0,'yes',,,,'yes',0,0.00,'no',0,0.00,5
