### What difference does the source of the training data tag make?

**See how much training data came from each of the 4 tagging sources:**
1. EPMC
2. Research Fish
3. Grants via Excel
4. Grants via Prodigy active learning

**See how well the ensemble does for each of these 4 sources.**

### Findings:

The training data is labelled from:
- Multiple sources: 26
- RF: 57 (57 'tech')
- EPMC: 126 (126 'tech')
- Grants: 485 (138 'tech', 347 'not tech)
- Prodigy only: 286 (150 'tech', 136 'not tech)

Of the 26 multiple sources only 3 were not in agreement:
- 3 times the grant description was labelled 'not tech', but RF data labelled as 'tech'
- 0 times the grant description was labelled 'not tech', but EPMC data labelled as 'tech'
- 3 times the grant description and RF labels both said tech
- 13 times the grant description and EPMC labels both said tech
- 7 times both EPMC and RF labels said tech

Using the test data (i.e. not any data that went into training the models), the 210223 ensemble model performs differently for data points labelled from the different sources.

- Of the 19 ResearchFish labelled data points (all labelled as tech) **0.632** were correctly labelled as tech when using the grant description.
- Of the 35 EPMC labelled data points (all labelled as tech) **0.714** were correctly labelled as tech when using the grant description.

When tagging a grant as tech or not from the original grant descriptions the model performs better:
- 114 original grants via Excel: 'precision': 0.805, 'recall': **0.892**, 'f1': 0.846
- 73 grants tagged via Prodigy: 'precision': 0.771, 'recall': **0.902**, 'f1': 0.831
- 187 grants **either** tagged via Excel or Prodigy: 'precision': 0.787, 'recall': **0.897**, 'f1': 0.838

Ensemble performance on all test data (recap):
- 241 grants: 'precision': 0.849, 'recall': 0.811, 'f1': 0.829


In [71]:
import json

import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,  f1_score, precision_score, recall_score


## Load datasets

In [2]:
# 1. The training data that went into Prodigy : 
original_training_data_dir = '../data/processed/training_data/210126/training_data.csv'

# 2. The final Prodigy merged results (original training + new tags) :
prodigy_data_dir = '../data/prodigy/merged_tech_grants/merged_tech_grants.jsonl'

# 3. Which grants were from the test set for the 210221 model runs: 
test_data_dir = '../data/processed/model_test_results/test_data_210221.csv'

# 4. The predictions on all grants from the 210223 Ensemble model (which used the 210221 models):
tech_preds_dir = '../data/processed/ensemble/210223/wellcome-grants-awarded-2005-2019_tagged.csv'

In [3]:
original_training_data = pd.read_csv(original_training_data_dir)
original_training_data_dict = {}
for i, row in original_training_data.iterrows():
    original_training_data_dict[row['Internal ID']] = {
        'Orig tech': row['Relevance code'],
        'RF': None if pd.isnull(row['Normalised code - RF']) else 1,
        'EPMC': None if pd.isnull(row['Normalised code - EPMC']) else 1,
        'Grants': None if pd.isnull(row['Normalised code - grants']) else (0 if int(row['Normalised code - grants'])==5 else 1),
    }
print(len(original_training_data_dict))
original_training_data_dict['106169/Z/14/Z']

696


{'Orig tech': 1, 'RF': 1, 'EPMC': None, 'Grants': 1}

In [4]:
# All Prodigy + original data in one place

cat2bin = {'Not tech grant': 0, 'Tech grant': 1}

training_data = {}
with open(prodigy_data_dir, 'r') as json_file:
    for json_str in list(json_file):
        data = json.loads(json_str)
        if data['answer'] != 'ignore':
            label = cat2bin[data['label']]
            if data['answer']=='accept':
                rel_code = label
            else:
                # If label=1, append 0 
                # if label=0, append 1
                rel_code = abs(label - 1)
            training_data[data['id']] = rel_code
print(len(training_data))
training_data['106169/Z/14/Z']

980


1

In [5]:
# Which were the test set
test_data = pd.read_csv(test_data_dir)
test_grants = test_data['Internal ID'].tolist()
len(test_grants)

245

In [6]:
# Get the predictions
tech_preds = pd.read_csv(tech_preds_dir)
tech_preds.drop_duplicates(inplace=True)
tech_preds = tech_preds.set_index('Grant ID')['Tech grant prediction'].to_dict()
print(len(tech_preds))
tech_preds['220282/Z/20/Z']

16854


0

## Combine datasets

In [7]:
training_data_details = []
for grant_number, tech_cat in training_data.items():
    grant_details = {'Grant number': grant_number}
    # Is it from the test set or training?
    if grant_number in test_grants:
        grant_details['Test/Train?'] = 'Test'
    else:
        grant_details['Test/Train?'] = 'Train'
        
    # Get prediction from ensemble model
    grant_details['Ensemble prediction'] = tech_preds.copy().get(grant_number)
    
    # Was it in the original training data
    orig_info = original_training_data_dict.copy().get(grant_number)
    if not orig_info:
        orig_info = {'Orig tech': None, 'RF': None, 'EPMC': None, 'Grants': None, 'Prodigy only': tech_cat}
    else: 
        orig_info['Prodigy only'] = None
    orig_info['Final tech'] = tech_cat
    grant_details.update(orig_info)
    training_data_details.append(grant_details)

In [8]:
len(training_data_details)

980

In [68]:
training_data_details_df = pd.DataFrame(training_data_details)

## Checks

In [12]:
# if 'Orig tech' is given it should always be the same as 'Final tech'
orig_tech = training_data_details_df[pd.notnull(training_data_details_df['Orig tech'])]
all(orig_tech['Orig tech'] == orig_tech['Final tech'])

True

In [36]:
multi_tag_data = training_data_details_df[pd.notnull(training_data_details_df[['RF', 'EPMC', 'Grants', 'Prodigy only']]).sum(axis=1)!=1]
print(f'Multiple sources: {len(multi_tag_data)}')
single_tag_data = training_data_details_df[pd.notnull(training_data_details_df[['RF', 'EPMC', 'Grants', 'Prodigy only']]).sum(axis=1)==1]
num_source = pd.notnull(single_tag_data[['RF', 'EPMC', 'Grants', 'Prodigy only']]).sum(axis=0)
num_source

Multiple sources: 26


RF               57
EPMC            126
Grants          485
Prodigy only    286
dtype: int64

In [144]:
single_tag_data.groupby('Final tech')[['RF', 'EPMC', 'Grants', 'Prodigy only']].count()

Unnamed: 0_level_0,RF,EPMC,Grants,Prodigy only
Final tech,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,0,347,136
1,57,126,138,150


In [69]:
# For the multiple ones, do they tend to agree?
print('Number of times the grant description tag was not tech, but RF data said tech:')
print(len(multi_tag_data.loc[((multi_tag_data['Grants']==0) & (multi_tag_data['RF']==1))]))
print('Number of times the grant description tag was not tech, but EPMC data said tech:')
print(len(multi_tag_data.loc[((multi_tag_data['Grants']==0) & (multi_tag_data['EPMC']==1))]))
print('Number of times the grant description tag was tech, and RF data said tech:')
print(len(multi_tag_data.loc[((multi_tag_data['Grants']==1) & (multi_tag_data['RF']==1))]))
print('Number of times the grant description tag was tech, and EPMC data said tech:')
print(len(multi_tag_data.loc[((multi_tag_data['Grants']==1) & (multi_tag_data['EPMC']==1))]))
print('Number of times EPMC and RF said tech:')
print(len(multi_tag_data.loc[((multi_tag_data['RF']==1) & (multi_tag_data['EPMC']==1))]))

Number of times the grant description tag was not tech, but RF data said tech:
3
Number of times the grant description tag was not tech, but EPMC data said tech:
0
Number of times the grant description tag was tech, and RF data said tech:
3
Number of times the grant description tag was tech, and EPMC data said tech:
13
Number of times EPMC and RF said tech:
7


## Test metrics for the different sources

In [58]:
test_single_tag_data = single_tag_data[single_tag_data['Test/Train?']=='Test']

In [103]:
# Metrics given which source
def get_metrics(source):
    source_additions = test_single_tag_data[pd.notnull(test_single_tag_data[source])]
    y_pred = source_additions['Ensemble prediction'].tolist()
    y = source_additions['Final tech'].tolist()
    
    y_tech_index = [i for i, v in enumerate(y) if v==1] # Which grant index were tech
    y_pred_tech = [y_pred[i] for i in y_tech_index] # The grant predictions for the tech grants only
    
    print({'precision': round(precision_score(y, y_pred, average='binary'),3),
         'recall': round(recall_score(y, y_pred, average='binary'),3),
         'f1': round(f1_score(y, y_pred, average='binary'),3),
           'Number tagged': len(y),
           'Number tagged as tech': len(y_tech_index),
           'Proportion predicted as tech were tech': round(sum(y_pred_tech)/len(y_tech_index),3)
        })
    return y, y_pred

In [104]:
y, y_pred = get_metrics('RF')

{'precision': 1.0, 'recall': 0.632, 'f1': 0.774, 'Number tagged': 19, 'Number tagged as tech': 19, 'Proportion predicted as tech were tech': 0.632}


In [105]:
y, y_pred = get_metrics('EPMC')

{'precision': 1.0, 'recall': 0.714, 'f1': 0.833, 'Number tagged': 35, 'Number tagged as tech': 35, 'Proportion predicted as tech were tech': 0.714}


In [106]:
y, y_pred = get_metrics('Grants')

{'precision': 0.805, 'recall': 0.892, 'f1': 0.846, 'Number tagged': 114, 'Number tagged as tech': 37, 'Proportion predicted as tech were tech': 0.892}


In [107]:
y, y_pred = get_metrics('Prodigy only')

{'precision': 0.771, 'recall': 0.902, 'f1': 0.831, 'Number tagged': 73, 'Number tagged as tech': 41, 'Proportion predicted as tech were tech': 0.902}


### Either grants source

In [112]:
source_additions = test_single_tag_data[
    ((pd.notnull(test_single_tag_data['Grants'])) |
     (pd.notnull(test_single_tag_data['Prodigy only'])))]
y_pred = source_additions['Ensemble prediction'].tolist()
y = source_additions['Final tech'].tolist()

y_tech_index = [i for i, v in enumerate(y) if v==1] # Which grant index were tech
y_pred_tech = [y_pred[i] for i in y_tech_index] # The grant predictions for the tech grants only

print({'precision': round(precision_score(y, y_pred, average='binary'),3),
     'recall': round(recall_score(y, y_pred, average='binary'),3),
     'f1': round(f1_score(y, y_pred, average='binary'),3),
       'Number tagged': len(y),
       'Number tagged as tech': len(y_tech_index),
       'Proportion predicted as tech were tech': round(sum(y_pred_tech)/len(y_tech_index),3)
    })

{'precision': 0.787, 'recall': 0.897, 'f1': 0.838, 'Number tagged': 187, 'Number tagged as tech': 78, 'Proportion predicted as tech were tech': 0.897}


In [113]:
## All data
source_additions = test_single_tag_data
y_pred = source_additions['Ensemble prediction'].tolist()
y = source_additions['Final tech'].tolist()

y_tech_index = [i for i, v in enumerate(y) if v==1] # Which grant index were tech
y_pred_tech = [y_pred[i] for i in y_tech_index] # The grant predictions for the tech grants only

print({'precision': round(precision_score(y, y_pred, average='binary'),3),
     'recall': round(recall_score(y, y_pred, average='binary'),3),
     'f1': round(f1_score(y, y_pred, average='binary'),3),
       'Number tagged': len(y),
       'Number tagged as tech': len(y_tech_index),
       'Proportion predicted as tech were tech': round(sum(y_pred_tech)/len(y_tech_index),3)
    })

{'precision': 0.849, 'recall': 0.811, 'f1': 0.829, 'Number tagged': 241, 'Number tagged as tech': 132, 'Proportion predicted as tech were tech': 0.811}


## Output training data with just grants tagged data

In [130]:
# Grant numbers of epmc or rf tagged only
epmc_rf_addition_grants = training_data_details_df[
    pd.isnull(training_data_details_df[['Grants', 'Prodigy only']]).sum(axis=1)==2
]['Grant number'].tolist()
len(epmc_rf_addition_grants)

190

In [133]:
new_training_data = []
with open(prodigy_data_dir, 'r') as json_file:
    for json_str in list(json_file):
        data = json.loads(json_str)
        if data['id'] not in epmc_rf_addition_grants:
            new_training_data.append(data)
print(len(new_training_data))

991


In [136]:
with open('../data/prodigy/merged_tech_grants/merged_tech_grants_noepmcrf.jsonl', 'w') as json_file:
    for entry in new_training_data:
        json.dump(entry, json_file)
        json_file.write('\n')