# WoS dataset filtering
In this notebok, we'll address the problem of filtering out irrelevant papers from our Web of Science-generated dataset. We'll explore a few approaches to this problem, including filtering based on static and dynamic WoS keywords. We'll then evaluate our approaches on a subset of abstracts annotated for whether or not theyshould be kept or removed from the dataset (relevance labeled).

In [2]:
import jsonlines
import pandas as pd
from collections import Counter

## Reading in the data

In the `WoS_dataset_characterization.ipynb` notebook, I played around with using word embedding clustering to determine which keywords were relevant/irrelevant for our task. I found that while the clustering does an impressive job of grouping keywords semantically, the semantic axis on which they cluster tends to be things like scientific disciplines, chemical and protein names, all of which contain both relevant and irrelevant terms for our subject matter. Therefore, I decided to move ahead by manually labelling the keywords as relevant or not. Because the keywords are viewed in isolation of their context (i.e. without looking at the abstract that they describe), and because it's very possible that multiple papers with the same keyword are both relevant and irrelevant, it's not a guarantee that this method is any better, so we want to evaluate the method against a test set of manually labeled abstracts. We'll read in all three sets of labels here.

In [3]:
### UPDATE with newest version when available
with jsonlines.open('../data/wos_files/core_collection_destol_or_anhydro_ALL_03Jan2024_sequential.jsonl') as reader:
    data = []
    for obj in reader:
        data.append(obj)

In [4]:
dynamos_ann1 = pd.read_csv('../data/wos_files/dynamic_keys_relevance_classified_27Dec2023.csv', index_col=0)
dynamos_ann2 = pd.read_csv('../data/wos_files/dynamic_keys_labeled_Ian_04Jan2024.csv', index_col=0)

In [5]:
statics_ann1 = pd.read_csv('../data/wos_files/static_keys_relevance_classified_27Dec2023.csv', index_col=0)
statics_ann2 = pd.read_csv('../data/wos_files/static_keys_labeled_Ian_04Jan2024.csv', index_col=0)

In [None]:
## TODO read in both annotations of the test set

## Calculating IAA
We'll use [Cohen's kappa](https://surge-ai.medium.com/inter-annotator-agreement-an-introduction-to-cohens-kappa-statistic-dcc15ffa5ac4) for calculating this metric, which is defined as:

$$\frac{P_{o} - P_{e}}{1 - P_{e}}$$

Where $P_{o}$ is the numebr of times both raters assigned the same label, and $P_{e}$ is the probability that btoh raters would choose the same label if they guessed randomly. We'll code up a function to calculate these values to get the overall Cohen's kappa.

In [18]:
def cohens_kappa(ann1, ann2):
    """
    Compute the cohen's kappa for a set of annotations.
    
    parameters:
        ann1, df: annotator1's annotations
        ann2, df: annotator2's annotations
    
    returns:
        kappa, float: cohen's kappa
    """
    # Merge the two dfs
    anns = ann1.merge(ann2, left_index=True, right_index=True, suffixes=('_ann1', '_ann2'))
    
    # Calculate the input values
    total = len(anns)
    tp = len(anns[(anns['relevant_ann1'] == anns['relevant_ann2']) & (anns['relevant_ann1'] == 'Y')])
    tn = len(anns[(anns['relevant_ann1'] == anns['relevant_ann2']) & (anns['relevant_ann1'] == 'N')])
    fp = len(anns[(anns['relevant_ann1'] != anns['relevant_ann2']) & (anns['relevant_ann1'] == 'Y')])
    fn = len(anns[(anns['relevant_ann1'] != anns['relevant_ann2']) & (anns['relevant_ann1'] == 'N')])
    Po = (tp + tn)/total
    P1 = ((tp + fn)*(tp + fp))/total**2
    P2 = ((tn + fn) * (tn + fp))/total**2
    Pe = P1 + P2
    
    # Calculate the overall value
    kappa = (Po - Pe)/(1 - Pe)
    
    return kappa


### IAA on keywords
First, we want to calculate an IAA for our labels of the keywords. 

In [21]:
dynamo_iaa = cohens_kappa(dynamos_ann1, dynamos_ann2)
print(f'Agreement on the dynamic keywords was {dynamo_iaa:.2f}')

Agreement on the dynamic keywords was 0.10


In [22]:
static_iaa = cohens_kappa(statics_ann1, statics_ann2)
print(f'Agreement on the static keywords was {static_iaa:.2f}')

Agreement on the static keywords was 0.38


These agreements are terrible! However, with the keywords, it's important to consider the fact that, since they're completely out of context (on purpose), this is an extremely opinionated task. Potentially, one person's opinion is more correct than another's as evaluated on the test set; so for now I'm going to keep the keyword annotaions separate and treat them as separate filtering "methods".

### IAA on test set
We will also calculate the IAA for the test set -- it's more important that this be relatively high, as we want consensus, it should be less of an outright opinion task and more self-evident.

## Designing filtering methods
There are multiple of both static and dynamic keywords for most papers. Therefore, there are a few ways that we can choose to filter papers based on the classified keywords. We'll implement the following:

1. Most stringent: To keep a paper, all keywords must have Y relevance
2. Middle road: To keep a paper, the majority of keywords have Y relevance
3. Least stringent: To keep a paper, only one keyword needs a Y relevance

In [31]:
def filter_papers(papers, key_df, kind, stringency='most'):
    """
    Filter papers by keyword.
    
    parameters:
        papers, list of dict: papers to filter
        key_df, pandas df: index are keywords, column is 'relevant' containing Y or N strings
        kind, str: either 'static_keywords' or 'dynamic_keys' ## TODO change to static_keys for udpated dataset
        stringency, str: 'most', 'middle', or 'least', default is 'most'
    
    returns:
        filtered_papers, list of dict: list of papers with irrelevant papers removed
    """
    filtered_papers = []
    for paper in papers:
        # Get the relevances of all keywords
        keys = paper[kind]
        key_rels = key_df.loc[keys, :]
        # Filter based on requested stringency
        if stringency == 'most':
            if (len(key_rels['relevant'].unique()) == 1) and (key_rels['relevant'].unique()[0] == 'Y'):
                filtered_papers.append(paper)
        elif stringency == 'middle':
            nums = Counter(key_rels['relevant'].values.tolist())
            if nums['Y'] > nums['N']:
                filtered_papers.append(paper)
        elif stringency == 'least':
            if 'Y' in key_rels.relevant.values.tolist():
                filtered_papers.append(papers)
    
    print(f'{len(filtered_papers)} of {len(papers)} were kept upon filtering.')
    return filtered_papers        

## Testing filtering methods
TODO: Actually test on test set

Now, let's see how each of these filtering methods impacts the dataset.

### Dynamic keywords
#### Most stringent

In [32]:
most_dynamo = filter_papers(data, dynamos, 'dynamic_keys')

4044 of 6903 were kept upon filtering.


#### Middle road

In [33]:
middle_dynamo = filter_papers(data, dynamos, 'dynamic_keys', stringency='middle')

5796 of 6903 were kept upon filtering.


#### Least stringent

In [34]:
least_dynamo = filter_papers(data, dynamos, 'dynamic_keys', stringency='least')

6380 of 6903 were kept upon filtering.


### Static keywords
#### Most stringent

In [35]:
most_static = filter_papers(data, statics, 'static_keywords')

3978 of 6903 were kept upon filtering.


#### Middle road

In [37]:
middle_static = filter_papers(data, statics, 'static_keywords', stringency='middle')

4692 of 6903 were kept upon filtering.


#### Least stringent

In [38]:
least_static = filter_papers(data, statics, 'static_keywords', stringency='least')

5963 of 6903 were kept upon filtering.
