# Classification correction
Our classification algorithm isn't perfect, so we manually corrected the classifications. Here, we'll merge the corrections with the graph, and quantify the mistakes.

In [13]:
import pandas as pd
import networkx as nx
from sklearn.metrics import f1_score
from collections import defaultdict
import jsonlines

## Read in the data

In [49]:
debugged_graph = nx.read_graphml('../data/citation_network/core_collection_destol_or_anhydro_FILTERED_MAIN_ONLY_classified_network_isolates_cleaned_11Mar2023.graphml')

In [15]:
original = pd.read_csv('../data/citation_network/DEBUGGED_automatic_classifications_core_collection_destol_or_anhydro.csv')
original.head()

Unnamed: 0,UID,title,abstract,study_system
0,WOS:A1990ET59600010,RESPONSE OF 4 SORGHUM LINES TO MID-SEASON DROU...,Four sorghum (Sorghum bicolor L. Moench) lines...,Plant
1,WOS:000244317000009,Effects of abscisic acid on growth and dehydra...,Cynanchum komarovii is well adapted to hot and...,Plant
2,WOS:000186691200001,LEAping to conclusions: A computational reanal...,Background: The late embryogenesis abundant (L...,Plant
3,WOS:000178654700046,Early salt stress effects on the changes in ch...,A technique based on Fourier transform infrare...,Plant
4,WOS:A1997XH01700004,Approaches to elucidate the basis of desiccati...,"Plants undergo a series of physiological, bioc...",Plant


Note that we adjusted the classification code to remove isolate nodes when generating the graph and re-ran the algorithm to get the graph displayed in `original`. We did this after passing the classifications from the previous iteration to the first annotator; so there are ~400 rows missing in `original` that had their classifications corrected by the annotators. Let's do a quick common-sense check to make sure nothing about the `original` annotations is meaningfully different from those provided to the annotators beyond the missing rows:

In [51]:
provided_csv = pd.read_csv('../data/citation_network/DEBUGGED_automatic_classifications_core_collection_destol_or_anhydro.csv')
provided_csv.shape

(5626, 4)

In [53]:
orig_compare = provided_csv.merge(original, on=['UID', 'title', 'abstract'], suffixes=('_provided', '_original'))
orig_compare[orig_compare['study_system_provided'] != orig_compare['study_system_original']]

Unnamed: 0,UID,title,abstract,study_system_provided,study_system_original


Perfect, nothing changed except for dropping those rows. We can now move forward, and the extra rows from the manual dataframes will be dropped when the information is merged onto the graph.

In [54]:
manual_RV = pd.read_csv('../data/citation_network/DEBUGGED_automatic_classifications_core_collection_destol_or_anhydro_RV_manually_corrected.csv')
manual_RV.head()

Unnamed: 0,UID,title,abstract,study_system
0,WOS:000249421700004,Phenotypic plasticity mediates climate change ...,Synergies between global change and biological...,Animal
1,WOS:000189080600003,"The importance of cuticular permeability, osmo...",Euedaphic collembolans have recently been show...,Animal
2,WOS:A1993LX94500007,GEOGRAPHICAL VARIATION IN THE ACCLIMATION RESP...,Populations may adapt to climatic stresses by ...,Animal
3,WOS:000182189500050,Transition from natively unfolded to folded st...,Late embryogenesis abundant (LEA) proteins are...,Animal
4,WOS:000170973900012,Mechanisms of plant desiccation tolerance,Anhydrobiosis ('life without water') is the re...,Animal


In [55]:
manual_SL = pd.read_csv('../data/citation_network/DEBUGGED_automatic_classifications_core_collection_destol_or_anhydro_SL_manually_corrected_29Mar2024.csv')
manual_SL.head()

Unnamed: 0,UID,title,abstract,study_system
0,WOS:000249421700004,Phenotypic plasticity mediates climate change ...,Synergies between global change and biological...,Animal
1,WOS:000189080600003,"The importance of cuticular permeability, osmo...",Euedaphic collembolans have recently been show...,Animal
2,WOS:A1993LX94500007,GEOGRAPHICAL VARIATION IN THE ACCLIMATION RESP...,Populations may adapt to climatic stresses by ...,Animal
3,WOS:000182189500050,Transition from natively unfolded to folded st...,Late embryogenesis abundant (LEA) proteins are...,Animal
4,WOS:000170973900012,Mechanisms of plant desiccation tolerance,Anhydrobiosis ('life without water') is the re...,Plant


I went back over the first set of manual corrections to make further adjustments, so let's do a quick common-sense check on the second set of manual corrections:

In [56]:
manual_v_manual = manual_RV[['UID', 'study_system', 'title', 'abstract']].merge(manual_SL[['UID', 'study_system']], on='UID', suffixes=('_RV', '_SL'))
manual_v_manual[manual_v_manual['study_system_RV'] != manual_v_manual['study_system_SL']]

Unnamed: 0,UID,study_system_RV,title,abstract,study_system_SL
4,WOS:000170973900012,Animal,Mechanisms of plant desiccation tolerance,Anhydrobiosis ('life without water') is the re...,Plant
5,WOS:000244031100003,Animal,Modelling the effects of microclimate on bean ...,Bean seed storage ability is of major interest...,Plant
23,WOS:000075551200015,Animal,Methods for dehydration-tolerance: Depression ...,"Anhydrobiosis, or life without water, is the r...",NOCLASS
38,WOS:000276030700002,Animal,Effect of storage temperature on spore viabili...,To effectively preserve the vulnerable species...,Plant
63,WOS:000288553000008,Animal,Tolerance to oxidative stress induced by desic...,Unravelling the mechanisms underlying desiccat...,Plant
...,...,...,...,...,...
5563,WOS:000446307800006,Plant,Life in Suspended Animation: Role of Chaperone...,"When confronted by environmental stress, organ...",Animal
5572,WOS:000399420300015,Plant,Dry Preservation of Spermatozoa: Consideration...,The current gold standard for sperm preservati...,Animal
5589,WOS:000525751200012,Plant,Survivorship of geographic Pomacea canaliculat...,"Pomacea canaliculata, a freshwater snail from ...",Animal
5619,WOS:000341348500013,Plant,Combination of synchrotron radiation-based Fou...,Understanding the spatial heterogeneity within...,Microbe


Looks good! We'll move forward using the secondarily reviewed corrections (SL).

In [57]:
comparison = original[['UID', 'study_system', 'title', 'abstract']].merge(manual_SL[['UID', 'study_system']], on='UID', suffixes=('_original', '_manual'))

In [58]:
comparison.head()

Unnamed: 0,UID,study_system_original,title,abstract,study_system_manual
0,WOS:A1990ET59600010,Plant,RESPONSE OF 4 SORGHUM LINES TO MID-SEASON DROU...,Four sorghum (Sorghum bicolor L. Moench) lines...,Plant
1,WOS:000244317000009,Plant,Effects of abscisic acid on growth and dehydra...,Cynanchum komarovii is well adapted to hot and...,Plant
2,WOS:000186691200001,Plant,LEAping to conclusions: A computational reanal...,Background: The late embryogenesis abundant (L...,NOCLASS
3,WOS:000178654700046,Plant,Early salt stress effects on the changes in ch...,A technique based on Fourier transform infrare...,Plant
4,WOS:A1997XH01700004,Plant,Approaches to elucidate the basis of desiccati...,"Plants undergo a series of physiological, bioc...",Plant


In [59]:
comparison[comparison['study_system_manual'] != comparison['study_system_original']]

Unnamed: 0,UID,study_system_original,title,abstract,study_system_manual
2,WOS:000186691200001,Plant,LEAping to conclusions: A computational reanal...,Background: The late embryogenesis abundant (L...,NOCLASS
28,WOS:000170973900012,Animal,Mechanisms of plant desiccation tolerance,Anhydrobiosis ('life without water') is the re...,Plant
33,WOS:000244031100003,Animal,Modelling the effects of microclimate on bean ...,Bean seed storage ability is of major interest...,Plant
34,WOS:000169703200006,Animal,Changes in oligosaccharide content and antioxi...,Seeds of bean (Phaseolos vulgaris cv. Vernel) ...,Plant
39,WOS:000087110100001,NOCLASS,Dehydration in dormant insects,Many of the mechanisms used by active insects ...,Animal
...,...,...,...,...,...
5598,WOS:000296681200006,NOCLASS,MEDIATED TREHALOSE UN-LOADING FOR REDUCED ERYT...,"Recently, high concentrations of intracellular...",Animal
5612,WOS:000341348500013,Plant,Combination of synchrotron radiation-based Fou...,Understanding the spatial heterogeneity within...,Microbe
5615,WOS:000455747900024,Fungi,Plasticity of a holobiont: desiccation induces...,The role of host-associated microbiota in endu...,Microbe
5618,WOS:000457618800009,Plant,Temporal clustering of extreme climate events ...,Research on regime shifts has focused primaril...,Microbe


## Quantify misclassifications
### Amount of each class
How many belong to each class in the automatic versus manually verified annotations?

In [60]:
comparison.study_system_original.value_counts()

study_system_original
Plant      3380
Animal     1273
Microbe     570
Fungi       237
NOCLASS     166
Name: count, dtype: int64

In [61]:
comparison.study_system_manual.value_counts()

study_system_manual
Plant      3403
Animal     1336
Microbe     562
Fungi       214
NOCLASS     111
Name: count, dtype: int64

### Calculating performance
To get a general idea of how we performed, we'll ignore specific classes and just check correct versus incorrect.

In [62]:
comparison['correct'] = comparison['study_system_original'] == comparison['study_system_manual']

In [63]:
accuracy = comparison['correct'].value_counts()[True]/(comparison['correct'].value_counts()[True] + comparison['correct'].value_counts()[False])
print(f'Overall classification accuracy was {accuracy*100:.2f}%')

Overall classification accuracy was 89.37%


We can look at an overall F1 score:

In [64]:
f1_overall = f1_score(comparison['study_system_manual'], comparison['study_system_original'], average='weighted') # weighted accounts for class imbalance
print(f'Overall F1 score is {f1_overall:.2f}')

Overall F1 score is 0.90


As well as F1 for each class:

In [65]:
f1_by_class = f1_score(comparison['study_system_manual'], comparison['study_system_original'], average=None)
f1_by_class = {n: f for n, f in zip(sorted(comparison['study_system_manual'].unique()), f1_by_class)}
for n, f in f1_by_class.items():
    print(f'F1 score for class {n} is {f:.2f}')

F1 score for class Animal is 0.87
F1 score for class Fungi is 0.75
F1 score for class Microbe is 0.89
F1 score for class NOCLASS is 0.30
F1 score for class Plant is 0.94


## Updating graph
We want to both update the classifications, as well as remove nodes that are true NOCLASS.

In [66]:
to_update = comparison.set_index('UID')['study_system_manual'].to_dict()

In [67]:
nx.set_node_attributes(debugged_graph, to_update, name='study_system')

In [70]:
to_drop = [n for n, c in to_update.items() if c == 'NOCLASS']
len(to_drop)

111

In [71]:
debugged_graph.remove_nodes_from(to_drop)

Double check that we updated and dropped correctly by verifying numbers:

In [72]:
verification = defaultdict(int)
for node, attrs in debugged_graph.nodes(data=True):
    verification[attrs['study_system']] += 1
verification

defaultdict(int, {'Plant': 3181, 'Animal': 1221, 'Microbe': 519, 'Fungi': 198})

Looks good! Now write out:

In [74]:
nx.write_graphml(debugged_graph, '../data/citation_network/FINAL_FINAL_DEBUGGED_MANUALLY_VERIFIED_core_collection_destol_or_anhydro_classified_network_29Mar2024.graphml')