# Classification correction
Our classification algorithm has a high degree of incorrect predictions (about ~15% of Plant and Fungi classifications were incorrect, and 15% overall are missing classifications), but there's a small enough number of papers that we could manually verify and adjust classifications. Here, we'll quantify the discrepancies and merge the correct annotations back into the final graphml file.

In [4]:
import pandas as pd
import networkx as nx
from sklearn.metrics import f1_score
from collections import defaultdict
import jsonlines

## Read in the data

In [3]:
debugged_graph = nx.read_graphml('../data/citation_network/core_collection_destol_or_anhydro_FILTERED_classified_network_DEBUGGED_KINGS_30Jan2023.graphml')

In [6]:
with jsonlines.open('../data/wos_files/core_collection_destol_or_anhydro_FILTERED_MAIN_ONLY_with_ref_abstracts_05Jan2023.jsonl') as reader:
    papers = [obj for obj in reader]
papers_by_uid = {paper['UID']: paper for paper in papers}

In [7]:
# Generate csv for manual classification
to_check = defaultdict(list)
for node, attrs in debugged_graph.nodes(data=True):
    to_check['UID'].append(node)
    to_check['title'].append(papers_by_uid[node]['title'])
    to_check['abstract'].append(papers_by_uid[node]['abstract'])
    to_check['study_system'].append(attrs['study_system'])
to_check_df = pd.DataFrame(to_check)
to_check_df.head()

Unnamed: 0,UID,title,abstract,study_system
0,WOS:A1990ET59600010,RESPONSE OF 4 SORGHUM LINES TO MID-SEASON DROU...,Four sorghum (Sorghum bicolor L. Moench) lines...,Plant
1,WOS:000244317000009,Effects of abscisic acid on growth and dehydra...,Cynanchum komarovii is well adapted to hot and...,Plant
2,WOS:000186691200001,LEAping to conclusions: A computational reanal...,Background: The late embryogenesis abundant (L...,Plant
3,WOS:000178654700046,Early salt stress effects on the changes in ch...,A technique based on Fourier transform infrare...,Plant
4,WOS:A1997XH01700004,Approaches to elucidate the basis of desiccati...,"Plants undergo a series of physiological, bioc...",Plant


In [8]:
to_check_df.to_csv('../data/citation_network/DEBUGGED_automatic_classifications_core_collection_destol_or_anhydro.csv', index=False)

In [58]:
graph = nx.read_graphml('../data/citation_network/core_collection_destol_or_anhydro_FILTERED_classified_network_06Jan2023.graphml')

In [40]:
original = pd.read_csv('../data/citation_network/core_collection_classified_mains_only_with_abstracts.csv')
original.head()

Unnamed: 0,UID,Id,study_system,title,abstract
0,WOS:000071070200022,WOS:000071070200022,Plant,Endophyte effect on drought tolerance in diver...,Tall fescue (Festuca arundinacea Schreb.) drou...
1,WOS:000071120900007,WOS:000071120900007,Plant,Seasonal variations in tolerance to ion leakag...,A simple ion leakage assay was used to test if...
2,WOS:000071209300053,WOS:000071209300053,NOCLASS,Regulation of body water balance in reedfrogs ...,The regulation of body water balance was exami...
3,WOS:000071582300006,WOS:000071582300006,NOCLASS,Community structure and environmental stress: ...,"In a previous field experiment, communities of..."
4,WOS:000071672300004,WOS:000071672300004,Plant,Evaluation of field and laboratory predictors ...,"In Mediterranean regions, plant breeding progr..."


In [41]:
manual = pd.read_csv('../data/citation_network/core_collection_classified_mains_only_with_abstracts_RV.csv')
manual.head()

Unnamed: 0,UID,Id,study_system,title,abstract
0,WOS:000074822300003,WOS:000074822300003,Animal,Ultrastructural changes during desiccation of ...,Ultrastructural changes during desiccation of ...
1,WOS:000077146200008,WOS:000077146200008,Animal,Entomopathogenic nematodes for control of codl...,The susceptibility of codling moth diapausing ...
2,WOS:000078065800003,WOS:000078065800003,Animal,Nematodes and other aquatic invertebrates in E...,Bryophytes provide microhabitats for aquatic i...
3,WOS:000080234300045,WOS:000080234300045,Animal,Factors affecting long-term survival of dry bd...,Naturally dried lichens and mushrooms were col...
4,WOS:000081030800008,WOS:000081030800008,Animal,Desiccation survival of the entomopathogenic n...,The present study describes different desiccat...


In [42]:
comparison = original[['UID', 'study_system', 'title', 'abstract']].merge(manual[['UID', 'study_system']], on='UID', suffixes=('_original', '_manual'))

In [43]:
comparison.head()

Unnamed: 0,UID,study_system_original,title,abstract,study_system_manual
0,WOS:000071070200022,Plant,Endophyte effect on drought tolerance in diver...,Tall fescue (Festuca arundinacea Schreb.) drou...,Plant
1,WOS:000071120900007,Plant,Seasonal variations in tolerance to ion leakag...,A simple ion leakage assay was used to test if...,Plant
2,WOS:000071209300053,NOCLASS,Regulation of body water balance in reedfrogs ...,The regulation of body water balance was exami...,Animal
3,WOS:000071582300006,NOCLASS,Community structure and environmental stress: ...,"In a previous field experiment, communities of...",Animal
4,WOS:000071672300004,Plant,Evaluation of field and laboratory predictors ...,"In Mediterranean regions, plant breeding progr...",Plant


## Quantify misclassifications
### Amount of each class
How many belong to each class in the automatic versus manually verified annotations?

In [44]:
comparison.study_system_original.value_counts()

study_system_original
Plant      3630
Microbe     830
NOCLASS     788
Animal      245
Fungi       133
Name: count, dtype: int64

In [45]:
comparison.study_system_manual.value_counts()

study_system_manual
Plant      3627
Animal     1176
Microbe     620
Fungi       134
NOCLASS      62
fungi         7
Name: count, dtype: int64

Some of the fungi labels are lowercased but shouldn't be, let's fix that quickly:

In [52]:
comparison.loc[comparison['study_system_manual'] == 'fungi', 'study_system_manual'] = 'Fungi'

In [53]:
comparison.study_system_manual.value_counts()

study_system_manual
Plant      3627
Animal     1176
Microbe     620
Fungi       141
NOCLASS      62
Name: count, dtype: int64

### Calculating performance
To get a general idea of how we performed, we'll ignore specific classes and just check correct versus incorrect.

In [54]:
comparison['correct'] = comparison['study_system_original'] == comparison['study_system_manual']

In [55]:
accuracy = comparison['correct'].value_counts()[True]/(comparison['correct'].value_counts()[True] + comparison['correct'].value_counts()[False])
print(f'Overall classification accuracy was {accuracy*100:.2f}%')

Overall classification accuracy was 71.31%


We can look at an overall F1 score:

In [56]:
f1_overall = f1_score(comparison['study_system_manual'], comparison['study_system_original'], average='weighted') # weighted accounts for class imbalance
print(f'Overall F1 score is {f1_overall:.2f}')

Overall F1 score is 0.72


As well as F1 for each class:

In [57]:
f1_by_class = f1_score(comparison['study_system_manual'], comparison['study_system_original'], average=None)
f1_by_class = {n: f for n, f in zip(sorted(comparison['study_system_manual'].unique()), f1_by_class)}
for n, f in f1_by_class.items():
    print(f'F1 score for class {n} is {f:.2f}')

F1 score for class Animal is 0.34
F1 score for class Fungi is 0.31
F1 score for class Microbe is 0.83
F1 score for class NOCLASS is 0.12
F1 score for class Plant is 0.85


## Updating graph
We want to both update the classifications, as well as remove nodes that are true NOCLASS.

In [60]:
to_update = comparison.set_index('UID')['study_system_manual'].to_dict()

In [63]:
nx.set_node_attributes(graph, to_update, name='study_system')

In [64]:
to_drop = [n for n, c in to_update.items() if c == 'NOCLASS']
len(to_drop)

62

In [65]:
graph.remove_nodes_from(to_drop)

Double check that we updated and dropped correctly by verifying numbers:

In [69]:
verification = defaultdict(int)
for node, attrs in graph.nodes(data=True):
    verification[attrs['study_system']] += 1
verification

defaultdict(int, {'Plant': 3627, 'Animal': 1176, 'Microbe': 620, 'Fungi': 141})

Looks good! Now write out:

In [70]:
nx.write_graphml(graph, '../data/citation_network/core_collection_destol_or_anhydro_FILTERED_classified_network_06Jan2023_MANUALLY_VERIFIED.graphml')