# Data Analysis: Manual Codes vs. Model Classifications

Compare and contrast the manual coding of descriptions for *Omission* and *Stereotype* to the predictions of the Baseline Omission and Stereotype Classifier and the Omission and Stereotype Classifier with linguistic features.

***

**Table of Contents**

I. [Prepare Data for Manual Review, part 2](#i)

II. [Analyze Agreement, part 1](#ii)

   - [Manual vs. Baseline OSC](#ii-i)
   - [Manual vs. LCOSC](#ii-ii)
   - [Baseline OSC vs. LCOSC](#ii-iii)

***

In [None]:
# For custom variables and functions
import clf_utils
import config

# For data analysis
import pandas as pd
import numpy as np

# For reading and writing files and directories
import os
from pathlib import Path
import joblib
from joblib import load

# For evaluation of classification/coding
import sklearn.metrics
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support

Read manually coded and classified data.

In [2]:
dir = config.coded_and_classified
f = "manually_coded_baselineosc_lcosc.csv"
df = pd.read_csv(dir+f)
df.head(2)

Unnamed: 0,description_id,token_id,index,doc,linguistic_prediction,gender_bias_manual,omission_manual,stereotype_manual,type,note,eadid,rowid,field,baseline_prediction,lcosc_prediction
0,11452,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",11452,Drafts and meeting notes relating to the creat...,['O'],n,n,n,,,CHE,CHE/01/01,scopecontent,(),()
1,11467,"[16, 17, 15]",11467,CHE Supporters Review,['O'],n,n,n,,,CHE,CHE/02/04,unittitle,(),()


Clean the model predictions' values.

In [3]:
df["baseline_prediction"] = df["baseline_prediction"].apply(lambda x: x.strip("()"))
df["baseline_prediction"] = df["baseline_prediction"].apply(lambda x: x.replace("'", ""))
df["baseline_prediction"] = df["baseline_prediction"].apply(lambda x: x.strip(","))
df["baseline_prediction"] = df["baseline_prediction"].apply(lambda x: x.split(", "))
df.baseline_prediction.value_counts()

[]                        12168
[Omission]                  164
[Stereotype]                  3
[Omission, Stereotype]        2
Name: baseline_prediction, dtype: int64

In [4]:
df["lcosc_prediction"] = df["lcosc_prediction"].apply(lambda x: x.strip("()"))
df["lcosc_prediction"] = df["lcosc_prediction"].apply(lambda x: x.replace("'", ""))
df["lcosc_prediction"] = df["lcosc_prediction"].apply(lambda x: x.strip(","))
df["lcosc_prediction"] = df["lcosc_prediction"].apply(lambda x: x.split(", "))
df.lcosc_prediction.value_counts()

[]                        12040
[Omission]                  291
[Omission, Stereotype]        3
[Stereotype]                  3
Name: lcosc_prediction, dtype: int64

In [5]:
df["type"] = df["type"].replace("Sterotype", "Stereotype")
df["type"] = df["type"].replace("Stereotype and Omission", "Omission, Stereotype")
df["type"] = df["type"].fillna("")
df["type"] = df["type"].apply(lambda x: x.split(", "))
df.type.value_counts()

[]                        12287
[Omission]                   45
[Stereotype]                  3
[Omission, Stereotype]        2
Name: type, dtype: int64

<a id="i"></a>
### I. Prepare Data for Manual Review, part 2

In [7]:
subdf = df.drop(columns=["token_id", "index", "linguistic_prediction", "gender_bias_manual", "omission_manual", "stereotype_manual"])
subdf = subdf.rename(columns={"type":"first_manual_code", "note":"first_note"})
subdf = subdf[[
    "doc", "baseline_prediction", "lcosc_prediction", "first_manual_code", "first_note",
      "eadid", "rowid", "field", "description_id"
    ]]
subdf.insert(1, "second_manual_code", np.nan)
subdf.insert(2, "second_note", np.nan)
subdf.tail(2)

Unnamed: 0,doc,second_manual_code,second_note,baseline_prediction,lcosc_prediction,first_manual_code,first_note,eadid,rowid,field,description_id
12335,“Thomas Sharp – an appreciation” by Lewis Keeble.,,,,,,,THS,THS 56.2,unittitle,32934
12336,“Thomas Sharp – an appreciation” by Lewis Keeb...,,,,,,,THS,THS 56.2,scopecontent,32933


In [8]:
subdf.head()

Unnamed: 0,doc,second_manual_code,second_note,baseline_prediction,lcosc_prediction,first_manual_code,first_note,eadid,rowid,field,description_id
0,Drafts and meeting notes relating to the creat...,,,,,,,CHE,CHE/01/01,scopecontent,11452
1,CHE Supporters Review,,,,,,,CHE,CHE/02/04,unittitle,11467
2,Friend Newcastle Annual Reports,,,,,,,CHE,CHE/02/06,unittitle,11471
3,Collection of documents on sex education in sc...,,,,,,,CHE,CHE/03/06/12,unittitle,11554
4,"Letters, newsletters and leaflets on the topic...",,,,,,,CHE,CHE/03/06/13,unittitle,11555


In [13]:
subdf.to_csv(dir+"manually_coded_and_classified.csv")

In [14]:
new_dir1 = "data/manually_coded_part2/group1/"
new_dir2 = "data/manually_coded_part2/group2/"
Path(new_dir1).mkdir(parents=True, exist_ok=True)
Path(new_dir2).mkdir(parents=True, exist_ok=True)

# eadids1 = ["BP", "HL", "THS"]
# eadids2 = ["BXB", "CHE", "OBR", "SH", "SW", "WCT"]
# subdf1 = subdf[subdf["eadid"].isin(eadids1)]
# subdf2 = subdf[subdf["eadid"].isin(eadids2)]
subdf3 = subdf[subdf["eadid"] == "CPT"]

# subdf1.to_csv(new_dir1+"manually_coded_and_classified.csv")
# subdf2.to_csv(new_dir2+"manually_coded_and_classified.csv")
subdf3.to_csv(new_dir2+"manually_coded_and_classified_CPT.csv")

<a id="ii"></a>
### II. Analyze Agreement, part 1
##### Compare manual codes to model classifications

Standardize manually coded and classified columns and values, creating columns for `omission_baseline`, `stereotype_baseline`, `omission_lcosc`, and `stereotype_lcosc`, and noting the presence of a code with `'y'` and the absence of a code with `'n'`.

In [6]:
cols = ["baseline_prediction", "lcosc_prediction"]
for col in cols:
    df_col = list(df[col])
    omission_clfs = ["y" if "Omission" in pred else "n" for pred in df_col]
    stereotype_clfs = ["y" if "Stereotype" in pred else "n" for pred in df_col]
    if "baseline" in col:
        suffix = "_baseline"
    else:
        suffix = "_lcosc"
    df.insert(column="omission"+suffix, value=omission_clfs, loc=(len(df.columns)))
    df.insert(column="stereotype"+suffix, value=stereotype_clfs, loc=(len(df.columns)))
df.head(2)

Unnamed: 0,description_id,token_id,index,doc,linguistic_prediction,gender_bias_manual,omission_manual,stereotype_manual,type,note,eadid,rowid,field,baseline_prediction,lcosc_prediction,omission_baseline,stereotype_baseline,omission_lcosc,stereotype_lcosc
0,11452,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",11452,Drafts and meeting notes relating to the creat...,['O'],n,n,n,[],,CHE,CHE/01/01,scopecontent,[],[],n,n,n,n
1,11467,"[16, 17, 15]",11467,CHE Supporters Review,['O'],n,n,n,[],,CHE,CHE/02/04,unittitle,[],[],n,n,n,n


In [7]:
mlb = joblib.load("models/transform_labels/mlb_targets_os.joblib")

In [61]:
y_manual = mlb.transform(df["type"])
y_baseline = mlb.transform(df["baseline_prediction"])
y_lcosc = mlb.transform(df["lcosc_prediction"])
# print(y_baseline[190:200])



<a id="ii-i"></a>
##### i. Agreement: Manual vs. Baseline OSC

In [62]:
matrix = multilabel_confusion_matrix(y_manual, y_baseline, labels=[0,1])
print(matrix)

[[[12134   156]
  [   37    10]]

 [[12327     5]
  [    5     0]]]


In [63]:
tn = matrix[:, 0, 0]  # True negatives
fn = matrix[:, 1, 0]  # False negatives
tp = matrix[:, 1, 1]  # True positives
fp = matrix[:, 0, 1]  # False positives
class_names = list(mlb.classes_)

[precision, recall, f_1, suport] = precision_recall_fscore_support(
    y_manual, y_baseline, beta=1.0, zero_division=0, labels=[0,1]
)

baseline_agmt_df = pd.DataFrame({
    "labels":class_names, "true_neg":tn, "false_neg":fn, "true_pos":tp, "false_pos":fp,
    "precision":precision, "recall":recall, "f_1":f_1
})
baseline_agmt_df

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,Omission,12134,37,10,156,0.060241,0.212766,0.093897
1,Stereotype,12327,5,0,5,0.0,0.0,0.0


In [64]:
print("Macro scores:")
print(baseline_agmt_df[["precision", "recall", "f_1"]].mean())

Macro scores:
precision    0.030120
recall       0.106383
f_1          0.046948
dtype: float64


In [65]:
precision, recall, F1 = clf_utils.precisionRecallF1(sum(tp), sum(fp), sum(fn))
print("Micro scores:")
print("precision \t", precision)
print("recall \t\t", recall)
print("f_1 \t\t", F1)

Micro scores:
precision 	 0.05847953216374269
recall 		 0.19230769230769232
f_1 		 0.08968609865470853


<a id="ii-ii"></a>
##### ii. Manual vs. LCOSC

In [66]:
matrix = multilabel_confusion_matrix(y_manual, y_lcosc, labels=[0,1])
print(matrix)

[[[12012   278]
  [   31    16]]

 [[12326     6]
  [    5     0]]]


In [67]:
tn = matrix[:, 0, 0]  # True negatives
fn = matrix[:, 1, 0]  # False negatives
tp = matrix[:, 1, 1]  # True positives
fp = matrix[:, 0, 1]  # False positives
class_names = list(mlb.classes_)

[precision, recall, f_1, suport] = precision_recall_fscore_support(
    y_manual, y_lcosc, beta=1.0, zero_division=0, labels=[0,1]
)

lcosc_agmt_df = pd.DataFrame({
    "labels":class_names, "true_neg":tn, "false_neg":fn, "true_pos":tp, "false_pos":fp,
    "precision":precision, "recall":recall, "f_1":f_1
})
lcosc_agmt_df

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,Omission,12012,31,16,278,0.054422,0.340426,0.093842
1,Stereotype,12326,5,0,6,0.0,0.0,0.0


In [68]:
print("Macro scores:")
print(lcosc_agmt_df[["precision", "recall", "f_1"]].mean())

Macro scores:
precision    0.027211
recall       0.170213
f_1          0.046921
dtype: float64


In [69]:
precision, recall, F1 = clf_utils.precisionRecallF1(sum(tp), sum(fp), sum(fn))
print("Micro scores:")
print("precision \t", precision)
print("recall \t\t", recall)
print("f_1 \t\t", F1)

Micro scores:
precision 	 0.05333333333333334
recall 		 0.3076923076923077
f_1 		 0.09090909090909093


<a id="ii-iii"></a>
##### iii. Agreement: Baseline OSC vs. LCOSC

In [70]:
matrix = multilabel_confusion_matrix(y_baseline, y_lcosc, labels=[0,1])
print(matrix)

[[[12005   166]
  [   38   128]]

 [[12331     1]
  [    0     5]]]


In [71]:
tn = matrix[:, 0, 0]  # True negatives
fn = matrix[:, 1, 0]  # False negatives
tp = matrix[:, 1, 1]  # True positives
fp = matrix[:, 0, 1]  # False positives
class_names = list(mlb.classes_)

[precision, recall, f_1, suport] = precision_recall_fscore_support(
    y_baseline, y_lcosc, beta=1.0, zero_division=0, labels=[0,1]
)

osc_agmt_df = pd.DataFrame({
    "labels":class_names, "true_neg":tn, "false_neg":fn, "true_pos":tp, "false_pos":fp,
    "precision":precision, "recall":recall, "f_1":f_1
})
osc_agmt_df

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,Omission,12005,38,128,166,0.435374,0.771084,0.556522
1,Stereotype,12331,0,5,1,0.833333,1.0,0.909091


In [72]:
print("Macro scores:")
print(osc_agmt_df[["precision", "recall", "f_1"]].mean())

Macro scores:
precision    0.634354
recall       0.885542
f_1          0.732806
dtype: float64


In [73]:
precision, recall, F1 = clf_utils.precisionRecallF1(sum(tp), sum(fp), sum(fn))
print("Micro scores:")
print("precision \t", precision)
print("recall \t\t", recall)
print("f_1 \t\t", F1)

Micro scores:
precision 	 0.44333333333333336
recall 		 0.7777777777777778
f_1 		 0.5647558386411891
