Multi-evaluation Binary Classifier #15

mjohnson541 · 2024-05-06T22:10:51Z

This adds an SIDT algorithm for Multi-Evaluation binary classification.

It also adds some smaller improvements:
Allows plotting only to specified depth
Saves rules as well as nodes in postpruning
allows specification of an initial set of splits from the root node

An example notebook for unstable Q.OOH classification is provided.

these groups go on nodes immediately below Root

since classifications are binary many of the individual nodes actually don't affect training classification and we can remove those resulting in a much smaller easier to analyze tree

hwpang

Thanks for the PR!

hwpang · 2024-05-07T13:12:31Z

pysidt/plotting.py

+        if node.depth <= depth:
+            n = pydot.Node(name=name, label=name, fontname="Helvetica", fontsize="16")
+            if images:
+                img = node.group.draw("png")


Suggested change

img = node.group.draw("png")

img = node.group.draw("pdf")

Better quality of figures when writing paper and making slides. The figures don't have fuzzy edges when you enlarge them. Additionally, overleaf compiles faster if you include all your figures as pdf.

hwpang · 2024-05-07T13:13:28Z

pysidt/plotting.py

@@ -1,27 +1,31 @@
 from IPython.display import Image, display
 import pydot
 import os
+import numpy as np


Suggested change

import numpy as np

import numpy as np

from pathlib import Path

hwpang · 2024-05-07T13:15:45Z

pysidt/plotting.py

    graph.set_fontsize("10")
    if not os.path.exists("./tree"):
        os.makedirs("./tree")


Can we change all the use of os to pathlib? pathlib is the more modern way since Python 3.4

Suggested change

graph.set_fontsize("10")

if not os.path.exists("./tree"):

os.makedirs("./tree")

graph.set_fontsize("10")

save_dir = Path("./tree")

save_dir.mkdir(exist_ok=True)

hwpang · 2024-05-07T13:16:54Z

pysidt/plotting.py

+                with open("./tree/" + node.name + ".png", "wb") as f:
+                    f.write(img)
+                n.set_image(os.path.abspath("./tree/" + node.name + ".png"))


Suggested change

with open("./tree/" + node.name + ".png", "wb") as f:

f.write(img)

n.set_image(os.path.abspath("./tree/" + node.name + ".png"))

node_save_path = (save_dir / node.name + ".pdf").resolve()

with open(node_save_path, "wb") as f:

f.write(img)

n.set_image(node_save_path))

hwpang · 2024-05-07T13:17:30Z

pysidt/plotting.py

    graph.write_dot("./tree/tree.dot")
    graph.write_png("./tree/tree.png")


Suggested change

graph.write_dot("./tree/tree.dot")

graph.write_png("./tree/tree.png")

graph.write_dot(save_dir / "tree.dot")

graph.write_png(save_dir / "tree.png")

hwpang · 2024-05-07T13:28:39Z

pysidt/sidt.py

+            sidt_val_values = [self.evaluate(d.mol) for d in self.validation_set]
+            true_val_values = [d.value for d in self.validation_set]
+
+        P,N,PP,PN,TP,FN,FP,TN = analyze_binary_classification(sidt_train_values,true_train_values)


I'm pretty sure there are metrics functions to compute accuracy, recall, precision, etc. we can just import from sklearn to make the code cleaner. Can you do that?

Suggested change

P,N,PP,PN,TP,FN,FP,TN = analyze_binary_classification(sidt_train_values,true_train_values)

P, N, PP, PN, TP, FN, FP, TN = analyze_binary_classification(sidt_train_values, true_train_values)

hwpang · 2024-05-07T13:28:58Z

pysidt/sidt.py

+                    }
+                self.best_rule_map = {name:self.nodes[name].rule for name in self.best_tree_nodes}
+
+        logging.info("# nodes: {}".format(len(self.nodes)))


Suggested change

logging.info("# nodes: {}".format(len(self.nodes)))

logging.info(f"# nodes: {len(self.nodes)}")

hwpang · 2024-05-07T13:29:11Z

pysidt/sidt.py

+        2) merges nodes with their parents if they do not result in different predictions
+        """
+
+        self.datum_truth_map = {datum:[getattr(n,"rule") for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}


Suggested change

self.datum_truth_map = {datum:[getattr(n,"rule") for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}

self.datum_truth_map = {datum: [getattr(n,"rule") for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}

hwpang · 2024-05-07T13:29:16Z

pysidt/sidt.py

+        """
+
+        self.datum_truth_map = {datum:[getattr(n,"rule") for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}
+        self.datum_node_map = {datum:[n for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}


Suggested change

self.datum_node_map = {datum:[n for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}

self.datum_node_map = {datum: [n for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}

hwpang · 2024-05-07T13:31:42Z

pysidt/sidt.py

+
+            assert len(new) == 0
+            assert len(comp) == 0
+            pnew = new_class_true/Nnew


Suggested change

pnew = new_class_true/Nnew

pnew = new_class_true / Nnew

hwpang · 2024-05-07T13:33:57Z

IPython/Unstable_QOOH_Binary_Classification.ipynb

+   "source": [
+    "data = []\n",
+    "for sm in stable_smiles:\n",
+    "    data.append(Datum(Molecule().from_smiles(sm),True))\n",


Suggested change

" data.append(Datum(Molecule().from_smiles(sm),True))\n",

" data.append(Datum(Molecule().from_smiles(sm), True))\n",

Can you run black formatter through the notebook too

hwpang · 2024-05-08T14:23:39Z

Can you also add a pytest for this new type of tree? I don't think it's best practice to only rely on notebook test. The --nbmake option doesn't provide a very comprehensive report.

mjohnson541 · 2024-08-24T03:49:35Z

Replaced

mjohnson541 added 14 commits May 6, 2024 15:12

allow plotting tree to specified depth

a94c101

save the fitted rules as well as the nodes in postpruning

b726180

allow specification of a set of initial root splits

aa3ddc2

these groups go on nodes immediately below Root

add MultiEvalSubgraphIsomorphicDecisionTreeBinaryClassifier class

cbdc2ec

add function for picking nodes to expand

516d940

add function for choosing the best binary classifier extension

37d21cd

add extend_tree_from_node for binary classifier

0d4e00d

add evaluate function for binary classifier

282901e

add function for analyzing the error in binary classifier

8236d61

add generate_tree for binary classifier

ee8a8d0

add function for reducing binary classifier tree

c2870bc

since classifications are binary many of the individual nodes actually don't affect training classification and we can remove those resulting in a much smaller easier to analyze tree

add function for computing binary classification statistics

58272c1

add example notebook for multi-eval binary classification

7b167e2

add Mult-eval binary classification notebook to pytests

814815d

mjohnson541 force-pushed the covdep_associated branch from c7ade02 to 814815d Compare May 6, 2024 22:13

mjohnson541 requested a review from hwpang May 6, 2024 23:37

hwpang requested changes May 7, 2024

View reviewed changes

mjohnson541 added 5 commits May 20, 2024 12:58

fix naming

b251090

add node tracing for evaluations

300566d

make logging rather than print

045b698

set initial groups that don't have defined rules to True

bdc0acc

bulk commit

3f17718

mjohnson541 closed this Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-evaluation Binary Classifier #15

Multi-evaluation Binary Classifier #15

mjohnson541 commented May 6, 2024 •

edited

Loading

hwpang left a comment

hwpang May 7, 2024

mjohnson541 May 9, 2024

hwpang May 10, 2024

hwpang May 7, 2024

hwpang May 7, 2024

hwpang May 7, 2024

hwpang May 7, 2024

hwpang May 7, 2024

hwpang May 7, 2024

hwpang May 7, 2024

hwpang May 7, 2024

hwpang May 7, 2024

hwpang May 7, 2024

hwpang commented May 8, 2024

mjohnson541 commented Aug 24, 2024

	import numpy as np
	import numpy as np
	from pathlib import Path

		graph.write_dot("./tree/tree.dot")
		graph.write_png("./tree/tree.png")

	P,N,PP,PN,TP,FN,FP,TN = analyze_binary_classification(sidt_train_values,true_train_values)
	P, N, PP, PN, TP, FN, FP, TN = analyze_binary_classification(sidt_train_values, true_train_values)

	logging.info("# nodes: {}".format(len(self.nodes)))
	logging.info(f"# nodes: {len(self.nodes)}")

	self.datum_truth_map = {datum:[getattr(n,"rule") for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}
	self.datum_truth_map = {datum: [getattr(n,"rule") for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}

	self.datum_node_map = {datum:[n for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}
	self.datum_node_map = {datum: [n for n in self.mol_node_maps[datum]["nodes"]] for datum in self.datums}

	" data.append(Datum(Molecule().from_smiles(sm),True))\n",
	" data.append(Datum(Molecule().from_smiles(sm), True))\n",

Multi-evaluation Binary Classifier #15

Multi-evaluation Binary Classifier #15

Conversation

mjohnson541 commented May 6, 2024 • edited Loading

hwpang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hwpang commented May 8, 2024

mjohnson541 commented Aug 24, 2024

mjohnson541 commented May 6, 2024 •

edited

Loading