Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special characters (e.g. &) are not escaped by sklearn.tree.export_graphviz #28339

Open
domdfcoding opened this issue Feb 1, 2024 · 2 comments
Labels

Comments

@domdfcoding
Copy link

Describe the bug

Exporting a decision tree where the feature_names or class_names contain special characters (particularly &<>) results in invalid graphviz output, as those characters have specific meanings to graphviz. Escaping to &amp;, &lt; and &gt; results in correct output. This can of course be done by the user but it's something I think scikit-learn should handle internally.

Steps/Code to Reproduce

from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

target_names = ["setosa & 123", "versicolor", "virginca"]
# target_names = ["setosa &amp; 123", "versicolor", "virginca"]  # This one works

tree.export_graphviz(
		clf,
		out_file="tree.dot",
		feature_names=iris.feature_names,
		class_names=target_names,
		filled=True,
		special_characters=True,
		)

Then run graphviz

dot tree.dot -Tsvg -o tree.svg 

Expected Results

Graphviz successfully converts to SVG without error.

Actual Results

Error: not well-formed (invalid token) in line 1 
... <br/>class = setosa & 123 ...
in label of node 0
Error: not well-formed (invalid token) in line 1 
... <br/>class = setosa & 123 ...
in label of node 1

Although SVG output is written to disk it is not correct.
image

Versions

System:
    python: 3.8.10 (default, Nov 22 2023, 10:22:35)  [GCC 9.4.0]
executable: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/bin/python3
   machine: Linux-5.15.0-92-generic-x86_64-with-glibc2.29

Python dependencies:
      sklearn: 1.3.2
          pip: 23.3.2
   setuptools: 69.0.3
        numpy: 1.24.4
        scipy: 1.10.1
       Cython: None
       pandas: 2.0.3
   matplotlib: 3.7.4
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/lib/python3.8/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/lib/python3.8/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Zen

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/domdf/Python/01 GitHub Repos/13 GunShotMatch/gunshotmatch-cli/venv/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Zen
@domdfcoding domdfcoding added Bug Needs Triage Issue requires triage labels Feb 1, 2024
@jatindyerawadekar
Copy link

Hi @domdfcoding can I contribute towards resolving this bug? I'm a first-timer interested in contributing to this project.

@lesteve lesteve removed the Needs Triage Issue requires triage label Feb 8, 2024
@lesteve
Copy link
Member

lesteve commented Feb 8, 2024

So this is a bug, but at the same I think its priority is rather low:

  • using tree.plot_tree is recommended instead of tree.export_graphviz. If there are things that you can not do or don't like with tree.plot_tree, I would say that investing time in improving tree.plot_tree may be a more useful thing to do
  • if you really need to use graphviz output, a reasonable work-around which is to escape special characters in target_names.

I am a bit worried about trying to support complicated things in the dot output. If this is a simple replacement for a few characters &, < and > why not. If you need to read the dot format spec for a few days and cover all the edge cases, I don't think this is worth our time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants