classification-analysis

This repository contains the data and analysis scripts required to reproduce the results shown in Data Efficiency of Classification Strategies for Chemical and Materials Design.

All data required for each figure can be reproduced using the script with the appropriate name (e.g., fig3.py recreates the data required for Figure 3). This data is stored in the data/ directory as a pickle file. To plot the appropriate figure, you can use the plotter.py file. To plot figure 3, you would call python plotter.py --fig 3. This will generate a figure that is stored in the figures/ directory. This process is not necessary for figures 1c and figure 2; the corresponding files will automatically generate the figure themselves. We also include the script prep_metafeatures.py that downselects the set of unique and uncorrelated metafeatures (stored in metafeatures.pickle) used in the sequential feature addition conducted in fig8.py.

Data Files

results.pickle is a dictionary that contains the performances of all seeds of all classification strategies on all tasks. The dictionary is structured by task, algorithm type, sampler, model, and seed. Therefore, results['qm9_cv']['al']['medoids']['nn'][24] contains a numpy array of the performances of of the 24th seed of an active learning algorithm with a medoids sampler and neural network model applied to classification of heat capacities in QM9. The numpy array is a 2D array of shape (11,4) where the first axis corresponds to rounds of active learning and the second axis includes the round number, balanced accuracy, Macro F₁, and Matthews Correlation Coefficient, in that order.

baseline/baseline.pickle is a dictionary that contains the performance of the naive strategy used to benchmark the data efficiency of classification strategies. Keys correspond to tasks and values correspond to numpy arrays where the first axis is the number of acquired points and the second axis includes the mean minus standard error, mean, and mean plus standard error, in that order, for the Macro F₁ scores of the naive strategy. basleine/baseline.pickle is a processed version of several baseline/baseline_raw_*.pickle files that contain the balanced accuracy, Macro F₁ score, and Matthew Correlation Coefficient for every seed of the naive strategy on every task.

Package Dependencies

The packages required for the analysis scripts above can be installed using the requirements.txt file shown. prep_metafeatures.py also uses code from the ClassificationSuite Python package that can be installed following the instructions at the classification-suite repository.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
baseline		baseline
data		data
figures		figures
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fig10.py		fig10.py
fig1c.py		fig1c.py
fig2.py		fig2.py
fig3.py		fig3.py
fig4.py		fig4.py
fig5.py		fig5.py
fig6.py		fig6.py
fig7.py		fig7.py
fig8.py		fig8.py
fig9.py		fig9.py
latex_typewriter.ttf		latex_typewriter.ttf
metafeatures.pickle		metafeatures.pickle
plotter.py		plotter.py
prep_metafeatures.py		prep_metafeatures.py
requirements.txt		requirements.txt
results.pickle		results.pickle
revisions.pkl		revisions.pkl
revisions_mf.pkl		revisions_mf.pkl
si_section4.py		si_section4.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

classification-analysis

Data Files

Package Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

classification-analysis

Data Files

Package Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages