Skip to content

A long-horizon, sparse-reward math environment for reinforcement learning. Official code repo for "What makes Math problems hard for reinforcement learning: A case study".

License

Notifications You must be signed in to change notification settings

shehper/AC-Solver

Repository files navigation

AC-Solver

A question that puzzled mathematicians for 39 years gets resolved.

Image 1 Image 2 Image 3
A state might have to become much more complex before it is solved. Path length distribution of solutions for different initial states. Number of environment interactions scales almost linearly with horizon length.

Overview

This repository accompanies the paper "What Makes Math Problems Hard for Reinforcement Learning: A Case Study." It introduces the “Andrews-Curtis (AC) Environment”—a reinforcement learning environment with a unique blend of features that we consider crucial for advancing reinforcement learning research:

  • Sparse rewards,
  • Easy control over horizon length,
  • A wide range of initial state complexities, and
  • (Extremely) low computational cost to run.

One of the coolest aspects of working with this environment is that solving tough episodes means cracking math problems that have stumped humans for decades!

Using this environment, we proposed new changes to the current set of reinforcement learning algorithms, giving them the ability to dynamically “learn how to learn.” For more details, check out our paper, and if you want a gentle introduction to the Andrews-Curtis conjecture and its environment, see What is AC?.

Usage

Option A: I wish to build on AC environment

To work with the AC Environment or build upon it, you can simply install the package using pip:

pip install ac_solver

This package comes equipped with two classical search algorithms (greedy search and breadth-first-search) as well as an implementation of Proximal Policy Optimization (PPO). After installation, you can start using the environment and agents as follows:

Initializing the environment

The ACEnv class is initialized using the ACEnvConfig class. By default, ACEnv() initializes the environment with the trivial presentation $\langle x, y | x, y \rangle$, represented in code as [1, 0, 2, 0].

If you want to specify your own presentation, you can do so using a list. For example, to initialize the environment with the presentation, $\langle x, y | x^2 = y^3, xyx = yxy \rangle$, you can follow these steps:

# 0. Import the classes.
from ac_solver import ACEnvConfig, ACEnv

# 1. Define the presentation as a list:
presentation = [1, 1, -2, -2, -2, 0, 0, 1, 2, 1, -2, -1, -2, 0]

# 2. Initialize the ACEnvConfig with this presentation:
config = ACEnvConfig(initial_state=presentation)

# 3. Create the environment with the custom configuration:
env = ACEnv(config)

This allows you to set different initial states of the AC Environment.

Performing Classical Search

To perform greedy or breadth-first search in the neighborhood of a specified presentation, do:

# greedy
from ac_solver import greedy_search
greedy_search(presentation)

# or bfs
from ac_solver import bfs
bfs(presentation)

You may specify the number of nodes by passing max_nodes_to_explore argument to these functions. By default, search is done over 10000 nodes. If the search is successful in reaching a trivial state, a path of AC moves is returned to the user.

Solving the Environment with PPO

To train a PPO agent on the AC environment, run the following command in your terminal:

python ac_solver/agents/ppo.py

By default, this command trains the PPO agent with initial states drawn from approximately 1200 presentations listed in this file. You can customize your run by passing any hyperparameters listed in args.py via the command line.

Option B: I wish to reproduce or extend the analyses in the paper

If you wish to reproduce the plots and analyses in the paper, you will need to clone the repository locally. Here is the recommended process:

  1. Clone the repository:

    git clone https://github.com/shehper/AC-Solver.git
    cd AC-Solver
  2. Make a virtual environment and install the package locally using pip:

    python -m venv ./env
    source ./env/bin/activate
    pip install .

The repository comes with code for performing barcode analysis on AC graph as well as notebooks to perform various analyses on experimental results.

Experiments

A PPO agent may be trained as simply as by running python ac_solver/agents/ppo.py. All agents trained for this paper were logged with Weights & Biases. See the W&B Project page for training runs.

Analysis

Analyses on epxerimental results may be performed using Jupyter Notebooks in the notebooks folder.

  • Classical-Search-and-PPO-in-AC-Environment.ipynb: To perform various analyses on the performance of classical search and PPO in AC environment. It shows, for example, that longer presentations are harder to solve for all of these algorithms.
  • Scaling-PPO-in-AC-Environment.ipynb: To study effects of scaling environment interactions and horizon length on agent performance.
  • Stable-AK3.ipynb: To demonstrate that a ~4-decades old open potential counterexample, AK(3), is a stably AC-trivial presentation.

Contributing

Contributions to this project are welcome! If you'd like to contribute, please follow the steps listed in Contributions.

Acknowledgments

This project’s PPO implementation is based on the CleanRL library.

Citation

@misc{shehper2024makesmathproblemshard,
      title={What makes math problems hard for reinforcement learning: a case study}, 
      author={Ali Shehper and Anibal M. Medina-Mardones and Bartłomiej Lewandowski and Angus Gruen and Piotr Kucharski and Sergei Gukov},
      year={2024},
      eprint={2408.15332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.15332}, 
}

About

A long-horizon, sparse-reward math environment for reinforcement learning. Official code repo for "What makes Math problems hard for reinforcement learning: A case study".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •