Learning to Optimize

The repository archives papers regarding the combination of combinatorial optimization and machine learning and corresponding reading notes. For this intersection of machine learning and combinatorial optimization, we name it Learning to Optimize.

Corresponding authos: Mingxuan Yuan, Huiling Zhen, and Xijun Li

Email: Yuan.Mingxuan@huawei.com, zhenhuiling2@huawei.com, xijun.li@huawei.com

Mingxuan Yuan, Huiling Zhen and Xijun Li are with Huawei Noah's Ark Lab

Thanks to all authors for contributing their notes, here we have:

Weilin Luo from Beihang University

Zhenkun Wang from SUSTech

Jinhong and Han from Shanghai Jiao Tong University

Looking forwards to anyone interested in this topic joining us.

2020 - Machine Learning for Computer Systems(ML4CS) - paper reading progress (opening)

First a survey road map is given below. The road map is being updated. Stay tuned!

Paper title	Readers	Slide
Performance-Effective and Low-Complexity Task Scheduling for Heterogenous Computing	Jinhong
Machine Learning Priority Rule For Solving Resource-Constrained Project Scheduling Problems	Jinhong
Learning Scheduling Algorithms for Data Processing Clusters	Jinhong
DeepWeave: Accelerating Job Completion Time with Deep Reinforcement Learning-based Coflow Scheduling	Jinhong
Generalizable Resource Allocation in Stream Processing via Deep Reinforcement Learning	Jinhong
Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning	Jinhong
DL2 : A Deep Learning-driven Scheduler for Deep Learning Clusters	Jinhong
QL-HEFT: a novel machine learning scheduling scheme base on cloud computing environment	Jinhong
LEARNING FINITE STATE REPRESENTATIONS OF RECURRENT POLICY NETWORKS	Jinhong, Luhan
External vs. Internal: An Essay on Machine Learning Agents for Autonomous Database Management Systems	Luhan
Self-Driving Database Management Systems	Luhan
Automatic Database Management System Tuning Through Large-scale Machine Learning	Luhan
Query-based Workload Forecasting for Self-Driving Database Management Systems	Luhan, Xijun
An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning	Luhan
Learning Memory Access Patterns	Luhan
Resource Management with Deep Reinforcement Learning	Luhan, Xijun
Towards Safe Online Reinforcement Learning in Computer Systems	Luhan
Safe Reinforcement Learning via Shielding	Luhan

LEARNING FINITE STATE REPRESENTATIONS OF RECURRENT POLICY NETWORKS

Anurag Koul, Sam Greydanus and Alan Fern, "LEARNING FINITE STATE REPRESENTATIONS OF RECURRENT POLICY NETWORKS",ICLR2019

Notes:

The author of this paper use Moore Machine Network(MMN) to learn the RNN, converting the continuous state representation to finite discrete state representation with less memory consumption. The main idea of this paper is using Finite State Machine(FSM) to reducing the complexity of traditional learning model without degrading of their performance. First of all, the author trains two Quantized Bottleneck Networks(QBN) to quantize memory and observations. Then, the author inserts the QBNs into RNN network and tuning the new network. Finally, the author extracts equivalent Moore Machine(MM) form MMN and minimizes the number of state representation. --Jinhong

The main idea is to use an autoencoder(QBN Quantized Bottleneck Networks) to convert the continuous state representation to finite discrete state representation. It will change the RNN model to MMN to help improve the interpretability and save computational expense. This is meaningful in industry. And the auther only experiment on synthetic experiments and Atari games. It will degrade performance in complex condition, like two games mentioned in the paper. --Luhan

External vs. Internal: An Essay on Machine Learning Agents for Autonomous Database Management Systems

Pavlo, Andrew, et al. "External vs. Internal: An Essay on Machine Learning Agents for Autonomous Database Management Systems." IEEE Data Eng. Bull. 42.2 (2019): 32-46.

Notes：

This article is an overview about parameters tuning in DBMS. The author mainly talk about two engineering approaches for integrating ML agents in a DBMS, external agents and internal agents. External agents reuse the DBMS’s existing APIs and environment data collection infrastructure without having to modify the DBMS itself. However, some changes of configration will cause the machine to restart and the exposed metrics are large and noisy. Internal agents require to design the system’s architecture to natively support autonomous operation. The challenge is that many internal agents affect each other. If we only use single agent, it will increase the dimension and require more trianing data and training time. --Luhan

Self-Driving Database Management Systems

Pavlo, Andrew, et al. "Self-Driving Database Management Systems." CIDR. Vol. 4. 2017.

Notes：

This paper presents the architecture of Peloton, the first self-driving DBMS. The basic architecture consists of three components: Runtime Architecture, Workload Modeling, Control Framework. Runtime Architecture aims to acquire the Real-time performance. Workload Modeling captures the qurey's logical semantics feature to do clustering and forecasting. And the forecasting method is RNN. Control Framework uses receding-horizon control model (RHCM) to guide the action using the prediction and history recorded. The experiment show the results are promising and the direction is worth exploring. --Luhan

Automatic Database Management System Tuning Through Large-scale Machine Learning

Van Aken, Dana, et al. "Automatic database management system tuning through large-scale machine learning." Proceedings of the 2017 ACM International Conference on Management of Data. 2017.

Notes:

The motivation is that the knobs of DBMS are not standardized, not independent and not universal. And some expriments expose that the different configration lead to different performance obviously, such as only modifying the size of buffer pool or logfile. The paper is aim to deal with the unlabeled metrics. Fisrt, the author choose the internal metrics as feature and do pruning using factor analysis and k-meas.Then, he uses lasso regression model to sort and rank the knobs. The model using some top metrics to help optimize. It uses dynamic mapping to find the nearest histroy and then modify the parameters using GP regression. It proves to be comparable to ones created by human experts and save much time. --Luhan

Query-based Workload Forecasting for Self-Driving Database Management Systems

Ma, Lin, et al. "Query-based workload forecasting for self-driving database management systems." Proceedings of the 2018 International Conference on Management of Data. 2018.

Notes:

The author chooses the logical sematic as features. And for SQL queries, we can convert them to similar template by replacing the values as placeholder. This will reduce the data scale effectively. First, the author uses improved DBSCAN as online-clustering method. Then, he emsembles the RNN and LR result to do prediction. Considering long time periodical events, he also uses the kernel regression model to help. If the error between the former and the later is larger than 150%, it will choose kernel regression. --Luhan

proposed a query-based workload forecasting method for database management systems. The workload forecasting can be beneficial to database management systems optimization. If a DBMS only considers the behavior of the application in the past when selecting which optimizations to apply, these optimizations may be sub-optimal for the workload in the near future. It can also cause resource contention if the DBMS tries to apply optimizations after the workload has shifted (e.g., it is entering a peak load period). Instead, a self-driving DBMS should choose its optimizations proactively according to the expected workload patterns in the future. Thus, this is necessary for wokload prediciton. The system can then select the optimization configuration to prepare based on this prediction. However, the workloads for real-world applications are never static. The authos propsosd QeuryBot5000(QB5000) in this paper. QB5000 is a workload forecasting framework that runs as either an external controller or as an embedded module. The workflow of QB5000 presented in Figure 2. The framework receives SQL queris from the DBMS. This data is first passed into the Pre-Processor that identifies distinct templates in the workload and records their arrival rate history. In this step, various SQL queries can be abstracted to several defined templates. Next, the Clusterer combines the templates with similar arrival rate patterns together. In this step, the complexity of this prediction task is furthermore decreased. Finally, the information is then fed into the Forecaster where it build models that predict the arrival rate of templates in each cluster. Note that if a template cannnot be classied to any cluster, then the templet will be the center of a new cluster. And the Forecaster will periodically update the prediction model for each cluster. In the experiment, the results show that QB5000 is effective in helping the DBMS select the optimal indexes for the target workload in real time. -- Xijun

An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning

Zhang, Ji, et al. "An end-to-end automatic cloud database tuning system using deep reinforcement learning." Proceedings of the 2019 International Conference on Management of Data. 2019.

Notes:

The author declares to be the first end-to-end automatic database tuning system to use deep RL learning and recommended database configurations. The features consist of 63 internal metrics, including 14 state value and 49 accumulated value. The author uses policy-based DDPG(Deep Deterministic Policy Gradient). The author also designs specific reward to take care of both the beginning state and the previous moment state. And the experiments show it performs much better than the former models and database experts. --Luhan

Learning Memory Access Patterns

Hashemi, Milad, et al. "Learning memory access patterns." arXiv preprint arXiv:1803.02329 (2018).

Notes:

The paper aims to deal with Prefecting addresses problem. The author does many experiments to find the adequate settings. It is more like the engineering exploration. He uses the delta of addresses instead of origin address beacuse of smaller searching space. He compares two models and prefers the Clustering and LSTM model to traditional embedding LSTM model. The input features are PC number of instruction and the delta addresses. --Luhan

Resource Management with Deep Reinforcement Learning

Mao, Hongzi, et al. "Resource management with deep reinforcement learning." Proceedings of the 15th ACM Workshop on Hot Topics in Networks. 2016.

Notes:

The author models tasks in idealized situations. Jobs arrive in discrete time steps. The scheduler selects one or more waiting jobs for scheduling in each time step. We assume that the resource requirements for each task are known at the time of arrival. We want to optimize the average lattancy time. We model the state as for each job, we konw its time slice, CPU number and Memory requirement. The RL network is simple policy gradient network. The experiment show it is effective to use RL. But it is really a simple and ideal condition. Its reward is worth learnning. --Luhan

Mao et al. proposed a method to solve online multi-resource allocation problem in computer system using deep reinforcement learning. It is relatively novel and interesting in publication of 2016 although it could be thought as simple in 2020. The motivation of this paper is that machine learning can provide a viable alternative to human-generated heuristics for resource management. The typical design flow of human-generated heuristics is (1) come up with clever heuristics for a simplified model of the problem; and (2) painstakingly test and tune the heuristics for good performance in practice. This process often has to be repeated if some perspectives of the problem such as the workload or the metric of interest changes. In this paper, the authors gives a simple model of multi-type resource (CPU and memory) management problem in cluster, where the cluster has a certain capacity for each type of resources, and jobs comes into the cluster dynamically and the jobs consume an certain amount of each type of resources. The objective of this problem is to schedule the jobs in the cluster fulfilling all resource constraint so as to minimize the average slowdown time of jobs. For this problem, the authors gives an integrate MDP, including design of state space, action space, and reward, inspired by Terits, as shown below. They adopts deep reinforcement learning model to solve this MDP, and train it using REINFORCE. The experiment shows that the DRL method could beat several heuristics. It should be noted that the retraining is still needed when the environment (cluster or job workload) changes notably. The authors also note that learning resource management strategies directly from experience, if we can make it work in a practical context, could offer a real alternative to current heuristic based approach. -- Xijun

Towards Safe Online Reinforcement Learning in Computer Systems

Mao, Hongzi, et al. "Towards Safe Online Reinforcement Learning in Computer Systems."

Notes:

The background is that in many systems, models trained using reinforcement learning (RL) are powerful alternatives to hand-written heuristics, but their potentially unpredictable behavior makes them unsafe for production deployment. The key idea is to train the RL model online, in the real system, but to fall back on a simple, known-safe fallback policy if the system enters an unsafe region of the state space, while still providing sufficient feedback for RL to learn a good policy. There is a "training wheel" helping model solve the unsafe condition and it will receive penalty to prevent the action again. In the Implementation process, we will increase entroy when encountering a new unstable condition to help explore. And we will decrease the entroy when the condition becomes stable. It performs well in different condition and can adapt the changing circumstances quickly. --Luhan

Safe Reinforcement Learning via Shielding

Alshiekh, Mohammed, et al. "Safe reinforcement learning via shielding." arXiv preprint arXiv:1708.08611 (2017).

Notes:

The paper describes two types of shielding. The first one is preemptive shielding. It will acts before the agent, providing the safe action list to choose. The second one is post-posted shielding. When the agent's action is unsafe, sheild will monitor and correct the errors. The interesting idea is for the later shielding, we can assgin a pushment when agent take unsafe action but can also assign a reward according to the shielding action. Because we can consider the shielding as a projection or as a last network layer. --Luhan

2020 - Learning to Opitimize (Solver & Logistics) - paper reading progress (opening)

Paper title	Readers	Commenters
A General Pricing Scheme for the Simplex Method	Xiaotian, Fangzhou, Xijun
A LEARNING-BASED ITERATIVE METHOD FOR SOLVING VEHICLE ROUTING PROBLEMS	Yi
An ADMM-Based Interior-Point Method for Large-Scale Linear Programming
DEEPSIMPLEX: REINFORCEMENT LEARNING OF PIVOT RULES IMPROVES THE EFFICIENCY OF SIMPLEX ALGORITHM IN SOLVING LINEAR PROGRAMMING PROBLEMS	Xiaotian
Efficient Revised Simplex Method for SVM Training	Xiaotian
FICO opens up the complex world of optimization to business analysts
Improving Optimization Bounds using Machine Learning: Decision Diagrams meet Deep Reinforcement Learning	Xiaotian, Hanchao
Presolve Reductions in Mixed Integer Programming
SCIP - a framework to integrate Constraint and Mixed Integer Programming	Xiaotian, Hanchao, Xijun, Fangzhou, Huiling
Introduction to SCIP	Xiaotian, Hanchao, Xijun, Fangzhou
SCIP Doxygen Documentation	Xiaotian, Hanchao, Xijun, Fangzhou
Improving the Accuracy of Linear Programming Solvers with Iterative Refinement
Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon	ALL
Exact Combinatorial Optimization with Graph Convolutional Neural Networks with Graph Convolutional Neural Networks	Yi, Xiaotian, Hanchao
ATTENTION, LEARN TO SOLVE ROUTING PROBLEMS!	Yi
Reinforcement Learning for Solving the Vehicle Routing Problem	Yi
Pointer Networks	Yi
Learning to Run Heuristics in Tree Search.	ALL
Attention Solves Your TSP, Approximately	Yi
Combinatorial Optimization with Graph Convolutional Networks and Guided Tree Search	ALL
Learning the Multiple Traveling Salesmen Problem	Yi
Dynamic pickup and delivery problems (Gerardo Berbeglia et. al.	Yi
Waiting and Buffering Strategies for the Dynamic Pickup and Delivery Problem with Time Windows	Yi
0. SCIP: solving constraint integer programs	Xiaotian, Hanchao, Xijun, Fangzhou, Huiling
1.1. Challenges in Linear Programming and how SoPlex deals with them	Xijun, Fangzhou, Xiaotian, Huiling
1.2. Notes 6 The Primal Simplex Method	Xijun, Fangzhou, Xiaotian, Huiling
2. Improving the Accuracy of Linear Programming Solvers with Iterative Refinement	Xijun, Fangzhou, Xiaotian, Huiling
3. Floating point algorithm	Xijun, Fangzhou, Xiaotian, Huiling
4. Implementation of Cutting Plane Separators for Mixed Integer Programs	Solver guys
5. Primal heuristics for mixed integer programs	Solver guys
Safe Exploration in Continuous Action Spaces	ALL
Linear Programming 2: Theory and Extensions	ALL
A General Pricing Scheme for the Simplex Method	Xijun, Fangzhou, Xiaotian, Huiling
Learning Fast Optimizers for Contextual Stochastic Integer Programs
On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming

2019 - Learning to Optimize - paper reading progress

Paper title	Readers	Commenters
Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon	Xijun, Mingxuan, Huiling	All
Learning to branch	Xijun, Mingxuan, Huiling	Mingxuan
Learning to Search in Branch and Bound Algorithm	Xijun, Mingxuan, Huiling	Mingxuan
Learning to branch in mixed integer programming	Xijun, Mingxuan, Huiling	Mingxuan
Learning Combinatorial Optimization Algorithms over Graphs	Huiling, Xijun	Mingxuan
Exact Combinatorial Optimization with Graph Convolutional Neural Networks with Graph Convolutional Neural Networks	Xijun, Huiling	Xijun
Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information	Huiling, Xijun	Huiling
Discriminative Embeddings of Latent Variable Models for Structured Data	Xijun
Learning when to use a decomposition	Zhenkun	Zhenkun
Best arm identification in multi-armed bandits with delayed feedback	Huiling	Huiling
Solving a New 3D Bin Packing Problem with Deep Reinforcement Learning Method	Xijun	Xijun
A Multi-task Selected Learning Approach for Solving 3D Flexible Bin Packing Problem	Xijun	Xijun
Pointer Networks	Huiling, Xijun	Huiling
NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT LEARNING	Huiling,Xijun	Huiling
ATTENTION, LEARN TO SOLVE ROUTING PROBLEMS!	Huiling	Huiling
Reinforcement Learning for Solving the Vehicle Routing Problem	Huiling	Huiling
Generalized Inverse Multiobjective Optimization with Application to Cancer Therapy	Zhenkun	Zhenkun
Reinforcement Learning for Integer Programming: Learning to Cut	Zhenkun	Zhenkun
Learning Permutations with Sinkhorn Policy Gradient	Huiling	Huiling
Learning to Run Heuristics in Tree Search.	Xijun	Xijun
Predicting Solution Summaries to Integer Linear Programs under Imperfect Information with Machine Learning.	Huiling	Huiling
Online Learning for Strong Branching Approximation in Branch-and-Bound	Zhenkun	Zhenkun
Optimization as a model for few-shot learning	Huiling	Huiling
Learning a SAT Solver from Single-Bit Supervision	Xijun	Xijun
Machine Learning to Balance the Load in Parallel Branch-and-Bound	Zhenkun	Zhenkun
Attention Solves Your TSP, Approximately	Xijun	Huiling
A Machine Learning-Based Approximation of Strong Branching	Zhenkun	Zhenkun
Learned Optimizers that Scale and Generalize	Huiling	Huiling
Learning fast optimizers for contextual stochastic integer programs	Huiling	Huiling
Combinatorial Optimization with Graph Convolutional Networks and Guided Tree Search	Huiling
Optimization on a Budget: A Reinforcement Learning Approach	Huiling	Huiling
A Deep Q-Network for the Beer Game: Reinforcement Learning for Inventory Optimization	Huiling
Boosting Combinatorial Problem Modeling with Machine Learning	Huiling	(-)
Learning Scheduling Algorithms for Data Processing Clusters
Learning the Multiple Traveling Salesmen Problem
Learning to Perform Local Rewriting for Combinatorial Optimization	Long Kang
Deep Reinforcement Learning with Knowledge Transfer for Online Rides Order Dispatching	Xijun, Weilin
Efficient Ridesharing Order Dispatching with Mean Field Multi-Agent Reinforcement Learning	Huiling, Weilin
Optimizing Online Matching for Ride-Sourcing Services with Multi-Agent Deep Reinforcement Learning
Optimizing Taxi Carpool Policies via Reinforcement Learning and Spatio-Temporal Mining	Xijun, Weilin
Crowd recruiter: Selecting participants for piggyback crowdsensing under probabilistic coverage constraint	Huiling
A taxi order dispatch model based on combinatorial optimization.	Xijun, Weilin
Efficient largescale fleet management via multi-agent deep reinforcement learning	Weilin
Large-scale order dispatch in on-demand ride-hailing platforms: A learning and planning approach.	Weilin
Learning to estimate the travel time.	Huiling, Weilin
Learning Scheduling Models from Event Data	Huiling
LSTOD: LATENT SPATIAL-TEMPORAL ORIGINDESTINATION PREDICTION MODEL AND ITS APPLICATIONS IN RIDE-SHARING PLATFORMS (ICLR 2020 PAPER under review)	Xijun, Weilin	Weilin
A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications	Xijun, Weilin
Learning Fast Optimizers for Contextual Stochastic Integer Programs
On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming
DEEPSIMPLEX: REINFORCEMENT LEARNING OF PIVOT RULES IMPROVES THE EFFICIENCY OF SIMPLEX ALGORITHM IN SOLVING LINEAR PROGRAMMING PROBLEMS	Xijun

CO+ML Survey Bengio 2018

Bengio, Yoshua, Andrea Lodi, and Antoine Prouvost. "Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon." arXiv preprint arXiv:1811.06128 (2018).

Paper location: CO-ML-papers⁩/papers⁩/CO+ML Survey Bengio 2018.pdf

Notes:

这篇文章Bengio总结了目前组合优化和机器学习融合的方法和架构。从方法论上分为demonstration和experience。demonstration是需要agent通过ML或RL的方式学习expert(人或者优化算法最优解)的"经验”；experience是需要agent通过RL的方式从“零”，自我学习的方式学得求解优化问题的“经验”。从架构上分为ene-to-end和嵌入式架构。end-to-end的架构是利用ML或RL完全替代优化算法来解决优化问题。嵌入式架构是在原始优化算法中利用ML替换掉速度较慢的模块，以此达到提高优化效率或精度的目的。 —— Xijun

Note location: CO-ML-papers⁩/⁨notes⁩/20190717⁩/bengio-co-ml-survey

Learning to Branch

Balcan, Maria-Florina, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. "Learning to branch." arXiv preprint arXiv:1803.10150 (2018).

Paper location: CO-ML-papers⁩/papers⁩/Learning to BranchMaria-Flori.pdf

Notes:

这篇文章是基于之前做branch&bound算法中各种评价node的指标。其目标是利用ML的方法学出一组权重，线性加权这些指标来综合评估node，从而达到减小搜索树的size。 —— Xijun

Note Location: CO-ML-papers⁩/⁨notes⁩/20190717⁩/learn-to-branch-flori

Learning to Search in Branch and Bound Algorithm

He, He, Hal Daume III, and Jason M. Eisner. "Learning to search in branch and bound algorithms." Advances in neural information processing systems. 2014.

Location: CO-ML-papers⁩/papers⁩/Learning to Search in Branch and Bound Algorithms.pdf

Notes:

这篇文章首先假设有一个Oracle已经知道branch & bound中知道什么是最优node selection策略（表示为训练数据），利用imitation learning学习不同问题的Oracle的node selection policy和node pruning policy。在训练集上学完后，然后在测试集上完全用学到的策略指导branch & bound的搜索过程。 —— Xijun

Note location: CO-ML-papers⁩/⁨notes⁩/20190717⁩/hehe-learn-to-branch-nips2014

Learning to Branch in Mixed Integer Programming

Khalil, Elias Boutros, et al. "Learning to branch in mixed integer programming." Thirtieth AAAI Conference on Artificial Intelligence. 2016.

Paper location: CO-ML-papers⁩/papers⁩/Learning to Branch in Mixed Integer Programming.pdf

Notes:

这篇文章是利用learning to rank学习branch & bound中的经典打分策略Strong Branching(SB)。其具体做法是，在面对任何一个MIP问题时，其算法在500个分支点前都用经典的Strong Branching作为变量评分标准来选择节点的最优分支变量，同时保留过程中的特征（手工构造的72个特种）和打分结果，这些特征和打分结果构成了训练集。在第501个节点时，利用learning to rank学出训练集中的SB打分策略，在后续节点中的变量选择中就都采用learning to rank来选择最优分支变量。 —— Xijun

Learning when to use a decomposition

Kruber M, Lübbecke M E, Parmentier A. Learning when to use a decomposition[C]//International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems. Springer, Cham, 2017: 202-210.

Paper location: CO-ML-papers⁩/papers⁩/Learning when to use a decomposition.pdf

Notes:

这篇文章提出监督学习来检测MIP问题结构，并根据结构安排合适合适分解策略，进而提升solver的求解速度的。当一个MIP问题具有arrowhead结构或者double-bordered block diagonal form。这个MIP就可以利用DW分解来进行更快速的求解。本文将MIP的结构detect问题建模成一个0-1分类问题。其中输入参数是，MIP，分解策略，以及时间范围。本文采用的是scikit-learn library的标准分类器。--zhenkun

Best arm identification in multi-armed bandits with delayed feedback

Aditya Grover, Todor Markov, Peter Attia, Norman Jin, Nicholas Perkins, Bryan Cheong, Michael Chen, Zi Yang, Stephen Harris, William Chueh, Stefano Ermon (Stanford University University of Michigan Lawrence Berkeley National Laboratory )

Notes:

这篇文章的思路与其他文章差别很大，bandit没有用来解决问题本身，而是用来挑选cplex的启发式策略。文章默认，cplex中有很多启发式策略，而针对不同的问题，会自动的筛选不同的。但是在筛选之前，cplex内部会有测试机制。本文就是为了缩短测试的时间而设计的。training set是2000个MIP问题，在有32个arms的rl模型上，利用不同时刻的feedback做训练。test的mip问题，则直接根据这32个arm所对应的特征，挑选启发式算法。因此，##启发式算法本身不重要，重要的是它们做表现出来的特征##。值得我们学习的地方是，这个文章的是把求解Mip转成了资源分配问题，同时是一个online的思路，很适合处理动态的云资源/wireless资源分配问题。-- Huiling

Learning Combinatorial Optimization Algorithms over Graphs

Khalil, Elias, et al. "Learning combinatorial optimization algorithms over graphs." Advances in Neural Information Processing Systems. 2017.

Paper location: CO-ML-papers⁩/papers⁩/7214-learning-combinatorial-optimization-algorithms-over-graphs.pdf

Notes:

这篇文章挑了三个跟图相关的优化问题Minimum Vertex Cover, Maximum Cut,以及TSP。将求解这三个问题建模为Markov Decision Process，即有一个partial solution（set of vertex）,然后从剩下的vertex set中每一步挑选一个点进入partial solution，直到拼成一个合法解（MDP终止）。partial solution中的节点状态利用structure2vec来描述，利用Q-learning方法来学习action policy。需要注意的是其structure2vec和Q-evaluation function是放在一起来学习的。

该算法的优点是：graph embedding parameterization can deal with different graph instances and sizes seamlessly. 能自适应地求解来自同一分布但不同size的问题，抓住了同一分布问题的本质。 —— Xijun

Exact Combinatorial Optimization with Graph Convolutional Neural Networks

Gasse, Maxime, et al. "Exact Combinatorial Optimization with Graph Convolutional Neural Networks." arXiv preprint arXiv:1906.01629 (2019).

Paper location: CO-ML-papers⁩/papers⁩/Exact Combinatorial Optimization with Graph Convolutional Neural.pdf

Notes:

这篇文章是利用graph convolutional network来实现learn to branch。具体做法也是将B&B建模成Markov Decision Process。MDP中的每个状态state是由constraint和variable的bipartite图表示，action是从fractional variable中挑选一个变量来做分支。但是其并没有利用reinforcement learning来完成policy的学习，而是通过imitation learning。因此其需要事先构造训练集，即(B&B树的状态，各个fractional variable的得分)。这里B&B树的状态利用构造的bipartite graph表示，variable得分根据strong branching score公式得到。有了这些训练集后，便利用imitation learning的方式训练一个graph convolutional neural network来学习strong branching score挑选variable的policy。

优势：利用GCNN学习strong branching的policy，避免了手工构造特征。实验结果表现超过其他利用机器学习方法提速的branching策略，同时也超过了SCIP的branching策略。 —— Xijun 另外，这个文章给了一种用学习的方法来处理约束满足的组合优化问题的方法，即：让约束和变量，作为图的定点，利用约束和变量之间的关系，构成二部图。从而，可以通过提取约束和变量所构成的图的特征，来代替认为的设定特征。 ——Huiling

Solving a New 3D Bin Packing Problem with Deep Reinforcement Learning Method

Hu, Haoyuan, et al. "Solving a new 3d bin packing problem with deep reinforcement learning method." arXiv preprint arXiv:1708.05930 (2017).

Paper location: CO-ML-papers⁩/papers/Solving a New 3D Bin Packing Problem with Deep Reinforcement Learning Metho.pdf

Notes:

这篇文章是阿里旗下菜鸟团队利用深度强化学习技术解决他们业务定义的一个新型3D bin packing problem (3DBPP)。具体问题定义为给定一系列待装载的item，如何将这些item装进一个bin中，使得这个bin的表面积最小。其方案思路为将整个3DBPP划分成三个连续的决策问题：1）决定item的装载顺序；2）决定每个item在bin中的摆放方向；3）决定每个item在bin中的摆放位置（coordinates）。上述决策问题中，第二、第三步均采用启发式方法，第一步(决定item的装载顺序）的问题利用深度强化学习来解决。具体来说，利用pointer networks来学习装载顺序的policy。其输入是待装载item的长宽高数据，输出是这些item的顺序(sequence of order)。需要注意的是，pointer network学到的装载policy应该是与摆放方向、摆放位置所采取的heurisitic算法强相关。因为pointer networks的学习（网络参数的更新）中，其reward的计算需要摆放方向、摆放位置所采取的heurisitic算法来辅助的（即算bin的表面积）。 —— Xijun

A Multi-task Selected Learning Approach for Solving 3D Flexible Bin Packing Problem

Duan, Lu, et al. "A Multi-task Selected Learning Approach for Solving 3D Flexible Bin Packing Problem." Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2019.

Paper location: CO-ML-papers⁩/papers/A Multi-task Selected Learning Approach for Solving 3D Flexible Bin Packing Problem

这篇文章是阿里旗下菜鸟团队在Solving a New 3D Bin Packing Problem with Deep Reinforcement Learning Method基础上的改进工作。在前述工作中其方案思路为将整个3DBPP划分成三个连续的决策问题：1）决定item的装载顺序；2）决定每个item在bin中的摆放方向；3）决定每个item在bin中的摆放位置（coordinates）。上述决策问题中，第二、第三步均采用启发式方法，第一步(决定item的装载顺序）的问题利用深度强化学习来解决。然后这篇文章的工作是将1）决定item的装载顺序和2）决定每个item在bin中的摆放方向都用RL的方式来优化，然后摆放位置仍然是启发式算法。其具体思路是类似于Pointer network，利用encoder与decoder的架构一起学习最优顺序和最优摆放方向的策略(所以是weight sharing)，其中除了利用了pointer networks本身的attention mechanism外，他们还提出了更高一层的intra-attention mechanism（即在decode的过程中除了利用encoder的hidden states信息外，还利用前序时间步的decoder的hidden state，能有效防止重复item在序列中重复出现）。 —— Xijun

Pointer Networks

Oriol Vinyals (Google brain), Meire Fortunato (UC Berkeley) and Navdeep Jaitly (Google brain). NIPS 2015.

这是第一篇把组合优化看成序列决策问题，利用seq2seq模型通过end-to-end的方式来多对多的求解TSP问题的工作。主要贡献在于1）利用RNN作为decoder和encoder的基本单元。我们可以这样理解这个结构： encoder部分对原求解条件做概括，压缩为隐藏状态，输入给decoder；decoder根据上一次子问题的求解输出和当前求解的状态来预测当前。 2）但如果直接是RNN构成的encoder和decoder，我们无法避免一个问题：这样的结构中烟隐藏状态是一个固定的向量，而这个向量与输入的每一部分都有关系，但我们又无法得知这一关系的区别。因此，本文又在encoder和decoder之间，嵌入了attention结构。但是，本文修改了传统attention networks的结构：去掉了传统attention中的weight-sum而直接输出distribution。这样，通过attention，可以计算出当前输出与输入序列关系最大的元素，那么我们就将输入序列的元素作为该输出元素，每一个输出元素都像有一个指针一样指向输入元素。在实际试验中，这样的结构，也一定程度上可以降低计算时间复杂度。当然，增加了网络训练的难度。 3）利用标准的tsp solver作为标签，supervise的方式训练这样的网络，从而得到tsp问题的结果（节点顺序）。

但这种方式有几个缺陷：1）网络结构：RNN结构（包括lstm）会记忆历史上下文的相关性，但是如果输入顺序发生改变（或者本来就是乱序的），会很大的影响预测结果。在NTM的任务中，人们会用memory networks来改进这个缺陷。但在tsp的问题中，还需要重新测试这种改进的实用性。 2）适用范围：这种将组合优化问题转化成可利用RNN或者其他seq2seq模型求解的问题，基本条件是该问题一定是序列化决策的问题，因此tsp，vrp这类，是比较合适的。但是，有些问题就需要重新建模，或者，改变模型，例如，set cover, bin-packing, planning以及scheduling等非标准的序列化问题（区分是否是序列化，我比较直观的理解是，）。

另外，该方法所处理的问题规模，也有待进一步考量。在本文的实验中，最大的节点个数是500。——Huiling

NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT LEARNING

Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi & Samy Bengio （Google brain）. ICLR workshop, 2017.

这是基于Pointer networks的改进工作，同时，文章的实验依然是应用到tsp问题上。不过，作者也给出了在knapsack问题上的实验结果。

本文依赖的网络结构依然是pointer networks，主要贡献在于使用了RL的训练方法，从而不需要事先知道tsp问题的解。其主要思路是：使用policy gradient，问题（例如：TSP）是agent，不同时刻的Graph作为state，以期学出Q-policy。文章给了两种实现算法：（1）一种基于pretraining。在该方法中，本文还引入了多线程处理算法A3C，其中critic有三个模块，分别是2个LSTM和1个二层神经网络（激活函数是relu）。这里，采用了标准的A3C的处理方法：critic利用TD-error自我更新，而后梯度传给agent, agent借助critic建议的梯度，完成自己的梯度更新。（2）另一种active search，不需要Pretraining。主要区别是：1）从一个固定的state出发，利用不同的hyperparameters, 采样得到不同的pi (policy)。这里的主要技术是：把hyerparameter看成state的温度（类比：Boltzmann distribution和SA中温度的概念）。 2）求出最小的L_j = L（pi_j|s），也就是找出效果最好的pi。这两个步骤主要是用来改进decoding的过程。这样处理的好处是：在给出pi之前就做了Local search的工作。3）不断的利用test数据集来refine网络。

通过后文实验显示，active search会得到质量更高的解，同时并未花费更多的时间。

本文的后续问题是：RL的训练方式比supervised的训练方式，能找到质量更高的解的实验结果，这一结果是否可以扩展到其他问题上。我们需要在新的数据集上测试这样的实验效果。——Huiling

Reinforcement Learning for Solving the Vehicle Routing Problem

Mohammadreza Nazari Afshin Oroojlooy Martin Takác Lawrence V. Snyder (Lehigh Univ.) NIPS， 2018

这也是一个基于Pointer networks的改进工作。与NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT LEARNING唯一的区别是，encoder的部分直接利用embedding来代替。构造逻辑是：（1）作者认为原有的pointer networks不适合处理更复杂的问题，并在图一做了说明。不适合的主要原因是：每次Input的顺序（比如在vrp问题中，两个customer的顺序）发生变化时，都需要重新计算网络上的所有参数。而我们需要一个新的结构，这个结构对于输入顺序具备不变性。同时，作者认为RNN这个结构对于解决组合优化问题而言，提供了过多的其他信息，但这些信息对于优化本身来说不是必须的。（比如在翻译任务中，RNN可以提供word和position之间的映射关系，这个关系对于翻译而言是重要的，因为反映了语序，但是对于vrp这类问题而言，是不必要的。笔者注：当然，对于tsp而言，也是必要的。）（2）用embedding来代替encoder中的RNN，输入的是一对信息，即：（coordinate, demand）。以期用embedding的结构，学出这样的pair的隐含关系。（3）decoder和attention的部分，与Pointer networks保持一致。（4）同时，本文给出了一种处理约束的方式，即：让约束变成termination condition，也就是，当不满足demand的约束时，就停止。（5）本文使用了非常常规的训练方法，即用标准的policy gradient结合actor-critic的算法框架去完成网络训练。

本文在后续讨论中，指出，当有很多约束时，可以让约束变成soft constraints，即：放到目标上去min。但这里带来了一个隐患：约束不一定全部被满足，需要在求出解以后去repair。如果约束很多时，会带来时间上的浪费。

本文在vrp问题上做了测试。不过，实验结果仅包括在50辆车上的调度，是一个非常小规模的问题。—— Huiling

Reinforcement Learning for Integer Programming: Learning to Cut

Yunhao Tang, Shipra Agrawal, Yuri Faenza (Columbia University), arxiv 2019

在这篇文章中，作者提出来用强化学习来加速割平面法的收敛速度。对于一个IP问题，割平面法是通过对一个松弛后LP最优解加cut(割平面)，并迭代求LP，直到该LP最优解满足所有整数约束。在割平面法中，割平面的选择极大的影响算法收敛速度。目前主要是是通过启发式算法来完成割平面的选择。本文首次提出使用RL来完成割平面法的选择。因为现实中割平面的数量可能非常大，作者通过一些实现技巧，做了些state space size 和 generality of method之间的平衡。实验结果现实，该方法极大的提升了算法的收敛速度，且具有良好的泛华能力。该算法还可以作为一个子程序用到一些Branch and Cut的solver里。-— Zhenkun

ATTENTION, LEARN TO SOLVE ROUTING PROBLEMS!

Wouter Kool, Herke van Hoof & Max Wellling. ICLR 2019.

本文依然借助encoder-decoder这个结构，来完成end-to-end的处理组合优化问题这一任务。本文的主要创新是借助了transformer （from "Attention is all you need"）的原有架构，将原模型在NLP任务上的效果，迁移到了组合优化问题中。但与原transformer主要的不同点在于，1）原有模型critic部分用了baseline，而本文用的rollout （经过试验对比，rollout的效果会远远好于baseline）。2）借助了Local search来提升解的质量 3）decoder的input没有用PCA降维。

据我们了解，本文是第一篇在100个节点（含以下）的问题上效率超过gurobi的算法。但是，本文的网络结构相对复杂，在更大规模的问题上的时间效率，需要进一步测试，或者修正模型。—— Huiling

Learning to Run Heuristics in Tree Search

Khalil, Elias B., et al. "Learning to Run Heuristics in Tree Search." IJCAI. 2017.

本文提出了一种利用机器学习方法来指导branch&bound算法中启发式搜索解的框架。首先来说，目前市面上最好的开源mip求解器SCIP，在其branch&bound实现的过程中，对于每个节点都会有多种heuristic方法来进行节点选择和变量的选择，而这些启发式的调用频率和规则都是由专家确定的。这也就是说对于不同问题而言，这些启发式的调用频率都不会因问题不同而有所改变。因此本文作者提出了利用机器学习的方法来学习面对不同问题时branch&bound算法启发式调用规则，从而提高branch&bound算法的收敛速度。具体来说，对于某一种启发式H而言，给定一个b&b搜索树，它需要一个二分类器帮助其确定在树中某一个节点是否使用该启发式H。这是一个监督学习，因此需要线下收集大量的（feature,label），简单来说feature包括节点信息和树的整体信息等，label是该启发式H能否找到incumbent，能找到是1，不能是0.

他们通过实验发现提出的机器学习框架要比SCIP默认（专家设定）的启发式调用规则更能提升branch&bound算法的效率（即primal integral更小，这个指标是本文或者前述文章提出来的一个衡量branch&bound算法性能的指标，越小越好）。

其实本文构思很简单，但是贵在实验丰富，他们claim自己是第一篇系统性地优化树搜索过程中启发式使用策略。 —— Xijun

Learning a SAT Solver from Single-Bit Supervision

Selsam, Daniel, et al. "Learning a SAT solver from single-bit supervision." arXiv preprint arXiv:1802.03685 (2018).

本文提出了用神经网络来学习处理SAT问题的Solver，即NeuroSAT。给定一个谓词逻辑，NeuroSAT能判断该谓词逻辑是不是可满足的，如果可满足的话，其以一定概率能输出对应变量的真值表。该网络的输入是将谓词逻辑转化成literal(即变量）和clause的bipartite图，用邻接矩阵描述这个bipartite图；该网络的总体架构是LSTM；该网络的输出是判断输入的谓词逻辑是否可满足，输出1表示可满足，输出0表示不能满足。训练方式是监督学习，因此需要事先利用SAT solver收集训练数据，即（给定谓词逻，可满足标签）。值得一提的是，NueroSAT如果判断给定谓词逻辑是可满足的话，是有70%的概率能从literal向量最后迭代的值中构造出一个能使给定谓词逻辑为真的assignment。

本文claim最大贡献不是提出的NeuroSAT性能，其性能还是比不过States of the Art SAT solvers。但他们提出的这个NeuroSAT以及网络输出可视化能learn to perform discrete search on their own without the help of hard-coded search procedure, even after only end-to-end training with minimal supervision. —— Xijun

Attention Solves Your TSP, Approximately

Kool, Wouter, Herke van Hoof, and Max Welling. "Attention solves your TSP, approximately." stat 1050 (2018): 22.

本文是提出了一个利用transformer神经网络架构，采用RL训练方式的方法来求解TSP问题。其与Pointer networks的区别很显著：1)本文是transformer架构（可以视为升级复杂版的attention mechanism），point networks是基础版的attention mechanism；2)本文是强化学习的方式训练网络，而point networks是监督学习。本文采用的transformer神经网络架构还是能分为encoder和decoder：1) encoder阶段将TSP中每个节点经过多层transformer层映射到embedding空间，然后将多个节点的embedding取平均，作为对tsp整张图的embedding；2) decoder阶段，每次解码出一个点（每次其实是输出candidate点的概率，然后greedily选取概率最大的candidate点作为解码输出），decoder的输入是（图的embedding, decoder上一次输出点的embedding,decoder第一次输出点的embedding）。采用REINFORCE训练上述网络。

在TSP（n=20,50,100）上测试发现，提出的方法比Point networks，Learning Combinatorial Optimization Algorithms over Graphs以及各种启发式方法都要好。 —— Xijun

Learning Permutations with Sinkhorn Policy Gradient

本文的主要贡献是，利用Sinkhorn algorithm（一种启发式算法，可以类比理解是结合模拟退火的local search方法。与模拟退火类似，Sinkhorn也常伴有一个表示temperature的超参），构造了Sinkhorn layer。这个结构可以作为一种Plug-and-play的结构，嵌入到原有神经网络中，从而方便利用backpropagation来实现active/local search。

具体做法是：给定一个矩阵x(instance)，构造其对应的Sinkhorn形式，即对矩阵的每个元素做softmax。同时，借助这个形式，可以将离散的PG变成DPG。本文利用这种方法求解了TSP 20和permutations(sorting问题)，但其结果显示该方法并不总适用于组合优化。例如在tsp问题中，该方法并没有优于已有方法。但是在permutation的问题上表现良好。

根据我的理解，这种表现良好的原因是： Permutations的问题的最优解更接近于连续分布，因此本文的方法更适合找到质量更高的解。--Huiling

Learning Fast Optimizers for Contextual Stochastic Integer Programs

Deepmind, UAI 2018.

本文的研究目的是利用学习的方法结合已有的数学规划方法和计算软件，以期提高商业软件的求解效率和质量。

本文的研究对象是一般的随机整数规划（SIP）问题，这是一个标准的数学规划模型。一般的，我们认为这是一个two-stage的规划问题。第一阶段先优化一个主要变量，这是一个确定性的优化问题，第二阶段优化一个含有噪声的随机问题，主要考察噪声对于主要变量的最优解的影响。因此，在数学规划领域，人们习惯的做法是利用Benders分解来求原SIP问题。其中，Master problem是第一阶段的优化问题， subproblem是含有噪声的随机优化问题。一般流程是：先求解master problem, 而后固定master的解，求解subproblem，求解后利用对偶形式，回传给master其对应的upperbound，而后重新求解master，依次循环，直到master和subproblem分别求出的bound之间的差距，小于给定的threshold，则停止计算。

本文的具体做法就是结合标准的Benders分解，其主要贡献是：

（1）构造Initial policy和local move policy，以求解Master problem （不包含初值，master的初值由计算软件给出）。其中，Initial policy的主要作用是利用生成模型，在给定Instance(变量x)的前提下，得到环境变量(contextual variable z)，为后面的学习提供基础。这里用的生成模型是NADE，事实上我们可以利用auto-encoder，GAN来做同样的工作。 Local move policy主要是为了在给定x和z的情况下，借助RL来得到policy，从而求解master。这里采用的RL是标准的A3C，借助的网络结构也是简单的feedforward networks (Relu active function, 2 hidden layers)。

（2）learning dual. 这种做法的好处是：构造dual的时间往往比求解还长，因此利用学习，可以有效的减少这部分的构造时间。学习的方法也很简单，也是feedforward networks (Relu active function, 2 hidden layers)，结合SGD。

同时，本文利用了一个技巧：同时生成多个被扰动的subproblems，以并行求解。 -- Huiling

Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information

Bengio, Lodi, et. al. Mar 2019, Preprint.

本文的研究对象依然是随机整数规划，本文的研究目的是用确定性的优化+机器学习的方法，来代替传统的对随机问题的直接求解。

本文的主要思路分两个步骤：

（1）依照噪声的分布，同时生成多个被扰动后的确定性优化问题，分别构成training set, test set 和validation set。方便利用学习来求解。但这里在学习方法上，并没有与learning to branch等文章，有本质的区别。（2）利用已知的数据集上，噪声对确定性问题的影响，构造回归模型，通过对回归参数的学习，来代替直接求解随机问题。

本文的主要内容是提出一般的方法论，因此实验很简单。目前正在审稿阶段，可以期待后续的修改和补充。-- Huiling

Generalized inverse multiobjective optimization with application to cancer therapy

Timothy C. Y. Chan, Tim Craig, Taewoo Lee, Michael B. Sharpe, Operations Research 2014.

针对多目标优化问题的逆问题，本文提出来一个一般性方法。该方法在当前解为不可行解时，依然可以求解逆问题。除此以外，该方法还被用到一个前列腺放射治疗的案例上。

针对一个多目标线性规划问题，其原问题是给定一个目标权重向量（weights），就可以得到一个非支配解（又称有效解）。其逆问题是给定一个有效解，就可以确定一个权重向量。但是如果给定的当前解不是有效解的时候，该逆问题是不可行的。本文提出两种解决策略：一种是相对误差的策略。让误差用一个乘积系数来表示。当系数为1时，该解为有效解。第二有是绝对误差的策略。用当前解到最近有效解的距离作为当前解的绝对误差。该误差为0，代表该解为有效解。此外本文还阐明了他们提出的广义多目标逆优化方法与传统帕累托曲面近似技术之间的联系。--Zhenkun

Combinatorial Optimization with Graph Convolutional Networks and Guided Tree Search

2018 NIPS

本文的研究背景是将多类图上的NP-hard问题（包括Minimum Vertex Cover，Satisfiability以及Maximal Clique），转变成Maximal independent set(MIS)问题。而后，传统的解决MIS问题的方法就是树搜索，但树搜索的最大问题是深度优先和广度优先很难平衡，因此选哪个节点优先分叉往往会对求解的效率起到至关重要的作用。本文的目的就是利用其它方法来加速这种收敛。

本文的主要思路是，用GCN来加速原有启发式算法（树搜索）的收敛。认为树搜索中被分叉的节点都在一个候选集中。利用GCN输出的概率，给树的节点分类{0，1}，分类是1的则被选入候选集。因此，本文将树搜索中最难决策的exploration-exploitation平衡的问题，转变成了训练一个更合适分类的GCN的问题。本文的主要学习方法是supervised learning（主要为了回避RL自身的不稳定性）。并且利用了并行机制：同时生成了多个GCN，使得一次可以生成多个不同的解，从而提升探索性能。

不同的图G_i{V,E,A}构成了一个统一的网络f=f(G_i)，输出是[0,1]^N, 表示节点进入了候选集的概率。这里利用solver在训练集上求出的最优解来打标签I_i。原始训练方式是：直接利用cross-entropy作为目标函数 sum_i loss(I_i, f(G_i, theta))，继而利用选定的solver（例如sgd）做训练。但是这样的训练有一个问题，就是让本不影响最优解性能的节点位置，标签完全相反，见文中图2。因此，在这里本文做了一个改进，就是让目标函数变成：sum_i min_m loss(I_i, f(G_i, theta)), 意味着对于一个给定的训练样本，损失只与预测最准确的结果相关。使得收敛更加稳定。这种做法在很多其他文献中也有涉及，例如NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT LEARNING中的active search, Learning Permutations with Sinkhorn Policy Gradient中的Sinkhorn layer，都是为了做类似的改进：在初始点附近就做local search，而后选择效果最好的往下继续进行。

随后本文将该算法与很多经典算法对比，包括gurobi，在求解时间和求解质量上，均有提升。是目前少有的可以比Gurobi效果更好的工作。本文涉及到的图/网络的节点个数都在2000以内，因此不是一个很大的图。

事实上如果适当转换，set cover，facility location甚至submodular/supermodular都可以变成MIS的问题。 --Huiling

Online Learning for Strong Branching Approximation in Branch-and-Bound

(Alejandro Marcos Alvarez, Louis Wehenkel, and Quentin Louveaux, 2016)

这篇论文提出了一种结合了可靠分支（reliability branching）和基于学习的分支（learned branching）的新分支策略。本文提出用一种在线学习算法来学习一个对strong branching的代理模型。然后用在某些情况下用该代理模型来替换原来的strong branching，从而克服了strong branching在某些时候计算代价过大的问题。

在本文的方法中，当分支定界法需要分支时，从当前节点处的分数变量集合中选择一组候选变量，并计算每个变量的分数。然后选择具有最高分数的变量用于分支。这里的分数计算方法有两种： 1）如果认为变量的代理模型不可靠，则计算实际strong branching得分，并得到描述当前节点变量的一组特征。得分用于对候选变量进行排名，但它也存储在数据集（具有特征向量）中以改进strong branching代理模型的精度。 2）如果认为代理模型是可靠的，则计算描述当前候选变量的特征向量并将其输入给strong branching代理模型（这里用的是线性回归学习的函数），以便快速生成近似的strong branching分数。

此外作者在一些实验上，证明了该方法是有效的。--zhenkun

A Machine Learning-Based Approximation of Strong Branching

(Alejandro Marcos Alvarez, Louis Wehenkel, and Quentin Louveaux, INFORMS Journal on Computing, 2017)

针对处理混合整数规划的分支定界法中的分支策略。本文提出了一种快速的代理模型技术，用以模拟strong branching的分支策略。这种近似算法是通过机器学习技术在一组观察到的strong branching分支决策中训练获得的。为了能够很好的提升代理模型的精度。本文详细设计了特征。本文将特征分为以下三类： 1）静态特征：它们是从问题参数c，A和b计算出来的。它们只需一次计算计算，代表问题的静态特征。这些特征能对问题进行全面描述。 2）动态特征：这些特征与当前B＆B节点上的解决方案有关。它们包含当前解决方案中固定变量的比例，变量的上下分数，等等。 3）动态优化特征：它们旨在表示优化的整体状态。这些特征汇总了单个当前节点无法提供的全局信息。当对变量执行分支时，会为该变量存储目标增量。

一旦特征被设计好了，就可以使用机器学习算法来从数据集学习代理模型。在这项工作中，他们使用Extremely Randomized Trees作为学习算法。 --zhenkun

Machine Learning to Balance the Load in Parallel Branch-and-Bound

(Alejandro Marcos Alvarez, Louis Wehenkel, and Quentin Louveaux, 2019)

本文的主要贡献是开发一种并行B&B方法，该方法使用机器学习来创建和分配给多个处理器的若干子问题，使得每个处理器的工作负载尽量接近。此外，我们开发了一组新特征，用以预测子问题的节点个数，从而预测该子问题的求解难度。实验结果表明，该方法可以有效地平衡了几个处理器之间的负载。

在这项工作中，监督学习框架被用来预测子问题节点的数量，以估计问题的难度。为了提高监督学习算法的精度，本文设计了五类特征。每个类别用于表示问题的不同动态。此外，该算法中还采用了一个打分函数来确定那个变量是最后的分支节点，以便通过该节点扩展当前树。

最后算法通过一个贪婪的策略，来为每个处理器分配子问题。--zhenkun

Optimization as a model for few-shot learning, Optimization on a Budget: A Reinforcement Learning Approach, Learned Optimizers that Scale and Generalize

Twitter, ICLR 2017; NIPS 2009; ICML 2017

这几篇文章共同的背景都是meta-learning。可以与另外的几个文章放到一起来看：MAML：Model-agnostic meta-learning for fast adaptation of deep networks，Reptile:Reptile: a Scalable Metalearning Algorithm，以及阿里的 Few-Shot Text Classification with Induction Network。

Meta-learning的本质是“learning to learn”（最早是Schmidhuber在1987年提出的概念，但是在2016年以后Learnining to learning by gradient descent以后才开始逐渐成为更广泛的研究内容），为了实现这个目的，人们会训练一系列的任务，来作为共同的training set，在每个任务上，又会有针对自己的support set (training)和query set(test)。因此，全部的support 和query构成了training set。计算框架是：利用support set更新建议的梯度theta, 而后利用总的loss（包含query）更新phi,而后沿着不同的task逐一进行，最后迭代到收敛。简单说，可以理解成一个复合函数F(f)的求导的chain rule的过程： F对应整体的loss，就是我们要学习的终极目的，每个support set针对f进行更新，而后利用query更新F。

结合optimization的角度，我们可以这样理解learning to learn: 目的就是要让训练出来的模型代替优化过程，从而自动的完成优化。也就是说，在优化过程中不需要人工处理超参。但是这种优化与传统优化有一个区别：就是我们不是一定要让optimization在已知的问题（training set）上达到最好，而是要让这种优化过程在未知的问题上（test set）上达到最好。这几个文章大多数都沿着这个思路进行展开，只是在细节处理上各有不同：

Optimization on a Budget: A Reinforcement Learning Approach

该文章早于meta-learning这个概念的提出，但是具有同样的motivation。本文的主要目的是利用RL学习Levenberg Marquardt Algorithm (LMA)算法，这是一个解决非线性最小二乘问题的优化方法，是牛顿法的修正方法，主要解决当hessian矩阵不正定的时候，LMA保证了每次迭代的方向都是下降的方向。本文用Q-learning估计出了最优的更新过程。据我们所知，这是第一篇用Q-learning来估计二阶优化算法的。

Model-agnostic meta-learning for fast adaptation of deep networks

这是MAML的原文，也是目前应用范围最广的few-shot learning算法之一，few-shot learning可以理解为meta-learning的supervised版本。这也是最直接的实现meta-learning的工作，文章的主要目的放在了找到最好的初始化phi上。

Reptile: a Scalable Metalearning Algorithm

这是Reptile的原文，是对MAML最直接的改良版本。和 MAML 类似，Reptile 会学习神经网络的参数初始化phi，以使神经网络可使用少量新任务数据进行调整。但是 MAML 通过梯度下降算法的计算图来展开微分计算过程，而 Reptile 在每个任务中执行标准形式的随机梯度下降（SGD）：它不用展开计算图或计算任意二阶导数。因此 Reptile 比 MAML 所需的计算量和内存都更少。Reptile 的工作原理是使用泰勒级数逼近更新来分析的。Reptile 更新最大化同一任务中不同小批量的梯度内积，以改善泛化效果。

Optimization as a model for few-shot learning

本文也是一个把optimization看成学习过程的一个文章，与之前不同的是，本文将优化过程中的参数更新与LSTM中的cell更新联系到一起。具体来说，本文是采用 LSTM表示meta learner，用其状态表达目标分类器的参数的更新，最终学会如何在新的分类任务上，对分类器网络（learner）进行初始化和参数更新。这个优化算法同时考虑一个任务的短时知识和跨多个任务的长时知识（memory & cell，并且通过捕获所有任务之前共享的基础知识，进而更好地初始化learner。

Learned Optimizers that Scale and Generalize

本文与上文的区别主要在于依赖的网络模型不同，不再是依赖一个LSTM，而是构造了一个hierarchical RNN来作为meta learner. 用不同的tensor RNN在support set上训练，用一个普通的RNN完成在query上的更新。并且，构造了一个global RNN，在完成不同时刻的parameter sharing的同时，协调各个tensor RNN的参数更新（协调方法是：该global RNN的输入是所有的hidden state的平均值，输出是biased terms）。构造层级的RNN的好处是使得每个RNN都不用很大，还能在support set, query set以外获取inter-tensor/inter-parameter的dependecies.

近两年few-shot, one-shot以及zero-shot（可以理解为不仅学习了过程，还引入了推理）发展很快，有很多模型和结构为transfer, generalization等问题服务。 --Huiling

A taxi order dispatch model based on combinatorial optimization.

Zhang, Lingyu, et al. "A taxi order dispatch model based on combinatorial optimization." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.

This paper proposed a order dispatch model which has been deployed in the online system of Didi Chuxing. The objective of the model is to maximize the global success rate which denotes the rate of an order is accepted by drives. In this way, they could optimize the overall traffic efficiency and user experience.

They formulate the order dispatch model as a combinatorial optimization problem, in which the first step is to estimate the probability of a drive accepting an order. They use logistic regression model to predict the probability of driver accepting an order based on amounts of historic pair data (order, driver). Based on the probability matrix they predict, they proceed to construct an combinatorial optimization problem to select the best suitable driver for each order, in order to improve the order success rate.

Besides, they also develop a destination prediction model to predict a destination list when a user open Didi App. They formulate the problem using the Bayesian framework based on user's historical data including departure time, latitude, longitude of departure place.

-- Xijun

LSTOD: LATENT SPATIAL-TEMPORAL ORIGINDESTINATION PREDICTION MODEL AND ITS APPLICATIONS IN RIDE-SHARING PLATFORMS

Overall: Weakly Accept

Summary: This paper proposes a latent spatial-temporal origin-destination model to address the OD flow prediction problem.

The main contributions are summarized as follows:

The authors propose a purely convolutional framework to learn both short-term and long-term spatio-temporal features simultaneously from dynamic origin-destination flow data.
The authors propose a novel SACN architecture to capture the relevance of OD flows by modeling each OD flow map as an adjacency matrix.
The authors design a periodically shift attention mechanism to model the long-term periodicity.
The results demonstrate that the proposed model outperforms state-of-the-art methods in OD flow prediction, with 6.5% to 15.0% improvement of testing RMSE.

However, my major concerns are as follows:

On Page 4, the authors explain the reasons why they use TGCNN instead of RNN-based architectures to capture the temporal representation. However, it would be more convincing if quantitative analysis or empirical results are provided.
On Page 4, readers might be confused with the symbols in the formulation (3) that are not defined clearly.
On Page 7, the experiment only considers one metric, i.e., RMSE. The efficiency comparisons between the proposed model and the baselines are missing.

-- Xijun

Large-scale order dispatch in on-demand ride-hailing platforms: A learning and planning approach

Zhe Xu et al. "Large-scale order dispatch in on-demand ride-hailing platforms: A learning and planning approach" KDD, 2018

This paper proposes a novel framework to solve the order dispatching problem, that contains a off-line learning module and an on-line planning module. The objective of the framework is to maximize the gross merchandise volume of the whole system.

In the off-line learning module, the environment is modeled as a Markov Decision Process, in which the learning objective is to maximize the accumulated revenue of the driver. To achieve this target, a tabular-RL method based on dynamic programming is applied to collect the state value function for all states.

According to the learned state value function, the on-line planning process can be finished via formulating order dispatching as a problem of bipartite-graph matching and KM algorithm is used to solve the problem.

The experiment shows that the presented framework has a relatively good performance on measurements such as total revenue, pickup distance and answer rate.

-- Weilin

Deep Reinforcement Learning with Knowledge Transfer for Online Rides Order Dispatching

Zhaodong Wang et al. "Deep Reinforcement Learning with Knowledge Transfer for Online Rides Order Dispatching" ICDM, 2018

This paper evaluates the revenue (value) of each pair of state-action through DRL techniques and in the basis of that, the order dispatching problem is formulated as a combinatorial optimization issue, which could be solved by some heuristic algorithms.

Specifically, the action space of the MDP is dynamics, which means the number of optional actions for each agent is different and the definition of the state is continuous. Thus, it is hard to obtain tha tabular Q value from the enormous state-action space. To handle the problem, this paper invents a improved DQN that take both state and action as input of the function estimator, and gives a action search strategy to generate action space for training the DQN.

In addition, three transfer learning techniques (fine-tuning, progressive network and CFPT) are introduced to achieve the transferring of knowledge from source city to other target cities. The main idea is to reuse the parameters of the branch trained by features that are adaptable for target city while training its network.

-- Weilin

Efficient Ridesharing Order Dispatching with Mean Field Multi-Agent Reinforcement Learning

Minne Li et al. "Efficient Ridesharing Order Dispatching with Mean Field Multi-Agent Reinforcement Learning" World Wide Web, 2019

This paper also model the order dispatching issue as a MDP and solve it based on the Actor-Critic model (that is independent order dispatching algorithm). Aiming at the problem of the changing action set, the Actor module takes each optional action with the observation of agents as input and generate a set of ranking values to choose an order.

Furthermore, to model other agent's policies, a mean field approximation is introduced following the mean field reinforcement learning (MFRL), in which the scalability issue in M-A RL with a large amount of agents is addressed via an average response that can approximate the interaction among agents. Specifically, the average response in the order dispatching problem is defined as the number of drivers at the same neighborhood as the agent, divided by the number of available orders for this agent. In practice, this average response is also input into the critic with the action and observation to help obtain the more effective ranking value.

-- Weilin

Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning

Kaixiang Lin et al. "Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning" KDD, 2018

This paper tries to maximize the gross merchandise volume of the platform by repositioning available vehicles to locations with larger demand-supply gap via deep reinforcement learning. Accordingly, the action set of each agent specifies where the agent is able to arrive at the next time interval and the reward is defined as the averaged revenue of all agents in the target grid, thus all the agents that are repositioned to the same grid would share one reward.

To solve the DRL problem, two types of contextual information are applied to limit the action space: geographic context and collaborative context. The first one is to reduce the grids that cannot be arrived such as lakes and rivers. The second one is to avoid that vehicles are scheduled in conflicting directions (i.e., vehicles from grid 1 are scheduled to grid 2, and vehicles from grid 2 are scheduled to grid 1). The contextual information is introduced into both DQN and Actor-Critic to improve the performance of the models.

-- Weilin

Optimizing Online Matching for Ride-Sourcing Services with Multi-Agent Deep Reinforcement Learning

Jintao Ke et al. "Optimizing Online Matching for Ride-Sourcing Services with Multi-Agent Deep Reinforcement Learning" IEEE, 2019

This paper has a different problem setting that it focus more on the passenger side and try to minimize passengers' pickup time through a dynamic control of delayed matching. In this work, a two-fold framework is constructed based on an assumption that a passenger may be matched with a much closer idle driver if his/her request is delayed for a few match time intervals. In the first step, the delayed time of each order is determined by DRL techniques and then, in the second step, optimal matching between idle drivers and unserved passengers is executed in the matching pool.

Specifically, the delayed matching problem is formulated as MDP and two types of rewards are defined: global reward and individual reward. To evaluate the value of each action, DQN and Actor-Critic are both trained in simulator. In the end of each time interval, orders that are decided to enter the matching pool would be matched with an available driver and this dispatching problem is formulated as a combinatorial optimization problem.

-- Weilin

Optimizing Taxi Carpool Policies via Reinforcement Learning and Spatio-Temporal Mining

Ishan Jindal et al. "Optimizing Taxi Carpool Policies via Reinforcement Learning and Spatio-Temporal Mining" Big Data, 2018

This paper attempts to tackle the issue of taxi carpooling via DRL. The target of this paper is maximizing transportation efficiency and thus reducing the traffic congestion because more orders are served by less vehicles. To achieve this target, the reward of each driver is defined as the effective distance, which is the sum of actual distances between the origin of the trips to the destination of individual trips. Besides that, each agent has three actions: waiting, non-carpool, carpool. Specifically, the top-level carpool decisions are made via DRL and the low-level details are executed based on some rules, including the path-choosing rule in the case of carpool. DQN is used in this work to solve the problem.

In addition, in order to estimate the travel time more precisely, the ST-NN architecture is proposed. This architecture contains two modules: Dist DNN Module and Time DNN Module. The Dist DNN Module takes the GPS coordinates of the origin and destination of an order and gives the estimated distance. As for the Time DNN Module, the activation of last hidden layer in the Dist DNN is fed into this module and the estimated travel time is obtained via a deep neural network.

-- Weilin

DEEPSIMPLEX: REINFORCEMENT LEARNING OF PIVOT RULES IMPROVES THE EFFICIENCY OF SIMPLEX ALGORITHM IN SOLVING LINEAR PROGRAMMING PROBLEMS

This paper was submitted to ICLR2020. It seems that the paper has been rejected due to its small experimnent scale.

First of all, we all know that one of popular method to solve linear programming is the simplex method. The authors of this paper use deep value-based reinforcement learning to learn a pivoting strategy (or called policy) taht at each iteration chooses between two of the most popular pivot rules (i.e., choosing a non-basci variable to become basis variables) -- Dantzig and steepest edge. ** The idea is so simple and straightforwad **. They optimize the choosing policy on a neural network designed for the simplex method. And they train the network using reinforcement learning and supervised learning respectively for LP relaxations of randomly generated instances of five-city traveling salesman problem. The experimental results show that 20% to 50% reduction in the gap between the learned strategy and the best possible omniscient polices.

-- Xijun

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
notes		notes
papers		papers
LICENSE		LICENSE
README.md		README.md

License

xijunlee/Learning-to-Optimize-Arxiv

Folders and files

Latest commit

History

Repository files navigation

Learning to Optimize

2020 - Machine Learning for Computer Systems(ML4CS) - paper reading progress (opening)

LEARNING FINITE STATE REPRESENTATIONS OF RECURRENT POLICY NETWORKS

External vs. Internal: An Essay on Machine Learning Agents for Autonomous Database Management Systems

Self-Driving Database Management Systems

Automatic Database Management System Tuning Through Large-scale Machine Learning

Query-based Workload Forecasting for Self-Driving Database Management Systems

An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning

Learning Memory Access Patterns

Resource Management with Deep Reinforcement Learning

Towards Safe Online Reinforcement Learning in Computer Systems

Safe Reinforcement Learning via Shielding

2020 - Learning to Opitimize (Solver & Logistics) - paper reading progress (opening)

2019 - Learning to Optimize - paper reading progress

CO+ML Survey Bengio 2018

Learning to Branch

Learning to Search in Branch and Bound Algorithm

Learning to Branch in Mixed Integer Programming

Learning when to use a decomposition

Best arm identification in multi-armed bandits with delayed feedback

Learning Combinatorial Optimization Algorithms over Graphs

Exact Combinatorial Optimization with Graph Convolutional Neural Networks

Solving a New 3D Bin Packing Problem with Deep Reinforcement Learning Method

A Multi-task Selected Learning Approach for Solving 3D Flexible Bin Packing Problem

Pointer Networks

NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT LEARNING

Reinforcement Learning for Solving the Vehicle Routing Problem

Reinforcement Learning for Integer Programming: Learning to Cut

ATTENTION, LEARN TO SOLVE ROUTING PROBLEMS!

Learning to Run Heuristics in Tree Search

Learning a SAT Solver from Single-Bit Supervision

Attention Solves Your TSP, Approximately

Learning Permutations with Sinkhorn Policy Gradient

Learning Fast Optimizers for Contextual Stochastic Integer Programs

Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information

Generalized inverse multiobjective optimization with application to cancer therapy

Combinatorial Optimization with Graph Convolutional Networks and Guided Tree Search

Online Learning for Strong Branching Approximation in Branch-and-Bound

A Machine Learning-Based Approximation of Strong Branching

Machine Learning to Balance the Load in Parallel Branch-and-Bound

Optimization as a model for few-shot learning, Optimization on a Budget: A Reinforcement Learning Approach, Learned Optimizers that Scale and Generalize

Optimization on a Budget: A Reinforcement Learning Approach

Model-agnostic meta-learning for fast adaptation of deep networks

Reptile: a Scalable Metalearning Algorithm

Optimization as a model for few-shot learning

Learned Optimizers that Scale and Generalize

A taxi order dispatch model based on combinatorial optimization.

LSTOD: LATENT SPATIAL-TEMPORAL ORIGINDESTINATION PREDICTION MODEL AND ITS APPLICATIONS IN RIDE-SHARING PLATFORMS

Large-scale order dispatch in on-demand ride-hailing platforms: A learning and planning approach

Deep Reinforcement Learning with Knowledge Transfer for Online Rides Order Dispatching

Efficient Ridesharing Order Dispatching with Mean Field Multi-Agent Reinforcement Learning

Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning

Optimizing Online Matching for Ride-Sourcing Services with Multi-Agent Deep Reinforcement Learning

Optimizing Taxi Carpool Policies via Reinforcement Learning and Spatio-Temporal Mining

DEEPSIMPLEX: REINFORCEMENT LEARNING OF PIVOT RULES IMPROVES THE EFFICIENCY OF SIMPLEX ALGORITHM IN SOLVING LINEAR PROGRAMMING PROBLEMS

About

Resources

License

Stars

Watchers

Forks