Reinforcement Learning

Small projects to delve into reinforcement learning. Following Sutton & Barto's Reinforcement Learning: An Introduction.

Developments

Basics (17 May 2021)

Install dependencies from requirements.txt
Run the random search algorithm on the CartPole environment. Works well due to the simplicity of the environment: Only 2 possible actions and 4 state parameters to consider.
Perform hill-climb (gradient descent) and policy gradients on the CartPole environment. Both work well.

Dynamic Programming (22 May 2021)

Run policy iteration and value iteration on the FrozenLake environment. Both methods are suitable for the $4\times 4$ and $8\times 8$ cases. Random search does not work at all, with an abysmal sub-1% success rate.
Run value iteration on the biased coin gambling problem. Policy iteration fails to converge as can be seen in my notebook. There appears to be a bug in the policy evaluation function which I cannot pinpoint as of now.

Monte Carlo (25 May 2021)

Estimate the value of pi using the dots in a square method. Very basic Monte Carlo.
Estimate the reward value of the DP optimal policy using first-visit and every-visit policy evaluation. Estimated values are very close to DP exact values.
Find the optimal policy for blackjack, a game with $280\times 280 \times 2$ states with on-policy and off-policy Monte Carlo control.
- For on-policy Monte Carlo control, exploring starts and $\epsilon$ -greedy soft policies yield good results, with $\epsilon$ -greedy outperforming exploring starts over more iterations.
- In off-policy Monte Carlo control, incremental Q-function updates greatly increases scalability and reduces space complexity over the naive implementation. In addition, weighted importance sampling improves accuracy and yields consistent results compared to the high variance seen in ordinary frequency sampling.

Temporal Difference Learning (30 May 2021)

Find the optimal policy of the Cliff Walking environment using Q-learning. A simple problem that Q-learning solved in relatively few episodes.
Find the optimal policy of Windy GridWorld, a similar environment to Cliff Walking but with additional wind displacement. SARSA worked great here, with Q-learning showing similar results.
For the taxicab problem 3 methods were tried:

Q-Learning converged with a relatively high error rate, as could be seen in the graphs. Grid search will probably reduce error even further. My attempts did not result in significant improvement even after stepwise $\alpha$ reductions after 400 episodes and increasing episode lengths.
SARSA had similar problems, but to a lesser extent. Grid search resulted in an average increase in reward of +12.
Double Q-Learning worked far better than Q-Learning as it did not overestimate action values as much. However, this method tended to somewhat underestimate action values, and so converged more slowly. Nonetheless, episode rewards converged better than Q-Learning (but SARSA + Grid Search still outperformed). These methods depend greatly on hyperparameter selection.

Multi-Arm Bandits (6 July 2021)

Solve multi-arm bandit (slot machines with probabilistic rewards) problems using random, $\epsilon$ -greedy, and softmax policies. Here, the $\epsilon$ -greedy policy converged the fastest.
Solve a simplified webpage ad placement problem using Upper Confidence Interval (UCB) and Thompson Sampling. In this case, contextual bandits (one bandit leading to another) were used. This increased the information and complexity of the problem.

Function Approximation (10 July 2021)

Estimate Q-values using linear equations, reducing the dimensionality of the problem. These linear equations are first solved using a fully connected layer nn.Linear. This is then improved into a shallow neural network with nn.Sequential and ReLU non-linearities nn.ReLU().
Approximate Q-values are then used to solve continuous state environments like the CartPole and MountainCar environments. An improvement on Q-learning is done using experience replay. Experience replay improves learning by randomly choosing a number of states to learn from instead of the entire episode.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
DP		DP
FA		FA
MC		MC
Multi-Arm Bandits		Multi-Arm Bandits
TD		TD
video		video
.gitignore		.gitignore
README.md		README.md
bandit.ipynb		bandit.ipynb
dp.ipynb		dp.ipynb
dqn.ipynb		dqn.ipynb
fa.ipynb		fa.ipynb
gym_basics.py		gym_basics.py
hill_climb.py		hill_climb.py
intro.ipynb		intro.ipynb
mc_roulette.ipynb		mc_roulette.ipynb
monte_carlo.ipynb		monte_carlo.ipynb
moonlander.ipynb		moonlander.ipynb
random_search.py		random_search.py
recorder.py		recorder.py
requirements.txt		requirements.txt
td.ipynb		td.ipynb
text_recorder.py		text_recorder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning

Developments

Basics (17 May 2021)

Dynamic Programming (22 May 2021)

Monte Carlo (25 May 2021)

Temporal Difference Learning (30 May 2021)

Multi-Arm Bandits (6 July 2021)

Function Approximation (10 July 2021)

About

Languages

shirsho-12/RL

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning

Developments

Basics (17 May 2021)

Dynamic Programming (22 May 2021)

Monte Carlo (25 May 2021)

Temporal Difference Learning (30 May 2021)

Multi-Arm Bandits (6 July 2021)

Function Approximation (10 July 2021)

About

Topics

Resources

Stars

Watchers

Forks

Languages