Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gambler's problem #39

Closed
datahaki opened this issue Apr 30, 2017 · 4 comments
Closed

gambler's problem #39

datahaki opened this issue Apr 30, 2017 · 4 comments

Comments

@datahaki
Copy link
Contributor

datahaki commented Apr 30, 2017

there is a mistake in the book:
for the gambler's problem in Chapter 4, there is no unique optimal policy.
The plot of optimal actions is missing other equally valuable actions.

For example

  • in state 51, actions {1, 49} are equally valuable.
  • in state 64, optimal actions are {11, 14, 36}

There are no floating point problems in your implementation.

The correct solution (i hope) is displayed in the images on the page
https://github.com/idsc-frazzoli/subare

Suggestion: prohibit the gambler to bet an amount of 0 as long as his/her cash is between 0 and 100. This will speed up the convergence.

@ShangtongZhang
Copy link
Owner

The book mentioned that

there is a whole family of optimal policies, all corresponding to ties

I just wonder how you manage to make the policy figure exactly same as the book. It will be much appreciated if you could briefly talk about how you select actions if there is a tie?

@datahaki
Copy link
Contributor Author

datahaki commented Apr 30, 2017

Under the Figure 4.3 it states: "In particular, for capital of 50 it bets it all on one flip, but for capital of 51 it does not." But this is misleading, since it is perfectly fine, i.e. optimal to bet 49 while in state 51. And:
"Why does the optimal policy for the gambler’s problem have such a curious form?"

there is a whole family of optimal policies, all corresponding to ties

you are right. I overlooked this remark and focussed on the caption of the figure and the remark in your code. Your code doesn't reproduce the policy plotted in the book, but you still obtain a different optimal policy.

Regarding implementation:
In the value iteration, nothing changes. the value iteration simply computes the value (a real number) for each state 0,1,...,100.

However, the arg max when establishing the greedy policy is often not unique. That means, if you notice that the maximum occurs for multiple actions, all these actions can be assigned a non-zero probability in a greedy policy.

In my implementation, the class GreedyPolicy collects precisely all of these possible actions. Also, notice the class FairArgMax.

Below, are the greedy actions for each state. (I hope that it's correct...)

0 -> {0}
1 -> {1}
2 -> {2}
3 -> {3}
4 -> {4}
5 -> {5}
6 -> {6}
7 -> {7}
8 -> {8}
9 -> {9}
10 -> {10}
11 -> {11}
12 -> {12}
13 -> {12, 13}
14 -> {11, 14}
15 -> {10, 15}
16 -> {9, 16}
17 -> {8, 17}
18 -> {7, 18}
19 -> {6, 19}
20 -> {5, 20}
21 -> {4, 21}
22 -> {3, 22}
23 -> {2, 23}
24 -> {1, 24}
25 -> {25}
26 -> {1, 24, 26}
27 -> {2, 23, 27}
28 -> {3, 22, 28}
29 -> {4, 21, 29}
30 -> {5, 20, 30}
31 -> {6, 19, 31}
32 -> {7, 18, 32}
33 -> {8, 17, 33}
34 -> {9, 16, 34}
35 -> {10, 15, 35}
36 -> {11, 14, 36}
37 -> {12, 13, 37}
38 -> {12, 38}
39 -> {11, 39}
40 -> {10, 40}
41 -> {9, 41}
42 -> {8, 42}
43 -> {7, 43}
44 -> {6, 44}
45 -> {5, 45}
46 -> {4, 46}
47 -> {3, 47}
48 -> {2, 48}
49 -> {1, 49}
50 -> {50}
51 -> {1, 49}
52 -> {2, 48}
53 -> {3, 47}
54 -> {4, 46}
55 -> {5, 45}
56 -> {6, 44}
57 -> {7, 43}
58 -> {8, 42}
59 -> {9, 41}
60 -> {10, 40}
61 -> {11, 39}
62 -> {12, 38}
63 -> {12, 13, 37}
64 -> {11, 14, 36}
65 -> {10, 15, 35}
66 -> {9, 16, 34}
67 -> {8, 17, 33}
68 -> {7, 18, 32}
69 -> {6, 19, 31}
70 -> {5, 20, 30}
71 -> {4, 21, 29}
72 -> {3, 22, 28}
73 -> {2, 23, 27}
74 -> {1, 24, 26}
75 -> {25}
76 -> {1, 24}
77 -> {2, 23}
78 -> {3, 22}
79 -> {4, 21}
80 -> {5, 20}
81 -> {6, 19}
82 -> {7, 18}
83 -> {8, 17}
84 -> {9, 16}
85 -> {10, 15}
86 -> {11, 14}
87 -> {12, 13}
88 -> {12}
89 -> {11}
90 -> {10}
91 -> {9}
92 -> {8}
93 -> {7}
94 -> {6}
95 -> {5}
96 -> {4}
97 -> {3}
98 -> {2}
99 -> {1}
100 -> {0}

@ShangtongZhang
Copy link
Owner

Oh got it, you just plot all the actions if there is a tie rather than just randomly select one. It makes the figure more good-looking. Thanks for your reply!

@datahaki
Copy link
Contributor Author

datahaki commented Apr 30, 2017

I thank you for your awesome code and repository!

AlejandroPenacho pushed a commit to AlejandroPenacho/reinforcement-learning-an-introduction that referenced this issue Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants