#### Comments on Multi Arm Bandit Algorithms

##### Questions one should ask oneself when you are planning to deploy bandit algorithms
- How sure are you that you won't subtly corrupt your deployment code ?
- How many different tests are you planning to run simulataneously? Will these tests interfere with each other? Will starting a new test while another one is already running corrupt its results?
- How long do you plan to run your tests ?
- How many users are you willing to expose to non-perferred versions of your site?
- How well-chosen is your metric of success?
- How are the arms you were measuting related to one another ?
- What additional information about context do you have when choosing arms? Do you have demographics based on browser information? Does your site have access to external information about people's tastes in products you miught advertise to them?
- How much traffic does your site receive? Is the system you were building going to scale up? How much traffic can your algorithm handle before it starts to slow your site down?
- How much will you have to distort the setup we have introduced when you admit that visitors to real websites are concurrent and are not arriving sequentially as in our simulations?



##### A/A Testing
- Even if you try A/A testing and do not find any worring issuses, this approach provides a useful way to estimate the actual variability in your data before trying to decide whether the differences found by a bandit algorithm are real.

##### Running Concurrent Experiments
- The best solution is this is simple: _**try your best to keep track of all of the experiments each user is a part of and include this information in analyses of any single experiment**_

##### Continuous Experimentation Vs. Periodic Testing

- Bandit algorithms look much better than A/B testing when you are willing to let them run for a very long times. If you are willing to have your site perpertually be in a state of experimentation, bandit algorithms will be many times better than A/B testing

- A/B testing perference for balancing people across arms can be advantageous if you are not going to gather a lot of data

##### Bad Metrics of Success
- Monitoring many different metrics you think are important to your business is probably the best to do.

##### Scaling Problems with Good Metrics of Success
- Even if you have a good metric of success, like the total amount of purchases made by a client ober a period of year, the algorithms described in this book may not work well unless you rescale those metrics into 0-1 space we have used in our examples. Some of the algorithms are numerically ustable, especially the softmax algorithm, which will break down if you start trying to calculate things like exp(10000.0). You need to make sure that you have scaled the rewards in your problem into a range in which algorithms will be numerically stable. If you can try to use the 0-1 scale we have used, which is, as we briefly noted earlier, an absolute requirement if you plan on using the UCB algorithm.

##### Intelligent Initialization of Values
- Smart initialization of arms values is very important. 
- First, you can use the histrorical metrics for the control arm in your bandit algorithm. 
- Second, you can use the amount of historical data you have to calibrate how much the algorithm thinks you know about the historical options. For an algorithm like UCB1, that will strongly encourage the algorithm to explore new options until the algorithm has some confidence about their worth relative to tradition.

##### Running Better Simulations
- In addition to initializing your algorithm using prior information you have before deploying a Bandit algorithm, you can often run much better simulations if you use historical information to build appropriate simulations. In this book, we have used a toy Monte Claro simulation with click-through rates that varied from 0.1 to 0.9. Real world click-through rates are typically much lower than this. Because low success rates may mean that your algorithm must run for a very long time before it is able to reach any strong conclusions.

##### Moving Worlds
- In the real world, the value of different arms in a bandit problem can easily change over time. As we said in the introduction, an orange and black site design might be perfect during session X, but terrible during session Y. Because the true value of an arm might actually shift over time, you want your estimates to be able to do this as well.

- _**Arms with changing values can be a very serious problem if you were not careful when you deploy a bandit algorithm.**_

- Epsilon-Greedy, Softmax & UCB1 can not handle most sorts of change in the underlying values of arms well. The problem has to do with the way that we estimate the value of an arm. We typically updated our estimates using the following snippet of the code: _newvalue = (  ( n-1 )  * value + reward )  / float(n)_

- The problem with this update rule is that 1 / float(n) goes to 0 as n gets large. When you were dealing with millon or billons of palys, this means that recent rewards will have almost zero effect on your estimates of the value of different arms. If those values shifted only a small amount, the algorithm will take a huge number of plays to update its estimated values.

- There is a simple trick for working around this that can be used if you were careful: instead of estimating the value of arms using strict averages, you can overweight recent events by using a slightly different rule based on a different snippet of code.
_newvalue = ( 1-alpha ) * value + alpha * reward_

- This alternative updating rule will allow your estimates to shift much more with recent experiences.  When the world can change radically, that flexibility is very important.

- Unfortunately, the price you pay for the flexibility is the introduction of a new parameter that you will have to tune to your specific business. 

- **Authors encourage one to experiment with this modified updating rule using simulations to develop an intuition for how it behaves in enviorment **


##### Correlated Bandits


##### Contextual Bandits

##### Implementing Bandit Algorithms at Scale

##### Concluding Remarks

- Domain expertise and good judgement will always be necessary
- By testing an algorithm in many different hypothetical worlds, you can build an appreciation for qualitative dynamics that cause a bandit algorithm to succeed in one scneario and to fail in another.
- _**Trade-offs, trade-offs, trade-offs**_ : 
- _**God does play dice**_ : Randomization is the key to the good life.
- _**Defaults matter a lot**_  : The way in which you initialize an algorithm can have a powerful effect on its long term success. You need to figure out whether your biases are helping you or hurtung you. No matter what you do, you will be biased in some way or another. What matters is that you spend some time learning whether your biases help or hurt.Part of the genius of UCB family of algorithms is that they make a point to do this initialization in a very systematic way right at the start.
- _**Take a chance**_ : You should try everything at the start of your explorations to insure that you know a little bit about the potential value of every option. Do not close your mind without giving a fair shot. At the same time, just one experience should be enough to convince you that some mistakes are not worth repeating.
- _**Everybody's gotta grow up sometime**_ You should make sure that you explore less over time.
- _**Leave your mistakes behind**_ You should direct your exploration to focus on the second-best option, the third-best option and a few other options that are just a little further away from the best. 
- _**Do not be cocky**_ : You should keep track of how confident you are about your evaluations of each of the options available to you. Do not be close-minded when you do not have evidence to support your beliefs. A the same time, do not be so unsure of yourself that you forget how much you already know. Measuring one's confidence explicitly is what makes UCB so much more effecitve than either algorithms we studied.
- _**Context Matters**_ : You should use any and every piece of information available to you about the context of your experiments. Do not simplify the world too much and pretend you have got things figured out: there's more to optimizing your business that comparing A with B. If you can figure out a way to exploit context using strategies like contextual bandit algorithms, use them. And if there are ways to generalized your experiences across arms, take advantage of them.



##### A Taxonomy of Bandit Algorithms

- Curiosity : Does the algorithm keep track of how much it knows about each arm? Does the algorithm try to gain knowledge explicitly, rather than incidentally? In other words, is the algorithm curious?
- Increased exploitation over time:  Does the algorithm use annealing?
- Starategic exploration : What factors determine the algorithm's decision at each time point? Does it maximize reward, knowledge or a combination of the two?
- Number of Tunable paramters 
- Initialization strategy : What assumptions does the algorithm make about the value of arms it has not yet explored.
- Cotext-Aware : Is the algorithm able to use background context about the value of the arms?
