## Content
### 1. Tensorflow
### 2. Discrete vs Continous
### 3. A simple program

In [1]:
import numpy as np

_rollout\_test.py_ will run a linear policy with randomly initialized parametes on 'CartPole-v0' and 'Point-v0'.

>'Point-v0' is a simple point-reaching envirionment. In this envirionment, a point will be spawned at a random location a the beginning of each eposiode and the goal is to move the center as soon as possible.

```
obs_dim = 2
action_dim = 2
rng = np.random.RandomState(42)
theta = rng.normal(scale=0.01, size=(action_dim, obs_dim+1))  # randomly initialized parameter
```

### np.random.RandomState(seed)
- class
- parameter
  - seed : {None, int, array_like}
- methods : normal, binoial, multinomial, poisson...

generating random numbers drawn from a variety of probability distribuitions

### RandomState.normal(loc=0.0, scale=1.0, size=None)
- function
- parameter
  - loc : mean of distribution
  - scale : standard deviation of the distribution
  - size : output shape 

draw samples from a normal (Gaussian) distribution for $\mu$ and $\sigma$

Let's check the value of mean and std for normal distribution by changing parameters.

In [25]:
th = rng.normal(size=(10000,))
print('mean:', np.mean(th))
print('std:', np.std(th))

0.343479652446
0.633343066633


In [26]:
th_scale = rng.normal(scale=.1, size=(10000,))
print('mean:', np.mean(th_scale))
print('std:', np.std(th_scale))

0.343479652446
0.633343066633


In [29]:
th_size = rng.normal(scale=1, size=(2,10000))
print('mean:', np.mean(th_size, axis=1))
print('std:', np.std(th_size, axis=1))

[ 0.33537946  0.35157984]
[ 0.34054087  0.82834025]


In [32]:
th_loc = rng.normal(loc=[[0],[1]], scale=1, size=(2,10000))
print('mean:', np.mean(th_loc, axis=1))
print('std:', np.std(th_loc, axis=1))

mean: [-0.00486791  1.00470028]
std: [ 0.9936277   1.00665813]


```
def point_get_action(theta, ob, rng=np.random):
    ob_1 = include_bias(ob)
    mean = theta.dot(ob_1)
    return rng.normal(loc=mean, scale=1.)


def cartpole_get_action(theta, ob, rng=np.random):
    ob_1 = include_bias(ob)
    logits = ob_1.dot(theta.T)
    return weighted_sample(logits, rng=rng)
```

There are two predefined function _include\_bias_ and _weighted\_sample_ in _simple\_utils.py_. Let's find out how these functions work.

First, _include\_bias_ function add 1 to each entry in the arrray.

```
def weighted_sample(logits, rng=np.random):
    weights = softmax(logits)
    return min(
        int(np.sum(rng.uniform() > np.cumsum(weights))),
        len(weights) - 1
    )


def include_bias(x):
    # Add a constant term (1.0) to each entry in x
    return np.concatenate([x, np.ones_like(x[..., :1])], axis=-1)
```

In [40]:
def include_bias(x):
    # Add a constant term (1.0) to each entry in x
    return np.concatenate([x, np.ones_like(x[..., :1])], axis=-1)

ob = np.array([[0.330, 0.974]]) # shape : (1, obs_dim)
ob_1 = include_bias(ob)         # shape : (1, obs_dim+1)
print("ob   :", ob)
print("ob_1 :", ob_1)

ob   : [[ 0.33   0.974]]
ob_1 : [[ 0.33   0.974  1.   ]]


For batch_size=10, the result is as follows.

In [39]:
obn = np.array([[0.330, 0.974],
        [0.355, 0.999],
        [0.330, 1.000],
        [0.305, 0.975],
        [0.330, 0.950],
        [0.305, 0.925],
        [0.280, 0.950],
        [0.255, 0.925],
        [0.280, 0.900],
        [0.305, 0.875]])    # shape : (10, obs_dim)
obn_1 = include_bias(obn)   # shape : (1, obs_dim+1)
print("obn   :", obn)
print("obn_1 :", obn_1)

obn   : [[ 0.33   0.974]
 [ 0.355  0.999]
 [ 0.33   1.   ]
 [ 0.305  0.975]
 [ 0.33   0.95 ]
 [ 0.305  0.925]
 [ 0.28   0.95 ]
 [ 0.255  0.925]
 [ 0.28   0.9  ]
 [ 0.305  0.875]]
obn_1 : [[ 0.33   0.974  1.   ]
 [ 0.355  0.999  1.   ]
 [ 0.33   1.     1.   ]
 [ 0.305  0.975  1.   ]
 [ 0.33   0.95   1.   ]
 [ 0.305  0.925  1.   ]
 [ 0.28   0.95   1.   ]
 [ 0.255  0.925  1.   ]
 [ 0.28   0.9    1.   ]
 [ 0.305  0.875  1.   ]]


Second, _weighted\_sample_ funcion ~.

```
def weighted_sample(logits, rng=np.random):
    weights = softmax(logits)
    return min(
        int(np.sum(rng.uniform() > np.cumsum(weights))),
        len(weights) - 1
    )
```

In [42]:
def softmax(logits):
    x = logits
    x = x - np.max(x, axis=-1, keepdims=True)
    x = np.exp(x)
    return x / np.sum(x, axis=-1, keepdims=True)

logits = [-7.01659141e-05, 1.04048004e-02] # ouput value of the linear policy
weights = softmax(logits)

print("weights: ", weights)


<mtrand.RandomState object at 0x7f9c9c04e558>
