## Introduction

This tutorial will introduce one approximate algorithm for optimization problem: Stochastic Local Search. <br>
<br>
Lots of practical problems in data science are about using the data to do some classificaton or regression. One good example is that in "homework 3" of 15688, we make a linear regression based on real-time bus data from Pittsburgh to make TrueTime prediction on arrival time.<br>
<br>
Almost all classification or regression problems are seeking for a optimal solution. However, the hardness of directly computing the optimal answer would be NP-hard. So people start to introduce some approximate algorithms to find approximate solution with the provable guarantees on the distance of the returned solution to the optimal one.<br>
<br>
So this tutorial would cover one useful approximate algorithm: Stochastic Local Search.<br>
<br>

### Tutorial Content

In this tutorial, we will cover how to use this stochastic local search in real problem. One of the example would be about how to use stochastic local search to compute the parameter of one Bayesian network and how the noise parameter would influence the search progress and result. The other example would be more about applying this stochastic local search in an actual regression problem, which is .......<br>
<br>
The following topics would be covered in this tutorial:
- [Motivation of SLS](#Motivation-of-SLS)
- [How SLS algorithm work](#How-SLS-algorithm-work)
- [Python implement of SLS](#Pyhton-implement-of-SLS)
- [Example application: linear regression](#Example-application:-linear-regression)

## Motivation of SLS

Most approximate algorithms are designed as greedy approximate algorithms. Greedy approximate algorithms work by making a local optimal step each time with the hope of finding the global optimal. However if the object function has multiple local maximun or minimum, in which case the global maximum or minimum would be different from the local optimal. Then the greedy approximate algorithm would most likely fall into local optimal. <br><br>
For example, one of the most commen used greedy approximate algorithms would be gradient decent. Gradient decent takes iterative steps to reach the maximun or minimum value of object fucniton. An example progress looks like the following:<br>
<img src="./gradient_decent.png" style="width: 500px;"/>


When we start our progress in point Q, it is easy to use gradient decent to find global minimum. Each iteration of gradient decent is going towards the direction that makes object function smaller. However, when we are choosing P as start point, the gradient decent would finially ends with local optimal instead of global optimal.<br><br>
So one thing about the gradient decent is that the initial value have great influence on whether this algorithm would ends with global optimal. Since there is no way for us to know exactly whether a start point is a "good start point", it is very likely to fall into local optimal.<br><br>
SLS algorithm could solve the problem by introducing a random step in iterative steps. The goal of introducing this random step is to have your parameter move randomly with the hope of it jumping out of this local optimal.<br><br>

## How SLS algorithm work

### Gradient Decent
Before we head to the SLS algorithm, let's have a look of gradent decent algorithm first, since it is used as one part of the SLS algorithm.<br><br>
   Given parameter set **{Max_Iterations, Min_error, learning rate, training_examples, traininglabels}**<br><br>
   Assign our training weights **w** with initial values<br><br>
   Repeat following steps until reaching the Max_Iterations or reaching the Min_error:<br><br>
      comput our model output: **y' = X*w**<br><br>
      update training weight: **w**:= **w** + learning_rate * $\sum_{}{} x*(y'-traininglabels)$
<br><br>

### SLS
Comparing to gradent decent, the main defference of SLS is that it adds a random step in the algorithm. Given a probability p, algorithm would choose whether to take a random step or a greedy step.<br><br>
We define the random step as randomly flip on bit of a randomly chosen variable. More specificly, we'll convert the float value into it's binary bits. Then randomly choose on bit to flip, eg, 0->1, 1->0. By doing this, we successfully assign our weight marteix a random value.<br><br>
While in greedy step, we'll use gradient decent to find a local optimal. 
So our SLS algorithm now works like this:<br><br>
   1. Initialize all parameters uniformly random.<br><br>
   2. Then take the stochastic local search step, which has a probability of p going the random step, probability of 1-p going the gradent decent step. <br><br>
   3. Repeat the stochastic local search step until SLS algorithm reaches our STOP_CONSTRAIN.<br><br>

### The workflow of SLS algorithm<br><br>
  algorithm parameter sets is **{MAX-TRIES, MAX-FLIPS, NOISE PROBALITY: p, OBJECT FUNCTION: f, SOLUTION PARAMETER: b}**<br><br>
    TRIES = 0<br><br>
    **while** TRIES < MAX-TRIES **do** { <br><br>
    Initialize b == (0,1) uniformly at random <br><br>
    **while** STEPS < MAX-STEPS **do** { <br><br>
       RANDOM = flip a biased coin using noise parameter p<br><br>
       **if** RANDOM then <br><br>
       **else** using gradent decent to find nearest local optimal result<br><br>
       **if** f(b) > f' **then** {f' = f(b); b' = b}<br><br>
       **if**  MEET THE STOP_CONSTRAINTS **then** return b'<br><br>
       }
    TRIES = TRIES + 1<br><br>
    }<br><br>
    return b' <br><br>

#### Supplyment: 
b' is used to store the current best solution. STOP_CONSTRAINTS is the constraints we set as the iteration stop requirement. When parameter of our current solution meet the STOP_CONSTRAINTS, this iteration would stop.

As you can see from the workflow, whether the random step would execute depends on the probability of nosie parameter p, which is set in range of [0,1]. The goal of doing this random step is to make our algorithm possible to jump out of a local optimal.

## Pyhton implement of SLS

In [None]:
import numpy as np
from numpy import random
from struct import pack,unpack
import matplotlib.pyplot as plt
def biased_flip(some_list, probabilities):  
    """
    input:  some_list: the values we want to output biased, in this case, it is 0 or 1
            probabilities: the probability vector determining which value would be picked as output
    """
    x = random.uniform(0,1)  
    cumulative_probability = 0.0  
    for item, item_probability in zip(some_list, probabilities):  
        cumulative_probability += item_probability  
        if x < cumulative_probability:break  
    return item

In [None]:
def random_step(w):
    #choose variable we want to assign a random value
    i=random.randint(0, len(w))
    variable=w[i,0]
    #randomly flip a bit on choosen variable's binary value
    fs=pack('f', variable)
    bval = list(unpack('BBBB', fs))
    index=random.randint(0,3)
    B=bval[index]
    v=random.randint(0,7)
    if B&(255-pow(2,v))==B:
        B=B+pow(2,v)
    else:
        B=B-pow(2,v)
    bval[index]=B
    fs = pack('BBBB', *bval)
    new_variable=unpack('f',fs)
    w[i,0]=new_variable[0]
    return w


In [None]:
def gradient_decent(x, y, learning_rate, weight_initial, maxIterations):
    """
    function: using gradient decent to find local optimal
    input:  x: training data
            y: training label
            learning_rate: learning rate of each iteration, in range of [0,1]
            weight_initial: start weight
            maxIterations: maximum iteration number
    output: w: weight matrix of local optimal
    """
    iteration=0
    m=weight_initial.shape[0]
    w=weight_initial
    y_model=np.dot(x, w)
    er=y_model-y
    while iteration<maxIterations and np.sum(er**2)>0.1:
        for i in range(m):
            for j in range(x.shape[0]):
                w[i][0] = w[i][0]-learning_rate*er[j,0]*x[j,i]
        y_model=np.dot(x, w)
        er=y_model-y
        iteration +=1
    return w
        

In [None]:
def greedy_step(w, training_examples, training_labels):
    """
    function: using gradient decent to find local optimal in greedy step
    """
    learning_rate = 0.0001
    max_iterations = 50000
    return gradient_decent(training_examples, training_labels, learning_rate, w, max_iterations)

In [None]:
def SLS(training_examples, training_labels, MAX_TRIES, MAX_STEPS, noise_parameter, stop_constraint):
    """
    function: using SLS to do a linear regression
    """
    TRIES=0
    w_current_optimal=random.random([training_examples.shape[1],training_labels.shape[1]])
    while (TRIES < MAX_TRIES):
        #initialize weight uniformly at random
        weight=random.random([training_examples.shape[1],training_labels.shape[1]])
        STEPS=0
        while (STEPS < MAX_STEPS):
            STEPS+=1
            i=biased_flip([0,1],[noise_parameter, 1-noise_parameter])
            if i==0:            
                weight=random_step(weight)
            if i==1:
                weight=greedy_step(weight, training_examples, training_labels)
            if np.sum((training_labels-np.dot(training_examples, w_current_optimal))**2) > np.sum((training_labels-np.dot(training_examples, weight))**2):
                w_current_optimal=weight
            if np.sum((training_labels-np.dot(training_examples, w_current_optimal))**2) < stop_constraint:
                return w_current_optimal
        TRIES+=1
    return w_current_optimal

## Example application: linear regression

In this part, we'll use one simple linear regression as an example to see how SLS algorithm works. We'll see how area of house relate to it's price. Given data: <br><br>
area of house: 150, 200, 250, 300, 350, 400, 600<br><br>
price of house: 6450, 7450, 8450, 9450, 11450, 15450,18450<br><br>
What we are trying to do here is to train a linear regression model to fit our training data properly. <br><br>
Before we start training, let's first use min-max normalization to normalize our data into [0,1].<br><br>

In [None]:
training_examples=np.array([[150], [200], [250], [300], [350], [400], [600]])
training_labels=np.array([[6450], [7450], [8450], [9450], [11450], [15450], [18450]])

def min_max_normalization(x):
    """
    function: use min-max normalization to normalize data into [0,1]
    input: data to be normalized
    output: normalized data
    """
    return np.array([(float(i)-min(x))/float(max(x)-min(x)) for i in x])

training_e = min_max_normalization(training_examples)
training_l = min_max_normalization(training_labels)

The linear regression model we want to train is y=w'X+b. "b" is the bias. To make "b" as a part of training, we add a "1" into all trainging examples.

In [None]:
a = np.array((1,1,1,1,1,1,1))
training_data = np.column_stack((training_e,a))

Now we put our training_data and training_l into SLS algorithm and check out how well this weight matrix fits our training data.<br><br>

In [None]:
weight = SLS(training_data, training_l, 2, 20, 0.1, 0.1)

In [None]:
plt.plot(training_e, training_l, 'r*')
plt.plot(training_e, weight[0,0]*training_e+weight[1,0])
plt.show()

### see how noise parameter influence our training

As we all noticed, there is a noise parameter p which determines the probability of whether random step takes place. Let's change our "p" to see how it influences.<br><br>
First, let's modify our SLS algorithm to save our loss function output everytime we make a step.<br><br>

In [None]:
def noise_parameter_SLS(training_examples, training_labels, MAX_STEPS, stop_constraint):
    """
    function: finding how noise pparameter(probability) influence training
    input: 
    """
    noise = [0, 0.1, 0.3, 0.5, 0.7, 1.0]
    line_color = ['b','y' 'g', 'r', 'k', 'c']
    plt.gca().set_color_cycle(['red', 'green', 'blue', 'yellow', 'cyan', 'magenta'])
    TRIES=0
    w_current_optimal=random.random([training_examples.shape[1],training_labels.shape[1]])
    for k in range(len(noise)):
        #initialize weight uniformly at random
        noise_parameter=noise[k]
        loss=[]
        itera=[]
        weight=random.random([training_examples.shape[1],training_labels.shape[1]])
        STEPS=0
        while (STEPS < MAX_STEPS):
            STEPS+=1
            itera.append(STEPS)
            i=biased_flip([0,1],[noise_parameter, 1-noise_parameter])
            if i==0:            
                weight=random_step(weight)
            if i==1:
                weight=greedy_step(weight, training_examples, training_labels)
            loss.append(np.sum((training_labels-np.dot(training_examples, weight))**2)/training_examples.shape[0])
            plt.plot(itera, loss)
            if np.sum((training_labels-np.dot(training_examples, w_current_optimal))**2) > np.sum((training_labels-np.dot(training_examples, weight))**2):
                w_current_optimal=weight
            if np.sum((training_labels-np.dot(training_examples, w_current_optimal))**2) < stop_constraint:
                break
    plt.legend(['noise=0', 'noise=0.1', 'noise=0.3', 'noise=0.5', 'noise=0.7', 'noise=1.0'], loc='upper right')
    plt.show()

In [None]:
noise_parameter_SLS(training_data, training_l, 100, 0.01)

We set our noise parameter as 0, 0.1, 0.3, 0.5, 0.7, 1, one example output would be as follows: (Due to the uncertainty of going random step or greedy step, running output may be different from following example.)<br><br>
<img src="./noise_parameter.png" style="width: 500px;"/>

As we could see, when noise parameter is larger, the whole training went more randomly. However, larger noise parameter did means that it is more likely to jump out of local optimal. When noise parameter is set as 0, SLS algorithm is completely greedy.

## Summary and references

When multi local optimal exists, this tutorial introduces one possible way of jumping out of local optimal. One thing for sure, given infinite time(steps or tries), SLS would find the global optimal. However, when given selected steps or tries, no one knows when or whether this SLS algorithm would find global optimal. But returned solution of SLS is at least not worse than pure greedy search.<br><br>
If your are interested, more disscussions and details on SLS are available from the following links.<br><br>
1. Understanding the role of noise in stochastic local search: Analysis and experiments https://www.sciencedirect.com/science/article/pii/S0004370208000040
2. Stochastic Local Search https://www.cs.put.poznan.pl/mkomosinski/materialy/optymalizacja/extras/StochasticLocalSearch.pdf
3. video about SLS algorithm https://www.youtube.com/watch?v=E0l8xIXYSY4