# Greedy algorithm

## Definition
A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. In many problems, a greedy strategy does not in general produce an optimal solution, but nonetheless a greedy heuristic may yield locally optimal solutions that approximate a global optimal solution in a reasonable time.

## Specifics
In general, greedy algorithms have five components:  

A candidate set, from which a solution is created  
A selection function, which chooses the best candidate to be added to the solution  
A feasibility function, that is used to determine if a candidate can be used to contribute to a solution  
An objective function, which assigns a value to a solution, or a partial solution, and  
A solution function, which will indicate when we have discovered a complete solution   

## Process

1. Establishing mathematical models to describe problems;
2. Divide the problem into several sub-problems;
3. Obtain the local optimal solutions of sub-problems.
4. Merge the solutions of sub-problems as the solution of origin problem.

## Example
Extract some rows from the dataset to make these rows meet the following conditions:  
1. The average age is between 35 and 45
2. The total salary is between 100,000 and 120,000

## The solution to the problem 
Minimize the mean-square error of average age and total salary

In [5]:
import pandas as pd
import random, time
import numpy as np

## Establish a random dataset
The dataset contains the following columns:  
1. Gender: randomly generate 'male' or 'female'  
2. Age: random integers between 22 and 65  
3. Salary: random integers between 3000 and 10000  

In [6]:
n_row = 1000
random.seed(50)

# create a series of gender
gender = pd.Series([random.choice(['male','female']) for i in range(n_row)])

# create a series of age
age_low = 22
age_high = 65
age = pd.Series([random.randint(age_low, age_high) for i in range(n_row)])

# create a series of salary
salary_low = 3000
salary_high = 10000
salary = pd.Series([random.randint(salary_low, salary_high) for i in range(n_row)])

# create a dataframe by gender and salary
df = pd.DataFrame({"gender": gender,"age": age, "salary": salary})
df.head()

Unnamed: 0,age,gender,salary
0,64,female,3828
1,64,female,6389
2,39,female,8344
3,30,male,4540
4,33,female,5147


## Create a dictionary of strings and their corresponding functions
1. "average": numpy.mean function  
2. "sum": numpy.sum function

In [7]:
def str2func(x):
    func_dict = {"average": np.mean, "sum": np.sum}
    return func_dict[x]

## Calculate square error
Suppose we want the variable x to fall on the [a, b] interval, then  the calculation of the squared error is as followed:   
1. if x in [a, b], then $SE = 0$
2. if x > b, then $SE = (x / b - 1)^2$
3. if x < a，then $SE = (1 - x / a)^2$


In [8]:
def get_se(x, rng):
    a, b = rng
    if a <= x <= b:
        res = 0
    elif x > b:
        # Normalization
        res = (x / b - 1) ** 2
    else:
        res = (1 - x / a) ** 2
    return res

## Calculate mean-square error 
$MSE = \frac{1}{n}\sum_{i}^{n}SE_i$

In [9]:
def get_mse(data, rows, cols, funcs, rngs, n_cond):
    mse = 0.0
    for col, func, rng in zip(cols, funcs, rngs):
        se = func(data.loc[rows == 1, col])
        se = get_se(se, rng)   
        mse += se / n_cond
    return mse

## Search function
The variable rows is something like [1, 1, 0, 1, 0, 0, 0..., 0, 1, 0, 0, 0], in which 1 means the row number is selected.  
Set the mse and min_mse as "infinite" initially to make the code more elegant.
1. Create an index array with n zeros.
2. Calculate the mse of indexes which are zeros.
3. Record the minimum mse as min_mse during step 2, and set the corresponding index as one.
4. Compare the mse and min_mse then update the value of mse.
5. Break the iteration if the mse cannot be lower anymore.

In [10]:
def search(data, cols, funcs, rngs, threshold=10e-6):
    n_row = data.shape[0]
    n_cond = len(cols)
    
    # create a series to show which rows are selected
    rows = pd.Series(np.zeros(n_row, dtype = np.int32))
    rows.index = data.index
    
    # get functions
    funcs = [str2func(x) for x in funcs]

    i = 0
    mse = float('inf')
    while mse > threshold:
        min_mse = float('inf')
        for idx in data.loc[rows == 0].index:
            rows.loc[idx] = 1
            tmp_mse = get_mse(data, rows, cols, funcs, rngs, n_cond)
            
            if tmp_mse < min_mse:
                min_mse = tmp_mse
                min_mse_idx = idx
            else:
                pass
            
            rows.loc[idx] = 0
        
        # check if mse cannot be lower any more
        if min_mse > mse:
            break
        else:
            mse = min_mse
            rows.loc[min_mse_idx] = 1
        
        # print loss
        print("%d times iteration, mse %.3f" % (i+1, mse))
        i += 1
        
    return rows

## Test the search fucntion and show results. 

In [11]:
print("\n" * 3)
print("Test search:")

run_time = time.time()

idxs = search(data = df
              , cols = ["age", "salary"]
              , funcs = ["average", "sum"]
              , rngs = [[35,40], [100000, 120000]])

search_result = df.loc[idxs == 1]
average_age = search_result.age.mean()
total_salary = search_result.salary.sum()

print()
print("Target average age is 35 to 40 and target total salary is 100000 to 120000")
print("Average age is %.2f and total salary is %d" % (average_age, total_salary))
print("Run time is %.2f s" % (time.time() - run_time))





Test search:
1 times iteration, mse 0.405
2 times iteration, mse 0.321
3 times iteration, mse 0.246
4 times iteration, mse 0.181
5 times iteration, mse 0.126
6 times iteration, mse 0.081
7 times iteration, mse 0.046
8 times iteration, mse 0.021
9 times iteration, mse 0.005
10 times iteration, mse 0.000
11 times iteration, mse 0.000

Target average age is 35 to 40 and target total salary is 100000 to 120000
Average age is 39.00 and total salary is 103287
Run time is 10.84 s


In [12]:
search_result

Unnamed: 0,age,gender,salary
0,64,female,3828
176,22,male,9945
222,39,female,9980
415,59,female,9921
449,27,female,9986
497,24,female,9944
602,43,male,9935
725,36,male,9937
846,60,female,9975
878,22,male,9922
