# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 
    
    AUTHOR: Dr. Roy Jafari 

## Chapter 1: Effectively employing computational and memory resources 

### Challenge 3: Patterns or Invention?
In this challenge, we are going to understand and learn which of the following strategies is going to be more advantageous.

    - Strategy 1: Learning the famous coding patterns, and trying to mix and match these patterns to every programming challenge
    - Strategy 2: Using our invention to come up with a code that solves the programming
    - Strategy 3: Spending little on the coding part, just coming up with any logically correct code, and using advanced technology to run the code

You may not be familiar with coding patterns. These are strategies for solving programming challenges. For instance, in the following example, we will use a pattern called sliding window to solve the problem.

#### Example of Sliding Window pattern
In this example, we would like to come up with a function that calculates the moving averages of a given time series.

Moving average is a forecasting method that uses the average of the last K values of the time series as the forecast for the next value in the time series. If K = 1, then then the last value of the series will be the forecast value. If K=2, then the mean of the last two values will be the forecast value.

The following two functions will get a K and a series and output the moving averages. The name of these functions are `moving_average_brute_force()`, and `moving_average_sliding_window()`.

The following code shows the definition of `moving_average_brute_force()`.

```
def moving_average_brute_force(K,series):
    result = []
    n= len(series)
    for i in range(n-K+1):
        result.append(sum(series[i:i+K])/K)
    return(result)
```

The following code shows the definition of `moving_average_sliding_window()`.

```
def moving_average_sliding_window(K,series):
    result=[]
    n = len(series)
    _sum = sum(series[:K])
    result.append(_sum/K)
    for i in range(n-K):
        _sum -=series[i]
        _sum +=series[i+K]
        result.append(_sum/K)
    return (result)
```

The function `moving_average_brute_force()` gets the CPU to go over all the n-K+1 subsets of the series. Also, on each iteration, the CPU must go over the K values in the subsets to add them and calculate the mean. All in all, the CPU must perform something (n-K+1)*K times, and after simplification, the CPU usage that the function will have can be shown by O(n*K).

On the other hand, during the function `moving_average_sliding_window()` the CPU will have to similarly go over the n-1+K subsets, on each iteration it only needs to remove the first value from the subset and add a new value to the subset. All in all, the CPU must perform smoothing (n-K+1)*2 times; when simplified the computational complexity of this function can be represented by O(n). 

The sliding window is the name of a coding strategy in that the code that will keep a window of data across its processing, and by sliding it will get the task done. In this example, you experienced how using this codding pattern we were able to decrease the computational complexity from O(n*K) to only O(n). For instance, when I use my computer to calculate the moving averages with K=100 for a series with a million values, it takes `moving_average_brute_force()` 2.6 seconds while it takes `moving_average_sliding_window()` only 420 milliseconds.


    ######################################################################
    Codding patterns
    
    The sliding window is only one of many coding patterns that you may invest in learning. The fourteen most famous patterns are introduced with examples at  https://hackernoon.com/14-patterns-to-ace-any-coding-interview-question-c5bb3357f6ed. 
    ######################################################################

Now that we have a better understanding of coding patterns, let us review the prompts for this challenge.

#### Challenge Prompts
In this challenge, our mission is to create a function that removes duplicates from a list of sorted numbers. For example, if we input the list [1,2,2,3,3,3,4,10], the function must return [1,2,3,4,10].

This is a famous programming challenge, and If you Google it the chances are you are going to find many different solutions to the tasks, however, avoid doing that. Because while our mission is to solve this challenge, our ultimate goal is to understand the difference between the three strategies that we saw at the beginning of this challenge.

Regarding this programming task, answer the following questions. 

1. What is the most straightforward function that solves this task? Imagine that you don’t have any limitations on computation, time, or memory. Implement the solution in form of a function and call it `remove_duplicates_naive()`.
2. What is the computational complexity, and memory complexity of `remove_duplicates_naive()`?
3. Try to come up with a function that has less computational complexity than `remove_duplicates_naive()`, and call it `remove_duplicates_inventive_CPU()`.
4. What is the computational complexity, and memory complexity of `remove_duplicates_inventive_CPU()`?
5. Try to come up with a function that has less memory complexity than `remove_duplicates_naive()`, and call it `remove_duplicates_inventive_RAM()`.
6. What is the computational complexity, and memory complexity of `remove_duplicates_inventive_RAM()`?
7. Study, and learn the coding pattern Two Pointers from the following subchapter. Can this pattern be used to solve this programming task?
8. Use the two-pointers pattern to solve this task, and call the ensuing function `remove_duplicates_two_pointers()`.
9.	What is the computational complexity, and memory complexity of `remove_duplicates_two_pointers()`?
10.	In this challenge, we created four different functions to solve the task. In creating each of these functions we used one of the three strategies that were introduced at the beginning of the chapter. Specify under what strategy each function was created. 
11.	From what you experienced in this challenge, use one of the following three labels to summarize and remember these three strategies: **High-Thinking-Low-Computation**, **Low-Thinking-Learning-High-Computation**, and **High-Learning-Low-Computation**.
12.	From what you experienced in this challenge, decide which of these three strategies is more advantageous when dealing with big data every day. What is your reasoning?

Give solving this challenge a real try, and then check if your answers were correct. The solution to the challenge can be found in the file *Challenge3_Solution.ipynb* in the book GitHub Repository.

#### Two Pointers Coding Pattern
When we come up with a programming solution using a loop, we normally assume under each loop we can keep track of only one index, or pointer. That assumption is not true, and the two-pointers coding pattern is the antithesis of the assumption. By learning this codding pattern you will be able to put that assumption aside when it is advantageous.

For instance, let us discuss the function `find_doubles_brute()` that we came up with in his chapter under Understanding Big O Notation, Learning the common Big O Complexities, Example of decreasing computational complexity at the expense of memory complexity. The function’s definition is the following.

```
def find_doubles_brute(num_list,val):
    output = []
    n = len(num_list)
    for i in range(n):
        for j in range(i+1,n):
            if (num_list[i] +
                num_list[j]) == val:
                    output.append([
                        num_list[i],
                        num_list[j]])        
    return output
 ```
The function takes in the list of sorted numbers `num_list` and another number val, and finds the pairs of number from `num_list` that sums up to `val`. Because the function is written under the assumption that under each loop we can only keep track of one index, the function had to use two loops to get the task done, and because of the two loops, the computational complexity of the function is O(n2). 

However, the function `find_doubles_two_pointers()` disrupts that assumption and uses two pointers in one loop to get the task done. The following is the definition of said function.

```
def find_doubles_two_pointers(num_list,val):
    output = [] 
    p1,p2 = 0, len(num_list)-1
    while p1!=p2:
        _sum_val = num_list[p1] + num_list[p2]
        if _sum_val < val:
            p1 +=1
        elif _sum_val > val:
            p2 -=1
        else:
            output.append([num_list[p1], num_list[p2]])
            p1 +=1
    return output
    
```

The function `find_doubles_two_pointers()` keeps track of the two pointers, `p1`, and `p2`, in one loop. Before the loop begins, the first pointer `p1` points to the beginning of `num_list`, and the second pointer `p2` points to the end of `num_list`. During the loop, If the sum of the value that the two pointers point to, which is `_sum_val`, is smaller than val, `p1` moves forward one index; if `_sum_val` is larger than `val`, `p2` moves backward one index; if `_sum_val` is equal to `val`, the pair that `p1`, and `p2` points to are appended to the output. When `p1` and `p2` point to the same index, the loop terminates and output is returned.

Using the two pointers codding pattern the function `find_doubles_two_pointers()` is capable of solving the programming task only in one loop and that improves the computational complexity to O(n).

As a side note, the function `find_doubles_two_pointers()` is better than `find_triplets_hashtable()` as well; we came up with this function in this chapter under Understanding Big O Notation, Learning the common Big O Complexities, Example of decreasing computational complexity at the expense of memory complexity. While both functions’ computational complexities are O(n), the function `find_doubles_two_pointers()` is advantageous in terms of memory complexity as it does not use any extra space; the function `find_triplets_hashtable()` has to use O(n) extra space due to the hash table. 

# Answers


**Question:**
1. What is the most straightforward function that solves this task? Imagine that you don’t have any limitations on computation, time, or memory. Implement the solution in form of a function and call it remove_duplicates_naive().

**Answer:**

We will create a random test first this code will create an array that has 100,000 numbers between to 1 to 1000.

In [170]:
import numpy as np

test_correctness = [1,2,2,3,3,3,4,10,10]
test_complexity = np.sort(np.random.randint(1,1000,100000)).tolist()

Now, the definition of `remove_duplicates_naive()`

In [171]:
def remove_duplicates_naive(input_list):
    uniuqe_nums = []
    for num in input_list[::-1]:
        if num in uniuqe_nums:
            input_list.pop(
                find_index(input_list,num)
            )
        else:
            uniuqe_nums.append(num)
    return input_list

def find_index(input_list,num):
    return input_list.index(num)

In [172]:
remove_duplicates_naive(test_correctness)

[1, 2, 3, 4, 10]

In [147]:
%%time
remove_duplicates_naive(test_complexity)

Wall time: 51.2 s


[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185

**Question:**

2.What is the computational complexity, and memory complexity of remove_duplicates_naive()?

**Answer:**

- The computational complexity of the preceding function is
$$ O(n^2) $$
, n being the number of values in `input_list`. There is a loop to evaluate each `num` in `input_list` that has n iterations. There is also an n loop in `find_index()` to find the index of `num` in  `input_list`. 

- The memory complexity of the preceding function is $$ O(n) $$, n being the number of values in `input_list`. The reason is that we are using that much memory is due to `uniuqe_nums`.

**Question:**

3.Try to come up with a function that has less computational complexity than `remove_duplicates_naive()`, and call it `remove_duplicates_inventive_CPU()`

In [148]:
def remove_duplicates_inventive_CPU(input_list):

    uniuqe_nums = []
    dropping_index = []
    
    for i,num in enumerate(input_list):
        if num in uniuqe_nums:
            dropping_index.append(i)
        else:
            uniuqe_nums.append(num)
            
    for i in dropping_index[::-1]:
        input_list.pop(i)
        
    return input_list

In [198]:
test_correctness = [1,2,2,3,3,3,4,10,10]
test_complexity = np.sort(np.random.randint(1,1000,100000)).tolist()

In [199]:
remove_duplicates_inventive_CPU(test_correctness)

[1, 2, 3, 4, 10]

In [200]:
%%time
remove_duplicates_inventive_CPU(test_complexity)

Wall time: 669 ms


[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185

**Question**:

4. What is the computational complexity, and memory complexity of `remove_duplicates_inventive_CPU()`?

**Answer:**

- The computational complexity of the preceding function is
$$ O(n) $$
, n being the number of values in `input_list`. There is a loop to evaluate each `num` in `input_list` that has n iterations. There is a second loop that is not nested that removes the values whose index are in `dropping_index`. The computational complexiy is O(n+n) which simplifies to O(n).

- The memory complexity of the preceding function is $$O(n)$$, n being the number of values in `input_list`. We are using O(n+n), n due to `uniuqe_nums` and another n due to `dropping_index`. O(n+n) simplifies to O(n).

**Question**:

5. Try to come up with a function that has less memory complexity than `remove_duplicates_naive()`, and call it `remove_duplicates_inventive_RAM()`.

In [201]:
def remove_duplicates_inventive_RAM(input_list):
   
    for i in range(len(input_list)):
        if i >= len(input_list):
            return(input_list)
        
        num =  input_list[-i-1]
        
        for num2 in input_list[:-i-1]:
            if num == num2:
                input_list.pop(
                    find_index(input_list,num)
                )
        
def find_index(input_list,num):
    return input_list.index(num)    

In [202]:
test_correctness = [1,2,2,3,3,3,4,10,10]
test_complexity = np.sort(np.random.randint(1,1000,100000)).tolist()

In [203]:
remove_duplicates_inventive_RAM(test_correctness)

[1, 2, 3, 4, 10]

In [204]:
%%time
remove_duplicates_inventive_RAM(test_complexity)

Wall time: 50.9 s


[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185

**Question**:

6. What is the computational complexity, and memory complexity of `remove_duplicates_inventive_RAM()`?

**Answer**:
- The computational complexity of the preceding function is
$$ O(n^3) $$
, n being the number of values in `input_list`. There are three nested loops. 

- The memory complexity of the preceding function is $$O(1)$$. We are not using any extra space.

**Question**:

7. Study, and learn the coding pattern Two Pointers from the following subchapter. Can this pattern be used to solve this programming task?

**Answer**:

Yes.

**Question**:

8. Use the two-pointers pattern to solve this task, and call the ensuing function `remove_duplicates_two_pointers()`.

In [205]:
def remove_duplicates_two_pointers(input_list):
    
    p1, p2 = 0,1
    
    while p2 < len(input_list):
        
        if input_list[p1] == input_list[p2]:
            input_list.pop(p2)
        else:
            p1+=1
            p2+=1
            
    return input_list

In [206]:
test_correctness = [1,2,2,3,3,3,4,10,10]
test_complexity = np.sort(np.random.randint(1,1000,100000)).tolist()

In [207]:
remove_duplicates_two_pointers(test_correctness)

[1, 2, 3, 4, 10]

In [208]:
%%time
remove_duplicates_two_pointers(test_complexity)

Wall time: 13 s


[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185

**Question**:

9.	What is the computational complexity, and memory complexity of `remove_duplicates_two_pointers()`?

**Answer**:
- The computational complexity of the preceding function is
$$ O(n) $$
, n being the number of values in `input_list`. There is only one loop with n iterations. 

- The memory complexity of the preceding function is $$O(1)$$. We are not using any extra space.

**Question**:

10.	In this challenge, we created four different functions to solve the task. In creating each of these functions we used one of the three strategies that were introduced at the beginning of the chapter. Specify under what strategy each function was created. 

**Answer**:

 - **Strategy 1: Learning the famous coding patterns, and trying to mix and match these patterns to every programming challenge**: `remove_duplicates_two_pointers()`
 - **Strategy 2: Using our invention to come up with a code that solves the programming**: `remove_duplicates_inventive_RAM()` and `remove_duplicates_inventive_CPU()`
 - **Strategy 3: Spending little on the coding part, just coming up with any logically correct code, and using advanced technology to run the code**: `remove_duplicates_naive()`.

**Question**:

11.	From what you experienced in this challenge, use one of the following three labels to summarize and remember these three strategies: **High-Thinking-Low-Computation**, **Low-Thinking-Learning-High-Computation**, and **High-Learning-Low-Computation**.


**Answer**:

- Strategy 1: **High-Learning-Low-Computation**
- Strategy 2: **High-Thinking-Low-Computation**
- Strategy 3: **Low-Thinking-Learning-High-Computation**

**Question**:

12.	From what you experienced in this challenge, decide which of these three strategies is more advantageous when dealing with big data every day. What is your reasoning?

**Answer**:

The best startey is Strategy 1 for data practitioners that will have to manipulate data and code on daily basis. It is best to learn these coding patterns and reach the best computational performance. 