## Activity 02: Indexing, Slicing, and Iterating

Our client wants to prove that our dataset is nicely distributed around the mean value of 100.   
They asked us to run some tests on several subsections of it to make sure they won't get a non-descriptive section of our data.

Look at the mean value of each subtask.

#### Loading the dataset

In [1]:
# importing the necessary dependencies
import numpy as np

In [2]:
# loading the Dataset
dataset = np.genfromtxt('normal_distribution.csv', delimiter=',')

---

#### Indexing

Since we need several rows of our dataset to complete the given task, we have to use indexing to get the right rows.   
To recap, we need: 
- the second row 
- the last row
- the first value of the first row
- the last value of the second to the last row

In [5]:
dataset

array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
         98.74986914,  98.80833412,  96.81964892,  98.56783189],
       [ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
         92.9267508 ,  92.65657752, 105.7197853 , 101.23162942],
       [ 95.66253664,  95.17750125,  90.93318132, 110.18889465,
         98.80084371, 105.95297652,  98.37481387, 106.54654286],
       [ 91.37294597, 100.96781394, 100.40118279, 113.42090475,
        105.48508838,  91.6604946 , 106.1472841 ,  95.08715803],
       [101.20862522, 103.5730309 , 100.28690912, 105.85269352,
         93.37126331, 108.57980357, 100.79478953,  94.20019732],
       [102.80387079,  98.29687616,  93.24376389,  97.24130034,
         89.03452725,  96.2832753 , 104.60344836, 101.13442416],
       [106.71751618, 102.97585605,  98.45723272, 100.72418901,
        106.39798503,  95.46493436,  94.35373179, 106.83273763],
       [ 96.02548256, 102.82360856, 106.47551845, 101.34745901,
        102.45651798,  98.7476749

In [3]:
# indexing the second row of the dataset (2nd row)
dataset[1]


array([ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
        92.9267508 ,  92.65657752, 105.7197853 , 101.23162942])

In [6]:
# indexing the last element of the dataset (last row)

dataset[-1]

array([ 94.11176915,  99.62387832, 104.51786419,  97.62787811,
        93.97853495,  98.75108352, 106.05042487, 100.07721494])

In [7]:
# indexing the first value of the second row (1st row, 1st value)
dataset[1, 0]


92.02628776

In [8]:
# indexing the last value of the second to last row (we want to use the combined access syntax here) 
dataset[1:, -1]


array([101.23162942, 106.54654286,  95.08715803,  94.20019732,
       101.13442416, 106.83273763,  92.5748759 , 106.03868807,
       103.29147111,  95.85284217,  97.59572169, 102.35313953,
        99.58664719, 103.93383957,  98.07122664, 104.15899829,
       104.70526438,  99.92159953, 102.20618501,  98.00253006,
        94.76253572, 103.83852459, 100.07721494])

---

#### Slicing

Other than the single rows and values we also need to get some subsets of the dataset.   
Here we want slices:
- a 2x2 slice starting from the second row and second element to the 4th element in the 4th row
- every other element of the 5th row
- the content of the last row in reversed order

In [9]:
# slicing an intersection of 4 elements (2x2) of the first two rows and first two columns
dataset[:2, :2]


array([[ 99.14931546, 104.03852715],
       [ 92.02628776,  97.10439252]])

##### Why is it not a problem if such a small subsection has a bigger standard deviation from 100?

Several smaller values can cluster in such a small subsection leading to the value being really low.   
If we make our subsection larger, we have a higher chance of getting a more expressive view of our data.

In [16]:
dataset[4]

array([101.20862522, 103.5730309 , 100.28690912, 105.85269352,
        93.37126331, 108.57980357, 100.79478953,  94.20019732])

In [19]:
# selecting every second element of the fifth row 
dataset[4, ::2]


array([101.20862522, 100.28690912,  93.37126331, 100.79478953])

In [21]:
dataset[:2]

array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
         98.74986914,  98.80833412,  96.81964892,  98.56783189],
       [ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
         92.9267508 ,  92.65657752, 105.7197853 , 101.23162942]])

In [20]:
# reversing the entry order, selecting the first two rows in reversed order
dataset[:2, ::-1]


array([[ 98.56783189,  96.81964892,  98.80833412,  98.74986914,
         97.85230675, 107.43534677, 104.03852715,  99.14931546],
       [101.23162942, 105.7197853 ,  92.65657752,  92.9267508 ,
         97.24584816,  99.32066924,  97.10439252,  92.02628776]])

---

#### Splitting

Our client's team only wants to use a small subset of the given dataset.   
Therefore we need to first split it into 3 equal pieces and then give them the first half of the first split.   
They sent us this drawing to show us what they need:
```
1, 2, 3, 4, 5, 6          1, 2     3, 4    5, 6          1, 2  
3, 2, 1, 5, 4, 6    =>    3, 2     1, 5    4, 6    =>    3, 2    =>    1, 2
5, 3, 1, 2, 4, 3          5, 3     1, 2    4, 3                        3, 2
1, 2, 2, 4, 1, 5          1, 2     2, 4    1, 5          5, 3
                                                         1, 2
```

> **Note:**   
We are using a very small dataset here but imagine you have a huge amount of data and only want to look at a small subset of it to tweak your visualizations

In [22]:
dataset.shape

(24, 8)

In [26]:
# splitting up our dataset horizontally on indices one third and two thirds
np.hsplit(dataset, 2)

[array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675],
        [ 92.02628776,  97.10439252,  99.32066924,  97.24584816],
        [ 95.66253664,  95.17750125,  90.93318132, 110.18889465],
        [ 91.37294597, 100.96781394, 100.40118279, 113.42090475],
        [101.20862522, 103.5730309 , 100.28690912, 105.85269352],
        [102.80387079,  98.29687616,  93.24376389,  97.24130034],
        [106.71751618, 102.97585605,  98.45723272, 100.72418901],
        [ 96.02548256, 102.82360856, 106.47551845, 101.34745901],
        [105.30350449,  92.87730812, 103.19258339, 104.40518318],
        [110.44484313,  93.87155456, 101.5363647 ,  97.65393524],
        [101.3514185 , 100.37372248, 106.6471081 , 100.61742813],
        [ 97.21315663, 107.02874163, 102.17642112,  96.74630281],
        [ 95.65982034, 107.22482426, 107.19119932, 102.93039474],
        [100.39303522,  92.0108226 ,  97.75887636,  93.18884302],
        [103.1521596 , 109.40523174,  93.83969256,  99.95827854],
        [1

In [27]:
# splitting up our dataset vertically on index 2
np.vsplit(dataset, 2)

[array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
          98.74986914,  98.80833412,  96.81964892,  98.56783189],
        [ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
          92.9267508 ,  92.65657752, 105.7197853 , 101.23162942],
        [ 95.66253664,  95.17750125,  90.93318132, 110.18889465,
          98.80084371, 105.95297652,  98.37481387, 106.54654286],
        [ 91.37294597, 100.96781394, 100.40118279, 113.42090475,
         105.48508838,  91.6604946 , 106.1472841 ,  95.08715803],
        [101.20862522, 103.5730309 , 100.28690912, 105.85269352,
          93.37126331, 108.57980357, 100.79478953,  94.20019732],
        [102.80387079,  98.29687616,  93.24376389,  97.24130034,
          89.03452725,  96.2832753 , 104.60344836, 101.13442416],
        [106.71751618, 102.97585605,  98.45723272, 100.72418901,
         106.39798503,  95.46493436,  94.35373179, 106.83273763],
        [ 96.02548256, 102.82360856, 106.47551845, 101.34745901,
         102.45651

---

#### Iterating

Once you sent over the dataset they tell you that they also need a way iterate over the whole dataset element by element as if it would be a one-dimensional list.   
However, they want to also now the position in the dataset itself.

They send you this piece of code and tell you that it's not working as mentioned.   
Come up with the right solution for their needs.

In [14]:
# iterating over whole dataset (each value in each row)
curr_index = 0
for x in np.nditer(dataset):
    print(x, curr_index)
    curr_index += 1

99.14931546 0
104.03852715 1
107.43534677 2
97.85230675 3
98.74986914 4
98.80833412 5
96.81964892 6
98.56783189 7
92.02628776 8
97.10439252 9
99.32066924 10
97.24584816 11
92.9267508 12
92.65657752 13
105.7197853 14
101.23162942 15
95.66253664 16
95.17750125 17
90.93318132 18
110.18889465 19
98.80084371 20
105.95297652 21
98.37481387 22
106.54654286 23
91.37294597 24
100.96781394 25
100.40118279 26
113.42090475 27
105.48508838 28
91.6604946 29
106.1472841 30
95.08715803 31
101.20862522 32
103.5730309 33
100.28690912 34
105.85269352 35
93.37126331 36
108.57980357 37
100.79478953 38
94.20019732 39
102.80387079 40
98.29687616 41
93.24376389 42
97.24130034 43
89.03452725 44
96.2832753 45
104.60344836 46
101.13442416 47
106.71751618 48
102.97585605 49
98.45723272 50
100.72418901 51
106.39798503 52
95.46493436 53
94.35373179 54
106.83273763 55
96.02548256 56
102.82360856 57
106.47551845 58
101.34745901 59
102.45651798 60
98.74767493 61
97.57544275 62
92.5748759 63
105.30350449 64
92.87730812

In [29]:
# iterating over the whole dataset with indices matching the position in the dataset
for row in dataset:
    for col in row:
        print(col, end=', ')

99.14931546, 104.03852715, 107.43534677, 97.85230675, 98.74986914, 98.80833412, 96.81964892, 98.56783189, 92.02628776, 97.10439252, 99.32066924, 97.24584816, 92.9267508, 92.65657752, 105.7197853, 101.23162942, 95.66253664, 95.17750125, 90.93318132, 110.18889465, 98.80084371, 105.95297652, 98.37481387, 106.54654286, 91.37294597, 100.96781394, 100.40118279, 113.42090475, 105.48508838, 91.6604946, 106.1472841, 95.08715803, 101.20862522, 103.5730309, 100.28690912, 105.85269352, 93.37126331, 108.57980357, 100.79478953, 94.20019732, 102.80387079, 98.29687616, 93.24376389, 97.24130034, 89.03452725, 96.2832753, 104.60344836, 101.13442416, 106.71751618, 102.97585605, 98.45723272, 100.72418901, 106.39798503, 95.46493436, 94.35373179, 106.83273763, 96.02548256, 102.82360856, 106.47551845, 101.34745901, 102.45651798, 98.74767493, 97.57544275, 92.5748759, 105.30350449, 92.87730812, 103.19258339, 104.40518318, 101.29326772, 100.85447132, 101.2226037, 106.03868807, 110.44484313, 93.87155456, 101.5363