## Aggregate Functions (continued...)

In [13]:
a = np.arange(1, 13).reshape(3, -1)

a

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

<b>1. `.sum()`</b>
- `np.sum(array_variable)` = Sum of all numeric elements

- `array_variable.sum()` = another way to find cummulative sum; same as np.sum(). Returns sum of all numeric elements.<br>

But there is a internal computing difference between above 2 ways -<br>
1st way is where np.sum() is called on a numpy array; whereas in 2nd way the .sum() evaluates on a object. If this object is not a numpy array, it will get converted to numpy array. *Same behivor holds true with other functions.*<br><br>

- `np.sum(a, axis=1)` = sum of all elements along horizontal direction

- `np.sum(a, axis=0)` = sum of all elements along vertical direction

In [15]:
print(a.sum())

print(np.sum(a))

78
78


In [21]:
np.sum(a, axis=1)

## or, a.sum(axis=1)

array([10, 26, 42])

In [22]:
a.sum(axis=0)

## sum along vertical direction

array([15, 18, 21, 24])

In [28]:
## `b` is Python list

b = [[3,4,5,6], [2,5,-1,0]]

print( np.sum(b, axis=1) )       ## this works fine, because Python list gets converted to Numpy array internally

print( b.sum() )                 ## ERROR !!

[18  6]


AttributeError: 'list' object has no attribute 'sum'

<br>
<hr>

### How to determine `axis` in any operation :

For Numpy arrays (and, Pandas dataframes), if :

- Operation occurs along the **VERTICAL** direction => `axis = 0`
<br><br>
- Operation occurs along the **HORIZONTAL** direction => `axis = 1`

Above convention/intution holds generally for ***any kind of operation*** like - sum, stacking, drop, merge, etc.

<hr>

<b>2. `.mean()`</b>
- `np.mean(array_variable)` = Mean of all numeric elements

- `array_variable.mean()` = another way to find cummulative sum; same as np.mean(). Returns mean of all numeric elements.<br>

In [None]:
a = np.arange(1, 13).reshape(3, 4)

In [38]:
print( a.mean() )                ## mean of all numbers

print( np.mean(a, axis=1) )      ## mean along horizontal direction only

print( np.mean(a, axis=0) )      ## mean along vertical direction only

6.5
[ 2.5  6.5 10.5]
[5. 6. 7. 8.]


<b>3. `.min()`</b>


<b>4. `.max()`</b>

In [41]:
print( np.max(a) )
print( np.max(a, axis=0) )
print( a.max(axis=1) )

print("-" * 20)

print( np.min(a) )
print( np.min(a, axis=1) )



12
[ 9 10 11 12]
[ 4  8 12]
--------------------
1
[1 5 9]


****

**Example** : Operate `.sum()` along different axes in 3-D array.

In [50]:
## example :

a = np.arange(1, 13).reshape(3, -1)
print(a.sum())

b = np.arange(11, 23).reshape(3, -1)
print(b.sum())

c = np.arange(5, 17).reshape(3, -1)
print(c.sum())

D = np.array([a,b,c])

D

78
198
126


array([[[ 1,  2,  3,  4],
        [ 5,  6,  7,  8],
        [ 9, 10, 11, 12]],

       [[11, 12, 13, 14],
        [15, 16, 17, 18],
        [19, 20, 21, 22]],

       [[ 5,  6,  7,  8],
        [ 9, 10, 11, 12],
        [13, 14, 15, 16]]])

In [59]:
print(f'D.sum() = {D.sum()} \t D.shape = {D.shape} \t D.ndim = {D.ndim} \t D.size = {D.size}')

D.sum() = 402 	 D.shape = (3, 3, 4) 	 D.ndim = 3 	 D.size = 36


In [71]:
## sum along the depth direction

D.sum(axis=0)

array([[17, 20, 23, 26],
       [29, 32, 35, 38],
       [41, 44, 47, 50]])

In [68]:
## sum along vertical direction

D.sum(axis=1)

array([[15, 18, 21, 24],
       [45, 48, 51, 54],
       [27, 30, 33, 36]])

In [69]:
## sum along horizontal direction

D.sum(axis=2)

array([[10, 26, 42],
       [50, 66, 82],
       [26, 42, 58]])

****

****

### Logical Functions
- `any`
- `all`

<b>`np.any()`</b>

- checks if there is atleast 1 non-zero value - i.e. `True` - in the array. If so, it returns `True`. Otherwise `False`.

- if at least 1 True value => returns `True`.

In [72]:
a = np.array([1,2,3,4])

np.any(a)    ## all values in a are non-zero

True

In [75]:
np.any([8,0,1,-9])       ## has three non-zero values and one 0

True

In [77]:
a = np.array([False, False, True, False])        ## has 1 True (non-zero) value

a.any()

True

In [80]:
a = np.array([False] * 3)

a.any()

False

In [104]:
np.any(np.zeros(5))

False

<br>

Given `arr` array, say True if there is at least one value in `arr` which is less than or equal to 0.6, otherwise print False.



In [3]:
arr = 2 * np.arange(0,2,0.5)
arr

array([0., 1., 2., 3.])

In [4]:
np.any(arr < 0.6)

True

<br>

<b>`np.all()`</b>

- checks if all elements are non-0 values (i.e, True).

- if all elements are 0 (or, False) => returns `False`.

- if all True values => returns `True`.

In [83]:
np.all(np.array([True, True, True, True]))

True

In [89]:
np.all(np.array([1,1,1,0,1]))

False

In [99]:
np.all(np.array([3,1,9,-1]))

True

In [110]:
np.all(np.array([False, False, True]))

False

In [109]:
a = np.array([4, 5, 8, 2, 0])
b = np.array([1, 3, 8, 4, -1])

mask = (a == b)

print(mask)

np.any(a == b)

## want to know if any one value in `m` which is equal to a corresponding value in `n`. So, use .any()

[False False  True False False]


True

In [119]:
## example :

g = np.arange(1, 13).reshape(3,4)
h = np.arange(2, 14).reshape(3,4)
h[1,1] = 4
h[2,2] = 1

print(g)
print("-" * 20)
print(h)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
--------------------
[[ 2  3  4  5]
 [ 6  4  8  9]
 [10 11  1 13]]


In [120]:
np.any(h < g, axis=0)      ## operation along vertical direction

array([False,  True,  True, False])

<br>
<hr>

### Sorting

`.sort()` results in ascending sort only. Descending order can not be obtained. However, it can be worked around in a hacky way.

In [128]:
arr = np.array([2, 30, 41, 7, 1, 13, 52])
print(arr)

np.sort(arr)

[ 2 30 41  7  1 13 52]


array([ 1,  2,  7, 13, 30, 41, 52])

Back find the indexes of sorted values in the original array => `.argsort()`

Like, in original array `arr` :<br>
13 is at index `5`<br>
41 is at index `2`

In [129]:
np.argsort(arr)

array([4, 0, 3, 5, 1, 2, 6], dtype=int64)

<br>

Find index of minimum value :

In [130]:
np.argmin(arr)

## index `4` is the index of minimum value of arr

4

<br>

Find index of maximum value :

In [131]:
np.argmax(arr)

6

<br>

<u>Sorting of 2-D arrays</u>

**NOTE : last axis is taken as default axis of operation.**

default axis for :
- 3D array => 2
- 2D array => 1

In [134]:
k = np.array([[32, 45, 11], [4, 8, 1], [10, 3, 7], [98, 75, 110]])

print(k)

[[ 32  45  11]
 [  4   8   1]
 [ 10   3   7]
 [ 98  75 110]]


In [137]:
np.sort(k, axis=0)    ## sorting in vertical direction

array([[  4,   3,   1],
       [ 10,   8,   7],
       [ 32,  45,  11],
       [ 98,  75, 110]])

In [138]:
np.sort(k, axis=1)   ## sort in horizontal direction

array([[ 11,  32,  45],
       [  1,   4,   8],
       [  3,   7,  10],
       [ 75,  98, 110]])

In [143]:
np.sort(k)

## the last axis is taken to be default axis i.e. axis=1 in case of 2-D arrays

array([[ 11,  32,  45],
       [  1,   4,   8],
       [  3,   7,  10],
       [ 75,  98, 110]])

<hr><hr><br>

## Analyse a real-world dataset using Numpy

Let's see and work with a Fitbit dataset of a person...

In [3]:
fit_data = np.loadtxt("./datasets/fitness.txt", dtype="str")

## dtype='str' will make every column's values to str type

In [4]:
fit_data

array([['06-10-2017', '5464', '200', '181', '5', '0', '66'],
       ['07-10-2017', '6041', '100', '197', '8', '0', '66'],
       ['08-10-2017', '25', '100', '0', '5', '0', '66'],
       ['09-10-2017', '5461', '100', '174', '4', '0', '66'],
       ['10-10-2017', '6915', '200', '223', '5', '500', '66'],
       ['11-10-2017', '4545', '100', '149', '6', '0', '66'],
       ['12-10-2017', '4340', '100', '140', '6', '0', '66'],
       ['13-10-2017', '1230', '100', '38', '7', '0', '66'],
       ['14-10-2017', '61', '100', '1', '5', '0', '66'],
       ['15-10-2017', '1258', '100', '40', '6', '0', '65'],
       ['16-10-2017', '3148', '100', '101', '8', '0', '65'],
       ['17-10-2017', '4687', '100', '152', '5', '0', '65'],
       ['18-10-2017', '4732', '300', '150', '6', '500', '65'],
       ['19-10-2017', '3519', '100', '113', '7', '0', '65'],
       ['20-10-2017', '1580', '100', '49', '5', '0', '65'],
       ['21-10-2017', '2822', '100', '86', '6', '0', '65'],
       ['22-10-2017', '181', '10

In [5]:
print(fit_data.ndim)

print(fit_data.shape)

print(fit_data.size)

2
(96, 7)
672


In [6]:
## let's unpack all columns into respective variables

date, step_count, mood, calories_burned, hours_of_sleep, bool_active, weight = fit_data.T

In [7]:
step_count      = step_count.astype(int)
calories_burned = calories_burned.astype('int')
hours_of_sleep  = hours_of_sleep.astype('int')

step_count

## convert step_count, calories_burned, hours_of_sleep to integer type, as it's obvious to do so.

array([5464, 6041,   25, 5461, 6915, 4545, 4340, 1230,   61, 1258, 3148,
       4687, 4732, 3519, 1580, 2822,  181, 3158, 4383, 3881, 4037,  202,
        292,  330, 2209, 4550, 4435, 4779, 1831, 2255,  539, 5464, 6041,
       4068, 4683, 4033, 6314,  614, 3149, 4005, 4880, 4136,  705,  570,
        269, 4275, 5999, 4421, 6930, 5195,  546,  493,  995, 1163, 6676,
       3608,  774, 1421, 4064, 2725, 5934, 1867, 3721, 2374, 2909, 1648,
        799, 7102, 3941, 7422,  437, 1231, 1696, 4921,  221, 6500, 3575,
       4061,  651,  753,  518, 5537, 4108, 5376, 3066,  177,   36,  299,
       1447, 2599,  702,  133,  153,  500, 2127, 2203])

`mood` and `bool_active` columns looks categorical in nature, as it only contains string-type values like -> "300" , "200" & "100".

This can be pre-processed into more intuitive way, using Boolean Masking, like below :

`mood` :
- "300" => "Happy"
- "200" => "Neutral"
- "100" => "Sad"

`bool_active` :
- "500" => "Active"
- "0" => "Inactive"

In [10]:
mood[mood == "300"] = "Happy"
mood[mood == "200"] = "Neutral"
mood[mood == "100"] = "Sad"

bool_active[bool_active == "500"] = "Active"
bool_active[bool_active == "0"]   = "Inactive"

In [16]:
print(mood)

print("-" * 100)

print(bool_active)

['Neutral' 'Sad' 'Sad' 'Sad' 'Neutral' 'Sad' 'Sad' 'Sad' 'Sad' 'Sad' 'Sad'
 'Sad' 'Happy' 'Sad' 'Sad' 'Sad' 'Sad' 'Neutral' 'Neutral' 'Neutral'
 'Neutral' 'Neutral' 'Neutral' 'Happy' 'Neutral' 'Happy' 'Happy' 'Happy'
 'Happy' 'Happy' 'Happy' 'Happy' 'Neutral' 'Happy' 'Happy' 'Happy' 'Happy'
 'Happy' 'Happy' 'Happy' 'Happy' 'Happy' 'Happy' 'Neutral' 'Happy' 'Happy'
 'Happy' 'Happy' 'Happy' 'Happy' 'Happy' 'Happy' 'Happy' 'Neutral' 'Sad'
 'Happy' 'Happy' 'Happy' 'Happy' 'Happy' 'Happy' 'Happy' 'Sad' 'Neutral'
 'Neutral' 'Sad' 'Sad' 'Neutral' 'Neutral' 'Happy' 'Neutral' 'Neutral'
 'Sad' 'Neutral' 'Sad' 'Neutral' 'Neutral' 'Sad' 'Sad' 'Sad' 'Sad' 'Happy'
 'Neutral' 'Happy' 'Neutral' 'Sad' 'Sad' 'Sad' 'Neutral' 'Neutral' 'Sad'
 'Sad' 'Happy' 'Neutral' 'Neutral' 'Happy']
----------------------------------------------------------------------------------------------------
['Inactive' 'Inactive' 'Inactive' 'Inactive' 'Active' 'Inactive'
 'Inactive' 'Inactive' 'Inactive' 'Inactive' 'Inactive' 'I

<br>

What is average step counts that the person took each day?

In [17]:
np.mean(step_count)

2935.9375

In [18]:
Highest step count that the person has?

Object `has` not found.


<br>

Find the dates, when he took maximum and minumum number of steps?

In [20]:
date[step_count.argmax()]

## 14-Dec-2017

'14-12-2017'

In [21]:
date[step_count.argmin()]

## find the index of minimum `step_count` and use that index to fetch the date

'08-10-2017'

<br>

Find the no. of times he was Happy, Sad or Neutral ?

In [22]:
## since this is a 1-D array. We can get the count of "Happy", via .shape

mood[mood == "Happy"].shape

## 40 times Happy

(40,)

In [23]:
mood[mood == "Neutral"].shape

## 27 times Neutral

(27,)

In [24]:
mood[mood == "Sad"].shape

## 29 times Sad

(29,)

can also be done by using **`np.unique()`** :

- `return_counts` arg gives the frequency of each unique occurence.

In [26]:
np.unique(mood, return_counts=True)

(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'),
 array([40, 27, 29], dtype=int64))

<br>

Similarly, number of Active and Inactive days can be determined :

In [27]:
np.unique(bool_active, return_counts=True)

## Active days   = 42
## Inactive days = 54

(array(['Active', 'Inactive'], dtype='<U10'), array([42, 54], dtype=int64))

<br>

Let's determine, if he is "Happy" or "Sad" then what's the average count of steps he takes ?

In [28]:
np.mean(step_count[mood == "Happy"])

## ie. on an average, the person walks ~3392 steps on Happy days

3392.725

In [29]:
np.mean(step_count[mood == "Sad"])

2103.0689655172414

<br>

*In this way, we can derive various insights by understanding relationship between person's mood, weight, walking steps, etc features.*

<br>
What are his mood patterns, when he was walking >4000 steps each day?

In [30]:
np.unique(mood[step_count > 4000], return_counts=True)

## 22 days were Happy,
## 9 days were Neutral,
## 7 days he was Dad

(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'),
 array([22,  9,  7], dtype=int64))

In [31]:
np.unique(mood[step_count < 2000], return_counts=True)

(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'),
 array([13,  8, 18], dtype=int64))