### Topics 
- array data types with numpy
- more advanced data manipulation with pandas
- seaborn for more flexible plots 

### Numpy Basics 
Numerical Python is a powerful library of functions, methods, and data types.
The main object that numpy provides is the **ndarray**. They differ in a few fundamental ways from regular Python lists:
- Arrays cannot be of mixed types. They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays (sortof). Which brings us to:

- Arrays can be multidimensional, but they must be rectangular. You can have a list of lists, where the first interior list is 3 elements long, the second 5, and the third 12, but for your multidemsional arrays, every row must have the same number of columns.

- We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in a slice of the array.

### Initializing Arrays
Start with a list and turn it to an arry with the array method 

In [5]:
import numpy as np

a = [0] * 40
print (type(a))

a = np.array(a)
print (type(a))
print (a)

<class 'list'>
<class 'numpy.ndarray'>
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0]


a similar vector of zeros can be created by the following

In [8]:
a = np.zeros(40)
print (a)
print(np.ones(5))

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.]


for some non zero arrays 

In [10]:
print (np.arange(14))

# more parameters could be given to arange
print (np.arange(10, 20))  ## starting and stopping points 

print (np.arange(10, 20, 2))   ## starting, stopping points and increment 
print (np.arange(10, 20, 1.5))

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13]
[10 11 12 13 14 15 16 17 18 19]
[10 12 14 16 18]
[ 10.   11.5  13.   14.5  16.   17.5  19. ]


In [14]:
# from 5 to 6 with 10 items, linear spacing 
print (np.linspace(5, 6, 10))

[ 5.          5.11111111  5.22222222  5.33333333  5.44444444  5.55555556
  5.66666667  5.77777778  5.88888889  6.        ]


In [15]:
# from 10^0 to 10^5, logarithmic spacing 
print (np.logspace(0, 5, 10))

[  1.00000000e+00   3.59381366e+00   1.29154967e+01   4.64158883e+01
   1.66810054e+02   5.99484250e+02   2.15443469e+03   7.74263683e+03
   2.78255940e+04   1.00000000e+05]


Arrays with more than one diemsion

In [16]:
# inner set of parantheses. (rows, columns)
np.zeros ((3,4))

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

In [17]:
# or create using python list of list 
a = [ [1,2,3], [3,4,5], [8,9,10] ]
np.array(a)

array([[ 1,  2,  3],
       [ 3,  4,  5],
       [ 8,  9, 10]])

### Recap: creating numpy arrays
- np.zeros: array of 0.0 of specified size
- np.ones : array of 1.0 of specified size
- np.array: create numpy array from python list
- np.linspace : create a numpy array within a given interval with linear spacing

### Accessing elements 

### 1d arrays

for 1d arrays indexing are just like that of lists


In [30]:
a = np.arange(10)
print (a)
print (a[5])

[0 1 2 3 4 5 6 7 8 9]
5


In [21]:
# slicing conventions also work 
print (a[:3])
print (a[-1])

[0 1 2]
9


In [25]:
# could also index with a list or another numpy array 
a = np.array([2, 4, 5, 8, 10])  # this is array creation from list 

# items wanted with their location in a list 
local = [0, 2, 3]
print (a[local])

# also works if locations are stored in another array
local = np.array(local)
print (a[local])
print (type(a[local]))

[2 5 8]
[2 5 8]
<class 'numpy.ndarray'>


** This (index into list using a list) DOES NOT work with normal python list**
A COMPREHENSION or a LOOP is needed instead 

In [28]:
a = [2, 4, 5, 8, 10]
local = [0, 2, 3]

print ([a[i] for i in local])
print (type([a[i] for i in local]))

[2, 5, 8]
<class 'list'>


You can also index into an array with boolean values
- must be the same length as the array (or else get a warning)
- **doesn't work with regular python lists**

In [29]:
a = np.array([2, 4, 6, 8, 10, 12])
b = np.array([True, True, False, False, False, True])
a[b]

array([ 2,  4, 12])

### 2d Array

for 2d arrays the indexing is [row, column]

In [31]:
a = np.array([[1, 2, 3, 4, 5], [2, 4, 6, 8, 10], [3, 6, 9, 12, 15]])
a

array([[ 1,  2,  3,  4,  5],
       [ 2,  4,  6,  8, 10],
       [ 3,  6,  9, 12, 15]])

In [32]:
a[2,4]   # sigle element

15

In [33]:
a[1, :]   # entire row

array([ 2,  4,  6,  8, 10])

In [34]:
a[:, 2]   # entire column

array([3, 6, 9])

**Slicing and bools work as they did before. They are only on each axis now.**


### Numpy array properties 

Numpy arrays have two important properties 

In [35]:
a = np.zeros((5, 10))
a

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [36]:
a.shape

(5, 10)

In [37]:
a.size 

50

In [38]:
# only gives the length of the outer most dimension, which is the len of rows 
len(a)

5

Every value in a numpy array must be the same type (unlike normal python lists) This is stored in the 'dtype' property


In [40]:
# numpy array type, type of an array is dtype 
a.dtype

dtype('float64')

In [41]:
# type of a sigle entry 
type(a[0, 0])

numpy.float64

In [44]:
# arange makes it 'int64' by default 
a = np.arange(10)
print (a.dtype)

# this dtype can be changed otherwise as follows
b = np.arange(10, dtype = 'float')
print (b.dtype)

int64
float64


In [45]:
# could also cast from one to the other 
a = np.arange(10).astype('float')
a.dtype

dtype('float64')

### Operating on numpy arrays

In [52]:
a = np.zeros((5, 5))

In [53]:
# update the current element values in the multidimensional array
a[1, 3] = 7
a

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  7.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [54]:
a[1, 3] *= 2
a 

array([[  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,  14.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.]])

**The real power of arrays is the ability to have one statement affect a large chunk of an array.**

In [56]:
# assign '1' to the enire row 1
a[1, :] = 1
a 

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [58]:
# assign '7' to the entire column 4
a[:, 3] = 7
a

array([[ 0.,  0.,  0.,  7.,  0.],
       [ 1.,  1.,  1.,  7.,  1.],
       [ 0.,  0.,  0.,  7.,  0.],
       [ 0.,  0.,  0.,  7.,  0.],
       [ 0.,  0.,  0.,  7.,  0.]])

In [59]:
# Assign -1 everywhere array is zero (this works because (a == 0) produces a 2d boolean numpy array)
a[a == 0] = -1
a

array([[-1., -1., -1.,  7., -1.],
       [ 1.,  1.,  1.,  7.,  1.],
       [-1., -1., -1.,  7., -1.],
       [-1., -1., -1.,  7., -1.],
       [-1., -1., -1.,  7., -1.]])

If try to do the above using lists

In [66]:
# create list of lists with all zeros 
ll = [ [0] * 5 for i in range(5)]
print (ll)

[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]


In [67]:
# set entries in row 1 to 1 
for i, elem in enumerate(ll[1]):
    
    """
   enumerate(iterable[, start]) -> iterator for index, value of iterable
   
   Return an enumerate object.  
   iterable must be another object that supports iteration.  
   The enumerate object yields pairs containing a count (from start, which defaults to zero) 
   and a value yielded by the iterable argument.
   enumerate is useful for obtaining an indexed list:
       (0, seq[0]), (1, seq[1]), (2, seq[2]), ...
 
   """
    ll[1][i] = 1
    
ll

[[0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0]]

In [79]:
# list indices must be integers or slices, not tuple
print (ll[:1])

print (ll[1])

[[0, 0, 0, 0, 0]]
[1, 1, 1, 1, 1]


In [82]:
for i, elem in enumerate(ll[1]):
    print(i, elem)

0 1
1 1
2 1
3 1
4 1


In [63]:
# set entries in column 4 to 7
for l in ll:
    l[4]= 7
ll

[[0, 0, 0, 0, 7],
 [1, 1, 1, 1, 7],
 [0, 0, 0, 0, 7],
 [0, 0, 0, 0, 7],
 [0, 0, 0, 0, 7]]

### Vector math with arrays 

do math on many values at once with arrays without using for loop

In [83]:
a = np.arange(0, 100, 2)
b = np.arange(50)

print (a)
print (b)

[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48
 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]


In [84]:
b/ 2

array([  0. ,   0.5,   1. ,   1.5,   2. ,   2.5,   3. ,   3.5,   4. ,
         4.5,   5. ,   5.5,   6. ,   6.5,   7. ,   7.5,   8. ,   8.5,
         9. ,   9.5,  10. ,  10.5,  11. ,  11.5,  12. ,  12.5,  13. ,
        13.5,  14. ,  14.5,  15. ,  15.5,  16. ,  16.5,  17. ,  17.5,
        18. ,  18.5,  19. ,  19.5,  20. ,  20.5,  21. ,  21.5,  22. ,
        22.5,  23. ,  23.5,  24. ,  24.5])

In [86]:
# pairwise multiplication 
print(a * b) 

# dot product 
np.dot(a, b)

[   0    2    8   18   32   50   72   98  128  162  200  242  288  338  392
  450  512  578  648  722  800  882  968 1058 1152 1250 1352 1458 1568 1682
 1800 1922 2048 2178 2312 2450 2592 2738 2888 3042 3200 3362 3528 3698 3872
 4050 4232 4418 4608 4802]


80850

In [87]:
print ((a * b).sum())

print (np.sum(a * b))

80850
80850


In [88]:
# other math function which applies to all the elements
a = np.linspace(-np.pi, np.pi, 10)
np.sin(a)

array([ -1.22464680e-16,  -6.42787610e-01,  -9.84807753e-01,
        -8.66025404e-01,  -3.42020143e-01,   3.42020143e-01,
         8.66025404e-01,   9.84807753e-01,   6.42787610e-01,
         1.22464680e-16])

Also:
- np.exp : exponential function
- np.log : logarithm
- np.abs : absolute value
- ... and many others

### Useful numpy functions
NumPy is huge, with around 1200 pages of reference documentation. The following are some basic statistical functions:

### Random distributions 
- uniform distribution
- normal distribution 

In [89]:
a = np.random.uniform(0, 100, 10)   # lower, upper bound and size of output
a

array([ 92.47768987,  55.66739508,  81.39816102,  41.21994641,
        94.32882325,  35.04759045,  91.5061206 ,  69.43786929,
        55.28103639,  14.65728916])

In [90]:
# could also give a shape to the third argument 
a = np.random.uniform(0, 100, (3, 3))
a

array([[ 30.53575893,  14.9305635 ,  62.77591512],
       [ 92.42281326,  17.23172521,  93.17119396],
       [ 67.1859455 ,   0.22656211,   8.54016835]])

In [91]:
# normal distribution with mean = 0, std = 1 and sample size= 10
a = np.random.normal(0, 1, 10)
a


array([-0.99089343,  0.49238631, -0.2529982 , -0.12834274, -1.0440051 ,
       -0.35750895,  0.51206632, -0.40534158,  0.10268209,  0.99130447])

### Summary statistics 

In [92]:
# draw 1000 sample from standard normal distribution with mean 5 and std 3 
a = np.random.normal(5, 3, 1000)

print (np.mean(a))
print (np.std(a))
print (np.min(a))
print (np.max(a))

5.2111643091
3.08882647628
-4.34183503261
14.2378638305


### Operating on 2d arrays 
one of the areas where numpy really shines is its ability to quickly operate along an axis of a 2d array

In [93]:
# 2d arary with 5 rows, 3 columns 
a = np.ones ((5, 3)) 
a

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [94]:
# sum over all elements 
sum(a)

array([ 5.,  5.,  5.])

In [96]:
# sum across all rows 
np.sum(a, axis = 0)

# which is the same as 
a.sum(axis = 0 )

array([ 5.,  5.,  5.])

In [97]:
# sum across all columns 
np.sum(a, axis = 1)

array([ 3.,  3.,  3.,  3.,  3.])

### Boolean numpy arrays for selection and filtering 

In [98]:
a = np.zeros(10, dtype = bool)
a

array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

In [99]:
# slicing and mass assignment still work for array with BOOLEAN type 
a[2: 5] = True 
a

array([False, False,  True,  True,  True, False, False, False, False, False], dtype=bool)

In [100]:
# the character ~ inverts the boolean array
b = ~a
b

array([ True,  True, False, False, False,  True,  True,  True,  True,  True], dtype=bool)

In [101]:
# "&" amd "|"
a = np.array([True, False, True])
b = np.array([False, False, True])

print (" A and B is {}". format (a & b))
print (" A or B is {}". format (a | b))

 A and B is [False False  True]
 A or B is [ True False  True]


In [102]:
# boolean array and data accessing example

# np.random.randn returns a sample (or samples) from the “standard normal” distribution.
data = np.random.randn(10)
data 

array([-0.32709702, -0.72586373, -0.06293976,  0.9014897 , -0.38309493,
       -0.22420609, -0.87452063,  0.73498385,  0.47888074,  0.29893285])

In [104]:
data = data[data < 0]
data 

array([-0.32709702, -0.72586373, -0.06293976, -0.38309493, -0.22420609,
       -0.87452063])

In [124]:
# np.random.rand creates an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
data = np.random.rand(4, 5) * 10
print (type (data))
data 


<class 'numpy.ndarray'>


array([[  7.72212061e+00,   3.32878843e+00,   4.26967922e+00,
          4.24869577e+00,   7.08036316e+00],
       [  6.80014456e+00,   2.58262997e+00,   9.88908151e+00,
          4.35020049e+00,   6.93051621e+00],
       [  7.00247033e+00,   9.05154777e+00,   5.79075262e+00,
          8.11815066e+00,   4.50701992e-02],
       [  4.54546024e+00,   9.14537478e+00,   8.86582688e+00,
          6.88032997e+00,   3.59784621e-04]])

In [113]:
# mean of each row 
np.mean(data, axis = 1)

array([ 3.45422862,  4.6599894 ,  6.24009515,  6.59614927,  4.3316211 ])

In [126]:
# Give me a subset of the data matrix, containing only rows with a mean > 5 and the second column < 4 

data[np.mean(data, axis = 1) > 5, :]


array([[  7.72212061e+00,   3.32878843e+00,   4.26967922e+00,
          4.24869577e+00,   7.08036316e+00],
       [  6.80014456e+00,   2.58262997e+00,   9.88908151e+00,
          4.35020049e+00,   6.93051621e+00],
       [  7.00247033e+00,   9.05154777e+00,   5.79075262e+00,
          8.11815066e+00,   4.50701992e-02],
       [  4.54546024e+00,   9.14537478e+00,   8.86582688e+00,
          6.88032997e+00,   3.59784621e-04]])

### Why Numpy?
- Avoid writing loops (don't re-invent the wheel)
- Most other python data libraries work on top of numpy
Efficient Computation
- Regarding the second point, numpy is useful because operations using it are many times faster than their pure Python implementations. This is because numpy processes arrays using code written in 'low-level' languages like C or Fortran. These languages are much more tedious to write programs in, but run much faster than a 'high-level' language like Python. However, by using Python to call functions written by other people in low-level languages, you can get the best of both worlds.


### Quick performance comparison 

In [130]:
# create a 1000 * 1000, list of lists (2d matrix)
n_row = 1000
n_col = 1000

python_matrix = [[[2] * n_col for i in range(n_row )]]

In [131]:
%%timeit
# ass 1 to every entry in the matix
for i in range(n_row):
    for j in range(n_col):
        python_matrix [i][j] += 1

TypeError: 'int' object is not iterable