## Introduction to Data Science

### Introduction to Numpy

In [1]:
#import pylab
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import pathlib


%matplotlib inline
#%matplotlib notebook

## [Numpy Basics](https://www.datacamp.com/community/tutorials/python-numpy-tutorial)

What Is A Python Numpy Array?  

NumPy arrays are a bit like Python lists, but still very much different at the same time. For those of you who are new to the topic, let’s clarify what it exactly is and what it’s good for.   

As the name kind of gives away, a NumPy array is a central data structure of the numpy library. The library’s name is actually short for “Numeric Python” or “Numerical Python”.  

In other words, NumPy is a Python library that is the core library for scientific computing in Python. It contains a collection of tools and techniques that can be used to solve on a computer mathematical models of problems in Science and Engineering. One of these tools is a high-performance multidimensional array object that is a powerful data structure for efficient computation of arrays and matrices. To work with these arrays, there’s a huge amount of high-level mathematical functions operate on these matrices and arrays.  

Then, what is an array?  

When you look at the print of a couple arrays, you could see it as grid that contains values of the same type. The array holds and represents any regular data in a structured way.  

However, you should know that, on a structural level, an array is basically nothing but pointers. It’s a combination of a memory address, a data type, a shape and strides:  

The data pointer indicates the memory address of the first byte in the array,  
The data type or dtype pointer describes the kind of elements that are contained within the array,  
The shape indicates the shape of the array, and
The strides are the number of bytes that should be skipped in memory to go to the next element.   

If your strides are (10,1), you need to proceed one byte to get to the next column and 10 bytes to locate the next row.  

### Creating Arrays:

In [2]:
my_numbers = [1,2,3,4]
simple_array = np.array(my_numbers)
print(simple_array)

[1 2 3 4]


In [3]:
type(simple_array)

numpy.ndarray

In [4]:
simple_array + 34

array([35, 36, 37, 38])

In [5]:
simple_array.shape

(4,)

In [6]:
simple_array.dtype

dtype('int64')

In [7]:
simple_array.data

<memory at 0x7f1fdd4d1588>

In [8]:
simple_array.strides

(8,)

In [9]:
my_other_numbers = [[1,2,3],[4,5,6],[7,8,9]]
other_simple_array = np.array(my_other_numbers)
other_simple_array

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [10]:
other_simple_array.shape

(3, 3)

In [11]:
#a = np.arange(20)
a = np.arange(1,5,0.2)
print(a)

[1.  1.2 1.4 1.6 1.8 2.  2.2 2.4 2.6 2.8 3.  3.2 3.4 3.6 3.8 4.  4.2 4.4
 4.6 4.8]


In [12]:
b = np.linspace(1,10,30)
#b = np.linspace(1,2*np.pi,50)
print(b)

[ 1.          1.31034483  1.62068966  1.93103448  2.24137931  2.55172414
  2.86206897  3.17241379  3.48275862  3.79310345  4.10344828  4.4137931
  4.72413793  5.03448276  5.34482759  5.65517241  5.96551724  6.27586207
  6.5862069   6.89655172  7.20689655  7.51724138  7.82758621  8.13793103
  8.44827586  8.75862069  9.06896552  9.37931034  9.68965517 10.        ]


In [13]:
b2 = np.logspace(1,100,30)
print(b2)

[1.00000000e+001 2.59294380e+004 6.72335754e+007 1.74332882e+011
 4.52035366e+014 1.17210230e+018 3.03919538e+021 7.88046282e+024
 2.04335972e+028 5.29831691e+031 1.37382380e+035 3.56224789e+038
 9.23670857e+041 2.39502662e+045 6.21016942e+048 1.61026203e+052
 4.17531894e+055 1.08263673e+059 2.80721620e+062 7.27895384e+065
 1.88739182e+069 4.89390092e+072 1.26896100e+076 3.29034456e+079
 8.53167852e+082 2.21221629e+086 5.73615251e+089 1.48735211e+093
 3.85662042e+096 1.00000000e+100]


In [14]:
a1 = np.zeros((3,4))
print(a1)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [15]:
a2 = np.ones((2,2))
print(a2)

[[1. 1.]
 [1. 1.]]


In [16]:
a3 = np.empty((2,3))
print(a3)

[[2.09097435e-316 0.00000000e+000 0.00000000e+000]
 [0.00000000e+000 0.00000000e+000 0.00000000e+000]]


In [17]:
a4 = np.identity(3)
print(a4)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [18]:
a5 = np.eye(2)        
print(a5)

[[1. 0.]
 [0. 1.]]


In [19]:
a6 = np.full((4,2), 7)
print(a6)

[[7 7]
 [7 7]
 [7 7]
 [7 7]]


In [20]:
a7 = np.random.random((2,2))  
print(a7)

[[0.59991968 0.33402303]
 [0.74318138 0.43792201]]


#### Modifying Dimensions:

In [21]:
c = np.arange(100)
print(c)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]


In [22]:
print(c.shape)
print(np.ndim(c))
print(c.dtype.name)

(100,)
1
int64


In [23]:
d = c.reshape(4,5,5)
print(d)

[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]
  [20 21 22 23 24]]

 [[25 26 27 28 29]
  [30 31 32 33 34]
  [35 36 37 38 39]
  [40 41 42 43 44]
  [45 46 47 48 49]]

 [[50 51 52 53 54]
  [55 56 57 58 59]
  [60 61 62 63 64]
  [65 66 67 68 69]
  [70 71 72 73 74]]

 [[75 76 77 78 79]
  [80 81 82 83 84]
  [85 86 87 88 89]
  [90 91 92 93 94]
  [95 96 97 98 99]]]


In [24]:
print(d.shape)
print(np.ndim(d))
print(d.dtype.name)

(4, 5, 5)
3
int64


In [25]:
d = np.random.random(100).reshape(4,25)
print(d)

[[0.5999244  0.8814462  0.33183581 0.42446993 0.00420474 0.96995923
  0.23528338 0.44374605 0.20149377 0.07050976 0.75996641 0.54290391
  0.15154222 0.56959726 0.44789564 0.2489747  0.30437618 0.21482265
  0.12637288 0.55179516 0.96735    0.94634489 0.81230964 0.93998996
  0.68265093]
 [0.47098122 0.27058527 0.96417137 0.09864398 0.69695584 0.06026422
  0.77356504 0.2700961  0.52474836 0.46524722 0.08879987 0.54181731
  0.84480711 0.38909336 0.41670707 0.199213   0.9652786  0.45556662
  0.87430596 0.09861877 0.35039281 0.91083742 0.9015205  0.2264822
  0.64301588]
 [0.86232495 0.26203706 0.15288921 0.20090174 0.03103731 0.47857353
  0.97723429 0.32800019 0.61169331 0.23480186 0.76360912 0.56228763
  0.62238145 0.70501046 0.03119854 0.74027337 0.37478862 0.1797827
  0.29016696 0.9411003  0.13422322 0.28044806 0.51306431 0.20398006
  0.73233439]
 [0.96300305 0.62774931 0.1044551  0.77719199 0.14446713 0.77215985
  0.19209024 0.91242255 0.50654187 0.37379202 0.5780322  0.04384369
  0.0883

In [26]:
print(d.shape)
print(np.ndim(d))
print(d.dtype.name)

(4, 25)
2
float64


In [27]:
x = np.arange(12).reshape((3,4))
print(x)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [28]:
print(x.shape)
print(np.ndim(x))
print(x.dtype.name)

(3, 4)
2
int64


In [29]:
x.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [30]:
print(x.ravel().shape)
print(np.ndim(x.ravel()))
print(x.ravel().dtype.name)

(12,)
1
int64


In [31]:
# Resize `x` to ((6,4))
y = np.resize(x, (7,5))
print(x, '\n\n\n', y)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 


 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11  0  1  2]
 [ 3  4  5  6  7]
 [ 8  9 10 11  0]
 [ 1  2  3  4  5]
 [ 6  7  8  9 10]]


### Slicing multidimensional arrays

In [32]:
d = np.arange(40).reshape(4,2,5)
print(d)

[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]]

 [[10 11 12 13 14]
  [15 16 17 18 19]]

 [[20 21 22 23 24]
  [25 26 27 28 29]]

 [[30 31 32 33 34]
  [35 36 37 38 39]]]


In [33]:
d[1,:,3:]

array([[13, 14],
       [18, 19]])

In [34]:
d.shape

(4, 2, 5)

In [35]:
d[d%2==0]

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
       34, 36, 38])

In [36]:
d[~d%2==0]  #negation of condition

array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,
       35, 37, 39])

In [37]:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print(a)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


In [38]:
# Create an array of indices
b = np.array([0, 2, 0, 1])
c = np.arange(4)

print(b, '\n\n', c)

[0 2 0 1] 

 [0 1 2 3]


In [39]:
# Select one element from each row of a using the indices in b
print(a[c, b])

[ 1  6  7 11]


In [40]:
# Mutate one element from each row of a using the indices in b
a[c, b] += 10
print(a)

[[11  2  3]
 [ 4  5 16]
 [17  8  9]
 [10 21 12]]


In [41]:
d

array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]],

       [[20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]],

       [[30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39]]])

In [42]:
# Boolean array indexing
bool_idx = (d > 13)
print(bool_idx)

[[[False False False False False]
  [False False False False False]]

 [[False False False False  True]
  [ True  True  True  True  True]]

 [[ True  True  True  True  True]
  [ True  True  True  True  True]]

 [[ True  True  True  True  True]
  [ True  True  True  True  True]]]


In [43]:
print(d[bool_idx])

[14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
 38 39]


### datatypes

In [44]:
x = np.array([1, 2])   # Let numpy choose the datatype
print(x.dtype)         # Prints "int64"

int64


In [45]:
x = np.array([1.0, 2.0])   # Let numpy choose the datatype
print(x.dtype)             # Prints "float64"

float64


In [46]:
x = np.array([1, 2], dtype=np.float64)   # Force a particular datatype
print(x.dtype)                         # Prints "int64"

float64


### Array Math

### Inline and vectorized operations:

In [47]:
a

array([[11,  2,  3],
       [ 4,  5, 16],
       [17,  8,  9],
       [10, 21, 12]])

In [48]:
a * 2

array([[22,  4,  6],
       [ 8, 10, 32],
       [34, 16, 18],
       [20, 42, 24]])

In [49]:
# the original array stays the same
a

array([[11,  2,  3],
       [ 4,  5, 16],
       [17,  8,  9],
       [10, 21, 12]])

In [50]:
a.cumsum()

array([ 11,  13,  16,  20,  25,  41,  58,  66,  75,  85, 106, 118])

In [51]:
a = np.arange(16).reshape(4,4)
np.vstack([a,np.arange(4).reshape(1,4)])

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [ 0,  1,  2,  3]])

In [52]:
np.hstack([a,np.arange(4).reshape(4,1)])

array([[ 0,  1,  2,  3,  0],
       [ 4,  5,  6,  7,  1],
       [ 8,  9, 10, 11,  2],
       [12, 13, 14, 15,  3]])

In [53]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

[[ 6.  8.]
 [10. 12.]]
[[ 6.  8.]
 [10. 12.]]


In [54]:
# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

[[-4. -4.]
 [-4. -4.]]
[[-4. -4.]
 [-4. -4.]]


In [55]:
# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

[[ 5. 12.]
 [21. 32.]]
[[ 5. 12.]
 [21. 32.]]


In [56]:
# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]


In [57]:
# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

[[1.         1.41421356]
 [1.73205081 2.        ]]


In [58]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
v = np.array([9,10])
w = np.array([11, 12])

In [59]:
print(x)
print()
print(y)
print()
print(v)
print()
print(w)

[[1 2]
 [3 4]]

[[5 6]
 [7 8]]

[ 9 10]

[11 12]


In [60]:
# Inner product of vectors; both produce 219
print(v.dot(w), '\n')
print(np.dot(v, w))

219 

219


In [61]:
# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v), '\n')
print(np.dot(x, v))

[29 67] 

[29 67]


In [62]:
# Matrix / matrix product; both produce the rank 2 array
print(x.dot(y), '\n')
print(np.dot(x, y))

[[19 22]
 [43 50]] 

[[19 22]
 [43 50]]


In [63]:
x = np.array([[1,2],[3,4]])
print(x, '\n')
print(np.sum(x), '\n')  # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0), '\n')  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

[[1 2]
 [3 4]] 

10 

[4 6] 

[3 7]


In [64]:
x = np.array([[1,2], [3,4]])

In [65]:
print(x, '\n')
print(x.T)

[[1 2]
 [3 4]] 

[[1 3]
 [2 4]]


In [66]:
# Note that taking the transpose of a rank 1 array does nothing:
v = np.array([1,2,3])

In [67]:
print(v, '\n')
print(v.T)

[1 2 3] 

[1 2 3]


In [68]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = np.empty_like(x)   # Create an empty matrix with the same shape as x

In [69]:
print(x, '\n')
print(v, '\n')
print(y)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]] 

[1 0 1] 

[[38118896        0        0]
 [       0        0        0]
 [       0        0        0]
 [       0        0        0]]


In [70]:
for i in range(4):
    y[i, :] = x[i, :] + v
print(y)

[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]


This works; however when the matrix x is very large, computing an explicit loop in Python could be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a matrix vv by stacking multiple copies of v vertically, then performing elementwise summation of x and vv. We could implement this approach like this:

In [71]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
vv = np.tile(v, (4, 1))   # Stack 4 copies of v on top of each other
print(vv)

[[1 0 1]
 [1 0 1]
 [1 0 1]
 [1 0 1]]


In [72]:
y = x + vv  # Add x and vv elementwise
print(y)

[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]


Numpy broadcasting allows us to perform this computation without actually creating multiple copies of v. Consider this version, using broadcasting:

In [73]:
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v  # Add v to each row of x using broadcasting
print(y)

[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]


The line y = x + v works even though x has shape (4, 3) and v has shape (3,) due to broadcasting; this line works as if v actually had shape (4, 3), where each row was a copy of v, and the sum was performed elementwise.

In [74]:
# Initialize `x` and `y`
x = np.ones((3,4))
y = np.random.random((5,1,4))

# Add `x` and `y`
print(x + y)

[[[1.74055131 1.07220087 1.11817047 1.77738885]
  [1.74055131 1.07220087 1.11817047 1.77738885]
  [1.74055131 1.07220087 1.11817047 1.77738885]]

 [[1.44328333 1.88690766 1.51450607 1.24817231]
  [1.44328333 1.88690766 1.51450607 1.24817231]
  [1.44328333 1.88690766 1.51450607 1.24817231]]

 [[1.68292416 1.10869722 1.46317615 1.92572402]
  [1.68292416 1.10869722 1.46317615 1.92572402]
  [1.68292416 1.10869722 1.46317615 1.92572402]]

 [[1.14557556 1.72052571 1.18634334 1.04222257]
  [1.14557556 1.72052571 1.18634334 1.04222257]
  [1.14557556 1.72052571 1.18634334 1.04222257]]

 [[1.40906291 1.62320429 1.13549471 1.66081139]
  [1.40906291 1.62320429 1.13549471 1.66081139]
  [1.40906291 1.62320429 1.13549471 1.66081139]]]


You see that, even though x and y seem to have somewhat different dimensions, the two can be added together.  
That is because they are compatible in all dimensions:

    Array x has dimensions 3 X 4,
    Array y has dimensions 5 X 1 X 4

Since you have seen above that dimensions are also compatible if one of them is equal to 1, you see that these two arrays are indeed a good candidate for broadcasting!  

What you will notice is that in the dimension where y has size 1 and the other array has a size greater than 1 (that is, 3), the first array behaves as if it were copied along that dimension.  

Note that the shape of the resulting array will again be the maximum size along each dimension of x and y: the dimension of the result will be (5,3,4)  

In short, if you want to make use of broadcasting, you will rely a lot on the shape and dimensions of the arrays with which you’re working.  

#### Useful functions:

In [86]:
grades1 = np.array([1.0,3,5.0,7,9,2,4,6])
grades2 = np.array([0.9,3,4.9,7,9,4,4,6])

In [87]:
np.where(grades1 > 4)

(array([2, 3, 4, 7]),)

In [88]:
np.where(grades1 > 4, 'bigger', 'lower')

array(['lower', 'lower', 'bigger', 'bigger', 'bigger', 'lower', 'lower',
       'bigger'], dtype='<U6')

In [89]:
grades1.argmin()

0

In [90]:
grades1.argmax()

4

In [91]:
grades1.argsort()

array([0, 5, 1, 6, 2, 7, 3, 4])

In [92]:
np.intersect1d(grades1,grades2)

array([3., 4., 6., 7., 9.])

In [94]:
np.allclose(grades1,grades2,0.1)

False

In [97]:
np.allclose(grades1,grades2,0.5)

True