<h1>Python libraries for data analysis</h1>


<li><b><span style="color:blue">Numpy</span></b>: supports numerical and array operations
<li><b><span style="color:blue">Scipy</span></b>: open source library for mathematics and scientific computing
<li><b><span style="color:blue">Pandas</span></b>: supports data manipulation and analysis
<li><b><span style="color:blue">Visualization libraries</span></b>: matplotlib, seaborne, bokeh, plotly, gmplot, and many others provide support for charts and graphs

<h1>numpy</h1>


<h2>Why numpy?</h2>
<li>Multi-dimensional arrays:
<li>Faster and more space efficient than lists 
<li>Can incorporate C/C++/Fortran code
<li>Linear algebra, Fourier transforms, Random number support



<h2>numpy array</h2>

In [1]:
import numpy as np
ax = np.array([1,2,3,4,5])
# print(type(ax))


<li>A numpy array has a data type associated with its elements
<li>and elements need to be of the same data type
<li>But an element could be an 'arbitrarily' complex object

In [2]:
# np.array(['1','2',False])

In [3]:
midt = np.array([67,47.5,34,81,84.5,82,89,83,71,94,80.5,67.5,88,93,78,76,75,45,79,87.5,65,82,50,76,81,79,68,52,68,70,73,80,73.5,85,79,72,74,64,88,66,75,86,72,77,84,75,97,54,85,50,66.5,82,74,69.5,51,86])

In [4]:
# print(midt.max(),midt.min(),midt.mean(),midt.std())

<h2>Specifying the type</h2>
<h3>Useful when reading a text stream directly into a numerical array</h3>

<h4>The <i>dtype</i> attribute</h4>
<li>Stores the data type in the array
<li>numpy makes a best guess of the data type

In [5]:
ax = np.array([[1,2,3,4],[5,6,7,8.3]])
ax.dtype

dtype('float64')

In [6]:
ax = np.array([{'a':1,'b':2},4])
# ax

In [7]:
x=['1','2','3']
xi = np.array(x,'int')
xf = np.array(x,'float')
xs = np.array(x,'str')
# print(xi,xf,xs,sep='\n')

<li>The <i>astype</i> function converts from one type to another


In [8]:
ax = np.array([1,2,3,'4'])
# print(ax.dtype)
# ax.astype(int)

In [9]:
ay = ax.astype(np.float128)
print(ay)
ay.dtype

[1. 2. 3. 4.]


dtype('float128')

<h2>Basic operations</h2>

<h4>statistical operations</h4>

In [10]:
x = np.array([13,24,21.2,17.6,21.7],'float')
print(x.sum(),x.mean(),x.std(),sep='\n')

97.50000000000001
19.500000000000004
3.8429155598321434


<h4>arrray arithmetic operations</h4>
<li><b>Important</b>: Arrays must be the same size!

In [11]:
x = np.array([13,24,21.2,17.6,21.7],'float')
y = np.array([1,3,4,7,2],'float')
x - y

array([12. , 21. , 17.2, 10.6, 19.7])

In [12]:
x+y

array([14. , 27. , 25.2, 24.6, 23.7])

In [13]:
x*y

array([ 13. ,  72. ,  84.8, 123.2,  43.4])

In [14]:
x/y

array([13.        ,  8.        ,  5.3       ,  2.51428571, 10.85      ])

<h2>Multi-dimensional arrays</h2>

In [15]:
x=[[0,1,2,3,4,5],[10,11,12,13,14,15],[20,21,22,23,24,25]]
ax=np.array(x,'float')
# print(ax)

<h3>Indexing</h3>

In [16]:
# ax[1:,:] #indexing

In [17]:
# ax[(0,2),(0,5)]

<h3>Slicing</h3>

In [18]:
# ax[1:3,2:4]
#Intersection between ax[1:3,:] and ax[:,2:4]

In [19]:
# ax[[1,2],[2,3]]

In [20]:
a = np.arange(10)
# a[2:10]

<h3>Reshaping</h3>
<li>nd arrays can be reshaped as long as the total dimensionality is unchanged


In [21]:
print(ax.shape)
ax.reshape(9,2)
#ax.reshape(10,3)

(3, 6)


array([[ 0.,  1.],
       [ 2.,  3.],
       [ 4.,  5.],
       [10., 11.],
       [12., 13.],
       [14., 15.],
       [20., 21.],
       [22., 23.],
       [24., 25.]])

<h3>Creating nd arrays</h3>

<h4>Using the <i>array</i> function</h4>

In [22]:
data = [[0,1,2,3,4],[5,6,7,8,9]]
data_array = np.array(data)
data_array

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

<h4>Using initializers</h4>

<li>The <i>arrange</i> (array range) function


In [23]:
ax = np.arange(10)
print(ax)
ay = np.array([np.arange(10),np.arange(10)])
print(ay)
ax.dtype

[0 1 2 3 4 5 6 7 8 9]
[[0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]]


dtype('int64')

In [24]:
ax = np.arange(10)**2
print(ax)

[ 0  1  4  9 16 25 36 49 64 81]


<li>The <i>ones</i> function creates an array of 1s (floats)

In [25]:
ax = np.ones(10)
print(ax)
ax.dtype

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


dtype('float64')

In [26]:
ax = np.array([[1,2,3,4],[5,6,7,8]])
ay = np.ones_like(ax)
ay

array([[1, 1, 1, 1],
       [1, 1, 1, 1]])

In [27]:

ay = np.zeros_like(ax)
ay

array([[0, 0, 0, 0],
       [0, 0, 0, 0]])

<li>The <i>identity(n)</i> function creates an identity matrix of order n

In [28]:
np.identity(10)

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

<li>The function <i>empty</i> creates an "empty" array
<li>Values in the array are "garbage" values

In [29]:
np.empty([2,3],float)

array([[0., 0., 0.],
       [0., 0., 0.]])

<h3>Matrix multiplication</h3>


In [30]:
ax = np.arange(10)
ay = np.array([ax,ax])
#Scalar multiplication
ay*2

array([[ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18],
       [ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18]])

In [31]:
np.dot(ay,ay.reshape(10,2)) #Dot product

array([[220, 265],
       [220, 265]])

<h2>Lists vs numpy arrays</h2>
<li>Lists are heterogenous. Elements of a list can be of multiple types
<li>Numpy arrays are homogeneous. Elements can be of only one type
<li>Both are mutable
<li>Homogeneity makes indexed access faster and more memory efficient
<li>numpy are optimized for matrix operations
<li>numpy provides random number support
    
CRUD 


<h3>numpy arrays are homogeneous</h3>

<h3>numpy arrays are faster</h3>

In [79]:
n=10
ax = np.array([np.arange(n)**2,np.arange(n)**3])
ay = ax.transpose()
print(ax)
print(ay)
np.dot(ax,ay)

[[  0   1   4   9  16  25  36  49  64  81]
 [  0   1   8  27  64 125 216 343 512 729]]
[[  0   0]
 [  1   1]
 [  4   8]
 [  9  27]
 [ 16  64]
 [ 25 125]
 [ 36 216]
 [ 49 343]
 [ 64 512]
 [ 81 729]]


array([[ 15333, 120825],
       [120825, 978405]])

<h4>Functionalize this</h4>


In [33]:
def dotproduct(n):
    ax = np.array([np.arange(n)**2,np.arange(n)**3])
    ay = ax.transpose()
    import datetime
    start = datetime.datetime.now()
    np.dot(ax,ay)
    end = datetime.datetime.now()
    return end-start
    
dotproduct(10)    

datetime.timedelta(microseconds=16)

<h4>Do the same with python lists</h4>


In [34]:
def dot_product_lists(n):
    x = [x**2 for x in range(n)]
    y = [x**3 for x in range(n)]
    ax = [x,y]
    ay = [list(i) for i in zip(*ax)]
    import datetime
    start = datetime.datetime.now()
    [[sum(a*b for a,b in zip(X_row,Y_col)) for Y_col in zip(*ay)] for X_row in ax]
    end = datetime.datetime.now()
    return end-start
    
dot_product_lists(10)


datetime.timedelta(microseconds=17)

In [89]:
x = [x**2 for x in range(n)]
y = [x**3 for x in range(n)]
ax = [x,y]
ay = [list(i) for i in zip(*ax)]
print(ax)
print(ay)

[[0, 1, 4, 9, 16, 25, 36, 49, 64, 81], [0, 1, 8, 27, 64, 125, 216, 343, 512, 729]]
[[0, 0], [1, 1], [4, 8], [9, 27], [16, 64], [25, 125], [36, 216], [49, 343], [64, 512], [81, 729]]


NameError: name 'Y_col' is not defined

<h4>Compare the two</h4>

In [90]:
for n in [10,100,1000,10000,1000000]:
    numpy_result = dotproduct(n)
    list_result = dot_product_lists(n)
    print(n,numpy_result,list_result,sep='\t')

10	0:00:00.000095	0:00:00.000018
100	0:00:00.000018	0:00:00.000060
1000	0:00:00.000019	0:00:00.000665
10000	0:00:00.000175	0:00:00.008250
1000000	0:00:00.003757	0:00:01.734458


In [None]:
# https://www.geeksforgeeks.org/timeit-python-examples/

<h3>numpy indexing vs list indexing</h3>
<li>numpy arrays use direct indexing
<li>lists use chained indexing

In [92]:
ax = np.array([1,2,3,4,8,9])
x = [1,2,3,4,8,9]

#Extract the first and last elements from the numpy array into a single array
ax[[0,-1]]

#Extract the first and last elements from the list into a new list
[x[0],x[-1]]

[1, 9]

<h3>numpy slicing vs list slicing</h3>

In [37]:
ax = np.array([[11,12,13,14],[21,22,23,24],[31,32,33,34]])
ax[1:3,1:3]

array([[22, 23],
       [32, 33]])

In [38]:
ax

array([[11, 12, 13, 14],
       [21, 22, 23, 24],
       [31, 32, 33, 34]])

In [93]:
lx = [[11,12,13,14],[21,22,23,24],[31,32,33,34]]

SyntaxError: invalid syntax (<ipython-input-93-28d89b3980c8>, line 2)

<h2>batch operations on nd arrays</h2>
<li>numpy arrays allow the application of batch operations on all elements of an array
<li>without having to write a for loop or use an iterator
<li>by <i>vectorizing</i> operations, numpy is much faster than the slow for loop structure of python


<h3>batch: selecting elements using a boolean mask</h3>
<li> A boolean max applies a condition to each element in turn
<li> And returns an array of boolean with
<ul>
<li> True for each value that satisfies the condition
<li> False for every other value

In [40]:
ax = np.array([1,4,7,9,2,3,10,11,34,2])
ax < 7

array([ True,  True, False, False,  True,  True, False, False, False,
        True])

<h4>The mask can be applied as a selection operator on the array

In [41]:
ax[ax<7]

array([1, 4, 2, 3, 2])

<h4>The mask doesn't have to be constructed on the same array</h4>
<li>But the mask and the array should have the same dimensions

In [42]:
names = np.array(['Bill','Sally','Qing','Savitri','Giovanni'])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50])
names[bonus > 130000]


array(['Bill', 'Sally'], dtype='<U8')

<h3>batch: arithmentic operations</h3>
<li>+, -, *, /, scalar multiplication do an element by element operation

In [43]:
ax = np.array([[1,2,3],[4,5,6]])
1/ax


array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

<h3>batch: functional artithmetic operators</h3>

In [94]:
ax = np.array([4,9,2,4,0,25,0])
print(np.sum(ax>5))
print(np.count_nonzero(ax))
print(np.any(ax>10))
print(np.all(ax>0))

2
5
True
True


<h3>Logical operations with numpy</h3>
<li>logical_or
<li>logical_and

In [45]:
np.logical_and(bonus>90000.0, bonus<400000)

array([ True, False, False,  True, False])

<h4>Boolean operators</h4>
<li>the numpy equivalent of "and" is "&"
<li>the numpy equivalent of "or" is "|"
<li>the numpy equivalent of "not" is "!"


In [95]:
print(np.sum(bonus[(bonus>50000) & (bonus < 200000)]))
print(np.sum(bonus[(names=="Bill") | (names == "Qing")]))
print(np.sum(bonus[(names!="Bill")]))
print(np.sum(bonus[~((names=="Bill") | (names == "Qing"))]))


150135.86
236191.8
632150.55
628259.31


<b>Problem</b> Calculate the mean and median bonus anount for all female employees with bonus less than $100,000

In [47]:
names = np.array(['Bill','Sally','Qing','Savitri','Giovanni'])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50])
gender = np.array(['M','F','F','F','M'])



In [48]:
# np.mean(bonus[(gender == 'F') & (bonus < 100000)])

<b>Problem</b> Return an nd array containing the named of all female employees with bonus less than $100,000

In [49]:
names = np.array(['Bill','Sally','Qing','Savitri','Giovanni'])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50])
gender = np.array(['M','F','F','F','M'])


<h3>batch: Selecting elements using where</h3>
<li><i>where</i> function creates a new array using a conditional expression
<li>Somewhat like the if function in an excel spreadsheet

<h2>axes</h2>
<li>The axis parameter tells numpy which axis to operate along

In [102]:
ax = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
ax = ax.reshape(3,4)
print(ax)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


<h4>sum each column</h4>

In [51]:
ax.sum(axis=0)

array([15, 18, 21, 24])

<h4>sum each row</h4>

In [52]:
ax.sum(axis=1)

array([10, 26, 42])

<h4>sum by depth</h4>

In [53]:
ax=ax.reshape(2,3,2)
ax.sum(axis=2)

array([[ 3,  7, 11],
       [15, 19, 23]])

<h4>add an axis to an array</h4>


In [104]:
ax = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print(ax)
ax[:,np.newaxis]

[ 1  2  3  4  5  6  7  8  9 10 11 12]


array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12]])

<h4>Easy to add n-dimensions to an nd array using newaxis</h4>

In [55]:
ax = ax.reshape(4,3)
# ax[np.newaxis,np.newaxis,np.newaxis].shape

In [56]:
x=[[0,1,2,3,4,5],[10,11,12,13,14,15],[20,21,22,23,24,25]]
ax=np.array(x,float)
# np.where(ax%2==0,1,0)

<h2>Broadcasting</h2>
<li>arithmetic operations work element by element
<li>so both arrays have to be of the same length
<li><b>broadcasting</b> is used for arithmetic on arrays of different shapes

In [57]:
ax = np.array([1,2,3])
ay = np.array([3,2,1])
ax+ay

array([4, 4, 4])

<li>when one operand is a scalar, numpy works as if it has created a second array
<li>ax + 5 is equivalent to ax + np.array([5,5,5,])
<li>note the "as if" because it doesn't actually do that
<li>instead it <b>broadcasts</b> the 5 to each element of ax
<li>we can do this broadcasting on any dimensional array

In [58]:
ay = np.ones([3,3])
ay

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [59]:
ax = np.array([1,2,3])
ax + ay

array([[2., 3., 4.],
       [2., 3., 4.],
       [2., 3., 4.]])

<b>broadcasting</b> won't work when arrays are of incompatible dimensions

In [107]:
ax = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
ay = np.array([3,4,5])
ax[np.newaxis].shape
ax + ay

ValueError: operands could not be broadcast together with shapes (12,) (3,) 

<h4>np.newaxis is useful here because we can convert ax into a 2D array</h4>

In [111]:
ax[:,np.newaxis] + ay

array([[ 4,  5,  6],
       [ 5,  6,  7],
       [ 6,  7,  8],
       [ 7,  8,  9],
       [ 8,  9, 10],
       [ 9, 10, 11],
       [10, 11, 12],
       [11, 12, 13],
       [12, 13, 14],
       [13, 14, 15],
       [14, 15, 16],
       [15, 16, 17]])

In [62]:
#Broadcasting effectively does this:
ax[:,np.newaxis] + np.array([[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5]])

array([[ 4,  5,  6],
       [ 5,  6,  7],
       [ 6,  7,  8],
       [ 7,  8,  9],
       [ 8,  9, 10],
       [ 9, 10, 11],
       [10, 11, 12],
       [11, 12, 13],
       [12, 13, 14],
       [13, 14, 15],
       [14, 15, 16],
       [15, 16, 17]])

<h4>We could also convert ay into a 2D array</h4>
<li>the result will be different (why?) 

In [63]:
ax = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
ay = np.array([3,4,5])
ax + ay[:,np.newaxis]

array([[ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15],
       [ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16],
       [ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17]])

<h2>Fancy indexing</h2>
<li>numpy let's us work on indexed subsets of an nd array
<li>this lets us construct arbitrary subsets of an nd array in any dimension

In [113]:
ax = np.array([4,3,9,2,1,6])
ay = np.array([2,4])
# ax[ay]

In [65]:
ay = np.array([[2,4],[1,3]])
# ax[ay]

<h4>multi-dimentional indexes</h4>
<li>In a 2-d, index, the index array is used to generate (row_number,col_number) pairs

In [66]:
ax = np.array([[23,34,21,34,22],[33,44,11,29,32],[14,90,10,20,17]])
rows = np.array([0,2])
cols = np.array([1,4])
# ax[rows,cols] # [ax[0,1],ax[2,4]]

In [67]:
rows2 = np.array([[0,0],[2,2]])
cols2 = np.array([[0,4],[0,4]])
# ax[rows2,cols2]

In [68]:
# ax[1:3,[2,3,4]]

In [69]:
# ax[0:3:2,0:5:4]

In [70]:
# ax.reshape(5,3)

<h2>Universal functions</h2>
<li>functions that perform elementwise operations on arrays
<li>fast "wrapper" functions that produce scalar (or lower dimension) results
<li>sqrt, exp, add,maximum, minimum, abs, etc.
<li>https://docs.scipy.org/doc/numpy/reference/ufuncs.html

In [71]:
ax = np.array([1,2,3,4,5,6,7],float)
np.sqrt(ax)
np.exp(ax)

array([   2.71828183,    7.3890561 ,   20.08553692,   54.59815003,
        148.4131591 ,  403.42879349, 1096.63315843])

In [72]:
ay = np.arange(10,17)
np.add(ax,ay)
np.maximum(ax,ay)

array([10., 11., 12., 13., 14., 15., 16.])

In [73]:

#linalg, a linear algebra module
#functions dealing with polynomials, differentials, etc


In [74]:
import scipy
scipy.nanmean(x)

  


12.5

<h3>Random number support in numpy</h3>

In [75]:
np.random.normal(size=10)
np.random.normal(size=(100,100))
np.random.exponential()
np.random.exponential(1.0,size=(6,3))
np.random.randint(-10,10,size=(9,9))

array([-0.48417364, -0.27293103, -1.19893426, -0.02400497, -1.92888687,
       -0.89108241,  0.95531516, -1.10296501, -1.35814482,  0.01034872])

In [114]:
dir(np.random)

['BitGenerator',
 'Generator',
 'MT19937',
 'PCG64',
 'Philox',
 'RandomState',
 'SFC64',
 'SeedSequence',
 '__RandomState_ctor',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_bit_generator',
 '_bounded_integers',
 '_common',
 '_generator',
 '_mt19937',
 '_pcg64',
 '_philox',
 '_pickle',
 '_sfc64',
 'absolute_import',
 'beta',
 'binomial',
 'bytes',
 'chisquare',
 'choice',
 'default_rng',
 'dirichlet',
 'division',
 'exponential',
 'f',
 'gamma',
 'geometric',
 'get_state',
 'gumbel',
 'hypergeometric',
 'laplace',
 'logistic',
 'lognormal',
 'logseries',
 'mtrand',
 'multinomial',
 'multivariate_normal',
 'negative_binomial',
 'noncentral_chisquare',
 'noncentral_f',
 'normal',
 'pareto',
 'permutation',
 'poisson',
 'power',
 'print_function',
 'rand',
 'randint',
 'randn',
 'random',
 'random_integers',
 'random_sample',
 'ranf',
 'rayleigh',
 'sample',
 'seed',
 'set_state',
 'shuffle',