<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Introduction to numpy:
</p><br>

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold"><br>
Package for scientific computing with Python
</p><br>

Numerical Python, or "Numpy" for short, is a foundational package on which many of the most common data science packages are built.  Numpy provides us with high performance multi-dimensional arrays which we can use as vectors or matrices.


It provides n-dimensional array of the same data-type and built-n functions which are fast and space-efficient.

Elementary introduction of numpy for beginners so that they can have the hang of it.
The key topics will be covering are:

    Creation of ndarray
	Indexing & Slicing 
	Arithmetic Array Operations 
	Statistical Methods
	Set Operations
	Broadcasting

<b>Additional Recommended Resources:</b><br>
<a href="https://docs.scipy.org/doc/numpy/reference/">Numpy Documentation</a><br>
<i>Python for Data Analysis</i> by Wes McKinney<br>
<i>Python Data science Handbook</i> by Jake VanderPlas



# Sample Data that we will be Dealing with

https://www.kaggle.com/c/digit-recognizer/data - Image Processing

https://www.kaggle.com/c/imaterialist-challenge-fashion-2018 - Image Identification

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data - Data Prediction

https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only - Natural Language Processing

# Why Numpy

In [45]:
%config IPCompleter.greedy=True
import numpy as np;

#Numpy Array
my_arr = np.arange(1000001)

#Python Array
my_list = list(range(1000000))

print (len(my_arr))
print (len(my_list))

1000001
1000000


In [None]:
#Now let’s multiply each sequence by 2
# Numpy is extremely fast compared to usual collections (traditional list in this case)

%time for _ in range(50): my_arr2 = my_arr * 2 #Numpy processing time

%time for _ in range(50): my_list2 = [x * 2 for x in my_list] #loop based processing

# NumPy data types

In [27]:
vector = np.array([1, 2, 3])
print (type(vector))
print(vector)

<class 'numpy.ndarray'>
[1 2 3]


In [26]:
matrix = np.array([[1,2,3],
              [4,5,6]])
print (matrix)
print (matrix[0],matrix[1])

print (type(matrix))
for l in matrix:
    print (l)


[[1 2 3]
 [4 5 6]]
[1 2 3] [4 5 6]
<class 'numpy.ndarray'>
[1 2 3]
[4 5 6]


In [18]:
tensor = np.array([
                [[1,2,3], 
               [4,5,6]],
                   
              [[21,22,33],
               [44,55,66]]
            ])
print(tensor)

[[[ 1  2  3]
  [ 4  5  6]]

 [[21 22 33]
  [44 55 66]]]


In [21]:
print(vector.shape, vector.size)
print(matrix.shape, matrix.size)
print(tensor.shape, tensor.size)

(3,) 3
(2, 3) 6
(2, 2, 3) 12


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Getting started with ndarray<br><br></p>

**ndarrays** are time and space-efficient multidimensional arrays at the core of numpy.  Let's get started by creating ndarrays using the numpy package.

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

How to create Rank 1 numpy arrays:
</p>

In [29]:
import numpy as np                 #import numpy package into notebook

an_array = np.array([3, 33, 333],dtype=np.int32)  # Create a rank 1 array

print(an_array)
print(type(an_array))              # The type of an ndarray is: "<class 'numpy.ndarray'>"
print(an_array.dtype)
print(an_array.shape)              # Attribute, provide the dimensionality if array, in case 1D-> length of array
print(an_array.size)

[  3  33 333]
<class 'numpy.ndarray'>
int32
(3,)
3


In [30]:
# because this is a 1-rank array, we need only one index to accesss each element
print(an_array[0], an_array[1], an_array[2]) 

3 33 333


In [31]:
an_array[0] =888            # ndarrays are mutable, here we change an element of the array

print(an_array)

[888  33 333]


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

How to access a Rank 2 numpy array:</p>

A rank 2 **ndarray** is one with two dimensions.  Notice the format below of [ [row] , [row] ].  2 dimensional arrays are great for representing matrices which are often useful in data science.

In [32]:
another = np.array([[11,12,13],[21,22,23]])   # Create a rank 2 array

print(another)  # print the array

print(another.shape)  # rows x columns 

#print(another.size) 

#print("Accessing elements [0,0], [0,1], and [1,0] of the ndarray: ", another[0, 0], ", ",another[0, 1],", ", another[1, 0])

[[11 12 13]
 [21 22 23]]
(2, 3)


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

There are many way to create numpy arrays:
</p>

Here we create a number of different size arrays with different shapes and different pre-filled values.  numpy has a number of built in methods which help us quickly and easily create multidimensional arrays.

In [38]:
import numpy as np

# create a 2x2 array of zeros
ex1 = np.zeros((2,2)) 
ex2 = np.zeros((2,2),int)      
print(ex1)  
print (ex2)

[[0. 0.]
 [0. 0.]]
[[0 0]
 [0 0]]


In [43]:
# create a 2x2 array filled with 9.0
ex2 = np.full((2,2), 9.0)  
print(ex2)   

[[9. 9.]
 [9. 9.]]


In [47]:
# create a 2x2 matrix with the diagonal 1s and the others 0
ex3 = np.eye(2,2)
ex4 = np.eye(3,5)
print (ex4)
print(ex3)  

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]]
[[1. 0.]
 [0. 1.]]


In [48]:
# create an array of ones
ex4 = np.ones((1,2))
print(ex4)    

[[1. 1.]]


In [55]:
# create an array of random floats between 0 and 1 and between 0 and 5
ex5 = np.random.random((5,5))
ex6 = np.random.random((6,6))*5
print(ex5)    
print (ex6)

[[0.34332948 0.66266928 0.38326866 0.91311684 0.88344043]
 [0.64519673 0.10812673 0.46421238 0.63245379 0.67003942]
 [0.42534684 0.2054543  0.68111379 0.46294504 0.84049412]
 [0.68373073 0.11632202 0.20488594 0.90326204 0.74775538]
 [0.52996399 0.53965589 0.15547809 0.52830508 0.66166518]]
[[2.74492733 1.09713977 1.80084828 2.33201231 0.93834372 2.34385635]
 [3.21342439 0.65251605 2.19568316 3.25645989 3.50720231 1.75655761]
 [2.09220398 1.14745742 0.29446128 2.35059253 0.11138307 3.10680607]
 [2.36422973 4.11679796 1.36991672 3.38098898 1.85764853 0.42721163]
 [4.0405649  0.78611973 1.38500519 1.97412698 4.22137313 1.37721157]
 [0.37113306 3.27561626 0.75784483 0.93679809 0.78943605 4.88532289]]


In [57]:
x = np.arange(1,10)
y = x.reshape(3,3) 
print(x)
print (y)


[1 2 3 4 5 6 7 8 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Array Indexing & Slicing

<br><br></p>

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>
Slice indexing:
</p>

We can use slice indexing to pull out sub-regions of ndarrays.
It will be used to divide a dataset into training, cross-validation, testing set.

In [58]:
import numpy as np
x = np.array([1,2,3,4,5])
print(x)


[1 2 3 4 5]


In [59]:
print('1st element:', x[0])      #positve indices from begining of the array
print('2nd element:', x[1])
#print('5th element:', x[4])

1st element: 1
2nd element: 2


When you modify a slice, you actually modify the underlying array.

In [61]:
print('1st element:', x[-5])    #negative indices from end of the array
print('2nd element:', x[-4])
#print('5th element:', x[-1])

1st element: 1
2nd element: 2


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>
Slice indexing:
</p>

We can use slice indexing to pull out sub-regions of ndarrays.
1. ndarray[start : end]     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Started is included & end is excluded
2. ndarray[start : ]        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;       From starting index to end
3. ndarray[ : end]

In [78]:
a = np.array([[1,2,3,4],
              [5,6,7,8],
              [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# [1,2,3,4],[5,6,7,8]

print(a[:2]) #first 2 rows, all columns
print(a[:,:]) #all rows, all columns
print(a[:,:2]) #all rows, first 2 colums

[[1 2 3 4]
 [5 6 7 8]]
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
[[ 1  2]
 [ 5  6]
 [ 9 10]]


In [65]:
a = np.array([[1,2,3,4],
              [5,6,7,8],
              [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3] #Multiple slices 

print (b)

[[2 3]
 [6 7]]


Excercise

In [109]:
arr2d = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]]) #3x3 matrix

# Retireve rows(0,1) , column (1rst onward)
print (arr2d[:2,1:])
# Retireve Last row all columns
print (arr2d[-1])
# Retirive [5,6],[8,9]
print (arr2d[1:,1:])

[[2 3]
 [5 6]]
[7 8 9]
[[5 6]
 [8 9]]


When you modify a slice, you actually modify the underlying array.

In [None]:
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34]])
a_slice = an_array[:2, 1:3]
print(a_slice)

print("Before:", an_array[0, 1])   #inspect the element at 0, 1  
a_slice[0, 0] = 1000    # a_slice[0, 0] is the same piece of data as an_array[0, 1]
print("After:", an_array[0, 1])  


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Datatypes and Array Operations
<br><br></p>

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Datatypes:
</p>

In [None]:
ex1 = np.array([11, 12]) # Python assigns the  data type
print(ex1.dtype)

In [None]:
ex2 = np.array([11.0, 12.0]) # Python assigns the  data type
print(ex2.dtype)

In [None]:
ex3 = np.array([11, 21], dtype=np.int64) #You can also tell Python the  data type
print(ex3.dtype)

In [None]:
# you can use this to force floats into integers 
ex4 = np.array([11.1,12.7], dtype=np.int64)
print(ex4.dtype)
print()
print(ex4)

In [None]:
# you can use this to force integers into floats if you anticipate
# the values may change to floats later
ex5 = np.array([11, 21], dtype=np.float64)
print(ex5.dtype)
print()
print(ex5)

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Arithmetic Array Operations:

</p>

In [79]:
import numpy as np

x = np.array([[111,112],[121,122]], dtype=np.int)
y = np.array([[211.1,212.1],[221.1,222.1]], dtype=np.float64)

print(x)
print()
print(y)

[[111 112]
 [121 122]]

[[211.1 212.1]
 [221.1 222.1]]


In [81]:
# add
%time print(x + y)         # The plus sign works
%time print()
%time print(np.add(x, y))  # so does the numpy function "add"


[[322.1 324.1]
 [342.1 344.1]]
CPU times: user 351 µs, sys: 72 µs, total: 423 µs
Wall time: 387 µs

CPU times: user 23 µs, sys: 4 µs, total: 27 µs
Wall time: 31.9 µs
[[322.1 324.1]
 [342.1 344.1]]
CPU times: user 259 µs, sys: 19 µs, total: 278 µs
Wall time: 270 µs


In [82]:
# subtract
print(x - y)
print()
print(np.subtract(x, y))

[[-100.1 -100.1]
 [-100.1 -100.1]]

[[-100.1 -100.1]
 [-100.1 -100.1]]


In [83]:
# multiply
print(x * y)
print()
print(np.multiply(x, y))

[[23432.1 23755.2]
 [26753.1 27096.2]]

[[23432.1 23755.2]
 [26753.1 27096.2]]


In [None]:
# divide
print(x / y)
print()
print(np.divide(x, y))

In [84]:
# square root
print(np.sqrt(x))

[[10.53565375 10.58300524]
 [11.         11.04536102]]


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Statistical Methods, Sorting, and <br> <br> Set Operations:
<br><br>
</p>

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Basic Statistical Operations:
</p>

In [85]:
# setup a random 2 x 4 matrix
arr = 10 * np.random.randn(2,5)
print(arr)

[[ -3.41979117   3.54660055  -3.6596343   -3.4243333    4.29302948]
 [ -5.05863632 -10.83581804  -8.39613719   6.48175457  11.18603826]]


In [86]:
# compute the mean for all elements
print(arr.mean())

-0.9286927451830038


In [87]:
# compute the means by row
print(arr.mean(axis = 1))

[-0.53282575 -1.32455974]


In [88]:
# compute the means by column
print(arr.mean(axis = 0))

[-4.23921374 -3.64460874 -6.02788575  1.52871063  7.73953387]


In [None]:
# sum all the elements
print(arr.sum())
# sum all the elements across columns

In [None]:
# compute the medians, across rows
print(np.median(arr, axis = 1))

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Sorting:
</p>


In [None]:
# create a 10 element array of randoms
unsorted = np.random.randn(10)

print(unsorted)

In [None]:
# create copy and sort
sorted = np.array(unsorted)
sorted.sort()

print(sorted)
print()
print(unsorted)

In [None]:
# inplace sorting
unsorted.sort() 

print(unsorted)

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Finding Unique elements:
</p>

In [None]:
array = np.array([1,2,1,4,2,1,4,2])
#array.shape
print(np.unique(array))

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Set Operations with np.array data type:
</p>

In [None]:
s1 = np.array(['desk','chair','bulb'])
s2 = np.array(['lamp','bulb','chair'])
print(s1, s2)

In [None]:
print( np.intersect1d(s1, s2) ) 

In [None]:
print( np.union1d(s1, s2) )

In [None]:
print( np.setdiff1d(s1, s2) )# elements in s1 that are not in s2

<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Broadcasting:
<br><br>
</p>

Arrays with different sizes cannot be added, subtracted, or generally be used in arithmetic.

A way to overcome this is to duplicate the smaller array so that it is the dimensionality and size as the larger array. This is called array broadcasting and is available in NumPy when performing array arithmetic, which can greatly reduce and simplify your code.



Introduction to broadcasting. <br>
For more details, please see: <br>
https://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html

In [None]:
import numpy as np
a = np.array([1, 2, 3])
print(a)
b = 2
print(b)
c = a + b
print(c)

In [None]:
import numpy as np
start = np.zeros((4,3))
print(start)
print()
# create a rank 1 ndarray with 3 values
add_rows = np.array([1, 0, 2])
print(add_rows)

In [None]:
y = start + add_rows  # add to each row of 'start' using broadcasting
print(y)

In [None]:
# create an ndarray which is 4 x 1 to broadcast across columns
add_cols = np.array([[0,1,2,3]])
add_cols = add_cols.T

print(add_cols)

In [None]:
# add to each column of 'start' using broadcasting
y = start + add_cols 
print(y)

In [None]:
# this will just broadcast in both dimensions
start = np.zeros((2,2))
add_scalar = np.array([1])  
print(start)
print(add_scalar)
print(start+add_scalar)

In [None]:
#•The last dimension of each array is compared. 
    #•If the dimension lengths are equal, or one of the dimensions is of length 1, then we keep going.
    #•If the dimension lengths aren't equal, and none of the dimensions have length 1, then there's an error.

import numpy as np
A = np.array([[1, 2, 3], [1, 2, 3]])
print(A.shape)
#b=np.array([[1, 2], [1, 2]])#2*2
#b = np.array([1, 2])#1*2
#b = np.array([1, 2, 1])#1*3
#b= np.array([1])
#b = np.array([[1],[2]])#2*1
print(b.shape)
C = A + b
print(C)


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Speedtest: ndarrays vs lists
<br><br>
</p>

First setup paramaters for the speed test. We'll be testing time to sum elements in an ndarray versus a list.

In [None]:
from numpy import arange
from timeit import Timer

size    = 1000000
timeits = 1000

In [None]:
# create the ndarray with values 0,1,2...,size-1
nd_array = arange(size)
print( type(nd_array) )

In [None]:
# timer expects the operation as a parameter, 
# here we pass nd_array.sum()
timer_numpy = Timer("nd_array.sum()", "from __main__ import nd_array")

print("Time taken by numpy ndarray: %f seconds" % 
      (timer_numpy.timeit(timeits)/timeits))

In [None]:
# create the list with values 0,1,2...,size-1
a_list = list(range(size))
print (type(a_list) )

In [None]:
# timer expects the operation as a parameter, here we pass sum(a_list)
timer_list = Timer("sum(a_list)", "from __main__ import a_list")

print("Time taken by list:  %f seconds" % 
      (timer_list.timeit(timeits)/timeits))

<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Read or Write to Disk:
<br><br>
</p>

<p style="font-family: Arial; font-size:1.3em;color:#2462C0; font-style:bold"><br>

Binary Format:</p>

In [89]:
x = np.array([ 23.23, 24.24] )

In [90]:
np.save('an_array', x)

In [91]:
np.load('an_array.npy')

array([23.23, 24.24])

<p style="font-family: Arial; font-size:1.3em;color:#2462C0; font-style:bold"><br>

Text Format:</p>

In [92]:
np.savetxt('array.txt', X=x, delimiter=',')

In [93]:
!cat array.txt

2.323000000000000043e+01
2.423999999999999844e+01


In [94]:
np.loadtxt('array.txt', delimiter=',')

array([23.23, 24.24])

<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Additional Common ndarray Operations
<br><br></p>

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Dot Product on Metrices 

</p>


![image.png](attachment:image.png)

In [100]:
# determine the dot product of two matrices


x2d = np.array([[1,1],[1,1]])
y2d = np.array([[2,2],[2,2]])
print(x2d.dot(y2d))
print()
print(np.dot(x2d, y2d))

[[4 4]
 [4 4]]

[[4 4]
 [4 4]]


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Sum:
</p>

In [101]:
# sum elements in the array
ex1 = np.array([[11,12],[21,22]])

print(np.sum(ex1))          # add all members

66


In [102]:
print(np.sum(ex1, axis=0))  # columnwise sum

[32 34]


In [103]:
print(np.sum(ex1, axis=1))  # rowwise sum

[23 43]


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Transpose:

</p>

In [104]:
# transpose
ex1 = np.array([[11,12],[21,22]])

ex1.T

array([[11, 21],
       [12, 22]])