Numpy is the fundamental package for _numeric computing_ with Python. 
   
It provides powerful ways to create, store, and/or manipulate data, which makes it able to seamlessly and speedily integrate with a wide variety of databases. 

This is also the foundation that _Pandas_ is built on, which is a high-performance data-centric package that we will learn later in the course.

In this lecture, we will talk about 

- creating array with certain data types, 

- manipulating array, 

- selecting elements from arrays, 

- and loading dataset into array. 


Such functions are useful for manipulating data and understanding the functionalities of other common Python data packages.

In [None]:
# You'll recall that we import a library using the `import` keyword as 
# numpy's common abbreviation is np
import numpy as np
import math

# Array Creation

In [None]:
# Arrays are displayed as a list or list of lists and 
# can be created through list as well. 
# When creating an array, we pass in a list as an argument in numpy array

a = np.array([1, 2, 3])
print(a)

# We can print the number of dimensions of a list using the ndim attribute
print(a.ndim)
print(a.shape) 

# note that dimension and shape are diffent. 

In [None]:
# If we pass in a list of lists in numpy array, 
# we create a multi-dimensional array, 
#for instance, a matrix
b = np.array([ [1,2,3] , [4,5,6] ])
b

In [None]:
a.shape

We can print out the length of each dimension by calling the shape attribute, which returns a tuple


In [None]:
print("b.ndim", b.ndim)
print("b.shape",b.shape)

In [None]:
# We can also check the type of items in the array
a.dtype

In [None]:
b.dtype

In [None]:
c = np.array([[1.3,2.12,3],[4,5,6]])
print(c.ndim,c.shape)
print(c.dtype)

In [None]:
d = np.array([[1.3,2.12,"a"],[4,5,6]])
d.dtype

In [None]:
# Besides integers, floats are also accepted in numpy arrays
c = np.array([2.2, 5, 1.1])
c.dtype.name


In [None]:
# Let's look at the data in our array
c

In [None]:
# Note that numpy automatically converts integers, like 5, up to floats, 
# since there is no loss of prescision.
# Numpy will try and give you the best data type format possible
# to keep your data types homogeneous, which
# means all the same, in the array

Sometimes we know the shape of an array that we want to create,
but not sure about what we want to be in it. 
numpy offers several functions to create arrays 
with initial placeholders, such as zero's or one's.
Lets create two arrays, both the same shape but with different filler values


In [None]:

d = np.zeros((2,3))
print("d",d)

e = np.ones((2,3))
print("e", e)

In [None]:
# We can also generate an array with random numbers between 0 and 1. 
np.random.rand(2,3)

In [None]:
np.random.normal( size=(2, 4)) #standard normal with mean 0 and st. dev 1

You'll see zeros, ones, and rand used quite often to create example arrays, 
especially in stack overflow posts and other forums.

We can also create a sequence of numbers in an array 
with the arange() function. The first argument is the
starting bound and the second argument is the ending bound,
and the third argument is the difference between
each consecutive numbers


In [None]:

# Let's create an array of every even number from ten (inclusive) to fifty (exclusive)
f = np.arange(10, 20, 2.32) #just like range function
f

if we want to generate a sequence of floats, we can use the linspace() function. In this function the third
argument isn't the difference between two numbers, but the total number of items you want to generate


In [None]:
f2 = np.linspace( 0, 2, 9 ) # 15 numbers from 0 (inclusive) to 2 (inclusive)
f2

In [None]:
# you can create an random array of integers
r2 = np.random.randint(2, 10, (3,4), dtype=int)
r2
# random.randint(low, high=None, size=None, dtype=int)

In [None]:
[      [3, 9, 5, 2],
       [3, 9, 5, 5],
       [6, 7, 4, 5]        ]


[[      [9, 5, 4, 3],
        [4, 8, 2, 4],
        [7, 6, 7, 4]   ],

 [      [6, 4, 4, 7],
        [5, 8, 4, 7],
        [2, 7, 7, 5]   ]   ]  2 3 

In [None]:
r3 = np.random.randint(2, 10, (2,3,4), dtype=int)
r3

In [None]:
r4= np.random.randint(2, 10, (2,3,4,5), dtype=int)
r4

In [None]:
print(r4.shape)
print(r4.ndim)

In [None]:
x = np.array([2,3,4])
print("dim",x.ndim)
print("shape",x.shape)



In [None]:
x = np.array([ [2,3,4] ])
print("dim",x.ndim)
print("shape",x.shape)

In [None]:
x = np.array([ [2,3,4], [2,3,4] ])
print("dim",x.ndim)
print("shape",x.shape)

In [None]:
x = np.array([ [2], [3] , [4] ])
print("dim",x.ndim)
print("shape",x.shape)

In [None]:
[ [2], 
  [3], 
  [4] ]

# Array Operations

We can do many things on arrays, 

such as mathematical manipulation 

(addition, subtraction, square, exponents) 

as well as use boolean arrays, which are binary values. 

We can also do matrix manipulation such

as product, transpose, inverse, and so forth.

In [None]:
# Arithmetic operators on array apply elementwise.

# Let's create a couple of arrays
a = np.array([10,20,30,40])
b = np.array([1, 2, 3,4])

# Now let's look at a minus b
c = a-b
print("c", c)

# And let's look at a times b
d = a*b
print("d",d)

With arithmetic manipulation, we can convert current data 
to the way we want it to be.
Here's a real-world problem 
In Canada people  use celcius for temperatures, US system uses farenheit. 
With numpy we could easily convert a number of farenheit values, 
say the weather forecase, to ceclius

In [None]:


# Let's create an array of typical Ann Arbor winter farenheit values
farenheit = np.array([5,-13,-22,-31,14])

# And the formula for conversion is ((°F − 32) × 5/9 = °C)
celcius = (farenheit - 32) * (5/9)
celcius

Another useful and important manipulation is the boolean array. 
We can apply an operator on an array, and a
boolean array will be returned for any element in the original, 
with True being emitted if it meets the condition and False oetherwise.
For instance, if we want to get a boolean array to check celcius
degrees that are greater than -20 degrees

In [None]:
# 
celcius > -20

Here's another example, we could use the modulus operator 
to  check numbers in an array to see if they are even. 
Recall that modulus does division but throws away 
everything but the remainder (decimal) portion)

In [None]:

celcius%2 == 0

Besides elementwise manipulation, it is important to know that 
numpy supports matrix manipulation. 
Let's look at matrix product. 

If we want to do elementwise product, we use the "*" sign

In [None]:

A = np.array([[1,1],[0,1]])
B = np.array([[2,0],[3,4]])

print("dot product", A*B)

# if we want to do matrix product, we use the "@" sign or use the dot function
print("matrix product", A@B)

You don't have to worry about complex matrix operations for this course, 
but it's important to know that
numpy is the underpinning of scientific computing libraries in python, 
and that it is capable of doing both element-wise operations (the asterix) 
as well as matrix-level operations (the @ sign). 

A few more linear algebra concepts are worth layering in here. 

You might recall that the product of two matrices is only plausible when the _inner_ dimensions of the two matrices are the same. 

The dimensions refer  to the number of elements both horizontally and vertically in the rendered matricies you've seen here. 

We can use numpy to quickly see the shape of a matrix:

In [None]:

A.shape

When manipulating arrays of different types, 
the type of the resulting array will correspond to 
the more __general__ of the two types. This is called _upcasting_.


In [None]:

# Let's create an array of integers
array1 = np.array([[1, 2, 3], [4, 5, 6]])
print(array1.dtype)

# Now let's create an array of floats
array2 = np.array([[7.1, 8.2, 9.1], [10.4, 11.2, 12.3]])
print(array2.dtype)

Integers (int) are whole numbers only, and Floating point numbers (float) can have a whole number portion
and a decimal portion. The 64 in this example refers to the number of bits that the operating system is
reserving to represent the number, which determines the size (or precision) of the numbers that can be
represented.

In [None]:
# Let's do an addition for the two arrays
array3=array1+array2
print(array3)
print(array3.dtype)

Notice how the items in the resulting array have been upcast into floating point numbers

Numpy arrays have many interesting aggregation functions on them, such as  sum(), max(), min(), and mean()


In [None]:
# Numpy arrays have many interesting aggregation functions on them, such as  sum(), max(), min(), and mean()
print("array3 itself", array3)
print("sum", array3.sum())
print("max", array3.max())
print("min", array3.min())
print("mean",array3.mean())

For two dimensional arrays, we can do the same thing for each row or column
let's create an array with 15 elements, ranging from 1 to 15, 
with a dimension of 3X5
we will use #reshape#

In [None]:

b = np.arange(1,16,1) #range that we used in for loops
print(b)

In [None]:
b1 = b.reshape(3,5)
print(b.shape)

print(b1.shape)


In [None]:
b.reshape(15)


In [None]:
b.reshape(15).shape


In [None]:
b2 = b.reshape(15,1)
b2

In [None]:
b.reshape(15,1).shape

Now, we often think about two dimensional arrays being made up of rows and columns, but you can also think
of these arrays as just a giant ordered list of numbers, and the *shape* of the array, the number of rows
and columns, is just an abstraction that we have for a particular purpose. Actually, this is exactly how
basic images are stored in computer environments.


In [None]:
x = np.random.randint(1, 20, (2,2), dtype=int)
#y = np.random.randint(1, 20, (2,2), dtype=int)
print(x)


In [None]:
np.reshape(x,(1,4))

In [None]:
x11 = x.reshape(1,4)
x11

In [None]:
x12 = x.reshape(4) # 1 dimension of size 4
x12

In [None]:
x2 = x.reshape(4,1) # 2 dimensions of size 4 and 1
x2

In [None]:
x11-x2 #elementwise subtraction
#x11 =>(1,4)   x2=>(4,1)
# result is (4,4) 
#if x11 has dim of (4,1) then the result will be (4,1)

In [None]:
x12-x2 #elementwise subtraction

# Indexing, Slicing and Iterating

Indexing, slicing and iterating are extremely important for data manipulation and analysis because these
techinques allow us to select data based on conditions, and copy or update data.

## Indexing

First we are going to look at integer indexing. A one-dimensional array, works in similar ways as a list -
To get an element in a one-dimensional array, we simply use the offset index.


In [None]:
a = np.array([1,3,5,7])
a[2]

For multidimensional array, we need to use integer array indexing, let's create a new multidimensional array


In [None]:
a = np.array([[1,2], [3, 4], [5, 6]])
a

if we want to select one certain element, we can do so by entering the index, which is comprised of two
integers the first being the row, and the second the column

In [None]:

a[2,0] # remember in python we start at 0!

In [None]:
# if we want to get multiple elements 
# for example, 1, 4, and 6 and put them into a one-dimensional array
# we can enter the indices directly into an array function
np.array([a[0, 0], a[1, 1], a[2, 1]])

we can also do that by using another form of array indexing, which essentiall "zips" the first list and the
second list up
it puts the first values in the first array, the second values into the second array.


In [None]:
print(a[ [0, 1, 2,1,0,1,2], [0, 1, 1,1,1,0,1]  ] )

https://favtutor.com/blogs/numpy-exercises-python
https://www.w3schools.com/python/numpy/exercise.asp
https://www.geeksforgeeks.org/python-numpy-practice-exercises-questions-and-solutions/

## Boolean Indexing

In [None]:
# Boolean indexing allows us to select arbitrary elements based on conditions. For example, in the matrix we
# just talked about we want to find elements that are greater than 5 so we set up a conditon a >5 
print(a >=5)
# This returns a boolean array showing that if the value at the corresponding index is greater than 5

In [None]:
a

In [None]:
# We can then place this array of booleans like a mask over the original array to return a one-dimensional 
# array relating to the true values.
a [a>=5] 

In [None]:
# As we will see, this functionality is essential in the pandas toolkit which is the bulk of this course

## Slicing

In [None]:
# Slicing is a way to create a sub-array based on the original array. For one-dimensional arrays, slicing 
# works in similar ways to a list. To slice, we use the : sign. For instance, if we put :3 in the indexing
# brackets, we get elements from index 0 to index 3 (excluding index 3)
a = np.array([0,1,2,3,4,5])
print(a[:3])

In [None]:
# By putting 2:4 in the bracket, we get elements from index 2 to index 4 (excluding index 4)
print(a[2:4])

In [None]:
# For multi-dimensional arrays, it works similarly, lets see an example
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
a

In [None]:
# First, if we put one argument in the array, for example a[:2] then we would get all the elements from the 
# first (0th) and second row (1th)
a[ 0:2 ]

In [None]:
# If we add another argument to the array, for example a[:2, 1:3], we get the first two rows but then the
# second and third column values only
a[0:2, 1:3]

In [None]:
a[1:4, 0:2]



In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
a

In [None]:
# or try these 
a[ : , 2]

In [None]:
a[1]

In [None]:
a[1,:] #the same with the upper line

In [None]:
a[:,1]

In [None]:
a[0:3,1]

In [None]:
a[:,0] #write the first column as a row

In [None]:
a[:,0:1] # want the same values but wanted to preserve that they are located in their own places. 

In [None]:
a[:,-1] # write the last row. recall that a minus sign means "from the end". Hence -1 means the first row from the end. 

In [None]:
#compare the following
print(a[0:2])

In [None]:
print(a[0:2][0])

In [None]:
print(a[0:2,0])

So, in multidimensional arrays, the first argument is for selecting rows, and the second argument is for  selecting columns

### It is important to realize that a slice of an array is !!! a view into the same data!!! 
### This is called passing by reference. 
### So modifying the sub array !!!!will consequently modify!!! the original array

In [None]:
a

Here I'll change the element at position [0, 0], which is 2, to 50, then we can see that the value in the
original array is changed to 50 as well

In [None]:
sub_array = a[:2, 1:3] # this is only another name for that part of the original array a
sub_array

In [None]:
print("sub array index [0,0] value before change:", sub_array[0,0])
sub_array[0,0] = 50
print("sub array index [0,0] value after change:", sub_array[0,0])
print("original array index [0,1] value after change:", a[0,1])

In [None]:
a

In [None]:
L1 = [0, 1,2,3]
L2 = L1
L2[0] = 'a'
L1 #alias
L3 = L1[:] #cloning
L3[0] = 'b'
L3

# Trying Numpy with Datasets

In [None]:
# Now that we have learned the essentials of Numpy let's use it on a couple of datasets

In [None]:
# Here we have a very popular dataset on wine quality, and we are going to only look at red wines. 
# The data fields include: 
# fixed acidity, 
# volatile aciditycitric acid, residual sugar, chlorides, free sulfur dioxide,
# total sulfur dioxidedensity, pH, sulphates, alcohol, quality

In [None]:
# To load a dataset in Numpy, we can use the genfromtxt() function. We can specify data file name, delimiter
# (which is optional but often used), and number of rows to skip if we have a header row, hence it is 1 here

# The genfromtxt() function has a parameter called dtype for specifying data types of each column this
# parameter is optional. Without specifying the types, all types will be casted the same to the more
# general/precise type

wines = np.genfromtxt("datasets/winequality-red.csv", delimiter=";", skip_header=1)
wines

In [None]:
wines.shape

In [None]:
# Recall that we can use integer indexing to get a certain column or a row. 
# For example, if we want to select the fixed acidity column, which is the first coluumn,
# we can do so by entering the index into the array.
# Also remember that for "multidimensional" arrays, the first argument refers to the row, and the second
# argument refers to the column, 
# and if we just give one argument then we'll get a single dimensional list back.

# So all rows combined but only the first column from them would be
print("one integer 0 for slicing: ", wines[:, 0])
print(wines[:,0].shape)

In [None]:
# But if we wanted the same values but wanted to preserve that they sit in their own rows 
#we would write
print("0 to 1 for slicing: \n", wines[:, 0:1])
print( wines[:, 0:1].shape)

In [None]:
# This is another great example of how the shape of the data is an abstraction which we can layer
# intentionally on top of the data we are working with.

wines.shape

In [None]:
# If we want a range of columns in order, say columns 0 through 3 (recall, this means first, second, and
# third, since we start at zero and don't include the training index value), we can do that too
wines[:, 0:3]

In [None]:
# What if we want several non-consecutive columns? We can place the indices of the columns that we want into
# an array and pass the array as the second argument. Here's an example
wines[:, [0,2,4]]

In [None]:
# We can also do some basic summarization of this dataset. For example, if we want to find out the average
# quality of red wine, we can select the quality column. We could do this in a couple of ways, but the most
# appropriate is to use the -1 value for the index, as negative numbers mean slicing from the back of the
# list. We can then call the aggregation functions on this data.
wines[:,-1].mean()

In [None]:
# Let's take a look at another dataset, this time on graduate school admissions. It has fields such as GRE
# score, TOEFL score, university rating, GPA, having research experience or not, and a chance of admission.
# With this dataset, we can do data manipulation and basic analysis to infer what conditions are associated
# with higher chance of admission. Let's take a look.

In [None]:
# We can specify data field names when using genfromtxt() to loads CSV data. Also, we can have numpy try and
# infer the type of a column by setting the dtype parameter to None
graduate_admission = np.genfromtxt('datasets/Admission_Predict.csv', dtype=None, delimiter=',', skip_header=1,
                                   names=('Serial No','GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
                                          'LOR','CGPA','Research', 'Chance of Admit'))
graduate_admission

In [None]:
# please compare the following
wines[0]


In [None]:
graduate_admission[0]




In [None]:
wines[0,0]


In [None]:
graduate_admission[0,0]

In [None]:
graduate_admission.shape

In [None]:
# Notice that the resulting array is actually a one-dimensional array with 400 tuples
graduate_admission.shape

In [None]:
# We can retrieve a column from the array using the column's name for example, 
#let's get the CGPA column and  only the first five values.
graduate_admission['CGPA']

In [None]:
graduate_admission['CGPA'][0:5]

In [None]:
# Since the GPA in the dataset range from 1 to 10, and in the US it's more common to use a scale of up to 4,
# a common task might be to convert the GPA by dividing by 10 and then multiplying by 4
graduate_admission['CGPA'] = graduate_admission['CGPA'] /10 *4
graduate_admission['CGPA'][0:20] #let's get 20 values

In [None]:
graduate_admission['Research'] == 1

In [87]:
# Recall boolean masking. We can use this to find out how many students have had research experience by
# creating a boolean mask and passing it to the array indexing operator
graduate_admission[graduate_admission['CGPA'] > 3.90]

array([( 25, 336, 119, 5, 4. , 3.5, 3.92 , 1, 0.97),
       ( 35, 331, 112, 5, 4. , 5. , 3.92 , 1, 0.94),
       ( 72, 336, 112, 5, 5. , 5. , 3.904, 1, 0.96),
       (131, 339, 114, 5, 4. , 4.5, 3.904, 1, 0.96),
       (144, 340, 120, 4, 4.5, 4. , 3.968, 1, 0.97),
       (149, 339, 116, 4, 4. , 3.5, 3.92 , 1, 0.96),
       (203, 340, 120, 5, 4.5, 4.5, 3.964, 1, 0.97),
       (204, 334, 120, 5, 4. , 5. , 3.948, 1, 0.97),
       (214, 333, 119, 5, 5. , 4.5, 3.912, 1, 0.96),
       (386, 335, 117, 5, 5. , 5. , 3.928, 1, 0.96)],
      dtype=[('Serial_No', '<i4'), ('GRE_Score', '<i4'), ('TOEFL_Score', '<i4'), ('University_Rating', '<i4'), ('SOP', '<f8'), ('LOR', '<f8'), ('CGPA', '<f8'), ('Research', '<i4'), ('Chance_of_Admit', '<f8')])

In [88]:
# Since we have the data field chance of admission, which ranges from 0 to 1, we can try to see if students
# with high chance of admission (>0.8) on average have higher GRE score than those with lower chance of
# admission (<0.4)

# So first we use boolean masking to pull out only those students we are interested in based on their chance
# of admission, then we pull out only their GPA scores, then we print the mean values.
print(graduate_admission[graduate_admission['Chance_of_Admit'] > 0.8]['GRE_Score'].mean())
print(graduate_admission[graduate_admission['Chance_of_Admit'] < 0.4]['GRE_Score'].mean())


328.7350427350427
302.2857142857143


In [89]:
# Take a moment to reflect here, do you understand what is happening in these calls?

# When we do the boolean masking we are left with an array with tuples in it still, and numpy holds underneath
# this a list of the columns we specified and their name and indexes
graduate_admission[graduate_admission['Chance_of_Admit'] > 0.8]

array([(  1, 337, 118, 4, 4.5, 4.5, 3.86 , 1, 0.92),
       (  6, 330, 115, 5, 4.5, 3. , 3.736, 1, 0.9 ),
       ( 12, 327, 111, 4, 4. , 4.5, 3.6  , 1, 0.84),
       ( 23, 328, 116, 5, 5. , 5. , 3.8  , 1, 0.94),
       ( 24, 334, 119, 5, 5. , 4.5, 3.88 , 1, 0.95),
       ( 25, 336, 119, 5, 4. , 3.5, 3.92 , 1, 0.97),
       ( 26, 340, 120, 5, 4.5, 4.5, 3.84 , 1, 0.94),
       ( 33, 338, 118, 4, 3. , 4.5, 3.76 , 1, 0.91),
       ( 34, 340, 114, 5, 4. , 4. , 3.84 , 1, 0.9 ),
       ( 35, 331, 112, 5, 4. , 5. , 3.92 , 1, 0.94),
       ( 36, 320, 110, 5, 5. , 5. , 3.68 , 1, 0.88),
       ( 44, 332, 117, 4, 4.5, 4. , 3.64 , 0, 0.87),
       ( 45, 326, 113, 5, 4.5, 4. , 3.76 , 1, 0.91),
       ( 46, 322, 110, 5, 5. , 4. , 3.64 , 1, 0.88),
       ( 47, 329, 114, 5, 4. , 5. , 3.72 , 1, 0.86),
       ( 48, 339, 119, 5, 4.5, 4. , 3.88 , 0, 0.89),
       ( 49, 321, 110, 3, 3.5, 5. , 3.54 , 1, 0.82),
       ( 71, 332, 118, 5, 5. , 5. , 3.856, 1, 0.94),
       ( 72, 336, 112, 5, 5. , 5. , 3.904, 1, 

In [90]:
# Let's also do this with GPA
print(graduate_admission[graduate_admission['Chance_of_Admit'] > 0.8]['CGPA'].mean())
print(graduate_admission[graduate_admission['Chance_of_Admit'] < 0.4]['CGPA'].mean())

3.7106666666666666
3.0222857142857142


In [None]:
# Hrm, well, I guess one could have expected this. The GPA and GRE for students who have a higher chance of
# being admitted, at least based on our cursory look here, seems to be higher.

So that's a bit of a whirlwing tour of numpy, the core scientific computing library in python. Now, you're
going to see a lot more of this kind of discussion, as the library we'll be focusing on in this course is
pandas, which is built on top of numpy. Don't worry if it didn't all sink in the first time, we're going to
dig in to most of these topics again with pandas. However, it's useful to know that many of the functions
and capabilities of numpy are available to you within pandas.