# Getting started with NumPy Array

>This is an accompanying notebook with the course __Data Analytics Using Python: Learning Python Functions__
><br>All the code snippets used in the section are available in this notebook for reference purposes.




## Introduction to NumPy Array, ndArray
NumPy (Numerical Python) is an open source Python library. It is the universal standard for working with numerical data in Python and it’s at the core of the scientific Python ecosystem.

### NumPy = Numerical + Python
    
Etymologically, NumPy is a portmanteau from ‘Numerical’ and ‘Python’. Numerical Python contains functions that can be used for all kinds of numerical operations in the data analysis process using Python. 
The NumPy application programming interface (API) is extensively used in Pandas, SciPy, Matplotlib, scikit-learn, and most other data science and data analytics Python packages. With respect to Python, you can consider APIs as the core functions, classes, and modules defined in the NumPy package.
    
### NumPy functionalities
    
There are many NumPy functionalities that include:
    
- multi-dimensional Array and matrix data structures
- a n-dimensional Array object of homogenous data type – ndArray – and methods to operate on it faster and more efficiently
- standard mathematical functions for faster operations on the entire Array of data without the need for loops
- linear algebra capabilities, random number generators, and so on.

While NumPy itself doesn’t provide high-level data analysis functionality, having the understanding of NumPy Arrays and its usage will help you in using other tools, like Pandas and Matplotlib, effectively and efficiently.

As in Python, to use a module in your NumPy program you first need to import it.

    
#### Importing NumPy
Any time you want to use a library or a package in your code, you first need to make it accessible by using the import statement. To start using NumPy and all of the functions available, this means importing the package. This can be easily done using the following code:


<code>
##Importing NumPy package
import numpy as np
</code>

There is an unstated, undocumented convention that is followed in the Python world – using ‘np’ as the reference name while importing NumPy. Technically, any other name can be used, but this is the convention generally followed.


In [None]:
## Importing NumPy package
import numpy as np

##Check the version of the package, and verify that it has been imported correctly
np.__version__

## NumPy data structure: ndArray

The fundamental data structure that NumPy provides is ndArray.

ndArray is a generic multidimensional container for homogenous data; that is, all of the elements in the Array should be of the same data type.

It has attributes such as:
- shape, which returns a tuple indicating the size of each dimension
- dtype, which returns an object describing the data type of the Array elements.

Here are the code snippets and the outputs for the same:


In [None]:
#Creating a random array and checking its shape and dtype attributes
arr1 = np.array([1,2,3,4,5])
arr1, arr1.shape, arr1.dtype

In [None]:
#Formatted Print Outputs for the ndArray arr1
#Print the type of the variable arr1 -> as we all know this is ndArray.
#Print the array itself -> this would print the items stored in the array.
print("-------Printing the the Type of the Variable arr1, and the contents of array-------")
print(type(arr1))
print(arr1)
print("-------Printing the the Shape and DType property of the ndArray arr1-------")
print("Shape of the Array is: " + str(arr1.shape))
print("Dtype for the Array is: " + str(arr1.dtype))


### Creating ndArrays

You can create ndArrays using the Array() function the NumPy package provides. This function accepts a sequence-like object as an input and produces the NumPy Array containing the data that has been passed to the function.

The code snippet and output demonstrate creating a one-dimensional ndArray:


<code>
    l1 = [1,2,3,4,5,6]
    a1 = np.array(l1)
    a1
</code>

If a nested sequence is provided as the input, a multi-dimensional Array is created (dimension is related to the level of nesting in the data that has been passed). 

The code snippet demonstrates creating a 2 ndArray:


<code>
    a2 = np.array([[1,2,3],
                   [4,5,6],
                   [7,8,9]])
    </code>
    
The attribute ‘ndim’ associated with every Array object tells us the number of dimensions it has.

The code snippet highlights the use of this attribute:

In [None]:
# Create ndArray by passing a list object to np.array()
l1 = [1,2,3,4,5,6]
a1 = np.array(l1)
a1

In [None]:
# Create multi-dimensional array by passing a nested list to np.array()
a2 = np.array([[1,2,3], [4,5,6], [7,8,9]])
a2

In [None]:
#Check the shape and ndim property of the arrays created in the previous step
print(a1.shape, a1.ndim)
print(a2.shape, a2.ndim)

You learned earlier that ndArrays have a dtype attribute that returns an object describing the data type of the Array elements. This means the np.Array() function must either accept the dtype as input, and if it's optional, it should try to infer the data type property based on the data that has been passed. 

And that’s exactly the scenario, unless explicitly specified; the np.Array() function infers the good data type for the Array that it creates. 

The code snippet and output that shows the dtype property of the Array objects that we have created in the example above:

In [None]:
#Check the property dtype of the arrays created in the previous step
a1.dtype, a2.dtype

### ndArray with floating data
    
Let’s try to create a ndArray that has floating data, instead of integers. 

<code>
    a3=np.array([1.1,2.3,4.5,6])
</code>

And we can check the __dtype__ attribute value for a3, to confirm the dtype value infered automatically.

In [None]:
# Creating ndArray with float values
a3=np.array([1.1,2.3,4.5,6])
a3

In [None]:
# Checking the infered dtype
a3.dtype

Many times, we need to create an ndArray with values defaulted; for example, to ‘0’ or ‘1’, and there are specific functions available for this purpose. The benefit of using these functions is that you don’t have to pass the data sequence with all 0s and 1s in it, as when dealing with higher dimensional data that becomes very cumbersome.

- __np.zero()__ -> creates ndArray with values equal to 0
- __np.ones()__ -> creates ndArray with values equal to 1

Both these functions accept the shape of the Array as a parameter. The code snippet highlights the use of these functions:

In [None]:
#Creating one dimensional ndArray with all values equal to 0
n1 = np.zeros(5)
n1

In [None]:
n1.ndim, n1.shape

In [None]:
#Creating 3 dimensional array with all values equal to 1
n2 = np.ones((2,3,4))
n2

In [None]:
n2.ndim, n2.shape

There are many other functions available for creating n-dimensional Arrays. For a comprehensive reference guide, please refer to the official NumPy documentation:

Read: https://numpy.org/doc/stable/reference/routines.array-creation.html


## NumPy: Essential operations

For the purposes of data analysis using Python, you need to understand the following essential operations that NumPy has to offer.

### Vectorization

NumPy Arrays provide vectorised mathematical operations and it’s beneficial to understand the concept of vectorisation. It has its roots in vector mathematics, but from a programming perspective it means that if data is stored in NumPy Arrays (vectors), it enables us to express the batch operations on the data without writing any loop. 

You simply express the mathematical operations as if you have scalar data types.

The code snippet explains the concept of vectorisation and shows that without using the loop functionality, we are able to perform addition on all the elements of ndArray using simple mathematical addition. NumPy Arrays provide vectorised mathematical operations and it’s beneficial to understand the concept of vectorisation. It has its roots in vector mathematics, but from a programming perspective it means that if data is stored in NumPy Arrays (vectors), it enables us to express the batch operations on the data without writing any loop. You simply express the mathematical operations as if you have scalar data types.

The code snippet explains the concept of vectorisation and shows that without using the loop functionality, we are able to perform addition on all the elements of ndArray using simple mathematical addition.



In [None]:
#Using NumPy Array we can do addition of all the data in NumPy Array
#element by element by simply specifying the + operator between two Arrays Operands
#If we are not using NumPy Array, then element wise operation will require
#loops to iterate through both arrays
a1 = np.array([[1,2,3], [4,5,6]])
a2 = np.array([[6,7,8], [9,10,11]])

In [None]:
a1, a1.shape, a1.ndim

In [None]:
a2, a2.shape, a2.ndim

In [None]:
a_sum = a1 + a2
a_sum, a_sum.shape, a_sum.ndim

### Indexing and slicing

We touched on indexing and slicing when we explored various foundational data structures provided within Python.

To recap:
- Indexing is used to access a particular element in the sequence. 
- Slicing is used to access subsets of a sequence (more than one element).

The notation used for both operations is [ ] and passing different index parameters into square brackets. We will explore indexing and slicing in the context of NumPy Arrays later in this section.

The same principle of indexing and slicing applies to the NumPy Arrays as well, but with some subtle differences. 

- Indexing: By providing the index in square brackets, you can access the value stored at that position.
- Slicing: By providing the slicing indices, you can access the subset of the data stored between those indices.

#### Indexing examples

Accessing an element in first row and third column, and an element in the second row and first column.


In [None]:
a_sum

In [None]:
#Accessing element in the first row (index 0), and third column (index 2) i.e. 11
item1 = a_sum[0,2]

#Accessing element in the second row (index 0), and first column (index 0) i.e. 13
item2 = a_sum[1,0]


In [None]:
item1, item2

### Slicing examples 

The code snippet shows a two-dimensional ndArray created with three rows and four columns. By using the slicing operation we are accessing some subsets of the data:

- first two columns of all rows ()
    - slice parameter -> [:,0:2]
- first and third row only, and second and fourth columns only
    - slice parameter -> [0::2, 1::2]


In [None]:
n1 = np.array([[11,12,13,14], [15,16,17,18], [13,14,15,16]])
n1

In [None]:
n1.ndim, n1.shape

In [None]:
#So n1 is a two dimensional array with a shape of (3,4) i.e. 3 rows and 4 columns
#Let's collect the partial data or a subset from this array. Let say first two columns of all rows.
#Remember we specify slice as start:stop:step for every dimension
slice_data = n1[:,0:2]
slice_data

In [None]:
#Similarly if we want to only access first and third row, and for the selected rows,
#only second and fourth column, this slicing operation will be as follows:
slice_data1 = n1[0::2, 1::2]
slice_data1

### Broadcasting

It’s important to understand this characteristic of slicing NumPy Arrays. Array slices are basically the views on the original Array; that is, when we perform a slicing operation, a new object is not created. 

We can see this in the slicing example in the previous code snippet – the variables slice_data and slice_data1 are simply the references to the data stored in the original NumPy Array. 


In [None]:
n1

This means that if we change the data using the Array slices, then the original source data will also change. This behaviour is called broadcasting – the method that NumPy uses to allow Array arithmetic between Arrays with a different shape or size. 

The code snippet demonstrates broadcasting in practice. Here, we are changing the second and third columns of the first and third row by selecting them using slices, and equating them to a fixed value of 50.


In [None]:
# Using broadcasting i.e. with use of array slicing, lets change the values
# of the elements in second and fourth columns, of the first and third row
# to say 50
n1[0::2,1::2] = 50

In [None]:
n1

Broadcasting is widely used in data analytics applications and the underlying programs so it’s a very important concept to master. It is used extensively in the data manipulations that we will conduct as part of data wrangling while creating data pipelines for data analytics applications in Module 5.


## Boolean indexing and transposing Arrays

### Boolean indexing 

Combining all the concepts we have learnt so far (ie, vectorisation, indexing, and slicing), we can program some very complex and powerful filtering logics in a single Python statement.

One of these operations is called boolean indexing, and this is very widely used in data wrangling aspects of data analysis.
Let’s use an example scenario to explain:

- A sequence consisting of the name of all the customers, in order, whenever they have made a purchase (ie, a customer can appear more than once in the list).
- Another sequence consisting of the details of each transaction for the specific customer in order.
- The sequence is stored in the NumPy Arrays, and now we want to select the transactions for a particular customer.

We can solve the above scenario using a combination of vectorisation, indexing, and slicing.

The code snippets simulate this scenario with steps to follow.


#### Step 1: Create data
To apply the boolean filters, you need to have dummy data. Let us create one for this activity. 


In [None]:
# List of customer who have made the transactions, stored in numpy array
customers = np.array(['Bob', 'John', 'Miller', 'Bob', 'Sammy', 'Samuel', 'Tony', 'Amanda', 'Bob'])
customers.ndim, customers.shape

#### Step 2: Apply comparison operator to create a boolean filter 

As operations on NumPy Arrays are vectorised, if we apply a comparison operator, the resultant Array will be a boolean Array with True or False values in it, depending on the filter.

Let’s perform the operation customer name = ‘Bob’ and see what happens.

In [None]:
# So there are 9 customer entries in the above customers array.
# Let's create an array of 9rows (one for each customer), and 4 columns (assume 4 attributes of transaction) 
# with random integer values. We will fill this array with some random numbers, using a function randint
data = np.random.randint(1,100, size=(9,4))
transactions = np.array(data)
transactions

#### Step 3: Pass boolean filter for the slicing of the Array

Here, we can pass this boolean filter as one of the index parameters in the slicing operations.

We want to select all the transaction rows related to customer ‘Bob’, and we want to select all the columns for those selected rows. So, we will pass the following slicing parameter:
- boolean filter – for row index
- :  – for the column index (ie, picking all columns)

The code snippet shows this behaviour and you will see rows 1, 3, and 9 are filtered. Just for the ease of viewing the result, the value 1,000,000 (1.0e+6) is also broadcast for these elements.


In [None]:
#Creating a Boolean Filter by applying vectorized comparison to customers array
customers == 'Bob'
#This will result into an array with 9 columns, with True at all the position
#where 'Bob' exists in the customer array

In [None]:
slice = transactions[customers=='Bob', :]
slice

In [None]:
transactions[customers=='Bob', :] = 1000000
transactions

Now it should be clear for you why Boolean indexing is the efficient way of filtering rows in a NumPy Array based on certain selection criteria dependent on the requirements.

### Transposing Arrays

Generally, transposing Arrays is a widely used internal operation in various machine learning and deep learning algorithms. Transposing operations on Arrays will convert rows into columns, and columns into rows.

Transposing is a special type of reshaping that returns a view on the underlying NumPy Array data without copying anything. For this purpose, there is a special attribute, T, available for every NumPy Array.

In [None]:
arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
arr

In [None]:
arr.T

## NumPy Array functions

### Universal functions

The NumPy package provides various functions that perform fast operations on the elements of the ndArray, called universal functions. These functions can be considered as the fast vectorised wrapper functions – performing simple mathematical operations like square, square root, exponential, maximum, add, subtract, and so on.

Some of these functions only require one NumPy Array as a parameter and the function simply performs the operations. These are referred to as unary universal functions; for example, square, square root, and so on.

Similarly, a universal function that requires two Arrays as a parameter is referred to as binary universal functions.
The code snippets show examples of the various universal functions.

#### Example of Unary Array Functions

In [None]:
arr = np.array([2,3,5,6,7,8])
arr_square = np.square(arr)
arr, arr_square

In [None]:
arr_root = np.sqrt(arr_square)
arr_square, arr_root

#### Example of Binary Array Function

In [None]:
arr1 = np.array([12,15,12,1,2,3,4,66,23])
arr2 = np.array([10,4,5,11,22,4,5,6,89])

#Element wise maximum value
arr_maximum = np.maximum(arr1, arr2)
arr_maximum

For a comprehensive list of universal functions, please refer to the list from the NumPy manual.<br>
Read: https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs


## Mathematical methods

There are sets of mathematical functions that perform certain computations, including statistics on the entire Array, or on all the data along a particular axis. These functions are available as Array methods.

Let’s explore some of these Array methods.


### Mean

This method will calculate the mean of all the data in the Array. If we pass the axis parameter, then the mean calculation happens along that axis.

The code snippet demonstrates the mean method:

In [None]:
arr1 = np.random.randn(4,5)
arr1

In [None]:
arr1.mean()

In [None]:
#Calculate mean across all the rows i.e mean of first, second and third columns 
arr1.mean(axis=0)

In [None]:
#Calculate mean across all the columns i.e mean of first, second and third rows respectively 
arr1.mean(axis=1)

### Sum

This method will calculate the sum of all the data in the Array, or if we pass the axis parameter, summation will happen along that axis.

The code snippet demonstrates the sum method:

In [None]:
arr1 = np.array([[1,2,3,10],[4,5,6,11],[7,8,9,12]])
arr1

In [None]:
arr1.sum(axis=0)

In [None]:
arr1.sum(axis=1)

An easier way to understand axis parameter is to assume you need to collate all the data on that axis by performing the said operations:

If we specify arr1.sum(axis=0), it means that in the result set the rows will be collapsed into one row (ie, summation has happened in every column).

There are some methods which won’t return the aggregated results, but rather the intermediate results depending on the operations performed along the axis specified. An example of such a method is the cumulative sum.

### CumSum

This method will perform the cumulative sum of the elements along the axis specified as the parameter. The following code snippet showcases an example.

In [None]:
arr1

In [None]:
arr1_0 = arr1.cumsum(axis=0)
arr1_0

In [None]:
arr1_1 = arr1.cumsum(axis=1)
arr1_1

<br>Another way of looking at axis becomes clear in this example. 

When we specify axis = 0, it means we are doing cumulative sum along the rows i.e.
- the first row will have the original value
- the second row will have the cumulative value (cumulative sum from the previous row).

When we specify axis = 1, it means we are doing cumulative sum along the columns. As in: 

- the first column will have the original value
- the second column will have the cumulative value (cumulative sum from the previous column).


### Methods on Boolean Arrays


Some mathematical methods can be applied to Boolean Arrays as well; for example, sum(). In this case, it will return the count of True values in a Boolean Array.

The code snippet shows the sum method applied to Boolean Arrays.


In [None]:
bool_arr = np.array([True,False,False,True,True,True,False,True,False,True,True])
total_true = bool_arr.sum()
total_true

In this section, we covered the basics of NumPy Arrays, but there is much more to n-dimensional Arrays. 

We will cover more of the NumPy Array features using application and coding exercises in the coming sections. 