Numpy Tutorial Part 2 – Vital Functions for Data Analysis

Agenda
0. Introduction
1. How to get index locations that satisfy a given condition using np.where?
2. How to import and export data as a csv file?
2.1 How to handle datasets that has both numbers and text columns?
3. How to save and load numpy objects?
4. How to concatenate two numpy arrays column-wise and row-wise?
5. How to sort a numpy array based on one or more columns?
5.1 How to sort a numpy array based on 1 column using argsort?
5.2 How to sort a numpy array based on 2 or more columns?
6. Working with dates
6.1 How to create a sequence of dates?
6.2 How to convert numpy.datetime64 to datetime.datetime object?
7. Advanced numpy functions
7.1 vectorize – Make a scalar function work on vectors
7.2 apply_along_axis – Apply a function column wise or row wise
7.3 searchsorted – Find the location to insert so the array will remain sorted
7.4 How to add a new axis to a numpy array?
7.5 More Useful Functions
8. What is missing in numpy?

How to get index locations that satisfy a given condition using np.where?
np.where locates the positions in the array where a given condition holds true.

In [1]:
import numpy as np

In [10]:
list1=([1,2,6,3],[9,8,2,0],[2,5,8,3])
arr=np.array(list1)
print("Array")
print(arr,"\n")
rs_arr=arr.reshape(4,3)
print("Reshaped Array")
print(rs_arr,"\n")
s_shape=arr.shape
print("Shape")
print(s_shape,"\n")
s_dim=arr.ndim
print("Dimensional")
print(s_dim,"\n")
row_arr,col_arr=np.where(arr>5)
print("Row Positions where value > 5")
print(row_arr,"\n")
print("Col Positions where value > 5")
print(col_arr,"\n")
val_get5=arr[row_arr,col_arr]
print("value > 5 from Row and col")
print(val_get5,"\n")

Array
[[1 2 6 3]
 [9 8 2 0]
 [2 5 8 3]] 

Reshaped Array
[[1 2 6]
 [3 9 8]
 [2 0 2]
 [5 8 3]] 

Shape
(3, 4) 

Dimensional
2 

Row Positions where value > 5
[0 1 1 2] 

Col Positions where value > 5
[2 0 1 2] 

value > 5 from Row and col
[6 9 8 8] 



Single dimensional array
Once you have the positions, you can extract them using the array’s take method.

In [13]:
arr_rand = np.array([1,2,6,3,9,8,2,0,2,5,8,3])
print("Array: ", arr_rand)
index_gt5 = np.where(arr_rand > 5)
print("Positions where value > 5: ", index_gt5)
val_arr=arr_rand.take(index_gt5)
print("Values using take")
print(val_arr,"\n")

Array:  [1 2 6 3 9 8 2 0 2 5 8 3]
Positions where value > 5:  (array([ 2,  4,  5, 10]),)
Values using take
[[6 9 8 8]] 



np.where also accepts 2 more optional arguments x and y. Whenever condition is true, ‘x’ is yielded else ‘y’.
find the location of the maximum and minimum valjues as well.

In [5]:
arr_rand = np.array([1,2,6,3,9,8,2,0,2,5,8,3])
print("Array: ", arr_rand,"\n")
new_arr=np.where(arr_rand > 5, 'gt5', 'le5')
print("New Array: ", new_arr,"\n")
print('Position of max value: ', np.argmax(arr_rand),"\n")  
print("Position of Min value",np.argmin(arr_rand),"\n")

Array:  [1 2 6 3 9 8 2 0 2 5 8 3] 

New Array:  ['le5' 'le5' 'gt5' 'le5' 'gt5' 'gt5' 'le5' 'le5' 'le5' 'le5' 'gt5' 'le5'] 

Position of max value:  4 

Position of Min value 7 



How to import and export data as a csv file?
A standard way to import datasets is to use the np.genfromtxt function. It can import datasets from web URLs, handle missing values, multiple delimiters, handle irregular number of columns etc.

    A less versatile version is the np.loadtxt which assumes the dataset has no missing values.
    By setting the filling_values argument you can replace the missing values with something else.

In [6]:
# Turn off scientific notation
np.set_printoptions(suppress=True) 

In [11]:
path = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data=np.genfromtxt(path,delimiter=',',skip_header=1,filling_values='-999',encoding=None,dtype='int')
arr_data=data[:3]
print(arr_data)

[[  18    8  307  130 3504   12   70    1 -999]
 [  15    8  350  165 3693 -999   70    1 -999]
 [  18    8  318  150 3436   11   70    1 -999]]


the values in last column has the same value ‘-999’
That happened because, I had mentioned the. `dtype=’float’`. The last column in the file contained text values and since all the values in a numpy array has to be of the same `dtype`, `np.genfromtxt` didn’t know how to convert it to a float.

How to handle datasets that has both numbers and text columns?
Finally, ‘np.savetxt’ lets you export the array as a csv file.


In [14]:
path = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data=np.genfromtxt(path,delimiter=',',skip_header=1,filling_values='-999',encoding=None,dtype='object')
arr_data=data[:3]
print(arr_data)
np.savetxt('ouput.csv',arr_data)

[[b'18' b'8' b'307' b'130' b'3504' b'12' b'70' b'1'
  b'"chevrolet chevelle malibu"']
 [b'15' b'8' b'350' b'165' b'3693' b'11.5' b'70' b'1'
  b'"buick skylark 320"']
 [b'18' b'8' b'318' b'150' b'3436' b'11' b'70' b'1'
  b'"plymouth satellite"']]


TypeError: Mismatch between array dtype ('object') and format specifier ('%.18e %.18e %.18e %.18e %.18e %.18e %.18e %.18e %.18e')

# Doubts :
1. Unable to load from system
2. save only the particular result after the calculation to csv to system and inside the project

# Keep Pending
## How to save and load numpy objects?
At some point, we will want to save large transformed numpy arrays to disk and load it back to console directly without having the re-run the data transformations code.
Numpy provides the .npy and the .npz file types for this purpose.


How to concatenate two numpy arrays columnwise and row wise
There are 3 different ways of concatenating two or more numpy arrays.
    Method 1: np.concatenate by changing the axis parameter to 0 and 1
    Method 2: np.vstack and np.hstack
    Method 3: np.r_ and np.c_