![Cloud-First](../image/CloudFirst.png) 


# SIT742: Modern Data Science
**(Module: Big Data)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/sit742](https://github.com/tulip-lab/sit742/issues)


Prepared by **SIT742 Teaching Team**

---


## Session 3D: Data Acquisition (1)

In this week, we will learn how to use Python Packages to ETL the data and files.

## Content



### Part 1 Numpy Module

1.1 [Importing Numpy](#importnp)

1.2 [Numpy arrays](#nparray)

1.3 [Manipulating arrays](#maninp)

1.4 [Array Operations](#arrayop)

1.5 [np.random](#random)

1.6 [Vectorizing Functions](#vecfunc)



### Part 2 Data Loading

2.1 [TXT](#txt)

2.2 [CSV](#csv)

2.3 [JSON](#json)



---
## <span style="color:#0b486b">1. Numpy module</span>


Python lists are very flexible for storing any sequence of Python objects. But usually flexibility comes at the price of performance and therefore Python lists are not ideal for numerical calculations where we are interested in performance. Here is where **NumPy** comes in. It adds support for large, multi-dimensional arrays and matrices, along with high-level mathematical functions to operate on these arrays to Python. 

Relying on `'BLAS'` and `'LAPACK'`, `'NumPy'` gives a functionality comparable with `'MATLAB'` to Python. NumPy facilitates advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences. It has become one of the fundamental packages used for numerical computations.

In this tutorial we will review its basics, so to learn more about NumPy, visit [NumPy User Guide](http://docs.scipy.org/doc/numpy/user/index.html)

<a id = "importnp"></a>

### <span style="color:#0b486b">1.1 Importing Numpy</span>

As you have learnt in this session, first we have to import a package to be able to use it. NumPy is imported with:

In [None]:
import numpy

Although it is the convention to import it like with an alias:

In [None]:
import numpy as np

<a id = "nparray"></a>

### <span style="color:#0b486b">1.2 Numpy arrays</span>

The core of NumPy is its arrays. You can create an array from a Python list or tuple using `'array'` function. They work similarly to lists apart from the fact that:

* you can easily perform element-wise operation on them, and
* unlike lists, they should be pre-allocated. This means that there is no equivalent to list append for arrays. The size of the arrays is known at the time it is defined.



#### <span style="color:#0b486b">1.2.1 Create an array from a list</span>

In [None]:
x = [1, 7, 3, 4, 0, -5]

In [None]:
y = np.array(x)
type(y)

#### <span style="color:#0b486b">1.2.2 Create an array using a range</span>

In [None]:
range(5)

In [None]:
print(np.array(range(5)))

In [None]:
print(np.arange(2, 3, 0.2))  # Why is there no value 3.0 in the output?

In [None]:
print(np.linspace(2, 3, 5))    # returns numbers spaced evenly on a linear scale, both endspoints are included


Just try to change the variable value 5 with 1, 2, 4  or 10? \

What pattern could you find ?

Could you guess what is the function of **linspace**, if without the given comments?

Then, you can try to use the same method to learn what is the function of **logspace**?




In [None]:
print(np.logspace(2, 3, 5))   # returns numbers spaced evenly on a log scale

**Note:** If you need any help on how to use a function or what it does, you can use IPython help. Just add a question mark (?) at the end of the function and execute the cell:

In [None]:
np.logspace?

#### <span style="color:#0b486b">1.2.3 Create a prefilled array</span>

In [None]:
print(np.zeros(5))

In [None]:
print("The 1st sample is",np.ones(5, dtype=int))   # you can specify the data type, default is float
print("The 2nd sample is ",np.ones(5, dtype=float)) 
print("The 3rd sample is ",np.ones((5,5), dtype=int)) 
print("The 4th sample is ",np.ones((5,5,5), dtype=int)) 

In [None]:
np.ones?

#### <span style="color:#0b486b">1.2.4 `'mgrid'`</span>
similar to meshgrid in MATLAB:

In [None]:
x, y = np.mgrid[0:5, 0:3]

print(x)
print(y)

In [None]:
np.mgrid?

#### <span style="color:#0b486b">1.2.5 Array attributes</span>

NumPy arrays have multiple attributes and methods. The cell below shows a few of them. You can press tab after typing the dot operator `'(.)'` to use IPython auto-complete and see the rest of them.

In [None]:
y = np.array([3, 0, -4, 6, 12, 2])

In [None]:
print("number of dimensions:\t", y.ndim)        
print("dimension of the array:", y.shape)       
print("numerical data type:\t", y.dtype)
print("maximum of the array:\t", y.max())       
print("index of the array max:", y.argmax())    
print("mean of the array:\t", y.mean())      

#### <span style="color:#0b486b">1.2.6 Multi-dimensional arrays</span>


You can define arrays with 2 (or higher) dimensions in numpy:

##### from lists

In [None]:
x = [[1, 2, 10, 20], [3, 4, 30, 40]]
y = np.array(x)
print(y)
print()
print(y.ndim, y.shape)

##### pre-filled 

In [None]:
x = np.ones((3, 5), dtype='int')

In [None]:
print(x)
print()
print(x.ndim, x.shape)

##### `'diag()'`
diagonal matrix

In [None]:
np.diag([1, 2, 3])

<a id = "maninp"></a>

### <span style="color:#0b486b">1.3 Manipulating arrays</span>


#### <span style="color:#0b486b">1.3.1 Indexing</span>


Similar to lists, you can index elements in an array using `'[]'` and indices:

If `'x'` is a 1-dimensional array, `'x[i]'` will index `'ith'` element of `'x'`:

In [None]:
x = np.array([2, 8, -2, 4, 3])
print(x[3])

If 'x' is a 2-dimensional arrray:

* '`x[i, j]'` or `'x[i][j]'` will index the element in `'ith'` row and `'jth'` column
* '`x[i, :]'` will index the `'ith'` row 
* `'x[:, j]'` will index `'jth'` column

In [None]:
x = np.array([[7, 6, 8, 6, 4],
              [4, 7, -2, 0, 9]])
              
print(x[1, 3])

In [None]:
print(x[1, :])      # or x[1]

In [None]:
print(x[:, 3])

Arrays can also be indexed with other arrays:

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])

idx1 = [1, 3, 4]        # list
idx2 = np.array(idx1)   # array

print(x[idx1], x[idx2])
x[idx2] = 0
print(x)

You can also index masks. The index mask should be a NumPy arrays of data type Bool. Then the element of the array is selected only if the index mask at the position of the element is True.

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])

In [None]:
mask = np.array([False, True, True, False, False, True, False])

In [None]:
x[mask]

Combining index masks with comparison operators enables you to conditionally select elements of the array.

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])
mask = (x>=2) * (x<9)
x[mask]

#### <span style="color:#0b486b">1.3.2 Slicing</span>


Similar to Python lists, arrays can also be sliced:

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])

print(x[3:])    # slicing
print(x[3:7:2])  # slicing with a specified step

In [None]:
x = np.array([[7, 6, 8, 6, 4, 3],
              [4, 7, 0, 5, 9, 5],
              [7, 3, 6, 3, 5, 1]])
              

print(x[1, 1:4])
print()
print(x[:2, 1::2])    # rows zero up to 2, cols 1 up to end with a step=2

#### <span style="color:#0b486b">1.3.3 Iteration over items</span>


Since most of NumPy functions are capable of operating on arrays, in many cases iteration over items of an arrays can be (and should be) avoided. Otherwise it is pretty much similar to iterating over values of a list:

In [None]:
a = np.arange(0, 50, 7)
print(a)
for item in a:
    print(item,) 

Of course you could iterate over items using their indices too:

In [None]:
a = np.arange(0, 50, 7)
for i in range(a.shape[0]):
    print(a[i],)

There are also many functions for manipulating arrays. The most used ones are:

#### <span style="color:#0b486b">1.3.4 `copy()`</span>


**Remember** that assignment operator is not an equivalent for copying arrays. In fact Python does not pass the values. It passess the references.

In [None]:
x = [1, 2, 3]
y = x
print(x, y)

In [None]:
y[0] = 0       # now we alter an element of y
print(x, y)     # note that x has changed as well

Same is true for numpy arrays. That's why if you need a copy of an array, you should use `'copy()'` function.

In [None]:
x = np.array([1, 2, 3])
y = x

y[0] = 0       # now we alter an element of y
print(x, y)     # note that x has changed as well

In [None]:
x = np.array([1, 2, 3])
y = x.copy()  # or np.copy(x)
y[0] = 0

print(x, y)

#### <span style="color:#0b486b">1.3.5 `reshape()`</span>


In [None]:
x1 = np.arange(6)
x2 = x1.reshape((2, 3))    # or np.reshape(x1, (2, 3))

print(x1)
print()
print(x2)

#### <span style="color:#0b486b">1.3.6 `astype()`</span>


Used for type casting:

In [None]:
x1 = np.arange(5)
x2 = x1.astype(float)

print(type(x1), x1)
print(type(x2), x2)

#### <span style="color:#0b486b">1.3.7 `T` Transpose</span> 

transpose method:

In [None]:
x1 = np.random.randint(5, size=(2, 4))
x2 = x1.T

print(x1)
print()
print(x2)

<a id = "arrayop"></a>

### <span style="color:#0b486b">1.4 Array operations</span>


#### <span style="color:#0b486b">1.4.1 Arithmetic operators</span>


Arrays can be added, subtracted, multiplied and divided using +, -, \* and, /. Operations done by these operators are **element wise**.

In [None]:
x1 = np.array([[2, 3, 5, 7], 
               [2, 4, 6, 8]], dtype=float)
x2 = np.array([[6, 5, 4, 3], 
               [9, 7, 5, 3]], dtype=float)

In [None]:
print(x1)
print()
print(x2)

In [None]:
print(x1 + x2)

In [None]:
print(x1 - x2)

In [None]:
print(x1 * x2)

In [None]:
print(x1 / x2)

In [None]:
print(3 + x1)

In [None]:
print(3 * x1)

In [None]:
print(3 / x1)

#### <span style="color:#0b486b">1.4.2 Boolean operators</span>

Much like arithmetic operators discussed above, boolean (comparison) operators perform element-wise on arrays.

In [None]:
x1 = np.array([2, 3, 5, 7])
x2 = np.array([2, 4, 6, 7])
y = x1<x2

print( y, y.dtype)

use methods `'.any()'` and `'.all()'` to return a single boolean value indicating whether any or all values in the array are True respectively. This value in turn can be used as a condition for an `'if'` statement.

In [None]:
print (y.all())
print (y.any())

NumPy has many other functions that you can read about them in [NumPy User Guide](http://docs.scipy.org/doc/numpy/user/). Specially read about:

* `np.unique`, returns unique elements of an array
* `np.flatten`, flattens a multi-dimensional array
* `np.mean`, `np.std`, `np.median`
* `np.min`, `np.max`, `np.argmin`, `np.argmax`

<a id = "random"></a>

### <span style="color:#0b486b">1.5 np.random</span>


NumPy has a module called `random` to generate arrays of random numbers. There are different ways to generate a random number:

In [None]:
print( np.random.rand())

In [None]:
# 2x5 random array drawn from standard normal distribution
print( np.random.random([2, 5]))

In [None]:
# 2x5 random array drawn from standard normal distribution
print (np.random.rand(2, 5))

In [None]:
# 2x5 random array drawn from a uniform distribution on {0, 1, 2, ..., 9}
print (np.random.randint(10, size=[2, 5])) 

##### <span style="color:#0b486b">1.5.1 Random seed</span>


Random numbers generated by computers are not really random. They are called pseudo-random. Thus we can set the random generator to generate the same set of random numbers every time. This is useful while testing the code.

In [None]:
for i in range(5):
    print (np.random.random(),)    

In [None]:
for i in range(5):
    np.random.seed(100)
    print (np.random.random(),)    

<a id = "vecfunc"></a>

### <span style="color:#0b486b">1.6 Vectorizing functions</span>


As mentioned earlier in operators, to get a good performance you should avoid looping over elements in an array and use vectorized algorithms. Many methods and functions of NumPy already support vectors, so keep this in mind while writing your own code.

But for now, suppose you have written a step function which does not work with arrays, as the cell below:

In [None]:
def step_func(x):
    """
    scalar implementation of step function
    """
    
    if x>=0:
        return 1
    else:
        return 0

Obviously it fails when dealing with an array, because it expects a scalar as its input. Execute the cell below and see that it raises an error:

In [None]:
# since step_func expects a scalar and recieves an array instead, 
# it raises an error

step_func(np.array([2, 7, -4, -9, 0, 4]))

You can use the function `'np.vectorize()'` to obtain a vectorized version of `'step_func'` that can handle vector data:

In [None]:
step_func_vectorized = np.vectorize(step_func)
step_func_vectorized(np.array([2, 7, -4, -9, 0, 4]))

Although `'vectorize()'` can automatically derive a vectorized version of a scalar function, but it is always better to keep this in mind and write functions vector-compatible, from the beginning. For example we could write the step function as it is shown in the cell below, so it can handle scalar and vector data.

In [None]:
def step_func2(x):
    """
    vector and scalar implementation of step function
    """
    
    return 1 * (x>=0)

In [None]:
step_func2(np.array([2, 7, -4, -9, 0, 4]))

---
## <span style="color:#0b486b">2. File I/O</span>

For Online platforms such as Google Colab, or IBM Cloud, it is important for you to get familiar with the provided data storage or cloud data storage function. Alternatively, you might want to directly access the file, and load into your Notebook.

In [None]:
!pip install wget

Then you can download the file into GPFS file system.

In [None]:
import wget

link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Jupyter/data/csv_data1.csv'
DataSet = wget.download(link_to_data)

<a id = "txt"></a>

### <span style="color:#0b486b">2.1 TXT</span>


TXT file format is the most simplistic way to store data. 

Load a TXT file with `'np.loadtxt()'`:

In [None]:
import numpy as np

# This code is for local PC
# x = np.loadtxt("data/txt_data1.txt")

# The following code for IBM Cloud
link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Jupyter/data/txt_data1.txt'
DataSet = wget.download(link_to_data)

x = np.loadtxt("txt_data1.txt")
x

Save a TXT file with `'np.savetxt()'`:

In [None]:
y = np.random.randint(10, size=5)
np.savetxt("txt_data2.txt", y)
y

<a id = "csv"></a>

### <span style="color:#0b486b">2.2 CSV</span>


Comma Separated Values format and its variations, are one the most used file format to store data.

You can use `'np.genfromtxt()'` to read a CSV file:

**NOTE:** The best way to read CSV and XLS files is using **pandas** package that will be introduced later.

In [None]:
import wget

link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Jupyter/data/csv_data1.csv'
DataSet = wget.download(link_to_data)

print(DataSet)

In [None]:
import numpy as np


x = np.genfromtxt("csv_data1.csv", delimiter=",")
x

Use `'np.savetxt()'` to save a 2d-array in a CSV file.

In [None]:
x = np.random.randint(10, size=(6,4))
np.savetxt("csv_data2.csv", x, delimiter=',')
x

<a id = "json"></a>

### <span style="color:#0b486b">2.3 JSON</span>


JSON is the most used file format when dealing with web services. 

To read a JSON file, use `'json'` package and `'load()'` function, or `'loads()'` if the data is serialized. It reads the data and parses it into a dictionary.

In [None]:
link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Jupyter/data/json_data1.json'
DataSet = wget.download(link_to_data)

print(DataSet)

In [None]:
import json
with open("json_data1.json", 'rb') as fp:
    fcontent = fp.read()
# data = json.loads(fcontent)
data = json.loads(fcontent.decode('utf-8'))
data.keys()

In [None]:
data

In [None]:
data['phoneNumbers']

You can also write a python dictionary into a JSON file. To do this use `'dump()'` or `'dumps()'` functions.

In [None]:
data = [{'Name': 'Zara', 'Age': 7, 'Class': 'First'}, 
        {'Name': 'Lily', 'Age': 9, 'Class': 'Third'}];
data

In [None]:
with open("json_data_now.json", 'w') as fp:
    json.dump(data, fp)