Contents (TODO) | [How to Read and Represent Data](../ica02/How_to_Read_and_Represent_Data.ipynb) >

<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica01/hello_world.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

# Hello World!

This notebook is designed to provide a quick and dirty introduction to Python programming. If you are already familiar with Python programming, you can safely skip many of the sections below and jump directly to assignment cells. This notebook is designed for complete newbies (=absolutely zero experience with Python) to help them quickly dive into the course materials for the rest of the course. This being said, it is NOT a purpose of this notebook to equip you with a comprehensive working knowledge and coding skills on Python. For more structured learning materials for Python, I strongly encourage you to purchase a book, take a (online) course, etc.

Also, I hereby acknowledge that many of the codes and explanations I used in this notebook are originally from [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas. The original codes and contents are available on their [GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).

## 1. Jupyter Notebook

This notebook that you are reading right now is what is called a Jupyter notebook. It's actually a pretty cool way of doing note keeping on your experiments, as it allows you to put nicely formatted notes along with source codes executable on the fly. To get started, click the code below and hit `ctrl + enter` (or if you're using Mac, `cmd + enter`) and see what happens.

In [None]:
print('Hello World!')

Did you notice that the notebook printed out a message `Hello World!`? Basically, a Jupyter notebook is comprised of "small blocks of things" called *cells*. The one that you just ran is a cell containing a code and it's called a *code cell*. `ctrl + enter` (or `cmd + enter` for Mac) is usually how you run a code cell. Alternatively, you can hit `shift + enter`, which in this case, executes the code cell and move the focus to the next cell.

There's another type of cell, which is a *text cell* or a *markdown cell*. The very paragraph that you are reading right now is actually a text cell. Text cells are for making notes to your codes essentially, but there can be a lot of different ways of using them. Text cells ues a text scripting language called 'Markdown'. Markdown is actually pretty simple to learn and for most of you, looking at [this cheat-sheet](https://www.markdownguide.org/cheat-sheet) should be just enough. Below is an example of a text cell. Double click it and change some texts there. After changes are made, simply hit `ctrl/shift + enter` as what you would do to run a code cell.

---
In a text cell, you can render texts in **boldface** or *italicized*.
You can also ~~strikethrough~~ words.
> You can also quote someone like this.

1. or you can create
2. an ordered
3. list


- or even
- an unordered
- list

You can render sample codes like this:
```cpp
include <stdio>
int main(void){
    std::cout << "Hello World!" << std::endl;
    return 0;
}
```
or, inline, `myVar = 3`.

You can also create a [hyperlink](http://www.stephenbaek.com),

or include an image ![iowa](https://lh5.googleusercontent.com/p/AF1QipOg3dTb5dkfYhENpjJeplro3cLVaNLiQvhMHL8i=w213-h160-k-no)

You can report some results in a table:

| Table | Column |
| ----- | ------ |
| Row 1 |  30 % |
| Row 2 |  70 % |


You can also build a laundry list like this:
- [ ] Do homework
- [ ] Buy some cookies
- [x] Steve: Prepare a lecture material
- [ ] Liz: Practice presentation

---

To add a cell, you can simply hit `+` button on a toolbar, or `Insert > Insert Cell Below (Above)` in the main menu. By default, it creates a code cell, but you can always convert it to a text cell by clicking `Cell > Cell Type > Markdown`. If you use Google Colab, creating cells is even simpler. You would just need to put a mouse cursor to where you want to create a cell and click either `CODE` or `TEXT` depending on the type of a cell you want to create.

### Remarks on the course materials...
In this course, I'll make lab notes, assignments, source codes, etc. available as Jupyter notebooks. The best way to learn from the notebooks is to run everything in there, cell by cell, and carefully observe what happens. I will also assign quiz, homework, and exercises using these notebooks, by leaving some blank code cells and such:

In [None]:
# Exercise Problem (Assignment)
# Fill in the blank to print your name.
print(                )

### Report typos, bugs, grammatical errors, etc.

Lastly, I have to admit that I'm an extremely unorganized, careless person. I can guarantee you that you will find lots of typos, grammatical mistakes, bugs, etc. in the lecture materials. You may also have some suggestions for improving this course, things like 'Can you add lab sessions on parallel computing?' 'Example in lecture-03 is boring. Can you use different dataset?' or things like that. Please let me know about those by opening issues in the course GitHub. To open an issue, you can visit https://github.com/stephenbaek/bigdata/issues and click `New issue` button. Your suggestions are always welcome and you won't be penalized for pointing out problems or making suggestions. In fact, for those who raise an issue that I find particularly useful for improving the course, I'll give you a bonus added to your final grade.

(TODO: Add a screenshot)

Finally, we are done with some basic things. Let's do some more exciting stuff now!

## 2. Data Types in Python

To be a good data scientist, you will be required to have a firm grasp on how data is stored and manipulated.
This section outlines and contrasts how arrays of data are handled in the Python language.

Users of Python are often drawn-in by its ease of use, one piece of which is dynamic typing. While a statically-typed language like C or Java requires each variable to be explicitly declared, a dynamically-typed language like Python and MATLAB skips this specification. For example, in C you might specify a particular operation as follows:
```cpp
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
printf("%d", i);
```
while in Python, the equivalent operation could be written this way:

In [None]:
# Python code
result = 0
for i in range(100):
    result += i
print(result)

Notice the main difference: in C, the data types of each variable are explicitly declared, while in Python the types are dynamically inferred. This means, for example, that we can assign any kind of data to any variable:

In [None]:
# Python code
x = 4
print(x)
x = "four"
print(x)

Here we've switched the contents of x from an integer to a string. The same thing in C would lead (depending on compiler settings) to a compilation error or other unintented consequences:

```cpp
/* C code */
int x = 4;
x = "four";  // FAILS
```

This sort of flexibility is one piece that makes Python and other dynamically-typed languages convenient and easy to use. Understanding how this works is an important piece of learning to analyze data efficiently and effectively with Python. But what this type-flexibility also points to is the fact that Python variables are more than just their value; they also contain extra information about the type of the value. We'll explore this more in the sections that follow.

In [None]:
# Exercise Problem (Assignment)
# declare an integer variable named `a` with initial value of 7
# declare a floating number variable (float) named `b` with initial value of 7
# declare a string variable named `c` with initial value of 'seven'

### 2.1. A Python integer is more than just an integer
The standard Python implementation is written in C. This means that every Python object is simply a cleverly-disguised C structure, which contains not only its value, but other information as well. For example, when we define an integer in Python, such as x = 10000, x is not just a "raw" integer. It's actually a pointer to a compound C structure, which contains several values. Looking through the Python 3.4 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded):
```cpp
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```

A single integer in Python 3.4 actually contains four pieces:

- `ob_refcnt`, a reference count that helps Python silently handle memory allocation and deallocation
- `ob_type`, which encodes the type of the variable
- `ob_size`, which specifies the size of the following data members
- `ob_digit`, which contains the actual integer value that we expect the Python variable to represent.


This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in the following figure:
![C int vs Python int](figures/cint_vs_pyint.png)

Here `PyObject_HEAD` is the part of the structure containing the reference count, type code, and other pieces mentioned before.

Notice the difference here: a C integer is essentially a label for a position in memory whose bytes encode an integer value. A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes that contain the integer value. This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically. All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects.

### 2.2. A Python list is more than just a list
Let's consider now what happens when we use a Python data structure that holds many Python objects. The standard multi-element container in Python is the list. We can create a list of integers as follows:

In [None]:
L = list(range(10))
print(L)       # print the entire list
print(L[0])    # print the zero-th element
print(L[2:5])  # print from 2nd element (inclusive) to 5th element (exclusive)

In [None]:
type(L[0])

Or, similarly a list of strings:

In [None]:
L2 = [str(c) for c in L]  # this is one of the useful things about Python. It looks almost like the English language.
print(L2)

In [None]:
type(L2[0])

Becuase of Python's dynamic typing, we can even create heterogenous lists:

In [None]:
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]

But this flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other information–that is, each item is a complete Python object. In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array, like what you would do (and are forced to do) in C. The difference between a dynamic-type list and a fixed-type array is illustrated in the following figure:

![Array vs List](figures/array_vs_list.png)

The word 'Numpy' in the figure is something we will learn in a moment. For now, just substitute the word with 'fixed-type'.

At the implementation level, the fixed-type array essentially contains a single pointer to one contiguous block of data. The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier. Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type. Fixed-type arrays lack this flexibility, but are much more efficient for storing and manipulating data.

In many data science applications, efficient storage and manipulation of numerical arrays is absolutely fundamental. For those applications, fixed-type arrays would be much more useful than dynamic-type lists, as all the elements we will be dealing with are essentially numbers. Furthermore, there are a lot of mathematical operations that we will need to apply to those numbers as we get to extract, transform, and analyze data. Unfortunately, the native Python does not have such capacity. Instead, there is a specialized toolset available for handling such numerical arrays, called the `NumPy` package and we will see this more in depth in the below section.

## 3. Introduction to NumPy

NumPy is the fundamental package for scientific computing with Python. It contains:
- a fixed-type n-dimensional array object
- operations defined on n-dimensional arrays
- useful linear algebra and other mathematical capabilities

If you followed the Python installation manual I provided on the course website, your system must already have NumPy installed. To confirm this, you can run the following code cell.

In [None]:
import numpy
numpy.__version__

By convention, you'll find that most people in the real-world will import NumPy using np as an alias:

In [None]:
import numpy as np

In fact, we will use `np` as an alias throughout the whole semester.

### 3.1. Creating NumPy Arrays from Python Lists
First, we can use np.array to create arrays from Python lists:

In [None]:
# integer array:
np.array([1, 4, 2, 5, 3])

Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):

In [None]:
np.array([3.14, 4, 2, 3])

If we want to explicitly set the data type of the resulting array, we can use the dtype keyword:

In [None]:
np.array([1, 2, 3, 4], dtype='float32')

Finally, unlike the native Python lists, NumPy arrays can explicitly be multi-dimensional; here's one way of initializing a multidimensional array using a list of lists:

In [None]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])

The inner lists are treated as rows of the resulting two-dimensional array.

In [None]:
# Exercise Problem (Assignment)
# create a numpy array that contains the first five numbers of the Fibonacci sequence
# (https://www.mathsisfun.com/numbers/fibonacci-sequence.html)
a = 

### 3.2. Creating Arrays from Scratch
Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy. Here are several examples:

In [None]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

In [None]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

In [None]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

In [None]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

In [None]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

In [None]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

In [None]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

In [None]:
# Create a 3x3 identity matrix
np.eye(3)

In [None]:
# Exercise Problem (Assignment)
a = # create an array of ten values evenly spaced between -1 and 1
b = # create an array filled with a linear sequence starting at 10, ending at 20, stepping by 1
c = # create a random 5x5 matrix whose elements follow the normal distribution with mean 1 and stddev 0.5
d = # create a constant 3x3 matrix filled with 7

### 3.3. NumPy Standard Data Types
NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations. Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.

The standard NumPy data types are listed in the following table. Note that when constructing an array, they can be specified using a string:

```
np.zeros(10, dtype='int16')
```

Or using the associated NumPy object:

```
np.zeros(10, dtype=np.int16)
```

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

More advanced type specification is possible, such as specifying big or little endian numbers; for more information, refer to the [NumPy documentation](http://numpy.org/).
NumPy also supports compound data types, but we won't cover it in this course.

### 3.4. Basic operations on NumPy arrays

Data science is virtually synonymous with data preparation and data preparation in Python is nearly synonymous with NumPy array manipulation. Here I show you several examples of using NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays. While the types of operations shown here may seem a bit dry and pedantic, they comprise the building blocks of many other examples used throughout the course. Get to know them well!

#### Attributes

In [None]:
x1 = np.random.randint(10, size=10)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the total size of the array):

In [None]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

Another useful attribute is the dtype, the data type of the array (which we discussed previously in Understanding Data Types in Python):

In [None]:
print("dtype:", x3.dtype)

In [None]:
# Exercise Problem (Assignment)
N = np.random.randint(low=3, high=5, size=1)
M = np.random.randint(low=1, high=4, size=N)
a = np.random.normal(0, 1, size=M)
del M, N
# The above lines generates a random sized, random shaped array `a`
print( ) # print the total number of elements in `a`
print( ) # print the dimension of `a`
print( ) # print the data type of `a`
print( ) # print the shape of `a`

#### Access
Accessing elements of a NumPy array is pretty similar to many other programming languages. In a one-dimensional array, the $i^{\text{th}}$ value (counting from zero) can be accessed by specifying the desired index in square brackets:

In [None]:
print(x1)     # This prints the entire array
print(x1[0])  # This prints the zero-th element
print(x1[4])  # This prints the 4th element

Negative indices allow you to index elements from the end of the array:

In [None]:
print(x1[-1]) # This prints the last element
print(x1[-2]) # This prints the second from the last element
print(x1[-3]) # This prints the third from the last element

In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices:

In [None]:
print(x2)        # This prints the entire array
print(x2[0, 0])  # This prints the element at 0-th row and 0-th column
print(x2[2, 3])  # This prints the 2nd row and 3rd column
print(x2[1, -1]) # This prints the 1st row and the last column

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the *slice* notation, marked by the colon (``:``) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array ``x``, use this:
``` python
x[start:stop:step]
```
If any of these are unspecified, they default to the values ``start=0``, ``stop=``*``size of dimension``*, ``step=1``.
We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.

In [None]:
print(x1)        # original array (for reference)
print(x1[:5])    # first five elements
print(x1[5:])    # elements after index 5
print(x1[4:7])   # middle sub-array
print(x1[::2])   # every other element
print(x1[1::2])  # every other element, starting at index 1
print(x1[::-1])  # all elements, reversed
print(x1[5::-2]) # reversed every other from index 5

In [None]:
print(x2)              # original array (for reference)
print(x2[:2, :3])      # two rows, three columns
print(x2[:3, ::2])     # all rows, every other column
print(x2[::-1, ::-1])  # reversed rows, reversed column
print(x2[:, 0])        # first column of x2
print(x2[0, :])        # first row of x2
print(x2[0])           # equivalent to x2[0, :]

In [None]:
# Exercise Problem (Assignment)
a = np.random.randint(10, size=(10, 10))
# matrix `a` contains random integers of shape 10 x 10
print( ) # print the first (2 x 3) block of the matrix on the top-left corner.
print( ) # print the (3 x 4) block on the bottom-right corner.
print( ) # print the first 5 elements of column 6
print( ) # print the last 3 elements of row 4
print( ) # print the row 7 in reverse order
print( ) # print the column 5 every three element starting at index 1

#### Views and Copies

One important–and extremely useful–thing to know about array slices is that they return **views** rather than **copies** of the array data. This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies. Consider our two-dimensional array from before:

In [None]:
print(x2)

Let's extract a $2 \times 2$ subarray from this:

In [None]:
x2_sub = x2[:2, :2]
print(x2_sub)

Now if we modify this subarray, we'll see that the original array is changed! Observe:

In [None]:
x2_sub[0, 0] = 99
print(x2_sub)
print(x2)

This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.

On the other hand, despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the ``copy()`` method:

In [None]:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)

If we now modify this subarray, the original array is not touched:

In [None]:
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)
print(x2)

#### Reshaping
Another useful type of operation is reshaping of arrays.
The most flexible way of doing this is with the ``reshape`` method.
For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:

In [None]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

Note that for this to work, the size of the initial array must match the size of the reshaped array. 
Where possible, the ``reshape`` method will use a no-copy view of the initial array, but with non-contiguous memory buffers this is not always the case.

Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix.
This can be done with the ``reshape`` method, or more easily done by making use of the ``newaxis`` keyword within a slice operation:

In [None]:
x = np.array([1, 2, 3])

# row vector via reshape
print(x.reshape((1, 3)))

# column vector via reshape
print(x.reshape((3, 1)))

In [None]:
# Exercise Problem (Assignment)
# create a 5x5 grid filled with numbers 1 through 25, from top to bottom. 
# i.e.   1   6  11  ...
#        2   7  12  ...
#        3   8  ...
#        4   9  ...
#        5  10  ...
a = 

#### Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays. We'll take a look at those operations here.

First off, concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines ``np.concatenate``, ``np.vstack``, and ``np.hstack``.
``np.concatenate`` takes a tuple or list of arrays as its first argument, as we can see here:

In [None]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

You can also concatenate more than two arrays at once:

In [None]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

It can also be used for two-dimensional arrays:

In [None]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

In [None]:
# concatenate along the first axis
np.concatenate([grid, grid])

In [None]:
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)

For working with arrays of mixed dimensions, it can be clearer to use the ``np.vstack`` (vertical stack) and ``np.hstack`` (horizontal stack) functions:

In [None]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

# vertically stack the arrays
np.vstack([x, grid])

In [None]:
# horizontally stack the arrays
y = np.array([[99],
              [99]])
np.hstack([grid, y])

Similary, np.dstack will stack arrays along the third axis.

The opposite of concatenation is splitting, which is implemented by the functions ``np.split``, ``np.hsplit``, and ``np.vsplit``.  For each of these, we can pass a list of indices giving the split points:

In [None]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

Notice that $N$ split-points, leads to $N + 1$ subarrays.
The related functions ``np.hsplit`` and ``np.vsplit`` are similar:

In [None]:
grid = np.arange(16).reshape((4, 4))
grid

In [None]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

In [None]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

Similarly, np.dsplit will split arrays along the third axis.

### 3.5. Broadcasting

For arrays of the same size, binary operations are performed on an element-by-element basis:

In [None]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
print(a + b)
print(a * b)

Broadcasting allows these types of binary operations to be performed on arrays of different sizes–for example, we can just as easily add a scalar (think of it as a zero-dimensional array) to an array:

In [None]:
print(a + 5)

We can think of this as an operation that stretches or duplicates the value ``5`` into the array ``[5, 5, 5]``, and adds the results.
The advantage of NumPy's broadcasting is that this duplication of values does not actually take place, but it is a useful mental model as we think about broadcasting.

We can similarly extend this to arrays of higher dimension. Observe the result when we add a one-dimensional array to a two-dimensional array:

In [None]:
M = np.ones((3, 3))
print(M)
print(M + a)

Here the one-dimensional array ``a`` is stretched, or broadcast across the second dimension in order to match the shape of ``M``.

While these examples are relatively easy to understand, more complicated cases can involve broadcasting of both arrays. Consider the following example:

In [None]:
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]

print(a)
print(b)

In [None]:
a + b

Just as before we stretched or broadcasted one value to match the shape of the other, here we've stretched *both* ``a`` and ``b`` to match a common shape, and the result is a two-dimensional array!
The geometry of these examples is visualized in the following figure:

![Broadcasting Visual](figures/broadcasting.png)

The light boxes represent the broadcasted values: again, this extra memory is not actually allocated in the course of the operation, but it can be useful conceptually to imagine that it is.

### 3.6. Boolean operations

NumPy implements comparison operators such as < (less than) and > (greater than) as element-wise universal functions. The result of these comparison operators is always an array with a Boolean data type. All six of the standard comparison operations are available:

In [None]:
x = np.array([1, 2, 3, 4, 5])
print(x < 3)
print(x > 3)
print(x <= 3)  # less than or equal to
print(x >= 3)  # greater than or equal to
print(x != 3)  # not equal to
print(x == 3)  # equal to

It is also possible to do an element-wise comparison of two arrays, and to include compound expressions:

In [None]:
(2 * x) == (x ** 2)  # x ** 2 stands for 'x squared'

To count the number of ``True`` entries in a Boolean array, ``np.count_nonzero`` is useful:

In [None]:
print( np.count_nonzero(x < 3) )  # how many values less than 3?
print( np.sum(x < 3) )            # or equally, you can do this

For multi-dimensional arrays, you count along different dimensions:

In [None]:
x = np.random.randint(10, size=(3, 4))
print(x)
print(np.sum(x < 6, axis=1)) # how many values less than 6 in each row?

If we're interested in quickly checking whether any or all the values are true, we can use (you guessed it) ``np.any`` or ``np.all``:

In [None]:
# are there any values greater than 8?
np.any(x > 8)

In [None]:
# are there any values less than zero?
np.any(x < 0)

In [None]:
# are all values less than 10?
np.all(x < 10)

In [None]:
# are all values equal to 6?
np.all(x == 6)

``np.all`` and ``np.any`` can be used along particular axes as well. For example:

In [None]:
# are all values in each row less than 8?
np.all(x < 8, axis=1)

A quick warning: Python has built-in ``sum()``, ``any()``, and ``all()`` functions. These have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multidimensional arrays. Be sure that you are using ``np.sum()``, ``np.any()``, and ``np.all()`` for these examples!

Finally, Python's *bitwise logic operators*, ``&``, ``|``, ``^``, and ``~`` allows you to perform logic operations.
Like with the standard arithmetic operators, these are element-wise operations on (usually Boolean) arrays.

For example, we can address this sort of compound question as follows:

In [None]:
np.sum((x > 1) & (x < 5))

The following table summarizes the bitwise Boolean operators:

| Operator	    | Name               || Operator	  | Name             |
|---------------|--------------------||---------------|------------------|
|``&``          | element-wise AND   ||&#124;         | element-wise OR  |
|``^``          | element-wise XOR   ||``~``          | element-wise NOT |

In [None]:
# Exercise Problem (Assignment)
a = np.random.randint(10, (10, 10))

# is there any zero element in `a`? implement a code to answer the question.

# how many elements in `a` are smaller than 5? implement a code to answer the question.

# multiply 5 to the elements of `a` ONLY WHEN they are smaller than 5.


### [NOTE] NumPy for MATLAB Users
If you learned MATLAB before, you might have found already that NumPy is kinda sorta similar to MATLAB. In fact, there is a whole online documentation called '[NumPy for MATLAB Users](https://docs.scipy.org/doc/numpy-1.15.0/user/numpy-for-matlab-users.html)'. You may find it quite useful.

Contents (TODO) | [How to Read and Represent Data](../ica02/How_to_Read_and_Represent_Data.ipynb) >


<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica01/hello_world.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>