<a href="https://colab.research.google.com/github/subhacom/np_tut_breastcancer/blob/master/colab_wisconsin_breast_cancer_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To navigate the entries using keyboard, press `down arrow` to go down and `up arrow` to go up.

In [0]:
# These can be used for installing RISE for turning the slideshow  presentation 
# into a slide show

# !pip install RISE
# !jupyter-nbextension install rise --py --sys-prefix

In [0]:
# This forces matplotlib plots to be displayed inline.
# Otherwise you'll have to call plt.show() to make the plots visible

%matplotlib inline

# Using Jupyter



## What is Jupyter?
From Jupyter home page: "*... an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.*"

## Pros and Cons of Python in Jupyter Notebook

### Pros:
  - Literate programming - document code with a narrative
  - Easily share online
  - Good for data exploration

### Cons:
  - Not moduler (can be ... in a complicated way)
  - Gets cumbersome quickly - especially when going back and forth
  - Very basic code editor
 
  
  
  

**Spyder** - a Python IDE with matlab-like interface.

## About Jupyter environments

- You can install Python/Jupyter on your own computer

- Or you can use environments available on the cloud
  - binder
  - colaboratory
  - kaggle
  - azure

- See https://www.dataschool.io/cloud-services-for-jupyter-notebook/ for a comparison of some of the popular ones.

#### Keyboard shortcurs:
 - If using `colab`, check `Tools` menu.
 - In `binder`, check `Help` menu.

## Local vs Cloud for Jupyter Notebooks

### Pros of local environment
 1. You have full control of the system
 2. No time limit
 3. No data limit - you can use your data on a local disk - no need to upload data
 4. Full function of Jupyter notebooks 
 5. No security restrictions
  



### Cons of local environments
 1. You have to set up and manage it
 2. You are limited by the hardware you have (may be expensive)
 3. Not the optimal use of resources
 4. Not easy to access from elsewhere
 5. Not easy to share and collaborate with others
 
  


## Jupyter basics

 - Cells: text boxes where you enter code or `markdown` text

 - Cells are created as code cells by default, like the one below. You can press `Shift+Enter` to run the code and go to next cell.

In [0]:
print('Hello world')

- Contents of code cells are executed: they can be Python statements, magic commands or operating system commands.

 - A code cell can be turned into markdown by pressing 
  - `Control+M` followed by `M` in `binder`
  - `Escape` followed by `M` in Jupyte

This is a markdown cell. You can double click this cell or press `Enter` after selecting it in order to edit the contents.
This applies to code cells, too.

All the text in this notebook is markdown.

You can learn more about how to write markdown text here: https://daringfireball.net/projects/markdown/basics

- Code cells are indicated by an empty pair of brackets before them. You can also recognize them from syntax highlighting.
- Keep pressing `Shift+Enter` to move down this notebook and run the code wherever appropriate.

In [0]:
message = 'Hello world'  # variable assignment
print(message)   # function call, argument
# `print` is a builtin function in Python - it is part of the Python interpreter

In [0]:
print(f'Message: {message}')

- `f'text {variable} text'` is a new and convenient way of formatting strings in Python (version 3.6 onwards).

- Magic commands: they are special commands available in Jupyter (they are not Python code, but they work in code cells). 

  For example:

In [0]:
%ls

- OS commands: you can run commands on your operating system (colab and binder are Unix-like environments, so they can take many Unix commands). Start them with "`!`".
 For example, you can print information about CPU of the system using the command below:

In [0]:
!cat /proc/cpuinfo

- Press `Control+s` to save and checkpoint the notebook in `binder` or local Jupyter.  
 - When you do this Jupyter creates a hidden file for you containing the current state of the notebook.
 - If you mess up, you can go back to the last checkpoint using `File->Revert to checkpoint` in Jupyter menu.
  - `colab` keeps version info like other google docs.

- The native file format for Jupyter notebooks is ipython notebook format. The extension is `.ipynb`.
- You can download your edited notebook from the `File` menu.

#### Pause here: Exercise
  1. create a cell
  1. convert it to mark down
  1. create another cell below
  1. type a valid expression: e.g. (2 + 3)
  1. execute the cell in place
  1. execute and insert a new cell below and move to it
  1. insert a cell above
  1. execute and move to next cell
  
  

# Tutorial on using numpy and matplotlib

Import required libraries (modules). The first two are part of Python Standard Library.

In [0]:
import io          # provides `StringIO` mimicking a file from a string
import requests    # for accessing HTTP resources

- `numpy`, `pandas` and `matplotlib` are third-party libraries
- installed separately from `Python` 
- **Anaconda** python distribution - bundles most common scientific libraries

In [0]:
import numpy as np  # create a shorter alias `np` for numpy

In [0]:
import pandas as pd

In [0]:
from matplotlib import pyplot as plt

## Homogeneous arrays in `numpy`

An array all whose elements are of the same type.

### Basic operations

In [0]:
myarray = np.array([1, 2, 3])   # `[1, 2, 3]` is a `list`

In [0]:
print(myarray)

In [0]:
myarray   # this may display more info than print
# but it works in interactive session
# it will not work inside a script

### Multi-dimensional arrays

In [0]:
myarray2 = np.array([[1, 2, 3, 4], 
                     [5, 6, 7, 8]])
# you could write that in one line 
# I broke it up for readability

In [0]:
myarray2

In [0]:
myarray2.shape

In [0]:
len(myarray2)

In [0]:
myarray2.T   # Matrix transpose, i.e. rows become columns and vice versa

### Element-wise arithmetic on arrays

- Operations with scalars are applied to each element in an array

In [0]:
2 * myarray

In [0]:
2**myarray

In [0]:
0.5 * myarray2

In [0]:
myarray2**2

In [0]:
myarray2 + 2

In [0]:
print(f'Shape of first array: {myarray.shape}, second array: {myarray2.shape}')
x = myarray + myarray2
print(f'Sum: {x}')

#### Pause here: You just encountered Python's error handling mechanism!
- Can you relate to your experience with errors in other languages?
- Go back and change `myarray` to a 4 element array, a 2 element array


### Other ways of creating arrays
 - These are inspired by MATLAB

In [0]:
a0 = np.zeros((4, 5))

In [0]:
a0

In [0]:
a1 = np.ones((4, 5))

In [0]:
a1

In [0]:
ai = np.eye(4)   # NxN identity matrix

In [0]:
ai

#### Arrays containing range of values

In [0]:
a2 = np.arange(3.0, 8.0, 2.0)  
# array of numbers starting with 3.0, at increments of 2.0, less than 8.0

In [0]:
a2

In [0]:
a3 = np.linspace(2, 7, 3)    
# split 2-7 into 3 equal parts, including 2 and 7

In [0]:
a3

#### Array with random numbers

In [0]:
a4 = np.random.rand(3, 4)

In [0]:
a4

### Reshaping

In [0]:
a4 = np.arange(12)
a4

In [0]:

a4.shape

In [0]:
a4.reshape(3, 4)

#### Pause here
 - What other shapes can you think of?

In [0]:
a4.reshape(2, 2, 3)

### Slicing and indexing
  - Unlike R and MATLAB, Python indexing starts at 0

In [0]:
myarray2

In [0]:
myarray2[0, 0]

#### You can slice and dice an array

In [0]:
myarray = np.arange(10)
myarray

  - `a[start:stop:step]`  

In [0]:
myarray[1:8:2]

  - `a[start:stop]` - `step` defaults to `1`, i.e. shortcut for `a[start:stop:1]`

In [0]:
myarray[1:5]

  - `a[start:]` - `stop` defaults to end of array

In [0]:
myarray[5:]

  - `a[:stop]` - `start` defaults to start of array

In [0]:
myarray[:5]

  - `a[:]` - view of the whole array

In [0]:
myarray[:]

- For multidimensional arrays each dimension can be sliced 

In [0]:
myarray2 = np.arange(100).reshape(10, 10)
myarray2

In [0]:
myarray2[:, 1]

In [0]:
myarray2[4, ::2]

In [0]:
myarray2[5:, ::2]

### Assign element values

In [0]:
myarray2 = np.arange(25).reshape(5, 5)
myarray2

In [0]:
myarray2[0, 0] = -10

In [0]:
myarray2

In [0]:
myarray2[:, 0] = -1  # broadcast

In [0]:
myarray2

#### Pause here
 - How will you set all of first row to 0?
 - How will you set every other column to 0?

### Check conditions on array elements


#### Condition check in simple Python

```python

if a < 0:
  print('a is negative')
```

---
```python

if a < 0:
  print('a is negative')
else:
  print('not negative')
  ```

---
```python

if a < 0:
  print('a is negative')
elif a > 0:
  print('a is positive')
else:
  print('a == 0')
  ```
  You can put as many `elif`s as you need.

#### Array comparison and boolean arrays




In [0]:
myarray = np.array([1, 2, 3, 4])
myarray > 2

#### Functions for condition check

In [0]:
np.nonzero(myarray > 2)

In [0]:
np.where(myarray > 2)

- `nonzero` and `where` returns the indices of the nonzero elements. `False` is numerically `0` in Python.

In [0]:
myarray2 = np.arange(9).reshape(3,3)
myarray2

In [0]:
np.nonzero(myarray2 > 3)

In [0]:
np.where(myarray2 > 3)

- For 2 or more dimensional arrays, these function return the indices on first dimension, followed by the second dimension, etc., of nonzero elements.

In [0]:
np.where(myarray2 > 3, 'X', 'Y')

#### Pause here: Check the documentation on `where`

## Array of heterogeneous data
#### Pause here: let us discuss data types
- Anybody noticed `dtype` so far?
- What are data types?
- What are some of the  data types in Python?


In [0]:
harray = np.array([(1, 's', 3.14)])

In [0]:
harray

In [0]:
harray = np.array([(1, 's', 3.14)], dtype=[('a', 'i8'), ('b', 'S1'), ('c', 'f4')])

In [0]:
harray['a']

In [0]:
harray[['a', 'b']]

## Using real data
We shall use data from the Wisconsin breast cancer database available online.

In [0]:
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
attr_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names'

- Information about the data is available in internet location referred to in `attr_url`.
- In human-readable plain text format. 
- We shall retrieve this using `http` protocol via functions in the `requests` library. 
- If working on a local computer, we could just download the file manually.

In [0]:
attrs = requests.get(attr_url)

Skim through the attributes. The data is meaningless jumble of numbers without some of this information (metadata) Pay attention to 5-8.

In [0]:
print(attrs.content.decode('utf-8'))

Retrieve the data from the web server into a string.

In [0]:
data_str = requests.get(data_url).content.decode('utf-8')

In [0]:
print(data_str[:100])

In [0]:
type(data_str)   # get the data type of the argument

### Read the data into a numpy array.
- With `python` on local computer, you could simply download the data file manually and load it from local disk like: 

    `np.loadtxt(filename, other arguments)`

    or

    `np.genfromtext(filename, other arguments)`. 

- But we need to mimic a file with a string. `StringIO` helps with that.

In [0]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',')

#### Notice two levels of error pointer (-->)
 - this is called a stack trace - when one function calls another function, which calls another function, etc., and an inner funtion encounters an error, it shows the outer most function first, and then goes deeper and deeper until the source of the error.

In [0]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',', dtype=int)

In [0]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',', dtype=str)

In [0]:
print(data[20:25])

This is not very useful - we want numbers as numeric types, not as text strings, for analysis.

#### `genfromtxt`: a function that can guess data types when reading text files

In [0]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=None)

In [0]:
data[20:25]

#### You can tell it what represents missing values

In [0]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=None, missing_values='?')

In [0]:
data[20:25]

#### You can specify what to put for missing values

In [0]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=None, missing_values='?', filling_values=-999)

In [0]:
data[20:25]

**Could be better!**
- We can define the field names and their data type here. 


### Check for missing values
- What did we find about representation of missing data in this case?
- We can try to compare the element values to this missing data representative.
- `numpy` function `where` lets us check a condition on each element of an array.

In [0]:
np.where(data == -999)

### Define column/field names and type

In [0]:
dtype = [('SCN', 'U8'), ('thickness', int), ('sizeu', int), ('shapeu', int), 
         ('adhesion', int), ('csize', int), ('bare', int), ('blandchrom', int), 
         ('normncl', int), ('mitoses', int), ('cclass', int)]

- ('SCN', 'U8') above tells that the first column is to be named `SCN` and it is to be read as a Unicode string of 8 characters.
- 'i4' - 4 byte integer
- 'f8' - 8 byte float


In [0]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=dtype, missing_values='?', filling_values=-999)

In [0]:
print(data.shape)

In [0]:
print(data[20: 25])

**Be extra careful about how missing values are represented!**

### Accessing fields in the data

In [0]:
data['SCN']

In [0]:
data[2]

In [0]:
data['SCN'][2]

In [0]:
data[2]['SCN']

In [0]:
data[2][0]

In [0]:
data[2, 0]

- The above threw an error because now data is treated as a 1 dimensional array, each element in which is another array. 
- `[2, 0]` expects 2D array.
- `data[2][0]` takes out the sub-array that is element #2 and then looks at element #0 of this subarray.

**We have a record/structured array with 1 dimension**
 - Each element is a record / structure

#### Can we use `numpy.where` to check a condition on each element of a structured array?

In [0]:
np.where(data == -999)

#### Pause here: Let us try one column at a time.
 - How can we check the columns one by one?
 - Hint: loop, `np.dtype`, `np.issubdtype`

In [0]:
data.dtype['SCN']

In [0]:
data.dtype['thickness']

In [0]:
data.dtype.names

#### Pause here: loops in Python
- `for loop` for looping over sequence of elements.

```python
    x = ['alpha', 'bravo', 'charlie', 'delta']
    for ii in x:
        print(ii)
    
```  

- `while loop` for conditional looping.
  ```python
x = 0
while x < 10:
    print(x)
    x += 1
  
  ```

In [0]:
for name in data.dtype.names:
    print(f'Column: {name}')
    if np.issubdtype(data.dtype[name], np.number):
        missing = np.where(data[name] < 0)
        if missing[0].shape[0] > 0:
            print(f'        missing data in rows: {missing}')

## Selection

- You can pass an array or list of indices to select rows

In [0]:
data[[1, 100, 500]]   # pick row numbers 1, 100, and 500

- You can select rows by condition

In [0]:
data[data['cclass'] == 2]

## Importing data using Pandas

In [0]:
pdata = pd.read_csv(data_url, names=data.dtype.names)

In [0]:
pdata.columns

In [0]:
pdata.iloc[23]

In [0]:
pdata[['SCN', 'thickness', 'cclass']]

## Plotting data

In [0]:
plt.ion()  # Make it interactive

In [0]:
# plt.ioff()

In [0]:
plt.plot(data['thickness'])   # by default plot connects the data points with straight lines

If you already do not see a plot above, you may need to run `plt.show()` below to display it.

In [0]:
plt.show()

In [0]:
plt.plot(data['thickness'], 'o')   
# Plot thickness (Y) against row number (X) with circles as markers

In [0]:
plt.plot(data['thickness'], data['csize'], 'o') 
# Plot csize (Y) against thickness (X) with circles as markers

In [0]:
plt.plot(data['sizeu'], data['csize'], 'ro')   # red circle markers

In [0]:
plt.plot(data['mitoses'], data['csize'], 'go-')  
# green circle markers at data points connected by lines of the same color

In [0]:
plt.hist(data['csize'])

#### Box plots - apply selection criterion

In [0]:
data.dtype

In [0]:
x = plt.boxplot([data['thickness'][data['cclass'] == 2], data['thickness'][data['cclass'] == 4]])
# boxplot returns a bunch of resulting values that are otherwise printed 
# in the output before the plots
# `x = ` part captures those in variable x and prevents the printing

### Adding legend

In [0]:
plt.plot(data['cclass'], data['sizeu'], 'ro', label='Size')
plt.plot(data['cclass'], data['thickness'], 'gv', label='thickness')
plt.legend()

### Add a little jitter and transparency

In [0]:
plt.plot(data['cclass']+np.random.rand(data.shape[0]), data['sizeu'], 'r.', alpha=0.5, label='Size')
plt.plot(data['cclass']+np.random.rand(data.shape[0]), data['thickness'], 'g.', alpha=0.5, label='thickness')
plt.legend()

### Subplots


In [0]:
# split the figure into 4 axes: 2 rows and 2 columns
ax0 = plt.subplot(2, 2, 1)   # top left axis
ax1 = plt.subplot(2, 2, 2)   # top right axis
ax2 = plt.subplot(2, 2, 3)   # bottom left axis
ax3 = plt.subplot(2, 2, 4)   # bottom right axis

ax0.plot(data['cclass']+np.random.rand(data.shape[0]), data['sizeu'], 'r.', alpha=0.5, label='Size')
ax1.boxplot([data['sizeu'][data['cclass'] == 2], data['sizeu'][data['cclass'] == 4]])
ax2.plot(data['cclass']+np.random.rand(data.shape[0]), data['thickness'], 'r.', alpha=0.5, label='Size')
x = ax3.boxplot([data['thickness'][data['cclass'] == 2], data['thickness'][data['cclass'] == 4]])

#### Sharing axis scales between subplots

In [0]:
ax0 = plt.subplot(2, 2, 1)
ax1 = plt.subplot(2, 2, 2, sharey=ax0)

In [0]:
fig, ax = plt.subplots(nrows=2, ncols=2, sharey='row')

### Layout control
 See https://matplotlib.org/3.1.0/tutorials/intermediate/gridspec.html

### Further information on plotting
 - See https://matplotlib.org/gallery.html

### Other useful libraries
  1. Multidimensional arrays with complex data and metadata: `xarray`
  1. Statistics: `statsmodels`
  1. Broad range of basic functions in science and engineering: `scipy`
  1. Image-processing: `PIL`, `opencv`, `imagej`

  5. Machine learning: `scikits-learn (sklearn)`
  1. Network analysis: `networkx`, `igraph`
  1. Graphical User Interface (GUI): `PyQT`, `ipywidgets`
  1. Efficient plotting: `pyqtgraph`

# Thank you