[![Launch on Google Colab](https://badgen.net/badge/Launch/on%20Google%20Colab/blue?icon=terminal)](https://colab.research.google.com/github/subhacom/np_tut_breastcancer/blob/master/Wisconsin_breast_cancer_data.ipynb)
[![Launch on Binder](https://mybinder.org/badge_logo.svg)](http://mybinder.org/v2/gh/subhacom/np_tut_breastcancer/master)


In [None]:
!pip install RISE
!jupyter-nbextension install rise --py --sys-prefix

In [None]:
%matplotlib inline

# Using Jupyter



## What is Jupyter?
From Jupyter home page: "*... an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.*"

## Pros and Cons of Python in Jupyter Notebook

### Pros:
  - Literate programming - document code with a narrative
  - Easily share online
  - Good for data exploration

### Cons:
  - Not moduler (can be ... in a complicated way)
  - Gets cumbersome quickly - especially when going back and forth
  - Very basic code editor
 
  
  
  

**Spyder** - a Python IDE with matlab-like interface.

## About Jupyter environments

- You can install Python/Jupyter on your own computer

- Or you can use environments available on the cloud
  - binder
  - colaboratory
  - kaggle
  - azure

- See https://www.dataschool.io/cloud-services-for-jupyter-notebook/ for a comparison of some of the popular ones.

## Local vs Cloud for Jupyter Notebooks

### Pros of local environment
 1. You have full control of the system
 2. No time limit
 3. No data limit - you can use your data on a local disk - no need to upload data
 4. Full function of Jupyter notebooks 
 5. No security restrictions
  



### Cons of local environments
 1. You have to set up and manage it
 2. You are limited by the hardware you have (may be expensive)
 3. Not the optimal use of resources
 4. Not easy to access from elsewhere
 5. Not easy to share and collaborate with others
 
  


## Jupyter basics
- Cells: Code and markdown

In [None]:
print('Hello world')

- Output

In [None]:
message = 'Hello world'  # variable assignment
print(message)   # function call, argument

In [None]:
print(f'Message: {message}')

- `f'text {variable} text'` is a new and convenient way of formatting strings in Python (version 3.6 onwards).

- Magic commands

In [None]:
%ls

- OS commands

In [None]:
!cat /proc/cpuinfo

- Check the keyboard shortcuts from menu (`Help` in binder, `Tools` in colab menu).

- Save and checkpoint
- File format: .ipynb

#### Pause here: Exercise
  1. create a cell
  1. convert it to mark down
  1. create another cell below
  1. type a valid expression: e.g. (2 + 3)
  1. execute the cell in place
  1. execute and insert a new cell below and move to it
  1. insert a cell above
  1. execute and move to next cell
  
  

# Tutorial on using numpy and matplotlib

Import required libraries (modules). The first two are part of Python Standard Library.

In [None]:
import io
import requests

- `numpy`, `pandas` and `matplotlib` are third-party libraries
- installed separately from `Python` 
- **Anaconda** python distribution - bundles most common scientific libraries

In [None]:
import numpy as np  # create a shorter alias for numpy

In [None]:
import pandas as pd

In [None]:
from matplotlib import pyplot as plt  # import only specific part of a module

## Homogeneous arrays in `numpy`

### Basic operations

In [None]:
myarray = np.array([1, 2, 3])

In [None]:
print(myarray)

In [None]:
myarray

### Multi-dimensional arrays

In [None]:
myarray2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

In [None]:
myarray2

In [None]:
myarray2.shape

In [None]:
len(myarray2)

In [None]:
myarray2.T

### Element-wise arithmetic on arrays

In [None]:
2 * myarray

In [None]:
2**myarray

In [None]:
2*myarray2

In [None]:
myarray2**2

In [None]:
myarray2 + 2

In [None]:
print(f'Shape of first array: {myarray.shape}, second array: {myarray2.shape}')
x = myarray + myarray2
print(f'Sum: {x}')

#### Pause here: You just encountered Python's error handling mechanism!
- Can you relate to your experience with errors in other languages?
- Explain error handling for beginners.
- Go back and change `myarray` to a 4 element array, a 2 element array


### Other ways of creating arrays

In [None]:
a0 = np.zeros((4, 5))

In [None]:
a0

In [None]:
a1 = np.ones((4, 5))

In [None]:
a1

In [None]:
ai = np.eye(4)

In [None]:
ai

#### Arrays containing range of values

In [None]:
a2 = np.arange(2.0, 7.0, 1.5)

In [None]:
a2

In [None]:
a3 = np.linspace(2, 7, 3)

In [None]:
a3

#### Array with random numbers

In [None]:
a4 = np.random.rand(3, 4)

In [None]:
a4

### Reshaping

In [None]:
a4.shape

In [None]:
a4.reshape(4, 3)

#### Pause here
 - What other shapes can you think of?

In [None]:
a4.reshape(2, 2, 3)

### Slicing and indexing
  - Unlike R and MATLAB, Python indexing starts at 0

In [None]:
myarray2[0, 0]

#### You can slice and dice an array

In [None]:
myarray = np.arange(10)
myarray

  - `a[start:stop:step]`  

In [None]:
myarray[1:8:2]

  - `a[start:stop]` - `step` defaults to `1`, i.e. shortcut for `a[start:stop:1]`

In [None]:
myarray[1:5]

  - `a[start:]` - `stop` defaults to end of array

In [None]:
myarray[5:]

  - `a[:stop]` - `start` defaults to start of array

In [None]:
myarray[:5]

  - `a[:]` - view of the whole array

In [None]:
myarray[:]

- For multidimensional arrays each dimension can be sliced 

In [None]:
myarray2 = np.arange(100).reshape(10, 10)
myarray2

In [None]:
myarray2[:, 1]

In [None]:
myarray2[4, ::2]

In [None]:
myarray2[5:, ::2]

### Assign element values

In [None]:
myarray2 = np.arange(25).reshape(5, 5)
myarray2

In [None]:
myarray2[0, 0] = -10

In [None]:
myarray2

In [None]:
myarray2[:, 0] = -1  # broadcast

In [None]:
myarray2

#### Pause here
 - How will you set all of first row to 0?
 - How will you set every other column to 0?

### Check conditions on array elements


#### Condition check in simple Python

```python

if a < 0:
  print('a is negative')
```

---
```python

if a < 0:
  print('a is negative')
else:
  print('not negative')
  ```

---
```python

if a < 0:
  print('a is negative')
elif a > 0:
  print('a is positive')
else:
  print('a == 0')
  ```
  You can put as many `elif`s as you need.

#### Array comparison and boolean arrays




In [None]:
myarray = np.array([1, 2, 3, 4])
myarray > 2

#### Functions for condition check

In [None]:
np.nonzero(myarray > 2)

In [None]:
np.where(myarray > 2)

In [None]:
myarray2 = np.arange(9).reshape(3,3)
myarray2

In [None]:
myarray2 > 3

In [None]:
np.nonzero(myarray2 > 3)

In [None]:
np.where(myarray2 > 3)

In [None]:
np.where(myarray2 > 3, 'X', 'Y')

#### Pause here: Check the documentation on `where`

## Array of heterogeneous data
#### Pause here: let us discuss data types
- Anybody noticed `dtype` so far?
- What are data types?
- What are some of the  data types in Python?


In [None]:
harray = np.array([(1, 's', 3.14)])

In [None]:
harray

In [None]:
harray = np.array([(1, 's', 3.14)], dtype=[('a', 'i8'), ('b', 'S1'), ('c', 'f4')])

In [None]:
harray['a']

In [None]:
harray[['a', 'b']]

## Using real data
We shall use data from the Wisconsin breast cancer database available online.

In [None]:
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
attr_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names'

- Information about the data is available in internet location referred to in `attr_url`.
- In human-readable plain text format. 
- We shall retrieve this using `http` protocol via functions in the `requests` library. 
- If working on a local computer, we could just download the file manually.

In [None]:
attrs = requests.get(attr_url)

Skim through the attributes. The data is meaningless jumble of numbers without some of this information (metadata) Pay attention to 5-8.

In [None]:
print(attrs.content.decode('utf-8'))

Retrieve the data from the web server into a string.

In [None]:
data_str = requests.get(data_url).content.decode('utf-8')

In [None]:
print(data_str[:100])

In [None]:
type(data_str)

### Read the data into a numpy array.
- With `python` on local computer, you could simply download the data file manually and load it from local disk like: 

    `np.loadtxt(filename, other arguments)`

    or

    `np.genfromtext(filename, other arguments)`. 

- Since we have a `str` instead of a file, we use a `StringIO` object around the string containing our data as a proxy for a file. 

In [None]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',')

In [None]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',', dtype=int)

In [None]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',', dtype=str)

In [None]:
print(data[20:25])

#### `genfromtxt`: a function that can guess data types when reading text files

In [None]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=None)

In [None]:
data[20:25]

#### You can tell it what represents missing values

In [None]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=None, missing_values='?')

In [None]:
data[20:25]

#### You can specify what to put for missing values

In [None]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=None, missing_values='?', filling_values=-999)

In [None]:
data[20:25]

**Could be better!**
- We can define the field names and their data type here. 


### Check for missing values
- What did we find about representation of missing data in this case?
- We can try to compare the element values to this missing data representative.
- `numpy` function `where` lets us check a condition on each element of an array.

In [None]:
np.where(data == -999)

### Define column/field names and type

In [None]:
dtype = [('SCN', 'U8'), ('thickness', int), ('sizeu', int), ('shapeu', int), 
         ('adhesion', int), ('csize', int), ('bare', int), ('blandchrom', int), 
         ('normncl', int), ('mitoses', int), ('cclass', int)]

In [None]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=dtype, missing_values='?', filling_values=-999)

In [None]:
print(data.shape)

In [None]:
print(data[20: 25])

# Be extra careful about how missing values are represented!

### Accessing fields in the data

In [None]:
data['SCN']

In [None]:
data['SCN'][2]

In [None]:
data[2]['SCN']

In [None]:
data[2][0]

In [None]:
data[2, 0]

**We have a record/structured array with 1 dimension**
 - Each element is a record / structure

#### Can we use `numpy.where` to check a condition on each element of a structured array?

In [None]:
np.where(data == -999)

#### Pause here: Let us try one column at a time.
 - How can we check the columns one by one?
 - Hint: loop, `np.dtype`, `np.issubdtype`

In [None]:
data.dtype['SCN']

In [None]:
data.dtype['thickness']

In [None]:
data.dtype.names

#### Pause here: loops in Python
- `for loop` for looping over sequence of elements.

```python
    x = ['alpha', 'bravo', 'charlie', 'delta']
    for ii in x:
        print(ii)
    
```  

- `while loop` for conditional looping.
  ```python
x = 0
while x < 10:
    print(x)
    x += 1
  
  ```

In [None]:
for name in data.dtype.names:
    print(f'Column: {name}')
    if np.issubdtype(data.dtype[name], np.number):
        missing = np.where(data[name] < 0)
        if missing[0].shape[0] > 0:
            print(f'        missing data in rows: {missing}')

## Selection

- You can pass an array or list of indices to select rows

In [None]:
data[[1, 100, 500]]

- You can select rows by condition

In [None]:
data[data['cclass'] == 2]

## Importing data using Pandas

In [None]:
pdata = pd.read_csv(data_url, names=data.dtype.names)

In [None]:
pdata.columns

In [None]:
pdata.iloc[23]

In [None]:
pdata[['SCN', 'thickness', 'cclass']]

## Plotting data

In [None]:
plt.ion()  # Make it interactive

In [None]:
plt.ioff()

In [None]:
plt.plot(data['thickness'])

In [None]:
plt.show()

In [None]:
plt.plot(data['thickness'], 'o')

In [None]:
plt.plot(data['thickness'], data['csize'], 'o')

In [None]:
plt.plot(data['sizeu'], data['csize'], 'ro')

In [None]:
plt.plot(data['mitoses'], data['csize'], 'go-')

In [None]:
plt.hist(data['csize'])

#### Box plots - apply selection criterion

In [None]:
data.dtype

In [None]:
x = plt.boxplot([data['thickness'][data['cclass'] == 2], data['thickness'][data['cclass'] == 4]])

### Adding legend

In [None]:
plt.plot(data['cclass'], data['sizeu'], 'ro', label='Size')
plt.plot(data['cclass'], data['thickness'], 'gv', label='thickness')
plt.legend()

### Add a little jitter and transparency

In [None]:
plt.plot(data['cclass']+np.random.rand(data.shape[0]), data['sizeu'], 'r.', alpha=0.5, label='Size')
plt.plot(data['cclass']+np.random.rand(data.shape[0]), data['thickness'], 'g.', alpha=0.5, label='thickness')
plt.legend()

### Subplots


In [None]:
ax0 = plt.subplot(2, 2, 1)
ax1 = plt.subplot(2, 2, 2)
ax2 = plt.subplot(2, 2, 3)
ax3 = plt.subplot(2, 2, 4)

ax0.plot(data['cclass']+np.random.rand(data.shape[0]), data['sizeu'], 'r.', alpha=0.5, label='Size')
ax1.boxplot([data['sizeu'][data['cclass'] == 2], data['sizeu'][data['cclass'] == 4]])
ax2.plot(data['cclass']+np.random.rand(data.shape[0]), data['thickness'], 'r.', alpha=0.5, label='Size')
x = ax3.boxplot([data['thickness'][data['cclass'] == 2], data['thickness'][data['cclass'] == 4]])

#### Sharing axis scales between subplots

In [None]:
ax0 = plt.subplot(2, 2, 1)
ax1 = plt.subplot(2, 2, 2, sharey=ax0)

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, sharey='row')

### Layout control
 See https://matplotlib.org/3.1.0/tutorials/intermediate/gridspec.html

### Further information on plotting
 - See https://matplotlib.org/gallery.html

### Other useful libraries
  1. Multidimensional arrays with complex data and metadata: `xarray`
  1. Statistics: `statsmodels`
  1. Broad range of basic functions in science and engineering: `scipy`
  1. Image-processing: `PIL`, `opencv`, `imagej`

  5. Machine learning: `scikits-learn (sklearn)`
  1. Network analysis: `networkx`, `igraph`
  1. Graphical User Interface (GUI): `PyQT`, `ipywidgets`
  1. Efficient plotting: `pyqtgraph`

# Thank you