<a href="https://colab.research.google.com/github/subhacom/GranuleCell/blob/master/Wisconsin_breast_cancer_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Using Jupyter
## About environments
- You can install Python/Jupyter on your own computer
- Or you can use environments available on the cloud
  - binder
  - colaboratory
  - kaggle
  - azure
- See https://www.dataschool.io/cloud-services-for-jupyter-notebook/ for a comparison of some of the popular ones.

### Pros of local environment
 1. You have full control of the system
 2. No time limit
 3. No data limit - you can use your data on a local disk - no need to upload data
 4. Full function of Jupyter notebooks 
 5. No security restrictions
  
### Cons of local environments
 1. You have to set up and manage it
 2. You are limited by the hardware you have (may be expensive)
 3. Not the optimal use of resources
 4. Not easy to access from elsewhere
 5. Not easy to share and collaborate with others
 
  
## Jupyter basics
- Cells: Code and markdown
- Output
- Check the keyboard shortcuts from help menu
- Save and checkpoint
- File format: .ipynb



# Tutorial on using numpy and matplotlib

Import required libraries (modules). The first two are part of Python standard library.

In [0]:
import io
import requests

`numpy` and `matplotlib` are third-party libraries that are installed separately from `Python`. On your personal system I recommend installing **Anaconda** python distribution as a convenient, portable way to set up a scientific computing environment. Anaconda comes bundled with most commonly used libraries and you can easily install additional requirements.

In [0]:
import numpy as np
# import pandas as pd
from matplotlib import pyplot as plt

## Homogeneous arrays
### Basic operations

In [0]:
myarray = np.array([1, 2, 3])

In [0]:
print(myarray)

In [0]:
myarray

In [0]:
myarray.T

#### You can make multidimensional arrays.

In [0]:
myarray2 = np.array([[0, 1, 2, 3], [4, 5, 6, 8]])

In [0]:
myarray2

In [0]:
myarray2.shape

In [0]:
len(myarray2)

In [0]:
myarray2.T

### Element-wise arithmetic on arrays

In [0]:
2 * myarray

In [0]:
2**myarray

In [0]:
2*myarray2

In [0]:
myarray2**2

In [0]:
myarray + 2

In [0]:
myarray + myarray2

#### Pause here: You just encountered Python's error handling mechanism!
- Can you relate to your experience with errors in other languages?
- Explain error handling for beginners.


### Other ways of creating arrays

In [0]:
a0 = np.zeros((4, 5))

In [0]:
a0

In [0]:
a1 = np.ones((4, 5))

In [0]:
a1

In [0]:
ai = np.eye(4)

In [0]:
ai

In [0]:
a2 = np.arange(2.0, 7.0, 1.5)

In [0]:
a2

In [0]:
a3 = np.linspace(2, 7, 3)

In [0]:
a3

In [0]:
a4 = np.random.rand(3, 4)

In [0]:
a4

### Reshaping

In [0]:
a4.shape

In [0]:
a4.reshape(4, 3)

#### Pause here
 - What other shapes can you think of?
 - What is the most common scenario for reshaping?

In [0]:
a4.reshape(2, 2, 3)

### Slicing and indexing

In [0]:
myarray2[0, 0]

In [0]:
myarray2[0, :]

In [0]:
myarray2[0, ::2]

In [0]:
myarray2[:, ::2]

### Assign values

In [0]:
myarray2[0, 0] = 10

In [0]:
myarray2

In [0]:
myarray2[:, 0] = -1  # broadcast

In [0]:
myarray2

#### Pause here
 - How will you set all of first row to 0?

### Check conditions on array elements
- comparison
- boolean arrays
- functions for conditionals

Condition checks in simple Python:
```python

if a < 0:
  print('a is negative')
elif a > 0:
  print('a is positive')
else:
  print('a == 0')
  ```

In [0]:
myarray > 2

In [0]:
np.nonzero(myarray > 2)

In [0]:
np.where(myarray > 2)

In [0]:
myarray2 > 3

In [0]:
np.nonzero(myarray2 > 3)

In [0]:
np.where(myarray2 > 3)

In [0]:
np.where(myarray2 > 3, 'X', 'Y')

#### Pause here: read the documentation on `where`

## Array of heterogeneous data
#### Pause here: let us discuss data types
- What are data types?
- What are some of the  data types in Python?


In [0]:
harray = np.array([(1, 's', 3.14)])#, dtype=[('a', 'i8'), ('b', 'S1'), ('c', 'f4')])

In [0]:
harray

In [0]:
harray = np.array([(1, 's', 3.14)], dtype=[('a', 'i8'), ('b', 'S1'), ('c', 'f4')])

In [0]:
harray['a']

In [0]:
harray[['a', 'b']]

## Using real data
We shall use data from the Wisconsin breast cancer database available online.

In [0]:
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
attr_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names'

We just defined two variables above. We ask python to print the values:

In [0]:
print(data_url)

Information about the data is available in `attr_url`. It is information in plain text. We use `http` protocol via functions in the `requests` library. When running Python from local computer, we could simply go to the url and download the file to the local filsystem.

In [0]:
attrs = requests.get(attr_url)

Skim through the attributes. The data is meaningless jumble of numbers without some of this information (metadata) Pay attention to 5-8.

In [0]:
print(attrs.content.decode('utf-8'))

Retrieve the data from the web server into a string.

In [0]:
data_str = requests.get(data_url).content.decode('utf-8')

In [0]:
print(data_str[:100])

In [0]:
type(data_str)

### Read the data into a numpy array.
With a python running on your own computer, you could simply download the data file manually and load it from local disk like: `np.genfromtext(filename, other arguments)`. Since we have a `str` instead of a file, we use a `StringIO` object around the string containing our data as a proxy for a file. 

In [0]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',')

In [0]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',', dtype=int)

In [0]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',', dtype=str)

In [0]:
print(data[:25])

Note that counting starts with 0 in Python. `[:25]` indicates range. We shall talk about range and slices soon.

In [0]:
data[23]

**Not nice!**
- We have to define the field names and their data type here. 


In [0]:
 #columns = ['SCN', 'thickness', 'sizeu', 'shapeu', 'adhesion', 'csize', 'bare', 'blandchrom', 'normncl', 'mitoses', 'cclass']
dtype = [('SCN', int), ('thickness', int), ('sizeu', int), ('shapeu', int), ('adhesion', int), ('csize', int), ('bare', int), ('blandchrom', int), ('normncl', int), ('mitoses', int), ('cclass', int)]
# dtype = [('SCN', 'u8'), ('thickness', 'u1'), ('sizeu', 'u1'), ('shapeu', 'u1'), ('adhesion', 'u1'), ('csize', 'u1'), ('bare', 'u1'), ('blandchrom', 'u1'), ('normncl', 'u1'), ('mitoses', 'u1'), ('cclass', 'u1')]


In [0]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=dtype, missing_values=np.nan)

In [0]:
print(data.shape)

In [0]:
print(data[: 25])

- Look at the original data file. 
- Note 24-th row has missing data with `?` inserted in place of a number. 
- Numpy converted it to -1.
- The exact value depends on what data type you choose in the `dtype` specification. 
- Careful about how missing values are represented.
- Leaving a non-space, non-numeric character where a number is expected is generally a bad idea.

In [0]:
type(data)

In [0]:
data.dtype

### Accessing fields in the data

In [0]:
data['SCN']

In [0]:
data['SCN'][2]

In [0]:
data[2]['SCN']

In [0]:
data[2][0]

In [0]:
data[2, 0]

#### Pause here: how can we check for missing data?
- What did we find about representation of missing data in this case?
- We can try to compare the element values to this missing data representative.
- `numpy` function `where` lets us check a condition on each element of an array.


In [0]:
np.where(data == -1)

#### Pause here: Let us try one column at a time.
- How can we check the columns one by one?

In [0]:
for name in data.dtype.names:
  print(f'{name}: missing data in rows: {np.where(data[name] < 0)}')

#### Pause here: *for loop* and `fstring`
- That is a `for` loop. Python has `while` loop for conditional looping.
  ```python
x = 0
while x < 10:
    print(x)
    x += 1
  
  ```
- `f'text {variable} text'` is a new and convenient way of formatting strings in Python (version 3.6 onwards).

## Plotting data

In [0]:
plt.plot(data['thickness'])

In [0]:
plt.plot(data['thickness'], 'o')

In [0]:
plt.plot(data['thickness'], data['csize'], 'o')

In [0]:
plt.plot(data['sizeu'], data['csize'], 'ro')

In [0]:
plt.plot(data['mitoses'], data['csize'], 'go-')

In [0]:
plt.hist(data['csize'])

In [0]:
plt.boxplot([data['csize'], data['mitoses']])

In [0]:
# pdata = pd.read_csv(data_url, names=data.dtype.names)

In [0]:
# pdata.iloc[23]

In [0]:
# pdata.columns

### Adding legend

In [0]:
plt.legend()

### Subplots


In [0]:
ax0 = plt.subplot(2, 2, 1)
ax1 = plt.subplot(2, 2, 2)
ax2 = plt.subplot(2, 2, 3)
ax3 = plt.subplot(2, 2, 4)

In [0]:
fig, ax = plt.subplots(nrows=2, ncols=2)

#### Sharing axis scales between subplots

In [0]:
ax0 = plt.subplot(2, 2, 1)
ax1 = plt.subplot(2, 2, 2, sharey=ax0)

In [0]:
fig, ax = plt.subplots(nrows=2, ncols=2, sharey='row')

### Layout control

### Further information on plotting
 - See https://matplotlib.org/gallery.html

### Other useful libraries
  1. Multidimensional arrays with complex data and metadata: `xarray`
  1. Statistics: `statsmodels`
  1. Image-processing: `PIL`, `opencv`, `imagej`
  1. Machine learning: `scikits-learn`
  1. Network analysis: `networkx`
  1. HDF5: `h5py`
  1. User interfaces: `PyQT`
  1. Efficient plotting: `pyqtgraph`
  1. Broad range of basic functions in science and engineering: `scipy`
  1. Symbolic mathematics: `sympy`, (also `sage` - a whole environment like `Mathematica`)