<a href="https://colab.research.google.com/github/subhacom/np_tut_breastcancer/blob/master/Wisconsin_breast_cancer_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Jupyter
- Cells: Code and markdown
- Output
- Check the keyboard shortcuts from help menu


# Tutorial on using numpy and matplotlib

Import required libraries (modules). The first two are part of Python standard library.

In [1]:
import io
import requests

`numpy` and `matplotlib` are third-party libraries that are installed separately from `Python`. On your personal system I recommend installing **Anaconda** python distribution as a convenient, portable way to set up a scientific computing environment. Anaconda comes bundled with most commonly used libraries and you can easily install additional requirements.

For this tutorial, I am using cloud resources for running Jupyter notebooks (Google colaboratory, binder). There are many such free environments. See https://www.dataschool.io/cloud-services-for-jupyter-notebook/ for a comparison of some of the popular ones.

In [2]:
import numpy as np
# import pandas as pd
from matplotlib import pyplot as plt

## Homogeneous arrays
### Basic operations

In [3]:
myarray = np.array([1, 2, 3])

In [4]:
print(myarray)

[1 2 3]


In [5]:
myarray

array([1, 2, 3])

In [6]:
myarray.T

array([1, 2, 3])

#### You can make multidimensional arrays.

In [7]:
myarray2 = np.array([[0, 1, 2, 3], [4, 5, 6, 8]])

In [8]:
myarray2

array([[0, 1, 2, 3],
       [4, 5, 6, 8]])

In [9]:
myarray2.shape

(2, 4)

In [10]:
len(myarray2)

2

In [11]:
myarray2.T

array([[0, 4],
       [1, 5],
       [2, 6],
       [3, 8]])

#### Element-wise arithmetic on arrays

In [12]:
2 * myarray

array([2, 4, 6])

In [13]:
2**myarray

array([2, 4, 8], dtype=int32)

In [14]:
2*myarray2

array([[ 0,  2,  4,  6],
       [ 8, 10, 12, 16]])

In [15]:
myarray2**2

array([[ 0,  1,  4,  9],
       [16, 25, 36, 64]], dtype=int32)

In [16]:
myarray + 2

array([3, 4, 5])

In [17]:
myarray + myarray2

ValueError: operands could not be broadcast together with shapes (3,) (2,4) 

#### Pause here: You just encountered Python's error handling mechanism!
- Explain error handling for beginners.

#### Slicing and indexing

In [18]:
myarray2[0, 0]

0

In [19]:
myarray2[0, :]

array([0, 1, 2, 3])

In [20]:
myarray2[0, ::2]

array([0, 2])

In [21]:
myarray2[:, ::2]

array([[0, 2],
       [4, 6]])

#### Assign values

In [22]:
myarray2[0, 0] = 10

In [23]:
myarray2

array([[10,  1,  2,  3],
       [ 4,  5,  6,  8]])

In [24]:
myarray2[:, 0] = -1

In [25]:
myarray2

array([[-1,  1,  2,  3],
       [-1,  5,  6,  8]])

#### Check conditions on array elements
- comparison
- boolean arrays
- functions for conditionals

In [26]:
myarray > 2

array([False, False,  True])

In [27]:
np.nonzero(myarray > 2)

(array([2], dtype=int64),)

In [28]:
np.where(myarray > 2)

(array([2], dtype=int64),)

In [29]:
myarray2 > 3

array([[False, False, False, False],
       [False,  True,  True,  True]])

In [30]:
np.nonzero(myarray2 > 3)

(array([1, 1, 1], dtype=int64), array([1, 2, 3], dtype=int64))

In [31]:
np.where(myarray2 > 3)

(array([1, 1, 1], dtype=int64), array([1, 2, 3], dtype=int64))

In [32]:
np.where(myarray2 > 3, 'X', 'Y')

array([['Y', 'Y', 'Y', 'Y'],
       ['Y', 'X', 'X', 'X']], dtype='<U1')

#### Pause here: read the documentation on `where`

## Array of heterogeneous data
### Pause here: let us discuss data types
- What are data types?
- What are some of the  data types in Python?


In [33]:
harray = np.array([(1, 's', 3.14)])#, dtype=[('a', 'i8'), ('b', 'S1'), ('c', 'f4')])

In [34]:
harray

array([['1', 's', '3.14']], dtype='<U11')

In [35]:
harray = np.array([(1, 's', 3.14)], dtype=[('a', 'i8'), ('b', 'S1'), ('c', 'f4')])

In [36]:
harray['a']

array([1], dtype=int64)

In [37]:
harray[['a', 'b']]

array([(1, b's')], dtype=[('a', '<i8'), ('b', 'S1')])

## Using real data
We shall use data from the Wisconsin breast cancer database available online.

In [38]:
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
attr_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names'

We just defined to variables above. We ask python to print the values:

In [39]:
print(data_url)

http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data


Information about the data is available in `attr_url`. It is information in plain text. We use `http` protocol via functions in the `requests` library. When running Python from local computer, we could simply go to the url and download the file to the local filsystem.

In [40]:
attrs = requests.get(attr_url)

Skim through the attributes. The data is meaningless jumble of numbers without some of this information (metadata) Pay attention to 5-8.

In [41]:
print(attrs.content.decode('utf-8'))

Citation Request:
   This breast cancer databases was obtained from the University of Wisconsin
   Hospitals, Madison from Dr. William H. Wolberg.  If you publish results
   when using this database, then please include this information in your
   acknowledgements.  Also, please cite one or more of:

   1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear 
      programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

   2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of 
      pattern separation for medical diagnosis applied to breast cytology", 
      Proceedings of the National Academy of Sciences, U.S.A., Volume 87, 
      December 1990, pp 9193-9196.

   3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition 
      via linear programming: Theory and application to medical diagnosis", 
      in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
      Li, editors, SIAM Publications, Philadelphia 199

Retrieve the data from the web server into a string.

In [42]:
data_str = requests.get(data_url).content.decode('utf-8')

In [43]:
print(data_str[:100])

1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1


In [44]:
type(data_str)

str

## Read the data into a numpy array.
With a python running on your own computer, you could simply download the data file manually and load it from local disk like: `np.genfromtext(filename, other arguments)`. Since we have a `str` instead of a file, we use a `StringIO` object around the string containing our data as a proxy for a file. 

In [47]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',')

ValueError: could not convert string to float: '?'

In [48]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',', dtype=int)

ValueError: could not convert string to float: '?'

In [49]:
data = np.loadtxt(io.StringIO(data_str), delimiter=',', dtype=str)

In [50]:
print(data[:25])

[['1000025' '5' '1' '1' '1' '2' '1' '3' '1' '1' '2']
 ['1002945' '5' '4' '4' '5' '7' '10' '3' '2' '1' '2']
 ['1015425' '3' '1' '1' '1' '2' '2' '3' '1' '1' '2']
 ['1016277' '6' '8' '8' '1' '3' '4' '3' '7' '1' '2']
 ['1017023' '4' '1' '1' '3' '2' '1' '3' '1' '1' '2']
 ['1017122' '8' '10' '10' '8' '7' '10' '9' '7' '1' '4']
 ['1018099' '1' '1' '1' '1' '2' '10' '3' '1' '1' '2']
 ['1018561' '2' '1' '2' '1' '2' '1' '3' '1' '1' '2']
 ['1033078' '2' '1' '1' '1' '2' '1' '1' '1' '5' '2']
 ['1033078' '4' '2' '1' '1' '2' '1' '2' '1' '1' '2']
 ['1035283' '1' '1' '1' '1' '1' '1' '3' '1' '1' '2']
 ['1036172' '2' '1' '1' '1' '2' '1' '2' '1' '1' '2']
 ['1041801' '5' '3' '3' '3' '2' '3' '4' '4' '1' '4']
 ['1043999' '1' '1' '1' '1' '2' '3' '3' '1' '1' '2']
 ['1044572' '8' '7' '5' '10' '7' '9' '5' '5' '4' '4']
 ['1047630' '7' '4' '6' '4' '6' '1' '4' '3' '1' '4']
 ['1048672' '4' '1' '1' '1' '2' '1' '2' '1' '1' '2']
 ['1049815' '4' '1' '1' '1' '2' '1' '3' '1' '1' '2']
 ['1050670' '10' '7' '7' '6' '4' '10' '4

Note that counting starts with 0 in Python. `[:25]` indicates range. We shall talk about range and slices soon.

In [51]:
data[23]

array(['1057013', '8', '4', '5', '1', '2', '?', '7', '3', '1', '4'],
      dtype='<U8')

**Not nice!**
- We have to define the field names and their data type here. 


In [53]:
 #columns = ['SCN', 'thickness', 'sizeu', 'shapeu', 'adhesion', 'csize', 'bare', 'blandchrom', 'normncl', 'mitoses', 'cclass']
dtype = [('SCN', int), ('thickness', int), ('sizeu', int), ('shapeu', int), ('adhesion', int), ('csize', int), ('bare', int), ('blandchrom', int), ('normncl', int), ('mitoses', int), ('cclass', int)]
# dtype = [('SCN', 'u8'), ('thickness', 'u1'), ('sizeu', 'u1'), ('shapeu', 'u1'), ('adhesion', 'u1'), ('csize', 'u1'), ('bare', 'u1'), ('blandchrom', 'u1'), ('normncl', 'u1'), ('mitoses', 'u1'), ('cclass', 'u1')]


In [54]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=dtype, missing_values=np.nan)

In [55]:
print(data[: 25])

[(1000025,  5,  1,  1,  1, 2,  1, 3,  1, 1, 2)
 (1002945,  5,  4,  4,  5, 7, 10, 3,  2, 1, 2)
 (1015425,  3,  1,  1,  1, 2,  2, 3,  1, 1, 2)
 (1016277,  6,  8,  8,  1, 3,  4, 3,  7, 1, 2)
 (1017023,  4,  1,  1,  3, 2,  1, 3,  1, 1, 2)
 (1017122,  8, 10, 10,  8, 7, 10, 9,  7, 1, 4)
 (1018099,  1,  1,  1,  1, 2, 10, 3,  1, 1, 2)
 (1018561,  2,  1,  2,  1, 2,  1, 3,  1, 1, 2)
 (1033078,  2,  1,  1,  1, 2,  1, 1,  1, 5, 2)
 (1033078,  4,  2,  1,  1, 2,  1, 2,  1, 1, 2)
 (1035283,  1,  1,  1,  1, 1,  1, 3,  1, 1, 2)
 (1036172,  2,  1,  1,  1, 2,  1, 2,  1, 1, 2)
 (1041801,  5,  3,  3,  3, 2,  3, 4,  4, 1, 4)
 (1043999,  1,  1,  1,  1, 2,  3, 3,  1, 1, 2)
 (1044572,  8,  7,  5, 10, 7,  9, 5,  5, 4, 4)
 (1047630,  7,  4,  6,  4, 6,  1, 4,  3, 1, 4)
 (1048672,  4,  1,  1,  1, 2,  1, 2,  1, 1, 2)
 (1049815,  4,  1,  1,  1, 2,  1, 3,  1, 1, 2)
 (1050670, 10,  7,  7,  6, 4, 10, 4,  1, 2, 4)
 (1050718,  6,  1,  1,  1, 2,  1, 3,  1, 1, 2)
 (1054590,  7,  3,  2, 10, 5, 10, 5,  4, 4, 4)
 (1054593, 10

- Look at the original data file. 
- Note 24-th row has missing data with `?` inserted in place of a number. 
- Numpy converted it to -1.
- The exact value depends on what data type you choose in the `dtype` specification. 
- Careful about how missing values are represented.
- Leaving a non-space, non-numeric character where a number is expected is generally a bad idea.

In [56]:
type(data)

numpy.ndarray

In [57]:
data.dtype

dtype([('SCN', '<i4'), ('thickness', '<i4'), ('sizeu', '<i4'), ('shapeu', '<i4'), ('adhesion', '<i4'), ('csize', '<i4'), ('bare', '<i4'), ('blandchrom', '<i4'), ('normncl', '<i4'), ('mitoses', '<i4'), ('cclass', '<i4')])

### Accessing fields in the data

In [58]:
data['SCN']

array([ 1000025,  1002945,  1015425,  1016277,  1017023,  1017122,
        1018099,  1018561,  1033078,  1033078,  1035283,  1036172,
        1041801,  1043999,  1044572,  1047630,  1048672,  1049815,
        1050670,  1050718,  1054590,  1054593,  1056784,  1057013,
        1059552,  1065726,  1066373,  1066979,  1067444,  1070935,
        1070935,  1071760,  1072179,  1074610,  1075123,  1079304,
        1080185,  1081791,  1084584,  1091262,  1096800,  1099510,
        1100524,  1102573,  1103608,  1103722,  1105257,  1105524,
        1106095,  1106829,  1108370,  1108449,  1110102,  1110503,
        1110524,  1111249,  1112209,  1113038,  1113483,  1113906,
        1115282,  1115293,  1116116,  1116132,  1116192,  1116998,
        1117152,  1118039,  1120559,  1121732,  1121919,  1123061,
        1124651,  1125035,  1126417,  1131294,  1132347,  1133041,
        1133136,  1136142,  1137156,  1143978,  1143978,  1147044,
        1147699,  1147748,  1148278,  1148873,  1152331,  1155

In [59]:
data['SCN'][2]

1015425

In [60]:
data[2]['SCN']

1015425

In [61]:
data[2][0]

1015425

In [62]:
data[2, 0]

IndexError: too many indices for array

#### Pause here: how can we check for missing data?
- What did we find about representation of missing data in this case?
- We can try to compare the element values to this missing data representative.
- `numpy` function `where` lets us check a condition on each element of an array.


In [65]:
np.where(data == -1)

  """Entry point for launching an IPython kernel.


(array([], dtype=int64),)

#### Pause here: Let us try one column at a time.
- How can we check the columns one by one?

In [64]:
for name in data.dtype.names:
  print(f'{name}: missing data in rows: {np.where(data[name] < 0)}')

SCN: missing data in rows: (array([], dtype=int64),)
thickness: missing data in rows: (array([], dtype=int64),)
sizeu: missing data in rows: (array([], dtype=int64),)
shapeu: missing data in rows: (array([], dtype=int64),)
adhesion: missing data in rows: (array([], dtype=int64),)
csize: missing data in rows: (array([], dtype=int64),)
bare: missing data in rows: (array([ 23,  40, 139, 145, 158, 164, 235, 249, 275, 292, 294, 297, 315,
       321, 411, 617], dtype=int64),)
blandchrom: missing data in rows: (array([], dtype=int64),)
normncl: missing data in rows: (array([], dtype=int64),)
mitoses: missing data in rows: (array([], dtype=int64),)
cclass: missing data in rows: (array([], dtype=int64),)


#### Pause here: *for loop* and `fstring`
- That is a `for` loop. Python has `while` loop for conditional looping.
- `f{variable} text}` is a new and convenient way of formatting strings in Python (version 3.6 onwards).

## Plotting data

In [67]:
plt.plot(data['thickness'])

[<matplotlib.lines.Line2D at 0x93fde80>]

In [68]:
plt.plot(data['thickness'], 'o')

[<matplotlib.lines.Line2D at 0x94671d0>]

In [70]:
plt.plot(data['thickness'], data['csize'], 'o')

[<matplotlib.lines.Line2D at 0x96ddb38>]

In [71]:
plt.plot(data['sizeu'], data['csize'], 'ro')

[<matplotlib.lines.Line2D at 0x8fef5c0>]

In [74]:
plt.plot(data['mitoses'], data['csize'], 'go-')

[<matplotlib.lines.Line2D at 0x98aa978>]

In [77]:
plt.hist(data['csize'])

(array([ 47., 386.,  72.,  48.,  39.,  41.,  12.,  21.,   2.,  31.]),
 array([ 1. ,  1.9,  2.8,  3.7,  4.6,  5.5,  6.4,  7.3,  8.2,  9.1, 10. ]),
 <a list of 10 Patch objects>)

In [0]:
# pdata = pd.read_csv(data_url, names=data.dtype.names)

In [0]:
# pdata.iloc[23]

In [0]:
# pdata.columns