<a href="https://colab.research.google.com/github/subhacom/np_tut_breastcancer/blob/master/Wisconsin_breast_cancer_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial on using numpy and matplotlib

Import required libraries (modules). The first two are part of Python standard library.

In [1]:
import io
import requests

`numpy` and `matplotlib` are third-party libraries that are installed separately from `Python`. On your personal system I recommend installing **Anaconda** python distribution as a convenient, portable way to set up a scientific computing environment. Anaconda comes bundled with most commonly used libraries and you can easily install additional requirements.

For this tutorial, I am using cloud resources for running Jupyter notebooks (Google colaboratory, binder). There are many such free environments. See https://www.dataschool.io/cloud-services-for-jupyter-notebook/ for a comparison of some of the popular ones.

In [4]:
import numpy as np
# import pandas as pd
from matplotlib import pyplot as plt

We shall use data from the Wisconsin breast cancer database available online.

In [0]:
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
attr_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names'

Information about the data is available in `attr_url`. It is information in plain text. We use http protocol via functions in the `requests` library. We could simply go to the url and download the file as well.

In [0]:
attrs = requests.get(attr_url)

In [0]:
print(attrs.content.decode('utf-8'))

Load the data into a string and then use a StringIO object around it as a proxy for a file (alternatively you could simply download the data file manually and load it from local disk).

In [0]:
data_str = requests.get(data_url).content.decode('utf-8')

In [0]:
print(data_str[:100])

In [0]:
 #columns = ['SCN', 'thickness', 'sizeu', 'shapeu', 'adhesion', 'csize', 'bare', 'blandchrom', 'normncl', 'mitoses', 'cclass']
dtype = [('SCN', int), ('thickness', int), ('sizeu', int), ('shapeu', int), ('adhesion', int), ('csize', int), ('bare', int), ('blandchrom', int), ('normncl', int), ('mitoses', int), ('cclass', int)]
# dtype = [('SCN', 'u8'), ('thickness', 'u1'), ('sizeu', 'u1'), ('shapeu', 'u1'), ('adhesion', 'u1'), ('csize', 'u1'), ('bare', 'u1'), ('blandchrom', 'u1'), ('normncl', 'u1'), ('mitoses', 'u1'), ('cclass', 'u1')]


Read the data into a nupmy array.

In [0]:
data = np.genfromtxt(io.StringIO(data_str), delimiter=',', dtype=dtype, missing_values=np.nan)

In [0]:
print(data[: 25])

Note that counting starts with 0 in Python. `[:25]` indicates range. We shall talk about range and slices soon.

In [0]:
data[23]

Note 24-th row has missing data with `?` inserted in place of a number. Numpy converted it to -1.

In [0]:
type(data)

In [0]:
data.dtype

In [0]:
for name in data.dtype.names:
  print(f'{name}: missing data in rows: {np.where(data[name] < 0)}')

In [0]:
pdata = pd.read_csv(data_url, names=data.dtype.names)

In [0]:
pdata.iloc[23]

In [0]:
pdata.columns