# Pipelines 


"A sequence of data processing components is called a data pipeline. Pipelines are very
common in Machine Learning systems, since there is a lot of data to manipulate and
many data transformations to apply.
Components typically run asynchronously. Each component pulls in a large amount
of data, processes it, and spits out the result in another data store, and then some time
later the next component in the pipeline pulls this data and spits out its own output,
and so on. Each component is fairly self-contained: the interface between components
is simply the data store. This makes the system quite simple to grasp (with the help of
a data flow graph), and different teams can focus on different components. Moreover,
if a component breaks down, the downstream components can often continue to run
normally (at least for a while) by just using the last output from the broken compo‐
nent. This makes the architecture quite robust.
On the other hand, a broken component can go unnoticed for some time if proper
monitoring is not implemented. The data gets stale and the overall system’s perfor‐
mance drops."

From Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Chapter 2. By Aurélien Géron:

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

Data Reference: This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. Acknowledgements to:
1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18. 
2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196. 
3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30. 
4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29

Attribute values:

1) ID number
2) Outcome (R = recur, N = nonrecur)
3) Time (recurrence time if field 2 = R, disease-free time if 
	field 2	= N)
4-33) Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)
34) Tumor size - diameter of the excised tumor in centimeters
35) Lymph node status - number of positive axillary lymph nodes observed at time of surgery


In [2]:
# create list for attributes:

attNames = ['ID', 'Outcome', 'Time']
for i in range(3):
    attNames.extend(('Radius','Texture', 'Perimeter', 'Area', 'Smoothness', 'Compactness','Concavity',
                     'Concave Points','Symmetry','Fractal Dimension'))
attNames.extend(('Tumor Size','Lymph node status'))

In [3]:
# Load data

bcSet = pd.read_csv('Data/wpbc.data.txt',header=None, 
                      names=attNames)
bcSet.head()

  return _read(filepath_or_buffer, kwds)


Unnamed: 0,ID,Outcome,Time,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,...,Perimeter.2,Area.2,Smoothness.2,Compactness.2,Concavity.2,Concave Points.2,Symmetry.2,Fractal Dimension.2,Tumor Size,Lymph node status
0,119513,N,31,18.02,27.6,117.5,1013.0,0.09489,0.1036,0.1086,...,139.7,1436.0,0.1195,0.1926,0.314,0.117,0.2677,0.08113,5.0,5
1,8423,N,61,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,3.0,2
2,842517,N,116,21.37,17.44,137.5,1373.0,0.08836,0.1189,0.1255,...,159.1,1949.0,0.1188,0.3449,0.3414,0.2032,0.4334,0.09067,2.5,0
3,843483,N,123,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,2.0,0
4,843584,R,27,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,3.5,0


## Preprocessing

We know from the column descriptions provided with the datset and from viewing the data as below that there are two non-numeric columns:

In [4]:
# Take a look at the data

bcSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198 entries, 0 to 197
Data columns (total 35 columns):
ID                     198 non-null int64
Outcome                198 non-null object
Time                   198 non-null int64
Radius                 198 non-null float64
Texture                198 non-null float64
Perimeter              198 non-null float64
Area                   198 non-null float64
Smoothness             198 non-null float64
Compactness            198 non-null float64
Concavity              198 non-null float64
Concave Points         198 non-null float64
Symmetry               198 non-null float64
Fractal Dimension      198 non-null float64
Radius.1               198 non-null float64
Texture.1              198 non-null float64
Perimeter.1            198 non-null float64
Area.1                 198 non-null float64
Smoothness.1           198 non-null float64
Compactness.1          198 non-null float64
Concavity.1            198 non-null float64
Concave Points.1    

### Categorical Values

We can handle the categorical variable: Outcome using OneHotEncoder

In [6]:
labelEncoder = LabelEncoder()
hotEncoder = OneHotEncoder()
catEncode = hotEncoder.fit_transform(labelEncoder.fit_transform(bcSet['Outcome']).reshape(-1,1))


catEncode[:5].toarray()

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

### Missing Values

Note that the last column is type "object", this column must have some nonnumeric values

In [7]:
bcSet['Lymph node status'].value_counts()

0     87
1     35
2     17
4     10
13     6
7      6
?      4
9      4
3      4
6      3
15     3
11     3
27     2
10     2
20     2
8      2
5      2
24     1
14     1
18     1
17     1
21     1
16     1
Name: Lymph node status, dtype: int64

Look! There are 4 '?'

In [8]:
# get the ID for the missing values
missingAtt = [i for i,j in enumerate(bcSet['Lymph node status']) if bcSet['Lymph node status'][i] == '?']

# create new dataframe without the missing values
df_nonMissing = bcSet.drop(missingAtt)

# add encoded variables


We can check that the '?' rows have been removed:

In [9]:
df_nonMissing['Lymph node status'].value_counts()

0     87
1     35
2     17
4     10
13     6
7      6
3      4
9      4
6      3
15     3
11     3
10     2
20     2
8      2
5      2
27     2
14     1
21     1
17     1
16     1
18     1
24     1
Name: Lymph node status, dtype: int64

In [None]:
# set the y to 'Lymph node status'

df_nonMissing_y = df_nonMissing['Lymph node status']

# set the X to everything but 'Lymph node status'
df_nonMissing_X = df_nonMissing.drop(['Lymph node status',], axis=1)

In [None]:
linearReg = LinearRegression()
ridgeReg = Ridge()

linearReg.fit(df_nonMissing_X, df_nonMissing_y)
ridgeReg.fit(df_nonMissing_X, df_nonMissing_y)

### Pipeline for categorical and numeric variables

.. to come