# Topic 1. Introduction to Machine learning
## Advanced Supervised Classification Methods and ML Pipelines 


### We import some commonly used Python libraries

In [1]:
import numpy as np

### We import several classifiers from sklearn

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier

### sklearn also contains a number of databases that can be used to test the algorithms. We will use some of them.

In [3]:
import sklearn.datasets as data_load

### We can check which are the datasets included

In [4]:
print("Available datasets:")
[name for name in data_load.__all__ if "load" in name]

Available datasets:


['load_boston',
 'load_diabetes',
 'load_digits',
 'load_files',
 'load_iris',
 'load_breast_cancer',
 'load_lfw_pairs',
 'load_lfw_people',
 'load_linnerud',
 'load_mlcomp',
 'load_sample_image',
 'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files']

###  Finally, we import the methods for validating the classifiers and for constructing ML pipelines are also  imported

In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import recall_score
from sklearn import metrics



from sklearn import preprocessing
from sklearn.pipeline import Pipeline

### We will also use the TPOT package to search for (almost) optimal pipelines

In [7]:
from tpot import TPOTClassifier

## Inspecting the Real-World datasets

We will use the breast cancer dataset, included in UCI ML Repository https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)


It has been used for the application of ML to Cancer diagnosis and prognosis: http://pages.cs.wisc.edu/~olvi/uwmp/cancer.html

In [8]:
# The dataset is loaded
breast_cancer_data = data_load.load_breast_cancer()

In [9]:
#Display options
np.set_printoptions(suppress=True)

It is a good practices to inspect the dataset before applying any ML technique, its header and also the characteristics of the data. 

In [12]:
#Some information about the dataset, understand what we are aiming for
print(breast_cancer_data['DESCR'])

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        

We analyze more details of the database. Rows define observations (instances of our classification problem). Columns represent variables captured in each observation.


In [15]:
breast_cancer_data["data"]

array([[  17.99   ,   10.38   ,  122.8    , ...,    0.2654 ,    0.4601 ,
           0.1189 ],
       [  20.57   ,   17.77   ,  132.9    , ...,    0.186  ,    0.275  ,
           0.08902],
       [  19.69   ,   21.25   ,  130.     , ...,    0.243  ,    0.3613 ,
           0.08758],
       ..., 
       [  16.6    ,   28.08   ,  108.3    , ...,    0.1418 ,    0.2218 ,
           0.0782 ],
       [  20.6    ,   29.33   ,  140.1    , ...,    0.265  ,    0.4087 ,
           0.124  ],
       [   7.76   ,   24.54   ,   47.92   , ...,    0.     ,    0.2871 ,
           0.07039]])

Notice in the rows shown above that the range of values change among the columns. Some columns seem to have values between 0 and 1 and others much higher values. This has to be taken into account for the application of the classifiers. 

In [13]:
#Classes in the database
breast_cancer_data["target"]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 0,

In the previous analysis, notice that we have used:   data["data"] to visualize the features and data["target"] to see the classes. 

### Learning and validating classifiers 

We define a logistic regression classifier

In [14]:
lr = LogisticRegression()

We estimate the classifier accuracy using k-fold cross-validation with k=5. The result of cross-validation will be the predictions for all instances

In [15]:
prediction = cross_val_predict(lr,breast_cancer_data.data, breast_cancer_data.target,cv=5)

In [16]:
# Let us print the predictions
print(prediction)

[0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1
 1 0 1 0 0 1 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1
 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 0 0 

With the prediction  and the target (true class value) we can compute different accuracy measures for the classifier.We do this for the accuracy metric below. 

In [17]:
lr_accuracy = metrics.accuracy_score(breast_cancer_data.target, prediction) 
print("The accuracy of the logistic regression classifier, as computed using 5-fold crossvalidation, is: ",lr_accuracy)

The accuracy of the logistic regression classifier, as computed using 5-fold crossvalidation, is:  0.95079086116


We can also compute the confusion matrix for the predictions made by the logistic regression classifier

In [18]:
lr_confusion_matrix = metrics.confusion_matrix(breast_cancer_data.target, prediction)
print("Confusion matrix for the predictions made by the logistic regression classifier:")
print(lr_confusion_matrix)

Confusion matrix for the predictions made by the logistic regression classifier:
[[194  18]
 [ 10 347]]


# Exercise 1

Using the examples from the previous cells, and the information given in http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics   

Compute

1.1) Precision and recall scores for the logistic regression classifier


1.2) f1_score for the predictions made by a decision tree 

In [20]:
lr_precision = metrics.precision_score(breast_cancer_data.target, prediction) 
print("The accuracy of the logistic regression classifier, as computed using 5-fold crossvalidation, is: ",lr_precision)

The accuracy of the logistic regression classifier, as computed using 5-fold crossvalidation, is:  0.950684931507


In [26]:
lr_recall = metrics.recall_score(breast_cancer_data.target, prediction) 
print("The accuracy of the logistic regression classifier, as computed using 5-fold crossvalidation, is: ",lr_recall)

The accuracy of the logistic regression classifier, as computed using 5-fold crossvalidation, is:  0.921568627451


In [23]:
lr = DecisionTreeClassifier()
prediction = cross_val_predict(lr,breast_cancer_data.data, breast_cancer_data.target,cv=5)
lr_f1 = metrics.f1_score(breast_cancer_data.target, prediction) 
print(lr_f1)

0.933333333333


# Exercise 2

Program a function that receives one classifier (of any type), the training data, and the classes,  and outputs three metrics: accuracy, precision score and recall score, all computed using cross-validation. 


SUGGESTION: Complete the following function and test it in the following cell. 

In [31]:
def my_scores_function(clf,train_data,train_class):
    clf_prediction = cross_val_predict(clf,train_class, train_data,cv=5)
    acc = metrics.accuracy_score(train_class, train_data) 
    precision = metrics.precision_score(train_class, train_data) 
    recall = metrics.recall_score(train_class, train_data)
    return acc,precision,recall

    

In [32]:
# We will test the implemented function using a KNN classifier
knn = KNeighborsClassifier(n_neighbors= 5, metric="euclidean")
my_scores_function(knn,breast_cancer_data.data, breast_cancer_data.target)    
    

ValueError: Can't handle mix of binary and continuous-multioutput

We define a standard scaler to scale all features in the dataset.


In [29]:
scaler = preprocessing.StandardScaler()
scaled_data = scaler.fit_transform(X=breast_cancer_data["data"])
#scaled_data

# Exercise 3

 Use the my_scores_function() to compute the accuracy, precision, and recall of a decision tree classifier that uses the scaled data.

In [None]:
lr = DecisionTreeClassifier()
prediction = cross_val_predict(lr,breast_cancer_data.data, breast_cancer_data.target,cv=5)
lr_f1 = metrics.f1_score(breast_cancer_data.target, prediction) 
print(lr_f1)

Scaling the data can improve the accuracy of some classifiers. We could morph the scaling and classification procedures into one single structure, in a way that both are applied with a single line of code. Below we use a pipeline with this purpose.

In [30]:
knn_scale = Pipeline([("scaler", scaler), ("k-NN", knn)])

# Exercise 4

Create a pipeline  that uses one scaler, one feature  selection method that produces 10 features, and a support vector machine classifier.


4.1) Compute the accuracy, precision, and recall of your pipeline. 


Suggestion: If needed, check sklearn web page help for feature extraction methods and support vector machine classifier definition. http://scikit-learn.org/0.18/index.html


## TPOT: Optimizing Pipelines

Now, lets use TPOT, which is a bi-objective genetic programmig tool that generates pipelines automatically, by searching for the maximum accuracy, while also attempting to keep the pipelines simple.

### We define Tpot instance, simmilarly to the way it is done for a regular sklearn classifier.

In [31]:
tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)




### Then use Tpot to "learn" a good pipeline  (it may take some time)

In [33]:
tpot.fit(features=breast_cancer_data["data"], target=breast_cancer_data["target"])

Optimization Progress:  32%|███▏      | 19/60 [00:14<00:24,  1.65pipeline/s]

Generation 1 - Current best internal CV score: 0.95801462100808


Optimization Progress:  48%|████▊     | 29/60 [00:25<00:34,  1.11s/pipeline]

Generation 2 - Current best internal CV score: 0.95801462100808


Optimization Progress:  63%|██████▎   | 38/60 [00:37<00:28,  1.32s/pipeline]

Generation 3 - Current best internal CV score: 0.95801462100808


Optimization Progress:  78%|███████▊  | 47/60 [00:56<00:22,  1.76s/pipeline]

Generation 4 - Current best internal CV score: 0.96326279338207


                                                                            

Generation 5 - Current best internal CV score: 0.96326279338207

Best pipeline: DecisionTreeClassifier(RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=False, RandomForestClassifier__criterion=gini, RandomForestClassifier__max_features=0.25, RandomForestClassifier__min_samples_leaf=DEFAULT, RandomForestClassifier__min_samples_split=2, RandomForestClassifier__n_estimators=100), DecisionTreeClassifier__criterion=DEFAULT, DecisionTreeClassifier__max_depth=9, DecisionTreeClassifier__min_samples_leaf=15, DecisionTreeClassifier__min_samples_split=20)


TPOTClassifier(config_dict={'sklearn.feature_selection.RFE': {'estimator': {'sklearn.ensemble.ExtraTreesClassifier': {'max_features': array([ 0.05,  0.1 ,  0.15,  0.2 ,  0.25,  0.3 ,  0.35,  0.4 ,  0.45,
        0.5 ,  0.55,  0.6 ,  0.65,  0.7 ,  0.75,  0.8 ,  0.85,  0.9 ,
        0.95,  1.  ]), 'n_estimators': [1... 0.45,
        0.5 ,  0.55,  0.6 ,  0.65,  0.7 ,  0.75,  0.8 ,  0.85,  0.9 ,
        0.95,  1.  ])}},
        crossover_rate=0.1, cv=5, disable_update_check=False,
        generations=5, max_eval_time_mins=5, max_time_mins=None,
        mutation_rate=0.9, n_jobs=1, offspring_size=10, population_size=10,
        random_state=16, scoring=None, subsample=1.0, verbosity=2,
        warm_start=False)

Now we can see what the result is

In [34]:

tpot.fitted_pipeline_.steps

[('stackingestimator',
  StackingEstimator(estimator=RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
              max_depth=None, max_features=0.25, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
              verbose=0, warm_start=False))),
 ('decisiontreeclassifier',
  DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=15,
              min_samples_split=20, min_weight_fraction_leaf=0.0,
              presort=False, random_state=None, splitter='best'))]

Subjecting our data to a TPOT execution, with the provided configurations, suggests that the previous pipeline is the best way to build a classifier.

# Exercise 5

Moving forwawrd to a real classification problem,

5.1) Fetch a real database (different from the one used in the example) from the sklearn library (with classification purposes), understand how it is structured, and get used to it.

5.2) Define and fit a classifier using the data.

5.3) Use cross-validation to estimate the accuracy, recall, and precision of the classifier.

5.4) Use a pre-processing method to transform the data before feeding it to the classifier

5.5) Create a Pipeline which includes (at least) one preprocessing method, and a classifier.

5.6) Apply the pipeline to the data.

5.7) Use Tpot to automatically generate a pipeline



In [25]:
#Available datasets:
[name for name in data_load.__all__ if "load" in name]

['load_boston',
 'load_diabetes',
 'load_digits',
 'load_files',
 'load_iris',
 'load_breast_cancer',
 'load_linnerud',
 'load_mlcomp',
 'load_sample_image',
 'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files',
 'load_wine']

*Note that this set contains databases aimed to both classification and regression. You will have to recognize which ones are valid for classification.