# Machine Learning in Python with Scikit-Learn

## Scikit-Learn Overview

* dominant Machine Learning Library for Python
* very wide user basis
* very good documentation
* state of the art  implementation 
* unified API 
* full integration in ***NumPy / Pandas*** work flows
* *everything but* **Deep Learning**  


## Scikit-Learn Resources
* Website: https://scikit-learn.org/stable/index.html
* API Reference: https://scikit-learn.org/stable/modules/classes.html
* Tutorial: https://scikit-learn.org/stable/tutorial/index.html

## Scikit-Learn Structure
***SkLearn*** provides a wide range of ML Algorithms plus methods for:
* loading / accessing data
* data pre-processing
* data selection
* model evaluation 
* model tuning
<br>

## Data Access
### Build in Data Sets
***sklearn*** provides many datasets that are commonly used in Machine Learning teaching and tutorials.
* see full list here: https://scikit-learn.org/stable/datasets/index.html

In [2]:
from sklearn.datasets import load_iris
X=load_iris()['data'] #vectors of data
y=load_iris()['target'] #label vector

In [3]:
type(X)

numpy.ndarray

In [4]:
print("First 10 lines of X:\n", X[0:10])
print("\nLabels\n",y)

First 10 lines of X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]

Labels
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


### Loading data

Load any data that you want to work with (e.g. using pandas)

## Unified API
One key feature of ***sklearn*** is it's unified API, that allows a very simple exchange ML methods

### Example: Classification with decision trees
In the following example we will use the Iris dataset that we loaded above and classify it with a decision tree. Note that the outlined steps are the same for all classifiers. Only the instantiation (and the respective hyperparameters) will change.

For comparison the corresponding pipeline in KNIME is shown here: 
<img src="IMG/KNIME-decision tree.jpg" width=600>


#### Step 1: Partition the data
Partition the data into a training and a validation or test set  
Note that often a large X is used for the training data and a small y for the test data  

Check out documentation to find out which parameters are available! 
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [7]:
# Randomly split into train and test data (corresponds to "Partitioning" node in KNIME)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

(105, 4)

#### Step 2: Create model instance for ML Algorithm
Check out documentation to find out which parameters are available! 

In [8]:
# Import and instanciate the classifier 
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

#### Step 3: Fit the model to the training data

In [9]:
# Build a classifier from the training set (X, y)    
# (Corresponds to "Decision Tree Learner" in KNIME)

clf = clf.fit(X_train, y_train)

#### Step 4: Predict values (= Apply the model)

In [10]:
# Predict class or regression value for a new X / X_test
# (Corresponds to "Decision Tree Predictor" in KNIME)

pred = clf.predict(X_test)

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0,
       0])

#### Step 5: Evaluate / Validate the model 

In [11]:
from sklearn import metrics

In [12]:
# Calculate the accuracy score on the test/validation data
metrics.accuracy_score(y_test, pred)

1.0

In [15]:
# Calculate the confusion matrix
cm = metrics.confusion_matrix(y_test,pred)
print(cm)

[[19  0  0]
 [ 0 13  0]
 [ 0  0 13]]


In [16]:
# Write a classification report 
report = metrics.classification_report(y_test,pred)
print(report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



#### Step 6: Apply the model

In [17]:
# Load application data (which does not have labels) and call the predict method again
# In this example we will just do it for a single data instance
import numpy as np
import pandas as pd

application_data = pd.DataFrame(np.array([2.1, 3.7, 1.3, 0.2])).transpose()
print("Input:\n", application_data.iloc[0,:])
pred = clf.predict(application_data)
print("Predicted class:", pred[0])

Input:
 0    2.1
1    3.7
2    1.3
3    0.2
Name: 0, dtype: float64
Predicted class: 0


## Fine-tuning the parameters of a model

Using a validation set, we can fine-tune the parameters of a model. We can use sklearn's `GridSearchCV` to find optimal parameters. 

In [1]:
import numpy as np
from sklearn.model_selection import GridSearchCV

In [3]:
# Initializing a grid with the machine learning parameters we want to optimize
# Here, we will optimize the parameter "max_depth" of the Decision tree. 
grid = {'max_depth' : np.arange(1,25)}

In [5]:
# Initialize the classifier
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier()

In [18]:
# We will use cross-validation to find the optimal depth
# Note that GridSearchCV does the split into training and validation data for us! 
clf = GridSearchCV(tree_clf, grid, cv = 10)
clf.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])})

In [19]:
# We can now extract the best set of parameters
clf.best_params_

{'max_depth': 7}

In [20]:
# ... and the best score we got for these parameters
clf.best_score_

0.9436363636363637

## Advanced: Saving and Loading Models
Models are stored via ***pickle***, the ***Python*** serialization library https://docs.python.org/3/library/pickle.html. 


In [41]:
import pickle 
pickle.dump(model, open( "my_model.p", "wb" ) ) #seave model to file
model2 = pickle.load(open( "my_model.p", "rb" ) )#load model from file
model2.predict(X_test)

array([1, 1, 0, 0, 1, 0, 1, 0, 0, 1])