## Explore:
1. sklearn.preprocessing?
2. Pipeline sklearn?
3. pipeline di dask dan pyspark?
4. Referensi: Data pipeline and data preprocessing

## 1. Sklearn.preprocessing?
sources: 
- https://scikit-learn.org/stable/getting_started.html
- https://scikit-learn.org/stable/user_guide.html

### Fitting and predicting: estimator basics

- Estimator is built-in machine learning algorithms and models in ```Scikit-learn```. Each estimator can be fitted to some data using its *fit* method. So Estimator is a class.
- The *fit* method generally accepts 2 inputs:
    - The sample matrix (or design matrix) X. The size of X is typically (n_samples, n_features), which means that samples are represented as row and features are represented as coulumns.
    - The target values y which are realt numbers for regression tasks, or integers for classification (or any other discrete set of values). Foor unsuperviszed learning tasks, y does not need to be specified. y is usually 1d array where the i th entry corresponds to the tartget of the i th sample (row) of X.
- Both X and y are usually expected to be numpy arrays or equivalent array-like data types, though some estimators work with other formats such as sparse matrices. 

In [2]:
## Example: fitting with Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0) # Dibahas di sesi Supervised Learning
X = [[1,2,3], [11,12,13]] # 2 samples, 3 features
y = [0,1] # classes of each sample
clf.fit(X,y)

RandomForestClassifier(random_state=0)

In [6]:
### Predicting target values of new data

print(clf.predict(X))
print(clf.predict([[4,5,6], [14,15,16]]))
print(clf.predict([[14,5,16], [14,1,16]]))

[0 1]
[0 1]
[1 1]


## 2. Pipelines: chaining pre-processors and estimators

Transformers and estimators (predictors) can be combined together into a single unifying object: a **Pipeline**. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with ```fit``` and ```predict```. As we will see later, using a pipeline will also prevent you from **data leakage**, i.e. disclosing some testing data in your training data.

In the following example, we load the Iris dataset, split it into train and test sets, and compute the accuracy score of a pipeline on the test data:

In [13]:
## Pipelines example

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

### Load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

### Fit the whole pipeline
pipe.fit(X_train, y_train)
#Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])

### We can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

## Model Evaluation
Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated. We have just seen the **train_test_split** helper that splits a dataset into train and test sets, but ```scikit-learn``` provides many other tools for model evaluation, in particular for **cross-validation**.

Here how to perform a 5-fold cross-validation procedure is briefly shown, using the **cross_validate** helper. Note that it is also possible to manually iterate over the folds, use different data splitting strategies, and use custom scoring function. For further detail on this direction, please see User Guide: https://scikit-learn.org/stable/user_guide.html

In [14]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate


X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()

result = cross_validate(lr, X, y) # defaults to 5-folds Cross Validate
result['test_score'] # r_squared score is high because dataset is easy


array([1., 1., 1., 1., 1.])

In [21]:
result

{'fit_time': array([0.03181028, 0.00324631, 0.0034883 , 0.00371099, 0.00341368]),
 'score_time': array([0.00043678, 0.00040627, 0.00041699, 0.00050569, 0.00046539]),
 'test_score': array([1., 1., 1., 1., 1.])}

## Automatic paramter search
All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. The generalization power of an estimator often critically depends on a few parameters. For example a **RandomForestRegressor** has a *n_estimator* parameter that determines the number of trees in the forest, and a *max_dept* parameter that determines the maximum depth of each tree. Quite often, it is not clear what the exact values of these paramteres should be, since they depend on the data at hand.

```Scikit-learn``` provides tools to automatically find the best parameter combinations (via cross-validation). In the following example, we randomly search over the parameter space of a random forest with a **RandomizedSearchCV** object. When the search is over, the RandomizedSearchCV behaves as a **RandomForestRegressor** that has been fitted with the best set of paramters. 

In [44]:
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

#X,y = fetch_california_housing(return_X_y=True)
X,y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

# Define the parameter space that will be searched over
param_distributions ={'n_estimators': randint(1,5), 
                     'max_depth': randint(5,10)}

# Now create a searchCV object and fit it to the data
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), 
                           n_iter=5,
                           param_distributions=param_distributions,
                           random_state=0)

search

RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f6582c29d60>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f659b195940>},
                   random_state=0)

In [45]:
print(X)
search.fit(X_train, y_train)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f6582c29d60>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f659b195940>},
                   random_state=0)

In [46]:
search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [55]:
### The search object now acts like a normal random forest estimator, with max_depth=9 and n_estimators=4
#print(X_test,y_test)
search.score(X_test, y_test)

0.9710365853658537

In [54]:
### The search object now acts like a normal random forest estimator, with max_depth=9 and n_estimators=4
print(pipe.predict(X_test))
print(y_test)
accuracy_score(pipe.predict(X_test), y_test)

[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]


0.9736842105263158

## Data leakage
References:
- https://www.kaggle.com/alexisbcook/data-leakage    