# Readings

1. Chapter 2 in the syllabus book 1 covers the materials in this notebook and related notebooks. As guidelines for your readings, you may want to: 
    - Study first the notebooks and then read the related topics in Chapter 2 and web links for more knowledge. Or
    - Read Chapter 2 to get broad vision and then study the notebooks and go through the web links when you need.
2. Try to practice the code in Chapter 2 book 1 GiHub. 

# Revision

## Question 1

Consider you have dataset with 10 million data points. Select one of the following learning strategy combination that will be most adequate to fit with your data size and explain why – Assume the data quality is excellent!

a. Batch and instance-based learning.   
b. Batch and model-based learning.     
c. On-line and instance-based learning.      
d. On-line and model-based learning.      

## Specific Questions

1. What is the [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity)? how does this problem affect the modeling?

2. What is the difference between prediction and classification?

## General Questions (See the book, lectures, and discussions)

1. What are the advantages and disadvantages to use rule-based and ML algorithms?
2. What are the advantages and disadvantages of learning on the fly versus with/without model strategies?
3. What are the challenges of unsupervised and supervised learning algorithm?
4. How do you know that the model is over-fitted or under-fitted? 
    - How to deal with these two problems?
5. What is the no free lunch (NFL) theorem?
6. What is the role of training, validation, and testing the model?

# General Guide

In this jupyter notebook:
1. Some items in the table of contents are described totally in this notebook. Others are just links to other jupyter notebooks located in the same folder of this notebook-- you need to click on each link to open it in a new window of the related notebook. 
2. We use some hyper-links for citation purpose <b> only </b> and they are optional for the student to read. It is preferable for the student to read those links cited as <b> Extra Readings </b>.
3. The student wants to understand not to memorize Chapter 2 concepts. The Chapter is full of concepts that will be used during the whole course. The student also may want to understand not memorize the code written in the jupyter notebooks.

# Objectives

To understand the following concepts:
1. The differences among data splitting techniques.
2. Feature engineering main steps.
3. How to implement Grid/Random search in sklearn.

# Copyright

## Copyright holder

Most of the contents of this lectures are taken/modified from the book:
"Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow Concepts, Tools, and Techniques to
Build Intelligent Systems"-- [link](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646). Other contents are either developed by the instructor or taken from web links that are cited in the contexts.      

## The Course materials copyright

The Content is made available only for your personal, noncommercial educational, and scholarly use. You may not use the Content for any other purpose, or distribute, post or make the Content available to others unless you obtain any required permission from the copyright holder. Some Content may be provided via streaming or other means that restrict copying; you may not circumvent those restrictions. You may not alter or remove any copyright or other proprietary notices included in the Content. You need to take the permission from the above copyright holder to use the materials and to check the complete Course Materials Copyright section in the course canvas website, as well.

# The conceptual view of machine learning pipeline

<img style="float: center" src="./images/pipeline_sa.png" alt="drawing" Hight="300" width="600"/>    

# Data splitting techniques

## [Hold-out cross-validation][1]

Split the data randomly into training, validation/development, and test/unseen parts randomly: usually 60%, 20%, and 20% respectively or any other splitting techniques:
1. Training set: Data set to build/fit the model.
2. Validation/development set: Data set to evaluate the learning algorithm with different configurations. It is called development set, since we are using it while developing our model. It can be a bit biased, that's why we need the third kind of data set.
3. Test/unseen set: Data set to check the accuracy of the final model and get the unbiased results.

#### hold-out cross-validation: notes</span>
1. It is a simple algorithm but we could not use it to prove model generalization.

## [K-fold cross-validation][2]

We usually split the data into K=10 folds (the size of each fold is 1/K) such that: 
1. for each fold i where i = 1,2,...K, do:
    1. train the learner on all folds except i.
    2. Use the ith fold for testing the model in (1) and report the performance results.
2. Average the model performance results in the K iterations in step (1)

### Kfold cross-validation: notes</span>
1. It is an in-place and computaionally doable algorithm but we could use it to prove model generalization on the population.

<img style="float: center" src="./images/kfold-cross-validation.jpg" alt="drawing" Hight="300" width="500"/> 


## [Bootstrapping cross-validation][3]


1. Choose a number of bootstrap samples to perform. //Usually 100,200,..., or 1000 repetitions 
2. Choose a sample size. // Usually a sample size = the size of population.
3. For each bootstrap sample:
   1. Randomly draw a sample with replacement (in-the-bag training sample) with the chosen size:
       - While the size of the sample is less than the chosen size:
            - Randomly select an observation from the dataset
            - Add it to the sample (i.e., In-the-bag training sample).
   2. Fit a model on the data sample
   3. Estimate the model performance on the remaining unselected observations (the out-of-bag sample).
4. Calculate the average of the model performance results in all bootstrap samples in step (3).


### Bootstrapping: notes
1. We use [0.632 bootstrap rule][4] in which the in-the-bag sample has 63.2% distinct observations and the out-the-bag sample has the remaining 38.8% observations.
2. It is a computationally extensive and out-of-place algorithm and we could use it to prove model generalization on the simulated population.


<img style="float: center" src="./images/bootstrap-example.jpg" alt="drawing" Hight="100" width="300"/> 


## [Repeated cross validation][5]

Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. 

[1]: https://www.mff.cuni.cz/veda/konference/wds/proc/pdf10/WDS10_105_i1_Reitermanova.pdf
[2]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)
[3]:https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
[4]:http://rasbt.github.io/mlxtend/user_guide/evaluate/bootstrap_point632_score/
[5]:https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/

## Cross-validation code examples

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline 

np.random.seed(0)  # For reproducibility

In [None]:
iris = pd.read_csv("~/DATA/Iris.csv")  # Load the data
X = iris[iris.columns[1:5]]
y = iris[iris.columns[5]]

In [None]:
X.shape,y.shape

### Holdout cross-validation code

In [None]:
#Hold-out cross-validation

from sklearn.model_selection import train_test_split,KFold
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,shuffle=True)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

In [None]:
X_train

### Kfold cross-validation code

In [None]:
#10-fold cross-validation

cv = KFold(n_splits=2, random_state=42, shuffle=True)
for train_index, test_index in cv.split(X):
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)

### Bootstrapping cross-validation code

In [None]:
# Bootstrapping cross-validation

from sklearn.utils import resample

Index=range(0,len(X))


boot = resample(Index, replace=True, n_samples=len(Index), random_state=1)
print(len(X),len(set(boot)))
print(len(set(boot))/len(X)*100.0)


print('\nBootstrap Sample: %s' % boot)
# out of bag observations
oob = [x for x in Index if x not in boot]
print('OOB Sample: %s' % oob)
print(len(oob)/len(X)*100.0) # out of bag 
print(len(oob),len(boot))

### [Example to show the difference between repeated Kfold and Kfold](https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/)


In [None]:
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

In [None]:
# evaluate a logistic regression model using k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

In [None]:
# evaluate a logistic regression model using repeated k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

### Exercise 
**When to use:**
1. Hold-out
2. K-fold
3. Bootstrapping

# <a href="./feature engineering-New.ipynb">Feature engineering</a>

# Pipeline code

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline 
from sklearn.preprocessing import MinMaxScaler,StandardScaler 
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

np.random.seed(0)  # For reproducibility

In [None]:
df= pd.read_csv("~/DATA/adult.csv",na_values='?')  # Load the data

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
cols=df.columns

In [None]:
num_cols = df._get_numeric_data().columns

In [None]:
num_cols

In [None]:
num_cols[2:]

In [None]:
num_cols

In [None]:
cat_cols=list(set(cols) - set(num_cols))

In [None]:
cat_cols

In [None]:
df[df.columns[-1]].value_counts()

In [None]:
mydict={"<=50K":1,">50K":0}

In [None]:
df.replace({df.columns[-1]:mydict},inplace=True)

In [None]:
y=df[df.columns[-1]]

In [None]:
y

In [None]:
X=df.iloc[:,:-1]

In [None]:
X

In [None]:
#Hold-out cross-validation

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,shuffle=True)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

In [None]:
myTrainCa = [value for value in cat_cols if value in X.columns]

In [None]:
standard_transformer = Pipeline(steps=[
        ('standard', StandardScaler())])

minmax_transformer = Pipeline(steps=[
        ('minmax', MinMaxScaler())])

onehot_transformer=Pipeline(steps=[
        ('onehot', OneHotEncoder())])


preprocessor = ColumnTransformer(
        remainder='passthrough', #passthough features not listed
        transformers=[
            ('std', standard_transformer , num_cols[1:3]),
            ('mm', minmax_transformer , num_cols[3:]),
            ('ohe', onehot_transformer , myTrainCa),
        ])

In [None]:
help(LogisticRegression)

In [None]:
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])


clf.fit(X_train, y_train)
print("The accuracy model score: %.3f" % clf.score(X_test, y_test))
y_preds = clf.predict(X_test)
print(y_preds)


# <a href="./hyper-parameter optimization.ipynb">Hyper-parameter optimization</a>