# Chapter 1

### Model Validation

- determines MODEL PERFORMANCE
- Way to determine that the model is performing in the as expected
- common way : Accuracy in unseen data (test data) is same as seen data (train data)
- Goal : Choose right model, right parameters, right accuracy metrics, high accuracy on new data
- Split dataset : split dataset into 80-20 split. train with 80% and test with 20%
- validation dataset : Split the train again into training and validation dataset with 75-25 split.
- tune hyperparameters : You need validation dataset for tuning hyperparameters
- Accuracy metrics for assessing model performance:
    - Regression :
        - Good rule of thumb : make your y into percentage, then the metric will also generate percentage value
        - Mean absolute error (MAE) : Treats all points equally
        - Mean Squared error (MSE) : Issues penalty for large difference
    - Classification: 
        - Measured from confusion matrix : `cm[<true_category_index>, <predicted_category_index>]`
        - precision : True positives out of all predicted positive values
            - Used when you do not want to over predic-positive values (you need assurity for cancer paitent test)
        - recall : True positives out of all real positive values
            - Used when you cannot afford to miss any positive values (you need assurity for cancer paitent test)
        - accuracy: overall ability of model to predict correct class
        - Specificity 
        - F1-score
- Cross validation : 
    - gets rid of bias result that occurs due to sampling
    - breaks training data further into training and validating set for producing compact training result
    - LOOCV : trains with n-1 data and validates with only 1 data point (generally used for less data. computationally expensive). During cross validation, put `cv=len(X.shape[0])`
- Overfitting : 
    - accuracy in test data is lower than accuracy in train data (High variance on train data)
    - model pays too much attention to the details of training data and also learns noise 
    - model becomes complex, more data may be needed
- Underfitting : 
    - accuracy in both train and test data is lower since the model is too simple to capture pattern (High bias, low variance)
    - model fails to find relationship between the data and the response target value
    - model is too simple
- BIAS VARIANCE TRADEOFF : 
    - variance : model pays too much attention to the details of training data and also learns noise
    - bias : models fails to learn the details
- Regression model : Target is a continuous value
- Classification model : Target is a discrete value / category
- hyper-parameter tuning : 
    - 2 types of parameters.
    - parameters that do not exist before training the model : you cannot change those (eg: co-efficient of linear regression)
    - parameters of the model that can be manually selected to see what parameters might produce optimal result. (eg: maximum depth of a tree)
    - GridSearch: Brute force on all available hyper-parameters
    - Random Search: randomly selecting from available hyper-parameters
    - Bayesian Search : use pass test on each step for the next run


# Chapter 3

### k-fold cross validation

```
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, make_scorer

kf = KFold(n_splits=5, shuffle=True, random_state=42)
# splits = kf.split(X) # See how they are splitted
# r-squared results for 5-fold cross validation score  
mae_scorer = make_scorer(mean_absolute_error)
scores = cross_val_score(your_model, X, y, cv=kf, scoring=mae_scorer)  # a list of error terms
avg_score = np.mean(scores)
# predicted_y results for 5-fold cross validation prediction
predicted_y = cross_val_predict(your_model, X, y, cv=5) # a list of predictions
avg_predicted_y = np.mean(predicted_y)

### example of ridge regression with grid search with k-fold cross validation
param_grid = {"alpha": np.arange(0.0001, 1, 10), "solver": ["sag", "lsqr"]}
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)
ridge_cv2 = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
ridge_cv.fit(X_train, y_train)
```

# Chapter 4

### Hyperparameter tuning

```
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import accuracy_score, mean_absolute_error, make_scorer

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42)
# Instantiate individual classifiers
lr = LogisticRegression(random_state=42)
knn = KNN()
dt = DecisionTreeClassifier(random_state=42,max_depth=4, min_samples_leaf=0.16)
classifiers = [('Logistic Regression', lr),
                ('K Nearest Neighbours', knn),
                ('Classification Tree', dt)]

# Instantiate an ensemble VotingClassifier
from sklearn.ensemble import VotingClassifier
ensemble_model = VotingClassifier(estimators=classifiers)

# Instantiate an ensemble VotingRegressor
ensemble_model = VotingRegressor(estimators=regressors)

# Instantiate an ensemble BaggingClassifier
from sklearn.ensemble import BaggingClassifier
ensemble_model = BaggingClassifier(base_estimator=dt, n_estimators=300,oob_score=True, n_jobs=-1)
oob_accuracy = bc.oob_score_

# Instantiate an ensemble BaggingRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
base_regressor = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13, random_state=3)
ensemble_model = BaggingRegressor(base_estimator=base_regressor, n_estimators=300, oob_score=True, n_jobs=-1)
oob_score = ensemble_model.oob_score_

# Instantiate an ensemble RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
ensemble_model = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.12, random_state=42)

# Instantiate an ensemble RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
ensemble_model = RandomForestClassifier(n_estimators=400, random_state=42)

# Instantiate an ensemble AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
ensemble_model = AdaBoostClassifier(base_estimator=dt, n_estimators=100) # dt is weak, has max depth of 1
y_pred_proba = ensemble_model.predict_proba(X_test)[:,1]
# Evaluate testing roc_auc_score
from sklearn.metrics import roc_auc_score
adb_clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba)

# Instantiate an ensemble GradientBoostingRegressor, (max_features=0.2, subsample=0.8) makes it stochastic gradient boosting
from sklearn.ensemble import GradientBoostingRegressor
ensemble_model = GradientBoostingRegressor(max_depth=1, subsample=0.8, max_features=0.2, n_estimators=300, random_state=42)

# Train using traing set
ensemble_model.fit(X_train, y_train)
# Predict with test set
y_pred = ensemble_model.predict(X_test)
# Evaluate accuracy for classification
print(accuracy_score(y_test, y_pred))
# Evaluate RMSE for regression
rmse = MSE(y_test, y_pred)**(1/2)
# Visualize features importances
importances = pd.Series(ensemble_model.feature_importances_, index = X.columns)
sorted_importances = importances.sort_values()
sorted_importances.plot(kind='barh', color='lightgreen')
plt.show()

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold
# See what parameters can be tuned
ryour_dt_model.get_params()
kf = KFold(n_splits=5, shuffle=True, random_state=42)
params_dt = {
    'max_depth': [3, 4,5, 6],
    'min_samples_leaf': [0.04, 0.06, 0.08],
    'max_features': [0.2, 0.4,0.6, 0.8]
}
mae_scorer = make_scorer(mean_absolute_error)
model_cv = GridSearchCV(estimator=your_dt_model,
    param_grid=params_dt,
    cv=kf, # scorer = mae_scorer
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=-1)
model_cv.fit(X_train, y_train)
best_hyperparams = model_cv.best_params_# Get the parameters with best result
best_model = model_cv.best_estimator_ # Get the best model
best_model.get_params() # See all parameters
y_pred = best_model.predict(X_test) # predict with best model
best_score = best_model.best_score_
model_cv.cv_results_ # See all information from dictionary
from sklearn.externals import joblib
joblib.dump(best_model, 'my_best_model.pkl') # Save the model in pkl file
```