# Chapter 1

### Hyperparameter Tuning

- Finding optimal combination of parameters for a model
- parameters: 
    - the ones that is set by the model after learning from dataset 
    - eg: co-efficients of linear regression, node decision by the decision trees
    - accessible by attribute (in the attribute section in the documentation)
- hyperparameters : 
    - the ones that we have the option to set before creating the model
    - print the estimator to see what it contains
    - accessible by parameter (in the parameter section in the documentation)
- Silly things to do (some examples):
    - Creating a random forest with just 2 or 3 trees
    - 1,2 neighbors in knn algorithm
    - increasing a hyperparameter by a small amount
    - Be aware of conflicting hyperparameter choices (The 'newton-cg', 'sag' and 'lbfgs' solvers support only l2 penalties.)
- Visualize if the hyperparameter has any effect:
    - Graph of learning curve : hyperparameter on X-axis and accuracy on Y-axis
- Problem: So many models can be build. But among these, find an optimal model that yields optimal result.
- Solution: Train with a set of adjustable parameters and compare the results to find the optimal model
- Rule of thumb : Cross validation is used to estimate the generalization performance.
- Curse of dimensionality : exhaustively searching results in exponential increase of dimensions with the increase of grid.
- Best practice : Do this when you really need optimal solution since it does not make a bad model into a good model.
- optimal hyperparameters = set of hyperparameters corresponding to the best CV score.
- Some algorithms:
    - Uninformed Search:
        - Grid Search : 
            - Find result for all possible combination of parameters 
            - Guaranteed best result
            - time consuming process, resource intensive
        - Random Search : 
            - Randomly choose a number of combinations of given parameters 
            - A good result but may not be the absolute best
            - fast 
            - Idea : You are unlikely to keep completely missing the 'good area' for a long time when randomly picking new spots
    - Informed Search :
        - Coarse to Fine: (Hybrid of grid search and randomized search)
            1. Random search
            2. Find promising areas (Narrow Down)
            3. Grid search in the smaller area or skip for step 4
            4. Continue from step 1 until optimal score is obtained
            - Idea : Narrow down the optimal area. The best result will be in that area.
        - Bayesian Optimization :
            - Inferring the probability of best result by deducing the outcome of past results
            - eg: Given a certain event occurs due to another event, we get more clearer idea of the probability of certain outcome
            - Idea : Getting better as we get more evidence.
        - Genetic Algorithms :
            1. We can create some models (that have hyperparameter settings)
            2. We can pick the best (by our scoring function). These are the ones that 'survive'
            3. We can create new models that are similar to the best ones
            4. We add in some randomness so we don't reach a local optimum
            5. Repeat until we are happy!
            - Idea : With evolution in gene sequence combination of best genes in new generations, we can obtain the optimal being

### Ensemble

```
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import accuracy_score, mean_squared_error, make_scorer, roc_auc_score

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42)
# Instantiate individual classifiers
lr = LogisticRegression(random_state=42)
knn = KNN()
dt = DecisionTreeClassifier(random_state=42,max_depth=4, min_samples_leaf=0.16)
classifiers = [('Logistic Regression', lr),
                ('K Nearest Neighbours', knn),
                ('Classification Tree', dt)]

# Import ensemble classifiers and regressors
from sklearn.ensemble import VotingClassifier, VotingRegressor, BaggingClassifier, BaggingRegressor, RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier, AdaBoostRegressor, GradientBoostingRegressor, GradientBoostingClassifier
# Voting Ensemble
ensemble_model_voting = VotingClassifier(estimators=classifiers)
ensemble_model_voting = VotingRegressor(estimators=regressors)
# Bagging Ensemble
ensemble_model_bagging = BaggingClassifier(base_estimator=dt, n_estimators=300, oob_score=True, n_jobs=-1)
ensemble_model_bagging = BaggingRegressor(base_estimator=dt, n_estimators=300, oob_score=True, n_jobs=-1)
oob_score = ensemble_model.oob_score_
# Random Forest Ensemble
ensemble_model_randomforest = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.12, random_state=42)
ensemble_model_randomforest = RandomForestClassifier(n_estimators=400, random_state=42)
# Adaboost Ensemble
ensemble_model_adaboost = AdaBoostClassifier(base_estimator=dt, n_estimators=100) # dt is weak, has max depth of 1
ensemble_model_adaboost = AdaBoostRegressor(base_estimator=dt, n_estimators=100) # dt is weak, has max depth of 1
# Gradient Boost Ensemble
# (max_features=0.2, subsample=0.8) makes it stochastic gradient boosting due to randomness of data and fraction of data features
ensemble_model_gboost = GradientBoostingRegressor(max_depth=1, subsample=0.8, max_features=0.2, n_estimators=300, random_state=42)
ensemble_model_gboost = GradientBoostingClassifier(max_depth=1, subsample=0.8, max_features=0.2, n_estimators=300, random_state=42)

# Train using traing set
ensemble_model.fit(X_train, y_train)
# Predict with test set
y_pred = ensemble_model.predict(X_test)
# Evaluate accuracy and ROC AUC for classification
print(accuracy_score(y_test, y_pred))
y_pred_proba = ensemble_model.predict_proba(X_test)[:,1]
clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
# Evaluate RMSE for regression
rmse = mean_squared_error(y_test, y_pred)**(1/2)
# Visualize features importances
importances = pd.Series(ensemble_model.feature_importances_, index = X.columns)
sorted_importances = importances.sort_values()
sorted_importances.plot(kind='barh', color='lightgreen')
plt.show()

```

### Decision Tree

```
# Split into train and test set
from sklearn.model_selection import train_test_split
X_Train, X_Test, y_Train, y_Test = train_test_split(X, y, test_size=0.3, random_state=3)

# Make sure to take into account the class imbalance 
from sklearn.utils.class_weight import compute_sample_weight
w_train = compute_sample_weight('balanced', y_train)

# Train the classifier
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
tree_clf.fit(X_Train,y_Train, sample_weight=w_train)

# Alternative approach : Train the classifier with snapml (offers multi-threaded CPU/GPU training)
from snapml import DecisionTreeClassifier
snapml_dt_gpu = DecisionTreeClassifier(max_depth=4, random_state=45, use_gpu=True)
snapml_dt_cpu = DecisionTreeClassifier(max_depth=4, random_state=45, n_jobs=4)
snapml_dt.fit(X_train, y_train, sample_weight=w_train)
# Predict
y_pred = tree_clf.predict(X_Test)

### Inspecting a random forest
# Pull out one tree from the forest (If decision tree is a random forest)
chosen_tree = randomforest_model.estimators_[7] # You can visualize it with (graphviz & pydotplus)
# Extract node decisions
split_column = chosen_tree.tree_.feature[0] # Get the first column it split on
split_column_name = X_train.columns[split_column] # Name of the column
split_value = chosen_tree.tree_.threshold[1] # Get the theshold value it split on

# Compute predicted probabilities
y_pred_prob = tree_clf.predict_proba(X_test)[:,1]

# Evaluate tree
from sklearn.metrics import roc_auc_score, accuracy_score
accuracy_score(y_testset, predTree)
roc_auc_score(y_test, y_pred)

# Visualize the graph using plot_tree
from sklearn.tree import plot_tree
plt.figure(figsize=(20, 10))
plot_tree(chosen_tree, feature_names=X_train.columns, filled=True, rounded=True, fontsize=10)
plt.show()
```

# Chapter 2

### Grid search, Random Search, Coarse Search, Bayes Optimization

```
# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold
from sklearn.metrics import accuracy_score, mean_absolute_error, make_scorer
# See what parameters can be tuned
model.get_params()
kf = KFold(n_splits=5, shuffle=True, random_state=42)
custom_params = {
    'max_depth': [3, 4,5, 6],
    'min_samples_leaf': [0.04, 0.06, 0.08],
    'max_features': [0.2, 0.4,0.6, 0.8]
}
mae_scorer = make_scorer(mean_absolute_error)
# Grid Search
model_cv1 = GridSearchCV(estimator=model, param_grid=custom_params, cv=kf, scoring= mae_scorer , # 'neg_mean_squared_error' 
            verbose=1, n_jobs=-1, refit = True, random_state=42)
# Randomized Search
model_cv2 = RandomizedSearchCV(estimator=model, param_distributions=custom_params, n_iter=10, 
            cv=kf, scoring=mse_scorer,  # Use custom scorer here
            verbose=1, n_jobs=-1, refit=True, random_state=42)
# Bayes Search
search_spaces = {
    'max_depth': (3, 6),
    'min_samples_leaf': (0.04, 0.08, 'uniform'),
    'max_features': (0.2, 0.8, 'uniform')
}
model_bayes_cv = BayesSearchCV(estimator=model, search_spaces=search_spaces, cv=kf,
    n_iter=50,  # Adjust the number of iterations as needed
    scoring=mae_scorer, verbose=1, n_jobs=-1, refit=True, random_state=42 )

# use TPOT for GENETIC SEARCH CV

model_cv.fit(X_train, y_train)
model_cv.cv_results_ # See all information from dictionary
best_hyperparams = model_cv.best_params_ # Get the parameters that produce best result
best_model = model_cv.best_estimator_  # Get the best model
best_model.get_params() # Get the parameters of the best model
y_pred = best_model.predict(X_test) # predict with best model
best_score = best_model.best_score_ # Best result

# Visualize contribution of parameter to get the optimal accuracy (Scatterplot or kdeplot)
results_df = pd.DataFrame({
    'Accuracy': model_cv.cv_results['mean_test_score'],
    'Parameter': model_cv.cv_results['param_name']  # Adjust the parameter name as needed
})
plt.scatter(results_df['Parameter'], results_df['Accuracy'], s=100, alpha=0.5)

from sklearn.externals import joblib
joblib.dump(best_model, 'my_best_model.pkl') # Save the model in pkl file
```

# Chapter 3

### See function python

```
print(inspect.getsource(func_name))
```

### Model Performance

```
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
r2_score(y_true, y_pred) # R-squared
mse = mean_squared_error(y_true, y_pred) # MSE
rmse = mean_squared_error(y_test, y_pred, squared=False) # RMSE
mae = mean_absolute_error(y_true, y_pred) # MAE

# Classification Performance Measurement
from sklearn.metrics import classification_report, confusion_matrix, jaccard_score, log_loss, roc_auc_score, roc_curve, f1_score
confusion_matrix(y_test, y_pred) # Confusion matrix
classification_report(y_test, y_pred) # TP, FP, TN, FN
jaccard_score(y_test, y_pred,pos_label=0) # Jaccard score
log_loss(y_test, y_pred_prob) # log loss
print(roc_auc_score(y_test, y_pred_prob)) # ROC AUC
print(f1_score(y_true, y_pred)) # F1 Score

# Visualize ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()

# Grid-search example for hyperparameter tuning of classification
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {"alpha": np.arange(0.0001, 1, 10), "solver": ["sag", "lsqr"]}
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)
ridge_cv2 = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)

### Compare different models distribution
results = {"Model 1": model1_cv_results, "Model 2": model2_cv_results, "Model 3": model3_cv_results}
plt.boxplot(results.values(), labels=results.keys())
plt.show()

# Leverage : measurement of how extreme the explanatory variable values are
leverage = model.get_influence().hat_matrix_diag
# Influence : how much the model would change if you leave the observation out of the dataset when modeling. (eg : cooks distance)
influence = model.get_influence().resid_studentized_external
cooks_distance = model.get_influence().cooks_distance[0]

# Residualplot for regression
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, color='blue', alpha=0.6)
plt.axhline(y=0, color='red', linestyle='--', linewidth=2)
plt.xlabel('Fitted Values (Predicted)')
plt.ylabel('Residuals')

# Q-Q plot for regression
from scipy.stats import probplot
probplot(residuals.flatten(), dist='norm', plot=plt)
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')
plt.show()

# Scale location plot
plt.scatter(y_pred, np.sqrt(np.abs(residuals)), color='blue', alpha=0.6)
plt.xlabel('Fitted Values (Predicted)')
plt.ylabel('Square Root of Absolute Residuals')
plt.axhline(y=np.mean(np.sqrt(np.abs(residuals))), color='red', linestyle='--', linewidth=2, label='Mean')
plt.legend()
plt.show()
```

# Chapter 4

### Bayes Rule

<center><img src="images/04.01.png"  style="width: 400px, height: 300px;"/></center>

- Statistical method that uses new evidence to iteratively update probabilistic outcome
- Formula : P(A|B) = P(B|A) P(A) / P(B)
- Left side of equation P(A|B)
    - Probability of event A given another event B (This situation is also called posterior outcome)
    - Signifies the updated probability of A, when a new evidence / event B occurs
- Right side of equation:  (how it happens)
    - P(A) = Probability of event A if there is no event B (This situation is also called prior outcome)
    - P(B) = marginal likelihood. (probability of observing new event B )
    - P(B|A) = The likelihood of B when event A is present

- Example:
    - 5% of population have a disease, P(D) = 0.05
    - 10% of population are affected by a generic condition, P(G) = 0.1
    - We have measured from clinical records that 20% of those that have the disease are also affected by genetic condition, P(G|D) = 0.1
    - Question is, what is the chance that the disease is caused by the genetic condition, P(D|G) = ??
    - If we did not have data of Genetic condition, P(G), we could conclude that anyone has 5% chance of having the disease
    - Since we have more information, we can deduce that the likelihood of a person having the disease changes according to whether the person has genetic condition or not.

### Bayesian Optimization for hyperparameters

```
from hyperopt import fmin, tpe, hp, Trials
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

# Assuming `model`, `X_train`, `y_train`, and `kf` are defined earlier in your code

# Define the search space for hyperopt
search_space = {
    'max_depth': hp.quniform('max_depth', 3, 6, 1),
    'min_samples_leaf': hp.uniform('min_samples_leaf', 0.04, 0.08),
    'max_features': hp.uniform('max_features', 0.2, 0.8)
}

# Define the objective function to minimize
def objective(params):
    model.set_params(**params)
    scores = cross_val_score(model, X_train, y_train, cv=kf, scoring=mae_scorer)
    return np.mean(scores)

# Initialize Trials to store optimization results
trials = Trials()

# Use fmin from hyperopt to perform Bayesian optimization
best_hyperparams = fmin(
    fn=objective,
    space=search_space,
    algo=tpe.suggest,
    max_evals=50,  # Adjust the number of evaluations as needed
    trials=trials,
    verbose=1,
    rstate=np.random.RandomState(42)
)

# Get the best hyperparameters
best_bayes_hyperparams = {key: best_hyperparams[key] for key in search_space}

# Set the best hyperparameters to the model
model.set_params(**best_bayes_hyperparams)

# Fit the model with the best hyperparameters
model.fit(X_train, y_train)

# Save the best model
joblib.dump(model, 'my_best_bayes_model.pkl')

# Access the results from hyperopt
hyperopt_results = pd.DataFrame({
    'Accuracy': [-trial['result']['loss'] for trial in trials.results],
    'Parameter': [trial['misc']['vals'] for trial in trials.trials]
})

# Visualize contribution of parameter to get the optimal accuracy (Scatterplot or kdeplot)
plt.scatter(hyperopt_results['Parameter'], hyperopt_results['Accuracy'], s=100, alpha=0.5)
plt.show()

```

### TPOT

```
from tpot import TPOTClassifier
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.externals import joblib  # Import joblib for model persistence
import matplotlib.pyplot as plt
import pandas as pd

# Define a custom scoring function using make_scorer
custom_scorer = make_scorer(accuracy_score)

# Define TPOT configuration for classification
tpot_config = {
    'sklearn.ensemble.RandomForestClassifier': {
        'n_estimators': [10, 50, 100],
        'max_depth': [3, 6],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2, 4],
        'max_features': [0.2, 0.4, 0.6, 0.8]
    }
}

# Initialize TPOTClassifier with custom scorer
tpot_classifier = TPOTClassifier(
    generations=5,  # Adjust the number of generations as needed
    population_size=20,
    random_state=42,
    verbosity=2,
    config_dict=tpot_config,
    cv=kf,
    scoring=custom_scorer,  # Use custom scorer for classification
    n_jobs=-1,
    warm_start=True
)

# Fit TPOTClassifier
tpot_classifier.fit(X_train, y_train)

# Access TPOT results
tpot_results = tpot_classifier.evaluated_individuals_

# Export the best TPOT pipeline to a .pkl file
best_tpot_pipeline = tpot_classifier.fitted_pipeline_
joblib.dump(best_tpot_pipeline, 'best_tpot_classifier_model.pkl')

# Save the best TPOT pipeline
tpot_classifier.export('best_tpot_classifier_pipeline.py')

# Visualize contribution of pipeline to get the optimal accuracy (Scatterplot or kdeplot)
results_df = pd.DataFrame({
    'Accuracy': tpot_results['internal_cv_score'],
    'Pipeline': tpot_results['pipeline']
})
plt.scatter(results_df['Pipeline'], results_df['Accuracy'], s=100, alpha=0.5)
plt.show()

```