# Chapter 1

### Decision Tree

<center><img src="images/01.01.png"  style="width: 400px, height: 300px;"/></center>

- Sequence of if-else question
- Consists of hierarchy of nodes. Each node raise question or prediction.
- Root node : No parent
- Internal node : Has parent, has children
- Leaf node : Has no children. It is where predictions are made
- Goal : Search for pattern to produce purest leaves. Each leaf contains pattern for one dominant label.
- Information Gain : At each node, find the split point for each feature for which we get maximum correct pure split of the data. When information gain = 0, we could say that our goal is achieved, the pattern is captured, and this is a leaf node. Otherwise keep splitting it (We can stop it by specifying maximum depth of recursion split). 
- Measure of impurity in a node:
    - Gini index: For classification
    - Entropy: For classification
    - MSE : For regression
- capture non-linear relationhship between features and labels/ real values
- Do not require feature scaling
- At each split, only one feature is involved
- Decision region : Feature space where instances are assigned to a label / value
- Decision Boundary : Surface that separates different decision regions
- Steps of building a decision tree:
    1. Choose an attribute (column) of dataset
    2. Calculate the significance of that attribute when splitting the data with Entropy.
        A good split has less Entropy (disorder / randomness). 
    3. Find the best attribute that has most significance and use that attribute
    	to split the data
    4. For each branch, repeat the process (Recursive partitioning) for best 
    	information gain (The path that gives the most information using entropy).
- Limitations:
    - Can only produce orthogonal decision boundaries
    - Sensitive to small variations in training set
    - High variance overfits the model
- Solution : Ensemble learning
    - This is a joint modeling where many models come together to solve a single problem
    - Train different models on same dataset
    - Let each model make its prediction
    - Aggregate predictions of individual models 
    - One model's weakness is covered by another model's strength in that particular task
    - Final model is combination of models that are skillfull in different ways
    - Hard-voting : 
        - Ensemble method that models data using majority of vote
    - Bagging or Bootstrap aggregating (Sampling with replacement) : 
        - Ensemble method that use bootstrap with resampling on training data. 
        - Base estimator : Decision tree, neural net, logistic regression etc
        - Reduces variance in individual models (Because of bootstrapping, variance of sample becomes smaller)
        - OOB evaluation : normally on average 33% sample data remains unseen, use that data for evaluation of scoring
        - Classification : Final prediction is obtained by majority voting
        - Regression : Final prediction is obtained by taking the mean
    - Random Forest (Sampling without replacement):
        - base estimator : Decision tree
        - bootstrap samples without replacement and further randomization involved
        - Classification : Final prediction is obtained by majority voting
        - Regression : Final prediction is obtained by taking the mean
    - Boosting:
        - Combine weak learners (models that are slightly better than random guessing) to form a strong learner
        - learners are placed sequentially, each learner trying to correct its predecessor
        - Adaboost or adaptive boosting (through contribution/weight adjustment) : 
            - predictor pays more information to wrongly classified target by predecessor and apply a weight or penalty
            - each predictor has an assigned co-efficient (alpha), that signifies it's contribution in final prediction
            - before the data goes to the next predictor for training, alpha is used to adjust the weights of data 
            - Learning rate ita contributes the adjustment of co-efficient alpha
            - Classification : Final outcome decided by weighted majority voting
            - Regression : Final outcome decided by weighted average
        - Gradient boosting (through training on gradients/residuals) :
            - sequential correction of predecessor's error instead of co-efficient adjustment like adaboost
            - Instead of adjusting weight like adaboost, predictor trains using predecessor's residuals as labels
            - Instead of weak learner like adaboost, it uses CART learners as base learners
            - Learning rate or shrinkage tradeoff : Decreased learning rate = increased number of estimators
        - Stochastic gradient boosting (sampling without replacement on gradient boosting to increase variance)
            - Gradient boosting problem : May lead to CARTs using the same split points and maybe the same features which may lead to increased bias. This may lead to underfitting problem.
            - Goal : to reduce bias and increase variance.
            - Solution : Randomly (40%-80% of the training set) are sampled without replacement.

### Classification Tree

```
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
# Split the dataset into 80% train, 20% test
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)
# Instantiate the Classification Tree
cl_dt = DecisionTreeClassifier(max_depth=2, random_state=42, criterion='gini')
# Train the model
cl_dt.fit(X_train,y_train)
# Predict using test set
y_pred = cl_dt.predict(X_test)
# Evaluate the test set accuracy
accuracy_score(y_test, y_pred)
# To check for model overfitting, compare this with test set log loss
# Compute negative log loss
neg_log_loss_cv = -cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_log_loss', n_jobs=-1)
```

### Regression Tree

```
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score
# Split the dataset into 80% train, 20% test
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)
# Instantiate the Regression Tree
reg_dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.1, random_state=3)
# Train with training data
reg_dt.fit(X_train, y_train)
# Predict 
y_pred = reg_dt.predict(X_test)
# Compute RMSE for testing data
mse_reg_dt = MSE(y_test, y_pred)
rmse_reg_dt = mse_reg_dt**(1/2)
print(rmse_reg_dt)
# To check for model overfitting, compare this with test set MSE
MSE_CV = - cross_val_score(dt, X_train, y_train, cv= 10, scoring='neg_mean_squared_error', n_jobs = -1)
rmse_cv = MSE_CV**(1/2)
```

# Chapter 2

### Bias-Variance Trade-off

<center><img src="images/02.04.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.05.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.03.png"  style="width: 400px, height: 300px;"/></center>


- Overfitting : 
    - Model also memorises / trains on noise that resides within training data. 
    - Model performs well when evaluating on training data but does not perform well on unseen data
    - High variance is responsible for this error because of also capturing noise.
    - Diagnosis: cross-val prediction on test set has high error than prediction on train set
    - Possible remedy : Decrease model complexity, gather more data, 
- Underfitting :
    - Model is too simple to catch the pattern, model is not good enough to capture the underlying pattern.
    - Model is bad on both training and unseen data
    - Model is not flexibple enough to approximate the prediction values
    - High bias is responsible for this error
    - Diagnosis: cross-val prediction on train and test set are roughly equal but have very high errors that is undesirable
    - Possible remedy : Increase model complexity, gather more features, 
- Bias-Variance trade-off :
    - Generalization error = bias^2 + variance + irreducable error (noise)
    - bias = error term that tells how on average real value is different from predicted value
    - variance = error term that tells how predicted value varies over different training sets
    - When model complexity increases, variance increases and bias decreases
    - When model complexity decreases, variance decreases and bias increases
    - The sweet spot is the minimised generalization error, which gives the optimised model

### Ensemble Learning : Hard Voting

<center><img src="images/02.01.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.02.png"  style="width: 400px, height: 300px;"/></center>

```
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import mean_squared_error as MSE

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42)
# Instantiate individual classifiers
lr = LogisticRegression(random_state=42)
knn = KNN()
dt = DecisionTreeClassifier(random_state=42,max_depth=4, min_samples_leaf=0.16)
classifiers = [('Logistic Regression', lr),
                ('K Nearest Neighbours', knn),
                ('Classification Tree', dt)]

# Instantiate an ensemble VotingClassifier
from sklearn.ensemble import VotingClassifier
ensemble_model = VotingClassifier(estimators=classifiers)

# Instantiate an ensemble VotingRegressor
ensemble_model = VotingRegressor(estimators=regressors)

# Instantiate an ensemble BaggingClassifier
from sklearn.ensemble import BaggingClassifier
ensemble_model = BaggingClassifier(base_estimator=dt, n_estimators=300,oob_score=True, n_jobs=-1)
oob_accuracy = bc.oob_score_

# Instantiate an ensemble BaggingRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
base_regressor = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13, random_state=3)
ensemble_model = BaggingRegressor(base_estimator=base_regressor, n_estimators=300, oob_score=True, n_jobs=-1)
oob_score = ensemble_model.oob_score_

# Instantiate an ensemble RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
ensemble_model = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.12, random_state=42)

# Instantiate an ensemble RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
ensemble_model = RandomForestClassifier(n_estimators=400, random_state=42)

# Instantiate an ensemble AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
ensemble_model = AdaBoostClassifier(base_estimator=dt, n_estimators=100) # dt is weak, has max depth of 1
y_pred_proba = ensemble_model.predict_proba(X_test)[:,1]
# Evaluate testing roc_auc_score
from sklearn.metrics import roc_auc_score
adb_clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba)

# Instantiate an ensemble GradientBoostingRegressor, (max_features=0.2, subsample=0.8) makes it stochastic gradient boosting
from sklearn.ensemble import GradientBoostingRegressor
ensemble_model = GradientBoostingRegressor(max_depth=1, subsample=0.8, max_features=0.2, n_estimators=300, random_state=42)

# Train using traing set
ensemble_model.fit(X_train, y_train)
# Predict with test set
y_pred = ensemble_model.predict(X_test)
# Evaluate accuracy for classification
print(accuracy_score(y_test, y_pred))
# Evaluate RMSE for regression
rmse = MSE(y_test, y_pred)**(1/2)
# Visualize features importances
importances = pd.Series(ensemble_model.feature_importances_, index = X.columns)
sorted_importances = importances.sort_values()
sorted_importances.plot(kind='barh', color='lightgreen')
plt.show()
```

# Chapter 3

### Ensemble : Bagging (Sampling with replacement)

<center><img src="images/03.01.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/03.02.png"  style="width: 400px, height: 300px;"/></center>


### Random Forest (Sampling without replacement)

<center><img src="images/03.03.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/03.04.png"  style="width: 400px, height: 300px;"/></center>


# Chapter 4

### Adaboost  (Adaptive Boosting)

<center><img src="images/04.01.png"  style="width: 400px, height: 300px;"/></center>


### Gradient Boosting

<center><img src="images/04.02.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/04.03.png"  style="width: 400px, height: 300px;"/></center>
