#  Model Selection and Model Training
Five Effective Machine Learning Algorithms for Regression and Classification 
#  Model Selection
## Regularization - Penalize model coefficients to prevent overfitting  
- Discourage large coefficients by dampening them
- Remove features entirely by setting their coefficients to 0
- The penalty is tunable 

## Ensembles  - Combine predictions from multiple separate models 
- Bagging : Reduce the chance overfitting complex models
  - Uses complex base models and tries to "smooth out" predictions
  - A strong learner : Relatively unconstrained model
  - Trains a large number of "strong" learners in parallel
  - Combines all the strong learners together to "smooth out" predictions
- Boosting : Improve the predictive flexibility of simple models
  - Uses simple base models and boosts aggregate complexity
  - A weak learner : A constrained model (i.e. limit the max depth of a decision tree).
  - Trains a large number of "weak" learners in sequence
  - Learns from the mistakes of the one before it
  - Combines all the weak learners into a single strong learner


##  Linear Regression
- Fits a "straight line" or a hyperplane
- Easy to interpret and understand
- Overfit: A model performs very well on the training data but poorly on the test data
- Prone to overfit with many input features
- Cannot easily express non-linear relationships

### Three common types of regularized linear regression algorithms to prevent from overfitting
1. LASSO (Least Absolute Shrinkage and Selection Operator) Regression
 - Penalizes the absolute size of coefficients : Leads to coefficients that can be exactly 0
 - Can completely remove some features and offers automatic feature selection
 - Should tune the "strength" of the penalty to lead more coefficients pushed to zero 
2. Ridge Regression
 - Penalizes the squared size of coefficients : Leads to smaller coefficients, but it doesn't force them to 0
 - Offers feature shrinkage
 - Should tune the "strength" of the penalty to leads more coefficients pushed closer to zero 
3. Elastic-Net  
 - A compromise between Lasso and Ridge
 - Elastic-Net penalizes a mix of both absolute and squared size
 - Should tune the ratio of the two penalty types and overall strength 


## Decision Tree
- Flexibility: The hierarchical branching structure can easily model nonlinear relationships
- Individual unconstrained decision trees are prone to being overfit

### Two ensembled trees to prevent from overfitting
1. Random forests
 - Train a large number of "strong" decision trees and combine their predictions through bagging
 - Feature selection: Each tree is only allowed to choose from a random subset of features to split on 
 - Resampling: Each tree is only trained on a random subset of observations 
 - Beat many other models and get good results
 - Not have many complicated parameters to tune

2. Boosted trees
 - Train a sequence of "weak", constrained decision trees and combine their predictions through boosting
 - Should tune a maximum depth for each tree
 - Correct the prediction errors of the one before it
 - Tend to have the highest performance ceilings
 - Often beat many other types of models after proper tuning
 - More complicated to tune than random forests


