## Random Forest

Decision trees are not really used on their own in machine learning. You really only see them in situations where understanding the model's decisions is extremely important.

Instead, a much more common use of decision trees is to use them in **ensembles**. An ensemble is basically a group of models used together to make a prediction. Often, ensembling multiple models leads to a stronger predictive model, but you lose some ability to interpret the model.

Random forests train a bunch of decision trees that are all different and then combine them via **voting.** Basically, you take the most common predicted class as the prediction (hard voting) and you can get probabilities by taking the fraction (soft voting). It is fairly easy to see how multiple models might be more effective than a single model if they all look at the data in a different way. For example, if you wanted to know whether a movie is good, you probably don't just ask one person. Instead you ask many different people with different preferences.

Random forests are one of the most popular models in machine learning. Here is a nice paper that compares many models and shows how well random forests fair:

http://lowrank.net/nikos/pubs/empirical.pdf

### Some intuition 

P. 183 of Hands on Machine Learning has one of the nicest explanations of why combining weak learners can led to a strong learner. Let's take a look!

One of the main take aways is that you want your models to be as independent as possible. Thus, when ensembling you want to do as much as you can to have diverse models. There are a few ways you can try and get this diversity:

1. Train many different models on the same data where each model is different. For example, one SVM, one logistic regression, and one k-nearest kneighbors. You can see how to do this in sklearn on p. 184 of Hands on Machine Learning. We won't dive too deep into this method, but it is pretty simple and can be pretty effective. 

2. ### Bagging (Bootstrapping)

Bagging uses the same model for each predictor, but each model gets a different view of the same data. How do we create different views? Simple - each model randomly samples from the training data **with replacement**. Each sample is the same size as the original training data, but since you sample with replacement each model only sees about 63% of the data - the rest are duplicates. Note: resampling without replacement is called pasting - but this is rarely used.

Benefits:

* Scales very well since training and prediction can be run in parallel
* Ensemble has lower variance than the single model

P. 186 of Hands on Machine Learning has an sklearn example of Bagging.

Since each model only sees about 63% of the data, that means there is about 37% of the data per model that isn't used for training. These are called **out-of-bag samples**. We can use these samples for evaluation without actually needing a cross-validation set. 

3. ### Random patches and subspaces

This is a simple idea - you randomly sample the features that each model can see. Random patches is sampling both features and samples. Sampling only features is called random subspaces. Sampling features can be especially effective when you have many features (high dimensions). Again, this typically lowers variance 


## Random Forest

Now that we understand all these methods, it is pretty easy to understand random forest. Random forests use decision trees with bagging. 

# Q: random features? 

### SKLearn Example