## Random Forests Classifiers:

### Bagging
**Bagging attempts to reduce the chance overfitting complex models.**
* It trains a large number of "strong" learners in parallel.
* A strong learner is a model that's relatively unconstrained.
* Bagging then combines all the strong learners together in order to "smooth out" their predictions.

### Boosting
**Boosting attempts to improve the predictive flexibility of simple models.**
* It trains a large number of "weak" learners in sequence.
* A weak learner is a constrained model (i.e., you could limit the max depth of each decision tree).
* Each one in the sequence focuses on learning from the mistakes of the one before it.
* Boosting then combines all the weak learners into a single strong learner.

### Random Forest Classifcation Trees
> Random forest is an example of bagging or boosting>
Random Forest is an example of the bagging (Bootstrap Aggregating) technique, which is a type of ensemble learning method used in machine learning. In bagging, multiple models are trained on different subsets of the training data, and their predictions are aggregated to produce the final prediction.

> In the case of Random Forest, the base models are decision trees, and the subsets of training data are created by randomly sampling with replacement from the original training data (known as bootstrap sampling). Each tree is trained independently on a different subset of the data and with a random subset of features. The final prediction is made by aggregating the predictions of all the trees in the forest, typically using majority voting in the case of classification or averaging in the case of regression.

> The goal of bagging is to reduce the variance (i.e., overfitting) of individual models by combining multiple models with different sources of randomness. Random Forest is a popular and effective method that uses bagging to improve the accuracy and generalization of decision trees, which can be prone to overfitting on noisy or complex datasets.

**Random forests train a large number of “strong” decision trees and combine their predictions through bagging. In addition, there are two sources of “randomness” for random forests:**
* Each tree is only allowed to choose from a random subset of features to split on (leading to feature selection).
* Each tree is only trained on a random subset of observations (a process called resampling).
* In practice, random forests tend to perform very well right out of the box. They often beat many other models that take up to weeks to develop. They don’t have many complicated parameters to tune, making them the perfect “swiss-army-knife” algorithm that almost always gets good results.

### Boosted Trees
**Boosted trees train a sequence of “weak”, constrained decision trees and combine their predictions through boosting.**
* Each tree is allowed a maximum depth, which should be tuned.
* Each tree in the sequence tries to correct the prediction errors of the one before it.
* In practice, boosted trees tend to have the highest performance ceilings. They often beat many other types of models after proper tuning, but they are more complicated to tune than random forests.

In [46]:
# Load the library with the iris dataset
from sklearn.datasets import load_iris

# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier

# Load pandas
import pandas as pd

# Load numpy
import numpy as np

In [47]:
# Create an object called iris with the iris data
iris = pd.read_csv("E:\datafile\iris.csv")

In [48]:
print (type(iris), iris.shape)
iris.head()

<class 'pandas.core.frame.DataFrame'> (150, 5)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [49]:
# Create a new column that for each row, generates a random number between 0 and 1, and
# if that value is less than or equal to .75, then sets the value of that cell as True
# and false otherwise. This is a quick way of randomly assigning some rows to
# be used as the training data and some as the test data.
iris['is_train'] = np.random.uniform(0, 1, len(iris)) <= .75

# View the top 5 rows
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,is_train
0,5.1,3.5,1.4,0.2,setosa,False
1,4.9,3.0,1.4,0.2,setosa,True
2,4.7,3.2,1.3,0.2,setosa,False
3,4.6,3.1,1.5,0.2,setosa,True
4,5.0,3.6,1.4,0.2,setosa,False


In [50]:
# Create two new dataframes, one with the training rows, one with the test rows
train, test = iris[iris['is_train'] == True], iris[iris['is_train'] == False]

In [51]:
# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:', len(test))

Number of observations in the training data: 105
Number of observations in the test data: 45


In [52]:
# Create a list of the feature column's names
features = iris.columns[:4]

# View features
features

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], dtype='object')

In [53]:
# train['species'] contains the actual species names. Before we can use it,
# we need to convert each species name into a digit. So, in this case there
# are three species, which have been coded as 0, 1, or 2.
y = pd.factorize(train['species'])[0]

# View target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)

In [54]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_jobs = 2, random_state = 0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
clf.fit(train[features], y)

In [55]:
# Apply the Classifier we trained to the test data (which, remember, it has never seen before)
clf.predict(test[features])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2,
       2], dtype=int64)

In [56]:
# View the predicted probabilities of the first 10 observations
clf.predict_proba(test[features])[0:10]

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

In [57]:
# Create actual english names for the plants for each predicted plant class
# preds = target_names[clf.predict(test[features])]
preds = clf.predict(test[features])
preds

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2,
       2], dtype=int64)

In [64]:
target_names = ['setosa', 'versicolor', 'virginica']
preds_species = [target_names[i] for i in preds]
preds_species[:10]

['setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa']

In [59]:
# View the PREDICTED species for the first five observations
preds[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [60]:
# View the ACTUAL species for the first five observations
test['species'].head(10)

0     setosa
2     setosa
4     setosa
7     setosa
10    setosa
24    setosa
25    setosa
26    setosa
28    setosa
29    setosa
Name: species, dtype: object

In [61]:
# Create confusion matrix
pd.crosstab(test['species'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])

Predicted Species,0,1,2
Actual Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,17,0,0
versicolor,0,14,1
virginica,0,2,11


In [62]:
# View a list of the features and their importance scores
list(zip(train[features], clf.feature_importances_))

[('sepal_length', 0.09527400879080274),
 ('sepal_width', 0.022390332046274004),
 ('petal_length', 0.4951652842127053),
 ('petal_width', 0.3871703749502181)]