## What is Model overfit?
Let's assume at our iris nursey we have a leaky heating duct that we dont know about.  And the irises under that duct are warmer by a few degrees, just enough to inhibit their growth.  And there are a variety of different irises under that duct, so the effect is random across the three species.

Because we dont know about it, this factor will not be in our model, so will act as a random effect that results in some irises being misclassified.  But we want 100% accuracy, so we continue to increase the model's depth, eventually reaching 100% accuracy.  That is a good thing, right?

Remember that we dont know that it is the heating duct, and *neither does your model*.  It will use the included factors to try to classify.  If you push the model hard enough, it may, for example, identify a narrow combination of petal width and sepal length that explains those specific irises, but these arent *true* effects.  Instead you are using randomness to try to classify the results. But since that combination of features isn't systemic to irises, you will compromise the generizability of your model.  It will do *worse* in the future.

Here is another way to think about it, using lines:

![Model Overfit](model_fit.png "Model Overfit")

The following code loads up the iris data and completes the feature engineering from the last class.

In [5]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn import tree
import sklearn
import pandas as pd

In [6]:
# load the data and put it into a dataframe:
iris = load_iris()
df = pd.DataFrame(iris.data)

# set the column names:
cols = ['sepal_length',
 'sepal_width',
 'petal_length',
 'petal_width']
df.columns = cols

# The target identifies the species as 0, 1, or 2.  
# Add target to our dataframe for later

df['target'] = iris.target

# Add the actual names of the irises to the dataframe for reference
df['species'] = 'Iris Setosa'
df.loc[df.target == 1, 'species'] = 'Iris Versicolour' 
df.loc[df.target == 2, 'species'] = 'Iris Virginica'

# Split the data into X and y dataframes for analysis
X = df[['sepal_length',
 'sepal_width',
 'petal_length',
 'petal_width']]

names = ['Iris Setosa', 'Iris Versicolour', 'Iris Virginica']
y = df['target']

In [7]:
X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Let's confirm that the species names were mapped correctly:

In [8]:
pd.crosstab(df.target, df.species)

species,Iris Setosa,Iris Versicolour,Iris Virginica
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,50,0,0
1,0,50,0
2,0,0,50


In [9]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target',
       'species'],
      dtype='object')

## Random Forests
Random forests overcome decision trees' tendency to overfit--that is, to work well for a training dataset, but degrade when exposed to new data.  To overcome this limitation, an ensemble method--called 'random forests' are used.

Random forests create dozens (or even hundreds) of randomized decision trees, using randomized subsets of data.  Each tree is therefore different, but the advantage is that the ensemble will do a better job of predicting than just one tree.  This power of crowds dramatically improves the efficacy of decision trees.



Random forests choose a subset of features as well as a subset of dataframe rows to generate a new decision tree.  This process is then repeated dozens, hundreds, or even thousands of times.  The power of random forests is comparing the accuracy results across the generated trees.  Consistent results indicates that the model will generalize--that is, has not been overfit.

## Train/Test Split
The following divides the dataset into 70\% training and 30\% testing.  We will use the train set to train our decision tree.  We will then run the *test* data set using clf object (which contains all parameters for the trained decision tree), to see how it performs on data that it has never seen.

The train/test concept allows us to explore the generalizability of our model: if the model performance is consistent, this is evidence that the model generalizes--that is, is not overfit. This also indicates that we can be more confident on applying our model to classify new irises.

In [10]:
# reminder: random_state will ensure you get the same results
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, \
        test_size = 0.3, random_state = 42) 

In [11]:
from sklearn.ensemble import RandomForestClassifier

# generate 10 decision trees
clf = RandomForestClassifier(n_estimators=10, verbose=0, bootstrap=True, max_depth=2, \
            random_state=42) 
clf = clf.fit(X_train,y_train)

preds = clf.predict(X_train)
probs = clf.predict_proba(X_train)
accuracy = round(clf.score(X_train,y_train)*100,2) # make it a percentage and round to 2 places
print(accuracy)

94.29


### Train the Model using the Train dataset

The model predicts with 94.29\% accuracy, which seems pretty good.

In [105]:
# 'sepal_length', 'sepal_width', 'petal_length', 'petal_width'
print(clf.feature_importances_)

[0.18235931 0.         0.292432   0.52520869]


The above is perhaps a powerful result: this indicates the relative power of each of the included features in predicting the species of iris.  According to this, therefore, the relative contribution of each is: 
* sepal_length: 18.2\%
* sepal_width: 0\%
* petal_length: 29.2\%
* petal_width: 52.5\%

At least within our decision tree, the petal dimensions were by far the most important in classifying the iris species.

### Generate a Confusion Matrix to Gauge our Training Results

In [94]:
y_train.value_counts()

2    37
1    37
0    31
Name: target, dtype: int64

Because we divided the data into test and train, the above represents 70\% of our original data, or n=105. 

In [95]:
# Store the probability and prediction of each iris
preds = clf.predict(X_train)
probs = clf.predict_proba(X_train)

In [96]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, preds)

array([[31,  0,  0],
       [ 0, 35,  2],
       [ 0,  4, 33]], dtype=int64)

A confusion matrix is a powerful summary of our model's performance.  The rows represent the actual species (0,1,2), while the columns represent the predicted species.  The diagonal (upper left to lower right) indicates the correct classifications.  If our model is perfect, only the diagonals would have numbers: the rest would be 0.  

The number 2 in the second row is an off-diagonal, so it is a misclassification.  The model predicted that the species was 2, but it was actually 1 (remember to start counting with 0!).  The other classification error is in row 3. Four Virginica irises (target=2) are being classified as Versicolor (target=2).  The accuracy of our tree, therefore, is:

`(31+35+33)/105 = 94.29%`

We can confirm this:

In [106]:
accuracy = clf.score(X_train,y_train)
print("Accuracy is: {:.4f}.".format(accuracy))

Accuracy is: 0.9429.


So the results indicate that our trained model does a good job of classifying iris species.  The question, however, is: how well does it generalize?  If our model is too tied to this particular dataset, that is called _overfitting_.  It indicates that the model will not generalize to new irises.  

This is a major concerns with decision trees, so we need to confirm that our model will generalize.  Fortunately, we withheld 30\% of our data in the X_test data set, so we can now see how well the model performs with new data!

### Use the test data to test the generalizability of our model
Now that we have trained a decision tree, let's test its generalizability by running it using the test data.  If the results are similar, we have a generalizable model.

In [108]:
accuracy = clf.score(X_test,y_test)
print("Accuracy is: {:.4f}.".format(accuracy))

Accuracy is: 1.0000.


Wow.  No overfit here! The model actually performs _better_.The model actually did better on the test data than the train data.  This indicates that our model is not overfit and may generalize to other irises!

In [109]:
# Store the probability and prediction of each iris
preds = clf.predict(X_test)
probs = clf.predict_proba(X_test)

### Generate a Confusion Matrix to Gauge our Test Results

In [110]:
confusion_matrix(y_test, preds)

array([[19,  0,  0],
       [ 0, 13,  0],
       [ 0,  0, 13]], dtype=int64)

## What Our Model Indicates
The results indicate that our model is not overfitted, so it is natural at this point to assume overfit is rarely a problem.  At least with regard to decision trees, that is not true.  _You should always train/test your decision tree models, to ensure your results are generalizable._ 

Please remember that I cherrypicked a data set that is known to be easy to classify, so the amount of signal is strong and the amount of randomness is very small.  In real life, of course, your data will likely have a lot of random effects, so you need to be careful of compromising generlizability by striving for perfection.

Another reason we didnt have a problem is that we restricted the model depth to two.  You can always improve your decision tree's acuracy by adding more depth, but Turns out, the model will continue to improve if we add depth, but the iris dataset is popular _because_ the flowers are easily categorized.  Real data will rarely behave as well.  Just be prepared to train and test your results!

## Conclusion

At this point, you may wonder: where is our decision tree?  The answer is: which one?  We just generated a lot of decision trees, so there is no longer *just one tree.* Instead, we used an ensemble method to mitigate the effects of overfitting: in laymen's terms, we generated a large number of trees to improve its ability to predict new irises in the future.