# Random Forest vs Extra Trees¶
# 

# 1. Overview
Scikit-Learn is packed with amazing algoritms which serves a wide variety of purposes. Needless to say one algoritm does not fit all datasets. Each algoritm comes with its pros and cons and our role is to keep experimenting and pick the algorithm which serve our needs. In this notebook, we will look at a special case of the famous Random Forest ensemble called Extremely Randomized Trees or Extra Trees and will see how can this ensemble become helpful in our projects. We will also compare it with the traditional Random Forest in terms of computational complexity. To get started lets first look at a basic Decision Tree.

# 2. Decision Tree
As the name suggest, Decision Tree is a tree with levels (or a depth), each level has nodes and each node takes a decision based on a certain threshold. Each node splits the dataset into two which helps in classifying any instance into categories and collectively into target classes. Look at the following figure for a better understanding

![tree.png](https://miro.medium.com/max/700/1*fGX0_gacojVa6-njlCrWZw.png)

Here, the first node divides the dataset by sex, at the second level the dataset is divided by the features Age and Pclass and so on. The important thing to note here is that the threshold to divide the tree at a particular node is selected by the algorithm on its own. It calculates this threshold by analysing the feature and threshold that will give the least gini impurity. In simple terms, if we have a dataset which classifies students into pass or fail and we have only one feature i.e Marks, the decision tree will find the passing score(threshold) on its own and train a model. The next time we enter a student's marks, the algorithm classifies the student into pass or fail using the computed passing score.

# 3. Random Forest¶
Going from Decision Tree to Random Forest is simple. Instead of a single tree on an empty land(let's say apple tree), imagine a forest with 1000s of trees(but not every tree is an apple tree). Each tree bear a slightly different fruit with different taste, colour, or altogether a different fruit. This variety usually makes the harvest of the forest much more high-yielding.

The Decision Tree takes the whole dataset and creates a tree. While Random Forest produces many trees but each tree is not shown the entire training dataset, it is shown only a part of the dataset and then it predicts the class or value. Later, all predicted values of all trees are cumulated to make the final prediction.

The random partial data is shown to each tree to bring diversity. Imagine if all the trees are shown the entire dataset, then each tree will bear the same identical apple fruit, which will add no additional value than just using a single Decision Tree algoritm. Look at the following image for better understanding

![tr.png](https://miro.medium.com/max/574/0*a8KgF1IINziv7KIQ.png)

# 4. Extra Trees
The "Random" in Random Forest is bceause we are using a random subset of the dataset. But what if instead of choosing the best possible threshold for each tree at each node, we simply choose a random threshold too. If you remember the example taken in Decision Trees, imagine the algorithm is not bothered to compute the passing score in the first attempt, instead it takes a random marks value which divides the dataset(let that value be 90). Now at the first level, the students are divided into two categories (greater than and less than 90). If a student scores more than 90 he will be declared as pass, but if he scores less than 90, then the algorithm will find another random threshold (between 0 and 90) and categorize the remaining instances further. This algorithm too will eventually create an accurate model but it will have far more number of levels and nodes than a Decision Tree. However, the random nature of choosing the threshold value will make it much more faster.

In [None]:
import pandas as pd
pd.options.display.max_columns = 100

from matplotlib import pyplot as plt
import numpy as np

import seaborn as sns
sns.set()
import pylab as plot

In [None]:
train = pd.read_csv("../input/tabular-playground-series-feb-2022/train.csv")
test = pd.read_csv("../input/tabular-playground-series-feb-2022/test.csv")
submission = pd.read_csv("../input/tabular-playground-series-feb-2022/sample_submission.csv")

In [None]:
train.head()

In [None]:
print(f'Number of rows: {train.shape[0]};  Number of columns: {train.shape[1]}; No of missing values: {sum(train.isna().sum())}')

In [None]:
train.describe().style.background_gradient(cmap="Pastel1")

In [None]:
target_count = train['target'].value_counts().sort_index()
target_count_df = pd.DataFrame(target_count)
#pd.options.display.float_format = '{:,.2f}%'.format
target_count_df['target(%)'] = (target_count_df/target_count.sum()*100)
target_count_df.sort_values('target(%)', ascending=False, inplace=True)
display(target_count_df)

In [None]:
colors_4 = ['magenta','yellow','orange','red','maroon','blue','purple','lime','chocolate','silver']
target_count.plot.pie(subplots=True, figsize=(20,10), labels=target_count.index,autopct='%1.1f%%', colors=colors_4)
plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
labelencoder=LabelEncoder()
train['target']     = labelencoder.fit_transform(train['target'])
submission['target']     = labelencoder.fit_transform(submission['target'])

In [None]:
train.drop(["row_id"] , axis = 1 , inplace = True)
y=train['target']
X=train.drop(labels=['target'], axis=1)

In [None]:
test.head()

In [None]:
print(f'Number of rows: {test.shape[0]};  Number of columns: {test.shape[1]}; No of missing values: {sum(test.isna().sum())}')

In [None]:
test.describe().style.background_gradient(cmap="Pastel1")

In [None]:
test.drop(["row_id"] , axis = 1 , inplace = True)
x_test=test

In [None]:
submission.head()

In [None]:
submission.drop(["row_id"] , axis = 1 , inplace = True)
y_test=submission.target

In [None]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from lightgbm import LGBMClassifier
from matplotlib import pyplot
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

In [None]:
#Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
clf = RandomForestClassifier(n_estimators=13)
#Cross Validation (K-fold)

from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
scoring = 'accuracy'
score = cross_val_score(clf, X_train, y_train, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

In [None]:
# Random Forest Score
round(np.mean(score)*100, 2)

In [None]:
# create the classifier with n_estimators = 100

clf = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

clf.fit(X_train, y_train)

In [None]:
# calculate manually
def my_function(y,y_preds):
  
  d = y - y_preds
  mse_f = np.mean(d**2)
  mae_f = np.mean(abs(d))
  rmse_f = np.sqrt(mse_f)


  print("Results by manual calculation:")
  print("MAE:",mae_f)
  print("MSE:", mse_f)
  print("RMSE:", rmse_f)

In [None]:
print('randomforset model')
y_preds_Random = clf.predict(x_test )
my_function(y_test,y_preds_Random)

In [None]:
# evaluate extra trees algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import ExtraTreesClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4)
# define the model
model = ExtraTreesClassifier()
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

In [None]:
# make predictions using extra trees for classification
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# define the model
model = ExtraTreesClassifier(n_estimators=100)
# fit the model on the whole dataset
model.fit(X_train, y_train)



In [None]:
# make a prediction

y_preds_Extra = model.predict(x_test )
my_function(y_test,y_preds_Extra)
